Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2019-04-01 Cooperative journals: 《计算机应用研究》
Abstract: With the rapid development of Internet technology, sensitive content images have changed from basic concealed content exchange to mass data sharing. The traditional method of sensitive content detection based on image feature extraction is no longer applicable. To overcome these difficulties, this paper proposes a sensitive content detection method based on sparse semantics and double-layer deep convolution neural network. In this method, the upper network preprocesses the training samples and constructs sparse semantic representation of the image as the input of the neural network, while the lower network further considers the third-party control mechanism (such as government agents) and proposes a sensitive content image detection method for specific groups. Compared with the existing image detection methods for sensitive content, the proposed method can effectively reduce the number of training samples, and the detection accuracy is more than 7% higher than that of traditional image detection methods (such as visual word bag method) .
Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-10-11 Cooperative journals: 《计算机应用研究》
Abstract: Along with the construction and development of the network in Xinjiang, a large number of Uyghur webpages have been produced. In order to construct a healthy network environment, this paper proposed a Uyghur text filtering method combining n-gram statistical model and class-unbalanced support vector machine (SVM) classifier. Firstly, it preprocessed the webpage text, and extracted the stem initially by the N-gram statistical model. Then, it carried out the semantic analysis of the stems, and aggregated the stems with similar meanings into one class, thereby reducing the stem dimension. Finally, it introduced a parameter that controls the distance between hyperplanes in the traditional SVM, and constructed a class-unbalanced SVM to classify Uyghur texts with nonlinear indivisibility and imbalance. The experimental results show that the method can accurately classify bad texts and has a shorter classification time.
Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-05-24 Cooperative journals: 《计算机应用研究》
Abstract: At present, most of the researches on the similarity of natural language texts are aimed at some major languages such as English. In order to detect similarities between Uighur texts, this paper proposed a similarity detection method based on N-gram and semantic analysis. Firstly, it used N-gram statistical model to obtain the words based on Uyghur word features, and constructed the word-text relation matrix according to the appearance frequency of the words in the text. Then, it adopted a latent semantic analysis (LSA) to obtain the hidden association between the words and their texts, so as to solve the problem of vague semantic meaning in Uyghur language and obtain exact similarity. Experiments on plagiarized text sets containing reorganization and synonym replacement show that this method can detect the similarity accurately and effectively.
Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-05-18 Cooperative journals: 《计算机应用研究》
Abstract: For the issues that the text filtering in Uyghur web forum, this paper proposed a text filtering method based on term selection and Rocchio classifier. Firstly, it preprocessed the forum text to remove useless words and extract stemming (term) based on the N-gram statistical model. Then, it proposed a balanced mutual information term selection method (BMITS) , which considered the correlation and redundancy of equilibrium, used to reduce the dimension of initial term set and obtain the reduced term set. Finally, it made the text feature terms as input, and used Rocchio classifier to filter out the bad text. The experimental results show that the proposed method can accurately identify the bad type text, which is effective.
Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-04-17 Cooperative journals: 《计算机应用研究》
Abstract: For the issues of the similarity calculation and plagiarism detection from documents written in Uyghur, a content-based Uyghur plagiarism detection (U-PD) method is proposed. Firstly, the Uyghur texts are segmented, the stop words are deleted, the stems are extracted and synonyms are replaced through the preprocessing stage, of which extraction stems are based on N-gram statistical models. Then, calculate the hash value of each text block through the BKDRhash algorithm and construct the hash fingerprint information of the entire document. Finally, according to the hash fingerprint information, the document and document library are matched at the document level, the paragraph level and the sentence level based on the RKR-GST matching algorithm, and the similarity of the document is obtained, so as to realize plagiarism detection. The experimental evaluation in Uyghur documents shows that the proposed method can detect plagiarism documents accurately and is feasible and effective.