A Chinese text similarity algorithm based on Yake and neural network

Open Access

Abstract: Traditional text similarity algorithm has the disadvantage of  a large amount of text data and high complexity. Keywords are highly concentrated thematic ideas in the text. Extracting them can reduce the complexity of text similarity calculation. Therefore, this paper proposes a Chinese text similarity calculation method that integrates improved YAKE and neural network(YANN). With Aim to the  problem that Yet Another Keyword Extractor(YAKE) algorithm is not suitable for Chinese text keyword extraction, keyword candidate stage. First the new feature value of words is calculated by using word span, position, frequency, word context relevance, and the number of different sentences. Next we calculate the keyword score of each candidate word after synthesizing all the features values, and output the keywords in the ascending order of the score. Finally, the keyword set is inputted into the trained word2vec model for vectorization. Summation and averaging where the keyword vector values are derived from the trained word2vec model, and the similarity between different texts is calculated by cosine similarity. The experimental results show that the method proposed in this paper has better performance than other algorithms in Chinese text keyword extraction, and the similarity calculation results prove the merit of the method used.


Keywords: Keyword Extraction; Word2vec; Text Similarity