obtained the global feature dictionary, on the need to go to the next page, scan can get all the word on the page, for these words according to the feature dictionary: keep in filtering feature dictionary words, used to express the main content of the document, delete does not appear in the dictionary the content features. After extracting the feature words in the corresponding use hash function to hash the feature words calculation, the value is the document text fingerprint.
for the I-Match algorithm is mainly based on large scale text collection statistics for all words in the text appears, according to the word IDF (inverse text frequency factor) to the order from high to low, the highest score and the lowest score to remove the word, keep the rest of the words most feature dictionary. The main step is to delete irrelevant text keywords, preserving important keywords. Below is a schematic diagram of the I-Match process:
documents are counted if textfingerprints want to see if two documents need to check whether the document only repeated approximation, if approximately two documents repeatedly said. On this very intuitive but also the efficiency is high, the effect is obvious to.
we do in Shanghai Longfeng the original artifacts often put the words and paragraphs swap position, in order to deceive the search engine that this is an original article, but I-Match is not sensitive to the word order between documents. If the article contains two words just change the location of the word, then the I-Match algorithm or the two articles that is repeated.
Schematic diagram of
but this algorithm still has many problems. 1, prone to false positives. Especially in the face of short text, short text word itself is relatively small, after filtering only special feature dictionary testimony rarely, so easy to put two papers originally not to repeat the mistake of repeat this document, the short document is very serious. 2. the stability is not good, sensitive to modify the document. If you make a little modification to the A document after document B, then this algorithm is likely to identify two documents for duplicate documents. For example, we add a word H in document A, B document. I-Match algorithm in the calculation, the two article is just a word H >.
Internet exists a large number of duplicate pages, according to statistics shows that the number of near duplicate page occupy the total number of site 29%, and completely duplicate pages accounted for 22%. These repeated pages occupy a lot of resources to the search engine, the search engine on the page to search is a very important algorithm in engine. So today we’re under the analysis of the search engine pages reuseremoving algorithm -I-Match algorithm.