Detecting spam documents in a phrase based information retrieval system
|An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.|
At least that’s the opening folly of the document. As a basic refresher, the method looks not only at the search term but related phrases for a given topic and related phrase occurrences expected to be present in a document statistically. I is calculated over individual/multiple documents and collections of documents (web pages and website for our purposes).
Then my favorite ‘concise’ version of the process
“To identify phrases that have sufficiently frequent and/or distinguished usage in the document collection to indicate that they are "valid" or "good" phrases”
So in essence it is looking at the ratio of actual co-occurrence rate to expected co-occurrence rate, which gives them a predictive measurement for any given phrase on any given topic on a web site or web page.
What is Spam?
Straight from the horses mouth;
|spam pages are documents that have little if any meaningful content, but instead comprise collections of popular words and phrases, often hundreds or even thousands of them; these pages are sometime called "keyword stuffing pages." Others include specific words and phrases known to be of interest to advertisers. These types are created to cause search engines to retrieve such documents for display along with paid advertisements.|
A Phrase Based Indexing and Retrieval (PaIR) system could set about identifying a web page (or document) as a spam page by comparing the actual number of related phrases present in the document with the expected number of related phrases. To high or to low an expected phrase rate or density, could flag a given document for further algorithmic inspection. Also expectations of plurals and singular occurrences can be valuated as part of the spam detection process. It is then added to a list of spam documents.
The process takes place both at indexing and retrieval. In essence the document gets its spam score at indexation and then upon retrieval, should that page be included in the results, weighting is then removed and the page is devalued during the ranking process for previously calculated Spam threshold scoring/weighting.
According to the folks that drafted it, a normal related, topical phrase occurrence (or related phrases) is in the order of 8-20 whereas the typical Spam document would contain between 100-1000 related phrases. So by looking for statistical deviations in related phrase occurrences the system can flag an item as Spam. Once again it is mostly for the high end, but a low deviation count can also be used as a flag for a low occurrences (which could be compared to the link profile for link spam)
Strangely, to me at least, there is a bit of ‘give’ to the system in that some documents (or sets of documents) that have been flagged as Spam, are entirely removed from the index. Other – not so spammy? – documents would merely be devalued. This means there is a grey area that could certainly cause some collateral damage in my opinion; or it may be there to ‘deal’ with potential false – positives. Either way you can see where some innocents can get caught up in the war on web Spam.
That’s a Wrap
Well that’s it for the world of PaIR for the moment. Time to move onto some new things. I have enjoyed the journey and look for great things as more search engines gravitate towards a similar model (hopefully) I can see man advantages for quality relevant sites, and a little tougher time on the spammers ( I am sure they’ll get it sorted)
Until next time …. Play Safe