| Spam detection in a PaIR system |
| Monday, 05 February 2007 | |||
|
Detecting spam documents in a phrase based information retrieval system This is a continuation of;Phrase Based Optimization and Phrase Based Indexing and Retrieval II
At least that’s the opening folly of the document. As a basic refresher, the method looks not only at the search term but related phrases for a given topic and related phrase occurrences expected to be present in a document statistically. I is calculated over individual/multiple documents and collections of documents (web pages and website for our purposes).
Then my favorite ‘concise’ version of the process “To identify phrases that have sufficiently frequent and/or distinguished usage in the document collection to indicate that they are "valid" or "good" phrases” So in essence it is looking at the ratio of actual co-occurrence rate to expected co-occurrence rate, which gives them a predictive measurement for any given phrase on any given topic on a web site or web page.
Straight from the horses mouth;
A Phrase Based Indexing and Retrieval (PaIR) system could set about identifying a web page (or document) as a spam page by comparing the actual number of related phrases present in the document with the expected number of related phrases. To high or to low an expected phrase rate or density, could flag a given document for further algorithmic inspection. Also expectations of plurals and singular occurrences can be valuated as part of the spam detection process. It is then added to a list of spam documents.
According to the folks that drafted it, a normal related, topical phrase occurrence (or related phrases) is in the order of 8-20 whereas the typical Spam document would contain between 100-1000 related phrases. So by looking for statistical deviations in related phrase occurrences the system can flag an item as Spam. Once again it is mostly for the high end, but a low deviation count can also be used as a flag for a low occurrences (which could be compared to the link profile for link spam) Collateral Damage? Strangely, to me at least, there is a bit of ‘give’ to the system in that some documents (or sets of documents) that have been flagged as Spam, are entirely removed from the index. Other – not so spammy? – documents would merely be devalued. This means there is a grey area that could certainly cause some collateral damage in my opinion; or it may be there to ‘deal’ with potential false – positives. Either way you can see where some innocents can get caught up in the war on web Spam.
Well that’s it for the world of PaIR for the moment. Time to move onto some new things. I have enjoyed the journey and look for great things as more search engines gravitate towards a similar model (hopefully) I can see man advantages for quality relevant sites, and a little tougher time on the spammers ( I am sure they’ll get it sorted) Until next time …. Play Safe **the Series - Phrase Based Optimization | Phrase Based Indexing and Retrieval II | Spam Detection in a PaIR system | Phrase Based Personalization od Search
Need help ranking? Get in touch today for our affordable SEO services
|
|||