|
Picking up where we left of with the overview of Phrase Based Optimization – I wanted to scan over some relevant points from the other Phrase Based Indexing and Retrieval (IR) Patents. This time we'll step back from the algo-babble and explore the intricacies a little further.
As you (undoubtedly) remember the core concept of the processing is to identify valid (actual/real) phrases in a given document collection (or web pages in our case). The goal being to classifying each potential phrase as either “a good phrase or a bad phrase” depending on it’s usage and frequency; then using those ‘good’ phrases in predicting the usage of other ‘good phrases’ in the collection of web pages. What’s a ‘Good Phrase’?
The classification for possible phrases as either a good phrase or a bad phrase is when the possible phrase; ‘appears in a minimum number of documents, and appear a minimum number of instances in the document collection’. What that number is, we don’t know. Those are the ‘dials’ the Search Gods themselves only have access to. It is almost looking at a Phrase Density over the aggregate of documents (the web site). Also, a BAD phrase is not one with dirty words, it is simply a phrase with too low a frequency count to make the ‘good’ list.
The proposed identification process begin as such; - Collect possible and good phrases, along with frequency and co-occurrence statistics of the phrases.
- Classify possible phrases to either good or bad phrases based on frequency statistics.
- Prune good phrase list based on a predictive measure derived from the co-occurrence statistics
So it basically has 2 filters to further refine the list of ‘good phrases’ to identify the strongest elements of the site or what could be loosely described as a theme. Making Predictions
With a collection of these ‘good phrases’ in hand it can then analyze the complete set of pages for the frequency count and number of distinguished instances of the phrase. A distinguished instance is; ‘a phrase distinguished from neighboring content in the document by grammatical or format markers.’ – or – ‘such as delimited by markup tags or other morphological, format, or grammatical markers.’ So while commas, hyphens and period type markers may come into play, I would imagine prominence factors would get a better response under this framework. The Good Phrases are used in a predictive manner for at least one other good phrase on the document and web site. So the more relevant ‘good phrases’ there are in the over-all site ‘theme’ the better the score would be. It can then judge a document (web page), with the ability to predict the presence of other phrases on a page. In many ways I can see this encouraging better or more unique content ( Markov will need to hit the Gym). It’s all related
‘this approach recognizes that topics, as indicated by related phrases, form a complex graph of relationships,where some phrases are related to many other phrases’ Once again there is an element of ‘theme’ being built up in the indexing and scoring of sets of documents and much is based on expected occurrences of related phrases. Content that is broad, semantic and unique will certainly have it’s advantages and (hopefully for Google) make for better search results. As you may imagine the outbound links and inlinks (internal links) also get treatment. For outbound links it looks at the anchor text and compares it against the ‘good list’ and scores it accordingly. It also checks the document (web page) of the target site against the good list further accreditation is given. Partial scoring also comes into play if, for example, the target document has ‘Australian’ but not ‘Australian Football’. While not a complete miss, it wouldn’t get FULL marks. Phrase Extensions and identification
Phrase extensions are merely additional words on the core term(s). If we had the core term ‘Baseball Cards’ we could ‘extend’ it with ‘Vintage Baseball Cards’, ‘Buy Vintage Baseball Cards’ and finally ‘Buy Vintage Baseball Cards Online’ – you get the idea. To identify a potential phrase the algo looks at a phrase such as "Hillary Rodham Clinton Bill on the Senate Floor", from which it would take; "Hillary Rodham Clinton Bill on," "Hillary Rodham Clinton Bill," and "Hillary Rodham Clinton". Only the last one is kept. It would also identify "Bill on the Senate Floor" and the inferences down to ‘bill’. Making the Grade
In the end it is these related phrase/theme scores that are used in the ranking of documents based on a given search query. The more related phrases and secondary related phrases found in the document for the query phrases would be ranked highest. The semantically topical, relevant page gets the highest ranking. Anchor phrase scoring is also counted in the related query phrase in the text links to other documents. There are 2 scores here being the ‘body’ score and ‘anchor’ score. Greater scoring is obviously given if a good phrase is in the text link as well as on the body of the referenced document. Additionally the anchor text TO your site is also analyzed and scored accordingly under the same methods. I personally am following this because it seems it would encourage more unique content and potentially better SERPs. I also believe searchers are getting more sophisticated in their search habits and understanding of how to use them.
**the Series - Phrase Based Optimization | Phrase Based Indexing and Retrieval II | Spam Detection in a PaIR system | Phrase Based Personalization od Search |