Latest Additions
Popular
Add to Technorati Favorites

 

Learn SEO 


 

Phrase Based Indexing and Retrieval 2 Print E-mail
Monday, 29 January 2007


Picking up where we left of with the overview of Phrase Based Optimization – I wanted to scan over some relevant points from the other Phrase Based Indexing and Retrieval (IR) Patents. This time we'll step back from the algo-babble and explore the intricacies a little further.

As you (undoubtedly) remember the core concept of the processing is to identify valid (actual/real) phrases in a given document collection (or web pages in our case). The goal being to classifying each potential phrase as either “a good phrase or a bad phrase” depending on it’s usage and frequency; then using those ‘good’ phrases in predicting the usage of other ‘good phrases’ in the collection of web pages.


What’s a ‘Good Phrase’?

The classification for possible phrases as either a good phrase or a bad phrase is when the possible phrase; ‘appears in a minimum number of documents, and appear a minimum number of instances in the document collection’. What that number is, we don’t know. Those are the ‘dials’ the Search Gods themselves only have access to. It is almost looking at a Phrase Density over the aggregate of documents (the web site). Also, a BAD phrase is not one with dirty words, it is simply a phrase with too low a frequency count to make the ‘good’ list.

The proposed identification process begin as such;

  1. Collect possible and good phrases, along with frequency and co-occurrence statistics of the phrases.
  2. Classify possible phrases to either good or bad phrases based on frequency statistics.
  3. Prune good phrase list based on a predictive measure derived from the co-occurrence statistics

So it basically has 2 filters to further refine the list of ‘good phrases’ to identify the strongest elements of the site or what could be loosely described as a theme.


Making Predictions

With a collection of these ‘good phrases’ in hand it can then analyze the complete set of pages for the frequency count and number of distinguished instances of the phrase. A distinguished instance is; ‘a phrase distinguished from neighboring content in the document by grammatical or format markers.’ – or – ‘such as delimited by markup tags or other morphological, format, or grammatical markers.’ So while commas, hyphens and period type markers may come into play, I would imagine prominence factors would get a better response under this framework.

The Good Phrases are used in a predictive manner for at least one other good phrase on the document and web site. So the more relevant ‘good phrases’ there are in the over-all site ‘theme’ the better the score would be. It can then judge a document (web page), with the ability to predict the presence of other phrases on a page. In many ways I can see this encouraging better or more unique content ( Markov will need to hit the Gym).


It’s all related

this approach recognizes that topics, as indicated by related phrases, form a complex graph of relationships,where some phrases are related to many other phrases

Once again there is an element of ‘theme’ being built up in the indexing and scoring of sets of documents and much is based on expected occurrences of related phrases. Content that is broad, semantic and unique will certainly have it’s advantages and (hopefully for Google) make for better search results.

As you may imagine the outbound links and inlinks (internal links) also get treatment. For outbound links it looks at the anchor text and compares it against the ‘good list’ and scores it accordingly. It also checks the document (web page) of the target site against the good list further accreditation is given. Partial scoring also comes into play if, for example, the target document has ‘Australian’ but not ‘Australian Football’. While not a complete miss, it wouldn’t get FULL marks.


Phrase Extensions and identification

Phrase extensions are merely additional words on the core term(s). If we had the core term ‘Baseball Cards’ we could ‘extend’ it with ‘Vintage Baseball Cards’, ‘Buy Vintage Baseball Cards’ and finally ‘Buy Vintage Baseball Cards Online’ – you get the idea.

To identify a potential phrase the algo looks at a phrase such as "Hillary Rodham Clinton Bill on the Senate Floor", from which it would take; "Hillary Rodham Clinton Bill on," "Hillary Rodham Clinton Bill," and "Hillary Rodham Clinton". Only the last one is kept. It would also identify "Bill on the Senate Floor" and the inferences down to ‘bill’.


Making the Grade

In the end it is these related phrase/theme scores that are used in the ranking of documents based on a given search query. The more related phrases and secondary related phrases found in the document for the query phrases would be ranked highest. The semantically topical, relevant page gets the highest ranking.

Anchor phrase scoring is also counted in the related query phrase in the text links to other documents. There are 2 scores here being the ‘body’ score and ‘anchor’ score. Greater scoring is obviously given if a good phrase is in the text link as well as on the body of the referenced document. Additionally the anchor text TO your site is also analyzed and scored accordingly under the same methods.


I personally am following this because it seems it would encourage more unique content and potentially better SERPs. I also believe searchers are getting more sophisticated in their search habits and understanding of how to use them.

**the Series - Phrase Based OptimizationPhrase Based Indexing and Retrieval II | Spam Detection in a PaIR system | Phrase Based Personalization of Search

 

 
< Prev   Next >

Knowledge Base
Keyword or Phrase research and targeting practices

This article is not aimed at the advanced level reader here folks. I talk to many people that don't even spend a passing thought on the KW/P research - this posting is written to give the reader a foundation - not as an advanced guide. Of course, this being an in-exact science, there could be some gems in here for the ragged SEO warriors as well.

For a while now, I have been intending to give my perspective on Keyword/Phrase ( KW/P ) research and targeting. Why? Because I think it is an essential if not mandatory part of any SEO campaign. Each and every term you target will have a cost associate with it. This is where the rubber meets the road. In simplest terms, if I spend $500 on link building, $200 on content creation, $100 on site adjustments then I am looking at $800 invested. Once we reach the desired ranking, how long will it take to recoup that cash (including ongoing maintenance). In short, where is the ROI and what is the ‘break even’ point? That is a very simplistic example, but hopefully you get the idea.

Simply aiming to be #1 on Google is foolhardy since some terms contain little reward (traffic). Remember, cash that gets tied up chasing non-performing terms is money that could have been used elsewhere in your marketing endeavors. So, this is certainly an important step in the SEO process. Mistakes here can be very costly later on and in the over-all ‘big picture’ that is the sites financial health.

 

Read more...