The main goal of this document is to give SEO enthusiasts a stronger grasp of how Phrasing is dealt with in Search Engines, in an effort to help you further target and optimize your web sites. The theories and information relate well to keyword/phrase research as well as content creation and to a lesser extent back links text development.
The crux of the piece was based on analysis of an existing Google Patent on ‘Phrase based searching’, (see Resources at the end). That is as far as I shall go on the original Patent since it can lead to assumptions of what may, or may not be used in their indexing and retrieval processes (algorithms). Just because they filed the patent, doesn’t necessarily mean they have implemented it. I feel the main point here is to get a better idea of HOW search engineers think and WHAT may possibly be in place now, or in future Search technologies.
Why Phrase based indexing and retrieval is important
The problem facing search engines is that the direct "Boolean" matching of query terms is known to have its limitations. One problem is that it doesn’t identify documents that do not have the query terms, but have related words. IT is a very tightly defined result. A search on "Florida Snakes" doesn’t return results related to local species, (Black Pine for example) Conversely it is likely to also retrieve and highly rank documents related to ‘Florida’ rather then the desired or intended query.
Creating better clusters
The answer is a methodology that uses phrases to index, search, rank, and create descriptions for websites. It looks to identify phrases that have frequent and/or distinguished/unique usage. Using this methodology phrases of four, five, or more terms, can be identified. To establish a ‘predictive measure’ the system can identify phrases that are related to one another. A prediction measure is used that relates the actual usage to an ‘expected usage’ of the two phrases. In essence, the more ‘expected’ related phrasings there are within a document, the higher the score will be.
What is considered to be related phrases are those that are commonly used to discuss or describe a topic or concept, such as "President of the United States" and "White House.", seemingly semantic unknowns, under a Boolean system, but of obvious relation to each other. Phrased based indexing and retrieval systems help alleviate this problem
Multiple purpose phrase relevance
For each phrase, the system (indexing and retrieval) identifies pages that have the phrase. Also, for a given phrase, a second list is used to store data the shows which related phrases of the queried phrase are also present in pages containing the given phrase. It can then identify which pages have which phrases as well as which pages also contain phrases that are related to query phrases. This enables a much tighter scoring for the results to a phrase query.
Using such a methodology creates a variety of clusters of related phrases, which “represent semantically meaningful groupings of phrases”. These are created by phrases that have a high prediction measure between all of the phrases in the cluster. This can now be used to organize the results, score and rank them as well as eliminating documents from the search results.
Query Processing and Phrase Extensions
The system uses the phrases when searching for pages in response to a query. In response to a search query it identifies any phrases that are present in the query, so it can look for related ‘lists’ and phrase information for the query phrases. IT can also be used in instances of an incomplete phrase in a search query; these may be identified and replaced by a phrase extension. If some one enters ‘the President of the” the system may add the extension ‘United States’ to complete the phrase based upon the same identification and weighting system
For the search result document selection the ‘related phrase information’ can be used to identify or choose which pages to include in the search result. For a query with two query phrases, (‘NHL Hockey, Stanley Cup’) the stored list for the first query phrase is used to identify pages containing the first query phrase, then the ‘related phrase information’ is used to identify which pages also have the second query phrase. These pages are then included in the search results.
Document Ranking and Duplicates
The ‘related phrase information’ is stored in a format, which expresses the relative significance of the phrase. These are associated with the related phrase having the highest prediction measure. By doing this for a given page and a desired phrase, the ‘related phrase information’ is used to score the page in question. The pages that have ‘high order related phrases’ for the query phrase are treated as topically related to the query over those that have a low rate of relation.
It can also be used to identify and eliminate duplicate pages, either while indexing (crawling), or when processing a search query. For this part let me quote, since it is in semi-English.
“For a given document, each sentence of the document has a count of how many related phrases are present in the sentence. The sentences of document can be ranked by this count, and a number of the top ranking sentences (e.g., five sentences) are selected to form a document description.
This description is then stored in association with the document. During indexing, a newly crawled document is processed in the same manner to generate the document description. The new document description can be matched (e.g.,hashed) against previous document descriptions, and if a match is found, then the new document is a duplicate. Similarly, during preparation of the results of a search query, the documents in the search result set can be processed to eliminate duplicates.”
In a sense each page gets a ‘related phrase information’ footprint that can be compared against the index at indexing to establish the originality of the page.
That’s a wrap!
The Patent also has some information pertaining to Personalization of Search results based on this system as well as creating result Descriptions. Certainly some interesting information, I felt that they were not as important to the aspects of understanding Phrase based concepts that I did cover. There is much I have not covered; this truly is a cursory view.
Once again the main goal of the exercise is not a ‘Eureka!’ moment. This is NOT a blueprint to Google or any other search algorithm. It is a journey into the mind of search engineers and how they deal with the problems they face. Absorb the information and walk away, just another tool in the virtual toolbox.
Based on the Patent; Phrase-based searching in an information retrieval system
More; Multiple index based information retrieval system - Phrase-based generation of document descriptions - Phrase identification in an information retrieval system - Detecting spam documents in a phrase based information retrieval system
ADDED; Jan. 07; I ran into some interesting data on search user behavior relating to phrase usage
"According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search engines. The RankStat research is based on a sample of 2 million visitors, made up of 20,000 visitors in 100 countries each day.
Here's the full breakdown:
Two-word phrases -- 28.38 percent
Three-word phrases -- 27.15 percent
Four-word phrases -- 16.42 percent
One-word phrase -- 13.48 percent
Five-word phrases -- 8.03 percent
Six-word phrases -- 3.67 percent
Seven-word phrases -- 1.63 percent
Eight-word phrases -- 0.73 percent
Nine-word phrases -- 0.34 percent
Ten-word phrases -- 0.16 percent "
From; Search Engine Watch