Latest Additions
Popular
Add to Technorati Favorites

 

Learn SEO 


 

A Probabilistic Learning Model Print E-mail

 

Method and apparatus for learning a probabilistic generative model for text - Click here for Original Patent

This is an interesting method that seeks to ‘teach’ the system how to relate various documents, or more appropriately, the TEXT within the documents, from semantics to link nodes. Or as stated at one point – “a system that learns concepts by learning an explanatory model of text”. This is something they have worked on for a while and can been seen in the earlier related patents; Test classification system and method and Method and system for creating improved search queries

Moving along….

 

In section 2 – Related Art – we have;

Processing text in a way that captures its underlying meaning--its semantics--is an often performed but poorly understood task. This function is most often performed in the context of search engines, which attempt to match documents in some repository to queries by users. It is sometimes also used by other library-like sources of information, for example to find documents with similar content. In general, understanding the semantics of text is an extremely useful subcomponent of such systems. Unfortunately, most systems written in the past have only a rudimentary understanding, focusing only on the words used in the text, not the meaning behind them.”

Call me a Phrase Based Indexing and Retrieval junky ( and you’d be right), but once again the concepts apply. The whole PaIR methodology sought to do just this – further comprehend the actual meaning of a document/text block rather than simply looking at individual words. For those paying attention, Google showed interest in the direction of ‘semantics’ when it purchased Applied Semantics and it’s ‘Latent Semantic Analysis’ technologies back around 2004 or so – though presumably for their AdWords/AdSense program. So this is not a new direction.

 

 

Oh.. and is this a shot or what? Can’t say I recall seeing ‘Patent Flaming’ before…Ha ha ha...

“Currently, if the document has the words "cooking class palo alto" several of the leading search engines will not find it, because they do not know that the words "class" and "classes" are related, because one is a subpart--a stem--of the other.”

 

Anyways, point is that once again they seem to be looking for deeper relevance.

From a systemic standpoint it is an evolutionary process that applies each ‘new’ model onto the existing one over and over to produce a continuously tightening ‘understanding’ of a given document and general related concepts to be applied to the model by ‘selectively introducing new cluster nodes into the current model.’;

In a variation on this embodiment, the system performs an iterative process which (1) considers the new model to be the current model, and (2) applies the training documents to the current model to produce a subsequent new model.

The initial ‘training document’ is derived by establishing a core comparison model (initial current model) and a set of documents with related terms. Presumably, each future processing will grow exponentially – ‘each iteration uses twice as many training documents as the previous iteration until all available training documents are used.’

 

On the topic of ‘learning’

In learning a generative model of text, in one embodiment of the present invention some source of text must be chosen. Some considerations in such a choice are as follows:

1. it should have related words in close proximity;
2. it should present evidence that is independent, given the model we are trying to learn (more on this later); and
3.  it should be relevant to different kinds of text. For this reason, the implementation of the model which follows uses exemplary "query sessions" from a search engine as its small pieces of text.

We have also implemented and run our model on web pages and other sources of text, but for the purposes of making this exposition more concrete, we focus on the analysis of query sessions.”

Once again, the semantic and phrase based models seem to be a factor here coupled with the ‘probability’ aspects that have been introduced.

we define a query session (also referred to as a user session or a session) as the set of words used by a single user on a search engine for a single day. Often users will search for related material, issuing several queries in a row about a particular topic. Sometimes, these queries are interspersed with random other topics.

 

So there seems to be an interesting interaction with the end user for further understanding potentially related terms for a given query. An interesting insight as to how Google treats the query data to further ‘teach’ or refine the systems understanding of related terms. It seeks to understand the user query models to establish probable relationships. It does not altogether use indexed documents to understand the human mind in as much as the humans themselves in this model.

 

The System – section has some interesting stuff.

One embodiment of the system considers the important information in a piece of text to be the words (and compounds) used in the text. For example in the query "cooking classes palo alto" the words are "cooking" and "classes", and the compounds consist of the simple compound "palo alto". Distinguishing compounds from words is done on the basis of compositionality. For example, "cooking classes" is not a compound because it is about both cooking and classes. However "palo alto" is not about "palo" and "alto" separately. This is sometimes a hard distinction to make, but good guesses can make such a system better than no guesses at all.

So once again, as in the Phrase Based work, looking for the semantic relationships of words is an important aspect of the exercise. That ‘German’ and ‘Sheppard’ have unique meanings separately as they do together… a Dog in this case. Teaching the system to understand these concepts is the trick.

Another example given is; ‘to explain why the words grey and skies often occur together, why the words grey and elephant often occur together, but yet why the words "elephant" and "skies" rarely occur together.’ By trying to ‘teach’ the system that words express ‘concepts’ or themes rather than simply text.

 

The Slow spots

From there the ‘probability’ model stuff is addressed and random generators ( rolling the dice – ha ha) for the various nodes… boring stuff, it’s basically describing the model for ‘teaching’ the system – ‘this model captures the correlation between the words grey and elephant, grey and skies, but not elephant and skies.

 

It then goes into some of the scale-ability issues to keep the beast from eternally adding to its usage of ‘probabilistic networks’ which would obviously eat up more resources than is realistic. While interesting to me, there’s not much of real use there for the average reader of this post.There are also considerable descriptions of the computational models involved which once again, I am not seeing any major tidbits that we haven’t already generalized up to this point.

 

Differential Text Source Adjustment Techniques

Now, a lot of this so far was based on learning from user queries, they remind us that;

We have been discussing our model in the context of query sessions. However, as pointed out at the beginning of the disclosure, our model can be run on any source of text, such as web documents. One interesting technique we have developed is in training our model on one source of data, while applying it on another source.

So, the process (model) can be used on queries, web pages and presumably other aspects within the universal search type of model. Once can also surmise they can cross reference the networks of varied source data. While there are certain differences in the concepts and probability from one to another, some adjustments are discussed to deal with it. Point being, from here at least, is that they are still thinking ‘universal’. It also deals with the issue of duplicates and dealing with them;

Large numbers of web pages are copies of each other, cut and pasted into different web servers. Training our model on all of these together is a little bit wasteful as it ends up learning exactly the repeated copies, without any of the hidden meaning behind them. In order to reduce this problem, one can eliminate all repeated runs of say N or more words (N is typically 10 or so) from a large set of documents. This can be done by fingerprinting all sequences of N words, sorting the fingerprints so as to group them, then iterating back over the training text removing words that are at the start of a 10 word run that is seen more than once. This technique has been applied with our model when training on web pages.”

 

Demonstration

To understand the relevance triggers the following is useful;

“The first terminal is "jobs". The information on the left, 1841287, is the number of times this cluster triggers the word "jobs". The information to the right of the word is again its best value and log likelihood of existence. The next few words are "job", "employment", "in", "job-search", "careers", "it", "career", "job-opportunities", "human-resources", and so on. All of these terminals are used when people talk about the concept of jobs! Note that many more terminals are linked to from this cluster, and only the most significant ones are displayed in this figure.”

The system looks not only at the core term but at common related terms used when discussing the concept of ‘jobs’. The weight is dependant on the occurrences within the existing matrix. There is a bunch of stuff on the ‘Parent’ and ‘Child’ node relationships… but I shall leave that for now. In short, the probability is a product of the strongest related nodes (terms).

If a search query does not produce enough results from the sets of documents, it can break it down into its compound parts to seek related results. In ‘Palo Alto Restaurants’, should it not return enough results it would look to pages containing ‘Palo Alto’ and see if they contained related words such as ‘dining’, ‘bar’, ‘café’, ‘nightlife’ and other related terms in  the ‘restaurant’ cluster. It could also use more nodes related to ‘Palo Alto’ such as

 

Then we have dealing with generic terms that can have a few different conceptual meanings

A different way of using our model for web search is to assume that the distribution of clusters extends the query. For example, a query for the word "jaguar" is ambiguous. It could mean either the animal or the car. Our model will identify clusters that relate to both meanings in response to this search. In this case, we can consider that the user typed in one of either two queries, the jaguar (CAR) query or the jaguar (ANIMAL) query. We can then retrieve documents for both of these queries taking into account the ratio of their respective clusters' probabilities. By carefully balancing how many results we return for each meaning, we assure a certain diversity of results for a search.

For an example -   http://www.google.com/search?source=ig&hl=en&q=jaguar  - If you notice even the Ads are a mixture.

 

As for Usage

“… our model can be used to estimate the probabilities that various concepts are present in any piece of text. The same can be done for web pages as well, and by looking at the joint concepts present in a web page and a query, one of the uses of our model is for a search over web pages.

Once again, while there is no direct mention of actual ranking mechanisms, we can see that is certainly can establish a ‘score’ for a page which could be used in other pre-existing ranking mechanisms in the indexing and retrieval system.

Some uses they highlight are;

Guessing at the concepts behind a piece of text. The concepts can then be displayed to a user allowing the user to better understand the meaning behind the text.

Comparing the words and concepts between a document and a query. This can be the information retrieval scoring function that is required in any document search engine, including the special case where the documents are web pages.”

While it can certainly be used to understand user queries better, there are implications relating to the actual ‘scoring (ranking) of documents (which we will get into next time). As with all additions, unless you know the weight given any such factor, it is hard to say how important it is ultimately in the ranking process.

 

What’s the point?

What does this mean to your SEO work? Well, if Content is King, then Relevance (within the content) is the Queen.. and we all know who really runs the kingdom, don’t we?

This really isn’t as much about ranking of documents as it is about how they are trying to perceive concepts in text to better define how information is processed and assigning probable relevance to given sets of words and concepts.  If you are looking to take something away to help you in your SEO it is to further realize that the days of KW density concepts are passing to more relevance in the accompanying terms/words/phrases on a given page – and one would assume link texts ultimately as well.

 

I’d like to thank Bill Slawski for inviting me to play along with him on this, or as one of my cohorts called it – Bill and Dave’s Patent Adventure


(also see related parts; Part II and Part III and the summary on my Blog)  

 

Resources - this is part of a 3 part series - Summary; Relevance through end user metrics - Learning a probabilistic generative model for text - Ranking documents based upon large data sets - Using concepts for Ad Targeting.

Original Patent - Method and apparatus for learning a probabilistic generative model for text

Patents of further interest - Query revision using known highly-ranked queries - User Distributed Search Results - Systems and methods for analyzing a user's web history - Systems and methods for modifying search results based on a user's history - Methods and systems for opportunistic cookie caching - 2002; Methods and apparatus for employing usage statistics in document retrieval -


 
Next >

Knowledge Base
Keyword or Phrase research and targeting practices

This article is not aimed at the advanced level reader here folks. I talk to many people that don't even spend a passing thought on the KW/P research - this posting is written to give the reader a foundation - not as an advanced guide. Of course, this being an in-exact science, there could be some gems in here for the ragged SEO warriors as well.

For a while now, I have been intending to give my perspective on Keyword/Phrase ( KW/P ) research and targeting. Why? Because I think it is an essential if not mandatory part of any SEO campaign. Each and every term you target will have a cost associate with it. This is where the rubber meets the road. In simplest terms, if I spend $500 on link building, $200 on content creation, $100 on site adjustments then I am looking at $800 invested. Once we reach the desired ranking, how long will it take to recoup that cash (including ongoing maintenance). In short, where is the ROI and what is the ‘break even’ point? That is a very simplistic example, but hopefully you get the idea.

Simply aiming to be #1 on Google is foolhardy since some terms contain little reward (traffic). Remember, cash that gets tied up chasing non-performing terms is money that could have been used elsewhere in your marketing endeavors. So, this is certainly an important step in the SEO process. Mistakes here can be very costly later on and in the over-all ‘big picture’ that is the sites financial health.

 

Read more...