| A Probabilistic Learning Model |
|
Method and apparatus for learning a probabilistic generative model for text - Click here for Original Patent This is an interesting method that seeks to ‘teach’ the system how to relate various documents, or more appropriately, the TEXT within the documents, from semantics to link nodes. Or as stated at one point – “a system that learns concepts by learning an explanatory model of text”. This is something they have worked on for a while and can been seen in the earlier related patents; Test classification system and method and Method and system for creating improved search queries Moving along….
In section 2 – Related Art – we have;
Call me a Phrase Based Indexing and Retrieval junky ( and you’d be right), but once again the concepts apply. The whole PaIR methodology sought to do just this – further comprehend the actual meaning of a document/text block rather than simply looking at individual words. For those paying attention, Google showed interest in the direction of ‘semantics’ when it purchased Applied Semantics and it’s ‘Latent Semantic Analysis’ technologies back around 2004 or so – though presumably for their AdWords/AdSense program. So this is not a new direction.
Oh.. and is this a shot or what? Can’t say I recall seeing ‘Patent Flaming’ before…Ha ha ha... “Currently, if the document has the words "cooking class palo alto" several of the leading search engines will not find it, because they do not know that the words "class" and "classes" are related, because one is a subpart--a stem--of the other.”
Anyways, point is that once again they seem to be looking for deeper relevance. From a systemic standpoint it is an evolutionary process that applies each ‘new’ model onto the existing one over and over to produce a continuously tightening ‘understanding’ of a given document and general related concepts to be applied to the model by ‘selectively introducing new cluster nodes into the current model.’;
The initial ‘training document’ is derived by establishing a core comparison model (initial current model) and a set of documents with related terms. Presumably, each future processing will grow exponentially – ‘each iteration uses twice as many training documents as the previous iteration until all available training documents are used.’
On the topic of ‘learning’ –
Once again, the semantic and phrase based models seem to be a factor here coupled with the ‘probability’ aspects that have been introduced. “we define a query session (also referred to as a user session or a session) as the set of words used by a single user on a search engine for a single day. Often users will search for related material, issuing several queries in a row about a particular topic. Sometimes, these queries are interspersed with random other topics.”
So there seems to be an interesting interaction with the end user for further understanding potentially related terms for a given query. An interesting insight as to how Google treats the query data to further ‘teach’ or refine the systems understanding of related terms. It seeks to understand the user query models to establish probable relationships. It does not altogether use indexed documents to understand the human mind in as much as the humans themselves in this model.
The System – section has some interesting stuff.
So once again, as in the Phrase Based work, looking for the semantic relationships of words is an important aspect of the exercise. That ‘German’ and ‘Sheppard’ have unique meanings separately as they do together… a Dog in this case. Teaching the system to understand these concepts is the trick. Another example given is; ‘to explain why the words grey and skies often occur together, why the words grey and elephant often occur together, but yet why the words "elephant" and "skies" rarely occur together.’ By trying to ‘teach’ the system that words express ‘concepts’ or themes rather than simply text.
The Slow spots From there the ‘probability’ model stuff is addressed and random generators ( rolling the dice – ha ha) for the various nodes… boring stuff, it’s basically describing the model for ‘teaching’ the system – ‘this model captures the correlation between the words grey and elephant, grey and skies, but not elephant and skies’.
It then goes into some of the scale-ability issues to keep the beast from eternally adding to its usage of ‘probabilistic networks’ which would obviously eat up more resources than is realistic. While interesting to me, there’s not much of real use there for the average reader of this post.There are also considerable descriptions of the computational models involved which once again, I am not seeing any major tidbits that we haven’t already generalized up to this point.
Differential Text Source Adjustment Techniques Now, a lot of this so far was based on learning from user queries, they remind us that; “We have been discussing our model in the context of query sessions. However, as pointed out at the beginning of the disclosure, our model can be run on any source of text, such as web documents. One interesting technique we have developed is in training our model on one source of data, while applying it on another source.” So, the process (model) can be used on queries, web pages and presumably other aspects within the universal search type of model. Once can also surmise they can cross reference the networks of varied source data. While there are certain differences in the concepts and probability from one to another, some adjustments are discussed to deal with it. Point being, from here at least, is that they are still thinking ‘universal’. It also deals with the issue of duplicates and dealing with them;
Demonstration To understand the relevance triggers the following is useful;
The system looks not only at the core term but at common related terms used when discussing the concept of ‘jobs’. The weight is dependant on the occurrences within the existing matrix. There is a bunch of stuff on the ‘Parent’ and ‘Child’ node relationships… but I shall leave that for now. In short, the probability is a product of the strongest related nodes (terms). If a search query does not produce enough results from the sets of documents, it can break it down into its compound parts to seek related results. In ‘Palo Alto Restaurants’, should it not return enough results it would look to pages containing ‘Palo Alto’ and see if they contained related words such as ‘dining’, ‘bar’, ‘café’, ‘nightlife’ and other related terms in the ‘restaurant’ cluster. It could also use more nodes related to ‘Palo Alto’ such as
Then we have dealing with generic terms that can have a few different conceptual meanings –
For an example - http://www.google.com/search?source=ig&hl=en&q=jaguar - If you notice even the Ads are a mixture.
As for Usage – “… our model can be used to estimate the probabilities that various concepts are present in any piece of text. The same can be done for web pages as well, and by looking at the joint concepts present in a web page and a query, one of the uses of our model is for a search over web pages.” Once again, while there is no direct mention of actual ranking mechanisms, we can see that is certainly can establish a ‘score’ for a page which could be used in other pre-existing ranking mechanisms in the indexing and retrieval system. Some uses they highlight are;
While it can certainly be used to understand user queries better, there are implications relating to the actual ‘scoring (ranking) of documents (which we will get into next time). As with all additions, unless you know the weight given any such factor, it is hard to say how important it is ultimately in the ranking process.
What’s the point? What does this mean to your SEO work? Well, if Content is King, then Relevance (within the content) is the Queen.. and we all know who really runs the kingdom, don’t we? This really isn’t as much about ranking of documents as it is about how they are trying to perceive concepts in text to better define how information is processed and assigning probable relevance to given sets of words and concepts. If you are looking to take something away to help you in your SEO it is to further realize that the days of KW density concepts are passing to more relevance in the accompanying terms/words/phrases on a given page – and one would assume link texts ultimately as well.
I’d like to thank Bill Slawski for inviting me to play along with him on this, or as one of my cohorts called it – Bill and Dave’s Patent Adventure
Resources - this is part of a 3 part series - Summary; Relevance through end user metrics - Learning a probabilistic generative model for text - Ranking documents based upon large data sets - Using concepts for Ad Targeting. Original Patent - Method and apparatus for learning a probabilistic generative model for text Patents of further interest - Query revision using known highly-ranked queries - User Distributed Search Results - Systems and methods for analyzing a user's web history - Systems and methods for modifying search results based on a user's history - Methods and systems for opportunistic cookie caching - 2002; Methods and apparatus for employing usage statistics in document retrieval -
Need help ranking? Get in touch today for our SEO services
|
|||||||||