A guide to assessing search patents
Well, what can I say? It just never ends. Oh no my friends start shaking yer heads.. We talk about LSI/Google crap not once, but two and three times. Behavioural data and oh, I dunno, bounce rates a few times? How about the most important one, the Magic Bullet (2010 and 2007 versions). Time and time again, it seems to keep happening; SEOs grasping at straws.
Which brings us to this week's offender; the Reasonable Surfer, (another Bill Slawski production – in theatres soon!). I have seen way too many (mis-guided) posts on this one.There's no bloody need to be pointing fingers at this point as it's been rampant the last while.
First off let's get this straight; I love IR papers and patents and Bill too. Any self respecting search geek should. That's not the issue. It is more about SEO bloggers/media twisting it into present day implications better known as stating theories as fact. Those of us schooled at Slawski U know better.
What is this all about?
First off, the patent;
Ranking documents based on user behavior and/or feature data
Invented by Jeffrey A. Dean, Corin Anderson and Alexis Battle
Assigned to Google Inc.
United States Patent 7,716,225
Granted May 11, 2010
Filed: June 17, 2004
Catch that last one? You know, that little minor technicality of it being filed some 6 years ago? This means we must take the leap that if it was incorporated it likely has been heavily augmented, weighted and dampened on all levels. It is a history lesson.
What were they looking at?
“...generating a model based on user behavior data associated with a group of documents. The method may also include assigning weights to links based on the model, where the links may include references from first documents to second documents in a set of documents”
“...may also include means for assigning weights to references in a set of documents based on the model and means for assigning ranks to documents in the set of documents based on the weights assigned to the references.”
“...a reasonable surfer model that indicates that when a surfer accesses a document with a set of links, the surfer will follow some of the links with higher probability than others. This reasonable surfer model reflects the fact that not all of the links associated with a document are equally likely to be followed. Examples of unlikely followed links may include "Terms of Service" links, banner advertisements, and links unrelated to the document. “
Of note, the world of behavioural data wasn't nearly what it is these days. Anyway, this patent looks at a simpler level of query analysis from existing data sets. They are looking at identifying less valuable types of links based on user interactions.
Know thine enemy well
Next we take another logical step and look at a few authors of it; of note, Jeffery Dean.We can see he is a Google research geek. And on his profile there are some interesting tidbits such as;
“Some aspects of our search ranking algorithms, notably improved handling for dealing with off-page signals such as anchor text.”
“The design and implementation of prototyping infrastructure for rapid development and experimentation with new ranking algorithms.”
Ok, interesting to see some of the background. While it isn't his mainstay, he certainly does seems to have some experience with not only links, but a passion for infrastructure as well.
After that you'd want to maybe listen/watch these;
Challenges in Building Large-Scale Information Retrieval Systems – Video Lectures
When you're done that? Then look into some of the other folks such as Corin Anderson and his papers, (doesn't seem to be with Google any longer) or Alexis Battle last seen on Google AU blog (interesting paper here). Then tell me, what do you see from these folks past?
- Recommendation (engine)
- Learning mechanisms
- Links/ranking features
An interesting mix of talents. This can often be very helpful in better understanding the mindsets of the people behind the technology. This is an important part of any journey into better understanding what you're looking at (as far as patents/papers are concerned). They have a trifecta of links, optimization (processing) and behavioural in this crew.
Still with me? Sure you are. Now that we have that, let's start to look at some of the tidbits that are the 'reasonable surfer'.
Some Reasonable Thinking
Now that we have the timeline and some background on the engineers involved, we can start to look at this particular filing. What needs to be done, and hasn't in much of the coverage so far, is to add some qualifiers to the elements this patent raises. We can't simply take them at face value, we need to use the Art of SEO. This means assign some possible value to some of these based on some research and experience. If we assume the trail begins in 04, let us see if we can work out where it went from there.
We can start a framework (for further discussion) on what value they hold.
Features associated with a link;
The position of the link; for this one we're looking at location and of course page segmentation elements. That means the 'where' is important. There is evidence to suggest that header/footer links are least desirable followed by side bars. Contextual links (in-content) are the more preferred with above the fold being the prime real estate. Does this mean the user data they have bears this out?
It seems a reasonable assumption (even Matt has said there are location segmentation).
Weighting; of the signals we're looking at today, this one would certainly have the potential to be valuable. Position can also be used for spam detection (paid links) and as a dampener.
Number of words in anchor text of a link; for me a good area to be in is 2-5 words. Any shorter, other than brand terms, are often 'here' or 'link' or 'said'. One certainly doesn't want to have links with long text. Not only poor targeting, but plenty of room for dilution. What is the perfect text length for optimal CTR?
Weighting; this one I'd certainly put into the mid-range category. Ultimately I would say it is beneficial as a dampening factor (on longer term links?) as it would be anything.
Font color and/or other attributes of the link (e.g., italics, gray, same color as background, etc.); once more looking at prominence items, one small step above a plain text link. And obviously the 'same color as background' denotes some potential for spam detection/dampening. It stands to reason that links that (once more) have prominence would likely be more clicked.
Weighting; far more likely a good tool for spam assassins. I probably would not see a lot of strength in these signals. Call it a mid-range consideration.
Font size of anchor text associated with the link; Is it regular text? Is it a larger H1-2 or smaller H4-5? This might give a search engine a sense of the value or importance of the link. A larger font (link) is more likely to be selected than a smaller one. A 'reasonable' assumption.
Weighting; I can't see this one having a whole lot of git-up in it. Obviously a more prominent link is desired, but for link weighting? Not so sure this holds a ton of water. Sure, users are more likely to click, but that's a facet of the prominence, not really the value of the link to me.
Position of link when in list element; this is interesting in that I've talked about prominence factors in the past including lists, but never links within a list. Concept here being that more prominent placement garners a boost.
I can't see something like this being a great way to valuate a link due to the propensity of click bias at play.
Weighting; this one I can also not see as being a strong factor. Something to keep in mind, not to be obsessed over.
Words in the anchor text of a link;
Nest section we will look at are the actual textual factors that they discussed in the filing. These are some traditional considerations (sensible considering it was 04).
How commercial the anchor text associated with a link; while they don't gully define this, we can take a stab that it relates to transactional terms such as 'buy now' or possibly named entities such as 'Walmart'. In some query types, links that are commercial in nature may get a higher CTR than in others.
Weight; it is entirely possible that non-relevant links (in query type context) could be weighted less (dampener) based on this model. I'll give this a mid-range value position.
The context of a few words before and/or after the link; another common concept. Are there patterns in related text around links with a higher CTR? It may be that informational and transactional terms have related terms in the surrounding text.
Weight; in all likelihood I'd say there are semantic approaches and other behavioural data that works better for these kinds of signals. I am calling this one minimalistic.
A topical cluster with which the anchor text of the link is associated; seems related to the above and also segmentation. Once more, it is probably evolved into other semantic approaches to link analysis. That being said, it is surely a lower/mid-range level signal ultimately.
Type of link (e.g., text link, image link); an interesting idea, that in some spaces or situations one or the other may tend to get a higher CTR. Maybe to a little split testing on your own site? Look at top pages in a given query space? Hard to say.
Weighting; this would certainly need to work with the segmentation elements, but might be useful. That being said, unlikely a strong indicator.
Whether the link leads somewhere on the same host or domain; this is seemingly about a level of trust and possibly resources. Are internal links getting a higher CTR on a given domain or in a given query space? Is the linked to page part of the structure or relevant?
Weight; this one is really hard to call. I would say in most situations internal links do better, but does that mean if gets a bump or a dampening in value to level it off? NO CALL.
And after some discussion of the source and target documents, they get more into the behavioural element;
User behaviour data associated with documents and links;
Ok so this part is about looking at some of the user performance metrics that can affect the above elements. If we take the above elements and start to add in factors that can be weighted against them, we find the ultimate inclination of the reasonable surfer, (or at least that's the theory). Please keep in mind, behavioural data is still considered a fairly weak signal in organic search. The ultimate weight of any of these factors, in 2004, was likely minimal.
The language of the users; obviously if we're using a form of personalization, then this type of signal would be somewhat important. Or in cases of geo-signals for links.
Weighting; these days I'd say that these are more important than they were in years past. Id go as far as putting a high value on this (in context).
Information about how people access and interact with documents; this includes items such as navigational actions (e.g., links selected, web addresses entered, forms completed, etc.). This might include click data, dwell time, scrolling etc. As with all behavioural signals, there is potential, but often noisy.
Weighting; as we move into the world of caffeine, it is entirely possible that these types of actions can be used to value links on a given page (including source and destination documents). While I am not 100% convinced of their value, I will give this a mid-range rating.
Interests of the users; particularly interesting given what we've seen as far as interest in Personalized PageRank. If they are indeed able to categorize users in their new personalization schemes, this type of data could be useful in link valuations.
Weighting; as with the others, there us certainly potential here. I do though wonder if other methods aren't more appropriate. For that reason I am giving this a mid-range rating.
How often no links are selected on a page; most certainly this would be a reasonably good indicator. If no links are being selected from a given page, there is every reason to believe that they hold a low relevancy (to the user).
Weight; I would give this one a thumbs up as it is a simple and potentially cleaner signal.
Query terms entered; do certain queries produce better (on page) CTR? If so then they may look to adjust link value for one link type over another. The theory being that certain queries are bringing more relvant users to the page.
Weighting; this one is somewhat weak to my mind and might be better used as a dampening signal.
How often a link is selected; most certainly the CTR is going to be a fairly strong indicator of a link's value. We would also want to look at engagement with the destination document though or this one might not work out too well.
Weighting; I am going to keep this one in the basement due to the potential for noise.
How often links aren’t selected when one link is chosen; is one link on the page performing better than the others? Beyond mere page segmentation, this could certainly be seen as a good metric. It also can play into the prominence elements covered earlier.
Weighting; considering that a few elements might affect this, (such as segmentation) we will leave this in the lower classes as a stand alone metric.
Link evaluation on the fly
One of the more entertaining thoughts I had would be a evolution of this system to work on the fly. If we consider the proliferation of tool bars and browsers at Google, they could continually adapt a page's rankings based on actual user interactions. Sadly, as with all behavioural data, it could be noisy, but a user grouped PPR approach and deeper personalization might just help.
Imagine link valuations that change as Google better understands the value of the links on the page via user interaction? Whoa... that would be cool. This is certainly something I hadn't considered before an on it's own makes this experience worth the time put in. And that's really the point of it all. Exanding our horizons, not unlocking the vault.
.... that my friends, is how to get geeky with a patent. No magic bullets, no secret sauce, no blanket statements borne from supposition. Just some ideas to ponder and absorb into our everyday activities.
What's it all mean?
Well to be honest, not a whole helluva a lot. Why? Because there are more than a few factors that come into play when we look at all of this in context;
- Filed in 2004
- We don't know any of the actual valuations
- Behavioural metrics can be noisy
- How does it play with semantic, temporal and other data?
- Does personalized search for all and caffeine open new doors?
All we can do is understand the concepts and the historical signifigance therein. What is important about this exercise is that we have another glimpse into not only behavioural data but link valuations as well. But link related signals aren't really a new thing (around here) as we've looked at a few approaches including temporal data, phrase based IR, page segmentation and even Personalized PageRank to name a few.
We can surely start to realize the simple truth; all links were not created equal.
We can't take this one patent and start making assessments of how Google is treating links, (out of context). There are so many other methods that it is truly impossible to say. This now reminds us that a lot of SEO 'testing' is somewhat flawed. Forget other signals (geo, QDF, semantic etc..) it is tough enough just to isolate the link signals. We must have a better understanding of how all of this might play out.
The concept here is to better understand how things work so we don't follow advice blindly. M'kay? (still awake?) Absorb it and move on - nothing to see here.