The reasonable surfer; makes for unreasonable thinkers
A guide to assessing search patents
Well, what can I say? It just never ends. Oh no my friends start shaking yer heads.. We talk about LSI/Google crap not once, but two and three times. Behavioural data and oh, I dunno, bounce rates a few times? How about the most important one, the Magic Bullet (2010 and 2007 versions). Time and time again, it seems to keep happening; SEOs grasping at straws.
Which brings us to this week's offender; the Reasonable Surfer, (another Bill Slawski production – in theatres soon!). I have seen way too many (mis-guided) posts on this one.There's no bloody need to be pointing fingers at this point as it's been rampant the last while.
First off let's get this straight; I love IR papers and patents and Bill too. Any self respecting search geek should. That's not the issue. It is more about SEO bloggers/media twisting it into present day implications better known as stating theories as fact. Those of us schooled at Slawski U know better.
What is this all about?
First off, the patent;
Ranking documents based on user behavior and/or feature data
Invented by Jeffrey A. Dean, Corin Anderson and Alexis Battle
Assigned to Google Inc.
United States Patent 7,716,225
Granted May 11, 2010
Filed: June 17, 2004
Catch that last one? You know, that little minor technicality of it being filed some 6 years ago? This means we must take the leap that if it was incorporated it likely has been heavily augmented, weighted and dampened on all levels. It is a history lesson.
What were they looking at?
“...generating a model based on user behavior data associated with a group of documents. The method may also include assigning weights to links based on the model, where the links may include references from first documents to second documents in a set of documents”
“...may also include means for assigning weights to references in a set of documents based on the model and means for assigning ranks to documents in the set of documents based on the weights assigned to the references.”
“...a reasonable surfer model that indicates that when a surfer accesses a document with a set of links, the surfer will follow some of the links with higher probability than others. This reasonable surfer model reflects the fact that not all of the links associated with a document are equally likely to be followed. Examples of unlikely followed links may include "Terms of Service" links, banner advertisements, and links unrelated to the document. “
Of note, the world of behavioural data wasn't nearly what it is these days. Anyway, this patent looks at a simpler level of query analysis from existing data sets. They are looking at identifying less valuable types of links based on user interactions.
Know thine enemy well
Next we take another logical step and look at a few authors of it; of note, Jeffery Dean.We can see he is a Google research geek. And on his profile there are some interesting tidbits such as;
“Some aspects of our search ranking algorithms, notably improved handling for dealing with off-page signals such as anchor text.”
“The design and implementation of prototyping infrastructure for rapid development and experimentation with new ranking algorithms.”
Ok, interesting to see some of the background. While it isn't his mainstay, he certainly does seems to have some experience with not only links, but a passion for infrastructure as well.
After that you'd want to maybe listen/watch these;
Challenges in Building Large-Scale Information Retrieval Systems – Video Lectures
When you're done that? Then look into some of the other folks such as Corin Anderson and his papers, (doesn't seem to be with Google any longer) or Alexis Battle last seen on Google AU blog (interesting paper here). Then tell me, what do you see from these folks past?
- Infrastructure
- Personalization
- Recommendation (engine)
- Learning mechanisms
- Links/ranking features
An interesting mix of talents. This can often be very helpful in better understanding the mindsets of the people behind the technology. This is an important part of any journey into better understanding what you're looking at (as far as patents/papers are concerned). They have a trifecta of links, optimization (processing) and behavioural in this crew.
Still with me? Sure you are. Now that we have that, let's start to look at some of the tidbits that are the 'reasonable surfer'.
Some Reasonable Thinking
Now that we have the timeline and some background on the engineers involved, we can start to look at this particular filing. What needs to be done, and hasn't in much of the coverage so far, is to add some qualifiers to the elements this patent raises. We can't simply take them at face value, we need to use the Art of SEO. This means assign some possible value to some of these based on some research and experience. If we assume the trail begins in 04, let us see if we can work out where it went from there.
We can start a framework (for further discussion) on what value they hold.
Features associated with a link;
|
It seems a reasonable assumption (even Matt has said there are location segmentation). Weighting; of the signals we're looking at today, this one would certainly have the potential to be valuable. Position can also be used for spam detection (paid links) and as a dampener.
|
||
|
Weighting; this one I'd certainly put into the mid-range category. Ultimately I would say it is beneficial as a dampening factor (on longer term links?) as it would be anything.
|
||
|
Weighting; far more likely a good tool for spam assassins. I probably would not see a lot of strength in these signals. Call it a mid-range consideration.
|
||
|
Weighting; I can't see this one having a whole lot of git-up in it. Obviously a more prominent link is desired, but for link weighting? Not so sure this holds a ton of water. Sure, users are more likely to click, but that's a facet of the prominence, not really the value of the link to me.
|
||
|
I can't see something like this being a great way to valuate a link due to the propensity of click bias at play. Weighting; this one I can also not see as being a strong factor. Something to keep in mind, not to be obsessed over.
|
||
Words in the anchor text of a link;
Nest section we will look at are the actual textual factors that they discussed in the filing. These are some traditional considerations (sensible considering it was 04).
|
Weight; it is entirely possible that non-relevant links (in query type context) could be weighted less (dampener) based on this model. I'll give this a mid-range value position.
|
||
|
Weight; in all likelihood I'd say there are semantic approaches and other behavioural data that works better for these kinds of signals. I am calling this one minimalistic. A topical cluster with which the anchor text of the link is associated; seems related to the above and also segmentation. Once more, it is probably evolved into other semantic approaches to link analysis. That being said, it is surely a lower/mid-range level signal ultimately. |
||
|
Weighting; this would certainly need to work with the segmentation elements, but might be useful. That being said, unlikely a strong indicator. |
||
|
Weight; this one is really hard to call. I would say in most situations internal links do better, but does that mean if gets a bump or a dampening in value to level it off? NO CALL. |
||
And after some discussion of the source and target documents, they get more into the behavioural element;
User behaviour data associated with documents and links;
Ok so this part is about looking at some of the user performance metrics that can affect the above elements. If we take the above elements and start to add in factors that can be weighted against them, we find the ultimate inclination of the reasonable surfer, (or at least that's the theory). Please keep in mind, behavioural data is still considered a fairly weak signal in organic search. The ultimate weight of any of these factors, in 2004, was likely minimal.
|
Weighting; these days I'd say that these are more important than they were in years past. Id go as far as putting a high value on this (in context). |
||
|
Weighting; as we move into the world of caffeine, it is entirely possible that these types of actions can be used to value links on a given page (including source and destination documents). While I am not 100% convinced of their value, I will give this a mid-range rating. |
||
|
Weighting; as with the others, there us certainly potential here. I do though wonder if other methods aren't more appropriate. For that reason I am giving this a mid-range rating. |
||
|
Weight; I would give this one a thumbs up as it is a simple and potentially cleaner signal. |
||
|
Weighting; this one is somewhat weak to my mind and might be better used as a dampening signal. |
||
|
Weighting; I am going to keep this one in the basement due to the potential for noise. |
||
|
Weighting; considering that a few elements might affect this, (such as segmentation) we will leave this in the lower classes as a stand alone metric. |
||
Link evaluation on the fly
One of the more entertaining thoughts I had would be a evolution of this system to work on the fly. If we consider the proliferation of tool bars and browsers at Google, they could continually adapt a page's rankings based on actual user interactions. Sadly, as with all behavioural data, it could be noisy, but a user grouped PPR approach and deeper personalization might just help.
Imagine link valuations that change as Google better understands the value of the links on the page via user interaction? Whoa... that would be cool. This is certainly something I hadn't considered before an on it's own makes this experience worth the time put in. And that's really the point of it all. Exanding our horizons, not unlocking the vault.
.... that my friends, is how to get geeky with a patent. No magic bullets, no secret sauce, no blanket statements borne from supposition. Just some ideas to ponder and absorb into our everyday activities.
What's it all mean?
Well to be honest, not a whole helluva a lot. Why? Because there are more than a few factors that come into play when we look at all of this in context;
- Filed in 2004
- We don't know any of the actual valuations
- Behavioural metrics can be noisy
- How does it play with semantic, temporal and other data?
- Does personalized search for all and caffeine open new doors?
All we can do is understand the concepts and the historical signifigance therein. What is important about this exercise is that we have another glimpse into not only behavioural data but link valuations as well. But link related signals aren't really a new thing (around here) as we've looked at a few approaches including temporal data, phrase based IR, page segmentation and even Personalized PageRank to name a few.
We can surely start to realize the simple truth; all links were not created equal.
We can't take this one patent and start making assessments of how Google is treating links, (out of context). There are so many other methods that it is truly impossible to say. This now reminds us that a lot of SEO 'testing' is somewhat flawed. Forget other signals (geo, QDF, semantic etc..) it is tough enough just to isolate the link signals. We must have a better understanding of how all of this might play out.
The concept here is to better understand how things work so we don't follow advice blindly. M'kay? (still awake?) Absorb it and move on - nothing to see here.
The position of the link; for this one we're looking at location and of course
Number of words in anchor text of a link; for me a good area to be in is 2-5 words. Any shorter, other than brand terms, are often 'here' or 'link' or 'said'. One certainly doesn't want to have links with long text. Not only poor targeting, but plenty of room for dilution. What is the perfect text length for optimal CTR?
Position of link when in list element; this is interesting in that I've talked about prominence factors in the past including lists, but never links within a list. Concept here being that more prominent placement garners a boost.