Google改变搜索引擎算法
我对现在的主流媒体评论搜索引擎的文章并不赞同,从这些文章中的不准确性、带有偏见的、浅薄的、日程式的报道中就可以看出。但是今天一篇Saul Hansel在纽约时报上的文章的确让我震惊。这篇文章思维缜密、卓尔不群。甚至连搜索行业的老手都会从中获益。文章:Googel Keeps Tweaking Its Search Engine(谷哥保持搜索引擎的改变)——可能是对Google评论的最好的主流媒体文章,或者可以说是对5年内搜索技术的评论。
下面有一些对引擎营销者来说的好的段落,让我们看一看:
Mr.Singhal是Google称为的排名“爬虫”大师,爬虫是一种方程式,它决定哪些网页能够最好的回答用户的询问。这是Google密室中至关重要的部件。这个部门被称作“搜索质量(search quality)”,Google像对待国家机密一样管理这个部门。很少有人能造访这个部门,所以很奇怪的是Mr.Singhal被允许和新闻媒体大谈这个神秘的、由百万个黑盒般的、驱动引擎工作的数学方程式。
Google对Mr.Singhal和他的团队评价很高,这些评价主要是对他们做的一些基础性工作。Google相信这个团队在减少网页搜索者进行搜索时花掉的冗余时间上所做的努力是极其重要,因为它对阻挡恶意广告和保护正当广告具有决定性的作用。
我很高兴听到Google也和我对搜索质量有同样的看法——现在最具竞争力的因素是搜索的相关性。我们也正在得到一个之前我们没有遇见的Googler的记录顶点(至少,在’plex之外)。我猜可怜的Mr.Singhal现在正在接收每个可能的他的名字@Googel.com的变化的信件。(可怜的人)。
Google的10,000个员工都可以用它的“Buganizer”系统举报搜索问题,大约每天会有100个——Mr.Singhal是对此事负责的人。
Mr.Singhal的一个同事、Google反垃圾组的领头人物Matt Cutts说:“当Mr.Singhal接收到一个难题时,他会像对待珍宝一样仔细的分析它,并想出办法来优化爬虫。”
有一些举报则针对一些简单的可以立即被更正的BUG。最近,一项关于“法国革命(French Revolution)”的搜索被引导到了一些近期的法国总统大选的网站上——这些网站上的内容包含了竞选者的一些改革(revolution)政策——而并没有被引导到关于路易十六被驱逐的内容上。所以有一项搜索引擎规则的改变就是相对于仅包含“法国”、“革命”两个词的文章,爬虫给予了包含像“法国革命”这样的词组的文章更大的份量。
Google的bug系统告诉我们,在这些魔术般的搜索背后,人们正在辛勤的工作以保证质量、比较分析个体搜索结果以及在最好的总体变化基础上改变搜索规则。上面关于法国革命的段落如果是准确的话,就可以帮助我们窥见爬虫的内部并不是统一的——甚至不是紧密关联的。个别举报能获得特别关注——所以下次你因为优化一些新的项目而结果并不像往常一样而困惑时,你应该意识到一套新的标准已经出台了。
但是Mr.Singhal通常并不立刻对bug进行修补,因为每一个简单改变都能对成千上百的页面排名造成影响。他说:“你不能立即回应并做出更改,只能静观其变以相机行事。”
所以他监视和记录这些举报,把经常出现的放在首位。去年下半年,重复最多的一个词是“刷新(freshness)”。
刷新,意味着很多近期出现的或者刚刚更改的页面需要被收录,这在搜索引擎的辩论中也是永恒不变的中心话题:到底是应该提供新的页面信息还是继续保留那些经受住了时间的考验并很可能拥有更高价值页面呢?知道现在,Google还是钟爱后者,并利用它们去吸引链接。
但是去年Mr.Singhal开始对Google的平衡性担忧。当公司提供新的股票行情服务时,一项名为“Google 财经”的搜索却搜不到这项服务。在监视到了相似的问题后,他召集了三个工程师,仔细的研究对策。
Google不显示最新的结果?听起来很熟悉,不是吗?我们SEOmoz以及大多数的之情的SEO早在过去的几年前就对此话题进行过讨论;特别是在2004年的三月后,当声名狼藉的“沙盒子(sandbox)”初显其型的时候。得到肯定的答复和透明化的辩护后,仍然值得我们谨记——Google并不是完美的;搜索不到“Google Finance”就是一个很有力的证据:Google和其他的引擎一样。所以,下次你觉得要为SERP中的不公平对待跟Google的工程师讨个说法的时候,最好的办法可能就是向他们阐明这个问题是怎样影响Google的产品的。
Mr.Singhal回复了刷新问题,并解释到在大多数情况下会简单的更改方程式以在低质量的搜索中展示新的页面的搜索结果。然后他透露了他们团队对此的解释方法:建立一个数学模式以试图判断用户什么时候需要新的信息,什么时候不要。(对了,就像所有的Google产品一样,它有个名字:QDF,意思是“query deserves freshness/需要刷新的问题”)……
这个QDF方案反复的判断一个话题是否是“热门”的。如果一个话题总是能引出新页面和文章,那么这个程式就会判断它为一个需要刷新项。同时,这个程式也在检测Google自己的搜索问题,这也是Mr.Singhal所认为的一个更好的对某一话题的全球热度的监视器。
他举了个例子,“如果一个网站遭遇停电,比如纽约突然大面积停电。第一篇文章在15分钟后出现,我们在2秒的时候就拿到了提问。”
Mr.Singhal说他用一个简单的方法测试了QDF:当人们搜索高QDF值的话题时,看看程序能不能决定将新的页面内容包含在常规搜索结果中。尽管Google已经有了一套非常复杂的程序来决定搜索页面排名,但是QDF提供了更为老练的技术支持,将一些标题提升到搜索结果首位、中位或者尾部,以适应不同的搜索要求。
在SEO的世界里,我们对onebox的新的搜索结果都很熟悉,但是在我们观察到的千百个提问中应该看到,刷新趋势已经出现了。
另外一点就是,在’plex里临时数据和提问分析是怎样发生的。对搜索结果满意程度的意识水平确实令人印象深刻,修复速度也快的出奇。Google声称:它能够通过检索博客帖子和新闻文章来判断什么话题和提问会持续“走热”,什么应该“刷新”以回复提问。这点足够有新闻价值。
当Google编辑它的索引时,它会计算它发现的PR……
Mr.Singhal发明了一种更为精细的PR系统。它包含200多种形态的信息,或者用Google的话说“signals(信号)”。PR是且仅是一种信号。网上有一些信号,比如文字、链接、图片等等。有些信号是从改变了的网页历史中抽取而来的;有些则是Google所处理过的数以兆计的未包裹数据模型。
Google将越来越多的使用从个人用户的历史搜索记录中得来的信号,以此来反应个体兴趣。举个例子,一项名为“海豚(dolphins)”的搜索会以不同的结果展现在迈阿密球迷和海洋生物学家的面前。不过这种情况只会在用户注册了Google的服务(比如:Gmail)后生效。
一旦Google圈进它拥有的天文数字般的信号后,它会将这些数据用“classifiers(分类)”方程式加以处理,使得搜索结果关联、用户得到的价值最高的页面信息。举个例子Classifiers能够辨别用户是在寻找产品以购买还是在搜寻一个地点、公司、或者某个人的信息。Google最近开发了一种新的Classifier以辨别普通人的姓名。另一种将能辨别产品名……
这些信号和分类计算了一个页面关联性的关键标尺,包括人们称的“时事性topicality”——一种测量页面中话题和用户搜索的关联程度的标尺。举个例子,一篇关于布什总统上周在白宫对达尔富尔地区的讲话的页面能在“达尔富尔”的搜索中排名靠前,并远远超过“乔治·布什”和“白宫”的搜索结果。Google整合所有这些标尺以打出最总的关联得分。
只要终检测没有表示结果的多样性不够,那么一个拥有十个最高分的网站肯定能够在搜索页面的第一页以高分傲视群雄。如果你的页面有许多的不同的观点,通常情况下,这比单一观点的页面更有竞争力。Mr.Cutts说:“如果用户想购买产品,一般来说他会看看博客、厂商网站、销售网站、或者商业竞争页面。”
天啦,200种质量信号(我们已经介绍了其中的一些关键信号),加上一个定义提问内容的分级系统和一个辨别多样性的自动系统!在以后的文章中我们将一一介绍。
Randfish
2007-6-3
Remarkable Openness from Google's Black Box Thanks to Saul Hansel
I'm more than a little skeptical of mainstream media articles about the search engines. With so many terrible experiences - inaccuracy, bias, shallow information, agenda-based reporting - it's easy to see why. However, today I'm thrilled to see an article from Saul Hansel in the NY Times that's not only impeccably well-written, but informative to even those of in most deeply inside the search industry. The article - Google Keeps Tweaking Its Search Engine - is quite possibly the best mainstream media article about Google, or modern search technology, in the last 5 years.
There are several big takeaways for search marketers, so let's dive right in:
Mr. Singhal is the master of what Google calls its “ranking algorithm” — the formulas that decide which Web pages best answer each user’s question. It is a crucial part of Google’s inner sanctum, a department called “search quality” that the company treats like a state secret. Google rarely allows outsiders to visit the unit, and it has been cautious about allowing Mr. Singhal to speak with the news media about the magical, mathematical brew inside the millions of black boxes that power its search engine.
Google values Mr. Singhal and his team so highly for the most basic of competitive reasons. It believes that its ability to decrease the number of times it leaves searchers disappointed is crucial to fending off ever fiercer attacks from the likes of Yahoo and Microsoft and preserving the tidy advertising gold mine that search represents.
It's nice to hear that Google feels much the same way I do about search quality - in particular that the current competitive advantage is primarily about the relevance of results. We're also getting a peak at a Googler that we've never met before (at least, outside the 'plex). I'm guessing that poor Mr. Singhai is now receiving quite a few emails to every possible variation of his names @ google.com (poor guy).
Any of Google’s 10,000 employees can use its “Buganizer” system to report a search problem, and about 100 times a day they do — listing Mr. Singhal as the person responsible to squash them.
“Someone brings a query that is broken to Amit, and he treasures it and cherishes it and tries to figure out how to fix the algorithm,” says Matt Cutts, one of Mr. Singhal’s officemates and the head of Google’s efforts to fight Web spam, the term for advertising-filled pages that somehow keep maneuvering to the top of search listings.
Some complaints involve simple flaws that need to be fixed right away. Recently, a search for “French Revolution” returned too many sites about the recent French presidential election campaign — in which candidates opined on various policy revolutions — rather than the ouster of King Louis XVI. A search-engine tweak gave more weight to pages with phrases like “French Revolution” rather than pages that simply had both words.
The Google bug system reminds us that behind all the magic, human beings toil to ensure quality, compare individual results and make tweaks based upon the best aggregate changes. The short paragraph about the French Revolution, if accurate, gives some insight into the fact that the algorithm is not uniform - not even close. Individual queries get individual attention - so next time you're stumped because Google's formula for some new term you're optmizing doesn't match up against your experiences from the past, you may simply be dealing with a different set of criteria.
But Mr. Singhal often doesn’t rush to fix everything he hears about, because each change can affect the rankings of many sites. “You can’t just react on the first complaint,” he says. “You let things simmer.”
So he monitors complaints on his white board, prioritizing them if they keep coming back. For much of the second half of last year, one of the recurring items was “freshness.”
Freshness, which describes how many recently created or changed pages are included in a search result, is at the center of a constant debate in search: Is it better to provide new information or to display pages that have stood the test of time and are more likely to be of higher quality? Until now, Google has preferred pages old enough to attract others to link to them.
But last year, Mr. Singhal started to worry that Google’s balance was off. When the company introduced its new stock quotation service, a search for “Google Finance” couldn’t find it. After monitoring similar problems, he assembled a team of three engineers to figure out what to do about them.
Hmmmm... Google not showing fresh results, eh? Sounds mighty familiar, no? We at SEOmoz, and most of the rest of the informed SEO world had been talking about this for the last few years; in particular since March of 2004 when the infamous "sandbox" first reared its ugly head. It's nice to get confirmation and feel the vindication of this transparency, but there's also a lesson to be learned - Google isn't perfect and they often look inward. The note that this problem wasn't addressed until the query "Google Finance" didnt' show "Google Finance" is strong evidence that Google is like many other companies. Things don't get fixed unless the folks internally feel the pain of the problem. Thus, next time you want to fight with the Google engineers about what you feel is inequitable treatment in the SERPs, the best way to do it might be to illustrate how the problem affects Google products.
Mr. Singhal introduced the freshness problem, explaining that simply changing formulas to display more new pages results in lower-quality searches much of the time. He then unveiled his team’s solution: a mathematical model that tries to determine when users want new information and when they don’t. (And yes, like all Google initiatives, it had a name: QDF, for “query deserves freshness.”)...
...“What do you take us for, slackers?” Mr. Singhal responded with a rebellious smile.
THE QDF solution revolves around determining whether a topic is “hot.” If news sites or blog posts are actively writing about a topic, the model figures that it is one for which users are more likely to want current information. The model also examines Google’s own stream of billions of search queries, which Mr. Singhal believes is an even better monitor of global enthusiasm about a particular subject.
As an example, he points out what happens when cities suffer power failures. “When there is a blackout in New York, the first articles appear in 15 minutes; we get queries in two seconds,” he says.
Mr. Singhal says he tested QDF for a simple application: deciding whether to include a few news headlines among regular results when people do searches for topics with high QDF scores. Although Google already has a different system for including headlines on some search pages, QDF offered more sophisticated results, putting the headlines at the top of the page for some queries, and putting them in the middle or at the bottom for others.
In the SEO world, we're all familiar with the new onebox results that pop up with news results, and now we've got a bit of backstory on it. I also suspect that although it wasn't mentioned in the article, there may have been some tweaking to the organic listings to help support more freshness in the results themselves. Google's still favoring a lot of old results, but of the thousand or so queries we monitor internally and for clients, there's at least some indications that a freshness boost exists.
Another big takeaway here is the thought process about how temporal data and query analysis happens at the 'plex. The level of awareness of satisfaction with results is certainly impressive, and so is the exceptionally fast timeline for fixes (at least, some fixes - in SEO, we've got our own examples of tortoise-speed implementation). What the article says, though, is that Google can determine, by examining blog posts and news articles, what topics and queries might be getting "hot' and return more "fresh" results for those queries. This fits in precisely with how smart SEOs advise on "escaping" from the sandbox - get lots of link love and lots of people talking about you, i.e. become newsworthy.
As Google compiles its index, it calculates a number it calls PageRank for each page it finds...
...Mr. Singhal has developed a far more elaborate system for ranking pages, which involves more than 200 types of information, or what Google calls “signals.” PageRank is but one signal. Some signals are on Web pages — like words, links, images and so on. Some are drawn from the history of how pages have changed over time. Some signals are data patterns uncovered in the trillions of searches that Google has handled over the years...
...Increasingly, Google is using signals that come from its history of what individual users have searched for in the past, in order to offer results that reflect each person’s interests. For example, a search for “dolphins” will return different results for a user who is a Miami football fan than for a user who is a marine biologist. This works only for users who sign into one of Google’s services, like Gmail...
...Once Google corrals its myriad signals, it feeds them into formulas it calls classifiers that try to infer useful information about the type of search, in order to send the user to the most helpful pages. Classifiers can tell, for example, whether someone is searching for a product to buy, or for information about a place, a company or a person. Google recently developed a new classifier to identify names of people who aren’t famous. Another identifies brand names...
...These signals and classifiers calculate several key measures of a page’s relevance, including one it calls “topicality” — a measure of how the topic of a page relates to the broad category of the user’s query. A page about President Bush’s speech about Darfur last week at the White House, for example, would rank high in topicality for “Darfur,” less so for “George Bush” and even less for “White House.” Google combines all these measures into a final relevancy score.
The sites with the 10 highest scores win the coveted spots on the first search page, unless a final check shows that there is not enough “diversity” in the results. “If you have a lot of different perspectives on one page, often that is more helpful than if the page is dominated by one perspective,” Mr. Cutts says. “If someone types a product, for example, maybe you want a blog review of it, a manufacturer’s page, a place to buy it or a comparison shopping site.”
Wow... OK - 200 signals of quality (we've covered a lot of the big ones here), a classification system that attempts to determine query intent and an automated system to determine diversity. That's a lot of confirmation about what many have only theorized until now. I'm not going to go into detail about each of these - I invite you to do so in the comments - but, I'll certainly be writing about them sometime in the near future.
(翻译:Skywalker 编辑:Levi)
原载: 蓝杉seo团队博客
版权所有,转载时必须以链接形式注明作者和原始出处及本声明。