Google Search

structure adopted from 'Cracking the PM Interview'

Google Search, or simply Google is a web search engine developed by Google LLC. It is the most used search engine on world wide web across all platforms, with 91.38% of the market share as of December 2020. Google search engine follows an algorithm called PageRank to rank web pages in their search engine results. It names after Larry Page one of the founders of Google. It is not the only but the first algorithm used by the company, and the best-known one. PageRank is based on link popularity and first introduced by Larry Page and Sergey Brin as part of a research project for a kind of search engine, and later published in 1998. (fun fact: Robin Li, the founder of Baidu developed a search engine called RankDex in 1996 which is later referenced by Larry for some of his work with PageRank)

Google 91.38%, Bing 2.7%, Yahoo! 1.46%, Baidu 1.38%, YANDEX 1%, DuckDuckGo 0.6% (GS 2020)

Competitors

Baidu and soso in China, Naver and Daum in Korea, Yandex in Russia, Seznam.ca in Czech Republic, Qwant in France, Yahoo in Japan, Taiwan, and the US, Bing and DuckDuckGo

Customer/market

all web users

Revenue

search advertising. Along with the results provided the most relevant results by Google search algorithm, there's also suggested pages from Google Ads advertiser.

Relevant product

Googlebot - the generic name for Google's web crawler, for two different names: a desktop crawler and a mobile one

Love/Hate

Google Doodle:

a special temporary alternation of the logo in Google's homepages intended to commemorate holidaysm events, achievements, and notable historical figures.

Discontinued feature:

Instant search:
was announced in Sep 2010 as a feature to display suggested results while the user typed in their search query. The primary advantage was to save time, 2-5 sec per search which will save all user 11 hours with each passing second. However, later it was discontinued for the complexity of removing queries from autocomplete search, part of it was the algorithmic approach. The algorithms would look for not only specific words but compound queries based on these words across all languages. So, for example, if there's a bad word in Russian, it will remove a compound word including the transliteration of the Russian word into English. Also, if any search result violates the policies, algorithms may remove that query from Autocomplete, even if the query itself wouldn't otherwise violate our policies. There was also inconsistency in how some form of the topics were allowed, for instance, 'lesbian' was blocked, but 'gay' was not. On July 26, 2017, Google removed instant search due to a growing number of searches on mobile, where the interaction with search as well as screen size differs significantly from a computer.
Real-time search:
a feature in which search results would also sometimes include real-time information from Twitter, Facebook, and news websites. The feature was introduced on December 7, 2009, and went offline on July 2, 2011, after the deal with Twitter expired.

Metrics/ranking system based on algorithms

(Google search central blog)

Meaning of the query:

synonym system/natural language understanding: this involves steps to identify spelling mistakes, and extends to understand what information you're looking for - the intent behind your query
specific or general information: determine what information are you looking for, are any word such as 'review' or 'opening hours' indicate specific info behind the search
language: what language should it return if the query is written in French, or you are searching for a local restaurant and want local info
timeliness: another important dimension in query analysis is whether you are looking for fresh content. If you are searching trending keywords, the fresh algorithm will interpret that as a signal that up-to-date info might be more useful than older pages. Information such as recent events or hot topics, regularly recurring events, and frequent updates should be prompted on the front page for web search results. e.g. Caffeine indexing system in 2010 improves 35% on when to give more up-to-date information.

Relevance of webpages:

keyword matching: the most basic signal that the information is relevant is when a webpage contains the same keywords as the search query. if the keywords appear on the page, or in the headings or body of the text the information is more likely to be relevant
relevance signals: can help assess whether a webpage contains an answer to your query rather than repeating the same question. Thinking of when you are searching 'dogs', you probably won't want a page with the word 'dog' on it a hundred times. Therefore we should also match for relevant content beyond keywords such as pictures of dogs, videos, or even breeds.

Quality of content:

reliability: search should aim to prioritize the most reliable sources available. The algorithm should identify pages that demonstrate expertise, authoritativeness, and trustworthiness on a given topic.
aggregated feedback: look for sites that many users seem to value for similar queries. If more prominent websites link to the page(known as PageRank), that proves the information is well trusted. Aggregated feedback will further refine how the system discerns the quality of information.
Spam-free: spam algorithm detects low-quality sneaky sites and ensures that sites don't rise in search results through deceptive behavior.

Usability:

evaluates whether webpages are easy to use and reduce persistent user pain points. Algorithms should indicate whether all users can view the results, like whether the site appears correctly in different browsers; whether it friendly designed for all device types and sizes, including desktops, tablets, and mobile devices, and whether page loafing times work well for users with slow internet connections.

Context and settings

Information such as location, past search history and Search settings all help Search to tailor the results to what is most useful and relevant for you at that moment. Say you are in Seattle and you search 'football', Google will most likely show you results about American football and Seahawks first.
Search settings are also an important indicator of which results you’re likely to find useful, such as if you set a preferred language or opted in to SafeSearch (a tool that helps filter out explicit results).

News and Rumors:

News:

Updates on page experience ranking factors, announce new speed and UX metrics
Google announced in November that the page experience signals in ranking will roll out in May 2021. The new page experience signals combine Core Web Vitals with existing search signals including mobile-friendliness, safe-browsing, HTTPS-security, and intrusive interstital guidelines. In mid-November, they started crawling a handful of sites over HTTP2, which is the next generation of the transfer protocol that powers the web. This will save a considerable amount of resources for both sites and Googlebot
moving from Google Webmaster to Google Search Central to centralize help information to one site, the goal is still to improve the visibility of the website on Google search and on social media

Rumors:

a new update that will add to google search is a new section called 'short videos' to allow users to access certain contents more easily, such as Instagram and TikTok videos It will also allow user to watch videos in the browser without being forced into the applications. These features will exclusively designed for mobile devices for now.

Product sense question:

how does a new ranking model affect the fraction of users who click on the first result? The second?
How many users click on page 2 of results?
Once a user clicks out to result page, how long before they click the back button to come back to the search results page?

Different ranking models applied to real users are not how Google search test its search engine. The fact is that Google does not use real usage data to tune their search ranking algorithm as user performance/metrics are usually not sensitive to these changes. Instead, Google relies on a small panel of raters to test ranking models. Two reasons being the first, users are trained to trust Google and click on the first result no matter what. So the new ranking models would receive a slight change in usage data. The second is that users don't what they are missing. This can be explained by two broad classes of queries search engine deals with:
- navigational queries: where users are looking for a specific geographic location, e.g. Stanford University, users can easily tell the best result from others, and it's usually the first one.
- informational queries: where the user has a broader topic such as diabetes pregnancy, in this case, there is no single right answer. Suppose there's a really good answer on page 4, that provides better information than the first three pages. Most users would not even know it exists. Therefore, their usage behavior would not provide the best feedback for the rankings