Improving Search with rigorous testing

Our goal is always to provide you with the most useful and relevant information. Any changes we make to Search are always to improve the usefulness of results you see.

Illustration of the world with pins in every continent with different avatars.

Testing for usefulness

Search has changed over the years to meet the evolving needs and expectations of the people who use Google. From innovations like the Knowledge Graph, to updates to our systems that ensure we’re continuing to highlight relevant content, our goal is always to improve the usefulness of your results. That is why, while advertisers can pay and be displayed in clearly marked ad sections, no one can buy better placement in the Search results.

We put all possible changes to Search through a rigorous evaluation process to analyze metrics and decide whether to implement a proposed change. Data from these search evaluations and experiments go through a thorough review by experienced engineers and search analysts, as well as other legal and privacy experts, who then determine if the change is approved to launch. In 2023, we ran over 700,000 experiments that resulted in more than 4,000 improvements to Search.

We evaluate Search in multiple ways. In 2023, we ran:

4,781 launches
16,871 live traffic experiments
719,326 search quality tests
124,942 side-by-side experiments

4,781 launches

Every single proposed change to Search goes through a review by our most experienced engineers and data scientists, who carefully review the data from all the different experiments to decide if the change is approved to launch. If we can’t show that a change actually makes things better for people, we don’t launch it.

Three potential changes rated good, neutral, and poor

16,871 live traffic experiments

We conduct live traffic experiments to see how real people interact with a feature before launching it to everyone.

A page comparison with graphs showing rate of user interactions

Running experiments

First, we enable the feature in question to just a small percentage of people, usually starting at 0.1%. We then compare the experiment group to a control group that did not have the feature enabled.

Analyzing metrics

Next, we look at a very long list of metrics, such as what people click on, how many queries were done, whether queries were abandoned, or how long it took for people to click on a result.

Measuring engagement

Finally, we use these results to measure whether engagement with the new feature is positive, to ensure that the changes we make are increasing the relevance and usefulness of our results for everyone.

719,326 search quality tests

We work with external Search Quality Raters to measure the quality of Search results on an ongoing basis. Raters assess how well website content fulfills a search request, and evaluate the quality of results based on the expertise, authoritativeness, and trustworthiness of the content. These ratings do not directly impact ranking, but they do help us benchmark the quality of our results and make sure these meet a high bar all around the world.

Raters evaluate the quality of results based on the expertise, authoritativeness, and trustworthiness of the content.

To ensure a consistent approach, we publish Search Quality Rater Guidelines to give these Raters guidance and examples for appropriate website ratings. While evaluating the quality of search results might sound simple, there are many tricky cases to think through, so this feedback is critical to ensuring we maintain high quality results for users.

Illustration of search results with checklist indicating high quality results

Read the rater guidelines

Guidelines Overview

Full Guidelines

Surfacing Google Search results

People don’t choose or arrange results on Search. Google Search automatically surfaces the most useful and reliable content for particular queries. We process billions of searches per day. Automation is how Google handles the immense scale of so many searches. These systems consider many factors, including the words in your query, the content of pages, the expertise of sources, and your language and location.

When we identify places where we are not delivering high-quality results, we investigate to understand what the broader issue might be, and we take a scalable approach to improve our results for not just one query, but many others.

We are constantly improving Google Search. In some cases, we may bring in humans to manually block policy-violating or illegal content, in the limited and well-defined situations that warrant this. You can learn more about our policies that apply to Google Search here.

124,942 side-by-side experiments

Search isn’t static. We’re constantly improving our systems to return better results and Search Quality Raters play an important role in the launch process. In a side-by-side experiment, we show Raters two different sets of Search results: one with the proposed change already implemented and one without. We ask them which results they prefer and why.