Improving Search with rigorous testing

Our goal is always to provide you with the most useful and relevant information. Any changes we make to Search are always to improve the usefulness of results you see.

Illustration of the world with pins in every continent with different avatars.

Testing for usefulness

Search has changed over the years to meet the evolving needs and expectations of the people who use Google. From innovations like the Knowledge Graph, to updates to our systems that ensure we’re continuing to highlight relevant content, our goal is always to improve the usefulness of your results. That is why, while advertisers can pay and be displayed in clearly marked ad sections, no one can buy better placement in the Search results.

We put all possible changes to Search through a rigorous evaluation process to analyze metrics and decide whether to implement a proposed change. Data from these search evaluations and experiments go through a thorough review by experienced engineers and search analysts, as well as other legal and privacy experts, who then determine if the change is approved to launch. In 2023, we ran over 700,000 experiments that resulted in more than 4,000 improvements to Search.

We evaluate Search in multiple ways. In 2023, we ran:

4,781 launches

Every single proposed change to Search goes through a review by our most experienced engineers and data scientists, who carefully review the data from all the different experiments to decide if the change is approved to launch. If we can’t show that a change actually makes things better for people, we don’t launch it.

Three potential changes rated good, neutral, and poor

16,871 live traffic experiments

We conduct live traffic experiments to see how real people interact with a feature before launching it to everyone.

A page comparison with graphs showing rate of user interactions

Running experiments

First, we enable the feature in question to just a small percentage of people, usually starting at 0.1%. We then compare the experiment group to a control group that did not have the feature enabled.

Analyzing metrics

Next, we look at a very long list of metrics, such as what people click on, how many queries were done, whether queries were abandoned, or how long it took for people to click on a result.

Measuring engagement

Finally, we use these results to measure whether engagement with the new feature is positive, to ensure that the changes we make are increasing the relevance and usefulness of our results for everyone.

719,326 search quality tests

We work with external Search Quality Raters to measure the quality of Search results on an ongoing basis. Raters assess how well website content fulfills a search request, and evaluate the quality of results based on the expertise, authoritativeness, and trustworthiness of the content. These ratings do not directly impact ranking, but they do help us benchmark the quality of our results and make sure these meet a high bar all around the world.

Raters evaluate the quality of results based on the expertise, authoritativeness, and trustworthiness of the content.

To ensure a consistent approach, we publish Search Quality Rater Guidelines to give these Raters guidance and examples for appropriate website ratings. While evaluating the quality of search results might sound simple, there are many tricky cases to think through, so this feedback is critical to ensuring we maintain high quality results for users.

Illustration of search results with checklist indicating high quality results

124,942 side-by-side experiments

Search isn’t static. We’re constantly improving our systems to return better results and Search Quality Raters play an important role in the launch process. In a side-by-side experiment, we show Raters two different sets of Search results: one with the proposed change already implemented and one without. We ask them which results they prefer and why.

Comparing two styles of Search results with a Search Quality Rater picking the better option

Discover more