Back to Home | Admin Console Help | Log Out
 Admin Console Help
 
Admin Console Help

Home

Content Sources
 Web Crawl
   Start and Block URLs
   Case-Insensitive Patterns
   Proxy Servers
   HTTP Headers
   Duplicate Hosts
   Coverage Tuning
   Crawl Schedule
   Host Load Schedule
   Freshness Tuning
   Secure Crawl
  Connector Managers
  Connectors
  Feeds
  Groups
  OneBox Modules
  Diagnostics

Index

Search

Reports

GSA Unification

GSAn

Administration

More Information

Content Sources > Web Crawl > Freshness Tuning

Use the Content Sources > Web Crawl > Freshness Tuning page to fine-tune the timing of crawls for different URLs. You can fine-tune crawling by:

Before Starting this Task

Before fine-tuning the timing of crawls on different URLs, complete the tasks listed in the following table.

Task Method
Ensure that the search appliance is crawling in continuous crawl mode. To select continuous crawl mode, use the Content Sources > Web Crawl > Crawl Schedule page in the Admin Console.
Ensure that URLs that you type on the Content Sources > Web Crawl > Freshness Tuning page can be reached from start URLs Check the URLs in the Start Crawling from the Following URLs box on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console.
Ensure that the patterns that you type on the Content Sources > Web Crawl > Freshness Tuning page are included in follow and crawl patterns. Check the URLs in the Follow and Crawl Only URLs with the Following Patterns on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console.

Specifying URL Patterns to Crawl Frequently

Use Crawl Frequently for URL patterns for content that changes frequently, as often as once an hour or even every few minutes. Crawling these URLs frequently keeps your serving index fresh. It is possible to slow the system down by overloading the frequently changing content section. Try to keep the number of URLs fairly small to avoid reduced performance.

To set options for crawling frequently changing content:

  1. Select Content Sources > Web Crawl > Freshness Tuning.
  2. Under Crawl Frequently, type URL patterns for content that changes often.
  3. Click Save.

Specifying URLs Patterns to Crawl Infrequently

Use Crawl Infrequently to index documents that are never updated or modified, such as a stable database, or that are only incrementally added to, such as in a mail or a news archive. With this option, you can set the crawler to crawl them once a week, once a month, or no more than once every 3 months. This reduces the load on your web servers.

To set options for crawling archival servers:

  1. Select Content Sources > Web Crawl > Freshness Tuning.
  2. Under Crawl Infrequently, type URL patterns for rarely changing or archived documents.
  3. From the Refresh Interval for Crawl Infrequently Patterns drop-down list, select how often the search appliance recrawls the URLs.
  4. Click Save.

Specifying Always Force Recrawl of URL Patterns

The first time URLs are crawled, the data is indexed and stored on disk. Subsequently, to allow for faster crawls and less load on the servers, only files modified after the date in the Appliance's If-Modified-Since request header will be recrawled. These updates are added to the index.

Type URL patterns in the Always Force Recrawl section only if out-of-date pages are displayed in your index. The crawler attempts to determine which servers contain content with incorrect dates and attempts to adjust automatically, other types of errors may be present.

Make sure that your servers maintain the correct time. If you think one or more of your web servers does not support the If-Modified-Since option or is misconfigured, use this section to type URL patterns to recrawl. Refer problems with your web servers to your webmaster.

To force recrawling certain URL patterns, regardless of your web server's response to If-Modified-Since:

  1. Select Content Sources > Web Crawl > Freshness Tuning.
  2. Under Always Force Recrawl, type URL patterns for pages to always recrawl regardless of last-modified date.
  3. Click Save.

Specifying Recrawl of URL Patterns

If you discover that a set of URLs has not been recrawled recently (usually because changes made to the web pages or because of a temporary error or misconfiguration present when the crawler last tried to crawl the URL), you can type the pattern in the Recrawl these URL Patterns box to inject it quickly into the queue of URLs the search appliance is crawling. The URL is crawled soon, unless there are higher priority URLs in the queue.

To have the search appliance recrawl a URL pattern:

  1. Select Content Sources > Web Crawl > Freshness Tuning.
  2. Under Recrawl these URL Patterns, type URL patterns for pages to always recrawl regardless of last-modified date.
  3. Click Recrawl.

For More Information

For detailed information about freshness tuning, see "Administering Crawl: Advanced Topics," which is linked to the Google Search Appliance help center.

 


 
© Google Inc.