Back to Home | Admin Console Help | Log Out
 Admin Console Help
 
Admin Console Help

Home

Content Sources
 Web Crawl
   Start and Block URLs
   Case-Insensitive Patterns
   Proxy Servers
   HTTP Headers
   Duplicate Hosts
   Coverage Tuning
   Crawl Schedule
   Host Load Schedule
   Freshness Tuning
   Secure Crawl
  Connector Managers
  Connectors
  Feeds
  Groups
  OneBox Modules
  Diagnostics

Index

Search

Reports

GSA Unification

GSAn

Administration

More Information

Content Sources > Web Crawl > HTTP Headers

Use the Content Sources > Web Crawl > HTTP Headers page to change the user agent name that identifies the Google Search Appliance, to modify the HTTP headers that are included in HTTP requests that are made during the search appliance crawl process, or to override automatic document encoding and language detection with values specified in HTTP response headers or HTML metatags. This help page contains information on the following topics:

The Google Search Appliance crawls web sites using a robot named the gsa-crawler. When the crawler requests a page from a web server, the HTTP request includes the user agent name and other information that identifies the crawler to the web server. The HTTP request also includes HTTP headers. The web server can store the user agent name in a log or use the other headers in the request to customize the response or implement access controls or a security policy. Also, the Google Search Appliance automatically detects a document's encoding and language.

HTTP headers are part of the HTTP requests made by the search appliance crawler to web servers. HTTP headers use the following format:

header_name:header_value

For example:

Authorization: Basic c29tZXVzZXI6c29tZXBhc3M=

Any HTTP headers you specify on this page must follow the formats defined in the following protocols:

Authorization and Proxy-Authorization are two commonly-used additional headers. You can find more information on Authorization and Proxy-Authorization headers in the following locations:

For more information about using the Proxy-Authorization header, see Authenticating to a Proxy Server.

Caution: Certain HTTP headers are used by the crawler for its normal operation (such as Host, Connection, Accept, From, User-Agent, etc.). Any new values for these headers that you enter on this page overwrite the crawler's standard headers and may cause undesirable problems.

You can use nonstandard headers that enable passing information your web servers require, but ensure that all nonstandard headers are valid for your servers. Otherwise, search results may be returned in an unpredictable manner.

Changing the User Agent Name

The user agent name is part of the identifier used by the gsa-crawler to identify itself to a web server. The identifier consists of the following elements, which are all automatically appended when the crawler makes a request to a web server:

  • The user agent name, which is set by default to gsa-crawler. The user agent name must contain only alphabetic characters and hyphens.
  • A unique identifier that is associated with a particular search appliance. You cannot change this identifier.
  • The email address for problem reports that is entered on the Administration > System Settings page, in the Send problem reports to field. Webmasters can use this email address to contact you if search appliance traffic overloads a web server or otherwise affects web site operations.

For example, the crawler might identify itself as follows, where the user agent name is gsa-crawler, the unique identifier is GID01065, and the email address is yourname@yourcompany.com:

    gsa-crawler (Enterprise; GID01065; yourname@yourcompany.com)

To change the user agent name, enter a new user agent name and click Update.

Relaxing Strict Domain Checking of Cookies

By default, the Google Search Appliance enforces strict domain checking of cookies that the crawler sends to servers, typically for access to protected resources.

With strict domain checking of cookies enforced, the crawler sends a cookie only to servers whose hostnames exactly match the domain of the cookie. For example, suppose the crawler has a cookie with a domain name of .cosmoaud.com. In this case, the crawler sends the cookie to the servers www.cosmoaud.com, mail.cosmoaud.com, and so on, because there is a domain match.  However, the crawler does not send the cookie to the cosmoaud.com. Because the hostname does not include a leading period, there isn't a domain match.

In some cases, you might want to relax strict domain checking of cookies, so that the crawler sends a cookie to a server even though there isn't an exact domain match. For example, you might want the crawler to send a cookie to cosmoaud.com, in addition to servers with matching domains.  Google does not recommend relaxing strict domain checking except in special cases.

To relax strict domain checking on cookies, uncheck Enable strict domain check on cookie

Specifying Additional HTTP Headers

This is an optional task. In most cases, you do not need to change or add to the HTTP headers for the crawler.

To specify additional HTTP headers:

  1. Click Content Sources > Web Crawl > HTTP Headers.
  2. In the Additional HTTP Headers for Crawler box, enter a new header.
  3. To add more headers, press Enter to start a new line.
  4. After all the headers are specified, click Update.

Respect encoding and language specified in HTTP headers and/or HTML metatags

By default, the Google Search Appliance automatically detects a document's encoding (aka charset) and language based on multiple factors such as TLD, characters and words in the document body, etc. In some cases the encoding and/or language are incorrectly detected. Instead of using the autordetected encoding and language the Google Search Appliance can respect the encoding and language specifed from a crawl request's HTTP headers or a document's HTML metatags. External metadata is not parsed when ssetting for encoding and langauge.

Document encoding can be parsed from either the Content-Type HTTP header (if present in the HTTP response headers), http-equiv HTML metatag, or charset HTML metatag. For more infomration about Content-Type as a HTTP header and HTML metatag, or charset HTML metatag, see:

Document language can be parsed from either the Content-Language HTTP header (if present in the HTTP response headers) or http-equiv HTML metatag. For more infomration about Content-Language as a HTTP header and HTML metatag see:

NOTE: If the Google Search Appliance is configured to respect encoding and language from both HTTP headers and HTML metatag a valid value from an HTML metatag will take precedence over the value from an HTTP header.

To use HTTP headers (if present in the HTTP response for a document) to set encoding and language for all documents check Respect Content-Type and Content-Language HTTP headers (will override encoding and language autodetection).

To use HTML metagas (if present in the document) to set encoding and language for all documents check Respect encoding and language specified in HTML metatags (will override encoding and language autodetection and HTTP headers. Does not work with external metadata).

A document's encoding and language are only set at crawl time. Changes to either of the above settings do not affect documents already in the index until the next time they are crawled. To force setting of encoding and langauge for a documents already in the index using the new settings, force the search appliance to recrawl URL patterns by using the Index > Diagnostics > Index Diagnostics page.

 
© Google Inc.