![]() |
|
Admin Console Help
Home |
Content Sources > Web Crawl > HTTP HeadersUse the Content Sources > Web Crawl > HTTP Headers page to change the user agent name that identifies the Google Search Appliance, to modify the HTTP headers that are included in HTTP requests that are made during the search appliance crawl process, or to override automatic document encoding and language detection with values specified in HTTP response headers or HTML metatags. This help page contains information on the following topics:
The Google Search Appliance crawls web sites using a robot named the gsa-crawler. When the crawler requests a page from a web server, the HTTP request includes the user agent name and other information that identifies the crawler to the web server. The HTTP request also includes HTTP headers. The web server can store the user agent name in a log or use the other headers in the request to customize the response or implement access controls or a security policy. Also, the Google Search Appliance automatically detects a document's encoding and language. HTTP headers are part of the HTTP requests made by the search appliance crawler to web servers. HTTP headers use the following format:
For example: Authorization: Basic c29tZXVzZXI6c29tZXBhc3M= Any HTTP headers you specify on this page must follow the formats defined in the following protocols:
Authorization and Proxy-Authorization are two commonly-used additional headers. You can find more information on Authorization and Proxy-Authorization headers in the following locations:
For more information about using the Proxy-Authorization header, see Authenticating to a Proxy Server. Caution: Certain HTTP headers are used by the crawler for its normal operation (such as Host, Connection, Accept, From, User-Agent, etc.). Any new values for these headers that you enter on this page overwrite the crawler's standard headers and may cause undesirable problems. You can use nonstandard headers that enable passing information your web servers require, but ensure that all nonstandard headers are valid for your servers. Otherwise, search results may be returned in an unpredictable manner. Changing the User Agent NameThe user agent name is part of the identifier used by the gsa-crawler to identify itself to a web server. The identifier consists of the following elements, which are all automatically appended when the crawler makes a request to a web server:
For example, the crawler might identify itself as follows, where the user agent name is gsa-crawler, the unique identifier is GID01065, and the email address is yourname@yourcompany.com:
To change the user agent name, enter a new user agent name and click Update. Relaxing Strict Domain Checking of CookiesBy default, the Google Search Appliance enforces strict domain checking of cookies that the crawler sends to servers, typically for access to protected resources. With strict domain checking of cookies enforced, the crawler sends a cookie only to servers whose hostnames exactly match the domain of the cookie. For example, suppose the crawler has a cookie with a domain name of In some cases, you might want to relax strict domain checking of cookies, so that the crawler sends a cookie to a server even though there isn't an exact domain match. For example, you might want the crawler to send a cookie to To relax strict domain checking on cookies, uncheck Enable strict domain check on cookie. Specifying Additional HTTP HeadersThis is an optional task. In most cases, you do not need to change or add to the HTTP headers for the crawler. To specify additional HTTP headers:
Respect encoding and language specified in HTTP headers and/or HTML metatagsBy default, the Google Search Appliance automatically detects a document's encoding (aka charset) and language based on multiple factors such as TLD, characters and words in the document body, etc. In some cases the encoding and/or language are incorrectly detected. Instead of using the autordetected encoding and language the Google Search Appliance can respect the encoding and language specifed from a crawl request's HTTP headers or a document's HTML metatags. External metadata is not parsed when ssetting for encoding and langauge.
Document encoding can be parsed from either the Content-Type HTTP header (if present in the HTTP response headers), http-equiv HTML metatag, or charset HTML metatag. For more infomration about Content-Type as a HTTP header and HTML metatag, or charset HTML metatag, see:
Document language can be parsed from either the Content-Language HTTP header (if present in the HTTP response headers) or http-equiv HTML metatag. For more infomration about Content-Language as a HTTP header and HTML metatag see: NOTE: If the Google Search Appliance is configured to respect encoding and language from both HTTP headers and HTML metatag a valid value from an HTML metatag will take precedence over the value from an HTTP header. To use HTTP headers (if present in the HTTP response for a document) to set encoding and language for all documents check Respect Content-Type and Content-Language HTTP headers (will override encoding and language autodetection). To use HTML metagas (if present in the document) to set encoding and language for all documents check Respect encoding and language specified in HTML metatags (will override encoding and language autodetection and HTTP headers. Does not work with external metadata). A document's encoding and language are only set at crawl time. Changes to either of the above settings do not affect documents already in the index until the next time they are crawled. To force setting of encoding and langauge for a documents already in the index using the new settings, force the search appliance to recrawl URL patterns by using the Index > Diagnostics > Index Diagnostics page. |
||
© Google Inc.
|