POST /1/crawlers

Name	Type	Required	Description
`Content-Type`	String	Yes	The media type of the request body. Default value: "application/json"

Name	Type	Required	Description
`name`	String	Yes	Name of the crawler.
`config`	Object	Yes	Crawler configuration.
`config.ignorePaginationAttributes`	Boolean	No	Whether the crawler should follow `rel="prev"` and `rel="next"` pagination links in the `<head>` section of an HTML page. If `true`, the crawler ignores the pagination links. If `false`, the crawler follows the pagination links. Default value: true
`config.linkExtractor`	Object	No	Function for extracting URLs from links on crawled pages. For more information, see the `linkExtractor` documentation.
`config.linkExtractor.source`	String	No
`config.linkExtractor.__type`	String	No	Valid values: `"function"`
`config.initialIndexSettings`	Object	No	Crawler index settings. These index settings are only applied during the first crawl of an index. Any subsequent changes won't be applied to the index. Instead, make changes to your index settings in the Algolia dashboard.
`config.maxUrls`	Integer	No	Limits the number of URLs your crawler processes. Change it to a low value, such as 100, for quick crawling tests. Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures. Because the Crawler works on many pages simultaneously, `maxUrls` doesn't guarantee finding the same pages each time it runs.
`config.startUrls[]`	Array	No	URLs from where to start crawling.
`config.renderJavaScript`	Object	No	If `true`, use a Chrome headless browser to crawl pages. Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages. Use micromatch to define URL patterns, including negations and wildcards.
`config.externalData[]`	Array	No	References to external data sources for enriching the extracted records.
`config.appId`	String	Yes	Algolia application ID where the crawler creates and updates indices.
`config.sitemaps[]`	Array	No	Sitemaps with URLs from where to start crawling.
`config.safetyChecks`	Object	No	Checks to ensure the crawl was successful. For more information, see the Safety checks documentation.
`config.safetyChecks.beforeIndexPublishing`	Object	No	Checks triggered after the crawl finishes but before the records are added to the Algolia index.
`config.safetyChecks.beforeIndexPublishing.maxFailedUrls`	Integer	No	Stops the crawler if a specified number of pages fail to crawl.
`config.safetyChecks.beforeIndexPublishing.maxLostRecordsPercentage`	Integer	No	Maximum difference in percent between the numbers of records between crawls. Default value: 10
`config.ignoreCanonicalTo`	Object	No
`config.ignoreQueryParams[]`	Array	No	Query parameters to ignore while crawling. All URLs with the matching query parameters are treated as identical. This prevents indexing URLs that just differ by their query parameters.
`config.saveBackup`	Boolean	No	Whether to back up your index before the crawler overwrites it with new records.
`config.ignoreNoFollowTo`	Boolean	No	Determines if the crawler should follow links with a `nofollow` directive. If `true`, the crawler will ignore the `nofollow` directive and crawl links on the page. The crawler always ignores links that don't match your configuration settings. `ignoreNoFollowTo` applies to: Links that are ignored because the `robots` meta tag contains `nofollow` or `none`. Links with a `rel` attribute containing the `nofollow` directive.
`config.requestOptions`	Object	No	Lets you add options to HTTP requests made by the crawler.
`config.requestOptions.timeout`	Integer	No	Timeout in milliseconds for the crawl. Default value: 30000
`config.requestOptions.proxy`	String	No	Proxy for all crawler requests.
`config.requestOptions.retries`	Integer	No	Maximum number of retries to crawl one URL. Default value: 3
`config.requestOptions.headers`	Object	No	Headers to add to all requests.
`config.requestOptions.headers.Accept-Language`	String	No	Preferred natural language and locale.
`config.requestOptions.headers.Authorization`	String	No	Basic authentication header.
`config.requestOptions.headers.Cookie`	String	No	Cookie. The header will be replaced by the cookie retrieved when logging in.
`config.rateLimit`	Integer	Yes	Determines the number of concurrent tasks per second that can run for this configuration. A higher rate limit means more crawls per second. Algolia prevents system overload by ensuring the number of URLs added in the last second and the number of URLs being processed is less than the rate limit: `max(new_urls_added, active_urls_processing) <= rateLimit` Start with a low value (for example, 2) and increase it if you need faster crawling. Be aware that a high `rateLimit` can have a huge impact on bandwidth cost and server resource consumption. The number of pages processed per second depends on the average time it takes to fetch, process, and upload a URL. For a given `rateLimit` if fetching, processing, and uploading URLs takes (on average): Less than a second, your crawler processes up to `rateLimit` pages per second. Four seconds, your crawler processes up to `rateLimit / 4` pages per second. In the latter case, increasing `rateLimit` improves performance, up to a point. However, if the processing time remains at four seconds, increasing `rateLimit` won't increase the number of pages processed per second.
`config.indexPrefix`	String	No	A prefix for all indices created by this crawler. It's combined with the `indexName` for each action to form the complete index name.
`config.schedule`	String	No	Schedule for running the crawl. Instead of manually starting a crawl each time, you can set up a schedule for automatic crawls. Use the visual UI or add the `schedule` parameter to your configuration. `schedule` uses Later.js syntax to specify when to crawl your site. Here are some key things to keep in mind when using `Later.js` syntax with the Crawler: The interval between two scheduled crawls must be at least 24 hours. To crawl daily, use "every 1 day" instead of "everyday" or "every day". If you don't specify a time, the crawl can happen any time during the scheduled day. Specify times for the UTC (GMT+0) timezone Include minutes when specifying a time. For example, "at 3:00 pm" instead of "at 3pm". Use "at 12:00 am" to specify midnight, not "at 00:00 am".
`config.exclusionPatterns[]`	Array	No	URLs to exclude from crawling.
`config.extraUrls[]`	Array	No	The Crawler treats `extraUrls` the same as `startUrls`. Specify `extraUrls` if you want to differentiate between URLs you manually added to fix site crawling from those you initially specified in `startUrls`.
`config.actions[]`	Array	Yes	A list of actions.
`config.actions[].hostnameAliases`	Object	No	Key-value pairs to replace matching hostnames found in a sitemap, on a page, in canonical links, or redirects. During a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs. This helps with links to staging environments (like `dev.example.com`) or external hosting services (such as YouTube). For example, with this `hostnameAliases` mapping: `{ hostnameAliases: { 'dev.example.com': 'example.com' } }` The crawler encounters `https://dev.example.com/solutions/voice-search/`. `hostnameAliases` transforms the URL to `https://example.com/solutions/voice-search/`. The crawler follows the transformed URL (not the original). `hostnameAliases` only changes URLs, not page text. In the preceding example, if the extracted text contains the string `dev.example.com`, it remains unchanged. The crawler can discover URLs in places such as: Crawled pages Sitemaps Canonical URLs Redirects. However, `hostnameAliases` doesn't transform URLs you explicitly set in the `startUrls` or `sitemaps` parameters, nor does it affect the `pathsToMatch` action or other configuration elements.
`config.actions[].name`	String	No	Unique identifier for the action. This option is required if `schedule` is set.
`config.actions[].autoGenerateObjectIDs`	Boolean	No	Whether to generate an `objectID` for records that don't have one. Default value: true
`config.actions[].discoveryPatterns[]`	Array	No	Which intermediary web pages the crawler should visit. Use `discoveryPatterns` to define pages that should be visited just for their links to other pages, not their content. It functions similarly to the `pathsToMatch` action but without record extraction.
`config.actions[].cache`	Object	No	Whether the crawler should cache crawled pages. For more information, see Partial crawls with caching.
`config.actions[].cache.enabled`	Boolean	No	Whether the crawler cache is active. Default value: true
`config.actions[].recordExtractor`	Object	Yes	Function for extracting information from a crawled page and transforming it into Algolia records for indexing. The Crawler has an editor with autocomplete and validation to help you update the `recordExtractor`. For details, see the `recordExtractor` documentation.
`config.actions[].recordExtractor.source`	String	No	A JavaScript function (as a string) that returns one or more Algolia records for each crawled page.
`config.actions[].recordExtractor.__type`	String	No	Valid values: `"function"`
`config.actions[].fileTypesToMatch[]`	Array	No	File types for crawling non-HTML documents. Default value: [ "html" ]
`config.actions[].indexName`	String	Yes	Reference to the index used to store the action's extracted records. `indexName` is combined with the prefix you specified in `indexPrefix`.
`config.actions[].pathAliases`	Object	No	Key-value pairs to replace matching paths with new values. It doesn't replace: URLs in the `startUrls`, `sitemaps`, `pathsToMatch`, and other settings. Paths found in extracted text. The crawl continues from the transformed URLs. For example, if you create a mapping for `{ "dev.example.com": { '/foo': '/bar' } }` and the crawler encounters `https://dev.example.com/foo/hello/`, it’s transformed to `https://dev.example.com/bar/hello/`. Compare with the `hostnameAliases` action.
`config.actions[].pathsToMatch[]`	Array	No	URLs to which this action should apply. Uses micromatch for negation, wildcards, and more.
`config.actions[].schedule`	String	No	How often to perform a complete crawl for this action. For mopre information, consult the `schedule` parameter documentation.
`config.actions[].selectorsToMatch[]`	Array	No	DOM selectors for nodes that must be present on the page to be processed. If the page doesn't match any of the selectors, it's ignored.
`config.ignoreNoIndex`	Boolean	No	Whether to ignore the `noindex` robots meta tag. If `true`, pages with this meta tag will be crawled.
`config.ignoreRobotsTxtRules`	Boolean	No	Whether to ignore rules defined in your `robots.txt` file.
`config.apiKey`	String	No	The Algolia API key the crawler uses for indexing records. If you don't provide an API key, one will be generated by the Crawler when you create a configuration. The API key must have: These rights and restrictions: `search`, `addObject`, `deleteObject`, `deleteIndex`, `settings`, `editSettings`, `listIndexes`, `browse` Access to the correct set of indices, based on the crawler's `indexPrefix`. For example, if the prefix is `crawler_`, the API key must have access to `crawler_`. Don't use your Admin API key*.
`config.login`	Object	No	Authorization method and credentials for crawling protected content. The Crawler supports these authentication methods: Basic authentication. The Crawler obtains a session cookie from the login page. OAuth 2.0 authentication (`oauthRequest`). The Crawler uses OAuth 2.0 client credentials to obtain an access token for authentication. Basic authentication The Crawler extracts the `Set-Cookie` response header from the login page, stores that cookie, and sends it in the `Cookie` header when crawling all pages defined in the configuration. This cookie is retrieved only at the start of each full crawl. If it expires, it isn't automatically renewed. The Crawler can obtain the session cookie in one of two ways: HTTP request authentication (`fetchRequest`). The Crawler sends a direct request with your credentials to the login endpoint, similar to a `curl` command. Browser-based authentication (`browserRequest`). The Crawler emulates a web browser by loading the login page, entering the credentials, and submitting the login form as a real user would. OAuth 2.0 The crawler supports OAuth 2.0 client credentials grant flow: It performs an access token request with the provided credentials Stores the fetched token in an `Authorization` header Sends the token when crawling site pages. This token is only fetched at the beginning of each complete crawl. If it expires, it isn't automatically renewed. Client authentication passes the credentials (`client_id` and `client_secret`) in the request body. The Azure AD v1.0 provider is supported.
`config.maxDepth`	Integer	No	Determines the maximum path depth of crawled URLs. Path depth is calculated based on the number of slash characters (`/`) after the domain (starting at 1). For example: 1 `http://example.com` 1 `http://example.com/` 1 `http://example.com/foo` 2 `http://example.com/foo/` 2 `http://example.com/foo/bar` 3 `http://example.com/foo/bar/` URLs added with `startUrls` and `sitemaps` aren't checked for `maxDepth`..

Name

Type

Required

Description

name

String

Yes

Name of the crawler.

config

Object

Yes

Crawler configuration.

config.ignorePaginationAttributes

Boolean

No

Whether the crawler should follow rel="prev" and rel="next" pagination links in the <head> section of an HTML page.

If true, the crawler ignores the pagination links.
If false, the crawler follows the pagination links.

Default value: true

config.linkExtractor

Object

No

Function for extracting URLs from links on crawled pages.

For more information, see the linkExtractor documentation.

config.linkExtractor.source

String

No

config.linkExtractor.__type

String

No

Valid values:

"function"

config.initialIndexSettings

Object

No

Crawler index settings.

These index settings are only applied during the first crawl of an index.

Any subsequent changes won't be applied to the index. Instead, make changes to your index settings in the Algolia dashboard.

config.maxUrls

Integer

No

Limits the number of URLs your crawler processes.

Change it to a low value, such as 100, for quick crawling tests. Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures. Because the Crawler works on many pages simultaneously, maxUrls doesn't guarantee finding the same pages each time it runs.

config.startUrls[]

Array

No

URLs from where to start crawling.

config.renderJavaScript

Object

No

If true, use a Chrome headless browser to crawl pages.

Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages. Use micromatch to define URL patterns, including negations and wildcards.

config.externalData[]

Array

No

References to external data sources for enriching the extracted records.

config.appId

String

Yes

Algolia application ID where the crawler creates and updates indices.

config.sitemaps[]

Array

No

Sitemaps with URLs from where to start crawling.

config.safetyChecks

Object

No

Checks to ensure the crawl was successful.

For more information, see the Safety checks documentation.

config.safetyChecks.beforeIndexPublishing

Object

No

Checks triggered after the crawl finishes but before the records are added to the Algolia index.

config.safetyChecks.beforeIndexPublishing.maxFailedUrls

Integer

No

Stops the crawler if a specified number of pages fail to crawl.

config.safetyChecks.beforeIndexPublishing.maxLostRecordsPercentage

Integer

No

Maximum difference in percent between the numbers of records between crawls.

Default value: 10

config.ignoreCanonicalTo

Object

No

config.ignoreQueryParams[]

Array

No

Query parameters to ignore while crawling.

All URLs with the matching query parameters are treated as identical. This prevents indexing URLs that just differ by their query parameters.

config.saveBackup

Boolean

No

Whether to back up your index before the crawler overwrites it with new records.

config.ignoreNoFollowTo

Boolean

No

Determines if the crawler should follow links with a nofollow directive. If true, the crawler will ignore the nofollow directive and crawl links on the page.

The crawler always ignores links that don't match your configuration settings. ignoreNoFollowTo applies to:

Links that are ignored because the robots meta tag contains nofollow or none.
Links with a rel attribute containing the nofollow directive.

config.requestOptions

Object

No

Lets you add options to HTTP requests made by the crawler.

config.requestOptions.timeout

Integer

No

Timeout in milliseconds for the crawl.

Default value: 30000

config.requestOptions.proxy

String

No

Proxy for all crawler requests.

config.requestOptions.retries

Integer

No

Maximum number of retries to crawl one URL.

Default value: 3

config.requestOptions.headers

Object

No

Headers to add to all requests.

config.requestOptions.headers.Accept-Language

String

No

Preferred natural language and locale.

config.requestOptions.headers.Authorization

String

No

Basic authentication header.

config.requestOptions.headers.Cookie

String

No

Cookie. The header will be replaced by the cookie retrieved when logging in.

config.rateLimit

Integer

Yes

Determines the number of concurrent tasks per second that can run for this configuration.

A higher rate limit means more crawls per second. Algolia prevents system overload by ensuring the number of URLs added in the last second and the number of URLs being processed is less than the rate limit:

max(new_urls_added, active_urls_processing) <= rateLimit

Start with a low value (for example, 2) and increase it if you need faster crawling. Be aware that a high rateLimit can have a huge impact on bandwidth cost and server resource consumption.

The number of pages processed per second depends on the average time it takes to fetch, process, and upload a URL. For a given rateLimit if fetching, processing, and uploading URLs takes (on average):

Less than a second, your crawler processes up to rateLimit pages per second.
Four seconds, your crawler processes up to rateLimit / 4 pages per second.

In the latter case, increasing rateLimit improves performance, up to a point. However, if the processing time remains at four seconds, increasing rateLimit won't increase the number of pages processed per second.

config.indexPrefix

String

No

A prefix for all indices created by this crawler. It's combined with the indexName for each action to form the complete index name.

config.schedule

String

No

Schedule for running the crawl.

Instead of manually starting a crawl each time, you can set up a schedule for automatic crawls. Use the visual UI or add the schedule parameter to your configuration.

schedule uses Later.js syntax to specify when to crawl your site. Here are some key things to keep in mind when using Later.js syntax with the Crawler:

The interval between two scheduled crawls must be at least 24 hours.
To crawl daily, use "every 1 day" instead of "everyday" or "every day".
If you don't specify a time, the crawl can happen any time during the scheduled day.
Specify times for the UTC (GMT+0) timezone
Include minutes when specifying a time. For example, "at 3:00 pm" instead of "at 3pm".
Use "at 12:00 am" to specify midnight, not "at 00:00 am".

config.exclusionPatterns[]

Array

No

URLs to exclude from crawling.

config.extraUrls[]

Array

No

The Crawler treats extraUrls the same as startUrls. Specify extraUrls if you want to differentiate between URLs you manually added to fix site crawling from those you initially specified in startUrls.

config.actions[]

Array

Yes

A list of actions.

config.actions[].hostnameAliases

Object

No

Key-value pairs to replace matching hostnames found in a sitemap, on a page, in canonical links, or redirects.

During a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs. This helps with links to staging environments (like dev.example.com) or external hosting services (such as YouTube).

For example, with this hostnameAliases mapping:

{
hostnameAliases: {
    'dev.example.com': 'example.com'
}
}

The crawler encounters https://dev.example.com/solutions/voice-search/.
hostnameAliases transforms the URL to https://example.com/solutions/voice-search/.
The crawler follows the transformed URL (not the original).

hostnameAliases only changes URLs, not page text. In the preceding example, if the extracted text contains the string dev.example.com, it remains unchanged.

The crawler can discover URLs in places such as:

Crawled pages
Sitemaps
Canonical URLs
Redirects.

However, hostnameAliases doesn't transform URLs you explicitly set in the startUrls or sitemaps parameters, nor does it affect the pathsToMatch action or other configuration elements.

config.actions[].name

String

No

Unique identifier for the action. This option is required if schedule is set.

config.actions[].autoGenerateObjectIDs

Boolean

No

Whether to generate an objectID for records that don't have one.

Default value: true

config.actions[].discoveryPatterns[]

Array

No

Which intermediary web pages the crawler should visit. Use discoveryPatterns to define pages that should be visited just for their links to other pages, not their content. It functions similarly to the pathsToMatch action but without record extraction.

config.actions[].cache

Object

No

Whether the crawler should cache crawled pages.

For more information, see Partial crawls with caching.

config.actions[].cache.enabled

Boolean

No

Whether the crawler cache is active.

Default value: true

config.actions[].recordExtractor

Object

Yes

Function for extracting information from a crawled page and transforming it into Algolia records for indexing.

The Crawler has an editor with autocomplete and validation to help you update the recordExtractor. For details, see the recordExtractor documentation.

config.actions[].recordExtractor.source

String

No

A JavaScript function (as a string) that returns one or more Algolia records for each crawled page.

config.actions[].recordExtractor.__type

String

No

Valid values:

"function"

config.actions[].fileTypesToMatch[]

Array

No

File types for crawling non-HTML documents.

Default value: [ "html" ]

config.actions[].indexName

String

Yes

Reference to the index used to store the action's extracted records. indexName is combined with the prefix you specified in indexPrefix.

config.actions[].pathAliases

Object

No

Key-value pairs to replace matching paths with new values.

It doesn't replace:

URLs in the startUrls, sitemaps, pathsToMatch, and other settings.
Paths found in extracted text.

The crawl continues from the transformed URLs.

For example, if you create a mapping for { "dev.example.com": { '/foo': '/bar' } } and the crawler encounters https://dev.example.com/foo/hello/, it’s transformed to https://dev.example.com/bar/hello/.

Compare with the hostnameAliases action.

config.actions[].pathsToMatch[]

Array

No

URLs to which this action should apply.

Uses micromatch for negation, wildcards, and more.

config.actions[].schedule

String

No

How often to perform a complete crawl for this action.

For mopre information, consult the schedule parameter documentation.

config.actions[].selectorsToMatch[]

Array

No

DOM selectors for nodes that must be present on the page to be processed. If the page doesn't match any of the selectors, it's ignored.

config.ignoreNoIndex

Boolean

No

Whether to ignore the noindex robots meta tag. If true, pages with this meta tag will be crawled.

config.ignoreRobotsTxtRules

Boolean

No

Whether to ignore rules defined in your robots.txt file.

config.apiKey

String

No

The Algolia API key the crawler uses for indexing records. If you don't provide an API key, one will be generated by the Crawler when you create a configuration.

The API key must have:

These rights and restrictions: search, addObject, deleteObject, deleteIndex, settings, editSettings, listIndexes, browse
Access to the correct set of indices, based on the crawler's indexPrefix. For example, if the prefix is crawler_, the API key must have access to crawler_*.

Don't use your Admin API key.

config.login

Object

No

Authorization method and credentials for crawling protected content.

The Crawler supports these authentication methods:

Basic authentication. The Crawler obtains a session cookie from the login page.
OAuth 2.0 authentication (oauthRequest). The Crawler uses OAuth 2.0 client credentials to obtain an access token for authentication.

Basic authentication

The Crawler extracts the Set-Cookie response header from the login page, stores that cookie, and sends it in the Cookie header when crawling all pages defined in the configuration.

This cookie is retrieved only at the start of each full crawl. If it expires, it isn't automatically renewed.

The Crawler can obtain the session cookie in one of two ways:

HTTP request authentication (fetchRequest). The Crawler sends a direct request with your credentials to the login endpoint, similar to a curl command.
Browser-based authentication (browserRequest). The Crawler emulates a web browser by loading the login page, entering the credentials, and submitting the login form as a real user would.

OAuth 2.0

The crawler supports OAuth 2.0 client credentials grant flow:

It performs an access token request with the provided credentials
Stores the fetched token in an Authorization header
Sends the token when crawling site pages.

This token is only fetched at the beginning of each complete crawl. If it expires, it isn't automatically renewed.

Client authentication passes the credentials (client_id and client_secret) in the request body. The Azure AD v1.0 provider is supported.

config.maxDepth

Integer

No

Determines the maximum path depth of crawled URLs.

Path depth is calculated based on the number of slash characters (/) after the domain (starting at 1). For example:

1 http://example.com
1 http://example.com/
1 http://example.com/foo
2 http://example.com/foo/
2 http://example.com/foo/bar
3 http://example.com/foo/bar/

URLs added with startUrls and sitemaps aren't checked for maxDepth..

Servers

Request headers

Request body fields

How to start integrating