POST /1/crawlers

Creates a new crawler with the provided configuration.

Servers

Request headers

Name Type Required Description
Content-Type String Yes The media type of the request body.

Default value: "application/json"

Request body fields

Name Type Required Description
name String Yes

Name of the crawler.

config Object Yes

Crawler configuration.

config.linkExtractor Object No

Function for extracting URLs from links on crawled pages.

For more information, see the linkExtractor documentation.

config.linkExtractor.source String No
config.linkExtractor.__type String No

Possible values:

  • "function"
config.initialIndexSettings Object No

Crawler index settings.

These index settings are only applied during the first crawl of an index. Any subsequent changes won't be applied to the index. Instead, make changes to your index settings in the Algolia dashboard.

config.maxUrls Number No

Limits the number of URLs your crawler processes.

Change it to a low value, such as 100, for quick crawling tests. Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures.

Because the Crawler works on many pages simultaneously, maxUrls doesn't guarantee finding the same pages each time it runs.

config.startUrls[] Array No

URLs from where to start crawling.

config.renderJavaScript Object No

If true, use a Chrome headless browser to crawl pages.

Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages. Use micromatch to define URL patterns, including negations and wildcards.

config.externalData[] Array No

References to external data sources for enriching the extracted records.

For more information, see Enrich extracted records with external data.

config.appId String Yes

Algolia application ID where the crawler creates and updates indices. The Crawler add-on must be enabled for this application.

config.sitemaps[] Array No

Sitemaps with URLs from where to start crawling.

config.safetyChecks Object No

Checks to ensure the crawl was successful.

For more information, see the Safety checks documentation.

config.safetyChecks.beforeIndexPublishing Object No

Checks triggered after the crawl finishes but before the records are added to the Algolia index.

config.safetyChecks.beforeIndexPublishing.maxFailedUrls Number No

Stops the crawler if a specified number of pages fail to crawl.

config.safetyChecks.beforeIndexPublishing.maxLostRecordsPercentage Number No

Maximum difference in percent between the numbers of records between crawls.

Default value: 10

config.ignoreCanonicalTo Object No
config.ignoreQueryParams[] Array No

Query parameters to ignore while crawling.

All URLs with the matching query parameters will be treated as identical. This prevents indexing URLs that just differ by their query parameters.

You can use wildcard characters to pattern match.

config.saveBackup Boolean No

Whether to back up your index before the crawler overwrites it with new records.

config.ignoreNoFollowTo Boolean No

Whether to ignore the nofollow meta tag or link attribute.

For more information, see the ignoreNoFollowTo documentation.

config.requestOptions Object No

Lets you add options to HTTP requests made by the crawler.

config.requestOptions.timeout Number No

Timeout in milliseconds for the crawl.

Default value: 30000

config.requestOptions.proxy String No

Proxy for all crawler requests.

config.requestOptions.retries Number No

Maximum number of retries to crawl one URL.

Default value: 3

config.requestOptions.headers Object No

Headers to add to all requests.

config.requestOptions.headers.Accept-Language String No

Preferred natural language and locale.

config.requestOptions.headers.Authorization String No

Basic authentication header.

config.requestOptions.headers.Cookie String No

Cookie. The header will be replaced by the cookie retrieved when logging in.

config.rateLimit Number Yes

Number of concurrent tasks per second.

If processing each URL takes n seconds, your crawler can process rateLimit / n URLs per second.

Higher numbers mean faster crawls but they also increase your bandwidth and server load.

config.indexPrefix String No

A prefix for all indices created by this crawler. It's combined with the indexName for each action to form the complete index name.

config.schedule String No

Schedule for running the crawl.

For more information, see the schedule documentation.

config.exclusionPatterns[] Array No

URLs to exclude from crawling.

config.extraUrls[] Array No

The Crawler treats extraUrls the same as startUrls. Specify extraUrls if you want to differentiate between URLs you manually added to fix site crawling from those you initially specified in startUrls.

config.actions[] Array Yes

A list of actions.

config.actions[].hostnameAliases Object No

Key-value pairs to replace matching hostnames found in a sitemap, on a page, in canonical links, or redirects.

For more information, see the hostnameAliases documentation.

config.actions[].name String No

Unique identifier for the action. This option is required if schedule is set.

config.actions[].autoGenerateObjectIDs Boolean No

Whether to generate an objectID for records that don't have one.

Default value: true

config.actions[].discoveryPatterns[] Array No

Indicates intermediary pages that the crawler should visit.

For more information, see the discoveryPatterns documentation.

config.actions[].cache Object No

Whether the crawler should cache crawled pages.

For more information, see the cache documentation.

config.actions[].cache.enabled Boolean No

Whether the crawler cache is active.

Default value: true

config.actions[].recordExtractor Object Yes

Function for extracting information from a crawled page and transforming it into Algolia records for indexing. The Crawler has an editor with autocomplete and validation to help you update the recordExtractor property.

For details, consult the recordExtractor documentation.

config.actions[].recordExtractor.source String No

A JavaScript function (as a string) that returns one or more Algolia records for each crawled page.

config.actions[].recordExtractor.__type String No

Possible values:

  • "function"
config.actions[].fileTypesToMatch[] Array No

File types for crawling non-HTML documents.

For more information, see Extract data from non-HTML documents.

Default value: [ "html" ]

config.actions[].indexName String Yes

Reference to the index used to store the action's extracted records. indexName is combined with the prefix you specified in indexPrefix.

config.actions[].pathAliases Object No

Key-value pairs to replace matching paths with new values.

It doesn't replace:

  • URLs in the startUrls, sitemaps, pathsToMatch, and other settings.
  • Paths found in extracted text.

The crawl continues from the transformed URLs.

config.actions[].pathsToMatch[] Array No

URLs to which this action should apply.

Uses micromatch for negation, wildcards, and more.

config.actions[].selectorsToMatch[] Array No

DOM selectors for nodes that must be present on the page to be processed. If the page doesn't match any of the selectors, it's ignored.

config.ignoreNoIndex Boolean No

Whether to ignore the noindex robots meta tag. If true pages with this meta tag will be crawled.

config.ignoreRobotsTxtRules Boolean No

Whether to ignore rules defined in your robots.txt file.

config.apiKey String No

Algolia API key for indexing the records.

For more information, see the apiKey documentation.

config.login Object No

Authorization method and credentials for crawling protected content.

config.maxDepth Number No

Maximum path depth of crawled URLs. For example, if maxDepth is 2, https://example.com/foo/bar is crawled, but https://example.com/foo/bar/baz won't. Trailing slashes increase the URL depth.

How to start integrating

  1. Add HTTP Task to your workflow definition.
  2. Search for the API you want to integrate with and click on the name.
    • This loads the API reference documentation and prepares the Http request settings.
  3. Click Test request to test run your request to the API and see the API's response.