POST /1/crawlers
Creates a new crawler with the provided configuration.
Servers
- https://crawler.algolia.com/api
Request headers
Name | Type | Required | Description |
---|---|---|---|
Content-Type |
String | Yes |
The media type of the request body.
Default value: "application/json" |
Request body fields
Name | Type | Required | Description |
---|---|---|---|
name |
String | Yes |
Name of the crawler. |
config |
Object | Yes |
Crawler configuration. |
config.linkExtractor |
Object | No |
Function for extracting URLs from links on crawled pages. For more information, see the |
config.linkExtractor.source |
String | No | |
config.linkExtractor.__type |
String | No |
Possible values:
|
config.initialIndexSettings |
Object | No |
Crawler index settings. These index settings are only applied during the first crawl of an index. Any subsequent changes won't be applied to the index. Instead, make changes to your index settings in the Algolia dashboard. |
config.maxUrls |
Number | No |
Limits the number of URLs your crawler processes. Change it to a low value, such as 100, for quick crawling tests. Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures. Because the Crawler works on many pages simultaneously, |
config.startUrls[] |
Array | No |
URLs from where to start crawling. |
config.renderJavaScript |
Object | No |
If Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages. Use micromatch to define URL patterns, including negations and wildcards. |
config.externalData[] |
Array | No |
References to external data sources for enriching the extracted records. For more information, see Enrich extracted records with external data. |
config.appId |
String | Yes |
Algolia application ID where the crawler creates and updates indices. The Crawler add-on must be enabled for this application. |
config.sitemaps[] |
Array | No |
Sitemaps with URLs from where to start crawling. |
config.safetyChecks |
Object | No |
Checks to ensure the crawl was successful. For more information, see the Safety checks documentation. |
config.safetyChecks.beforeIndexPublishing |
Object | No |
Checks triggered after the crawl finishes but before the records are added to the Algolia index. |
config.safetyChecks.beforeIndexPublishing.maxFailedUrls |
Number | No |
Stops the crawler if a specified number of pages fail to crawl. |
config.safetyChecks.beforeIndexPublishing.maxLostRecordsPercentage |
Number | No |
Maximum difference in percent between the numbers of records between crawls. Default value: 10 |
config.ignoreCanonicalTo |
Object | No | |
config.ignoreQueryParams[] |
Array | No |
Query parameters to ignore while crawling. All URLs with the matching query parameters will be treated as identical. This prevents indexing URLs that just differ by their query parameters. You can use wildcard characters to pattern match. |
config.saveBackup |
Boolean | No |
Whether to back up your index before the crawler overwrites it with new records. |
config.ignoreNoFollowTo |
Boolean | No |
Whether to ignore the For more information, see the |
config.requestOptions |
Object | No |
Lets you add options to HTTP requests made by the crawler. |
config.requestOptions.timeout |
Number | No |
Timeout in milliseconds for the crawl. Default value: 30000 |
config.requestOptions.proxy |
String | No |
Proxy for all crawler requests. |
config.requestOptions.retries |
Number | No |
Maximum number of retries to crawl one URL. Default value: 3 |
config.requestOptions.headers |
Object | No |
Headers to add to all requests. |
config.requestOptions.headers.Accept-Language |
String | No |
Preferred natural language and locale. |
config.requestOptions.headers.Authorization |
String | No |
Basic authentication header. |
config.requestOptions.headers.Cookie |
String | No |
Cookie. The header will be replaced by the cookie retrieved when logging in. |
config.rateLimit |
Number | Yes |
Number of concurrent tasks per second. If processing each URL takes n seconds,
your crawler can process Higher numbers mean faster crawls but they also increase your bandwidth and server load. |
config.indexPrefix |
String | No |
A prefix for all indices created by this crawler. It's combined with the |
config.schedule |
String | No |
Schedule for running the crawl. For more information, see the |
config.exclusionPatterns[] |
Array | No |
URLs to exclude from crawling. |
config.extraUrls[] |
Array | No |
The Crawler treats |
config.actions[] |
Array | Yes |
A list of actions. |
config.actions[].hostnameAliases |
Object | No |
Key-value pairs to replace matching hostnames found in a sitemap, on a page, in canonical links, or redirects. For more information, see the |
config.actions[].name |
String | No |
Unique identifier for the action. This option is required if |
config.actions[].autoGenerateObjectIDs |
Boolean | No |
Whether to generate an Default value: true |
config.actions[].discoveryPatterns[] |
Array | No |
Indicates intermediary pages that the crawler should visit. For more information, see the |
config.actions[].cache |
Object | No |
Whether the crawler should cache crawled pages. For more information, see the |
config.actions[].cache.enabled |
Boolean | No |
Whether the crawler cache is active. Default value: true |
config.actions[].recordExtractor |
Object | Yes |
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
The Crawler has an editor with autocomplete and validation to help you update the For details, consult the |
config.actions[].recordExtractor.source |
String | No |
A JavaScript function (as a string) that returns one or more Algolia records for each crawled page. |
config.actions[].recordExtractor.__type |
String | No |
Possible values:
|
config.actions[].fileTypesToMatch[] |
Array | No |
File types for crawling non-HTML documents. For more information, see Extract data from non-HTML documents. Default value: [ "html" ] |
config.actions[].indexName |
String | Yes |
Reference to the index used to store the action's extracted records.
|
config.actions[].pathAliases |
Object | No |
Key-value pairs to replace matching paths with new values. It doesn't replace:
The crawl continues from the transformed URLs. |
config.actions[].pathsToMatch[] |
Array | No |
URLs to which this action should apply. Uses micromatch for negation, wildcards, and more. |
config.actions[].selectorsToMatch[] |
Array | No |
DOM selectors for nodes that must be present on the page to be processed. If the page doesn't match any of the selectors, it's ignored. |
config.ignoreNoIndex |
Boolean | No |
Whether to ignore the |
config.ignoreRobotsTxtRules |
Boolean | No |
Whether to ignore rules defined in your |
config.apiKey |
String | No |
Algolia API key for indexing the records. For more information, see the |
config.login |
Object | No |
Authorization method and credentials for crawling protected content. |
config.maxDepth |
Number | No |
Maximum path depth of crawled URLs.
For example, if |
How to start integrating
- Add HTTP Task to your workflow definition.
- Search for the API you want to integrate with and click on the name.
- This loads the API reference documentation and prepares the Http request settings.
- Click Test request to test run your request to the API and see the API's response.