URL Options
Fetch options
URL content is retrieved via a Fetcher. There are a number of options that can apply to how the content is retreived. These options are specified by the fetch_options
array directive in the configuration:
Option | Type | Default | Explanation |
---|---|---|---|
concurrency | int | 10 | How many maximum concurrent requests should be used to fetch content. |
delay | int | 100 | Delay between requests in milliseconds. |
cache_enabled | boolean | true | If enabled, URL content is cached on disk for subsequent processing. |
cache_dir | string | /tmp/merlin_cache | Directory to store the cache. If the path does not exist it will be created. |
fetcher_class | string | ...FetcherSpatieCrawler The full class path is / Fetcher / Fetchers / SpatieCrawler / FetcherSpatieCrawler. | The full name-spaced class name of the Fetcher class to use to retrieve content. In most normal circumstances this can be left alone. |
execute_js | boolean | false | Executes javascript on the page after fetching. You need to ensure the necessary node dependencies are met and installed. Note: JS is currently only available when using the default FetcherSpatieCrawler. |
follow_redirects | boolean | true | If enabled, redirects e.g. 302 will be followed. |
ignore_ssl_errors | boolean | false | If enabled, will ignore SSL errors. |
user_agent | string | "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36" | Set the User-Agent to identify as in requests. |
timeouts | array | connect_timeout: 15 timeout: 15 read_timeout: 30 | Various fetching timeouts. Note that if you use execute_js these timeouts will need to be at least doubled to allow time to run the javascript. |
Most of the time the default Fetcher class should cover most usage requirements, however, you can specify a custom class name if you need to do something different. Check Merlin\Fetcher\Fetchers\*
for examples of how to implement.
fetcher_options
configuration
Example ---
domain: http://www.example.com
urls:
- /some/path
- /some/path?with=a&query=123
- /some/path?with=a&query=123#and-a-fragment
fetch_options:
# Default 10
concurrency: 10
# Delay between requests, default 100 milliseconds
delay: 100
# Cache content (and use previously cached content), default true
cache_enabled: true
# Cache storage root dir (path created if doesn't exist), default /tmp/merlin_cache
cache_dir: '/tmp/merlin_cache'
# Fetcher class, default FetcherSpatieCrawler
# fetcher_class: '\Migrate\Fetcher\Fetchers\SpatieCrawler\FetcherSpatieCrawler'
fetcher_class: '\Migrate\Fetcher\Fetchers\Curl\FetcherCurl'
# fetcher_class: '\Migrate\Fetcher\Fetchers\RollingCurl\FetcherRollingCurl'
# Execute on-load JS, default false.
# Currently only available if using the FetcherSpatieCrawler fetcher class
execute_js: false
# Whether to follow redirects
follow_redirects: true
# Ignore SSL errors
ignore_ssl_errors : true
# Timeouts. When using execute_js, you want to have reasonably long timeouts.
# Not all timeouts are applicable to all Fetchers.
timeouts:
connect_timeout: 15,
timeout: 60,
# FetcherSpatieCrawler only
read_timeout: 30
URL Options
There are a number of options that can apply to the URL list. These options are specified by the url_options
array directive in the configuration:
Option | Explanation |
---|---|
include_query | Will include the query part of the URL in the request. If set to false, the crawler will only fetch the path component of the URL. |
include_fragment | Will include the fragment part of the URL. If set to false, the crawler will only fetch the path component of the URL. |
find_content_duplicates | Will check for content duplicates. This will create a file called url-content-duplicates.json that contains a list of URLs that appear to resolve to the same content. This is to avoid content duplication in the target system as well as provide a way to easily generate aliases. |
hash_selector | This is an XPath selector that is used to generate the hash of content that is used to detect duplicates. By default sha1 is used as the hash algorithm and uses the <body> tag of the page as the determining content. |
hash_exclude_nodes | This is an array of XPath selectors to exclude when generating the hash to detect duplicates. This could include elements that may appear on the page that might be metadata/cache busters or contain timestamps etc that can be safely excluded from building a hash for duplicate detection. By default all <script> , <!-- Comment --> , <style> , <input> and <head> tags will be ignored. |
urls | This is an associative array of urls and their corresponding include_query and include_fragment settings (as above) to override the global setting, if required. |
url_options
configuration
Example ---
domain: http://www.example.com
urls:
- /some/path
- /some/path?with=a&query=123
- /some/path?with=a&query=123#and-a-fragment
url_options:
# Default false
include_query: true
# Default false
include_fragment: true
# Default true
find_content_duplicates: true
# Default '//body'
hash_selector: '//body'
# Default script, comment, style, input, head
hash_exclude_nodes:
- '//script'
- '//comment()'
- '//style'
- '//input'
- '//head'
Example overriding options for specific URLs
If there are URLs that need to have a different query or fragment inclusion setting from that of the global setting, their behaviour can be specified independently:
---
domain: http://www.example.com
urls:
- /some/path
- /some/path?with=a&query=123
- /some/path?with=a&query=123#and-a-fragment
url_options:
include_query: false
include_fragment: false
urls:
-
url: /some/path?with=a&query=123#and-a-fragment
include_query: true
include_fragment: true
URLs in a separate file
You can also provide a list of URLs in a separate file. Your configuration can provide both urls
and urls_file
properties, or just one. Supply the urls_file
as a relative path to the config file.
---
domain: http://www.example.com
urls:
- /some/path
- /some/path/subpath
urls_file: list_of_urls.yml
Example of a separate URLs file
Provide the list of urls in a separate file with a single urls
property that contains the list of URLs. Example content of a urls_file
:
---
urls:
- /some/path
- /some/other/path