Scrapy Seed Loaders¶
Frontera has some built-in Scrapy middlewares for seed loading.
Seed loaders use the process_start_requests
method to generate requests from a source that are added later to the
FrontierManager
.
Activating a Seed loader¶
Just add the Seed Loader middleware to the SPIDER_MIDDLEWARES
scrapy settings:
SPIDER_MIDDLEWARES.update({
'frontera.contrib.scrapy.middlewares.seeds.FileSeedLoader': 650
})
FileSeedLoader¶
Load seed URLs from a file. The file must be formatted contain one URL per line:
http://www.asite.com
http://www.anothersite.com
...
Yo can disable URLs using the #
character:
...
#http://www.acommentedsite.com
...
Settings:
SEEDS_SOURCE
: Path to the seeds file
S3SeedLoader¶
Load seeds from a file stored in an Amazon S3 bucket
File format should the same one used in FileSeedLoader.
Settings:
SEEDS_SOURCE
: Path to S3 bucket file. eg:s3://some-project/seed-urls/
SEEDS_AWS_ACCESS_KEY
: S3 credentials Access KeySEEDS_AWS_SECRET_ACCESS_KEY
: S3 credentials Secret Access Key