Middlewares¶
Frontier Middleware sits between
FrontierManager and
Backend objects, using hooks for
Request
and Response processing according to
frontier data flow.
It’s a light, low-level system for filtering and altering Frontier’s requests and responses.
Activating a middleware¶
To activate a Middleware component, add it to the
MIDDLEWARES setting, which is a list whose values can be class paths or instances of
Middleware objects.
Here’s an example:
MIDDLEWARES = [
'frontera.contrib.middlewares.domain.DomainMiddleware',
]
Middlewares are called in the same order they’ve been defined in the list, to decide which order to assign to your middleware pick a value according to where you want to insert it. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.
Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware documentation for more info.
Writing your own middleware¶
Writing your own frontier middleware is easy. Each Middleware
component is a single Python class inherited from Component.
FrontierManager will communicate with all active middlewares
through the methods described below.
-
class
frontera.core.components.Middleware¶ Interface definition for a Frontier Middlewares
Methods
-
frontier_start()¶ Called when the frontier starts, see starting/stopping the frontier.
-
frontier_stop()¶ Called when the frontier stops, see starting/stopping the frontier.
-
add_seeds(seeds)¶ This method is called when new seeds are are added to the frontier.
Parameters: seeds (list) – A list of Requestobjects.Returns: Requestobject list orNoneShould either return
Noneor a list ofRequestobjects.If it returns
None,FrontierManagerwon’t continue processing any other middleware and seed will never reach theBackend.If it returns a list of
Requestobjects, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to theBackend.If you want to filter any seed, just don’t include it in the returned object list.
-
page_crawled(response, links)¶ This method is called each time a page has been crawled.
Parameters: Returns: ResponseorNoneShould either return
Noneor aResponseobject.If it returns
None,FrontierManagerwon’t continue processing any other middleware andBackendwill never be notified.If it returns a
Responseobject, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to theBackend.If you want to filter a page, just return None.
-
request_error(page, error)¶ This method is called each time an error occurs when crawling a page
Parameters: - request (object) – The crawled with error
Requestobject. - error (string) – A string identifier for the error.
Returns: RequestorNoneShould either return
Noneor aRequestobject.If it returns
None,FrontierManagerwon’t continue processing any other middleware andBackendwill never be notified.If it returns a
Responseobject, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to theBackend.If you want to filter a page error, just return None.
- request (object) – The crawled with error
Class Methods
-
from_manager(manager)¶ Class method called from
FrontierManagerpassing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
-
Built-in middleware reference¶
This page describes all Middleware components that come with Frontera.
For information on how to use them and how to write your own middleware, see the
middleware usage guide..
For a list of the components enabled by default (and their orders) see the MIDDLEWARES setting.
DomainMiddleware¶
-
class
frontera.contrib.middlewares.domain.DomainMiddleware¶ This
Middlewarewill add adomaininfo field for everyRequest.metaandResponse.metaif is activated.domainobject will contains the following fields:- netloc: URL netloc according to RFC 1808 syntax specifications
- name: Domain name
- scheme: URL scheme
- tld: Top level domain
- sld: Second level domain
- subdomain: URL subdomain(s)
An example for a
Requestobject:>>> request.url 'http://www.scrapinghub.com:8080/this/is/an/url' >>> request.meta['domain'] { "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }
If
TEST_MODEis active, It will accept testing URLs, parsing letter domains:>>> request.url 'A1' >>> request.meta['domain'] { "name": "A", "netloc": "A", "scheme": "-", "sld": "-", "subdomain": "-", "tld": "-" }
UrlFingerprintMiddleware¶
-
class
frontera.contrib.middlewares.fingerprint.UrlFingerprintMiddleware¶ This
Middlewarewill add afingerprintfield for everyRequest.metaandResponse.metaif is activated.Fingerprint will be calculated from object
URL, using the function defined inURL_FINGERPRINT_FUNCTIONsetting. You can write your own fingerprint calculation function and use by changing this setting.An example for a
Requestobject:>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['fingerprint'] '60d846bc2969e9706829d5f1690f11dafb70ed18'
DomainFingerprintMiddleware¶
-
class
frontera.contrib.middlewares.fingerprint.DomainFingerprintMiddleware¶ This
Middlewarewill add afingerprintfield for everyRequest.metaandResponse.metadomainfields if is activated.Fingerprint will be calculated from object
URL, using the function defined inDOMAIN_FINGERPRINT_FUNCTIONsetting. You can write your own fingerprint calculation function and use by changing this setting.An example for a
Requestobject:>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['domain'] { "fingerprint": "5bab61eb53176449e25c2c82f172b82cb13ffb9d", "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }