Middlewares¶
Frontier Middleware sits between
FrontierManager and
Backend objects, using hooks for
Request
and Response processing according to
frontier data flow.
It’s a light, low-level system for filtering and altering Frontier’s requests and responses.
Activating a middleware¶
To activate a Middleware component, add it to the
MIDDLEWARES setting, which is a list whose values can be class paths or instances of
Middleware objects.
Here’s an example:
MIDDLEWARES = [
'frontera.contrib.middlewares.domain.DomainMiddleware',
]
Middlewares are called in the same order they’ve been defined in the list, to decide which order to assign to your middleware pick a value according to where you want to insert it. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.
Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware documentation for more info.
Writing your own middleware¶
Writing your own frontier middleware is easy. Each Middleware
component is a single Python class inherited from Component.
FrontierManager will communicate with all active middlewares
through the methods described below.
-
class
frontera.core.components.Middleware¶ Interface definition for a Frontier Middlewares
Methods
-
frontier_start()¶ Called when the frontier starts, see starting/stopping the frontier.
-
frontier_stop()¶ Called when the frontier stops, see starting/stopping the frontier.
-
page_crawled(response)¶ This method is called every time a page has been crawled.
Parameters: response (object) – The Responseobject for the crawled page.Returns: ResponseorNoneShould either return
Noneor aResponseobject.If it returns
None,FrontierManagerwon’t continue processing any other middleware andBackendwill never be notified.If it returns a
Responseobject, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to theBackend.If you want to filter a page, just return None.
-
request_error(page, error)¶ This method is called each time an error occurs when crawling a page.
Parameters: - request (object) – The crawled with error
Requestobject. - error (string) – A string identifier for the error.
Returns: RequestorNoneShould either return
Noneor aRequestobject.If it returns
None,FrontierManagerwon’t continue processing any other middleware andBackendwill never be notified.If it returns a
Responseobject, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to theBackend.If you want to filter a page error, just return None.
- request (object) – The crawled with error
Class Methods
-
classmethod
from_manager(manager)¶ Class method called from
FrontierManagerpassing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
-
Built-in middleware reference¶
This page describes all Middleware components that come with Frontera.
For information on how to use them and how to write your own middleware, see the
middleware usage guide..
For a list of the components enabled by default (and their orders) see the MIDDLEWARES setting.
DomainMiddleware¶
-
class
frontera.contrib.middlewares.domain.DomainMiddleware¶ This
Middlewarewill add adomaininfo field for everyRequest.metaandResponse.metaif is activated.domainobject will contain the following fields, with both keys and values as bytes:- netloc: URL netloc according to RFC 1808 syntax specifications
- name: Domain name
- scheme: URL scheme
- tld: Top level domain
- sld: Second level domain
- subdomain: URL subdomain(s)
An example for a
Requestobject:>>> request.url 'http://www.scrapinghub.com:8080/this/is/an/url' >>> request.meta['domain'] { "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }
If
TEST_MODEis active, It will accept testing URLs, parsing letter domains:>>> request.url 'A1' >>> request.meta['domain'] { "name": "A", "netloc": "A", "scheme": "-", "sld": "-", "subdomain": "-", "tld": "-" }
UrlFingerprintMiddleware¶
-
class
frontera.contrib.middlewares.fingerprint.UrlFingerprintMiddleware¶ This
Middlewarewill add afingerprintfield for everyRequest.metaandResponse.metaif is activated.Fingerprint will be calculated from object
URL, using the function defined inURL_FINGERPRINT_FUNCTIONsetting. You can write your own fingerprint calculation function and use by changing this setting. The fingerprint must be bytes.An example for a
Requestobject:>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['fingerprint'] '60d846bc2969e9706829d5f1690f11dafb70ed18'
-
frontera.utils.fingerprint.hostname_local_fingerprint(key)¶ This function is used for URL fingerprinting, which serves to uniquely identify the document in storage.
hostname_local_fingerprintis constructing fingerprint getting first 4 bytes as Crc32 from host, and rest is MD5 from rest of the URL. Default option is set to make use of HBase block cache. It is expected to fit all the documents of average website within one cache block, which can be efficiently read from disk once.Parameters: key – str URL Returns: str 20 bytes hex string
DomainFingerprintMiddleware¶
-
class
frontera.contrib.middlewares.fingerprint.DomainFingerprintMiddleware¶ This
Middlewarewill add afingerprintfield for everyRequest.metaandResponse.metadomainfields if is activated.Fingerprint will be calculated from object
URL, using the function defined inDOMAIN_FINGERPRINT_FUNCTIONsetting. You can write your own fingerprint calculation function and use by changing this setting. The fingerprint must be bytesAn example for a
Requestobject:>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['domain'] { "fingerprint": "5bab61eb53176449e25c2c82f172b82cb13ffb9d", "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }