Backends¶
Frontier Backend
is where the crawling logic/policies lies.
It’s responsible for receiving all the crawl info and selecting the next pages to be crawled.
It’s called by the FrontierManager
after
Middleware
, using hooks for
Request
and Response
processing according to
frontier data flow.
Unlike Middleware
, that can have many different instances activated, only one
Backend
can be used per frontier.
Some backends require, depending on the logic implemented, a persistent storage to manage
Request
and Response
objects info.
Activating a backend¶
To activate the frontier middleware component, set it through the BACKEND
setting.
Here’s an example:
BACKEND = 'frontera.contrib.backends.memory.FIFO'
Keep in mind that some backends may need to be enabled through a particular setting. See each backend documentation for more info.
Writing your own backend¶
Writing your own frontier backend is easy. Each Backend
component is a
single Python class inherited from Component
.
FrontierManager
will communicate with active
Backend
through the methods described below.
-
class
frontera.core.components.
Backend
¶ Interface definition for a Frontier Backend
Methods
-
frontier_start
()¶ Called when the frontier starts, see starting/stopping the frontier.
Returns: None.
-
frontier_stop
()¶ Called when the frontier stops, see starting/stopping the frontier.
Returns: None.
-
add_seeds
(seeds)¶ This method is called when new seeds are are added to the frontier.
Parameters: seeds (list) – A list of Request
objects.Returns: None.
-
get_next_requests
(max_n_requests, **kwargs)¶ Returns a list of next requests to be crawled.
Parameters: - max_next_requests (int) – Maximum number of requests to be returned by this method.
- kwargs (dict) – A parameters from downloader component.
Returns: list of
Request
objects.
-
page_crawled
(response, links)¶ This method is called each time a page has been crawled.
Parameters: Returns: None.
-
request_error
(page, error)¶ This method is called each time an error occurs when crawling a page
Parameters: - request (object) – The crawled with error
Request
object. - error (string) – A string identifier for the error.
Returns: None.
- request (object) – The crawled with error
Class Methods
-
from_manager
(manager)¶ Class method called from
FrontierManager
passing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
-
Built-in backend reference¶
This page describes all each backend documentation components that come with Frontera. For information on how to use them and how to write your own middleware, see the backend usage guide..
To know the default activated Backend
check the
BACKEND
setting.
Basic algorithms¶
Some of the built-in Backend
objects implement basic algorithms as
as FIFO/LIFO or DFS/BFS for page visit ordering.
Differences between them will be on storage engine used. For instance,
memory.FIFO
and
sqlalchemy.FIFO
will use the same logic but with different
storage engines.
Memory backends¶
This set of Backend
objects will use an heapq object as storage for
basic algorithms.
-
class
frontera.contrib.backends.memory.
FIFO
¶
-
class
frontera.contrib.backends.memory.
LIFO
¶
-
class
frontera.contrib.backends.memory.
BFS
¶
-
class
frontera.contrib.backends.memory.
DFS
¶
SQLAlchemy backends¶
This set of Backend
objects will use SQLAlchemy as storage for
basic algorithms.
By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.
Request
and Response
are
represented by a declarative sqlalchemy model:
class Page(Base):
__tablename__ = 'pages'
__table_args__ = (
UniqueConstraint('url'),
)
class State:
NOT_CRAWLED = 'NOT CRAWLED'
QUEUED = 'QUEUED'
CRAWLED = 'CRAWLED'
ERROR = 'ERROR'
url = Column(String(1000), nullable=False)
fingerprint = Column(String(40), primary_key=True, nullable=False, index=True, unique=True)
depth = Column(Integer, nullable=False)
created_at = Column(TIMESTAMP, nullable=False)
status_code = Column(String(20))
state = Column(String(10))
error = Column(String(20))
If you need to create your own models, you can do it by using the DEFAULT_MODELS
setting:
DEFAULT_MODELS = {
'Page': 'frontera.contrib.backends.sqlalchemy.models.Page',
}
This setting uses a dictionary where key
represents the name of the model to define and value
the model to use.
If you want for instance to create a model to represent domains:
DEFAULT_MODELS = {
'Page': 'frontera.contrib.backends.sqlalchemy.models.Page',
'Domain': 'myproject.backends.sqlalchemy.models.Domain',
}
Models can be accessed from the Backend dictionary attribute models
.
For a complete list of all settings used for sqlalchemy backends check the settings section.
-
class
frontera.contrib.backends.sqlalchemy.
FIFO
¶
-
class
frontera.contrib.backends.sqlalchemy.
LIFO
¶