Frontera API¶
This section documents the Frontera core API, and is intended for developers of middlewares and backends.
Frontera API / Manager¶
The main entry point to Frontera API is the FrontierManager
object, passed to middlewares and backend through the from_manager class method. This object provides access to all
Frontera core components, and is the only way for middlewares and backend to access them and hook their
functionality into Frontera.
The FrontierManager is responsible for loading the installed
middlewares and backend, as well as for managing the data flow around the whole frontier.
Loading from settings¶
Although FrontierManager can be initialized using parameters the most common way of doing this is using
Frontera Settings.
This can be done through the from_settings
class method, using either a string path:
>>> from frontera import FrontierManager
>>> frontier = FrontierManager.from_settings('my_project.frontier.settings')
or a Settings object instance:
>>> from frontera import FrontierManager, Settings
>>> settings = Settings()
>>> settings.MAX_PAGES = 0
>>> frontier = FrontierManager.from_settings(settings)
It can also be initialized without parameters, in this case the frontier will use the default settings:
>>> from frontera import FrontierManager, Settings
>>> frontier = FrontierManager.from_settings()
Frontier Manager¶
-
class
frontera.core.manager.FrontierManager(request_model, response_model, backend, logger, event_log_manager, middlewares=None, test_mode=False, max_requests=0, max_next_requests=0, auto_start=True, settings=None)¶ The
FrontierManagerobject encapsulates the whole frontier, providing an API to interact with. It’s also responsible of loading and communicating all different frontier components.Parameters: - request_model (object/string) – The
Requestobject to be used by the frontier. - response_model (object/string) – The
Responseobject to be used by the frontier. - backend (object/string) – The
Backendobject to be used by the frontier. - logger (object/string) – The
Loggerobject to be used by the frontier. - event_log_manager (object/string) – The
EventLoggerobject to be used by the frontier. - middlewares (list) – A list of
Middlewareobjects to be used by the frontier. - test_mode (bool) – Activate/deactivate frontier test mode.
- max_requests (int) – Number of pages after which the frontier would stop (See Finish conditions).
- max_next_requests (int) – Maximum number of requests returned by
get_next_requestsmethod. - auto_start (bool) – Activate/deactivate automatic frontier start (See starting/stopping the frontier).
- settings (object/string) – The
Settingsobject used by the frontier.
Attributes
-
request_model¶ The
Requestobject to be used by the frontier. Can be defined withREQUEST_MODELsetting.
-
response_model¶ The
Responseobject to be used by the frontier. Can be defined withRESPONSE_MODELsetting.
-
event_log_manager¶ The
EventLoggerobject to be used by the frontier. Can be defined withEVENT_LOGGERsetting.
-
middlewares¶ A list of
Middlewareobjects to be used by the frontier. Can be defined withMIDDLEWARESsetting.
-
test_mode¶ Boolean value indicating if the frontier is using frontier test mode. Can be defined with
TEST_MODEsetting.
-
max_requests¶ Number of pages after which the frontier would stop (See Finish conditions). Can be defined with
MAX_REQUESTSsetting.
-
max_next_requests¶ Maximum number of requests returned by
get_next_requestsmethod. Can be defined withMAX_NEXT_REQUESTSsetting.
-
auto_start¶ Boolean value indicating if automatic frontier start is activated. See starting/stopping the frontier. Can be defined with
AUTO_STARTsetting.
-
iteration¶ Current frontier iteration.
-
n_requests¶ Number of accumulated requests returned by the frontier.
-
finished¶ Boolean value indicating if the frontier has finished. See Finish conditions.
API Methods
-
start()¶ Notifies all the components of the frontier start. Typically used for initializations (See starting/stopping the frontier).
Returns: None.
-
stop()¶ Notifies all the components of the frontier stop. Typically used for finalizations (See starting/stopping the frontier).
Returns: None.
-
add_seeds(seeds)¶ Adds a list of seed requests (seed URLs) as entry point for the crawl.
Parameters: seeds (list) – A list of Requestobjects.Returns: None.
-
get_next_requests(max_next_requests=0, **kwargs)¶ Returns a list of next requests to be crawled. Optionally a maximum number of pages can be passed. If no value is passed,
FrontierManager.max_next_requestswill be used instead. (MAX_NEXT_REQUESTSsetting).Parameters: - max_next_requests (int) – Maximum number of requests to be returned by this method.
- kwargs (dict) – Arbitrary arguments that will be passed to backend.
Returns: list of
Requestobjects.
-
page_crawled(response, links=None)¶ Informs the frontier about the crawl result and extracted links for the current page.
Parameters: Returns: None.
-
request_error(request, error)¶ Informs the frontier about a page crawl error. An error identifier must be provided.
Parameters: - request (object) – The crawled with error
Requestobject. - error (string) – A string identifier for the error.
Returns: None.
- request (object) – The crawled with error
Class Methods
-
classmethod
from_settings(settings=None)¶ Returns a
FrontierManagerinstance initialized with the passed settings argument. Argument value can either be a string path pointing to settings file or aSettingsobject instance. If no settings is given, frontier default settings are used.
- request_model (object/string) – The
Starting/Stopping the frontier¶
Sometimes, frontier components need to perform initialization and finalization operations. The frontier mechanism to
notify the different components of the frontier start and stop is done by the
start() and
stop() methods
respectively.
By default auto_start frontier value is activated,
this means that components will be notified once the
FrontierManager object is created.
If you need to have more fine control of when different components are initialized, deactivate
auto_start and manually call frontier API
start() and
stop() methods.
Note
Frontier stop() method is not automatically called
when auto_start is active (because frontier is
not aware of the crawling state). If you need to notify components of frontier end you should call the method
manually.
Frontier iterations¶
Once frontier is running, the usual process is the one described in the data flow section.
Crawler asks the frontier for next pages using the
get_next_requests() method.
Each time the frontier returns a non empty list of pages (data available), is what we call a frontier iteration.
Current frontier iteration can be accessed using the
iteration attribute.
Finishing the frontier¶
Crawl can be finished either by the Crawler or by the Frontera. Frontera will finish when a maximum number
of pages are returned. This limit is controlled by the
max_requests attribute
(MAX_REQUESTS setting).
If max_requests has a value of 0 (default value)
the frontier will continue indefinitely.
Once the frontier is finished, no more pages will be returned by the
get_next_requests method and
finished attribute will be True.
Component objects¶
-
class
frontera.core.components.Component¶ Interface definition for a frontier component The
Componentobject is the base class for frontierMiddlewareandBackendobjects.FrontierManagercommunicates with the active components using the hook methods listed below.Implementations are different for
MiddlewareandBackendobjects, therefore methods are not fully described here but in their corresponding section.Attributes
-
name¶ The component name
Abstract methods
-
frontier_start()¶ Called when the frontier starts, see starting/stopping the frontier.
-
frontier_stop()¶ Called when the frontier stops, see starting/stopping the frontier.
-
add_seeds(seeds)¶ This method is called when new seeds are are added to the frontier.
Parameters: seeds (list) – A list of Requestobjects.
-
page_crawled(response, links)¶ This method is called each time a page has been crawled.
Parameters:
-
request_error(page, error)¶ This method is called each time an error occurs when crawling a page
Parameters: - request (object) – The crawled with error
Requestobject. - error (string) – A string identifier for the error.
- request (object) – The crawled with error
Class Methods
-
classmethod
from_manager(manager)¶ Class method called from
FrontierManagerpassing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
-
Test mode¶
In some cases while testing, frontier components need to act in a different way than they usually do (for instance
domain middleware accepts non valid URLs like 'A1' or 'B1' when parsing
domain urls in test mode).
Components can know if the frontier is in test mode via the boolean
test_mode attribute.
Another ways of using the frontier¶
Communication with the frontier can also be done through other mechanisms such as an HTTP API or a queue system. These functionalities are not available for the time being, but hopefully will be included in future versions.