urllib based www backend (tendril.utils.www.bare)

TODO Some details about the other stuff

This module also provides the WWWCachedFetcher class, an instance of which is available in cached_fetcher, which is subsequently used by get_soup() and any application code that wants cached results.

Overall, caching for this backend looks something like this :

  • WWWCachedFetcher provides short term (~5 days) caching, aggressively caching whatever goes through it. This caching is NOT HTTP1.1 compliant. In case HTTP1.1 compliant caching is desired, use the requests based implementation instead or use an external http-replicator like caching proxy.

  • RedirectCacheHandler is something of a special case, handling redirects which otherwise would be incredibly expensive. Unfortunately, this layer is also the dumbest cacher, and does not expire anything. Ever. To ‘invalidate’ something in this cache, the entire cache needs to be nuked. It may be worthwhile to consider moving this to redis instead.

tendril.utils.www.bare._test_opener(openr)[source]

Tests an opener obtained using urllib2.build_opener() by attempting to open Google’s homepage. This is used to test internet connectivity.

tendril.utils.www.bare._create_opener()[source]

Creates an opener for the internet.

It also attaches the CachingRedirectHandler to the opener and sets its User-agent to Mozilla/5.0.

If the Network Proxy settings are set and recognized, it creates the opener and attaches the proxy_handler to it. The opener is tested and returned if the test passes.

If the test fails an opener without the proxy settings is created instead and is returned instead.

tendril.utils.www.bare.urlopen(url)[source]

Opens a url specified by the url parameter.

This function handles redirect caching, if enabled.

class tendril.utils.www.bare.WWWCachedFetcher(cache_dir='/home/docs/.tendril/cache/soupcache')[source]

Bases: tendril.utils.www.caching.CacheBase

This class implements a simple filesystem cache which can be used to create and obtain from various cached requests from internet resources.

The cache is stored in the folder defined by cache_dir, with a filename constructed by the _get_filepath() function.

If the cache’s _accessor() function is called with the getcpath attribute set to True, only the path to a (valid) file in the cache filesystem is returned, and opening and reading the file is left to the caller. This hook is provided to help deal with file encoding on a somewhat case-by-case basis, until the overall encoding problems can be ironed out.

_get_filepath(url)[source]

Return a filename constructed from the md5 sum of the url (encoded as utf-8 if necessary).

Parameters

url – url of the resource to be cached

Returns

name of the cache file

_get_fresh_content(url)[source]

Retrieve a fresh copy of the resource from the source.

Parameters

url – url of the resource

Returns

contents of the resource

fetch(url, max_age=600000, getcpath=False)[source]

Return the content located at the url provided. If a fresh cached version exists, it is returned. If not, a fresh one is obtained, stored in the cache, and returned.

Parameters
  • url – url of the resource to retrieve.

  • max_age – maximum age in seconds.

  • getcpath – (default False) if True, returns only the path to the cache file.

tendril.utils.www.bare.cached_fetcher = <tendril.utils.www.bare.WWWCachedFetcher object>

The module’s WWWCachedFetcher instance which should be used whenever cached results are desired. The cache is stored in the directory defined by tendril.config.WWW_CACHE.

tendril.utils.www.bare.get_soup(url)[source]

Gets a bs4 parsed soup for the url specified by the parameter. The lxml parser is used. This function returns a soup constructed of the cached page if one exists and is valid, or obtains one and dumps it into the cache if it doesn’t.