Although it's not documented, the Scrapy images pipeline supports storing your downloaded images in an Amazon S3 bucket. This is very useful especially when scraping is over distributed nodes and you want a centralized image store.

To enable the storing on S3, all that is required is to set the following settings in your spider's settings.py file:

AWS_ACCESS_KEY_ID = "<your_id>"
AWS_SECRET_ACCESS_KEY = "<your_secret_key>"
IMAGES_STORE = "s3://<bucket_name>/<optional_prefix>"

along with the usual settings for the Scrapy images pipeline. You'll also need to install the boto library into your environment, usually pip install boto suffices. The AWS key and secret can be obtained from the AWS console. I'd recommend setting up a dedicated user with only the required access rights (see the "Bucket policy, users IAM and cors" section of blog post here concerning how to do that). Note, the trailing slash is essential in the IMAGES_STORE setting regardless whether you use the prefix or not.

Azure Storage

What is not currently possible out of the box is to use the Microsoft equivalent of S3, namely Azure Storage. I've submitted a pull request to Scrapy to introduce this feature into the main repo, but in the meantime one can use a custom images pipeline to achieve this.

The following pipeline customizes the images pipeline and allows you to do just that by simply adding some settings equivalent to the s3 ones:

AZURE_ACCOUNT_NAME = "<your_azure_account_name>"
AZURE_SECRET_ACCESS_KEY = <"your_azure_secret_key>"
IMAGES_STORE = "azure://<your_azure_container>/"

Here is the code:

import pytz
import six
from scrapy.pipelines.images import ImagesPipeline
from scrapy.pipelines.files import FSFilesStore, S3FilesStore
from scrapy.utils.datatypes import CaselessDict
from twisted.internet import threads

class AzureFilesStore(object):

    AZURE_ACCOUNT_NAME = None
    AZURE_SECRET_ACCESS_KEY = None

    HEADERS = {
        'Cache-Control': 'max-age=172800',
    }

    def __init__(self, uri):
        from azure.storage.blob import BlockBlobService, ContentSettings
        self.BlockBlobService = BlockBlobService
        self.ContentSettings = ContentSettings
        assert uri.startswith('azure://')
        self.container, self.prefix = uri[8:].split('/', 1)

    def stat_file(self, path, info):
        def _onsuccess(blob_properties):
            if blob_properties:
                checksum = blob_properties.properties.etag.strip('"')
                # Aware datetime
                last_modified = blob_properties.properties.last_modified
                # N.B. strftime("%s") currently is broken and doesn't respect tzinfo
                utc_last_modified = last_modified.astimezone(pytz.utc)
                modified_stamp = int(utc_last_modified.strftime("%s"))
                return {'checksum': checksum, 'last_modified': modified_stamp}
            # If media_to_download gets a None result it will also return None
            # and force download
            return None

        return self._get_azure_blob(path).addCallback(_onsuccess)

    def _get_azure_service(self):
        # If need http instead of https, use protocol kwarg
        return self.BlockBlobService(account_name=self.AZURE_ACCOUNT_NAME, account_key=self.AZURE_SECRET_ACCESS_KEY)

    def _get_azure_blob(self, path):
        blob_name = '%s%s' % (self.prefix, path)
        # Check if blob exists
        s = self._get_azure_service()
        # Note: returning None as result will force download in media_to_download
        if s.exists(self.container, blob_name=blob_name):
            # Get properties
            return threads.deferToThread(s.get_blob_properties, self.container, blob_name=blob_name)
        return threads.deferToThread(lambda _: _, None)

    def persist_file(self, path, buf, info, meta=None, headers=None):
        """Upload file to Azure blob storage"""
        blob_name = '%s%s' % (self.prefix, path)
        extra = self._headers_to_azure_content_kwargs(self.HEADERS)
        if headers:
            extra.update(self._headers_to_azure_content_kwargs(headers))
        buf.seek(0)
        s = self._get_azure_service()
        return threads.deferToThread(s.create_blob_from_bytes, self.container, blob_name, buf.getvalue(),
                                     metadata={k: str(v) for k, v in six.iteritems(meta or {})},
                                     content_settings=self.ContentSettings(**extra))

    def _headers_to_azure_content_kwargs(self, headers):
        """ Convert headers to Azure content settings keyword agruments.
        """
        # This is required while we need to support both boto and botocore.
        mapping = CaselessDict({
            'Content-Type': 'content_type',
            'Cache-Control': 'cache_control',
            'Content-Disposition': 'content_disposition',
            'Content-Encoding': 'content_encoding',
            'Content-Language': 'content_language',
            'Content-MD5': 'content_md5',
            })
        extra = {}
        for key, value in six.iteritems(headers):
            try:
                kwarg = mapping[key]
            except KeyError:
                raise TypeError(
                    'Header "%s" is not supported by Azure' % key)
            else:
                extra[kwarg] = value
        return extra


class CustomImagesPipeline(ImagesPipeline):
    STORE_SCHEMES = {
        '': FSFilesStore,
        'file': FSFilesStore,
        's3': S3FilesStore,
        'azure': AzureFilesStore,
    }

    @classmethod
    def from_settings(cls, settings):
        azureStore = cls.STORE_SCHEMES['azure']
        azureStore.AZURE_ACCOUNT_NAME = settings['AZURE_ACCOUNT_NAME']
        azureStore.AZURE_SECRET_ACCESS_KEY = settings['AZURE_SECRET_ACCESS_KEY']
        s3store = cls.STORE_SCHEMES['s3']
        s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
        s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']

        store_uri = settings['IMAGES_STORE']
        return cls(store_uri, settings=settings)

In place of boto you'll need to install azure-storage into your environment, e.g. pip install azure-storage. Save the above code into your project, as for example myproject/mypipelines.py Then simply add this pipeline to your ITEM_PIPELINES in place of the original images pipeline:

ITEM_PIPELINES = {'myproject.pipelines.CustomImagesPipeline': 1,...}

along with the above settings I mentioned and it should work out of the box.

Currently unrated

About Lee

I am a Theoretical Physics PhD graduate now working in the technology sector. I have strong mathematical skills and originally started in heavy duting scientific computing, but now I work mostly with Python and the Django framework. I am available for hire now, so check out my resume and get in touch.

Comments