Storing cache results - scrapy

I have activated the middleware extension scrapy.extensions.httpcache.FilesystemCacheStorage to return the cache as results in a folder (.gzip) when scraping. However, I get the following error:
raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'\x80\x04')
I think the issue is that the file names are saved as the following:
My settings.py:
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HTTPCACHE_DBM_MODULE = 'dbm'
HTTPCACHE_GZIP = True
How do I correctly activate the extension and store the cache as files in my working directory?
example scraper:
import scrapy
class email_spider(scrapy.Spider):
name = 'email'
start_urls = [ 'http://quotes.toscrape.com/tag/humor/' ]
def parse(self, response):
content = response.xpath("(//div[#class='col-md-8'])[2]//div")
for stuff in content:
yield {
'stuff':stuff.xpath(".//a//#href).get(),}

Related

How to customize URI in Scrapy with non built in storage URI parameters

I want to customize the Scrapy feed URI to s3 to include the dimensions of the uploaded file. Currently I have the following in settings.py file:
FEEDS = {
's3://path-to-file/file_to_have_dimensions.csv': {
'format': 'csv',
'encoding': 'utf8',
'store_empty': False,
'indent': 4,
}
}
But would like to have something like the following:
NUMBER_OF_ROWS_IN_CSV = file.height()
FEEDS = {
f's3://path-to-files/file_to_have_dimensions_{NUMBER_OF_ROWS_IN_CSV}.csv': {
'format': 'csv',
'encoding': 'utf8',
'store_empty': False,
'indent': 4,
}
}
Note that I would like the number of rows to be inserted automatically.
Is this possible to do this solely through changing settings.py, or is it required to change other parts of the scrapy code?
The feed file is created when the spider starts running at which point the number of items is not yet know. However, when the spider finishes running, it calls a method named closed from which you can access the spider stats, settings and also you can perform any other tasks that you want to run after the spider has finished scraping and saving items.
In the case below i renamed the feed file from intial_file.csv to final_file_{item_count}.csv.
As you cannot rename files in s3,I use the boto3 library to copy the initial_file to a new file and name it with the item_count value included in the file name and then delete the initial file.
import scrapy
import boto3
class SampleSpider(scrapy.Spider):
name = 'sample'
start_urls = [
'http://quotes.toscrape.com/',
]
custom_settings = {
'FEEDS': {
's3://path-to-file/initial_file.csv': {
'format': 'csv',
'encoding': 'utf8',
'store_empty': False,
'indent': 4,
}
}
}
def parse(self, response):
for quote in response.xpath('//div[#class="quote"]'):
yield {
'text': quote.xpath('./span[#class="text"]/text()').extract_first(),
'author': quote.xpath('.//small[#class="author"]/text()').extract_first(),
'tags': quote.xpath('.//div[#class="tags"]/a[#class="tag"]/text()').extract()
}
def closed(self, reason):
item_count = self.crawler.stats.get_value('item_scraped_count')
try:
session = boto3.Session(aws_access_key_id = 'awsAccessKey', aws_secret_access_key = 'awsSecretAccessKey')
s3 = session.resource('s3')
s3.Object('my_bucket', f'path-to-file/final_file_{item_count}.csv').copy_from(CopySource = 'my_bucket/path-to-file/initial_file.csv')
s3.Object('my_bucket', 'path-to-file/initial_file.csv').delete()
except:
self.logger.info("unable to rename s3 file")

Simple scraper with Scrapy API

I am writing a scraper with Scrapy within a larger project, and I'm trying to keep it as minimal as possible (without create a whole scrapy project). This code downloads a single URL correctly:
import scrapy
from scrapy.crawler import CrawlerProcess
class WebsiteSpider(scrapy.Spider):
"""
https://docs.scrapy.org/en/latest/
"""
custom_settings = {'DOWNLOAD_DELAY': 1, 'DEPTH_LIMIT': 3}
name = 'my_website_scraper'
def parse(self,response):
html = response.body
url = response.url
# process page here
process = CrawlerProcess()
process.crawl(WebsiteSpider, start_urls=['https://www.bbc.co.uk/'])
process.start()
How can I enrich this code to keep scraping the links found in the start URLs (with a maximum depth, for example of 3)?
Try this.
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain
class WebsiteSpider(Spider):
name = 'bbc.co.uk'
allowed_domains = ['.bbc.co.uk']
start_urls = ['https://www.bbc.co.uk/']
# refresh_urls = True # For debug. If efresh_urls = True, start_urls will be crawled again.
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
lstA = doc.listA(url=url["url"]) # Get link data for subsequent crawling
data = [{"title": doc.title.text}] # Get target data
return {"Urls": lstA, "Data": data} # Return data to framework
SimplifiedMain.startThread(WebsiteSpider()) # Start crawling

Looping through pages of Web Page's Request URL with Scrapy

I'm looking to adapt this tutorial, (https://medium.com/better-programming/a-gentle-introduction-to-using-scrapy-to-crawl-airbnb-listings-58c6cf9f9808) to scraping this site of tiny home listings: https://tinyhouselistings.com/.
The tutorial uses the request URL, to get a very complete and clean JSON file, but does so for the first page only. It seems that looping through the 121 pages of my tinyhouselistings request url should be pretty straight-forward but I have not been able to get anything to work. The tutorial does not loop through the pages of the request url, but rather uses scrapy splash, run within a Docker container to get all the listings. I am willing to try that, but I just feel like it should be possible to loop through this request url.
This outputs only the first page only of the tinyhouselistings request url for my project:
import scrapy
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
allowed_domains = ['tinyhouselistings.com']
start_urls = ['http://www.tinyhouselistings.com']
def start_requests(self):
url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page=1'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
_file = "tiny_listings.json"
with open(_file, 'wb') as f:
f.write(response.body)
I've tried this:
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
allowed_domains = ['tinyhouselistings.com']
start_urls = ['']
def start_requests(self):
url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page='
for page in range(1, 121):
self.start_urls.append(url + str(page))
yield scrapy.Request(url=start_urls, callback=self.parse)
But I'm not sure how to then pass start_urls to parse so as to write the response to the json being written at the end of the script.
Any help would be much appreciated!
Remove allowed_domains = ['tinyhouselistings.com'] because the url thl-prod.global.ssl.fastly.net will be filtered out by Scrapy
Since you are using start_requests method so you do not need start_urls, you can only have either of them
import json
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
listings_url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page={}'
def start_requests(self):
page = 1
yield scrapy.Request(url=self.listings_url.format(page),
meta={"page": page},
callback=self.parse)
def parse(self, response):
resp = json.loads(response.body)
for ad in resp["listings"]:
yield ad
page = int(response.meta['page']) + 1
if page < int(listings['meta']['pagination']['page_count'])
yield scrapy.Request(url=self.listings_url.format(page),
meta={"page": page},
callback=self.parse)
From terminal, run spider using to save scraped data to a JSON file
scrapy crawl tinyhouselistings -o output_file.json

Serving STATIC FILES in development and Amazon S3 together

I would like to server static files from Amazon S3 and local server?
also I don't know how to setup MEDIA_URL STATIC_ROOT and MEDIA_ROOT
Context:
I am serving my static files from Amazon S3 using django-boto and my settings/base.py are:
STATICFILES_LOCATION = 'assets'
STATICFILES_STORAGE = 'custom_storages.StaticStorage'
STATIC_URL = "https://%s/%s/" % (AWS_S3_CUSTOM_DOMAIN, STATICFILES_LOCATION)
MEDIAFILES_LOCATION = 'media'
MEDIA_URL = "https://%s/%s/" % (AWS_S3_CUSTOM_DOMAIN, MEDIAFILES_LOCATION)
DEFAULT_FILE_STORAGE = 'custom_storages.MediaStorage'
My custom_storages.py file content is:
from django.conf import settings
# from storages.backends.s3boto3 import S3Boto3Storage
from storages.backends.s3boto import S3BotoStorage
class StaticStorage(S3BotoStorage):
location = settings.STATICFILES_LOCATION
class MediaStorage(S3BotoStorage):
location = settings.MEDIAFILES_LOCATION
And all of this is working fine.
When I execute collectstatic my static files are being uploaded to my bucket in Amazon S3.
The problem I have is that every time I make a change in my css or js files, I need to do the collectstatic command.
How can I setup my project (settings) to serve my static files from S3 and my django in a local server together?
I have a settings/development.py file in which I am overriding the following settings:
STATIC_URL = '/assets/'
STATICFILES_LOCATION = 'assets'
MEDIAFILES_LOCATION = 'media/'
MEDIA_URL = MEDIAFILES_LOCATION
STATIC_ROOT = os.path.join(BASE_DIR, "assets")
MEDIA_ROOT = os.path.join(BASE_DIR, "media")
And my urls.py main file I have this condition:
if settings.DEBUG:
urlpatterns += static(settings.STATIC_URL, document_root=settings.STATIC_ROOT)
urlpatterns += static(settings.MEDIA_URL, document_root=settings.MEDIA_ROOT)

Any way to follow further requests in one web page?

I need to download a web page with intensive ajax. Currently, I am using Scrapy with Ajaxenabled. After I write out this response, and open it in browser. There are still some requests initiated. I am not sure if I was right that the rendered response only includes the first level requests. So, how could we let scrapy include all sub-requests into one response?
Now in this case, there are 72 requests sent as opening online, where 23 requests as opening offline.
Really appreciate it!
Here are the screenshots for the requests sent before and after download
requests sent before download
requests sent after download
Here is the code:
class SeedinvestSpider(CrawlSpider):
name = "seedinvest"
allowed_domains = ["seedinvest.com"]
start_urls = (
'https://www.seedinvest.com/caplinked/bridge',
)
def parse_start_url(self, response):
item = SeedinvestDownloadItem()
item['url'] = response.url
item['html'] = response.body
yield item
The code is as follows:
class SeedinvestSpider(CrawlSpider):
name = "seedinvest"
allowed_domains = ["seedinvest.com"]
start_urls = (
'https://www.seedinvest.com/startmart/pre.seed',
)
def parse_start_url(self, response):
item = SeedinvestDownloadItem()
item['url'] = response.url
item['html'] = response.body
yield item