Run scrapy on a set of hundred plus urls - scrapy

I need to download CPU and GPU data of a set of phones fro gsmarena. Now as a step one, I downloaded the urls of those phones by running scrapy and deleted the unnecessary items.
COde for the same is below.
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
from scrapy import Spider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from gsmarena_data.items import gsmArenaDataItem
class MobileInfoSpider(Spider):
name = "mobile_info"
allowed_domains = ["gsmarena.com"]
start_urls = (
# 'http://www.gsmarena.com/samsung-phones-f-9-10.php',
# 'http://www.gsmarena.com/apple-phones-48.php',
# 'http://www.gsmarena.com/microsoft-phones-64.php',
# 'http://www.gsmarena.com/nokia-phones-1.php',
# 'http://www.gsmarena.com/sony-phones-7.php',
# 'http://www.gsmarena.com/lg-phones-20.php',
# 'http://www.gsmarena.com/htc-phones-45.php',
# 'http://www.gsmarena.com/motorola-phones-4.php',
# 'http://www.gsmarena.com/huawei-phones-58.php',
# 'http://www.gsmarena.com/lenovo-phones-73.php',
# 'http://www.gsmarena.com/xiaomi-phones-80.php',
# 'http://www.gsmarena.com/acer-phones-59.php',
# 'http://www.gsmarena.com/asus-phones-46.php',
# 'http://www.gsmarena.com/oppo-phones-82.php',
# 'http://www.gsmarena.com/blackberry-phones-36.php',
# 'http://www.gsmarena.com/alcatel-phones-5.php',
# 'http://www.gsmarena.com/xolo-phones-85.php',
# 'http://www.gsmarena.com/lava-phones-94.php',
# 'http://www.gsmarena.com/micromax-phones-66.php',
# 'http://www.gsmarena.com/spice-phones-68.php',
'http://www.gsmarena.com/gionee-phones-92.php',
)
def parse(self, response):
phone = gsmArenaDataItem()
hxs = Selector(response)
phone_listings = hxs.css('.makers')
for phone_listings in phone_listings:
phone['model'] = phone_listings.xpath("ul/li/a/strong/text()").extract()
phone['link'] = phone_listings.xpath("ul/li/a/#href").extract()
yield phone
Now, I need to run scrapy on those set of urls to get the CPU and GPU data. All that info comes css selector = ".ttl".
Kindly guide how to loop scrapy on the set of urls and output the data in a single csv or json. I'm well aware will creating items and using css selectors. Need help with how to loop on those hundred plus pages.
I have a list of urls like:
www.gsmarena.com/samsung_galaxy_s5_cdma-6338.php
www.gsmarena.com/samsung_galaxy_s5-6033.php
www.gsmarena.com/samsung_galaxy_core_lte_g386w-6846.php
www.gsmarena.com/samsung_galaxy_core_lte-6099.php
www.gsmarena.com/acer_iconia_one_8_b1_820-7217.php
www.gsmarena.com/acer_iconia_tab_a3_a20-7136.php
www.gsmarena.com/microsoft_lumia_640_dual_sim-7082.php
www.gsmarena.com/microsoft_lumia_532_dual_sim-6951.php
Which are the links to phone descriptions on gsm arena.
Now I need to download the CPU and GPU info of the 100 models I have.
I extracted the urls of those 100 models for which the data is required.
The spider written for the same is,
from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem
class MobileInfoSpider(Spider):
name = "cpu_gpu_info"
allowed_domains = ["gsmarena.com"]
start_urls = (
"http://www.gsmarena.com/microsoft_lumia_435_dual_sim-6949.php",
"http://www.gsmarena.com/microsoft_lumia_435-6942.php",
"http://www.gsmarena.com/microsoft_lumia_535_dual_sim-6792.php",
"http://www.gsmarena.com/microsoft_lumia_535-6791.php",
)
def parse(self, response):
phone = gsmArenaDataItem()
hxs = Selector(response)
cpu_gpu = hxs.css('.ttl')
for phone_listings in phone_listings:
phone['cpu'] = cpu_gpu.xpath("ul/li/a/strong/text()").extract()
phone['gpu'] = cpu_gpu.xpath("ul/li/a/#href").extract()
yield phone
If somehow I could run on the urls for which I want to extract this data, I could get the required data in a single csv file.

I think you need information from every vendors. If so you don't have to put those hundreds of urls in the start-url, alternatively you can use this link as start-url after that in parse() you could extract those urls programatically and process what you want.
This answer will help you to do so.

Related

How do I add a directory of .wav files to the Kedro data catalogue?

This is my first time trying to use the Kedro package.
I have a list of .wav files in an s3 bucket, and I'm keen to know how I can have them available within the Kedro data catalog.
Any thoughts?
I don't believe there's currently a dataset format that handles .wav files. You'll need to build a custom dataset that uses something like Wave - not as much work as it sounds!
This will enable you to do something like this in your catalog:
dataset:
type: my_custom_path.WaveDataSet
filepath: path/to/individual/wav_file.wav # this can be a s3://url
and you can then access your WAV data natively within your Kedro pipeline. You can do this for each .wav file you have.
If you wanted to be able to access a whole folders worth of wav files, you might want to explore the notion of a "wrapper" dataset like the PartitionedDataSet whose usage guide can be found in the documentation.
This worked:
import pandas as pd
from pathlib import Path, PurePosixPath
from kedro.io import AbstractDataSet
class WavFile(AbstractDataSet):
'''Used to load a .wav file'''
def __init__(self, filepath):
self._filepath = PurePosixPath(filepath)
def _load(self) -> pd.DataFrame:
df = pd.DataFrame({'file': [self._filepath],
'data': [load_wav(self._filepath)]})
return df
def _save(self, df: pd.DataFrame) -> None:
df.to_csv(str(self._filepath))
def _exists(self) -> bool:
return Path(self._filepath.as_posix()).exists()
def _describe(self):
return dict(filepath=self._filepath)
class WavFiles(PartitionedDataSet):
'''Replaces the PartitionedDataSet.load() method to return a DataFrame.'''
def load(self)->pd.DataFrame:
'''Returns dataframe'''
dict_of_data = super().load()
df = pd.concat(
[delayed() for delayed in dict_of_data.values()]
)
return df
my_partitioned_dataset = WavFiles(
path="path/to/folder/of/wav/files/",
dataset=WavFile,
)
my_partitioned_dataset.load()

How do I tell the spider to stop requesting after n failed requests?

import scrapy
class MySpider(scrapy.Spider):
start_urls = []
def __init__(self, **kwargs):
for i in range(1, 1000):
self.start_urls.append("some url"+i)
def parse(self, response):
print(response)
Here we queue 1000 urls in __init__ function, but I want to stop making all those requests if it fails or returns something undesirable. How do I tell the spider to stop making requests say after 10 failed requests.
You might want to set CLOSESPIDER_ERRORCOUNT to 10 in that case. It probably doesn't account for failed requests only, though. Alternatively, you might set HTTPERROR_ALLOWED_CODES to handle even the error responses (failed requests) and implement your own failed request counter inside the spider. Then, when the counter is above threshold, you raise CloseSpider exception yourself.

Pull imdbIDs for titles on search list

Would it be possible to get all the IMDb IDs for titles that meet a search criteria (such as number of votes, language, release year, etc)?
My priority is to compile a list of all the IMDb IDs are classified as a feature film and have over 25,000 votes (a.k.a. those eligible appear on the top 250 list) as it appears here. At the time of this posting, there are 4,296 films that meet that criteria.
(If you are unfamiliar with IMDb IDs: it is a unique 7-digit code associated with every film/person/character/etc in the database. For instance, for the movie "Drive" (2011), the IMDb ID is "0780504".)
However, in the future, it would be helpful to set the search criteria as I see fit, as I can when typing in the url address (with &num_votes=##, &year=##, &title_type=##, ...)
I have been using IMDBpy with great success to pull information on individual movie titles and would love if this search feature I describe were accessible through that library.
Until now, I have been generating random 7-digit-strings and testing to see if they meet my criteria, but this will be inefficient moving forward because I waste processing time on superfluous IDs.
from imdb import IMDb, IMDbError
import random
i = IMDb(accessSystem='http')
movies = []
for _ in range(11000):
randID = str(random.randint(0, 7221897)).zfill(7)
movies.append(randID)
for m in movies:
try:
movie = i.get_movie(m)
except IMDbError as err:
print(err)`
if str(movie)=='':
continue
kind = movie.get('kind')
if kind != 'movie':
continue
votes=movie.get('votes')
if votes == None:
continue
if votes>=25000:
Take a look at http://www.omdbapi.com/
You can use the API directly, to search by title or ID.
In python3
import urllib.request
urllib.request.urlopen("http://www.omdbapi.com/?apikey=27939b55&s=moana").read()
Found a solution using Beautiful Soup based on a tutorial written by Alexandru Olteanu
Here is my code:
from requests import get
from bs4 import BeautifulSoup
import re
import math
from time import time, sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn
url = "http://www.imdb.com/search/title?num_votes=25000,&title_type=feature&view=simple&sort=num_votes,desc&page=1&ref_=adv_nxt"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
num_films_text = html_soup.find_all('div', class_ = 'desc')
num_films=re.search('of (\d.+) titles',str(num_films_text[0])).group(1)
num_films=int(num_films.replace(',', ''))
print(num_films)
num_pages = math.ceil(num_films/50)
print(num_pages)
ids = []
start_time = time()
requests = 0
# For every page in the interval`
for page in range(1,num_pages+1):
# Make a get request
url = "http://www.imdb.com/search/title?num_votes=25000,&title_type=feature&view=simple&sort=num_votes,desc&page="+str(page)+"&ref_=adv_nxt"
response = get(url)
# Pause the loop
sleep(randint(8,15))
# Monitor the requests
requests += 1
sleep(randint(1,3))
elapsed_time = time() - start_time
print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
clear_output(wait = True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
# Break the loop if the number of requests is greater than expected
if requests > num_pages:
warn('Number of requests was greater than expected.')
break
# Parse the content of the request with BeautifulSoup
page_html = BeautifulSoup(response.text, 'html.parser')
# Select all the 50 movie containers from a single page
movie_containers = page_html.find_all('div', class_ = 'lister-item mode-simple')
# Scrape the ID
for i in range(len(movie_containers)):
id = re.search('tt(\d+)/',str(movie_containers[i].a)).group(1)
ids.append(id)
print(ids)

Run multiple spiders from script in scrapy

I am doing scrapy project I want run multiple spiders at a time
This is code for run spiders from script. I getting error .. how to do
from spiders.DmozSpider import DmozSpider
from spiders.CraigslistSpider import CraigslistSpider
from scrapy import signals, log
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
TO_CRAWL = [DmozSpider, CraigslistSpider]
RUNNING_CRAWLERS = []
def spider_closing(spider):
"""Activates on spider closed signal"""
log.msg("Spider closed: %s" % spider, level=log.INFO)
RUNNING_CRAWLERS.remove(spider)
if not RUNNING_CRAWLERS:
reactor.stop()
log.start(loglevel=log.DEBUG)
for spider in TO_CRAWL:
settings = Settings()
# crawl responsibly
settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
crawler = Crawler(settings)
crawler_obj = spider()
RUNNING_CRAWLERS.append(crawler_obj)
# stop reactor when spider closes
crawler.signals.connect(spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(crawler_obj)
crawler.start()
blocks process so always keep as the last statement
reactor.run()
Sorry to not answer the question itself but just bringing into your attention scrapyd and scrapinghub (at least for a quick test). reactor.run() (when you make it) will run any number of Scrapy instances on a single CPU. Do you want this side effect? Even if you have a look on scrapyd's code, they don't run multiple instances with a single thread but they do fork/spawn subprocesses.
You need something like the code below. You can easily find it from Scrapy docs :)
First utility you can use to run your spiders is
scrapy.crawler.CrawlerProcess. This class will start a Twisted reactor
for you, configuring the logging and setting shutdown handlers. This
class is the one used by all Scrapy commands.
# -*- coding: utf-8 -*-
import sys
import logging
import traceback
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from scrapy.utils.project import get_project_settings
from spiders.DmozSpider import DmozSpider
from spiders.CraigslistSpider import CraigslistSpider
SPIDER_LIST = [
DmozSpider, CraigslistSpider
]
if __name__ == "__main__":
try:
## set up the crawler and start to crawl one spider at a time
process = CrawlerProcess(get_project_settings())
for spider in SPIDER_LIST:
process.crawl(spider)
process.start()
except Exception, e:
exc_type, exc_obj, exc_tb = sys.exc_info()
logging.info('Error on line {}'.format(sys.exc_info()[-1].tb_lineno))
logging.info("Exception: %s" % str(traceback.format_exc()))
References:
http://doc.scrapy.org/en/latest/topics/practices.html

Scrapy high CPU usage

I have a very simple test spider which does no parsing. However I'm passing a large number of urls (500k) to the spider in the start_requests method and seeing very high (99/100%) cpu usage. Is this the expected behaviour? if so how can I optimize this (perhaps batching and using spider_idle?)
class TestSpider(Spider):
name = 'test_spider'
allowed_domains = 'mydomain.com'
def __init__(self, **kw):
super(Spider, self).__init__(**kw)
urls_list = kw.get('urls')
if urls_list:
self.urls_list = urls_list
def parse(self, response):
pass
def start_requests(self):
with open(self.urls_list, 'rb') as urls:
for url in urls:
yield Request(url, self.parse)
I think that the main problem here is that you are scraping too many links, try adding a Rule to avoid scraping links that didnt contains what you want.
Scrapy provides really useful Docs, check them out!:
http://doc.scrapy.org/en/latest/topics/spiders.html