scrapy extracting urls from a simple site

scrapy extracting urls from a simple site - scrapy

I'm trying to extract basic data from a basic site: vapedonia.com. It's a simple ecommerce site and I do it pretty easily "reinventing the wheel" (mainly working on a big html string) but when I have to work in that mold called scrapy, it just does not work.
I first analyze the html code and create my xpath expression using a plugin. In that plugin, everything goes fine but when I create my code (or even when I use the scrappy shell), it doesn't work.
here's the code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpider(BaseSpider):
name = "vapedonia"
allowed_domains = ["vapedonia.com"]
start_urls = ["https://www.vapedonia.com/23-e-liquidos"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
products = hxs.select("//div[#class='product-container clearfix']")
for products in products:
image = products.select("div[#class='center_block']/a/img/#src").extract()
name = products.select("div[#class='center_block']/a/#title").extract()
link = products.select("div[#class='right_block']/p[#class='s_title_block']/a/#href").extract()
price = products.select("div[#class='right_block']/div[#class='content_price']/span[#class='price']").extract()
print image, name, link, price
Here are the errors:
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample>scrapy crawl vapedonia
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test.py:1: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
from scrapy.spider import BaseSpider
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test.py:6: ScrapyDeprecationWarning: craigslist_sample.spiders.test.MySpider inherits from deprecated class scrapy.spiders.BaseSpider, please inherit from scrapy.spiders.Spider. (warning only on first subclass, there may be others)
class MySpider(BaseSpider):
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test2.py:1: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
from scrapy.contrib.spiders import CrawlSpider, Rule
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test2.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test2.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test2.py:13: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[#class="button next"]',)), callback="parse_items", follow= True),
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test4.py:15: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[#class="button next"]',)), callback="parse_items", follow= True),
Traceback (most recent call last):
File "C:\Users\eric\Miniconda2\Scripts\scrapy-script.py", line 5, in <module>
sys.exit(scrapy.cmdline.execute())
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\cmdline.py", line 148, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\crawler.py", line 243, in __init__
super(CrawlerProcess, self).__init__(settings)
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\crawler.py", line 134, in __init__
self.spider_loader = _get_spider_loader(settings)
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\crawler.py", line 330, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\spiderloader.py", line 61, in from_settings
return cls(settings)
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\spiderloader.py", line 25, in __init__
self._load_all_spiders()
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\spiderloader.py", line 47, in _load_all_spiders
for module in walk_modules(name):
File "C:\Users\eric\Miniconda2\lib\site-packages\scrapy\utils\misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "C:\Users\eric\Miniconda2\lib\importlib\__init__.py", line 37, in import_module
__import__(name)
File "C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test5.py", line 17
link = products.select("div[#class='right_block']/p[#class='s_title_block']/a/#href").extract()
^
IndentationError: unexpected indent
C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample>
I don't know what the problem is but I have several spiders coded spiders in the spiders directory/folder. May be it's some kind of mix of codes between spiders.
Thanks.

When scrapy runs it scans for all scrapers present in the project to fine their names and run the one that you specified. So if any scraper has a syntax error then it won't work
File "C:\Users\eric\Documents\Web Scraping\0 - Projets\Scrapy-\projects\craigslist_sample\craigslist_sample\spiders\test5.py", line 17
link = products.select("div[#class='right_block']/p[#class='s_title_block']/a/#href").extract()
As you can see in the exception the error is there in your test5.py. Fix the indentation in this file or comment it if you don't need it. That should allow you to run the spider
Edit-1: Mixing of Tabs and Spaces
Python is dependent on indentation and a visually same looking indentation may be different in code. It may be using mix of tabs and spaces in different line. Which will cause errors. So make sure to check your editor to show tab and space characters and convert all tabs to spaces.

Related

Using scrapy 2.7 ModuleNotFoundError: No module named 'scrapy.squeue'

I run my scrapy as a standalone script like this
if __name__ == "__main__":
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
s = get_project_settings()
process = CrawlerProcess(s)
process.crawl(MySpider)
process.start()
my scraper was consuming huge memory so i thought of using these two custom settings.
SCHEDULER_DISK_QUEUE = "scrapy.squeue.PickleFifoDiskQueue"
SCHEDULER_MEMORY_QUEUE = "scrapy.squeue.FifoMemoryQueue"
But after adding these two custom settings and when i run my standalone spider I get error saying.
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
result = context.run(gen.send, result)
File "/usr/local/lib/python3.9/dist-packages/scrapy/crawler.py", line 118, in crawl
yield self.engine.open_spider(self.spider, start_requests)
ModuleNotFoundError: No module named 'scrapy.squeue'
Any ideas what's the issue with this ?

ModuleNotFoundError: No module named 'scrapy.squeue'
You have a typo:
SCHEDULER_DISK_QUEUE = "scrapy.squeues.PickleFifoDiskQueue"
SCHEDULER_MEMORY_QUEUE = "scrapy.squeues.FifoMemoryQueue"

AWS Notebook Instance is working but Lambda is not accepting the input

I developed an ANN tool by using pycharm/tensorflow on my own computer. I uploaded the h5 and json files to Amazon Sagemaker by creating a Notebook Instance. I was finally able to successfully create an endpoint and make it work. The following code in Notebook Instance -Jupyter works:
import json
import boto3
import numpy as np
import io
import sagemaker
from sagemaker.tensorflow.model import TensorFlowModel
client = boto3.client('runtime.sagemaker')
data = np.random.randn(1,6).tolist()
endpoint_name = 'sagemaker-tensorflow-**********'
response = client.invoke_endpoint(EndpointName=endpoint_name, Body=json.dumps(data))
response_body = response['Body']
print(response_body.read())
However, the problem occurs when I created a lambda function and call the endpoint from there. The input should be a row of 6 features -that is a 1-by-6 vector. I enter the following input into lambda {"data":"1,1,1,1,1,1"} and it gives me the following error:
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 20, in lambda_handler
Body=payload)
File "/var/runtime/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/var/runtime/botocore/client.py", line 635, in _make_api_call
raise error_class(parsed_response, operation_name)
I think the problem is that the input needs to be 1-by-6 instead of 6-by-1 and I don't know how to do that.

I assume the content type you specified is text/csv, so try out:
{"data": ["1,1,1,1,1,1"]}

Why does running multiple scrapy spiders through CrawlerProcess cause the spider_idle signal to fail?

I need to make thousands of requests that require a session token for authorization.
Queuing all the requests at once results in thousands of requests failing because the session token expires before the later requests are issued.
So, I am issuing a reasonable number of requests that will reliably complete before the session token expires.
When a batch of requests completes, the spider_idle signal is triggered.
If further requests are needed, the signal handler requests a new session token be used with the next batch of requests.
This works when running one spider normally, or one spider through CrawlerProcess.
However, the spider_idle signal fails with multiple spiders run through CrawlerProcess.
One spider will execute the spider_idle signal as expected, but the others fail with this exception:
2019-06-14 10:41:22 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_idle of <SpideIdleTest None at 0x7f514b33c550>>
Traceback (most recent call last):
File "/home/loren/.virtualenv/spider_idle_test/local/lib/python2.7/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
*arguments, **named)
File "/home/loren/.virtualenv/spider_idle_test/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "fails_with_multiple_spiders.py", line 25, in spider_idle
spider)
File "/home/loren/.virtualenv/spider_idle_test/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 209, in crawl
"Spider %r not opened when crawling: %s" % (spider.name, request)
I created a repo that shows the spider_idle behaving as expected with a single spider, and failing with multiple spiders.
https://github.com/loren-magnuson/scrapy_spider_idle_test
Here is the version that shows the failures:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, signals
from scrapy.exceptions import DontCloseSpider
from scrapy.xlib.pydispatch import dispatcher
class SpiderIdleTest(scrapy.Spider):
custom_settings = {
'CONCURRENT_REQUESTS': 1,
'DOWNLOAD_DELAY': 2,
}
def __init__(self):
dispatcher.connect(self.spider_idle, signals.spider_idle)
self.idle_retries = 0
def spider_idle(self, spider):
self.idle_retries += 1
if self.idle_retries < 3:
self.crawler.engine.crawl(
Request('https://www.google.com',
self.parse,
dont_filter=True),
spider)
raise DontCloseSpider("Stayin' alive")
def start_requests(self):
yield Request('https://www.google.com', self.parse)
def parse(self, response):
print(response.css('title::text').extract_first())
process = CrawlerProcess()
process.crawl(SpiderIdleTest)
process.crawl(SpiderIdleTest)
process.crawl(SpiderIdleTest)
process.start()

I tried running multiple spiders concurrently using billiard as an alternative approach.
After getting the spiders running concurrently using billiard's Process, the spider_idle signal still failed, but with a different exception.
Traceback (most recent call last):
File "/home/louis_powersports/.virtualenv/spider_idle_test/lib/python3.6/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
*arguments, **named)
File "/home/louis_powersports/.virtualenv/spider_idle_test/lib/python3.6/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "test_with_billiard_process.py", line 25, in spider_idle
self.crawler.engine.crawl(
AttributeError: 'SpiderIdleTest' object has no attribute 'crawler'
This lead me to try changing:
self.crawler.engine.crawl(
Request('https://www.google.com',
self.parse,
dont_filter=True),
spider)
to
spider.crawler.engine.crawl(
Request('https://www.google.com',
self.parse,
dont_filter=True),
spider)
which works.
Billiard is not necessary. The original attempt based on Scrapy's documentation will work after making the change above.
Working version of original:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, signals
from scrapy.exceptions import DontCloseSpider
from scrapy.xlib.pydispatch import dispatcher
class SpiderIdleTest(scrapy.Spider):
custom_settings = {
'CONCURRENT_REQUESTS': 1,
'DOWNLOAD_DELAY': 2,
}
def __init__(self):
dispatcher.connect(self.spider_idle, signals.spider_idle)
self.idle_retries = 0
def spider_idle(self, spider):
self.idle_retries += 1
if self.idle_retries < 3:
spider.crawler.engine.crawl(
Request('https://www.google.com',
self.parse,
dont_filter=True),
spider)
raise DontCloseSpider("Stayin' alive")
def start_requests(self):
yield Request('https://www.google.com', self.parse)
def parse(self, response):
print(response.css('title::text').extract_first())
process = CrawlerProcess()
process.crawl(SpiderIdleTest)
process.crawl(SpiderIdleTest)
process.crawl(SpiderIdleTest)
process.start()

py2exe setup.py not working

I have an small code that uses pandas and sqlalchemy and is declared in my main.py as:
import pandas as pd
from sqlalchemy import create_engine
this is my complete setup.py:
from distutils.core import setup
import py2exe
from glob import glob
data_files = [("Microsoft.VC90.CRT", glob(r'C:\Users\Flavio\Documents\Python_dll\*.*'))]
opts = {
"py2exe": {
"packages": ["pandas", "sqlalchemy"]
}
}
setup(
data_files=data_files,
options = opts,
console=['main.py']
)
And I'm using this command in terminal:
python setup.py py2exe
But when I run main.exe it's open the terminal start to execute code and suddenly close window.
when I run over terminal it's the error:
C:\Users\Flavio\Documents\python\python\untitled\dist>main.exe
Please add a valid tradefile date as yyyymmdd: 20150914
Traceback (most recent call last):
File "main.py", line 11, in <module>
File "C:\Users\Flavio\Anaconda3\lib\site-packages\sqlalchemy\engine\__init__.p
y", line 386, in create_engine
return strategy.create(*args, **kwargs)
File "C:\Users\Flavio\Anaconda3\lib\site-packages\sqlalchemy\engine\strategies
.py", line 75, in create
dbapi = dialect_cls.dbapi(**dbapi_args)
File "C:\Users\Flavio\Anaconda3\lib\site-packages\sqlalchemy\connectors\pyodbc
.py", line 51, in dbapi
return __import__('pyodbc')
ImportError: No module named 'pyodbc'

without know what your program does
I would try the following 1st
open a command window and run your .exe from there
The window will not close and any error messages (if any) will be displayed

python 3.3 and beautifulsoup4-4.3.2

from bs4 import BeautifulSoup
import urllib
import socket
searchurl = "http://suchen.mobile.de/auto/search.html?scopeId=C&isSearchRequest=true&sortOption.sortBy=price.consumerGrossEuro"
f = urllib.request.urlopen(searchurl, None, None)
html = f.read()
soup = BeautifulSoup(html)
for link in soup.find_all("div","listEntry "):
print(link)
Traceback (most recent call last):
File "C:\Users\taha\Documents\worksapcetoon\Parser\com\test__init__.py", line 6, in
f = urllib.request.urlopen(searchurl, None, None)
AttributeError: 'module' object has no attribute 'request'

For urllib.request docs look here:
http://docs.python.org/py3k/library/urllib.request.html?highlight=urllib#urllib.request.urlopen
use import urllib.request instead of import urllib

Replace:
import urllib
with:
import urllib.request
Python by default doesn't include submodules into packages' namespace - it must be done by hand by programmer. Some packages are written to do it - for example package 'os' puts 'path' in own namespace so 'import os' is enough to use both 'os' and 'os.path' functions - but in general all module imports need to be explicit.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

scrapy extracting urls from a simple site - scrapy

Related

Using scrapy 2.7 ModuleNotFoundError: No module named 'scrapy.squeue'

AWS Notebook Instance is working but Lambda is not accepting the input

Why does running multiple scrapy spiders through CrawlerProcess cause the spider_idle signal to fail?

py2exe setup.py not working

python 3.3 and beautifulsoup4-4.3.2

Categories

Resources