How do I call scrapy from airflow dag? - scrapy

My scrapy project runs perfectly well with 'scrapy crawl spider_1' command. How to trigger it (or call the scrappy command) from airflow dag?
with DAG(<args>) as dag:
scrapy_task = PythonOperator(
task_id='scrapy',
python_callable= ?)
task_2 = ()
task_3 = ()
....
scrapy_task >> [task_2, task_3, ...]

Run with BashOperator
https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/bash.html
with DAG(<args>) as dag:
scrapy_task = BashOperator(
task_id='scrapy',
bash_command='scrapy crawl spider_1')
If you're using virtualenv, you may use VirtualEnvOperator
or to use existing environment, you can use source activate venv && scrapy crawl spider_1
Run with PythonOperator
From scrapy documentation: https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('spider_1')
process.start() # the script will block here until the crawling is finished

Related

Is there a way to run code after reactor.run() in scrapy?

I am working on a scrapy api. One of my issues was that the twisted reactor wasn't restartable. I fixed this using crawl runner as opposed to crawl process. My spider extracts links from a website, validates them. My issue is that if I add the validation code after reactor.run() it doesn't work. This is my code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from urllib.parse import urlparse
list = set([])
list_validate = set([])
runner = CrawlerRunner()
class Crawler(CrawlSpider):
name = "Crawler"
start_urls = ['https:www.example.com']
allowed_domains = ['www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
def parse_links(self, response):
base_url = url
href = response.xpath('//a/#href').getall()
list.add(urllib.parse.quote(response.url, safe=':/'))
for link in href:
if base_url not in link:
list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
for link in list:
if base_url in link:
list_validate.add(link)
runner.crawl(Crawler)
reactor.run()
If add the code that validates the links after reactor.run(), it doesn't get executed. And if I put the code before reactor.run(), nothing happens because the spider hasn't yet finished crawling all the links. What should I do? The code that validates the links is perfectly fine I used it before and it works.
We can do that with d.addCallback(<callback_function>) and d.addErrback(<errback_function>)
...
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished(d):
print("finished :D")
def spider_error(err):
print("Spider error :/")
d.addCallback(finished)
d.addErrback(spider_error)
reactor.run()
For your ScraperApi you can use Klein.
Klein is a micro-framework for developing production-ready web services with Python. It is 'micro' in that it has an incredibly small API similar to Bottle and Flask.
...
import scrapy
from scrapy.crawler import CrawlerRunner
from klein import Klein
app=Klein()
#app.route('/')
async def hello(request):
status=list()
class TestSpider(scrapy.Spider):
name='test'
start_urls=[
'https://quotes.toscrape.com/',
'https://quotes.toscrape.com/page/2/',
'https://quotes.toscrape.com/page/3/',
'https://quotes.toscrape.com/page/4/'
]
def parse(self,response):
"""
parse
"""
status.append(response.status)
runner=CrawlerRunner()
d= await runner.crawl(TestSpider)
content=str(status)
return content
#app.route('/h')
def index(request):
return 'Index Page'
app.run('localhost',8080)

How to add Proxy to Scrapy and Selenium Script

I would like to add a proxy to my script.
How do I have to do it? Do I have to use Selenium or Scrapy for it?
I think that Scrapy is making the initial request, so it would make sense to use scrapy for it. But what exactly do I have to do?
Can you recommend any proxylist which works quite reliable?
This is my current script:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
import re
import csv
from time import sleep
class PostsSpider(Spider):
name = 'posts'
allowed_domains = ['xyz']
start_urls = ('xyz',)
def parse(self, response):
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
for url in start_urls:
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
self.driver.get(url)
try:
self.driver.find_element_by_id('currentTab').click()
sleep(3)
self.logger.info('Sleeping for 5 sec.')
self.driver.find_element_by_xpath('//*[#id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
sleep(7)
self.logger.info('Sleeping for 7 sec.')
except NoSuchElementException:
self.logger.info('Blog does not exist anymore')
while True:
try:
element = self.driver.find_element_by_id('last_item')
self.driver.execute_script("arguments[0].scrollIntoView(0, document.documentElement.scrollHeight-5);", element)
sleep(3)
self.driver.find_element_by_id('last_item').click()
sleep(7)
except NoSuchElementException:
self.logger.info('No more tipps')
sel = Selector(text=self.driver.page_source)
allposts = sel.xpath('//*[#class="block media _feedPick feed-pick"]')
for post in allposts:
username = post.xpath('.//div[#class="col-sm-7 col-lg-6 no-padding"]/a/#title').extract()
publish_date = post.xpath('.//*[#class="bet-age text-muted"]/text()').extract()
yield {'Username': username,
'Publish date': publish_date}
self.driver.close()
break
One simple way is to set the http_proxy and https_proxy environment variables.
You could set them in your environment before starting your script, or maybe add this at the beginning of your script:
import os
os.environ['http_proxy'] = 'http://my/proxy'
os.environ['https_proxy'] = 'http://my/proxy'
For a list of publicly available proxy, you will find a ton if you just search in Google.
You should read Scrapy ProxyMiddleware to explore it to best. Ways of using mentioned proxies are also mentioned in it

Scrapy - CSS selector issue

I would like to get the link located in a href attribute from a aelement. The url is: https://www.drivy.com/location-voiture/antwerpen/bmw-serie-1-477429?address=Gare+d%27Anvers-Central&city_display_name=&country_scope=BE&distance=200&end_date=2019-05-20&end_time=18%3A30&latitude=51.2162&longitude=4.4209&start_date=2019-05-20&start_time=06%3A00
I'm searching for the href of this element:
<a class="car_owner_section" href="/users/2643273" rel="nofollow"></a>
When I enter response.css('a.car_owner_section::attr(href)').get() in the terminal I just get nothing but the element exists even when I inspect view(response).
Anybody has a clue about this issue ?
The site seems to load on JavaScript, using splash works perfect.
Here is the code:
import scrapy
from scrapy_splash import SplashRequest
class ScrapyOverflow1(scrapy.Spider):
name = "overflow1"
def start_requests(self):
url = 'https://www.drivy.com/location-voiture/antwerpen/bmw-serie-1-477429?address=Gare+d%27Anvers-Central&city_display_name=&country_scope=BE&distance=200&end_date=2019-05-20&end_time=18%3A30&latitude=51.2162&longitude=4.4209&start_date=2019-05-20&start_time=06%3A00'
yield SplashRequest(url=url, callback=self.parse, args={'wait': 5})
def parse(self, response):
links = response.xpath('//a[#class="car_owner_section"]/#href').extract()
print(links)
To use splash install splash, scrapy splash and run sudo docker run -p 8050:8050 scrapinghub/splash
before running the spider. Here is a great article on installing and running splash. article on scrapy spash... and also add midlewares to settings.py (also in the article)
The result is as above

Scrapy don't crawl over site

I have common trap and can't get rid with it: my Scrapy spider very lazy, so that it is, it can parse only start_urls. Code below:
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Field
from scrapy.selector import Selector
class HabraPostSpider(scrapy.Spider):
name = 'habrapost'
allowed_domains = ['habrahabr.ru']
start_urls = ['https://habrahabr.ru/interesting/']
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
rules = (Rule(LinkExtractor()),
Rule(LinkExtractor(allow=('/post/'),),callback='parse_post',follow= True))
I will be very happy if anybody could say how to fix my spider!
Your english is totally broken but reading between the lines what I understand is that you want the crawler to go into every link it sees.
For that you have to use CrawlSpider instead of Spider
class HabraPostSpider(scrapy.spiders.CrawlSpider)
Check the documentation.

How debuging twisted application in PyCharm

I would like to debug a Twisted Application in PyCharm
from twisted.internet import defer
from twisted.application import service, internet
from txjason.netstring import JSONRPCServerFactory
from txjason import handler
class Example(handler.Handler):
def __init__(self, who):
self.who = who
#handler.exportRPC("add")
#defer.inlineCallbacks
def _add(self, x, y):
yield
defer.returnValue(x+y)
#handler.exportRPC()
def whoami(self):
return self.who
factory = JSONRPCServerFactory()
factory.addHandler(Example('foo'), namespace='bar')
application = service.Application("Example JSON-RPC Server")
jsonrpcServer = internet.TCPServer(7080, factory)
jsonrpcServer.setServiceParent(application)
How to run app from command line I know, but how to start debugging in PyCharm can't understand
Create a new Run Configuration in PyCharm, under the "Python" section.
If you start this application using twistd, then configure the "Script" setting to point to that twistd, and the "script parameters" as you would have them on the command line. You'll probably want to include the --nodaemon option.
You should then be able to Run that under PyCharm, or set breakpoints and Debug it.