How to access Scrapy httpcache middleware data directly - scrapy

How could I access the httpcache middleware from scrapy directly?
Something like such in pseudo code
URL = 'http://scrapedsite.com/category1/item1'
print retrieveRawHtml(URL)

from scrapy.utils.response import open_in_browser
from scrapy.http import HtmlResponse
url = 'http://scrapedsite.com/category1/item1'
body = '<html>hello</html>'
response = HtmlResponse(url, body=body)
open_in_browser(response)
or from your callback:
def parse_cb(self, response):
from scrapy.utils.response import open_in_browser
open_in_browser(response)
If caching is turned on it will pull from cache.

Related

Passing google login cookies from scrapy splash to selenium

I want to sign in to my Google Account and enable a Google API and extract the developer's key. My main task is to automate this process.
Everyone knows that you can't log into the Google Account using an automated browser. I did manage to do that using scrapy splash.
import re
import time
import base64
import scrapy
from scrapy_splash import SplashRequest
from selenium import webdriver
class GoogleScraperSpider(scrapy.Spider):
name = 'google_scraper'
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
local url = splash.args.url
local youtube_url = "https://console.cloud.google.com/apis/library/youtube.googleapis.com"
assert(splash:go(url))
assert(splash:wait(1))
splash:set_viewport_full()
local search_input = splash:select('.whsOnd.zHQkBf')
search_input:send_text("xxxxxxxxxxx#gmail.com")
assert(splash:wait(1))
splash:runjs("document.getElementById('identifierNext').click()")
splash:wait(5)
local search_input = splash:select('.whsOnd.zHQkBf')
search_input:send_text("xxxxxxxx")
assert(splash:wait(1))
splash:runjs("document.getElementById('passwordNext').click()")
splash:wait(5)
return {
cookies = splash:get_cookies(),
html = splash:html(),
png = splash:png()
}
end
"""
def start_requests(self):
url = 'https://accounts.google.com'
yield SplashRequest(url, self.parse, endpoint='execute', session_id="1", args={'lua_source': self.script})
def parse(self, response):
imgdata = base64.b64decode(response.data['png'])
with open('image.png', 'wb') as file:
file.write(imgdata)
cookies = response.data.get("cookies")
driver = webdriver.Chrome("./chromedriver")
for cookie in cookies:
if "." in cookie["domain"][:1]:
url = f"https://www{cookie['domain']}"
else:
url = f"https://{cookie['domain']}"
driver.get(url)
driver.add_cookie(cookie)
driver.get("https://console.cloud.google.com/apis/library/youtube.googleapis.com")
time.sleep(5)
In the parse function I'm trying to retrieve those cookies and add them to my chromedriver to bypass the login process so I can move ahead to enabling the API and extracting the key but I always face the login page in the chromedriver.
Your help would be most appreciated.
Thanks.
try using pickle to save cookies instead, just use any python console to save the cookies with this code
import pickle
input('press enter when logged')
pickle.dump(driver.get_cookies(), open('cookies.pkl'))
then you get the cookies.pkl file with google login data, import it in your code using:
import pickle
cookies = pickle.load(open('cookies.pkl'))
for cookie in cookies:
driver.add_cookies(cookie)
driver.refresh()
# rest of work here...
refresh the driver to enable the cookies

Is there a way to run code after reactor.run() in scrapy?

I am working on a scrapy api. One of my issues was that the twisted reactor wasn't restartable. I fixed this using crawl runner as opposed to crawl process. My spider extracts links from a website, validates them. My issue is that if I add the validation code after reactor.run() it doesn't work. This is my code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from urllib.parse import urlparse
list = set([])
list_validate = set([])
runner = CrawlerRunner()
class Crawler(CrawlSpider):
name = "Crawler"
start_urls = ['https:www.example.com']
allowed_domains = ['www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
def parse_links(self, response):
base_url = url
href = response.xpath('//a/#href').getall()
list.add(urllib.parse.quote(response.url, safe=':/'))
for link in href:
if base_url not in link:
list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
for link in list:
if base_url in link:
list_validate.add(link)
runner.crawl(Crawler)
reactor.run()
If add the code that validates the links after reactor.run(), it doesn't get executed. And if I put the code before reactor.run(), nothing happens because the spider hasn't yet finished crawling all the links. What should I do? The code that validates the links is perfectly fine I used it before and it works.
We can do that with d.addCallback(<callback_function>) and d.addErrback(<errback_function>)
...
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished(d):
print("finished :D")
def spider_error(err):
print("Spider error :/")
d.addCallback(finished)
d.addErrback(spider_error)
reactor.run()
For your ScraperApi you can use Klein.
Klein is a micro-framework for developing production-ready web services with Python. It is 'micro' in that it has an incredibly small API similar to Bottle and Flask.
...
import scrapy
from scrapy.crawler import CrawlerRunner
from klein import Klein
app=Klein()
#app.route('/')
async def hello(request):
status=list()
class TestSpider(scrapy.Spider):
name='test'
start_urls=[
'https://quotes.toscrape.com/',
'https://quotes.toscrape.com/page/2/',
'https://quotes.toscrape.com/page/3/',
'https://quotes.toscrape.com/page/4/'
]
def parse(self,response):
"""
parse
"""
status.append(response.status)
runner=CrawlerRunner()
d= await runner.crawl(TestSpider)
content=str(status)
return content
#app.route('/h')
def index(request):
return 'Index Page'
app.run('localhost',8080)

How to add Proxy to Scrapy and Selenium Script

I would like to add a proxy to my script.
How do I have to do it? Do I have to use Selenium or Scrapy for it?
I think that Scrapy is making the initial request, so it would make sense to use scrapy for it. But what exactly do I have to do?
Can you recommend any proxylist which works quite reliable?
This is my current script:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
import re
import csv
from time import sleep
class PostsSpider(Spider):
name = 'posts'
allowed_domains = ['xyz']
start_urls = ('xyz',)
def parse(self, response):
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
for url in start_urls:
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
self.driver.get(url)
try:
self.driver.find_element_by_id('currentTab').click()
sleep(3)
self.logger.info('Sleeping for 5 sec.')
self.driver.find_element_by_xpath('//*[#id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
sleep(7)
self.logger.info('Sleeping for 7 sec.')
except NoSuchElementException:
self.logger.info('Blog does not exist anymore')
while True:
try:
element = self.driver.find_element_by_id('last_item')
self.driver.execute_script("arguments[0].scrollIntoView(0, document.documentElement.scrollHeight-5);", element)
sleep(3)
self.driver.find_element_by_id('last_item').click()
sleep(7)
except NoSuchElementException:
self.logger.info('No more tipps')
sel = Selector(text=self.driver.page_source)
allposts = sel.xpath('//*[#class="block media _feedPick feed-pick"]')
for post in allposts:
username = post.xpath('.//div[#class="col-sm-7 col-lg-6 no-padding"]/a/#title').extract()
publish_date = post.xpath('.//*[#class="bet-age text-muted"]/text()').extract()
yield {'Username': username,
'Publish date': publish_date}
self.driver.close()
break
One simple way is to set the http_proxy and https_proxy environment variables.
You could set them in your environment before starting your script, or maybe add this at the beginning of your script:
import os
os.environ['http_proxy'] = 'http://my/proxy'
os.environ['https_proxy'] = 'http://my/proxy'
For a list of publicly available proxy, you will find a ton if you just search in Google.
You should read Scrapy ProxyMiddleware to explore it to best. Ways of using mentioned proxies are also mentioned in it

Scrapy don't crawl over site

I have common trap and can't get rid with it: my Scrapy spider very lazy, so that it is, it can parse only start_urls. Code below:
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Field
from scrapy.selector import Selector
class HabraPostSpider(scrapy.Spider):
name = 'habrapost'
allowed_domains = ['habrahabr.ru']
start_urls = ['https://habrahabr.ru/interesting/']
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
rules = (Rule(LinkExtractor()),
Rule(LinkExtractor(allow=('/post/'),),callback='parse_post',follow= True))
I will be very happy if anybody could say how to fix my spider!
Your english is totally broken but reading between the lines what I understand is that you want the crawler to go into every link it sees.
For that you have to use CrawlSpider instead of Spider
class HabraPostSpider(scrapy.spiders.CrawlSpider)
Check the documentation.

login using post request in scrapy

I want to login rediffmail but error is generate
exceptions.NameError: global name 'FormRequest' is not defined
here my spider code:
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
from rediffmail.items import RediffmailItem
class MySpider(BaseSpider):
name = 'rediffmail'
allowed_domains = ["rediff.com"]
start_urls = ['https://mail.rediff.com/cgi-bin/login.cgi']
def parse(self, response):
return [FormRequest.from_response(response,formdata={'login': 'XXXX', 'passwd': 'XXXX'},callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
Please check is there any issue in my code. I am beginner in python
You are missing the import of FormRequest. And in your version of scrapy, the FormRequest is in scrapy.http.
So add this line in your import section:
from scrapy.http import FormRequest