Just getting started with Scrapy, I'm hoping for a nudge in the right direction.
I want to scrape data from here:
https://www.sportstats.ca/display-results.xhtml?raceid=29360
This is what I have so far:
import scrapy
import re
class BlogSpider(scrapy.Spider):
name = 'sportstats'
start_urls = ['https://www.sportstats.ca/display-results.xhtml?raceid=29360']
def parse(self, response):
headings = []
results = []
tables = response.xpath('//table')
headings = list(tables[0].xpath('thead/tr/th/span/span/text()').extract())
rows = tables[0].xpath('tbody/tr[contains(#class, "ui-widget-content ui-datatable")]')
for row in rows:
result = []
tds = row.xpath('td')
for td in enumerate(tds):
if headings[td[0]].lower() == 'comp.':
content = None
elif headings[td[0]].lower() == 'view':
content = None
elif headings[td[0]].lower() == 'name':
content = td[1].xpath('span/a/text()').extract()[0]
else:
try:
content = td[1].xpath('span/text()').extract()[0]
except:
content = None
result.append(content)
results.append(result)
for result in results:
print(result)
Now I need to move on to the next page, which I can do in a browser by clicking the "right arrow" at the bottom, which I believe is the following li:
<li><a id="mainForm:j_idt369" href="#" class="ui-commandlink ui-widget fa fa-angle-right" onclick="PrimeFaces.ab({s:"mainForm:j_idt369",p:"mainForm",u:"mainForm:result_table mainForm:pageNav mainForm:eventAthleteDetailsDialog",onco:function(xhr,status,args){hideDetails('athlete-popup');showDetails('event-popup');scrollToTopOfElement('mainForm\\:result_table');;}});return false;"></a>
How can I get scrapy to follow that?
If you open the url in a browser without javascript you won't be able to move to the next page. As you can see inside the li tag, there is some javascript to be executed in order to get the next page.
Yo get around this, the first option is usually try to identify the request generated by javascript. In your case, it should be easy: just analyze the java script code and replicate it with python in your spider. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.
You're going to need to perform a callback. Generate the url from the xpath from the 'next page' button. So url = response.xpath(xpath to next_page_button) and then when you're finished scraping that page you'll do yield scrapy.Request(url, callback=self.parse_next_page). Finally you create a new function called def parse_next_page(self, response):.
A final, final note is if it happens to be in Javascript (and you can't scrape it even if you're sure you're using the correct xpath) check out my repo in using splash with scrapy https://github.com/Liamhanninen/Scrape
Related
On this page (https://www.realestate.com.kh/buy/), I managed to grab a list of ads, and individually parse their content with this code:
import scrapy
class scrapingThings(scrapy.Spider):
name = 'scrapingThings'
# allowed_domains = ['https://www.realestate.com.kh/buy/']
start_urls = ['https://www.realestate.com.kh/buy/']
def parse(self, response):
ads = response.xpath('//*[#class="featured css-ineky e1jqslr40"]//a/#href')
c = 0
for url in ads:
c += 1
absolute_url = response.urljoin(url.extract())
self.item = {}
self.item['url'] = absolute_url
yield scrapy.Request(absolute_url, callback=self.parse_ad, meta={'item': self.item})
def parse_ad(self, response):
# Extract things
yield {
# Yield things
}
However, I'd like to automate the switching from one page to another to grab the entirety of the ads available (not only on the first page, but on all pages). By, I guess, simulating the pressings of the 1, 2, 3, 4, ..., 50 buttons as displayed on this screen capture:
Is this even possible with Scrapy? If so, how can one achieve this?
Yes it's possible. Let me show you two ways of doing it:
You can have your spider select the buttons, get the #href value of them, build a [full] URL and yield as a new request.
Here is an example:
def parse(self, response):
....
href = response.xpath('//div[#class="desktop-buttons"]/a[#class="css-owq2hj"]/following-sibling::a[1]/#href').get()
req_url = response.urljoin(href)
yield Request(url=req_url, callback=self.parse_ad)
The selector in the example will always return the #href of the next page's button (It returns only one value, if you are in page 2 it returns the #href of page 2)
In this page, the href is an relative url, so we need to use response.urljoin() method to build a full url. It will use the response as base.
We yield a new request, the response will be parsed in the callback function you determined.
This will require your callback function to always yield the request for the next page. So it's a recursive solution.
A more simple approach would be to just observe the pattern of the hrefs and manually yield all requests. Each button has a href of "/buy/?page={nr}" where {nr} is the number of the page, se can arbitrarily change this nr value and yield all requests at once.
def parse(self, response):
....
nr_pages = response.xpath('//div[#class="desktop-buttons"]/a[#class="css-1en2dru"]//text()').getall()
last_page_nr = int(nr_pages[-1])
for nr in range(2, last_page_nr + 1):
req_url = f'/buy/?page={nr}'
yield Request(url=response.urljoin(req_url), callback=self.parse_ad)
nr_pages returns the number of all buttons
last_page_nr selects the last number (which is the last available page)
We loop in the range between 2 to the value of last_page_nr (50 in this case) and in each loop we request a new page (that correspond to the number).
This way you can make all the requests in your parse method, and parse the response in the parse_ad without recursive calling.
Finally I suggest you take a look on scrapy tutorial it covers several common scenarios on scraping.
I would like to automatically get images saved as browser's data after the page renders, using their corresponding data URLs.
For example:
You can go to the webpage: https://en.wikipedia.org/wiki/Truck
Using the WebInspector from Firefox pick the first thumbnail image on the right.
Now on the Inspector tab, right click over the img tag, go to Copy and press "Image Data-URL"
Open a new tab, paste and enter to see the image from the data URL.
Notice that the data URL is not available on the page source. On the website I want to scrape, the images are rendered after passing through a php script. The server returns a 404 response if the images try to be accessed directly with the src tag attribute.
I believe it should be possible to list the data URLs of the images rendered by the website and download them, however I was unable to find a way to do it.
I normally scrape using selenium webdriver with Firefox coded in python, but any solution would be welcome.
I managed to work out a solution using chrome webdriver with CORS disabled as with Firefox I could not find a cli argument to disable it.
The solution executes some javascript to redraw the image on a new canvas element and then use toDataURL method to get the data url. To save the image I convert the base64 data to binary data and save it as png.
This apparently solved the issue in my use case.
Code to get first truck image
from binascii import a2b_base64
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--disable-web-security")
chrome_options.add_argument("--disable-site-isolation-trials")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://en.wikipedia.org/wiki/Truck")
img = driver.find_element_by_xpath("/html/body/div[3]/div[3]"
"/div[5]/div[1]/div[4]/div"
"/a/img")
img_base64 = driver.execute_script(
"""
const img = arguments[0];
const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
canvas.width = img.width;
canvas.height = img.height;
ctx.drawImage(img, 0, 0);
data_url = canvas.toDataURL('image/png');
return data_url
""",
img)
binary_data = a2b_base64(img_base64.split(',')[1])
with open('image.png', 'wb') as save_img:
save_img.write(binary_data)
Also, I found that the data url that you get with the procedure described in my question, was generated by the Firefox web inspector on request, so it should not be possible to get a list of data urls (that are not within the page source) as I first thought.
BeautifulSoup is the best library to use for such problem statements. When u wanna retrieve data from any website, u can blindly use BeautifulSoup as it is faster than selenium. BeautifulSoup just takes around 10 seconds to complete this task, whereas selenium would approximately take 15-20 seconds to complete the same task, so it is better to use BeautifulSoup. Here is how u do it using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
import time
st = time.time()
src = requests.get('https://en.wikipedia.org/wiki/Truck').text
soup = BeautifulSoup(src,'html.parser')
divs = soup.find_all('div',class_ = "thumbinner")
count = 1
for x in divs:
url = x.a.img['srcset']
url = url.split('1.5x,')[-1]
url = url.split('2x')[0]
url = "https:" + url
url = url.replace(" ","")
path = f"D:\\Truck_Img_{count}.png"
response = requests.get(url)
file = open(path, "wb")
file.write(response.content)
file.close()
count+=1
print(f"Execution Time = {time.time()-st} seconds")
Output:
Execution Time = 9.65831208229065 seconds
29 Images. Here is the first image:
Hope that this helps!
I am doing the following. After creating a session I am doing a simple GET to a page. Problem is, this page if full of dynamic parts, so it takes between 10-30 seconds to fully generate the HTML I am interested in. The HTML I process with BeautifulSoup.
If I process the response object too quickly, I don't get the data I want. I have used "sleep" to pause for some time, but I think there should be a better way to check for complete page load. I cannot depend on status 200 code, because inside the main page, dynamic parts are still loading.
My code:
s = requests.session()
r = s.get('URL')
time.sleep(20)
... code to process response object...
I have tried to do it more "elegantly" to check for a certain tag through BeautifulSoup search, but doesn't seem to work.
My code:
title_found = False
while title_found == False:
soupje = BeautifulSoup(r.text, 'html.parser')
title_found_in_html_full = soupje.find(id='titleView!1Title')
if title_found_in_html_full is not None:
title_found_in_html = title_found_in_html_full.get('id')
if title_found_in_html == 'titleView!1Title':
title_found = True
Is it true the response object changes over time as the page is loading?
Any suggestions? Thanks
my question is similar to this post:
How to use scrapy for Amazon.com links after "Next" Button?
I want my crawler to traverse through all "Next" links. I've searched a lot, but most people ether focus on how to parse the ULR or simply put all URL's in the initial URL list.
So far, I am able to visit the first page and parse the next page's link. But I don't know how to visit that page using the same crawler(spider). I tried to append the new URL into my URL list, it does appended (I checked the length), but later it doesn't visit the link. I have no idea why...
Note that in my case, I only know the first page's URL. Second page's URL can only be obtained after visiting the first page. The same, (i+1)'th page's URL is hidden in the i'th page.
In the parse function, I can parse and print the correct next page link URL. I just don't know how to visit it.
Please help me. Thank you!
import scrapy
from bs4 import BeautifulSoup
class RedditSpider(scrapy.Spider):
name = "test2"
allowed_domains = ["http://www.reddit.com"]
urls = ["https://www.reddit.com/r/LifeProTips/search?q=timestamp%3A1427232122..1437773560&sort=new&restrict_sr=on&syntax=cloudsearch"]
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': { 'wait': 0.5 }
}
})
`
def parse(self, response):
page = response.url[-10:]
print(page)
filename = 'reddit-%s.html' % page
#parse html for next link
soup = BeautifulSoup(response.body, 'html.parser')
mydivs = soup.findAll("a", { "rel" : "nofollow next" })
link = mydivs[0]['href']
print(link)
self.urls.append(link)
with open(filename, 'wb') as f:
f.write(response.body)
Update
Thanks to Kaushik's answer, I figured out how to make it work. Though I still don't know why my original idea of appending new URL's doesn't work...
The updated code is as follow:
import scrapy
from bs4 import BeautifulSoup
class RedditSpider(scrapy.Spider):
name = "test2"
urls = ["https://www.reddit.com/r/LifeProTips/search?q=timestamp%3A1427232122..1437773560&sort=new&restrict_sr=on&syntax=cloudsearch"]
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': { 'wait': 0.5 }
}
})
def parse(self, response):
page = response.url[-10:]
print(page)
filename = 'reddit-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
#parse html for next link
soup = BeautifulSoup(response.body, 'html.parser')
mydivs = soup.findAll("a", { "rel" : "nofollow next" })
if len(mydivs) != 0:
link = mydivs[0]['href']
print(link)
#yield response.follow(link, callback=self.parse)
yield scrapy.Request(link, callback=self.parse)
What you require is explained very well in the Scrapy docs . I don't think you would need any other explanation other than that. Suggest going through it once for better understanding.
A brief explanation first though:
To follow a link to the next page, Scrapy provides many methods. The most basic methods is using the http.Request method
Request object :
class scrapy.http.Request(url[, callback,
method='GET', headers, body, cookies, meta, encoding='utf-8',
priority=0, dont_filter=False, errback, flags])
>>> yield scrapy.Request(url, callback=self.next_parse)
url (string) – the URL of this request
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter.
For convenience though, Scrapy has inbuilt shortcut for creating Request objects by using response.follow where the url can be an absolute path or a relative path.
follow(url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding=None, priority=0, dont_filter=False,
errback=None)
>>> yield response.follow(url, callback=self.next_parse)
In case if you have to move through to the next link by passing values to a form or any other type of input field, you can use the Form Request objects. The FormRequest class extends the base Request with functionality
for dealing with HTML forms. It uses lxml.html forms to pre-populate
form fields with form data from Response objects.
Form Request object
from_response(response[, formname=None,
formid=None, formnumber=0, formdata=None, formxpath=None,
formcss=None, clickdata=None, dont_click=False, ...])
If you want to simulate a HTML Form POST in your spider and send a couple of key-value fields, you can return a FormRequest object (from your spider) like this:
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
Note : If a Request doesn’t specify a callback, the spider’s parse() method will be used. If exceptions are raised during processing, errback is called instead.
I'm trying to scrape content from listing detail page that can only be viewed by clicking the 'view' button which triggers a form submit . I am new to both Python and Scrapy
Example markup
<li><h3>Abc Widgets</h3>
<form action="/viewlisting?id=123" method="post">
<input type="image" src="/images/view.png" value="submit" >
</form>
</li>
My solution in Scrapy is to extract form actions then use Request to return the page with a callback to parse it for for the desired content. However I have hit a few issues
I'm getting the following error "request url must be str or unicode"
secondly when I hardcode a URL to overcome the above issue it seems my parsing function is returning what looks like a list
Here is my code - with reactions of the real URLs
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from wfi2.items import Wfi2Item
class ProfileSpider(Spider):
name = "profiles"
allowed_domains = ["wfi.com.au"]
start_urls = ["http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=WA",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=VIC",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=QLD",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=NSW",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=TAS"
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=NT"
]
def parse(self, response):
hxs = Selector(response)
forms = hxs.xpath('//*[#id="area-managers"]//*/form')
for form in forms:
action = form.xpath('#action').extract()
print "ACTION: ", action
#request = Request(url=action,callback=self.parse_profile)
request = Request(url=action,callback=self.parse_profile)
yield request
def parse_profile(self, response):
hxs = Selector(response)
profile = hxs.xpath('//*[#class="contentContainer"]/*/text()')
print "PROFILE", profile
I'm getting the following error "request url must be str or unicode"
Please have a look at the scrapy documentation for extract(). It says : "Serialize and return the matched nodes as a list of unicode strings" (bold added by me).
The first element of the list is probably what you want. So you could do something like:
request = Request(url=response.urljoin(action[0]), callback=self.parse_profile)
secondly when I hardcode a URL to overcome the above issue it seems my
parsing function is returning what looks like a list
According to the documentation of xpath it's a SelectorList. Add extract() to the xpath and you'll get a list with the text tokens. Eventually you want to clean up and join the elements that list before further processing.