How to pull down large tbody from html using beautifulsoup - beautifulsoup

I was trying to pull down a tbody from https://iextrading.com/trading/market-data/#hist-download and couldn't bring down the contents with the body. Whenever I tried to pull down the table I got an empty
<tbody id="hist-rows">
</tbody>
Here is a snippet of my code:
BaseDataUrl = r"https://iextrading.com/trading/market-data"
base_download = r"https://www.googleapis.com/download/storage/v1/b/iex/o/data%2Ffeeds%2F"
BaseData = requests.get(BaseDataUrl)
soup = BeautifulSoup(BaseData.content, 'lxml')
tableofcontents = soup.find('div', class_="overflow-x-auto mb2")
tableofdownloads = soup.find('tbody', id="hist-rows")
tableofcontents = tableofcontents.findAll("th")
print(tableofdownloads)

On analyzing the website, the website makes an ajax call the get the data for the table. Hence, call the API to get the data. I have added the screenshots
The following code will get you the data
import requests
res = requests.get("https://iextrading.com/api/1.0/hist")
print(res.json())

Related

Acess data image url when the data url is only obtain upon rendering

I would like to automatically get images saved as browser's data after the page renders, using their corresponding data URLs.
For example:
You can go to the webpage: https://en.wikipedia.org/wiki/Truck
Using the WebInspector from Firefox pick the first thumbnail image on the right.
Now on the Inspector tab, right click over the img tag, go to Copy and press "Image Data-URL"
Open a new tab, paste and enter to see the image from the data URL.
Notice that the data URL is not available on the page source. On the website I want to scrape, the images are rendered after passing through a php script. The server returns a 404 response if the images try to be accessed directly with the src tag attribute.
I believe it should be possible to list the data URLs of the images rendered by the website and download them, however I was unable to find a way to do it.
I normally scrape using selenium webdriver with Firefox coded in python, but any solution would be welcome.
I managed to work out a solution using chrome webdriver with CORS disabled as with Firefox I could not find a cli argument to disable it.
The solution executes some javascript to redraw the image on a new canvas element and then use toDataURL method to get the data url. To save the image I convert the base64 data to binary data and save it as png.
This apparently solved the issue in my use case.
Code to get first truck image
from binascii import a2b_base64
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--disable-web-security")
chrome_options.add_argument("--disable-site-isolation-trials")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://en.wikipedia.org/wiki/Truck")
img = driver.find_element_by_xpath("/html/body/div[3]/div[3]"
"/div[5]/div[1]/div[4]/div"
"/a/img")
img_base64 = driver.execute_script(
"""
const img = arguments[0];
const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
canvas.width = img.width;
canvas.height = img.height;
ctx.drawImage(img, 0, 0);
data_url = canvas.toDataURL('image/png');
return data_url
""",
img)
binary_data = a2b_base64(img_base64.split(',')[1])
with open('image.png', 'wb') as save_img:
save_img.write(binary_data)
Also, I found that the data url that you get with the procedure described in my question, was generated by the Firefox web inspector on request, so it should not be possible to get a list of data urls (that are not within the page source) as I first thought.
BeautifulSoup is the best library to use for such problem statements. When u wanna retrieve data from any website, u can blindly use BeautifulSoup as it is faster than selenium. BeautifulSoup just takes around 10 seconds to complete this task, whereas selenium would approximately take 15-20 seconds to complete the same task, so it is better to use BeautifulSoup. Here is how u do it using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
import time
st = time.time()
src = requests.get('https://en.wikipedia.org/wiki/Truck').text
soup = BeautifulSoup(src,'html.parser')
divs = soup.find_all('div',class_ = "thumbinner")
count = 1
for x in divs:
url = x.a.img['srcset']
url = url.split('1.5x,')[-1]
url = url.split('2x')[0]
url = "https:" + url
url = url.replace(" ","")
path = f"D:\\Truck_Img_{count}.png"
response = requests.get(url)
file = open(path, "wb")
file.write(response.content)
file.close()
count+=1
print(f"Execution Time = {time.time()-st} seconds")
Output:
Execution Time = 9.65831208229065 seconds
29 Images. Here is the first image:
Hope that this helps!

having difficulty converting requests.models.Response to scrapy.selector.unified.Selector

This code
import requests
url = 'https://docs.scrapy.org/en/latest/_static/selectors-sample1.html'
response = requests.get(url)
gets me a requests.models.Response instance, from which I can extract data with scrapy
from scrapy import Selector
sel = Selector(response=response)
sel.xpath('//div')
A post gives a great way to access a website. Here is just part of it.
response = requests.get('https://www.zhihu.com/api/v4/columns/wangzhenotes/items', headers=headers)
print(response.json())
With that approach, I got the content from that site.
However, the same code cannot extract data from the Response instance
sel = Selector(response=response)
len(sel.xpath('//div'))
I just got 0. How do I fix this?
Result of this request
response = requests.get('https://www.zhihu.com/api/v4/columns/wangzhenotes/items', headers=headers)
is JSON object, sure it does not contain any div
to get the required information you have to parse that JSON
response = requests.get('https://www.zhihu.com/api/v4/columns/wangzhenotes/items', headers=headers)
data = response.json()['data']
then you need to loop through the data list and take fields which you need
again, if you want yo use scrapy, you can make requests to url https://www.zhihu.com/api/v4/columns/wangzhenotes/items
and then in parse method read response as JSON:
j_obj = json.loads(response.body_as_unicode())

Requests response object: how to check page loaded completely (dynamic content)?

I am doing the following. After creating a session I am doing a simple GET to a page. Problem is, this page if full of dynamic parts, so it takes between 10-30 seconds to fully generate the HTML I am interested in. The HTML I process with BeautifulSoup.
If I process the response object too quickly, I don't get the data I want. I have used "sleep" to pause for some time, but I think there should be a better way to check for complete page load. I cannot depend on status 200 code, because inside the main page, dynamic parts are still loading.
My code:
s = requests.session()
r = s.get('URL')
time.sleep(20)
... code to process response object...
I have tried to do it more "elegantly" to check for a certain tag through BeautifulSoup search, but doesn't seem to work.
My code:
title_found = False
while title_found == False:
soupje = BeautifulSoup(r.text, 'html.parser')
title_found_in_html_full = soupje.find(id='titleView!1Title')
if title_found_in_html_full is not None:
title_found_in_html = title_found_in_html_full.get('id')
if title_found_in_html == 'titleView!1Title':
title_found = True
Is it true the response object changes over time as the page is loading?
Any suggestions? Thanks

Scrapy returning same first row data in each row instead of separate data for each row

I have written a simple scrape using scrapy, but it keeps returning the first instance of the target data instead of the correct data in each row from each instance of target data. In this case, it returns the first link for all scraped jobs from the Indeed website, instead of the correct link for each job.
I've tried both using (div) and avoiding (.//div) absolute paths, as well as using [0] at the end of the lin. Without, [0], it returns all data from all rows in each cell.
Link to example of source data is;
https://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A330K-%C2%A3460K&radius=25&fromage=2&limit=50&sort=date&psf=advsrch
Target data is href="/rc/clk?jk=56e4f5164620b6da&fccid=6920a3604c831610&vjs=3"
Target data from page
<div class="title">
<a target="_blank" id="jl_56e4f5164620b6da" href="/rc/clk?jk=56e4f5164620b6da&fccid=6920a3604c831610&vjs=3" onmousedown="return rclk(this,jobmap[0],1);" onclick=" setRefineByCookie(['radius', 'jobtype', 'salest']); return rclk(this,jobmap[0],true,1);" rel="noopener nofollow" title="Project Manager" class="jobtitle turnstileLink " data-tn-element="jobTitle">
<b>Project</b> <b>Manager</b></a>
Here's my code
def parse(self, response):
titles = response.css('div.jobsearch-SerpJobCard')
items = []
for title in titles:
item = ICcom4Item()
home_url = ("http://www.indeed.co.uk")
item ['role_title_link'] = titles.xpath('div[#class="title"]/a/#href').extract()[0]
items.append(item)
return items
I just need the correct link from each job to appear. All help welcome!
The problem is in the line below:
item ['role_title_link'] = titles.xpath('div[#class="title"]/a/#href').extract()[0]
Instead of titles.xpath, you should use title.xpath, like below:
item ['role_title_link'] = title.xpath('div[#class="title"]/a/#href').extract()[0]
Then, your code will scrape the link for each job, as you want.

Scrapy Running Results

Just getting started with Scrapy, I'm hoping for a nudge in the right direction.
I want to scrape data from here:
https://www.sportstats.ca/display-results.xhtml?raceid=29360
This is what I have so far:
import scrapy
import re
class BlogSpider(scrapy.Spider):
name = 'sportstats'
start_urls = ['https://www.sportstats.ca/display-results.xhtml?raceid=29360']
def parse(self, response):
headings = []
results = []
tables = response.xpath('//table')
headings = list(tables[0].xpath('thead/tr/th/span/span/text()').extract())
rows = tables[0].xpath('tbody/tr[contains(#class, "ui-widget-content ui-datatable")]')
for row in rows:
result = []
tds = row.xpath('td')
for td in enumerate(tds):
if headings[td[0]].lower() == 'comp.':
content = None
elif headings[td[0]].lower() == 'view':
content = None
elif headings[td[0]].lower() == 'name':
content = td[1].xpath('span/a/text()').extract()[0]
else:
try:
content = td[1].xpath('span/text()').extract()[0]
except:
content = None
result.append(content)
results.append(result)
for result in results:
print(result)
Now I need to move on to the next page, which I can do in a browser by clicking the "right arrow" at the bottom, which I believe is the following li:
<li><a id="mainForm:j_idt369" href="#" class="ui-commandlink ui-widget fa fa-angle-right" onclick="PrimeFaces.ab({s:"mainForm:j_idt369",p:"mainForm",u:"mainForm:result_table mainForm:pageNav mainForm:eventAthleteDetailsDialog",onco:function(xhr,status,args){hideDetails('athlete-popup');showDetails('event-popup');scrollToTopOfElement('mainForm\\:result_table');;}});return false;"></a>
How can I get scrapy to follow that?
If you open the url in a browser without javascript you won't be able to move to the next page. As you can see inside the li tag, there is some javascript to be executed in order to get the next page.
Yo get around this, the first option is usually try to identify the request generated by javascript. In your case, it should be easy: just analyze the java script code and replicate it with python in your spider. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.
You're going to need to perform a callback. Generate the url from the xpath from the 'next page' button. So url = response.xpath(xpath to next_page_button) and then when you're finished scraping that page you'll do yield scrapy.Request(url, callback=self.parse_next_page). Finally you create a new function called def parse_next_page(self, response):.
A final, final note is if it happens to be in Javascript (and you can't scrape it even if you're sure you're using the correct xpath) check out my repo in using splash with scrapy https://github.com/Liamhanninen/Scrape