Reading url using bs4 from yahoo finance - beautifulsoup

I am trying the following code to read the historical CSV data from yahoo finance:
import datetime
import time
from bs4 import BeautifulSoup
per1 = str(int(time.mktime((datetime.datetime.today() - td(days=365)).timetuple())))
per2 = str(int(time.mktime((datetime.datetime.today()).timetuple())))
url = 'https://query1.finance.yahoo.com/v7/finance/download/MSFT?period1=' + per1 + '&period2=' + per2 + '&interval=1d&events=history&crumb=OQg/YFV3fvh'
The url variable can be seen when you go to yahoo finance, type a ticker and hover over the "download data" button.
I get authentication error which I believe is due to missing cookie so I tried the following:
import requests
ses = requests.Session()
url1 = 'https://finance.yahoo.com/quote/MSFT/history?p=MSFT'
ses.get(url1)
soup = BeautifulSoup(ses.get(url).content)
print soup.prettify()
I get incorrect cookie error this time.
Can someone suggest how to work around this?

The crumb parameter of the query string keeps changing, perhaps with each browser session. So, when you copy its value from the browser, close it and then use it in another instance of the browser, it expires by then.
So, it should come as no surprise that by the time you use it in your requests session, it doesn't recognize the cookie value and generates an error.
Step 1
Studying the network tab in any browser will help. In this particular case, this crumb part is generated probably when you click on a ticker in the main page. So you'll have to fetch that URL first.
tickers = ('000001.SS', 'NKE', 'MA', 'SBUX')
url = 'https://finance.yahoo.com/quote/{0}?p={0}'.format(tickers[0])
r = s.get(url, headers = req_headers)
This URL needs to be fetched only once. So it doesn't matter which ticker you use for this.
Step 2
The response returned by the server contains the value passed to the crumb parameter in the query string when you download the CSV file.
However, it's contained in the script tag of the page returned by the previous request. This means you can't use BeautifulSoup alone to extract the crumb value.
I initially tried re to extract that out of the script tag's text. But for some reason, I wasn't able to. So I moved to json for parsing it.
soup = BeautifulSoup(r.content, 'lxml')
script_tag = soup.find(text=re.compile('crumb'))
response_dict = json.loads(script_tag[script_tag.find('{"context":'):script_tag.find('}}}};') + 4])
crumb = response_dict['context']['dispatcher']['stores']['CrumbStore']['crumb']
Note that BeautifulSoup is required to extract the script element's contents to be later passed to json to parse it into a Python dict object.
I had to use pprint to print the resulting dict to a file to see exactly where the crumb value was stored.
Step 3
The final URL that fetches the CSV file looks like this:
for ticker in tickers:
csv_url = 'https://query1.finance.yahoo.com/v7/finance/download/{0}?period1=1506656676&period2=1509248676&interval=1d&events=history&crumb={1}'.format(ticker, crumb)
r = s.get(csv_url, headers = req_headers)
Result
Here's the first few lines of one the files downloaded:
Date,Open,High,Low,Close,Adj Close,Volume
2017-09-29,3340.311035,3357.014893,3340.311035,3348.943115,3348.943115,144900
2017-10-09,3403.246094,3410.169922,3366.965088,3374.377930,3374.377930,191700
2017-10-10,3373.344971,3384.025879,3358.794922,3382.988037,3382.988037,179400
Note:
I used appropriate headers in both the requests. So if you skip that part and don't get the desired results, you may have to include them as well.
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

Related

Scrapy is not scraping the whole page but only some part of it

I am trying to scrape the British Petroleum website from scraping job profile. Initially the bot was not allowing it to scrape but after I initialized ROBOTSTXT_OBEY = False it started working but now it is not scraping whole page. Below is my code:
import scrapy
class exxonmobilSpider(scrapy.Spider):
name = "bp"
start_urls=['https://www.bp.com/en/global/corporate/careers/search-and-apply.html?query=data+scientist']
def parse(self, response):
name=response.xpath('//h3[#class="Hit_hitTitle__3MFk3"]')
print(name)
print(len(name))[enter image description here][1]
As you can see in image that xpath gives that h3 tag but when I run the code I am getting empty list. Later I cross checked by printing all the li or div tag and then counting the total number of tags, I found out that only half or some of the tags were getting scraped. Anyone has any idea why scrapy is scraping only some part of the page but not full page. Attaching the comparison image too. enter image description here
You Can see the total number of li tags are 55
But now check the length of the response variable "name".enter image description here
In the hope that OP will include a minimal reproducible example in his next question, here is a way of getting those jobs. Bear in mind jobs are being pulled from an API by Javascript in page, so you need to either use splash/scrapy-playwright, either scrape the API directly. We will do the latter. API url is being obtained from browser's Dev tools - Network tab.
import scrapy
class BpscrapeSpider(scrapy.Spider):
name = 'bpscrape'
allowed_domains = ['algolianet.com', 'bp.com']
def start_requests(self):
headers = {
'x-algolia-application-id': 'RF87OIMXXP',
'x-algolia-api-key': 'f4f167340049feccfcf6141fb7b90a5d',
'Origin': 'https://www.bp.com',
'content-type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
api_url='https://rf87oimxxp-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.9.1)%3B%20Browser%3B%20JS%20Helper%20(3.4.4)%3B%20react%20(17.0.2)%3B%20react-instantsearch%20(6.11.0)'
payload = '{"requests":[{"indexName":"candidatematcher_bp_navapp_prod","params":"highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&filters=type%3A%20Professionals&hitsPerPage=100&query=data%20scientist&maxValuesPerFacet=20&page=0&facets=%5B%22country%22%2C%22group%22%5D&tagFilters="}]}'
yield scrapy.Request(
url=api_url,
headers=headers,
body=payload,
callback= self.parse,
method="POST")
def parse(self, response):
data = response.json()['results'][0]['hits']
for x in data:
yield x
Run with scrapy crawl bpscrape -o bpdsjobs.json to get a json file with all 26 jobs.
You will need to do some data cleaning, as that json response is quite comprehensive, and contains a lot of html tags etc.
For Scrapy documentation, please see https://docs.scrapy.org/en/latest/

Kraken - Pull Futures Data ("Unknown Asset Pair")

So I can pull spot data from Kraken with:
import requests
url = 'https://api.kraken.com/0/public/OHLC?pair=XBTUSD'
resp = requests.get(url)
resp.json()
But when I try to pull Futures data I always get
{'error': ['EQuery:Unknown asset pair']}
What I've done currently: I take the url from this website, https://demo-futures.kraken.com/futures/FI_XBTUSD_220930, which is "FI_BTCUSD_220930":
url = 'https://api.kraken.com/0/public/OHLC?pair=FI_BTCUSD_220930'
resp = requests.get(url)
resp.json()
I've considered that it might be due to being unable to pull OHLC data for futures. Even when I try a more simple request like just to get info about the ticker I get the same error.
I've looked in the documetation for seperate rules for futures but can't find any reference to what to do differently for futures?

Send request to wordnet

I need to get send a request on wordnet knowing the tar_id (taken from Imagenet) to get the lemma assigned to that tar (e.g., I have a tar with houses, I need to send the request and obtain the lemma written on wordnet "living accommodation").
I used requests.get() first, with the URL. Then BeautifulSoup's parser.
I get the parsed HTML as a return but, there is no reference to the "body", meaning the part of the Noun and hypernyms / hyponyms.
Can you tell me how to get that part of Wordnet parsed with the rest of the page?
This is the URL I'm working on: http://wordnet-rdf.princeton.edu/pwn30/03546340-n
Just use the JSON endpoint.
For example:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
}
url = "http://wordnet-rdf.princeton.edu/json/pwn30/03546340-n"
data = requests.get(url, headers=headers).json()
print(data[0]["definition"])
Output:
structures collectively in which people are housed
And if you switch the endpoint to
url = "http://wordnet-rdf.princeton.edu/json_rel/03551520-n"
You'll get all the word relation data.

Coinbase Pro - Get Account Hold Pagination Requests

With reference to https://docs.pro.coinbase.com/#get-account-history
HTTP REQUEST
GET /accounts//holds
I am struggling to produce python code to get the account holds via API pagination request and I could not find any example of implementing it.
Could you please advise me on how to proceed with this one?
According to Coinbase Pro documentation, pagination works like so (example with the holds endpoint):
import requests
account_id = ...
url = f'https://api.pro.coinbase.com/accounts/{account_id}/holds'
response = requests.get(url)
assert response.ok
first_page = response.json() # here is the first page
cursor = response.headers['CB-AFTER']
response = requests.get(url, params={'after': cursor})
assert response.ok
second_page = response.json() # then the next one
cursor = response.headers['CB-AFTER']
# and so on, repeat until response.json() is an empty list
You should properly wrap this into helper functions or class, or, even better, use an existing library and save dramatic time.

Why my scrapy response body is empty?

I'm learning scrapy recently. And I tried use its simplist way to fetch a response body, but I got an empty string.
Here is my code:
>>> from scrapy.http import Response
>>> r = Response('http://zenofpython.blog.163.com/blog/static/23531705420146124552782')
>>> r.body
''
>>> r.headers
{}
>>> r.status
200
And with no difficulty, I can visit the url I used above for scrapy Response through browser.It has rich content.
What mistake I've made here?
Another reason for your problem can be that the site requires User-Agent header. Try it like this
scrapy shell http://www.to.somewhere -s USER_AGENT='Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0'
You can read more here
You're supposed to fetch a Request and get a Response object in return.
Try doing:
r = Request(url='http://zenofpython.blog.163.com/blog/static/23531705420146124552782')
fetch(r)
on scrapy shell and you'll be able to get the result as a Response object.
print response.body