I'm learning scrapy recently. And I tried use its simplist way to fetch a response body, but I got an empty string.
Here is my code:
>>> from scrapy.http import Response
>>> r = Response('http://zenofpython.blog.163.com/blog/static/23531705420146124552782')
>>> r.body
''
>>> r.headers
{}
>>> r.status
200
And with no difficulty, I can visit the url I used above for scrapy Response through browser.It has rich content.
What mistake I've made here?
Another reason for your problem can be that the site requires User-Agent header. Try it like this
scrapy shell http://www.to.somewhere -s USER_AGENT='Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0'
You can read more here
You're supposed to fetch a Request and get a Response object in return.
Try doing:
r = Request(url='http://zenofpython.blog.163.com/blog/static/23531705420146124552782')
fetch(r)
on scrapy shell and you'll be able to get the result as a Response object.
print response.body
Related
I am trying to use the [OpenMeteo API][1] to download CSVs via script. Using their API Builder, and their click-able site, I am able to do it for a single site just fine.
In trying to write a function to accomplish this in a loop via the requests library, I am stumped. I CANNOT get the json data OUT of the type requests.model.Response and into a JSON or even better CSV/ Pandas format.
Cannot parse the requests module response to anything. It appears to succeed ("200" response), and have data, but I cannot get it out of the response format!
latitude= 60.358
longitude= -148.939
request_txt=r"https://archive-api.open-meteo.com/v1/archive?latitude="+ str(latitude)+"&longitude="+ str(longitude)+ "&start_date=1959-01-01&end_date=2023-02-10&models=era5_land&daily=temperature_2m_max,temperature_2m_min,temperature_2m_mean,shortwave_radiation_sum,precipitation_sum,rain_sum,snowfall_sum,et0_fao_evapotranspiration&timezone=America%2FAnchorage&windspeed_unit=ms"
r=requests.get(request_txt).json()
[1]: https://open-meteo.com/en/docs/historical-weather-api
The url needs format=csv added as a parameter. StringIO is used to download as an object into memory.
from io import StringIO
import pandas as pd
import requests
latitude = 60.358
longitude = -148.939
url = ("https://archive-api.open-meteo.com/v1/archive?"
f"latitude={latitude}&"
f"longitude={longitude}&"
"start_date=1959-01-01&"
"end_date=2023-02-10&"
"models=era5_land&"
"daily=temperature_2m_max,temperature_2m_min,temperature_2m_mean,"
"shortwave_radiation_sum,precipitation_sum,rain_sum,snowfall_sum,"
"et0_fao_evapotranspiration&"
"timezone=America%2FAnchorage&"
"windspeed_unit=ms&"
"format=csv")
with requests.Session() as request:
response = request.get(url, timeout=30)
if response.status_code != 200:
print(response.raise_for_status())
df = pd.read_csv(StringIO(response.text), sep=",", skiprows=2)
print(df)
I am trying to scrape the British Petroleum website from scraping job profile. Initially the bot was not allowing it to scrape but after I initialized ROBOTSTXT_OBEY = False it started working but now it is not scraping whole page. Below is my code:
import scrapy
class exxonmobilSpider(scrapy.Spider):
name = "bp"
start_urls=['https://www.bp.com/en/global/corporate/careers/search-and-apply.html?query=data+scientist']
def parse(self, response):
name=response.xpath('//h3[#class="Hit_hitTitle__3MFk3"]')
print(name)
print(len(name))[enter image description here][1]
As you can see in image that xpath gives that h3 tag but when I run the code I am getting empty list. Later I cross checked by printing all the li or div tag and then counting the total number of tags, I found out that only half or some of the tags were getting scraped. Anyone has any idea why scrapy is scraping only some part of the page but not full page. Attaching the comparison image too. enter image description here
You Can see the total number of li tags are 55
But now check the length of the response variable "name".enter image description here
In the hope that OP will include a minimal reproducible example in his next question, here is a way of getting those jobs. Bear in mind jobs are being pulled from an API by Javascript in page, so you need to either use splash/scrapy-playwright, either scrape the API directly. We will do the latter. API url is being obtained from browser's Dev tools - Network tab.
import scrapy
class BpscrapeSpider(scrapy.Spider):
name = 'bpscrape'
allowed_domains = ['algolianet.com', 'bp.com']
def start_requests(self):
headers = {
'x-algolia-application-id': 'RF87OIMXXP',
'x-algolia-api-key': 'f4f167340049feccfcf6141fb7b90a5d',
'Origin': 'https://www.bp.com',
'content-type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
api_url='https://rf87oimxxp-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.9.1)%3B%20Browser%3B%20JS%20Helper%20(3.4.4)%3B%20react%20(17.0.2)%3B%20react-instantsearch%20(6.11.0)'
payload = '{"requests":[{"indexName":"candidatematcher_bp_navapp_prod","params":"highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&filters=type%3A%20Professionals&hitsPerPage=100&query=data%20scientist&maxValuesPerFacet=20&page=0&facets=%5B%22country%22%2C%22group%22%5D&tagFilters="}]}'
yield scrapy.Request(
url=api_url,
headers=headers,
body=payload,
callback= self.parse,
method="POST")
def parse(self, response):
data = response.json()['results'][0]['hits']
for x in data:
yield x
Run with scrapy crawl bpscrape -o bpdsjobs.json to get a json file with all 26 jobs.
You will need to do some data cleaning, as that json response is quite comprehensive, and contains a lot of html tags etc.
For Scrapy documentation, please see https://docs.scrapy.org/en/latest/
I need to get send a request on wordnet knowing the tar_id (taken from Imagenet) to get the lemma assigned to that tar (e.g., I have a tar with houses, I need to send the request and obtain the lemma written on wordnet "living accommodation").
I used requests.get() first, with the URL. Then BeautifulSoup's parser.
I get the parsed HTML as a return but, there is no reference to the "body", meaning the part of the Noun and hypernyms / hyponyms.
Can you tell me how to get that part of Wordnet parsed with the rest of the page?
This is the URL I'm working on: http://wordnet-rdf.princeton.edu/pwn30/03546340-n
Just use the JSON endpoint.
For example:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
}
url = "http://wordnet-rdf.princeton.edu/json/pwn30/03546340-n"
data = requests.get(url, headers=headers).json()
print(data[0]["definition"])
Output:
structures collectively in which people are housed
And if you switch the endpoint to
url = "http://wordnet-rdf.princeton.edu/json_rel/03551520-n"
You'll get all the word relation data.
I am trying the following code to read the historical CSV data from yahoo finance:
import datetime
import time
from bs4 import BeautifulSoup
per1 = str(int(time.mktime((datetime.datetime.today() - td(days=365)).timetuple())))
per2 = str(int(time.mktime((datetime.datetime.today()).timetuple())))
url = 'https://query1.finance.yahoo.com/v7/finance/download/MSFT?period1=' + per1 + '&period2=' + per2 + '&interval=1d&events=history&crumb=OQg/YFV3fvh'
The url variable can be seen when you go to yahoo finance, type a ticker and hover over the "download data" button.
I get authentication error which I believe is due to missing cookie so I tried the following:
import requests
ses = requests.Session()
url1 = 'https://finance.yahoo.com/quote/MSFT/history?p=MSFT'
ses.get(url1)
soup = BeautifulSoup(ses.get(url).content)
print soup.prettify()
I get incorrect cookie error this time.
Can someone suggest how to work around this?
The crumb parameter of the query string keeps changing, perhaps with each browser session. So, when you copy its value from the browser, close it and then use it in another instance of the browser, it expires by then.
So, it should come as no surprise that by the time you use it in your requests session, it doesn't recognize the cookie value and generates an error.
Step 1
Studying the network tab in any browser will help. In this particular case, this crumb part is generated probably when you click on a ticker in the main page. So you'll have to fetch that URL first.
tickers = ('000001.SS', 'NKE', 'MA', 'SBUX')
url = 'https://finance.yahoo.com/quote/{0}?p={0}'.format(tickers[0])
r = s.get(url, headers = req_headers)
This URL needs to be fetched only once. So it doesn't matter which ticker you use for this.
Step 2
The response returned by the server contains the value passed to the crumb parameter in the query string when you download the CSV file.
However, it's contained in the script tag of the page returned by the previous request. This means you can't use BeautifulSoup alone to extract the crumb value.
I initially tried re to extract that out of the script tag's text. But for some reason, I wasn't able to. So I moved to json for parsing it.
soup = BeautifulSoup(r.content, 'lxml')
script_tag = soup.find(text=re.compile('crumb'))
response_dict = json.loads(script_tag[script_tag.find('{"context":'):script_tag.find('}}}};') + 4])
crumb = response_dict['context']['dispatcher']['stores']['CrumbStore']['crumb']
Note that BeautifulSoup is required to extract the script element's contents to be later passed to json to parse it into a Python dict object.
I had to use pprint to print the resulting dict to a file to see exactly where the crumb value was stored.
Step 3
The final URL that fetches the CSV file looks like this:
for ticker in tickers:
csv_url = 'https://query1.finance.yahoo.com/v7/finance/download/{0}?period1=1506656676&period2=1509248676&interval=1d&events=history&crumb={1}'.format(ticker, crumb)
r = s.get(csv_url, headers = req_headers)
Result
Here's the first few lines of one the files downloaded:
Date,Open,High,Low,Close,Adj Close,Volume
2017-09-29,3340.311035,3357.014893,3340.311035,3348.943115,3348.943115,144900
2017-10-09,3403.246094,3410.169922,3366.965088,3374.377930,3374.377930,191700
2017-10-10,3373.344971,3384.025879,3358.794922,3382.988037,3382.988037,179400
Note:
I used appropriate headers in both the requests. So if you skip that part and don't get the desired results, you may have to include them as well.
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
I have the following code that works and I get the file output.txt. I would like for the output file to say success when it works and to provide error code when it doesn't.
import requests
import json
f = open('output.txt', 'w')
url = 'https://webapi.teamviewer.com/api/v1/account'
payload = {'name': 'alias', 'email': 'user#teamviewer.com'}
headers = {"content-type": "application/json", "Authorization": "Bearer myuser token"}
r = requests.put(url, data=json.dumps(payload), headers=headers)
f.write(r.text)
f.close()
TeamViewer HTTP Response codes are:
200 – OK: Used for successful GET, POST and DELETE.
204 – No Content: Used for PUT to indicate that the update succeeded, but no content is included in the
response.
400 – Bad Request: One or more parameters for this function is either missing, invalid or unknown. Details
should be included in the returned JSON.
401 – Unauthorized: Access token not valid (expired, revoked, …) or not included in the header.
403 – Forbidden / Rate Limit Reached: IP blocked or rate limit reached.
500 – Internal Server Error: Some (unexpected) error on the server. The same request should work if the
server works as intended.
You can get the result and error code from your response (assuming TeamViewer api is well behaved):
r = requests.put(url, data=json.dumps(payload), headers=headers)
if r.status_code == 200:
f.write('success')
else
f.write('{0}: {1}'.format(r.status_code, r.text))