Error when connecting to wunderground API - pandas

I am trying to scrape wind speed data for different UK weather stations using the site wunderground. I assume they have an API, I just have a hard to connecting to it.
Here's the XHR link I use:
https://api.weather.com/v1/location/EGNV:9:GB/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20150101&endDate=20150131
This is the data I would like. The table in the bottom for wind speed:
https://www.wunderground.com/history/monthly/gb/darlington/EGNV/date/2015-1
My code, pretty simple: I first load the headers, my function get_data gets me the response in json format.
In my main I append the data to a dataframe and print it.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import urllib
from urllib.request import urlopen
headers = {
':authority': 'api.weather.com',
#':path': '/v1/location/EGNV:9:GB/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20150101&endDate=20150131',
':scheme': 'https',
'accept': 'application/json, text/plain, */*',
'accept-encoding' : 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,da;q=0.7',
'origin': 'https://www.wunderground.com',
#'apiKey': '6532d6454b8aa370768e63d6ba5a832e',
'referer': 'https://www.wunderground.com/history/monthly/gb/darlington/EGNV/date/2015-1',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'cross-site',
'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'
}
def get_data(response):
df = response.json()
return df
if __name__ == "__main__":
date = pd.datetime.now().strftime("%d-%m-%Y")
api_key = "6532d6454b8aa370768e63d6ba5a832e"
start_date = "20150101"
end_date = "20150131"
urls = [
"https://api.weather.com/v1/location/EGNV:9:GB/observations/historical.json?apiKey="+ api_key +"&units=e&startDate="+start_date+"&endDate="+end_date+""
]
df = pd.DataFrame()
for url in urls:
res = requests.get(url, headers= headers)
data = get_data(res)
df = df.append(data)
print(df)
The error I get:
SSLError: HTTPSConnectionPool(host='api.weather.com', port=443): Max retries exceeded with url: /v1/location/EGNV:9:GB/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20150101&endDate=20150131
(Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
Update:
Even when without trying to connect to the API, but by scraping the page using BS4, I still get denied access. Not sure why, and how they can detect my scraper?

I solved it.
If I add verify = False in my requests.get() I manage to go around the error.

Related

How to get Scrapy to parse CSS

I am following this guide to scrape movie titles from my local cinema website. I am using Scrapy Spider and CSS parsing to get this done. Within the HTML for the site, each movie title is constructed like this:
<div class="col-md-12 movie-description">
<h2>Minions: The Rise of Gru<h2>
...
Here is my code that attempts to scrape this info
import scrapy
class CinemaSpider(scrapy.Spider):
name = "cinema"
allowed_domains = ["cannonvalleycinema10.com"]
start_urls = ["https://cannonvalleycinema10.com/"]
def parse(self, response):
movie_names = response.css(".col-md-12.movie-description h2::text").extract()
for movie_name in movie_names:
yield {
'name': movie_name
}
The cinema's website is here. I have tried all sorts of different combinations for what would get the titles I'm looking for to be added to my json file but can't figure it out.
If it helps, I am running this code:
scrapy runspider .\cinema_scrape.py -o movies.json
I am in the proper directory, too.
The page is dynamically loaded so you have try scrapy and json together :
import scrapy
from scrapy import FormRequest
from scrapy.crawler import CrawlerProcess
import json
from scrapy.http import Request
class TestSpider(scrapy.Spider):
name = 'test'
url = 'https://cabbtheatres.intensify-solutions.com/embed/ajaxGetRepertoire'
cookies = {
'PHPSESSID': 'i8l12572hvd3a702d4nfj3vbg0',
}
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
# 'Cookie': 'PHPSESSID=i8l12572hvd3a702d4nfj3vbg0',
'Origin': 'https://cabbtheatres.intensify-solutions.com',
'Referer': 'https://cabbtheatres.intensify-solutions.com/embed?location=3663456',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
}
data = {
'location': '3663456',
'date': '2022-07-30',
'lang': 'en',
'soon': '',
}
def start_requests(self):
yield scrapy.FormRequest(
url =self.url,
method='POST',
formdata=self.data,
headers=self.headers,
callback=self.parse_item,
)
def parse_item(self, response):
detail=response.json()
titles=detail['data']
for name in titles:
title=name['title']
print(title)
output:
Minions: The Rise of Gru
Thor Love and Thunder
DC League of Super-Pets
Elvis(2022)
Mrs. Harris Goes to Paris
Where the Crawdads Sing
Top Gun: Maverick
Nope

HLS Stream OneTimePassword (GetHLSToken from 511nj.org)

For research purposes, we analyzed and extracted large-scale traffic data from 511nj cameras. Starting from 2020, it adds an OTP(one-time password) to the video link, which is requested by the website API: https://511nj.org/api/client/camera/getHlsToken?Id=2&rnd=202205281151, as observed from the network traffic. However, the website itself gets a response of 200 with a valid token, while accessing it by myself gets a response of 401, which indicates an unauthorized request. What makes me confused is why I can get the camera feed with the OTP through the website, but can't access the api. Is there a solution?
Failed attempt:
requests.get('https://511nj.org/api/client/camera/getHlsToken?Id=2&rnd='+datetime.datetime.now().strftime('%Y%m%d%H%M')).json()
20220601 Update: Modification has been made based on #Brad 's comment. However, the error message now turns into "An error has occurred". Please help!
def GetOTP():
import requests
import json
import datetime
Now = datetime.datetime.now()
DT = Now.strftime('%Y%m%d%I%M%S')
DT24 = Now.strftime('%Y%m%d%H%M')
Enigma = dict(zip(map(str,range(0,10)),('2','1','*','3','4','5','6','7','8','9')))
cookies = {
'ReloadVersion': '112',
'_ga': 'GA1.2.335422978.1653768841',
'_gid': 'GA1.2.721252567.1653768841',
}
headers = {
'authority': '511nj.org',
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-US,en;q=0.9',
'referer': 'https://511nj.org/Scripts/Directives/HLSPlayer.html?v=V223',
'responsetype': "".join([Enigma[key] for key in DT[4:6]])+'$'.join([Enigma[key] for key in DT[6:8]])+'$'.join([Enigma[key] for key in DT[:4]])+'#'+Enigma[DT[9]]+"!".join([Enigma[key] for key in DT[10:12]])+"!"+DT[-2:]+"#PK",
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36',
}
params = {
'Id': '2',
'rnd': DT24,
}
Response = requests.get('https://511nj.org/api/client/camera/getHlsToken?Id=2&rnd='+DT24, headers=headers,cookies=cookies,params=params).json()
print("".join([Enigma[key] for key in DT[4:6]])+'$'+''.join([Enigma[key] for key in DT[6:8]])+'$'+''.join([Enigma[key] for key in DT[:4]])+'#'+Enigma[DT[9]]+"!"+''.join([Enigma[key] for key in DT[10:12]])+"!"+DT[-2:]+"#PK")
print(params['rnd'])
print(Response)
return Response['Data']

encountered a suspected TLS fingerprint risk control when using the python crawler to send Amazon to modify the zip code interface

When I use python to carry a cookie to send a simulated request to
https://www.amazon.com/gp/delivery/ajax/address-change.html
a response of \n\n will be returned. But when I use Charles as the middleman agent, the same http message responds normally.
Similarly, when you use nodejs to send a simulation request, you can also get a normal response. I tried three different network request libraries in python, requests, httpx, and aiohttp got the same result.
For the response of \n\n, I locked the problem on the tls handshake package requested by python. After modifying urllib3.util.ssl_.DEFAULT_CIPHERS, it still returns a response of \n\n.
After comparison with Wireshark captures, it is found that in addition to the CIPHERS part, the Signature Algorithm part is also fixed, and the Signature Algorithm parts of the three request libraries are the same, with the way of curl, the way of nodejs, the way of Charles, and the way of Chrome to obtain the TSL Client Hello package. It's not the same.
I want to simulate the Signature Algorithm part of the Python TSL Client Hello package into Chrome. After I DEBUG the source code of the python request library, I found that the SSL Signature Algorithm part control seems to exist in the openssl so file.
This problem is It has troubled me for a long time, hope it can be resolved, thank you very much
import requests
from aiohttp import ClientSession
import httpx
cookies = {
'csm-hit': 'tb:s-B8ZK0QTPQCGWKHY3QDT5|1620287052879&t:1620287054928&adb:adblk_no',
'i18n-prefs': 'USD',
'lc-main': 'en_US',
'session-id': '143-0501748-3847056',
'session-id-time': '2082787201l',
'session-token': 'NxLWWkB7RnpUvmQEl7OcUzk44D9PnlSt/swrqvnSwBvry9WAPSeQt5U2hVCa7IeEEDwj+qzLHwrNhCnA+7pN8H7HELP5WYZuPjtTJ1d8jrTxLueLIQB+wh+3e+1c1vRrfYDa4FTsdm6jN2QR55zq0ybhNJt0jrXCTdlaktZ+e0tHPIjQnCsu1lidMvyOksR+',
'skin': 'noskin',
'sp-cdn': 'L5Z9:CN',
'ubid-main': '134-5202210-0613519',
}
headers = {
'Host': 'www.amazon.com',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.4094.1',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'zh-CN,zh;q=0.9',
'anti-csrftoken-a2z': 'gBtJDelwICZ60r+pGBgwbzjAf4Wr+LTRIoyWRyMAAAAMAAAAAGC1xeJyYXcAAAAA',
'content-type': 'application/x-www-form-urlencoded;charset=utf-8',
}
data = 'locationType=LOCATION_INPUT&zipCode=90001&storeContext=generic&deviceType=web&pageType=Gateway&actionSource=glow&almBrandId=undefined'
url = 'https://www.amazon.com/gp/delivery/ajax/address-change.html'
# url = 'https://www.python-spider.com/nginx'
your_proxy_url = 'http://127.0.0.1:8888'
# your_proxy_url = ''
#
with httpx.Client(
# http2=True,
# proxies=your_proxy_url,
verify=False) as client:
# This HTTP request will be tunneled instead of forwarded.
response = client.post(url=url, headers=headers, cookies=cookies, data=data)
print(response.status_code)
print(response.text)
# cert='/Users/yangyanhui/lbs/spider/amazon/amazon_cookie_pool/charles-ssl-proxying-certificate.pem'
response = requests.post(url, headers=headers, cookies=cookies, data=data)
print(response.status_code)
print(response.text)
import aiohttp, asyncio
# asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy()) # 加上这一行
async def main(): # aiohttp必须放在异步函数中使用
async with ClientSession(cookies=cookies, headers=headers) as session:
async with session.post(url, data=data,
# proxy=your_proxy_url,
verify_ssl=False) as resp:
print(await resp.text())
print(resp.status)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
curl -H 'Host: www.amazon.com' -H 'Cookie: csm-hit=tb:s-B8ZK0QTPQCGWKHY3QDT5|1620287052879&t:1620287054928&adb:adblk_no; i18n-prefs=USD; lc-main=en_US; session-id=143-0501748-3847056; session-id-time=2082787201l; session-token=NxLWWkB7RnpUvmQEl7OcUzk44D9PnlSt/swrqvnSwBvry9WAPSeQt5U2hVCa7IeEEDwj+qzLHwrNhCnA+7pN8H7HELP5WYZuPjtTJ1d8jrTxLueLIQB+wh+3e+1c1vRrfYDa4FTsdm6jN2QR55zq0ybhNJt0jrXCTdlaktZ+e0tHPIjQnCsu1lidMvyOksR+; skin=noskin; sp-cdn=L5Z9:CN; ubid-main=134-5202210-0613519' -H 'user-agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.4094.1' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'accept-language: zh-CN,zh;q=0.9' -H 'anti-csrftoken-a2z: gBtJDelwICZ60r+pGBgwbzjAf4Wr+LTRIoyWRyMAAAAMAAAAAGC1xeJyYXcAAAAA' -H 'content-type: application/x-www-form-urlencoded;charset=utf-8' --data-binary "locationType=LOCATION_INPUT&zipCode=90001&storeContext=generic&deviceType=web&pageType=Gateway&actionSource=glow&almBrandId=undefined" --compressed 'https://www.amazon.com/gp/delivery/ajax/address-change.html'
maybe consider using golang which has a package can modify your tls fingerprint

Login in to Amazon using BeautifulSoup

I am working on a script to scrape some information off Amazon's Prime Now grocery website. However, I am stumbling on the first step in which I am attempting to start a session and login to the page.
I am fairly positive that the issue is in building the 'data' object. There are 10 input's in the html but the data object I have constructed only has 9, with the missing one being the submit button. I am not entirely sure if it is relevant as this is my first time working with BeautifulSoup.
Any help would be greatly appreciated! All of my code is below, with the last if/else statement confirming that it has not worked when I run the code.
import requests
from bs4 import BeautifulSoup
# define URL where login form is located
site = 'https://primenow.amazon.com/ap/signin?clientContext=133-1292951-7489930&openid.return_to=https%3A%2F%2Fprimenow.amazon.com%2Fap-post-redirect%3FsiteState%3DclientContext%253D131-7694496-4754740%252CsourceUrl%253Dhttps%25253A%25252F%25252Fprimenow.amazon.com%25252Fhome%252Csignature%253DIFISh0byLJrJApqlChzLdkc2FCEj3D&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=amzn_houdini_desktop_us&openid.mode=checkid_setup&marketPlaceId=A1IXFGJ6ITL7J4&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&pageId=amzn_pn_us&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.pape.max_auth_age=3600'
# initiate session
session = requests.Session()
# define session headers
session.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.61 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': site
}
# get login page
resp = session.get(site)
html = resp.text
# get BeautifulSoup object of the html of the login page
soup = BeautifulSoup(html , 'lxml')
# scrape login page to get all the needed inputs required for login
data = {}
form = soup.find('form')
for field in form.find_all('input'):
try:
data[field['name']] = field['value']
except:
pass
# add username and password to the data for post request
data['email'] = 'my email'
data['password'] = 'my password'
# submit post request with username / password and other needed info
post_resp = session.post(site, data = data)
post_soup = BeautifulSoup(post_resp.content , 'lxml')
if post_soup.find_all('title')[0].text == 'Your Account':
print('Login Successfull')
else:
print('Login Failed')

Scrapy | How get response from request without urllib?

I believe there is a better way to get response using scrapy.Request then I do
...
import urllib.request
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
...
class MatchResultsSpider(scrapy.Spider):
name = 'match_results'
allowed_domains = ['site.com']
start_urls = ['url.com']
def get_detail_page_data(self, detail_url):
req = urllib.request.Request(
detail_url,
data=None,
headers={
'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'accept': 'application/json, text/javascript, */*; q=0.01',
'referer': 'site.com',
}
)
page = urllib.request.urlopen(req)
response = HtmlResponse(url=detail_url, body=page.read())
target = Selector(response=response)
return target.xpath('//dd[#data-first_name]/text()').extract_first()
I get all information inside parse function.
But in one place I need to get a little peace data from inside detail page.
# Lineups
lineup_team_tables = lineups_container.xpath('.//tbody')
for i, table in enumerate(lineup_team_tables):
# lineup players
line_up = []
lineup_players = table.xpath('./tr[not(contains(string(), "Coach"))]')
for lineup_player in lineup_players:
line_up_entries = {}
lineup_player_url = lineup_player.xpath('.//a/#href').extract_first()
line_up_entries['player_id'] = get_id(lineup_player_url)
line_up_entries['jersey_num'] = lineup_player.xpath('./td[#class="shirtnumber"]/text()').extract_first()
abs_lineup_player_url = response.urljoin(lineup_player_url)
line_up_entries['position_id_detail'] = self.get_detail_page_data(abs_lineup_player_url)
line_up.append(line_up_entries)
# team_lineup['line_up'] = line_up
self.write_to_scuard(i, 'line_up', line_up)
Can I get data from other page using scrapy.Request(detail_url, calback_func)?
Thank for your help!
Too much extra code. Use simple scheme of Scrapy parsing:
class ********(scrapy.Spider):
name = '*******'
domain = '****'
allowed_domains = ['****']
start_urls = ['https://******']
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64;AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
'DEFAULT_REQUEST_HEADERS': {
'ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'ACCEPT_ENCODING': 'gzip, deflate, br',
'ACCEPT_LANGUAGE': 'en-US,en;q=0.9',
'CONNECTION': 'keep-alive',
}
def parse(self, response):
(You already have responsed html start_urls = ['https://******'])
yield scrapy.Request(url, callback=self.parse_details)
then you can parse further (nested). And return back to parse callback:
def parse_details(self, response):
************
yield scrapy.Request(url_2, callback=self.parse)