encountered a suspected TLS fingerprint risk control when using the python crawler to send Amazon to modify the zip code interface - ssl

When I use python to carry a cookie to send a simulated request to
https://www.amazon.com/gp/delivery/ajax/address-change.html
a response of \n\n will be returned. But when I use Charles as the middleman agent, the same http message responds normally.
Similarly, when you use nodejs to send a simulation request, you can also get a normal response. I tried three different network request libraries in python, requests, httpx, and aiohttp got the same result.
For the response of \n\n, I locked the problem on the tls handshake package requested by python. After modifying urllib3.util.ssl_.DEFAULT_CIPHERS, it still returns a response of \n\n.
After comparison with Wireshark captures, it is found that in addition to the CIPHERS part, the Signature Algorithm part is also fixed, and the Signature Algorithm parts of the three request libraries are the same, with the way of curl, the way of nodejs, the way of Charles, and the way of Chrome to obtain the TSL Client Hello package. It's not the same.
I want to simulate the Signature Algorithm part of the Python TSL Client Hello package into Chrome. After I DEBUG the source code of the python request library, I found that the SSL Signature Algorithm part control seems to exist in the openssl so file.
This problem is It has troubled me for a long time, hope it can be resolved, thank you very much
import requests
from aiohttp import ClientSession
import httpx
cookies = {
'csm-hit': 'tb:s-B8ZK0QTPQCGWKHY3QDT5|1620287052879&t:1620287054928&adb:adblk_no',
'i18n-prefs': 'USD',
'lc-main': 'en_US',
'session-id': '143-0501748-3847056',
'session-id-time': '2082787201l',
'session-token': 'NxLWWkB7RnpUvmQEl7OcUzk44D9PnlSt/swrqvnSwBvry9WAPSeQt5U2hVCa7IeEEDwj+qzLHwrNhCnA+7pN8H7HELP5WYZuPjtTJ1d8jrTxLueLIQB+wh+3e+1c1vRrfYDa4FTsdm6jN2QR55zq0ybhNJt0jrXCTdlaktZ+e0tHPIjQnCsu1lidMvyOksR+',
'skin': 'noskin',
'sp-cdn': 'L5Z9:CN',
'ubid-main': '134-5202210-0613519',
}
headers = {
'Host': 'www.amazon.com',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.4094.1',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'zh-CN,zh;q=0.9',
'anti-csrftoken-a2z': 'gBtJDelwICZ60r+pGBgwbzjAf4Wr+LTRIoyWRyMAAAAMAAAAAGC1xeJyYXcAAAAA',
'content-type': 'application/x-www-form-urlencoded;charset=utf-8',
}
data = 'locationType=LOCATION_INPUT&zipCode=90001&storeContext=generic&deviceType=web&pageType=Gateway&actionSource=glow&almBrandId=undefined'
url = 'https://www.amazon.com/gp/delivery/ajax/address-change.html'
# url = 'https://www.python-spider.com/nginx'
your_proxy_url = 'http://127.0.0.1:8888'
# your_proxy_url = ''
#
with httpx.Client(
# http2=True,
# proxies=your_proxy_url,
verify=False) as client:
# This HTTP request will be tunneled instead of forwarded.
response = client.post(url=url, headers=headers, cookies=cookies, data=data)
print(response.status_code)
print(response.text)
# cert='/Users/yangyanhui/lbs/spider/amazon/amazon_cookie_pool/charles-ssl-proxying-certificate.pem'
response = requests.post(url, headers=headers, cookies=cookies, data=data)
print(response.status_code)
print(response.text)
import aiohttp, asyncio
# asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy()) # 加上这一行
async def main(): # aiohttp必须放在异步函数中使用
async with ClientSession(cookies=cookies, headers=headers) as session:
async with session.post(url, data=data,
# proxy=your_proxy_url,
verify_ssl=False) as resp:
print(await resp.text())
print(resp.status)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
curl -H 'Host: www.amazon.com' -H 'Cookie: csm-hit=tb:s-B8ZK0QTPQCGWKHY3QDT5|1620287052879&t:1620287054928&adb:adblk_no; i18n-prefs=USD; lc-main=en_US; session-id=143-0501748-3847056; session-id-time=2082787201l; session-token=NxLWWkB7RnpUvmQEl7OcUzk44D9PnlSt/swrqvnSwBvry9WAPSeQt5U2hVCa7IeEEDwj+qzLHwrNhCnA+7pN8H7HELP5WYZuPjtTJ1d8jrTxLueLIQB+wh+3e+1c1vRrfYDa4FTsdm6jN2QR55zq0ybhNJt0jrXCTdlaktZ+e0tHPIjQnCsu1lidMvyOksR+; skin=noskin; sp-cdn=L5Z9:CN; ubid-main=134-5202210-0613519' -H 'user-agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.4094.1' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'accept-language: zh-CN,zh;q=0.9' -H 'anti-csrftoken-a2z: gBtJDelwICZ60r+pGBgwbzjAf4Wr+LTRIoyWRyMAAAAMAAAAAGC1xeJyYXcAAAAA' -H 'content-type: application/x-www-form-urlencoded;charset=utf-8' --data-binary "locationType=LOCATION_INPUT&zipCode=90001&storeContext=generic&deviceType=web&pageType=Gateway&actionSource=glow&almBrandId=undefined" --compressed 'https://www.amazon.com/gp/delivery/ajax/address-change.html'

maybe consider using golang which has a package can modify your tls fingerprint

Related

Converting HTML to PDF from https requiring authentication

I've been trying to convert html to pdf from my company's https secured authentication required web.
I tried directly converting it with pdfkit first.
pdfkit.from_url("https://companywebsite.com", 'output.pdf')
However I'm receiving these errors
Error: Authentication Required
Error: Failed to load https://companywebsite.com,
with network status code 204 and http status code 401 - Host requires authentication
So I added options to argument
pdfkit.from_url("https://companywebsite.com", 'output.pdf', options=options)
options = {'username': username,
'password': password}
It's loading forever without any output
My second method was to try creating session with requests
def download(session,username,password):
session.get('https://companywebsite.com', auth=HTTPBasicAuth(username,password),verify=False)
ua = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
session.headers = {'User-Agent': ua}
payload = {'UserName':username,
'Password':password,
'AuthMethod':'FormsAuthentication'}
session.post('https://companywebsite.com', data = payload, headers = session.headers)
my_html = session.get('https://companywebsite.com/thepageiwant')
my_pdf = open('myfile.html','wb+')
my_pdf.write(my_html.content)
my_pdf.close()
path_wkthmltopdf = 'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=bytes(path_wkthmltopdf, 'utf8'))
pdfkit.from_file('myfile.html', 'out.pdf')
download(session,username,password)
Could someone help me, I am getting 200 from session.get so its definitely getting the session
Maybe try using selenium to access to that site and snap the screenshot

Scraping JSON data from XHR response

I am trying to scrape some information from this page: https://salesforce.wd1.myworkdayjobs.com/en-US/External_Career_Site/job/United-Kingdom---Wales---Remote/Enterprise-Account-Executive-Public-Sector_JR65970
When the page loads and I look at the XHR, the response tab for that URL request delivers the info I'm looking for in JSON format. But, if I try to do json.loads(response.body.decode('utf-8')) on that page, I don't get the data I'm looking for because the page loads with JavaScript. Is it possible to just pull that JSON data from the page somehow? Screen shot of what I'm looking at below.
I saw this post on r/scrapy thought I'd answer here.
It's always best to try and replicate the requests when it comes to json data. Json data is called upon on request from the website server, therefore if we make the right HTTP request we can get the response we want.
Using the dev tools under XHR, you can get the referring URL, headers and cookies. See the images below.
Request url: https://imgur.com/TMQxEGJ
Request headers and cookies: https://imgur.com/spCqCvS
Within scrapy the request object allows you to specify the URL in this case the request URL seen in the dev tools. But it also allows us to specify the headers and cookies too! Which we can get from the last image.
So something like this would work click here for code.
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['salesforce.wd1.myworkdayjobs.com']
start_urls = ['https://salesforce.wd1.myworkdayjobs.com/en-
US/External_Career_Site/job/United-Kingdom---Wales---
Remote/']
cookies = {
'PLAY_LANG': 'en-US',
'PLAY_SESSION': '5ff86346f3ba312f6d57f23974e3cff020b5c33e-
salesforce_pSessionId=o3mgtklolr1pdpgmau0tc8nhnv^&instance=
wd1prvps0003a',
'wday_vps_cookie': '3425085962.53810.0000',
'TS014c1515': '01560d0839d62a96c0b
952e23282e8e8fa0dafd17f75af4622d072734673c
51d4a1f4d3bc7f43bee3c1746a1f56a728f570e80f37e',
'timezoneOffset': '-60',
'cdnDown': '0',
}
headers = {
'Connection': 'keep-alive',
'Accept': 'application/json,application/xml',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116
Safari/537.36',
'X-Workday-Client': '2020.27.015',
'Content-Type': 'application/x-www-form-urlencoded',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://salesforce.wd1.myworkdayjobs.com/en-
US/External_Career_Site/job/United-Kingdom---Wales---Remote/
Enterprise-Account-Executive-Public-Sector_JR65970',
'Accept-Language': 'en-US,en;q=0.9',
}
def parse(self, response):
url = response.url + 'Enterprise-Account-Executive-Public-
Sector_JR65970'
yield scrapy.Request(url=url,headers=self.headers,
cookies=self.cookies, callback=self.start)
def start(self,response):
info = response.json()
print(info)
We specify a dictionary of headers and cookies at the start. We then use the parse function to specify the correct url.
Notice I used response.url which gives us the starting url specified above and I add the last part of the URL on as the correct url in the dev tools. Not particularly necessary but little bit less repeating code.
We then do a scrapy Request with the correct headers and cookies and ask for the response to be called back to another function. Here we deserialise the json response into a python object and print it out.
Note response.json() is a new feature of Scrapy which deserialises json into a python object, see here for details.
A great stackoverflow discussion on replicating AJAX requests in scrapy is found here.
To read JSON response in scrapy you can use following code:
j_obj = json.loads(response.body_as_unicode())

Unable to login into PSN using Python requests module

I am trying to login into PSN https://www.playstation.com/en-in/sign-in-and-connect/ using python requests module and API got from the inspect element of browser. Below is the code
import requests
login_data = {
'password': "mypasswordhere",
'username': "myemailhere",
}
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
}
with requests.Session() as s1:
url = "https://auth.api.sonyentertainmentnetwork.com/2.0/oauth/token"
r = s1.post(url, data = login_data, headers = header)
print(r.text)
With this, I got below response from the server.
{"error":"invalid_client","error_description":"Bad client credentials","error_code":4102,"docs":"https://auth.api.sonyentertainmentnetwork.com/docs/","parameters":[]}
Can I know any alternative method to login into PSN network? Preferably using API model instead of selenium? My objective is to login into PSN network with my credentials and change password but seems got stuck in login page only...

Error when connecting to wunderground API

I am trying to scrape wind speed data for different UK weather stations using the site wunderground. I assume they have an API, I just have a hard to connecting to it.
Here's the XHR link I use:
https://api.weather.com/v1/location/EGNV:9:GB/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20150101&endDate=20150131
This is the data I would like. The table in the bottom for wind speed:
https://www.wunderground.com/history/monthly/gb/darlington/EGNV/date/2015-1
My code, pretty simple: I first load the headers, my function get_data gets me the response in json format.
In my main I append the data to a dataframe and print it.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import urllib
from urllib.request import urlopen
headers = {
':authority': 'api.weather.com',
#':path': '/v1/location/EGNV:9:GB/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20150101&endDate=20150131',
':scheme': 'https',
'accept': 'application/json, text/plain, */*',
'accept-encoding' : 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,da;q=0.7',
'origin': 'https://www.wunderground.com',
#'apiKey': '6532d6454b8aa370768e63d6ba5a832e',
'referer': 'https://www.wunderground.com/history/monthly/gb/darlington/EGNV/date/2015-1',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'cross-site',
'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'
}
def get_data(response):
df = response.json()
return df
if __name__ == "__main__":
date = pd.datetime.now().strftime("%d-%m-%Y")
api_key = "6532d6454b8aa370768e63d6ba5a832e"
start_date = "20150101"
end_date = "20150131"
urls = [
"https://api.weather.com/v1/location/EGNV:9:GB/observations/historical.json?apiKey="+ api_key +"&units=e&startDate="+start_date+"&endDate="+end_date+""
]
df = pd.DataFrame()
for url in urls:
res = requests.get(url, headers= headers)
data = get_data(res)
df = df.append(data)
print(df)
The error I get:
SSLError: HTTPSConnectionPool(host='api.weather.com', port=443): Max retries exceeded with url: /v1/location/EGNV:9:GB/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20150101&endDate=20150131
(Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
Update:
Even when without trying to connect to the API, but by scraping the page using BS4, I still get denied access. Not sure why, and how they can detect my scraper?
I solved it.
If I add verify = False in my requests.get() I manage to go around the error.

Issue with accessing Jenkins api with Vue/Axios call

I tried making a get call with axios from my Vue js codebase/ environment to Jenkins API and I'm unable to do so.
I've read every resource that I could but wasn't able to fix this particular problem. I even created a .htaccess file to see if it help but wasn't useful.I ran out of options so I came here for help.
Below are the axios codes that I used within my App.vue file.
axios.get(
*URL to access Jenkins that is currently running on a tomcat server*,
{
headers: {
"jenkins-crumb": "* Some numbers and letters*",
},
auth: {
username: "*obvious username*",
password: "*obvious password*"
},
withCredentials: true,
crossdomain: true
}
)
.then(response => (this.info= response)).catch(error => (console.log(error)));
Console log output:
Access to XMLHttpRequest at 'url' from origin 'http://localhost:8080' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource.
Network output:
General
Request URL: URL
Request Method: OPTIONS
Status Code: 403
Remote Address: localhost:8080
Referrer Policy: no-referrer-when-downgrade
Request Headers
Provisional headers are shown
Access-Control-Request-Headers: authorization,jenkins-crumb
Access-Control-Request-Method: GET
Origin: http://localhost:8080
Referer: http://localhost:8080/
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36
Please help!