Converting HTML to PDF from https requiring authentication - authentication

I've been trying to convert html to pdf from my company's https secured authentication required web.
I tried directly converting it with pdfkit first.
pdfkit.from_url("https://companywebsite.com", 'output.pdf')
However I'm receiving these errors
Error: Authentication Required
Error: Failed to load https://companywebsite.com,
with network status code 204 and http status code 401 - Host requires authentication
So I added options to argument
pdfkit.from_url("https://companywebsite.com", 'output.pdf', options=options)
options = {'username': username,
'password': password}
It's loading forever without any output
My second method was to try creating session with requests
def download(session,username,password):
session.get('https://companywebsite.com', auth=HTTPBasicAuth(username,password),verify=False)
ua = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
session.headers = {'User-Agent': ua}
payload = {'UserName':username,
'Password':password,
'AuthMethod':'FormsAuthentication'}
session.post('https://companywebsite.com', data = payload, headers = session.headers)
my_html = session.get('https://companywebsite.com/thepageiwant')
my_pdf = open('myfile.html','wb+')
my_pdf.write(my_html.content)
my_pdf.close()
path_wkthmltopdf = 'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=bytes(path_wkthmltopdf, 'utf8'))
pdfkit.from_file('myfile.html', 'out.pdf')
download(session,username,password)
Could someone help me, I am getting 200 from session.get so its definitely getting the session

Maybe try using selenium to access to that site and snap the screenshot

Related

Scraping Lazada data

I have used Selenium to get data like item name, price, reviews and so on from the Lazada website. However, it will block me after the first scraping. My question is there any way to solve this? Could you guys give some solution in details. Thankyou
Lazada having high security, for getting data without blocking you must use proxy. you can even get the data using python request try below code
cookies = {
"user": "en"
}
req_headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"x-requested-with": "XMLHttpRequest",
}
proxies = {"https": "http://000.0.0.0:0000"}
response_data = requests.get(product_url, headers=req_headers, cookies=cookies, proxies=proxies, verify=False)
you can get the product data from response text.
for getting reviews you can use this url :
host = "lazada.sg" // you can use any region here
"https://my.{}/pdp/review/getReviewList?itemId={}&pageSize=100&filter=0&sort=1&pageNo={}".format(host,item_id,page_no)
if you want to use selenium you need to set proxy in selenium

Scraping JSON data from XHR response

I am trying to scrape some information from this page: https://salesforce.wd1.myworkdayjobs.com/en-US/External_Career_Site/job/United-Kingdom---Wales---Remote/Enterprise-Account-Executive-Public-Sector_JR65970
When the page loads and I look at the XHR, the response tab for that URL request delivers the info I'm looking for in JSON format. But, if I try to do json.loads(response.body.decode('utf-8')) on that page, I don't get the data I'm looking for because the page loads with JavaScript. Is it possible to just pull that JSON data from the page somehow? Screen shot of what I'm looking at below.
I saw this post on r/scrapy thought I'd answer here.
It's always best to try and replicate the requests when it comes to json data. Json data is called upon on request from the website server, therefore if we make the right HTTP request we can get the response we want.
Using the dev tools under XHR, you can get the referring URL, headers and cookies. See the images below.
Request url: https://imgur.com/TMQxEGJ
Request headers and cookies: https://imgur.com/spCqCvS
Within scrapy the request object allows you to specify the URL in this case the request URL seen in the dev tools. But it also allows us to specify the headers and cookies too! Which we can get from the last image.
So something like this would work click here for code.
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['salesforce.wd1.myworkdayjobs.com']
start_urls = ['https://salesforce.wd1.myworkdayjobs.com/en-
US/External_Career_Site/job/United-Kingdom---Wales---
Remote/']
cookies = {
'PLAY_LANG': 'en-US',
'PLAY_SESSION': '5ff86346f3ba312f6d57f23974e3cff020b5c33e-
salesforce_pSessionId=o3mgtklolr1pdpgmau0tc8nhnv^&instance=
wd1prvps0003a',
'wday_vps_cookie': '3425085962.53810.0000',
'TS014c1515': '01560d0839d62a96c0b
952e23282e8e8fa0dafd17f75af4622d072734673c
51d4a1f4d3bc7f43bee3c1746a1f56a728f570e80f37e',
'timezoneOffset': '-60',
'cdnDown': '0',
}
headers = {
'Connection': 'keep-alive',
'Accept': 'application/json,application/xml',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116
Safari/537.36',
'X-Workday-Client': '2020.27.015',
'Content-Type': 'application/x-www-form-urlencoded',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://salesforce.wd1.myworkdayjobs.com/en-
US/External_Career_Site/job/United-Kingdom---Wales---Remote/
Enterprise-Account-Executive-Public-Sector_JR65970',
'Accept-Language': 'en-US,en;q=0.9',
}
def parse(self, response):
url = response.url + 'Enterprise-Account-Executive-Public-
Sector_JR65970'
yield scrapy.Request(url=url,headers=self.headers,
cookies=self.cookies, callback=self.start)
def start(self,response):
info = response.json()
print(info)
We specify a dictionary of headers and cookies at the start. We then use the parse function to specify the correct url.
Notice I used response.url which gives us the starting url specified above and I add the last part of the URL on as the correct url in the dev tools. Not particularly necessary but little bit less repeating code.
We then do a scrapy Request with the correct headers and cookies and ask for the response to be called back to another function. Here we deserialise the json response into a python object and print it out.
Note response.json() is a new feature of Scrapy which deserialises json into a python object, see here for details.
A great stackoverflow discussion on replicating AJAX requests in scrapy is found here.
To read JSON response in scrapy you can use following code:
j_obj = json.loads(response.body_as_unicode())

Unable to login into PSN using Python requests module

I am trying to login into PSN https://www.playstation.com/en-in/sign-in-and-connect/ using python requests module and API got from the inspect element of browser. Below is the code
import requests
login_data = {
'password': "mypasswordhere",
'username': "myemailhere",
}
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
}
with requests.Session() as s1:
url = "https://auth.api.sonyentertainmentnetwork.com/2.0/oauth/token"
r = s1.post(url, data = login_data, headers = header)
print(r.text)
With this, I got below response from the server.
{"error":"invalid_client","error_description":"Bad client credentials","error_code":4102,"docs":"https://auth.api.sonyentertainmentnetwork.com/docs/","parameters":[]}
Can I know any alternative method to login into PSN network? Preferably using API model instead of selenium? My objective is to login into PSN network with my credentials and change password but seems got stuck in login page only...

Login in to Amazon using BeautifulSoup

I am working on a script to scrape some information off Amazon's Prime Now grocery website. However, I am stumbling on the first step in which I am attempting to start a session and login to the page.
I am fairly positive that the issue is in building the 'data' object. There are 10 input's in the html but the data object I have constructed only has 9, with the missing one being the submit button. I am not entirely sure if it is relevant as this is my first time working with BeautifulSoup.
Any help would be greatly appreciated! All of my code is below, with the last if/else statement confirming that it has not worked when I run the code.
import requests
from bs4 import BeautifulSoup
# define URL where login form is located
site = 'https://primenow.amazon.com/ap/signin?clientContext=133-1292951-7489930&openid.return_to=https%3A%2F%2Fprimenow.amazon.com%2Fap-post-redirect%3FsiteState%3DclientContext%253D131-7694496-4754740%252CsourceUrl%253Dhttps%25253A%25252F%25252Fprimenow.amazon.com%25252Fhome%252Csignature%253DIFISh0byLJrJApqlChzLdkc2FCEj3D&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=amzn_houdini_desktop_us&openid.mode=checkid_setup&marketPlaceId=A1IXFGJ6ITL7J4&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&pageId=amzn_pn_us&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.pape.max_auth_age=3600'
# initiate session
session = requests.Session()
# define session headers
session.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.61 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': site
}
# get login page
resp = session.get(site)
html = resp.text
# get BeautifulSoup object of the html of the login page
soup = BeautifulSoup(html , 'lxml')
# scrape login page to get all the needed inputs required for login
data = {}
form = soup.find('form')
for field in form.find_all('input'):
try:
data[field['name']] = field['value']
except:
pass
# add username and password to the data for post request
data['email'] = 'my email'
data['password'] = 'my password'
# submit post request with username / password and other needed info
post_resp = session.post(site, data = data)
post_soup = BeautifulSoup(post_resp.content , 'lxml')
if post_soup.find_all('title')[0].text == 'Your Account':
print('Login Successfull')
else:
print('Login Failed')

Python 3 basic auth with pinnaclesports API

i am trying to grab betting lines with python from pinnaclesports using their API http://www.pinnaclesports.com/api-xml/manual
which requires basic authentication (http://www.pinnaclesports.com/api-xml/manual#authentication):
Authentication
API use HTTP Basic access authentication . Always use HTTPS to access
the API. You need to send HTTP Request header like this:
Authorization: Basic
For example:
Authorization: Basic U03MyOT23YbzMDc6d3c3O1DQ1
import urllib.request, urllib.parse, urllib.error
import socket
import base64
url = 'https://api.pinnaclesports.com/v1//feed?sportid=12&leagueid=6164'
username = "abc"
password = "xyz"
base64 = "Basic: " + base64.b64encode('{}:{}'.format(username,password).encode('utf-8')).decode('ascii')
print (base64)
details = urllib.parse.urlencode({ 'Authorization' : base64 })
details = details.encode('UTF-8')
url = urllib.request.Request(url, details)
url.add_header("User-Agent","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13")
responseData = urllib.request.urlopen(url).read().decode('utf8', 'ignore')
print (responseData)
Unfortunately i get a http 500 error. Which from my point means either my authentication isn't working properly or their API is not working.
Thanks in advance
As it happens, I don't seem to use the Python version you use, so this has not been tested using your code, but there is an extraneous colon after "Basic" in your base64 string. In my own code, adding this colon after "Basic" indeed yields a http 500 error.
Edit: Code example using Python 2.7 and urllib2:
import urllib2
import base64
def get_leagues():
url = 'https://api.pinnaclesports.com/v1/leagues?sportid=33'
username = "myusername"
password = "mypassword"
b64str = "Basic " + base64.b64encode('{}:{}'.format(username,password).encode('utf-8')).decode('ascii')
headers = {'Content-length' : '0',
'Content-type' : 'application/xml',
'Authorization' : b64str}
req = urllib2.Request(url, headers=headers)
responseData = urllib2.urlopen(req).read()
ofn = 'api_leagues.txt'
with open(ofn, 'w') as ofile:
ofile.write(responseData)