Scraping Lazada data - selenium

I have used Selenium to get data like item name, price, reviews and so on from the Lazada website. However, it will block me after the first scraping. My question is there any way to solve this? Could you guys give some solution in details. Thankyou

Lazada having high security, for getting data without blocking you must use proxy. you can even get the data using python request try below code
cookies = {
"user": "en"
}
req_headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"x-requested-with": "XMLHttpRequest",
}
proxies = {"https": "http://000.0.0.0:0000"}
response_data = requests.get(product_url, headers=req_headers, cookies=cookies, proxies=proxies, verify=False)
you can get the product data from response text.
for getting reviews you can use this url :
host = "lazada.sg" // you can use any region here
"https://my.{}/pdp/review/getReviewList?itemId={}&pageSize=100&filter=0&sort=1&pageNo={}".format(host,item_id,page_no)
if you want to use selenium you need to set proxy in selenium

Related

Converting HTML to PDF from https requiring authentication

I've been trying to convert html to pdf from my company's https secured authentication required web.
I tried directly converting it with pdfkit first.
pdfkit.from_url("https://companywebsite.com", 'output.pdf')
However I'm receiving these errors
Error: Authentication Required
Error: Failed to load https://companywebsite.com,
with network status code 204 and http status code 401 - Host requires authentication
So I added options to argument
pdfkit.from_url("https://companywebsite.com", 'output.pdf', options=options)
options = {'username': username,
'password': password}
It's loading forever without any output
My second method was to try creating session with requests
def download(session,username,password):
session.get('https://companywebsite.com', auth=HTTPBasicAuth(username,password),verify=False)
ua = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
session.headers = {'User-Agent': ua}
payload = {'UserName':username,
'Password':password,
'AuthMethod':'FormsAuthentication'}
session.post('https://companywebsite.com', data = payload, headers = session.headers)
my_html = session.get('https://companywebsite.com/thepageiwant')
my_pdf = open('myfile.html','wb+')
my_pdf.write(my_html.content)
my_pdf.close()
path_wkthmltopdf = 'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=bytes(path_wkthmltopdf, 'utf8'))
pdfkit.from_file('myfile.html', 'out.pdf')
download(session,username,password)
Could someone help me, I am getting 200 from session.get so its definitely getting the session
Maybe try using selenium to access to that site and snap the screenshot

Scraping JSON data from XHR response

I am trying to scrape some information from this page: https://salesforce.wd1.myworkdayjobs.com/en-US/External_Career_Site/job/United-Kingdom---Wales---Remote/Enterprise-Account-Executive-Public-Sector_JR65970
When the page loads and I look at the XHR, the response tab for that URL request delivers the info I'm looking for in JSON format. But, if I try to do json.loads(response.body.decode('utf-8')) on that page, I don't get the data I'm looking for because the page loads with JavaScript. Is it possible to just pull that JSON data from the page somehow? Screen shot of what I'm looking at below.
I saw this post on r/scrapy thought I'd answer here.
It's always best to try and replicate the requests when it comes to json data. Json data is called upon on request from the website server, therefore if we make the right HTTP request we can get the response we want.
Using the dev tools under XHR, you can get the referring URL, headers and cookies. See the images below.
Request url: https://imgur.com/TMQxEGJ
Request headers and cookies: https://imgur.com/spCqCvS
Within scrapy the request object allows you to specify the URL in this case the request URL seen in the dev tools. But it also allows us to specify the headers and cookies too! Which we can get from the last image.
So something like this would work click here for code.
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['salesforce.wd1.myworkdayjobs.com']
start_urls = ['https://salesforce.wd1.myworkdayjobs.com/en-
US/External_Career_Site/job/United-Kingdom---Wales---
Remote/']
cookies = {
'PLAY_LANG': 'en-US',
'PLAY_SESSION': '5ff86346f3ba312f6d57f23974e3cff020b5c33e-
salesforce_pSessionId=o3mgtklolr1pdpgmau0tc8nhnv^&instance=
wd1prvps0003a',
'wday_vps_cookie': '3425085962.53810.0000',
'TS014c1515': '01560d0839d62a96c0b
952e23282e8e8fa0dafd17f75af4622d072734673c
51d4a1f4d3bc7f43bee3c1746a1f56a728f570e80f37e',
'timezoneOffset': '-60',
'cdnDown': '0',
}
headers = {
'Connection': 'keep-alive',
'Accept': 'application/json,application/xml',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116
Safari/537.36',
'X-Workday-Client': '2020.27.015',
'Content-Type': 'application/x-www-form-urlencoded',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://salesforce.wd1.myworkdayjobs.com/en-
US/External_Career_Site/job/United-Kingdom---Wales---Remote/
Enterprise-Account-Executive-Public-Sector_JR65970',
'Accept-Language': 'en-US,en;q=0.9',
}
def parse(self, response):
url = response.url + 'Enterprise-Account-Executive-Public-
Sector_JR65970'
yield scrapy.Request(url=url,headers=self.headers,
cookies=self.cookies, callback=self.start)
def start(self,response):
info = response.json()
print(info)
We specify a dictionary of headers and cookies at the start. We then use the parse function to specify the correct url.
Notice I used response.url which gives us the starting url specified above and I add the last part of the URL on as the correct url in the dev tools. Not particularly necessary but little bit less repeating code.
We then do a scrapy Request with the correct headers and cookies and ask for the response to be called back to another function. Here we deserialise the json response into a python object and print it out.
Note response.json() is a new feature of Scrapy which deserialises json into a python object, see here for details.
A great stackoverflow discussion on replicating AJAX requests in scrapy is found here.
To read JSON response in scrapy you can use following code:
j_obj = json.loads(response.body_as_unicode())

Unable to login into PSN using Python requests module

I am trying to login into PSN https://www.playstation.com/en-in/sign-in-and-connect/ using python requests module and API got from the inspect element of browser. Below is the code
import requests
login_data = {
'password': "mypasswordhere",
'username': "myemailhere",
}
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
}
with requests.Session() as s1:
url = "https://auth.api.sonyentertainmentnetwork.com/2.0/oauth/token"
r = s1.post(url, data = login_data, headers = header)
print(r.text)
With this, I got below response from the server.
{"error":"invalid_client","error_description":"Bad client credentials","error_code":4102,"docs":"https://auth.api.sonyentertainmentnetwork.com/docs/","parameters":[]}
Can I know any alternative method to login into PSN network? Preferably using API model instead of selenium? My objective is to login into PSN network with my credentials and change password but seems got stuck in login page only...

How to properly parse JSONP callback function in Meteor?

Does someone know how to parse JSONP callback in Meteor server methods?
I do
let response = HTTP.call('GET', AVIASALES_API_ENDPOINTS.getLocationFromIP, {
params: {
locale: 'en',
callback: 'useriata',
ip: clientIP
}
});
in response.content I’ve got
useriata({"iata":"MSQ","name":"Minsk","country_name":"Belarus"})
How to properly parse it?
It could help to know what you really try to accomplish? But here is a working example meteor actually doesn't do anything unusual with the requests.
Meteor.startup(function () {
var result = HTTP.call("GET", "https://api.github.com/legacy/repos/search/meteor", {
params: {},
headers: {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"
}
});
console.log(result.data); // it's js object you can do result.data.repositories[0].name
console.log(JSON.stringify(result.data)); // json string
console.log(JSON.parse(JSON.stringify(result.data))) // if for some reason you need to parse it this way will work, but seems unnecessary
});
Update: The string you got back from the response wasn't valid JSON so you couldn't parse it used some regex to remove the invalid strings here is working example: http://meteorpad.com/pad/JCy5WkFsrtciG9PR5/Copy%20of%20Leaderboard

CORS request is preflighted, but it seems like it should not be

The following cross-origin POST request, with a content-type of multipart/form-data and only simple headers is preflighted. According to the W3C spec, unless I am reading it wrong, it should not be preflighted. I've confirmed this happens in Chrome 27 and Firefox 10.8.3. I haven't tested any other browsers.
Here are the request headers, etc:
Request URL:http://192.168.130.135:8081/upload/receiver
Request Method:POST
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Content-Length:27129
Content-Type:multipart/form-data; boundary=----WebKitFormBoundaryix5VzTyVtCMwcNv6
Host:192.168.130.135:8081
Origin:http://192.168.130.135:8080
Referer:http://192.168.130.135:8080/test/raytest-jquery.html
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.37 Safari/537.36
And here is the OPTIONS (preflight) request:
Request URL:http://192.168.130.135:8081/upload/receiver
Request Method:OPTIONS
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Access-Control-Request-Headers:origin, content-type
Access-Control-Request-Method:POST
Connection:keep-alive
Host:192.168.130.135:8081
Origin:http://192.168.130.135:8080
Referer:http://192.168.130.135:8080/test/raytest-jquery.html
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.37 Safari/537.36
The spec seems pretty clear:
Only simple headers: CHECK
Only simple methods: CHECK
UPDATE: Here's some simple client-side code that will reproduce this:
var xhr = new XMLHttpRequest(),
formData = new FormData();
formData.append('myfile', someFileObj);
xhr.upload.progress = function(e) {
//insert upload progress logic here
};
xhr.open('POST', 'http://192.168.130.135:8080/upload/receiver', true);
xhr.send(formData);
Does anyone know why this is being preflighted?
I ended up checking out the Webkit source code in an attempt to figure this out (after Google did not yield any helpful hits). It turns out that Webkit will force any cross-origin request to be preflighted simply if you register an onprogress event handler. I'm not entirely sure, even after reading the code comments, why this logic was applied.
In XMLHttpRequest.cpp:
void XMLHttpRequest::createRequest(ExceptionCode& ec)
{
...
options.preflightPolicy = uploadEvents ? ForcePreflight : ConsiderPreflight;
...
// The presence of upload event listeners forces us to use preflighting because POSTing to an URL that does not
// permit cross origin requests should look exactly like POSTing to an URL that does not respond at all.
// Also, only async requests support upload progress events.
bool uploadEvents = false;
if (m_async) {
m_progressEventThrottle.dispatchEvent(XMLHttpRequestProgressEvent::create(eventNames().loadstartEvent));
if (m_requestEntityBody && m_upload) {
uploadEvents = m_upload->hasEventListeners();
m_upload->dispatchEvent(XMLHttpRequestProgressEvent::create(eventNames().loadstartEvent));
}
}
...
}
UPDATE: Firefox applies the same logic as Webkit, it appears. Here is the relevant code from nsXMLHttpRequest.cpp:
nsresult
nsXMLHttpRequest::CheckChannelForCrossSiteRequest(nsIChannel* aChannel)
{
...
// Check if we need to do a preflight request.
nsCOMPtr<nsIHttpChannel> httpChannel = do_QueryInterface(aChannel);
NS_ENSURE_TRUE(httpChannel, NS_ERROR_DOM_BAD_URI);
nsAutoCString method;
httpChannel->GetRequestMethod(method);
if (!mCORSUnsafeHeaders.IsEmpty() ||
(mUpload && mUpload->HasListeners()) ||
(!method.LowerCaseEqualsLiteral("get") &&
!method.LowerCaseEqualsLiteral("post") &&
!method.LowerCaseEqualsLiteral("head"))) {
mState |= XML_HTTP_REQUEST_NEED_AC_PREFLIGHT;
}
...
}
Notice the mUpload && mUpload->HasListeners() portion of the conditional.
Seems like Webkit and Firefox (and possibly others) have inserted some logic into their preflight-determination code that is not sanctioned by the W3C spec. If I'm missing something in the spec, please comment.
My guess is that the "boundary" on the Content-Type header is causing issues. If you are able to reproduce this, it should be filed as a browser bug, since the spec states that the Content-Type header check should exclude parameters.