I have used Selenium to get data like item name, price, reviews and so on from the Lazada website. However, it will block me after the first scraping. My question is there any way to solve this? Could you guys give some solution in details. Thankyou
Lazada having high security, for getting data without blocking you must use proxy. you can even get the data using python request try below code
cookies = {
"user": "en"
}
req_headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"x-requested-with": "XMLHttpRequest",
}
proxies = {"https": "http://000.0.0.0:0000"}
response_data = requests.get(product_url, headers=req_headers, cookies=cookies, proxies=proxies, verify=False)
you can get the product data from response text.
for getting reviews you can use this url :
host = "lazada.sg" // you can use any region here
"https://my.{}/pdp/review/getReviewList?itemId={}&pageSize=100&filter=0&sort=1&pageNo={}".format(host,item_id,page_no)
if you want to use selenium you need to set proxy in selenium
I am trying to load cookies into my request session in Python from selenium exported cookies, however when I do it returns the following error:
"'list' object has no attribute 'extract_cookies'"
def load_cookies(filename):
with open(filename, 'rb') as f:
return pickle.load(f)
initial_state= requests.Session()
initial_state.cookies=load_cookies(time_cookie_file)
search_requests = initial_state.get(search_url)
Everywhere I see this should work, however my cookies are a list of dictionaries, which is what I understand all cookies are, and why I assume this works with Selenium. However for some reason it does not work with requests, any and all help in this regard would be really great, it feels like I am missing something obvious!
Cookies have been dumped from Selenium using:
with open("Filepath.pkl", 'wb') as f:
pickle.dump(driver.get_cookies(), f)
An example of the cookies would be (slightly obfuscated):
[{'domain': '.website.com',
'expiry': 1640787949,
'httpOnly': False,
'name': '_ga',
'path': '/',
'secure': False,
'value': 'GA1.2.1111111111.1111111111'},
{'domain': 'website.com',
'expiry': 1585488346,
'httpOnly': False,
'name': '__pnahc',
'path': '/',
'secure': False,
'value': '0'}]
I have now managed to load in the cookies as per the answer below, however it does not seem like the cookies are loaded in properly as they do not remember anything, however if I load the cookies in when browsing through Selenium they work fine.
Cookie
The Cookie HTTP request header contains stored HTTP cookie previously sent by the server with the Set-Cookie header. A HTTP cookie is a small piece of data that a server sends to the user's web browser. The browser may store the cookies and send it back with the next request to the same server. Typically, cookies to tell if two requests came from the same browser, keeping the user logged in.
Demonstration using Selenium
To demonstrate the usage of cookies using Selenium we have stored the cookies using pickle once the user had logged into the website http://demo.guru99.com/test/cookie/selenium_aut.php. In the next step, we opened the same website, adding the cookies and was able to land as a logged in user.
Code Block to store the cookies:
from selenium import webdriver
import pickle
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('http://demo.guru99.com/test/cookie/selenium_aut.php')
driver.find_element_by_name("username").send_keys("abc123")
driver.find_element_by_name("password").send_keys("123xyz")
driver.find_element_by_name("submit").click()
pickle.dump( driver.get_cookies() , open("cookies.pkl","wb"))
Code Block to use the stored cookies for automatic authentication:
from selenium import webdriver
import pickle
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('http://demo.guru99.com/test/cookie/selenium_aut.php')
cookies = pickle.load(open("cookies.pkl", "rb"))
for cookie in cookies:
driver.add_cookie(cookie)
driver.get('http://demo.guru99.com/test/cookie/selenium_cookie.php')
Demonstration using Requests
To demonstrate usage of cookies using session and requests we have accessed the site https://www.google.com, added a new dictionary of cookies:
{'name':'my_own_cookie','value': 'debanjan' ,'domain':'.stackoverflow.com'}
Next, we have used the same requests session to send another request which was successful as follows:
Code Block:
import requests
s1 = requests.session()
s1.get('https://www.google.com')
print("Original Cookies")
print(s1.cookies)
print("==========")
cookie = {'name':'my_own_cookie','value': 'debanjan' ,'domain':'.stackoverflow.com'}
s1.cookies.update(cookie)
print("After new Cookie added")
print(s1.cookies)
Console Output:
Original Cookies
<RequestsCookieJar[<Cookie 1P_JAR=2020-01-21-14 for .google.com/>, <Cookie NID=196=NvZMMRzKeV6VI1xEqjgbzJ4r_3WCeWWjitKhllxwXUwQcXZHIMRNz_BPo6ujQduYCJMOJgChTQmXSs6yKX7lxcfusbrBMVBN_qLxLIEah5iSBlkdBxotbwfaFHMd-z5E540x02-YZtCm-rAIx-MRCJeFGK2E_EKdZaxTw-StRYg for .google.com/>]>
==========
After new Cookie added
<RequestsCookieJar[<Cookie domain=.stackoverflow.com for />, <Cookie name=my_own_cookie for />, <Cookie value=debanjan for />, <Cookie 1P_JAR=2020-01-21-14 for .google.com/>, <Cookie NID=196=NvZMMRzKeV6VI1xEqjgbzJ4r_3WCeWWjitKhllxwXUwQcXZHIMRNz_BPo6ujQduYCJMOJgChTQmXSs6yKX7lxcfusbrBMVBN_qLxLIEah5iSBlkdBxotbwfaFHMd-z5E540x02-YZtCm-rAIx-MRCJeFGK2E_EKdZaxTw-StRYg for .google.com/>]>
Conclusion
Clearly, the newly added dictionary of cookies {'name':'my_own_cookie','value': 'debanjan' ,'domain':'.stackoverflow.com'} is pretty much in use within the second request.
Passing Selenium Cookies to Python Requests
Now, if your usecase is to passing Selenium Cookies to Python Requests, you can use the following solution:
from selenium import webdriver
import pickle
import requests
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('http://demo.guru99.com/test/cookie/selenium_aut.php')
driver.find_element_by_name("username").send_keys("abc123")
driver.find_element_by_name("password").send_keys("123xyz")
driver.find_element_by_name("submit").click()
# Storing cookies through Selenium
pickle.dump( driver.get_cookies() , open("cookies.pkl","wb"))
driver.quit()
# Passing cookies to Session
session = requests.session() # or an existing session
with open('cookies.pkl', 'rb') as f:
session.cookies.update(pickle.load(f))
search_requests = session.get('https://www.google.com/')
print(session.cookies)
Since you are replacing session.cookies (RequestsCookieJar) with a list which don't have those attributes, it won't work.
You can import those cookies one by one by using:
for c in your_cookies_list:
initial_state.cookies.set(name=c['name'], value=c['value'])
I've tried loading the whole cookie but it seems like requests doesn't recognize those ones and returns:
TypeError: create_cookie() got unexpected keyword arguments: ['expiry', 'httpOnly']
requests accepts expires instead and HttpOnly comes nested within rest
Update:
We can also change the dict keys for expiry and httpOnly so that requests correctly load them instead of throwing an exception, by using dict.pop() which deletes an item from dict by the key and returns the value of deleted key so after we add a new key with deleted item value then unpack & pass them as kwargs:
for c in your_cookies_list:
c['expires'] = c.pop('expiry')
c['rest'] = {'HttpOnly': c.pop('httpOnly')}
initial_state.cookies.set(**c)
You can get cookies and use only name/value. You'll need headers also. You can get them from dev tools or by using proxy.
Basic example:
driver.get('https://website.com/')
# ... login or do anything
cookies = {}
for cookie in driver.get_cookies():
cookies[cookie['name']] = cookie['value']
# Write to a file if need or do something
# import json
# with open("cookies.txt", 'w') as f:
# f.write(json.dumps(cookies))
And usage:
# Read cookies from file as Dict
# with open('cookies.txt') as reader:
# cookies = json.loads(reader.read())
# use cookies
response = requests.get('https://website.com/', headers=headers, cookies=cookies)
Stackoverflow headers example, some headers can be required some not. You can find information here and here. You can get request headers using dev tools Network tab:
headers = {
'authority': 'stackoverflow.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
'sec-fetch-user': '?1',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'referer': 'https://stackoverflow.com/questions/tagged?sort=Newest&tagMode=Watched&uqlId=8338',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7',
}
You can create a session. The session class handles cookies between requests.
s = requests.Session()
login_resp = s.post('https://example.com/login', login_data)
self.cookies = self.login_resp.cookies
cookiedictreceived = {}
cookiedictreceived=requests.utils.dict_from_cookiejar(self.login_resp.cookies)
So requests wants all "values" in your cookie to be a string. Possibly the same with the "key". Cookies also does not want a list as your function load_cookies returns. Cookies can be created for the request.utils with cookies = requests.utils.cookiejar_from_dict(....
Lets say I go to "https://stackoverflow.com/" with selenium and save the cookies as you have done.
from selenium import webdriver
import pickle
import requests
#Go to the website
driver = webdriver.Chrome(executable_path=r'C:\Path\\To\\Your\\chromedriver.exe')
driver.get('https://stackoverflow.com/')
#Save the cookies in a file
with open("C:\Path\To\Your\Filepath.pkl", 'wb') as f:
pickle.dump(driver.get_cookies(), f)
driver.quit()
#you function to get the cookies from the file.
def load_cookies(filename):
with open(filename, 'rb') as f:
return pickle.load(f)
saved_cookies_list = load_cookies("C:\Path\To\Your\Filepath.pkl")
#Set request session
initial_state = requests.Session()
#Function to fix cookie values and add cookies to request_session
def fix_cookies_and_load_to_requests(cookie_list, request_session):
for index in range(len(cookie_list)):
for item in cookie_list[index]:
if type(cookie_list[index][item]) != str:
print("Fix cookie value: ", cookie_list[index][item])
cookie_list[index][item] = str(cookie_list[index][item])
cookies = requests.utils.cookiejar_from_dict(cookie_list[index])
request_session.cookies.update(cookies)
return request_session
initial_state_with_cookies = fix_cookies_and_load_to_requests(cookie_list=saved_cookies_list, request_session=initial_state)
search_requests = initial_state_with_cookies.get("https://stackoverflow.com/")
print("search_requests:", search_requests)
Requests also accept http.cookiejar.CookieJar objects:
https://docs.python.org/3.8/library/http.cookiejar.html#cookiejar-and-filecookiejar-objects
I've been trying to convert html to pdf from my company's https secured authentication required web.
I tried directly converting it with pdfkit first.
pdfkit.from_url("https://companywebsite.com", 'output.pdf')
However I'm receiving these errors
Error: Authentication Required
Error: Failed to load https://companywebsite.com,
with network status code 204 and http status code 401 - Host requires authentication
So I added options to argument
pdfkit.from_url("https://companywebsite.com", 'output.pdf', options=options)
options = {'username': username,
'password': password}
It's loading forever without any output
My second method was to try creating session with requests
def download(session,username,password):
session.get('https://companywebsite.com', auth=HTTPBasicAuth(username,password),verify=False)
ua = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
session.headers = {'User-Agent': ua}
payload = {'UserName':username,
'Password':password,
'AuthMethod':'FormsAuthentication'}
session.post('https://companywebsite.com', data = payload, headers = session.headers)
my_html = session.get('https://companywebsite.com/thepageiwant')
my_pdf = open('myfile.html','wb+')
my_pdf.write(my_html.content)
my_pdf.close()
path_wkthmltopdf = 'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=bytes(path_wkthmltopdf, 'utf8'))
pdfkit.from_file('myfile.html', 'out.pdf')
download(session,username,password)
Could someone help me, I am getting 200 from session.get so its definitely getting the session
Maybe try using selenium to access to that site and snap the screenshot
I am working on a script to scrape some information off Amazon's Prime Now grocery website. However, I am stumbling on the first step in which I am attempting to start a session and login to the page.
I am fairly positive that the issue is in building the 'data' object. There are 10 input's in the html but the data object I have constructed only has 9, with the missing one being the submit button. I am not entirely sure if it is relevant as this is my first time working with BeautifulSoup.
Any help would be greatly appreciated! All of my code is below, with the last if/else statement confirming that it has not worked when I run the code.
import requests
from bs4 import BeautifulSoup
# define URL where login form is located
site = 'https://primenow.amazon.com/ap/signin?clientContext=133-1292951-7489930&openid.return_to=https%3A%2F%2Fprimenow.amazon.com%2Fap-post-redirect%3FsiteState%3DclientContext%253D131-7694496-4754740%252CsourceUrl%253Dhttps%25253A%25252F%25252Fprimenow.amazon.com%25252Fhome%252Csignature%253DIFISh0byLJrJApqlChzLdkc2FCEj3D&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=amzn_houdini_desktop_us&openid.mode=checkid_setup&marketPlaceId=A1IXFGJ6ITL7J4&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&pageId=amzn_pn_us&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.pape.max_auth_age=3600'
# initiate session
session = requests.Session()
# define session headers
session.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.61 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': site
}
# get login page
resp = session.get(site)
html = resp.text
# get BeautifulSoup object of the html of the login page
soup = BeautifulSoup(html , 'lxml')
# scrape login page to get all the needed inputs required for login
data = {}
form = soup.find('form')
for field in form.find_all('input'):
try:
data[field['name']] = field['value']
except:
pass
# add username and password to the data for post request
data['email'] = 'my email'
data['password'] = 'my password'
# submit post request with username / password and other needed info
post_resp = session.post(site, data = data)
post_soup = BeautifulSoup(post_resp.content , 'lxml')
if post_soup.find_all('title')[0].text == 'Your Account':
print('Login Successfull')
else:
print('Login Failed')
The following cross-origin POST request, with a content-type of multipart/form-data and only simple headers is preflighted. According to the W3C spec, unless I am reading it wrong, it should not be preflighted. I've confirmed this happens in Chrome 27 and Firefox 10.8.3. I haven't tested any other browsers.
Here are the request headers, etc:
Request URL:http://192.168.130.135:8081/upload/receiver
Request Method:POST
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Content-Length:27129
Content-Type:multipart/form-data; boundary=----WebKitFormBoundaryix5VzTyVtCMwcNv6
Host:192.168.130.135:8081
Origin:http://192.168.130.135:8080
Referer:http://192.168.130.135:8080/test/raytest-jquery.html
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.37 Safari/537.36
And here is the OPTIONS (preflight) request:
Request URL:http://192.168.130.135:8081/upload/receiver
Request Method:OPTIONS
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Access-Control-Request-Headers:origin, content-type
Access-Control-Request-Method:POST
Connection:keep-alive
Host:192.168.130.135:8081
Origin:http://192.168.130.135:8080
Referer:http://192.168.130.135:8080/test/raytest-jquery.html
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.37 Safari/537.36
The spec seems pretty clear:
Only simple headers: CHECK
Only simple methods: CHECK
UPDATE: Here's some simple client-side code that will reproduce this:
var xhr = new XMLHttpRequest(),
formData = new FormData();
formData.append('myfile', someFileObj);
xhr.upload.progress = function(e) {
//insert upload progress logic here
};
xhr.open('POST', 'http://192.168.130.135:8080/upload/receiver', true);
xhr.send(formData);
Does anyone know why this is being preflighted?
I ended up checking out the Webkit source code in an attempt to figure this out (after Google did not yield any helpful hits). It turns out that Webkit will force any cross-origin request to be preflighted simply if you register an onprogress event handler. I'm not entirely sure, even after reading the code comments, why this logic was applied.
In XMLHttpRequest.cpp:
void XMLHttpRequest::createRequest(ExceptionCode& ec)
{
...
options.preflightPolicy = uploadEvents ? ForcePreflight : ConsiderPreflight;
...
// The presence of upload event listeners forces us to use preflighting because POSTing to an URL that does not
// permit cross origin requests should look exactly like POSTing to an URL that does not respond at all.
// Also, only async requests support upload progress events.
bool uploadEvents = false;
if (m_async) {
m_progressEventThrottle.dispatchEvent(XMLHttpRequestProgressEvent::create(eventNames().loadstartEvent));
if (m_requestEntityBody && m_upload) {
uploadEvents = m_upload->hasEventListeners();
m_upload->dispatchEvent(XMLHttpRequestProgressEvent::create(eventNames().loadstartEvent));
}
}
...
}
UPDATE: Firefox applies the same logic as Webkit, it appears. Here is the relevant code from nsXMLHttpRequest.cpp:
nsresult
nsXMLHttpRequest::CheckChannelForCrossSiteRequest(nsIChannel* aChannel)
{
...
// Check if we need to do a preflight request.
nsCOMPtr<nsIHttpChannel> httpChannel = do_QueryInterface(aChannel);
NS_ENSURE_TRUE(httpChannel, NS_ERROR_DOM_BAD_URI);
nsAutoCString method;
httpChannel->GetRequestMethod(method);
if (!mCORSUnsafeHeaders.IsEmpty() ||
(mUpload && mUpload->HasListeners()) ||
(!method.LowerCaseEqualsLiteral("get") &&
!method.LowerCaseEqualsLiteral("post") &&
!method.LowerCaseEqualsLiteral("head"))) {
mState |= XML_HTTP_REQUEST_NEED_AC_PREFLIGHT;
}
...
}
Notice the mUpload && mUpload->HasListeners() portion of the conditional.
Seems like Webkit and Firefox (and possibly others) have inserted some logic into their preflight-determination code that is not sanctioned by the W3C spec. If I'm missing something in the spec, please comment.
My guess is that the "boundary" on the Content-Type header is causing issues. If you are able to reproduce this, it should be filed as a browser bug, since the spec states that the Content-Type header check should exclude parameters.