How to reduce amount of scraping time when using Requests-HTML? - selenium

I currently use Requests-HTML version 0.10.0 and Selenium 3.141.0. My project is to scrape the ratings of all articles on this website https://openreview.net/group?id=ICLR.cc/2021/Conference. To open each page of the website (the website has 53 pages and each page has 50 articles), I use Selenium. Next, to open articles on each page, I use Requests-HTML. My question is about how to reduce the time uses to open each article and get the rating. In this case, I use await r_inside.html.arender(sleep = 5, timeout=100), which means the sleeping time is 5 seconds and the timeout is 100 seconds. When I try to reduce sleep time to 0.5 seconds, it will cause an error, which is because it does not have enough time to scrape the website. However, if I keep the sleep time as 5 seconds, it will take 6 to 13 hours to scrape all 2600 articles. Also, after waiting for 13 hours, I can scrape all 2600 articles, but the codes use 88 GB of RAM, which I do not prefer because I need to send this code to other people who will not have enough RAM to run. My purpose is to reduce the scraping time and RAM memory. Below is the code I use.
import csv
link = 'https://openreview.net/group?id=ICLR.cc/2021/Conference'
from requests_html import HTMLSession, AsyncHTMLSession
import time
from tqdm import tqdm
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
id_list = []
keyword_list = []
abstract_list = []
title_list = []
driver = webdriver.Chrome('./requests_html/chromedriver.exe')
driver.get('https://openreview.net/group?id=ICLR.cc/2021/Conference')
cond = EC.presence_of_element_located((By.XPATH, '//*[#id="all-submissions"]/nav/ul/li[13]/a'))
WebDriverWait(driver, 10).until(cond)
for page in tqdm(range(1, 54)):
text = ''
elems = driver.find_elements_by_xpath('//*[#id="all-submissions"]/ul/li')
for i, elem in enumerate(elems):
try:
# parse title
title = elem.find_element_by_xpath('./h4/a[1]')
link = title.get_attribute('href')
paper_id = link.split('=')[-1]
title = title.text.strip().replace('\t', ' ').replace('\n', ' ')
# show details
elem.find_element_by_xpath('./a').click()
time.sleep(0.2)
# parse keywords & abstract
items = elem.find_elements_by_xpath('.//li')
keyword = ''.join([x.text for x in items if 'Keywords' in x.text])
abstract = ''.join([x.text for x in items if 'Abstract' in x.text])
keyword = keyword.strip().replace('\t', ' ').replace('\n', ' ').replace('Keywords: ', '')
abstract = abstract.strip().replace('\t', ' ').replace('\n', ' ').replace('Abstract: ', '')
text += paper_id+'\t'+title+'\t'+link+'\t'+keyword+'\t'+abstract+'\n'
title_list.append(title)
id_list.append(paper_id)
keyword_list.append(keyword)
abstract_list.append(abstract)
except Exception as e:
print(f'page {page}, # {i}:', e)
continue
# next page
try:
driver.find_element_by_xpath('//*[#id="all-submissions"]/nav/ul/li[13]/a').click()
time.sleep(2) # NOTE: increase sleep time if needed
except:
print('no next page, exit.')
break
csv_file = open('./requests_html/bb_website_scrap.csv','w', encoding="utf-8")
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Title','Keyword','Abstract','Link','Total Number of Reviews','Average Rating','Average Confidence'])
n = 0
for item in range(len(id_list)):
title = title_list[item]
keyword = keyword_list[item]
abstract = abstract_list[item]
id = id_list[item]
link_pdf = f'https://openreview.net/forum?id={id}'
print(id)
asession_inside = AsyncHTMLSession()
r_inside = await asession_inside.get(link_pdf)
print(type(r_inside))
await r_inside.html.arender(sleep = 5, timeout=100)
test_rating = r_inside.html.find('div.comment-level-odd div.note_contents span.note_content_value')
print(len(test_rating))
check_list = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9','10'}
total_rating_confidence = []
total_rating = []
total_confidence = []
for t in range(len(test_rating)):
if any(test_rating[t].text.split(':')[0] in s for s in check_list):
total_rating_confidence.append(test_rating[t].text.split(':')[0])
for r in range(len(total_rating_confidence)):
if (r % 2 == 0):
total_rating.append(int(total_rating_confidence[r]))
else:
total_confidence.append(int(total_rating_confidence[r]))
average_rating = sum(total_rating) / len(total_rating)
average_confidence = sum(total_confidence) / len(total_confidence)
csv_writer.writerow([title, keyword, abstract, link_pdf,len(total_rating),average_rating,average_confidence])
n = n + 1
print('Order {}',n)
csv_file.close()

I'm no Python expert (in fact, I'm a rank beginner) but the simple answer is better parallelism & session management.
The useful answer is a bit more complicated.
You're leaving the Chromium session around, which is likely what's hoovering up all your RAM. If you call asession_inside.close(), you may see an improvement in RAM usage.
As far as I can tell, you're doing everything in serial; You fetch each page and extract data on the articles in serial. Then, you query each article in serial as well.
You're using arender to fetch each article asynchronously, but you're awaiting it & using a standard for loop. As far as I understand, that means you're not getting any advantage from async; You're still processing each page one at a time (which explains your long process time).
I'd suggest using asyncio to turn the for loop into a parallel version of itself as suggested in this article. Make sure you set a task limit so that you don't try to load all the articles at once; That will also help with your RAM usage.

Related

Scraping Glassdoor returns duplicate entries

So I am trying to scrape job posts from Glassdoor using Requests, Beautiful Soup and Selenium. The entire code works except that, even after scraping data from 30 pages, most entries turn out to be duplicates (almost 80% of them!). Its not a headless scraper so I know it is going to each new page. What could be the reason for so many duplicate entries? Could it be some sort of anti-scraping tool being used by Glassdoor or is something off in my code?
The result turns out to be 870 entries of which a whopping 690 are duplicates!
My code:
def glassdoor_scraper(url):
driver = webdriver.Chrome()
driver.get(url)
time.sleep(10)
# Getting to the page where we want to start scraping
jobs_search_title = driver.find_element(By.ID, 'KeywordSearch')
jobs_search_title.send_keys('Data Analyst')
jobs_search_location = driver.find_element(By.ID, 'LocationSearch')
time.sleep(1)
jobs_search_location.clear()
jobs_search_location.send_keys('United States')
click_search = driver.find_element(By.ID, 'HeroSearchButton')
click_search.click()
for page_num in range(1,10):
time.sleep(10)
res = requests.get(driver.current_url)
soup = BeautifulSoup(res.text,'html.parser')
time.sleep(2)
companies = soup.select('.css-l2wjgv.e1n63ojh0.jobLink')
for company in companies:
companies_list.append(company.text)
positions = soup.select('.jobLink.css-1rd3saf.eigr9kq2')
for position in positions:
positions_list.append(position.text)
locations = soup.select('.css-l2fjlt.pr-xxsm.css-iii9i8.e1rrn5ka0')
for location in locations:
locations_list.append(location.text)
job_post = soup.select('.eigr9kq3')
for job in job_post:
salary_info = job.select('.e1wijj242')
if len(salary_info) > 0:
for salary in salary_info:
salaries_list.append(salary.text)
else:
salaries_list.append('Salary Not Found')
ratings = soup.select('.e1rrn5ka3')
for index, rating in enumerate(ratings):
if len(rating.text) > 0:
ratings_list.append(rating.text)
else:
ratings_list.append('Rating Not Found')
next_page = driver.find_elements(By.CLASS_NAME, 'e13qs2073')[1]
next_page.click()
time.sleep(5)
try:
close_jobalert_popup = driver.find_element(By.CLASS_NAME, 'modal_closeIcon')
except:
pass
else:
time.sleep(1)
close_jobalert_popup.click()
continue
#driver.close()
print(f'{len(companies_list)} jobs found for you!')
global glassdoor_dataset
glassdoor_dataset = pd.DataFrame(
{'Company Name': companies_list,
'Company Rating': ratings_list,
'Position Title': positions_list,
'Location' : locations_list,
'Est. Salary' : salaries_list
})
glassdoor_dataset.to_csv(r'glassdoor_jobs_scraped.csv')
You're going way too fast. You need to put some waits.
I see you have put Implicit Waits. Trying putting Explicit Waits instead.
Something like this:
(put your own conditions. you can try invisibility element too. like if something is invisible and then visible to ensure you are on next page now)
if not then increase your time.sleep()
WebDriverWait(driver, 40).until(expected_conditions.visibility_of_element_located(
(By.XPATH, '//*[#id="wrapper"]/section/div/div/div[2]/button[2]')))
I don't think the repetition is due to a code issue - I think glassdoor just starts cycling results after a while. [If interested, see this gist for some stats - basically, from the 7th page or so, most of the 1st page results seem to be shown on every page onwards. I did a small test manually - with only 5 listings, by id, and even directly on an un-automated browser, they started repeating after a while....]
My suggestion would be to just filter them before looping to the next page - there's a data-id attribute for each li wrapped around the listings which seems to be a unique identifier. If we add that to the other columns' lists, we can start collecting only un-collected listings; if you just edit the for page_num loop to:
for page_num in range(1, 10):
time.sleep(10)
scrapedUrls.append(driver.current_url)
res = requests.get(driver.current_url)
soup = BeautifulSoup(res.text, 'html.parser')
# soup = BeautifulSoup(driver.page_source, 'html.parser') # no noticable improvement
time.sleep(2)
filteredListings = [
di for di in soup.select('li[data-id]') if
di.get('data-id') not in datId_list
]
datId_list += [di.get('data-id') for di in filteredListings]
companies_list += [
t.select_one('.css-l2wjgv.e1n63ojh0.jobLink').get_text(strip=True)
if t.select_one('.css-l2wjgv.e1n63ojh0.jobLink')
else None for t in filteredListings
]
positions_list += [
t.select_one('.jobLink.css-1rd3saf.eigr9kq2').get_text(strip=True)
if t.select_one('.jobLink.css-1rd3saf.eigr9kq2')
else None for t in filteredListings
]
locations_list += [
t.select_one(
'.css-l2fjlt.pr-xxsm.css-iii9i8.e1rrn5ka0').get_text(strip=True)
if t.select_one('.css-l2fjlt.pr-xxsm.css-iii9i8.e1rrn5ka0')
else None for t in filteredListings
]
job_post = [
t.select('.eigr9kq3 .e1wijj242') for t in filteredListings
]
salaries_list += [
'Salary Not Found' if not j else
(j[0].text if len(j) == 1 else [s.text for s in j])
for j in job_post
]
ratings_list += [
t.select_one('.e1rrn5ka3').get_text(strip=True)
if t.select_one('.e1rrn5ka3')
else 'Rating Not Found' for t in filteredListings
]
and, if you added datId_list to the dataframe, it could serve as a meaningful index
dfDict = {'Data-Id': datId_list,
'Company Name': companies_list,
'Company Rating': ratings_list,
'Position Title': positions_list,
'Location': locations_list,
'Est. Salary': salaries_list
}
for k in dfDict:
print(k, len(dfDict[k]))
glassdoor_dataset = pd.DataFrame(dfDict)
glassdoor_dataset.set_index('Data-Id', drop=True)
glassdoor_dataset.to_csv(r'glassdoor_jobs_scraped.csv')

How to improve the speed of getting request content via the request module

The below functions extract content from 'http://thegreyhoundrecorder.com.au/form-guides/' and append all content to a list. The function works fine, although the speed at which the content is scraped from the website is slow. This line tree = html.fromstring(page.content) in particular slows down the process. Is there a way I can improve on the speed of my request.
import lxml
from lxml import html
import requests
import re
import pandas as pd
from requests.exceptions import ConnectionError
greyhound_url = 'http://thegreyhoundrecorder.com.au/form-guides/'
def get_page(url):
"""fxn take page url and return the links to the acticle(Field) we
want to scrape in a list.
"""
page = requests.get(url)
tree = html.fromstring(page.content)
my_list = tree.xpath('//tbody/tr/td[2]/a/#href') # grab all link
print('Length of all links = ', len(my_list))
my_url = [page.url.split('/form-guides')[0] + str(s) for s in my_list]
return my_url
def extract_data(my_url):
"""
fxn take a list of urls and extract the needed infomation from
greyhound website.
return: a list with the extracted field
"""
new_list = []
try:
for t in my_url:
print(t)
page_detail = requests.get(t)
tree_1 = html.fromstring(page_detail.content)
title = ''.join(tree_1.xpath('//div/h1[#class="title"]/text()'))
race_number = tree_1.xpath("//tr[#id = 'tableHeader']/td[1]/text()")
Distance = tree_1.xpath("//tr[#id = 'tableHeader']/td[3]/text()")
TGR_Grade = tree_1.xpath("//tr[#id = 'tableHeader']/td[4]/text()")
TGR1 = tree_1.xpath("//tbody/tr[#class='fieldsTableRow raceTipsRow']//div/span[1]/text()")
TGR2 = tree_1.xpath("//tbody/tr[#class='fieldsTableRow raceTipsRow']//div/span[2]/text()")
TGR3 = tree_1.xpath("//tbody/tr[#class='fieldsTableRow raceTipsRow']//div/span[3]/text()")
TGR4 = tree_1.xpath("//tbody/tr[#class='fieldsTableRow raceTipsRow']//div/span[4]/text()")
clean_title = title.split(' ')[0].strip()
#clean title and extract track number
Track = title.split(' ')[0].strip()
#clean title and extract track date
date = title.split('-')[1].strip()
#clean title and extract track year
year = pd.to_datetime('now').year
#convert date to pandas datetime
race_date = pd.to_datetime(date + ' ' + str(year)).strftime('%d/%m/%Y')
#extract race number
new_rn = []
for number in race_number:
match = re.search(r'^(.).*?(\d+)$', number)
new_rn.append(match.group(1) + match.group(2))
new_list.append((race_date,Track,new_rn,Distance,TGR_Grade,TGR1,TGR2,TGR3,TGR4))
return new_list
except ConnectionError as e:
print('Connection error, connect to a stronger network or reload the page')

Python multiprocessing how to update a complex object in a manager list without using .join() method

I started programming in Python about 2 months ago and I've been struggling with this problem in the last 2 weeks.
I know there are many similar threads to this one but I can't really find a solution which suits my case.
I need to have the main process which is the one which interacts with Telegram and another process, buffer, which understands the complex object received from the main and updates it.
I'd like to do this in a simpler and smoother way.
At the moment objects are not being updated due to the use of multi-processing without the join() method.
I tried then to use multi-threading instead but it gives me compatibility problems with Pyrogram a framework which i am using to interact with Telegram.
I wrote again the "complexity" of my project in order to reproduce the same error I am getting and in order to get and give the best help possible from and for everyone.
a.py
class A():
def __init__(self, length = -1, height = -1):
self.length = length
self.height = height
b.py
from a import A
class B(A):
def __init__(self, length = -1, height = -1, width = -1):
super().__init__(length = -1, height = -1)
self.length = length
self.height = height
self.width = width
def setHeight(self, value):
self.height = value
c.py
class C():
def __init__(self, a, x = 0, y = 0):
self.a = a
self.x = x
self.y = y
def func1(self):
if self.x < 7:
self.x = 7
d.py
from c import C
class D(C):
def __init__(self, a, x = 0, y = 0, z = 0):
super().__init__(a, x = 0, y = 0)
self.a = a
self.x = x
self.y = y
self.z = z
def func2(self):
self.func1()
main.py
from b import B
from d import D
from multiprocessing import Process, Manager
from buffer import buffer
if __name__ == "__main__":
manager = Manager()
lizt = manager.list()
buffer = Process(target = buffer, args = (lizt, )) #passing the list as a parameter
buffer.start()
#can't invoke buffer.join() here because I need the below code to keep running while the buffer process takes a few minutes to end an instance passed in the list
#hence I can't wait the join() function to update the objects inside the buffer but i need objects updated in order to pop them out from the list
import datetime as dt
t = dt.datetime.now()
#library of kind of multithreading (pool of 4 processes), uses asyncio lib
#this while was put to reproduce the same error I am getting
while True:
if t + dt.timedelta(seconds = 10) < dt.datetime.now():
lizt.append(D(B(5, 5, 5)))
t = dt.datetime.now()
"""
#This is the code which looks like the one in my project
#main.py
from pyrogram import Client #library of kind of multithreading (pool of 4 processes), uses asyncio lib
from b import B
from d import D
from multiprocessing import Process, Manager
from buffer import buffer
if __name__ == "__main__":
api_id = 1234567
api_hash = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
app = Client("my_account", api_id, api_hash)
manager = Manager()
lizt = manager.list()
buffer = Process(target = buffer, args = (lizt, )) #passing the list as a parameter
buffer.start()
#can't invoke buffer.join() here because I need the below code to run at the same time as the buffer process
#hence I can't wait the join() function to update the objects inside the buffer
#app.on_message()
def my_handler(client, message):
lizt.append(complex_object_conatining_message)
"""
buffer.py
def buffer(buffer):
print("buffer was defined")
while True:
if len(buffer) > 0:
print(buffer[0].x) #prints 0
buffer[0].func2() #this changes the class attribute locally in the class instance but not in here
print(buffer[0].x) #prints 0, but I'd like it to be 7
print(buffer[0].a.height) #prints 5
buffer[0].a.setHeight(10) #and this has the same behaviour
print(buffer[0].a.height) #prints 5 but I'd like it to be 10
buffer.pop(0)
This is the whole code about the problem I am having.
Literally every suggestion is welcome, hopefully constructive, thank you in advance!
At last I had to change the way to solve this problem, which was using asyncio like the framework was doing as well.
This solution offers everything I was looking for:
-complex objects update
-avoiding the problems of multiprocessing (in particular with join())
It is also:
-lightweight: before I had 2 python processes 1) about 40K 2) about 75K
This actual process is about 30K (and it's also faster and cleaner)
Here's the solution, I hope it will be useful for someone else like it was for me:
The part of the classes is skipped because this solution updates complex objects absolutely fine
main.py
from pyrogram import Client
import asyncio
import time
def cancel_tasks():
#get all task in current loop
tasks = asyncio.Task.all_tasks()
for t in tasks:
t.cancel()
try:
buffer = []
firstWorker(buffer) #this one is the old buffer.py file and function
#the missing loop and loop method are explained in the next piece of code
except KeyboardInterrupt:
print("")
finally:
print("Closing Loop")
cancel_tasks()
firstWorker.py
import asyncio
def firstWorker(buffer):
print("First Worker Executed")
api_id = 1234567
api_hash = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
app = Client("my_account", api_id, api_hash)
#app.on_message()
async def my_handler(client, message):
print("Message Arrived")
buffer.append(complex_object_conatining_message)
await asyncio.sleep(1)
app.run(secondWorker(buffer)) #here is the trick: I changed the
#method run() of the Client class
#inside the Pyrogram framework
#since it was a loop itself.
#In this way I added another task
#to the existing loop in orther to
#let run both of them together.
my secondWorker.py
import asyncio
async def secondWorker(buffer):
while True:
if len(buffer) > 0:
print(buffer.pop(0))
await asyncio.sleep(1)
The resources to understand the asyncio used in this code can be found here:
Asyncio simple tutorial
Python Asyncio Official Documentation
This tutorial about how to fix classical Asyncio errors

how to make cross hair mouse tracker on a PlotWidget() promoted in designer-qt5

I am trying to make a cross hair on my pyqtgraph interactive plots, which are embedded in a PyQt5 GUI thanks to the designer-qt5. I found a working
code in the pyqtgraph "examples". A simplified WORKING example is posted below. Now I want the same, but the problem seems to be that I promoted a
QGraphicsView() to a pg.PlotWidget in the designer, instead of pg.GraphicsWindow()? The Code does not work for me because my p1 is "pyqtgraph.widgets.PlotWidget.PlotWidget object" while in the example p1 is
"pyqtgraph.graphicsItems.PlotItem.PlotItem.PlotItem object".
So what should I do to make this example work for me?
import numpy as np
import pyqtgraph as pg
from pyqtgraph.Qt import QtGui, QtCore
from pyqtgraph.Point import Point
pg.setConfigOption('background', '#ffffff')
pg.setConfigOption('foreground', 'k')
pg.setConfigOptions(antialias=True)
app = QtGui.QApplication([])
win = pg.GraphicsWindow()
win.setWindowTitle('pyqtgraph example: crosshair')
label = pg.LabelItem(justify='right')
win.addItem(label)
p1 = win.addPlot(row=1, col=0)
p1.setAutoVisible(y=True)
#create numpy arrays
#make the numbers large to show that the xrange shows data from 10000 to all the way 0
data1 = 10000 + 15000 * pg.gaussianFilter(np.random.random(size=10000), 10) + 3000 * np.random.random(size=10000)
p1.plot(data1, pen="r")
#cross hair
vLine = pg.InfiniteLine(angle=90, movable=False)
hLine = pg.InfiniteLine(angle=0, movable=False)
p1.addItem(vLine, ignoreBounds=True)
p1.addItem(hLine, ignoreBounds=True)
vb = p1.vb
print(p1)
print(vb)
def mouseMoved(evt):
pos = evt[0] ## using signal proxy turns original arguments into a tuple
if p1.sceneBoundingRect().contains(pos):
mousePoint = vb.mapSceneToView(pos)
index = int(mousePoint.x())
if index > 0 and index < len(data1):
label.setText("<span style='font-size: 12pt'>x=%0.1f, <span style='color: green'>y2=%0.1f</span>" % (mousePoint.x(), data1[index]))
vLine.setPos(mousePoint.x())
hLine.setPos(mousePoint.y())
proxy = pg.SignalProxy(p1.scene().sigMouseMoved, rateLimit=60, slot=mouseMoved)
#p1.scene().sigMouseMoved.connect(mouseMoved)
## Start Qt event loop unless running in interactive mode or using pyside.
if __name__ == '__main__':
import sys
if (sys.flags.interactive != 1) or not hasattr(QtCore, 'PYQT_VERSION'):
QtGui.QApplication.instance().exec_()
I am very sorry for the noise!!! I fix it myself!
The important part was:
plot_wg.proxy = proxy
Very simple...
Below is the function, which should work for any PlotWidget:
def cross_hair(self, plot_wg, log=False ):
global fit
################### TETS cross hair ############3
vLine = pg.InfiniteLine(angle=90, movable=False)#, pos=0)
hLine = pg.InfiniteLine(angle=0, movable=False)#, pos=2450000)
plot_wg.addItem(vLine, ignoreBounds=True)
plot_wg.addItem(hLine, ignoreBounds=True)
vb = plot_wg.getViewBox()
label = pg.TextItem()
plot_wg.addItem(label)
def mouseMoved(evt):
pos = evt[0] ## using signal proxy turns original arguments into a tuple
if plot_wg.sceneBoundingRect().contains(pos):
mousePoint = vb.mapSceneToView(pos)
if log == True:
label.setText("x=%0.3f, y1=%0.3f"%(10**mousePoint.x(), mousePoint.y()))
else:
label.setText("x=%0.3f, y1=%0.3f"%(mousePoint.x(), mousePoint.y()))
vLine.setPos(mousePoint.x())
hLine.setPos(mousePoint.y())
#print(mousePoint.x(),mousePoint.y())
plot_wg.getViewBox().setAutoVisible(y=True)
proxy = pg.SignalProxy(plot_wg.scene().sigMouseMoved, rateLimit=60, slot=mouseMoved)
plot_wg.proxy = proxy
proxy = pg.SignalProxy(plot_wg.scene().sigMouseMoved, rateLimit=60, slot=mouseMoved)
plot_wg.proxy = proxy
################### TETS cross hair ############3

How to get ASINs XPATH from 2 different Amazon pages that have the same parent nodes?

I made a web scraping program using python and webdriver and I want to extract the ASIN from 2 different pages. I would like xpath to work for these 2 links at the same .
These are the amazon pages:https://www.amazon.com/Hydro-Flask-Wide-Mouth-Flip/dp/B01ACATW7E/ref=sr_1_3?s=kitchen&ie=UTF8&qid=1520348607&sr=1-3&keywords=-gfds and
https://www.amazon.com/Ubbi-Saving-Special-Required-Locking/dp/B00821FLSU/ref=sr_1_1?s=baby-products&ie=UTF8&qid=1520265799&sr=1-1&keywords=-hgfd&th=1. They have the same parent nodes(id, classes). How can I make this program work for these 2 links at the same time?
So the problem is on these lines of code: 36, 41
asin = driver.find_element_by_xpath('//div[#id="detail-bullets"]/table/tbody/tr/td/div/ul/li[4]').text
and
asin = driver.find_element_by_xpath('//div[#id="detail-bullets_feature_div"]/div[#id="detail-bullets"]/table/tbody/tr/td/div/ul/li[5]').text. I have to change these lines to output in the csv the ASINs for these 2 products. For the first link it prints the wrong information and for the second it prints the ASIN.
I attached the code. I will appreciate any help.
from selenium import webdriver
import csv
import io
# set the proxies to hide actual IP
proxies = {
'http': 'http://5.189.133.231:80',
'https': 'https://27.111.43.178:8080'
}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()]))
driver = webdriver.Chrome(executable_path="C:\\Users\Andrei-PC\Downloads\webdriver\chromedriver.exe",
chrome_options=chrome_options)
header = ['Product title', 'ASIN']
with open('csv/bot_1.csv', "w") as output:
writer = csv.writer(output)
writer.writerow(header)
links=['https://www.amazon.com/Hydro-Flask-Wide-Mouth-Flip/dp/B01ACATW7E/ref=sr_1_3?s=kitchen&ie=UTF8&qid=1520348607&sr=1-3&keywords=-gfds',
'https://www.amazon.com/Ubbi-Saving-Special-Required-Locking/dp/B00821FLSU/ref=sr_1_1?s=baby-products&ie=UTF8&qid=1520265799&sr=1-1&keywords=-hgfd&th=1'
]
for i in range(len(links)):
driver.get(links[i])
product_title = driver.find_elements_by_xpath('//*[#id="productTitle"][1]')
prod_title = [x.text for x in product_title]
try:
asin = driver.find_element_by_xpath('//div[#id="detail-bullets"]/table/tbody/tr/td/div/ul/li[4]').text
except:
print('no ASIN template one')
try:
asin = driver.find_element_by_xpath('//div[#id="detail-bullets_feature_div"]/div[#id="detail-bullets"]/table/tbody/tr/td/div/ul/li[5]').text
except:
print('no ASIN template two')
try:
data = [prod_title[0], asin]
except:
print('no items v3 ')
with io.open('csv/bot_1.csv', "a", newline="", encoding="utf-8") as output:
writer = csv.writer(output)
writer.writerow(data)
You can simply use
//li[b="ASIN:"]
to get required element on both pages