Scraping Glassdoor returns duplicate entries

Scraping Glassdoor returns duplicate entries - selenium

So I am trying to scrape job posts from Glassdoor using Requests, Beautiful Soup and Selenium. The entire code works except that, even after scraping data from 30 pages, most entries turn out to be duplicates (almost 80% of them!). Its not a headless scraper so I know it is going to each new page. What could be the reason for so many duplicate entries? Could it be some sort of anti-scraping tool being used by Glassdoor or is something off in my code?
The result turns out to be 870 entries of which a whopping 690 are duplicates!
My code:
def glassdoor_scraper(url):
driver = webdriver.Chrome()
driver.get(url)
time.sleep(10)
# Getting to the page where we want to start scraping
jobs_search_title = driver.find_element(By.ID, 'KeywordSearch')
jobs_search_title.send_keys('Data Analyst')
jobs_search_location = driver.find_element(By.ID, 'LocationSearch')
time.sleep(1)
jobs_search_location.clear()
jobs_search_location.send_keys('United States')
click_search = driver.find_element(By.ID, 'HeroSearchButton')
click_search.click()
for page_num in range(1,10):
time.sleep(10)
res = requests.get(driver.current_url)
soup = BeautifulSoup(res.text,'html.parser')
time.sleep(2)
companies = soup.select('.css-l2wjgv.e1n63ojh0.jobLink')
for company in companies:
companies_list.append(company.text)
positions = soup.select('.jobLink.css-1rd3saf.eigr9kq2')
for position in positions:
positions_list.append(position.text)
locations = soup.select('.css-l2fjlt.pr-xxsm.css-iii9i8.e1rrn5ka0')
for location in locations:
locations_list.append(location.text)
job_post = soup.select('.eigr9kq3')
for job in job_post:
salary_info = job.select('.e1wijj242')
if len(salary_info) > 0:
for salary in salary_info:
salaries_list.append(salary.text)
else:
salaries_list.append('Salary Not Found')
ratings = soup.select('.e1rrn5ka3')
for index, rating in enumerate(ratings):
if len(rating.text) > 0:
ratings_list.append(rating.text)
else:
ratings_list.append('Rating Not Found')
next_page = driver.find_elements(By.CLASS_NAME, 'e13qs2073')[1]
next_page.click()
time.sleep(5)
try:
close_jobalert_popup = driver.find_element(By.CLASS_NAME, 'modal_closeIcon')
except:
pass
else:
time.sleep(1)
close_jobalert_popup.click()
continue
#driver.close()
print(f'{len(companies_list)} jobs found for you!')
global glassdoor_dataset
glassdoor_dataset = pd.DataFrame(
{'Company Name': companies_list,
'Company Rating': ratings_list,
'Position Title': positions_list,
'Location' : locations_list,
'Est. Salary' : salaries_list
})
glassdoor_dataset.to_csv(r'glassdoor_jobs_scraped.csv')

You're going way too fast. You need to put some waits.
I see you have put Implicit Waits. Trying putting Explicit Waits instead.
Something like this:
(put your own conditions. you can try invisibility element too. like if something is invisible and then visible to ensure you are on next page now)
if not then increase your time.sleep()
WebDriverWait(driver, 40).until(expected_conditions.visibility_of_element_located(
(By.XPATH, '//*[#id="wrapper"]/section/div/div/div[2]/button[2]')))

I don't think the repetition is due to a code issue - I think glassdoor just starts cycling results after a while. [If interested, see this gist for some stats - basically, from the 7th page or so, most of the 1st page results seem to be shown on every page onwards. I did a small test manually - with only 5 listings, by id, and even directly on an un-automated browser, they started repeating after a while....]
My suggestion would be to just filter them before looping to the next page - there's a data-id attribute for each li wrapped around the listings which seems to be a unique identifier. If we add that to the other columns' lists, we can start collecting only un-collected listings; if you just edit the for page_num loop to:
for page_num in range(1, 10):
time.sleep(10)
scrapedUrls.append(driver.current_url)
res = requests.get(driver.current_url)
soup = BeautifulSoup(res.text, 'html.parser')
# soup = BeautifulSoup(driver.page_source, 'html.parser') # no noticable improvement
time.sleep(2)
filteredListings = [
di for di in soup.select('li[data-id]') if
di.get('data-id') not in datId_list
]
datId_list += [di.get('data-id') for di in filteredListings]
companies_list += [
t.select_one('.css-l2wjgv.e1n63ojh0.jobLink').get_text(strip=True)
if t.select_one('.css-l2wjgv.e1n63ojh0.jobLink')
else None for t in filteredListings
]
positions_list += [
t.select_one('.jobLink.css-1rd3saf.eigr9kq2').get_text(strip=True)
if t.select_one('.jobLink.css-1rd3saf.eigr9kq2')
else None for t in filteredListings
]
locations_list += [
t.select_one(
'.css-l2fjlt.pr-xxsm.css-iii9i8.e1rrn5ka0').get_text(strip=True)
if t.select_one('.css-l2fjlt.pr-xxsm.css-iii9i8.e1rrn5ka0')
else None for t in filteredListings
]
job_post = [
t.select('.eigr9kq3 .e1wijj242') for t in filteredListings
]
salaries_list += [
'Salary Not Found' if not j else
(j[0].text if len(j) == 1 else [s.text for s in j])
for j in job_post
]
ratings_list += [
t.select_one('.e1rrn5ka3').get_text(strip=True)
if t.select_one('.e1rrn5ka3')
else 'Rating Not Found' for t in filteredListings
]
and, if you added datId_list to the dataframe, it could serve as a meaningful index
dfDict = {'Data-Id': datId_list,
'Company Name': companies_list,
'Company Rating': ratings_list,
'Position Title': positions_list,
'Location': locations_list,
'Est. Salary': salaries_list
}
for k in dfDict:
print(k, len(dfDict[k]))
glassdoor_dataset = pd.DataFrame(dfDict)
glassdoor_dataset.set_index('Data-Id', drop=True)
glassdoor_dataset.to_csv(r'glassdoor_jobs_scraped.csv')

Related

Extract part of string(/soup element) within a list of lists

I'm having some issues with scraping fish images off a website.
species_with_foto = ['/fangster/aborre-perca-fluviatilis/1',
'/fangster/almindelig-tangnaal-syngnathus-typhle/155',
'/fangster/ansjos-engraulis-encrasicholus/66',
'/fangster/atlantisk-tun-blaafinnet-tun-thunnus-thynnus-/137']
titles = []
species = []
for x in species_with_foto:
specie_page = 'https://www.fiskefoto.dk'+x
driver.get(specie_page)
content = driver.page_source
soup = BeautifulSoup(content)
brutto = soup.find_all('img', attrs={'class':'rapportBillede'})
species.append(brutto)
#print(brutto)
titles.append(x)
try:
driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
print('CLicked next', x)
except NoSuchElementException:
print('Succesfully finished - :', x)
time.sleep(2)
This returns a list of lists with the sublist looking like this:
[<img alt="Aborre (Perca fluviatilis) aborrefiskeri, striber, rygfinne, regnorm, majs, spinner, " class="rapportBillede" src="/images/400/aborre-perca-fluviatilis-medefiskeri-bundrig-0,220kg-24cm-striber-rygfinne-regnorm-majs-spinner-358-22-29-14-740-2013-21-4.jpg" style="width:50%;"/>,
<img alt="Aborre (Perca fluviatilis) aborrefiskeri, striber, rygfinne, regnorm, majs, spinner, " class="rapportBillede" src="/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg" style="width:calc(50% - 6px);margin-bottom:7px;"/>]
How can i clean up the list and only keep the src="/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg" - part? I have tried with other variables in the soup.find_all but can't get it to work.
(The selenium part is also not functioning properly, but I'll get to that after......)
EDIT:
This is my code now, I'm really getting close :) One issue is that now my photos are not saved in a list of lists but just a list - I for the love of god don't understand why this happens?
Help to fix and understand would be greatly appreciated!
titles = []
fish_photos = []
for x in species_with_foto_mini:
site = "https://www.fiskefoto.dk/"+x
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')
titles.append(x)
try:
images = bs.find_all('img', attrs={'class':'rapportBillede'})
for img in images:
if img.has_attr('src'):
#print(img['src'])
a = (img['src'])
fish_photos.append(a)
except KeyError:
print('No src')
#navigate pages
try:
driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
print('CLicked next', x)
except NoSuchElementException:
print('Succesfully finished -', x)
time.sleep(2)
EDIT:
I need the end result to be a list of lists looking something like this:
fish_photos =
[['/images/400/aborre-perca-fluviatilis-medefiskeri-bundrig-0,220kg-24cm-striber-rygfinne-regnorm-majs-spinner-358-22-29-14-740-2013-21-4.jpg',
'/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg',['/images/400/tungehvarre-arnoglossus-laterna-medefiskeri-6650-2523403.jpg', '/images/400/ulk-myoxocephalus-scorpius-medefiskeri-bundrig-koebenhavner-koebenhavner-torsk-mole-sild-boersteorm-pigge-351-18-48-9-680-2013-6-4.jpg'],[ '/images/400/graeskarpe-ctenopharyngodon-idella-medefiskeri-bobleflaad-med-toastbroed-paa-enkeltkrog-5.02kg-77cm-6436-7486.jpg','/images/400/graeskarpe-ctenopharyngodon-idella-medefiskeri-bobleflaad-med-toastbroed-paa-enkeltkrog-10.38kg-96cm-6337-4823146.jpg']
EDIT:
My output now is a list with identical lists. I need it to put every specie in its own list, like this: fish_photo_list = [[trout1, trout2, trout3], [other fish1, other fish 2, other], [salmon1, salmon2]]
My initial code this, but not now.

Here is an example, you can change:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')
try:
images = bs.find_all('img')
for img in images:
if img.has_attr('src'):
print(img['src'])
except KeyError:
print('No src')

Issues with Selenium background execution (python)

I made a little code to scrape a website and save reviews and I'm rather a beginner at scraping.
It's the typical webstore, with a review section.
Comments in that review load dynamically by clicking on the "load more button".
My issue isn't specific to one site.
the idea of the code is to browse through products and for each product it will load their reviews. Some products have more reviews than others, up to 2000 or so...
By default, the website display 5 reviews and if you want to see more, you've got to click on the "load more" button.
Basically, if I launch my code, it's going open my chrome brower and do all that but if I launch my code and then switch deskops (on mac) then it doesn't load the reviews, it detects that there are more reviews but then it keep looking for the load more button without finding it (at least, that's what I assume since it throws no error).
Obviously it's got something to do with the fact that I'm switching desktops but I don't know how to fix it.
Has anyone experience similar issues? Am i missing something here?
MacOs Big Sur
Chrome webdriver
using proxies
driver = webdriver.Chrome()
#####
# function to get the number of reviews
def get_number_of_reviews():
try:
driver.execute_script(
"arguments[0].scrollIntoView(true);",
WebDriverWait(driver, 10).until(
EC.visibility_of_element_located(
(By.CSS_SELECTOR, ".bv_numReviews_text")
)
),
)
number_of_reviews = driver.find_element_by_css_selector(
".bv_numReviews_text"
).text
number_of_reviews = int(number_of_reviews[1 : (len(number_of_reviews) - 1)])
except Exception as e:
# A common exception is when there is no review, the element containing the reviews is missing and therefore throws an exception
number_of_reviews = 0
return number_of_reviews
# function load every review in the perfume review section
"""
# While it's possible, click load more
# If it's not clickable anymore or the button disappeared, break out
"""
def load_all_reviews(number_of_reviews, start):
start = start
while True:
try:
driver.execute_script(
"arguments[0].scrollIntoView(true);",
WebDriverWait(driver, 10).until(
EC.visibility_of_element_located(
(By.CSS_SELECTOR, ".bv-content-btn-pages-load-more")
)
),
)
driver.execute_script(
"arguments[0].click();",
WebDriverWait(driver, 20).until(
EC.element_to_be_clickable(
(By.CSS_SELECTOR, ".bv-content-btn-pages-load-more")
)
),
)
print("\n New reviews have been loaded !")
except Exception as e:
reviews = driver.find_elements_by_css_selector(".bv-content-review")
print("\n There aren't any new reviews to load")
print("\n Time elapsed : ", (time.time() - start))
if len(reviews) == number_of_reviews or (time.time() - start) > 25:
return reviews
else:
print("encore un tour! ", len(reviews))
return load_all_reviews(number_of_reviews, start)
break
# function returning the user asigned rating from 1 to 5
def get_rating():
try:
return review.find_element_by_class_name("bv-off-screen").text[0]
except Exception as e:
return "NA"
# function returning the review title
def get_title():
try:
return review.find_element_by_class_name("bv-content-title-container").text
except:
return "NA"
# function returning the review
def get_review():
try:
return review.find_element_by_class_name("bv-content-summary-body-text").text
except:
return "NA"
# We create a list containing all perfume links
links = read_perfume_links_from_pickle("data/perfume_list_sephora.pickle")
driver.get(links[random.randint(0, 1300)])
driver.implicitly_wait(10)
driver.find_element_by_xpath('//*[#id="footer_tc_privacy_button_2"]').click()
"""
# Itirating on the list of links to the products
# Checking the number of review
# If there are no reviews, we break out of the while loop
# Else, we collect each review
# Then, we move on to the next product
"""
lines = []
for link in tqdm(links, position=0, leave=True):
url = link
driver.get(url)
# We check the number of reviews
number_of_reviews = get_number_of_reviews()
if number_of_reviews > 0:
print("Nombre de reviews : ", number_of_reviews)
print("on dort")
time.sleep(6)
print("\n on se réveille")
driver.find_element_by_xpath('//*[#id="ratings-summary"]').click()
start = time.time()
reviews = load_all_reviews(number_of_reviews, start)
print("nbr reviews : ", len(reviews))
for review in tqdm(reviews, position=0, leave=True):
unique_ID = url.split("-")[-1].split(".")[0]
rating = get_rating() # Rating
title = get_title()
review_text = get_review() # Review
temp = []
temp.append(unique_ID) # id
temp.append(title) # title
temp.append(review_text) # review
temp.append(rating) # rating
lines.append(temp)
else:
print("")
driver.quit()

Why is scrapy suddenly giving me an unpredictable AttributeError, stating no attribute 'css'

For my job, I built a scrapy spider to quickly check in on ~200-500 website landing pages for clues that the pages are not functioning, outside of just 400-style errors. (e.g. check for the presence of "out of stock" on page.) This check happens across approx. 30 different websites under my purview, all of them using the same page structure.
This has worked fine, every day, for 4 months.
Then, suddenly, and without change to the code, I started getting unpredictable errors, about 4 weeks ago:
url_title = response.css("title::text").extract_first()
AttributeError: 'Response' object has no attribute 'css'
If I run this spider, this error will occur with, say... 3 out of 400 pages.
Then, if immediately run the spider again, those same 3 pages are scraped just fine without error, and 4 totally different pages will return the same error.
Furthermore, if I run the EXACT same spider as below, but replace mapping with just these 7 erroneous landing pages, they are scraped perfectly fine.
Is there something in my code that's not quite right??
I'm going to attach the whole code - sorry in advance!! - I just fear that something I might deem as superfluous may in fact be the cause. So this is the whole thing, but with sensitive data replaced with ####.
I've checked all of the affected pages, and of course the css is valid, and the title is always present.
I've done sudo apt-get update & sudo apt-get dist-upgrade on the server running scrapy, in hopes that this would help. No luck.
import scrapy
from scrapy import signals
from sqlalchemy.orm import sessionmaker
from datetime import date, datetime, timedelta
from scrapy.http.request import Request
from w3lib.url import safe_download_url
from sqlalchemy import and_, or_, not_
import smtplib
from email.MIMEMultipart import MIMEMultipart
from email.MIMEText import MIMEText
from sqlalchemy.engine import create_engine
engine = create_engine('mysql://######:#######localhost/LandingPages', pool_recycle=3600, echo=False)
#conn = engine.connect()
from LandingPageVerifier.models import LandingPagesFacebook, LandingPagesGoogle, LandingPagesSimplifi, LandingPagesScrapeLog, LandingPagesScrapeResults
Session = sessionmaker(bind=engine)
session = Session()
# today = datetime.now().strftime("%Y-%m-%d")
# thisyear = datetime.now().strftime("%Y")
# thismonth = datetime.now().strftime("%m")
# thisday = datetime.now().strftime("%d")
# start = date(year=2019,month=04,day=09)
todays_datetime = datetime(datetime.today().year, datetime.today().month, datetime.today().day)
print todays_datetime
landingpages_today_fb = session.query(LandingPagesFacebook).filter(LandingPagesFacebook.created_on >= todays_datetime).all()
landingpages_today_google = session.query(LandingPagesGoogle).filter(LandingPagesGoogle.created_on >= todays_datetime).all()
landingpages_today_simplifi = session.query(LandingPagesSimplifi).filter(LandingPagesSimplifi.created_on >= todays_datetime).all()
session.close()
#Mix 'em together!
landingpages_today = landingpages_today_fb + landingpages_today_google + landingpages_today_simplifi
#landingpages_today = landingpages_today_fb
#Do some iterating and formatting work
landingpages_today = [(u.ad_url_full, u.client_id) for u in landingpages_today]
#print landingpages_today
landingpages_today = list(set(landingpages_today))
#print 'Unique pages: '
#print landingpages_today
# unique_landingpages = [(u[0]) for u in landingpages_today]
# unique_landingpage_client = [(u[1]) for u in landingpages_today]
# print 'Pages----->', len(unique_landingpages)
class LandingPage004Spider(scrapy.Spider):
name='LandingPage004Spider'
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(LandingPage004Spider, cls).from_crawler(crawler, *args, **kwargs)
#crawler.signals.connect(spider.spider_opened, signals.spider_opened)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
return spider
def spider_closed(self, spider):
#stats = spider.crawler.stats.get_stats()
stats = spider.crawler.stats.get_value('item_scraped_count'),
Session = sessionmaker(bind=engine)
session = Session()
logitem = LandingPagesScrapeLog(scrape_count = spider.crawler.stats.get_value('item_scraped_count'),
is200 = spider.crawler.stats.get_value('downloader/response_status_count/200'),
is400 = spider.crawler.stats.get_value('downloader/response_status_count/400'),
is403 = spider.crawler.stats.get_value('downloader/response_status_count/403'),
is404 = spider.crawler.stats.get_value('downloader/response_status_count/404'),
is500 = spider.crawler.stats.get_value('downloader/response_status_count/500'),
scrapy_errors = spider.crawler.stats.get_value('log_count/ERROR'),
scrapy_criticals = spider.crawler.stats.get_value('log_count/CRITICAL'),
)
session.add(logitem)
session.commit()
session.close()
#mapping = landingpages_today
handle_httpstatus_list = [200, 302, 404, 400, 500]
start_urls = []
def start_requests(self):
for url, client_id in self.mapping:
yield Request(url, callback=self.parse, meta={'client_id': client_id})
def parse(self, response):
##DEBUG - return all scraped data
#wholepage = response.body.lower()
url = response.url
if 'redirect_urls' in response.request.meta:
redirecturl = response.request.meta['redirect_urls'][0]
if 'utm.pag.ca' in redirecturl:
url_shortener = response.request.meta['redirect_urls'][0]
else:
url_shortener = 'None'
else:
url_shortener = 'None'
client_id = response.meta['client_id']
url_title = response.css("title::text").extract_first()
# pagesize = len(response.xpath('//*[not(descendant-or-self::script)]'))
pagesize = len(response.body)
HTTP_code = response.status
####ERROR CHECK: Small page size
if 'instapage' in response.body.lower():
if pagesize <= 20000:
err_small = 1
else:
err_small = 0
else:
if pagesize <= 35000:
err_small = 1
else:
err_small = 0
####ERROR CHECK: Page contains the phrase 'not found'
if 'not found' in response.xpath('//*[not(descendant-or-self::script)]').extract_first().lower():
#their sites are full of HTML errors, making scrapy unable to notice what is and is not inside a script element
if 'dealerinspire' in response.body.lower():
err_has_not_found = 0
else:
err_has_not_found = 1
else:
err_has_not_found = 0
####ERROR CHECK: Page cotains the phrase 'can't be found'
if "can't be found" in response.xpath('//*[not(self::script)]').extract_first().lower():
err_has_cantbefound = 1
else:
err_has_cantbefound = 0
####ERROR CHECK: Page contains the phrase 'unable to locate'
if 'unable to locate' in response.body.lower():
err_has_unabletolocate = 1
else:
err_has_unabletolocate = 0
####ERROR CHECK: Page contains phrase 'no longer available'
if 'no longer available' in response.body.lower():
err_has_nolongeravailable = 1
else:
err_has_nolongeravailable = 0
####ERROR CHECK: Page contains phrase 'no service specials'
if 'no service specials' in response.body.lower():
err_has_noservicespecials = 1
else:
err_has_noservicespecials = 0
####ERROR CHECK: Page contains phrase 'Sorry, no' to match zero inventory for a search, which normally says "Sorry, no items matching your request were found."
if 'sorry, no ' in response.body.lower():
err_has_sorryno = 1
else:
err_has_sorryno = 0
yield {'client_id': client_id, 'url': url, 'url_shortener': url_shortener, 'url_title': url_title, "pagesize": pagesize, "HTTP_code": HTTP_code, "err_small": err_small, 'err_has_not_found': err_has_not_found, 'err_has_cantbefound': err_has_cantbefound, 'err_has_unabletolocate': err_has_unabletolocate, 'err_has_nolongeravailable': err_has_nolongeravailable, 'err_has_noservicespecials': err_has_noservicespecials, 'err_has_sorryno': err_has_sorryno}
#E-mail settings
def sendmail(recipients,subject,body):
fromaddr = "#######"
toaddr = recipients
msg = MIMEMultipart()
msg['From'] = fromaddr
msg['Subject'] = subject
body = body
msg.attach(MIMEText(body, 'html'))
server = smtplib.SMTP('########)
server.starttls()
server.login(fromaddr, "##########")
text = msg.as_string()
server.sendmail(fromaddr, recipients, text)
server.quit()
`
Expected results is a perfect scrape, with no errors.
Actual results are unpredicatable AttributeErrors, claiming that attribute 'css' can't be found on some pages. But if I scrape those pages individually, using the same script, they scrape just fine.

Sometimes Scrapy can't parse HTML because of markup errors, that's why you can't call response.css(). You can catch these events in your code and analyze broken HTML:
def parse(self, response):
try:
....
your code
.....
except:
with open("Error.htm", "w") as f:
f.write(response.body)
UPDATE You can try to check for empty response:
def parse(self, response):
if not response.body:
yield scrapy.Request(url=response.url, callback=self.parse, meta={'client_id': response.meta["client_id"]})
# your original code

How to get ASINs XPATH from 2 different Amazon pages that have the same parent nodes?

I made a web scraping program using python and webdriver and I want to extract the ASIN from 2 different pages. I would like xpath to work for these 2 links at the same .
These are the amazon pages:https://www.amazon.com/Hydro-Flask-Wide-Mouth-Flip/dp/B01ACATW7E/ref=sr_1_3?s=kitchen&ie=UTF8&qid=1520348607&sr=1-3&keywords=-gfds and
https://www.amazon.com/Ubbi-Saving-Special-Required-Locking/dp/B00821FLSU/ref=sr_1_1?s=baby-products&ie=UTF8&qid=1520265799&sr=1-1&keywords=-hgfd&th=1. They have the same parent nodes(id, classes). How can I make this program work for these 2 links at the same time?
So the problem is on these lines of code: 36, 41
asin = driver.find_element_by_xpath('//div[#id="detail-bullets"]/table/tbody/tr/td/div/ul/li[4]').text
and
asin = driver.find_element_by_xpath('//div[#id="detail-bullets_feature_div"]/div[#id="detail-bullets"]/table/tbody/tr/td/div/ul/li[5]').text. I have to change these lines to output in the csv the ASINs for these 2 products. For the first link it prints the wrong information and for the second it prints the ASIN.
I attached the code. I will appreciate any help.
from selenium import webdriver
import csv
import io
# set the proxies to hide actual IP
proxies = {
'http': 'http://5.189.133.231:80',
'https': 'https://27.111.43.178:8080'
}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()]))
driver = webdriver.Chrome(executable_path="C:\\Users\Andrei-PC\Downloads\webdriver\chromedriver.exe",
chrome_options=chrome_options)
header = ['Product title', 'ASIN']
with open('csv/bot_1.csv', "w") as output:
writer = csv.writer(output)
writer.writerow(header)
links=['https://www.amazon.com/Hydro-Flask-Wide-Mouth-Flip/dp/B01ACATW7E/ref=sr_1_3?s=kitchen&ie=UTF8&qid=1520348607&sr=1-3&keywords=-gfds',
'https://www.amazon.com/Ubbi-Saving-Special-Required-Locking/dp/B00821FLSU/ref=sr_1_1?s=baby-products&ie=UTF8&qid=1520265799&sr=1-1&keywords=-hgfd&th=1'
]
for i in range(len(links)):
driver.get(links[i])
product_title = driver.find_elements_by_xpath('//*[#id="productTitle"][1]')
prod_title = [x.text for x in product_title]
try:
asin = driver.find_element_by_xpath('//div[#id="detail-bullets"]/table/tbody/tr/td/div/ul/li[4]').text
except:
print('no ASIN template one')
try:
asin = driver.find_element_by_xpath('//div[#id="detail-bullets_feature_div"]/div[#id="detail-bullets"]/table/tbody/tr/td/div/ul/li[5]').text
except:
print('no ASIN template two')
try:
data = [prod_title[0], asin]
except:
print('no items v3 ')
with io.open('csv/bot_1.csv', "a", newline="", encoding="utf-8") as output:
writer = csv.writer(output)
writer.writerow(data)

You can simply use
//li[b="ASIN:"]
to get required element on both pages

requests + bs4 no results from pages

Here the code that can get info from https://www.gabar.org/membersearchresults.cfm
but cannot from https://www.gabar.org/membersearchresults.cfm?start=1&id=70FFBD1B-9C8E-9913-79DBB8B989DED6C1
from bs4 import BeautifulSoup
import requests
import traceback
links_to_visit = []
navigation_links = [] # for testing next button
base_url = 'https://www.gabar.org'
def make_soup(link):
r = requests.get(link)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def all_results(url):
global links_to_visit
global navigation_links
soup = make_soup(url)
print(soup)
div = soup.find('div', {'class': 'cs_control'})
links = div.find_all('a')
print(links)
for link in links:
try:
if link.text == 'Next': # prev, next, new search
navigation_links.append(link)
print('got it')
elif not '/MemberSearchDetail.cfm?ID=' in link.get('href'):
pass # I dont need that link
else:
links_to_visit.append(link)
except:
traceback.print_exc()
print(len(links_to_visit))
print(links_to_visit)
#print(links_to_visit[-1].get('href'))
def start():
flag = 1
page = 1
while page < 60716:
flag = 0
if navigation_links[-1].text == 'Next':
flag = 1
next_link = navigation_links[-1]
#print(next_link.get('href'))
page += 25
print(base_url + next_link.get('href'))
all_results(base_url + next_link.get('href'))
print('page is:', page)
if __name__ == '__main__':
all_results('https://www.gabar.org/membersearchresults.cfm')
start()
What I need to understand or do if I want to get full result?

What you need to understand is that there is more than a URL to an HTTP-request. In this case, a search result is only available to the session that executed the search and can therefore only be paged through if you are the "owner" of that session. Most websites identify a session using session-cookies that you need to send along with your HTTP-request.
This can be a huge hassle, but luckily pythons requests takes care of all of that for you with requests.session. Instead of using requests.get(url) you initialize the session session=requests.session() and then use that session in subsequent requests session.get(url). This will automagically preserve cookies for you and in many ways behave like an actual browser would.
You can read more about how requests.session works here.
And last but not least, your fixed code =)
from bs4 import BeautifulSoup
import requests
import traceback
links_to_visit = []
navigation_links = [] # for testing next button
# we initialize the session here
session = requests.session()
base_url = 'https://www.gabar.org'
def make_soup(link):
# r = requests.get(link)
# we use the session here in order to preserve cookies across requests
r = session.get(link)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def all_results(url):
# globals are almost never needed or recommended and certainly not here.
# you can just leave this out
# global links_to_visit
# global navigation_links
soup = make_soup(url)
print(soup)
div = soup.find('div', {'class': 'cs_control'})
links = div.find_all('a')
print(links)
for link in links:
try:
if link.text == 'Next': # prev, next, new search
navigation_links.append(link)
print('got it')
elif not '/MemberSearchDetail.cfm?ID=' in link.get('href'):
pass # I dont need that link
else:
links_to_visit.append(link)
except:
traceback.print_exc()
print(len(links_to_visit))
print(links_to_visit)
#print(links_to_visit[-1].get('href'))
def start():
flag = 1
page = 1
while page < 60716:
flag = 0
if navigation_links[-1].text == 'Next':
flag = 1
next_link = navigation_links[-1]
#print(next_link.get('href'))
page += 25
print(base_url + next_link.get('href'))
all_results(base_url + next_link.get('href'))
print('page is:', page)
if __name__ == '__main__':
all_results('https://www.gabar.org/membersearchresults.cfm')
start()

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scraping Glassdoor returns duplicate entries - selenium

Related

Extract part of string(/soup element) within a list of lists

Issues with Selenium background execution (python)

Why is scrapy suddenly giving me an unpredictable AttributeError, stating no attribute 'css'

How to get ASINs XPATH from 2 different Amazon pages that have the same parent nodes?

requests + bs4 no results from pages

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scraping Glassdoor returns duplicate entries - selenium

Related

Extract part of string(/soup element) within a list of lists

Issues with Selenium background execution (python)

Why is scrapy suddenly giving me an *unpredictable* AttributeError, stating no attribute 'css'

How to get ASINs XPATH from 2 different Amazon pages that have the same parent nodes?

requests + bs4 no results from pages

Categories

Resources

Why is scrapy suddenly giving me an unpredictable AttributeError, stating no attribute 'css'