there is a problem, I will explain the script of the code.
First, we go to the main page, then the for loop that goes to page 1, scrolls and parses the links to the photo (it doesn’t parse without scrolling), then goes to page 2, and there is already a problem.
It scrolls to the bottom in a second. And I need it to scroll gradually, like on the first page.
I throw the code:
driver.get(url=url)
n = 0
urla = driver.find_element(By.CLASS_NAME, "ipsPagination_pageJump").text
for page_number in range(int(urla.split()[3])):
page_number = page_number + 1
driver.get(url=url + f"page/{page_number}")
time.sleep(2)
imgs = driver.find_elements(By.CLASS_NAME, "cGalleryPatchwork_image")
for i in imgs:
n = n + 500
driver.execute_script(f"window.scrollTo(0, {n})")
time.sleep(0.2)
print(i.get_attribute("src"))
driver.quit()
I know the code is very bad and not optimized
to scroll gradually one element by one element you should use the following execute_script command driver.execute_script("arguments[0].scrollIntoView(true);", i)
Code:
for page_number in range(int(urla.split()[3])):
page_number = page_number + 1
driver.get(url=url + f"page/{page_number}")
time.sleep(2)
imgs = driver.find_elements(By.CLASS_NAME, "cGalleryPatchwork_image")
for i in imgs:
#n = n + 500
#driver.execute_script(f"window.scrollTo(0, {n})")
driver.execute_script("arguments[0].scrollIntoView(true);", i)
time.sleep(0.2)
print(i.get_attribute("src"))
Related
I'm having some issues with scraping fish images off a website.
species_with_foto = ['/fangster/aborre-perca-fluviatilis/1',
'/fangster/almindelig-tangnaal-syngnathus-typhle/155',
'/fangster/ansjos-engraulis-encrasicholus/66',
'/fangster/atlantisk-tun-blaafinnet-tun-thunnus-thynnus-/137']
titles = []
species = []
for x in species_with_foto:
specie_page = 'https://www.fiskefoto.dk'+x
driver.get(specie_page)
content = driver.page_source
soup = BeautifulSoup(content)
brutto = soup.find_all('img', attrs={'class':'rapportBillede'})
species.append(brutto)
#print(brutto)
titles.append(x)
try:
driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
print('CLicked next', x)
except NoSuchElementException:
print('Succesfully finished - :', x)
time.sleep(2)
This returns a list of lists with the sublist looking like this:
[<img alt="Aborre (Perca fluviatilis) aborrefiskeri, striber, rygfinne, regnorm, majs, spinner, " class="rapportBillede" src="/images/400/aborre-perca-fluviatilis-medefiskeri-bundrig-0,220kg-24cm-striber-rygfinne-regnorm-majs-spinner-358-22-29-14-740-2013-21-4.jpg" style="width:50%;"/>,
<img alt="Aborre (Perca fluviatilis) aborrefiskeri, striber, rygfinne, regnorm, majs, spinner, " class="rapportBillede" src="/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg" style="width:calc(50% - 6px);margin-bottom:7px;"/>]
How can i clean up the list and only keep the src="/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg" - part? I have tried with other variables in the soup.find_all but can't get it to work.
(The selenium part is also not functioning properly, but I'll get to that after......)
EDIT:
This is my code now, I'm really getting close :) One issue is that now my photos are not saved in a list of lists but just a list - I for the love of god don't understand why this happens?
Help to fix and understand would be greatly appreciated!
titles = []
fish_photos = []
for x in species_with_foto_mini:
site = "https://www.fiskefoto.dk/"+x
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')
titles.append(x)
try:
images = bs.find_all('img', attrs={'class':'rapportBillede'})
for img in images:
if img.has_attr('src'):
#print(img['src'])
a = (img['src'])
fish_photos.append(a)
except KeyError:
print('No src')
#navigate pages
try:
driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
print('CLicked next', x)
except NoSuchElementException:
print('Succesfully finished -', x)
time.sleep(2)
EDIT:
I need the end result to be a list of lists looking something like this:
fish_photos =
[['/images/400/aborre-perca-fluviatilis-medefiskeri-bundrig-0,220kg-24cm-striber-rygfinne-regnorm-majs-spinner-358-22-29-14-740-2013-21-4.jpg',
'/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg',['/images/400/tungehvarre-arnoglossus-laterna-medefiskeri-6650-2523403.jpg', '/images/400/ulk-myoxocephalus-scorpius-medefiskeri-bundrig-koebenhavner-koebenhavner-torsk-mole-sild-boersteorm-pigge-351-18-48-9-680-2013-6-4.jpg'],[ '/images/400/graeskarpe-ctenopharyngodon-idella-medefiskeri-bobleflaad-med-toastbroed-paa-enkeltkrog-5.02kg-77cm-6436-7486.jpg','/images/400/graeskarpe-ctenopharyngodon-idella-medefiskeri-bobleflaad-med-toastbroed-paa-enkeltkrog-10.38kg-96cm-6337-4823146.jpg']
EDIT:
My output now is a list with identical lists. I need it to put every specie in its own list, like this: fish_photo_list = [[trout1, trout2, trout3], [other fish1, other fish 2, other], [salmon1, salmon2]]
My initial code this, but not now.
Here is an example, you can change:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')
try:
images = bs.find_all('img')
for img in images:
if img.has_attr('src'):
print(img['src'])
except KeyError:
print('No src')
So I am trying to scrape job posts from Glassdoor using Requests, Beautiful Soup and Selenium. The entire code works except that, even after scraping data from 30 pages, most entries turn out to be duplicates (almost 80% of them!). Its not a headless scraper so I know it is going to each new page. What could be the reason for so many duplicate entries? Could it be some sort of anti-scraping tool being used by Glassdoor or is something off in my code?
The result turns out to be 870 entries of which a whopping 690 are duplicates!
My code:
def glassdoor_scraper(url):
driver = webdriver.Chrome()
driver.get(url)
time.sleep(10)
# Getting to the page where we want to start scraping
jobs_search_title = driver.find_element(By.ID, 'KeywordSearch')
jobs_search_title.send_keys('Data Analyst')
jobs_search_location = driver.find_element(By.ID, 'LocationSearch')
time.sleep(1)
jobs_search_location.clear()
jobs_search_location.send_keys('United States')
click_search = driver.find_element(By.ID, 'HeroSearchButton')
click_search.click()
for page_num in range(1,10):
time.sleep(10)
res = requests.get(driver.current_url)
soup = BeautifulSoup(res.text,'html.parser')
time.sleep(2)
companies = soup.select('.css-l2wjgv.e1n63ojh0.jobLink')
for company in companies:
companies_list.append(company.text)
positions = soup.select('.jobLink.css-1rd3saf.eigr9kq2')
for position in positions:
positions_list.append(position.text)
locations = soup.select('.css-l2fjlt.pr-xxsm.css-iii9i8.e1rrn5ka0')
for location in locations:
locations_list.append(location.text)
job_post = soup.select('.eigr9kq3')
for job in job_post:
salary_info = job.select('.e1wijj242')
if len(salary_info) > 0:
for salary in salary_info:
salaries_list.append(salary.text)
else:
salaries_list.append('Salary Not Found')
ratings = soup.select('.e1rrn5ka3')
for index, rating in enumerate(ratings):
if len(rating.text) > 0:
ratings_list.append(rating.text)
else:
ratings_list.append('Rating Not Found')
next_page = driver.find_elements(By.CLASS_NAME, 'e13qs2073')[1]
next_page.click()
time.sleep(5)
try:
close_jobalert_popup = driver.find_element(By.CLASS_NAME, 'modal_closeIcon')
except:
pass
else:
time.sleep(1)
close_jobalert_popup.click()
continue
#driver.close()
print(f'{len(companies_list)} jobs found for you!')
global glassdoor_dataset
glassdoor_dataset = pd.DataFrame(
{'Company Name': companies_list,
'Company Rating': ratings_list,
'Position Title': positions_list,
'Location' : locations_list,
'Est. Salary' : salaries_list
})
glassdoor_dataset.to_csv(r'glassdoor_jobs_scraped.csv')
You're going way too fast. You need to put some waits.
I see you have put Implicit Waits. Trying putting Explicit Waits instead.
Something like this:
(put your own conditions. you can try invisibility element too. like if something is invisible and then visible to ensure you are on next page now)
if not then increase your time.sleep()
WebDriverWait(driver, 40).until(expected_conditions.visibility_of_element_located(
(By.XPATH, '//*[#id="wrapper"]/section/div/div/div[2]/button[2]')))
I don't think the repetition is due to a code issue - I think glassdoor just starts cycling results after a while. [If interested, see this gist for some stats - basically, from the 7th page or so, most of the 1st page results seem to be shown on every page onwards. I did a small test manually - with only 5 listings, by id, and even directly on an un-automated browser, they started repeating after a while....]
My suggestion would be to just filter them before looping to the next page - there's a data-id attribute for each li wrapped around the listings which seems to be a unique identifier. If we add that to the other columns' lists, we can start collecting only un-collected listings; if you just edit the for page_num loop to:
for page_num in range(1, 10):
time.sleep(10)
scrapedUrls.append(driver.current_url)
res = requests.get(driver.current_url)
soup = BeautifulSoup(res.text, 'html.parser')
# soup = BeautifulSoup(driver.page_source, 'html.parser') # no noticable improvement
time.sleep(2)
filteredListings = [
di for di in soup.select('li[data-id]') if
di.get('data-id') not in datId_list
]
datId_list += [di.get('data-id') for di in filteredListings]
companies_list += [
t.select_one('.css-l2wjgv.e1n63ojh0.jobLink').get_text(strip=True)
if t.select_one('.css-l2wjgv.e1n63ojh0.jobLink')
else None for t in filteredListings
]
positions_list += [
t.select_one('.jobLink.css-1rd3saf.eigr9kq2').get_text(strip=True)
if t.select_one('.jobLink.css-1rd3saf.eigr9kq2')
else None for t in filteredListings
]
locations_list += [
t.select_one(
'.css-l2fjlt.pr-xxsm.css-iii9i8.e1rrn5ka0').get_text(strip=True)
if t.select_one('.css-l2fjlt.pr-xxsm.css-iii9i8.e1rrn5ka0')
else None for t in filteredListings
]
job_post = [
t.select('.eigr9kq3 .e1wijj242') for t in filteredListings
]
salaries_list += [
'Salary Not Found' if not j else
(j[0].text if len(j) == 1 else [s.text for s in j])
for j in job_post
]
ratings_list += [
t.select_one('.e1rrn5ka3').get_text(strip=True)
if t.select_one('.e1rrn5ka3')
else 'Rating Not Found' for t in filteredListings
]
and, if you added datId_list to the dataframe, it could serve as a meaningful index
dfDict = {'Data-Id': datId_list,
'Company Name': companies_list,
'Company Rating': ratings_list,
'Position Title': positions_list,
'Location': locations_list,
'Est. Salary': salaries_list
}
for k in dfDict:
print(k, len(dfDict[k]))
glassdoor_dataset = pd.DataFrame(dfDict)
glassdoor_dataset.set_index('Data-Id', drop=True)
glassdoor_dataset.to_csv(r'glassdoor_jobs_scraped.csv')
My code goes into a webpage, clicks on a record, which then drops other records.
Is there a way to use xPath to pull all of these drop-down titles?
Currently, I copied the first drop down titles full xpath, and its only pulling the first one.
That is fine, but how do I pull all entry titles that drop down?
My current code is specifically only for the first line
from selenium import webdriver
import time
driver = webdriver.Chrome()
for x in range (1,2):
driver.get(f'https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page={x}')
time.sleep(4)
productlist_length = len(driver.find_elements_by_xpath("//div[#class='accordin_title']"))
for i in range(1, productlist_length + 1):
product = driver.find_element_by_xpath("(//div[#class='accordin_title'])[" + str(i) + "]")
title = product.find_element_by_xpath('.//h4').text.strip()
buttonToClick = product.find_element_by_xpath('.//div[#class="sign"]')
buttonToClick.click()
time.sleep(5)
dropDownTitle=product.find_element_by_xpath('//*[#id="accordin"]/div/ul/li[1]/div[2]/div/ul/li/div[1]/div[3]/h4').text #this line is the full xpath
print(dropDownTitle)
So can you check with the below line of code
#try to execute it in maximize mode sometimes element is overlayed
driver.maximize_window()
for x in range (1,5):
driver.get(f'https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page={x}')
time.sleep(4)
productlist_length = len(driver.find_elements_by_xpath("//div[#class='accordin_title']"))
for i in range(1, productlist_length + 1):
product = driver.find_element_by_xpath("(//div[#class='accordin_title'])[" + str(i) + "]")
title = product.find_element_by_xpath('.//h4').text.strip()
#So at some point the dropdown doesn't display any records so at that point it throws ClickInterceptedException, Also I ActionChain to move to the particular element
buttonToClick = product.find_element_by_xpath('.//*[#class="info_right"]/h4')
action = ActionChains(driver)
action.move_to_element(buttonToClick).click().perform()
time.sleep(5)
#Here if you just provide the index of the li it will print the title
dropDownTitle=product.find_element_by_xpath("//*[#id='accordin']/div/ul/li["+str(i)+"]/div[1]/div[3]/h4").text
print(dropDownTitle)
import
from time import sleep
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
Output
I currently use Requests-HTML version 0.10.0 and Selenium 3.141.0. My project is to scrape the ratings of all articles on this website https://openreview.net/group?id=ICLR.cc/2021/Conference. To open each page of the website (the website has 53 pages and each page has 50 articles), I use Selenium. Next, to open articles on each page, I use Requests-HTML. My question is about how to reduce the time uses to open each article and get the rating. In this case, I use await r_inside.html.arender(sleep = 5, timeout=100), which means the sleeping time is 5 seconds and the timeout is 100 seconds. When I try to reduce sleep time to 0.5 seconds, it will cause an error, which is because it does not have enough time to scrape the website. However, if I keep the sleep time as 5 seconds, it will take 6 to 13 hours to scrape all 2600 articles. Also, after waiting for 13 hours, I can scrape all 2600 articles, but the codes use 88 GB of RAM, which I do not prefer because I need to send this code to other people who will not have enough RAM to run. My purpose is to reduce the scraping time and RAM memory. Below is the code I use.
import csv
link = 'https://openreview.net/group?id=ICLR.cc/2021/Conference'
from requests_html import HTMLSession, AsyncHTMLSession
import time
from tqdm import tqdm
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
id_list = []
keyword_list = []
abstract_list = []
title_list = []
driver = webdriver.Chrome('./requests_html/chromedriver.exe')
driver.get('https://openreview.net/group?id=ICLR.cc/2021/Conference')
cond = EC.presence_of_element_located((By.XPATH, '//*[#id="all-submissions"]/nav/ul/li[13]/a'))
WebDriverWait(driver, 10).until(cond)
for page in tqdm(range(1, 54)):
text = ''
elems = driver.find_elements_by_xpath('//*[#id="all-submissions"]/ul/li')
for i, elem in enumerate(elems):
try:
# parse title
title = elem.find_element_by_xpath('./h4/a[1]')
link = title.get_attribute('href')
paper_id = link.split('=')[-1]
title = title.text.strip().replace('\t', ' ').replace('\n', ' ')
# show details
elem.find_element_by_xpath('./a').click()
time.sleep(0.2)
# parse keywords & abstract
items = elem.find_elements_by_xpath('.//li')
keyword = ''.join([x.text for x in items if 'Keywords' in x.text])
abstract = ''.join([x.text for x in items if 'Abstract' in x.text])
keyword = keyword.strip().replace('\t', ' ').replace('\n', ' ').replace('Keywords: ', '')
abstract = abstract.strip().replace('\t', ' ').replace('\n', ' ').replace('Abstract: ', '')
text += paper_id+'\t'+title+'\t'+link+'\t'+keyword+'\t'+abstract+'\n'
title_list.append(title)
id_list.append(paper_id)
keyword_list.append(keyword)
abstract_list.append(abstract)
except Exception as e:
print(f'page {page}, # {i}:', e)
continue
# next page
try:
driver.find_element_by_xpath('//*[#id="all-submissions"]/nav/ul/li[13]/a').click()
time.sleep(2) # NOTE: increase sleep time if needed
except:
print('no next page, exit.')
break
csv_file = open('./requests_html/bb_website_scrap.csv','w', encoding="utf-8")
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Title','Keyword','Abstract','Link','Total Number of Reviews','Average Rating','Average Confidence'])
n = 0
for item in range(len(id_list)):
title = title_list[item]
keyword = keyword_list[item]
abstract = abstract_list[item]
id = id_list[item]
link_pdf = f'https://openreview.net/forum?id={id}'
print(id)
asession_inside = AsyncHTMLSession()
r_inside = await asession_inside.get(link_pdf)
print(type(r_inside))
await r_inside.html.arender(sleep = 5, timeout=100)
test_rating = r_inside.html.find('div.comment-level-odd div.note_contents span.note_content_value')
print(len(test_rating))
check_list = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9','10'}
total_rating_confidence = []
total_rating = []
total_confidence = []
for t in range(len(test_rating)):
if any(test_rating[t].text.split(':')[0] in s for s in check_list):
total_rating_confidence.append(test_rating[t].text.split(':')[0])
for r in range(len(total_rating_confidence)):
if (r % 2 == 0):
total_rating.append(int(total_rating_confidence[r]))
else:
total_confidence.append(int(total_rating_confidence[r]))
average_rating = sum(total_rating) / len(total_rating)
average_confidence = sum(total_confidence) / len(total_confidence)
csv_writer.writerow([title, keyword, abstract, link_pdf,len(total_rating),average_rating,average_confidence])
n = n + 1
print('Order {}',n)
csv_file.close()
I'm no Python expert (in fact, I'm a rank beginner) but the simple answer is better parallelism & session management.
The useful answer is a bit more complicated.
You're leaving the Chromium session around, which is likely what's hoovering up all your RAM. If you call asession_inside.close(), you may see an improvement in RAM usage.
As far as I can tell, you're doing everything in serial; You fetch each page and extract data on the articles in serial. Then, you query each article in serial as well.
You're using arender to fetch each article asynchronously, but you're awaiting it & using a standard for loop. As far as I understand, that means you're not getting any advantage from async; You're still processing each page one at a time (which explains your long process time).
I'd suggest using asyncio to turn the for loop into a parallel version of itself as suggested in this article. Make sure you set a task limit so that you don't try to load all the articles at once; That will also help with your RAM usage.
The actual error is produced by a much larger program I've been writing, but the following sample reproduces the error:
import wx
class MyLine(wx.Frame):
def __init__(self):
self.thickness = 1
self.length = 10
self.spin_ctrl = []
super(MyLine, self).__init__(None)
self.SetBackgroundColour(wx.ColourDatabase().Find("GREY"))
vbox = wx.BoxSizer(wx.VERTICAL)
#Length section
self.spin_ctrl.append(wx.SpinCtrl(self, initial = self.length, min = 1, max = 100))
vbox.Add(self.spin_ctrl[-1], 0, wx.ALL | wx.ALIGN_CENTER, 5)
#Thickness section
self.spin_ctrl.append(wx.SpinCtrl(self, initial = self.thickness, min = 1, max = 10))
vbox.Add(self.spin_ctrl[-1], 0, wx.ALL | wx.ALIGN_CENTER, 5)
self.SetSizerAndFit(vbox)
self.Show()
app = wx.App()
fr = MyLine()
app.MainLoop()
When the above is run, a window appears with two SpinCtrl buttons. If I click on the first one to change the value and then close the window, everything works fine and there are no error messages. When I click on the second button to change its value and then close the window, the following error appears:
Pango-CRITICAL **: pango_layout_get_cursor_pos: assertion 'index >= 0 && index <= layout->length' failed. Is this a bug or am I not using SpinCtrl buttons correctly?
I am running WxPython4.0.3.