Selenium/BeautifulSoup - WebScrape this field

Selenium/BeautifulSoup - WebScrape this field - selenium

My code runs fine and prints the title for all rows but the rows with dropdowns.
For example, row 4 has a dropdown if clicked. I implemented a 'try' which would in theory click the dropdown, to then pull the titles.
But when i execute click() and try to print, for the rows with these drop downs, they are not printing.
Expected output- Print all titles including the ones in dropdown.
A user has submitted an answer on this link StackOverFlowAnswer but the format of his answer was different and I do not know how to add fields such as date, time, chairs, or the field on the top which says "On demand" with his approach
Any approach would be appreciated, would like to put into a dataframe. Thanks
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome()
actions = ActionChains(driver)
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
new_titles = set()
productlist=driver.find_elements_by_xpath("//div[#class='card item-container session']")
for property in productlist:
actions.move_to_element_with_offset(property,0,0).perform()
time.sleep(4.5)
sessiontitle=property.find_element_by_xpath(".//h4[#class='session-title card-title']").text
#print(sessiontitle)
ifDropdown=property.find_elements_by_xpath(".//*[#class='item-expand-action expand']")
if(ifDropdown):
ifDropdown[0].click()
time.sleep(4)
open_titles = driver.find_elements_by_class_name('card-title')
for open_title in open_titles:
title = open_title.text
if(title not in new_titles):
print(title)
new_titles.add(title)

Your problem is with driver.find_elements_by_class_name('item-expand-action expand') command. The find_elements_by_class_name('item-expand-action expand') locator is wrong. Those web elements have multiple class names. To locate these elements you can use css_selector or XPath.
Also since there are several elements with dropdowns, to perform clicks on them you should iterate over them. You can not perform .click() on a list of web elements.
So your code should be like this:
ifDropdown=driver.find_elements_by_css_selector('.item-expand-action.expand')
for drop_down in ifDropdown:
drop_down.click()
time.sleep(0.5)
Alternatively to the css_selector above you can use XPath as well:
ifDropdown=driver.find_elements_by_xpath('//a[#class="item-expand-action expand"]')
UPD
If you wish to print the added, new titles you can do this:
ifDropdown=driver.find_elements_by_css_selector('.item-expand-action.expand')
for drop_down in ifDropdown:
drop_down.click()
time.sleep(0.5)
newTitles=driver.find_elements_by_class_name('card-title')
for new_title in newTitles:
print(new_title.text)
Here after expanding all the dropdown elements I'm getting all the new titles and then iterate over that list printing each element text.
driver.find_elements_by_class_name returns a list of web elements. You can not apply .text on a list, you have to iterate over list elements getting each single element text each time.
UPD2
The entire code opening dropdowns and printing their inner titles can be like this:
I'm doing this with Selenium, not mixing with bs4.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome()
actions = ActionChains(driver)
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
new_titles = set()
productlist=driver.find_elements_by_xpath("//div[#class='card item-container session']")
for property in productlist:
actions.move_to_element(property).perform()
time.sleep(0.5)
sessiontitle=property.find_element_by_xpath(".//h4[#class='session-title card-title']").text
print(sessiontitle)
ifDropdown=property.find_elements_by_xpath(".//*[#class='item-expand-action expand']")
if(ifDropdown):
ifDropdown[0].click()
time.sleep(4)
open_titles = driver.find_elements_by_class_name('card-title')
for open_title in open_titles:
title = open_title.text
if(title not in new_titles):
print(title)
new_titles.add(title)
Here I am checking if there is a dropdown. If do, I open it. Then getting all the currently existing opened titles. Per each such title I validate if it is new or was opened previously. If the title is new, not existing in the set I print it and add it to the set.

To get all the data, including the date, time, chairs, you can use only requests/BeautifulSoup. There's no need for Selenium.
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = []
url = "https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list?p={}"
for page in range(1, 5): # <-- Increase number of pages here
with requests.Session() as session:
soup = BeautifulSoup(session.get(url.format(page)).content, "html.parser")
for card in soup.select("div.card-block"):
title = card.find(class_="session-title card-title").get_text()
date = card.select_one(".internal_date div.property").get_text(strip=True)
time = card.select_one(".internal_time div.property").get_text()
try:
chairs = card.select_one(".persons").get_text(strip=True)
except AttributeError:
chairs = "N/A"
data.append({"title": title, "date": date, "time": time, "chairs": chairs})
df = pd.DataFrame(data)
print(df.to_string())
Output (truncated):
title date time chairs
0 Educational sessions on-demand Thu, 16.09.2021 08:30 - 09:40 N/A
1 Special Symposia on-demand Thu, 16.09.2021 12:30 - 13:40 N/A
2 Multidisciplinary sessions on-demand Thu, 16.09.2021 16:30 - 17:40 N/A
3 MSD - Homologous Recombination Deficiency: BRCA and beyond Fri, 17.09.2021 08:45 - 09:55 Frederique Penault-Llorca(Clermont-Ferrand, France)
4 Servier - The clinical value of IDH inhibition in cholangiocarcinoma Fri, 17.09.2021 08:45 - 10:15 Arndt Vogel(Hannover, Germany)Angela Lamarca(Manchester, United Kingdom)
5 AstraZeneca - Redefining Breast Cancer – Biology to Therapy Fri, 17.09.2021 08:45 - 10:15 Ian Krop(Boston, United States of America)

Related

The output from my selenium script is blank, how do I fix?

First time using selenium for web scraping a website, and I'm fairly new to python. I have tried to scrape a Swedish housing site to extract price, address, area, size, etc., for every listing for a specific URL that shows all houses for sale in a specific area called "Lidingö".
I managed to bypass the pop-up window for accepting cookies.
However, the output I get from the terminal is blank when the script runs. I get nothing, not an error, not any output.
What could possibly be wrong?
The code is:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
s = Service("/Users/brustabl1/hemnet/chromedriver")
url = "https://www.hemnet.se/bostader?location_ids%5B%5D=17846&item_types%5B%5D=villa"
driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.implicitly_wait(10)
driver.get(url)
# The cookie button clicker
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[62]/div/div/div/div/div/div[2]/div[2]/div[2]/button"))).click()
lists = driver.find_elements(By.XPATH, '//*[#id="result"]/ul[1]/li[1]/a/div[2]')
for list in lists:
adress = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[2]/a/div[2]/div/div[1]/div[1]/h2')
area = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[1]/div[1]/div/span[2]')
price = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[1]')
rooms = list.find_element(By.XPATH,'//*
[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[3]')
size = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[2]')
print(adress.text)

There are a lot of flaws in your code...
lists = driver.find_elements(By.XPATH, '//*[#id="result"]/ul[1]/li[1]/a/div[2]')
in your code this returns a list of elements in the variable lists
for list in lists:
adress = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[2]/a/div[2]/div/div[1]/div[1]/h2')
area = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[1]/div[1]/div/span[2]')
price = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[1]')
rooms = list.find_element(By.XPATH,'//*
[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[3]')
size = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[2]')
print(adress.text)
you are not storing the value of each address in a list, instead, you are updating its value through each iteration.And xpath refers to the exact element, your loop is selecting the same element over and over again!
And scraping text through selenium is a bad practice, use BeautifulSoup instead.

How store values together after scrape

I am able to scrape individual fields off a website, but would like to map the title to the time.
The fields "have their own class, so I am struggling on how to map the time to the title.
A dictionary would work, but how would i structure/format this dictionary so that it stores values on a line by line basis?
url for reference - https://ash.confex.com/ash/2021/webprogram/STUDIO.html
expected output:
9:00 AM-9:30 AM, Defining Race, Ethnicity, and Genetic Ancestry
11:00 AM-11:30 AM, Definitions of Structural Racism
etc
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
import time
driver.get('https://ash.confex.com/ash/2021/webprogram/STUDIO.html')
time.sleep(3)
page_source = driver.page_source
soup=BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('div',class_='itemtitle')
for item in productlist:
for eachLine in item.find_all('a',href=True):
title=eachLine.text
print(title)
times=driver.find_elements_by_class_name("time")
for t in times:
print(t.text)

Selenium is an overkill here. Website didn't use any dynamic content, so you can scrape it with Python requests and BeautifulSoup. Here is a code how to achieve it. You need to query productlist and times separately and then iterate using indexes to be able to get both items at once. I put in range() length of an productlist because I assuming that both productlist and times will have equal length.
import requests
from bs4 import BeautifulSoup
url = 'https://ash.confex.com/ash/2021/webprogram/STUDIO.html'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
productlist = soup.select('div.itemtitle > a')
times = soup.select('.time')
for iterator in range(len(productlist)):
row = times[iterator].text + ", " + productlist[iterator].text
print(row)
Note: soup.select() gather items by css.

How do i extract data from yelp using selenium python

I am new to python!! I want to Extract data from yelp
https://www.yelp.com/search?find_desc=nails+salons&find_loc=San+Francisco%2C+CA&ns=1
and then from clicking on name on 1st page ...i.e
https://www.yelp.com/biz/joy-joy-nail-and-spa-san-francisco?osq=nails+salons
it should extract
Name
Address
Website
Contact No
Rating (How many) in numbers
and then it should continue doing so for full page
Example output
Joy Joy Nail & Spa
4023 24th St San Francisco, CA 94114
joyjoynailspa.com
(415) 655-3216
6 Reviews
Sunset Nails
1810 Irving St
San Francisco, CA 94122
(415) 566-9888
1185 reviews
if any of the element not present like website it should skip that info and continue

So, basically you have to go to page, then using find_elements have to see how many items are present to scrape, then select the first one and scrape the desire elements and go back to the previous page and do the same for other products.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
driver.get("https://www.yelp.com/search?find_desc=nails+salons&find_loc=San+Francisco%2C+CA&ns=1")
wait = WebDriverWait(driver, 20)
lnght = len(driver.find_elements(By.XPATH, "//div[contains(#class,'businessName')]/descendant::a"))
j = 0
for item in range(lnght):
elements = driver.find_elements(By.XPATH, "//div[contains(#class,'arrange-unit') and contains(#class,'arrange-unit-fill')]//ancestor::div[contains(#class,'container') and contains(#class,'hover')]")
time.sleep(1)
#driver.execute_script("arguments[0].scrollIntoView(true);", elements[j])
eles = driver.find_elements(By.XPATH, "//h4/descendant::a")
ActionChains(driver).move_to_element(eles[j]).click().perform()
#elements[j].click()
time.sleep(2)
print(wait.until(EC.visibility_of_element_located((By.XPATH, "//div[contains(#class,'headingLight')]//h1"))).text)
print(wait.until(EC.visibility_of_element_located((By.XPATH, "//p[text()='Business website']/following-sibling::p/a"))).text)
print(wait.until(EC.visibility_of_element_located((By.XPATH, "//p[text()='Phone number']/following-sibling::p"))).text)
print(wait.until(EC.visibility_of_element_located((By.XPATH, "//a[text()='Get Directions']/../following-sibling::p"))).text)
print(wait.until(EC.visibility_of_element_located((By.XPATH, "//span[contains(text(),'reviews')]"))).text)
driver.execute_script("window.history.go(-1)")
time.sleep(2)
j = j + 1
Update 1 :
Whichever line is causing the issue, try to wrap them like this :
try:
print(wait.until(EC.visibility_of_element_located((By.XPATH, "//p[text()='Business website']/following-sibling::p/a"))).text)
except:
pass

Failing to scrape the full page from Google Search results using selenium

I'm trying to scrape Google results using selenium chromedriver. Before, I used requests + Beautifulsoup to scrape google Results, and this worked, however I got blocked from Google after around 300 results. I've been reading into this topic and it seems to me that using selenium + webdriver is less easily blocked by Google.
Now, I'm trying to scrape Google results using selenium. I would like to scrape the title, link and description of all items. Essentially, I want to do this: How to scrape all results from Google search results pages (Python/Selenium ChromeDriver)
NoSuchElementException: no such element: Unable to locate element:
{"method":"css selector","selector":"h3"} (Session info:
chrome=90.0.4430.212)
Therefore, I'm trying another code. This code is able to scrape some, but not ALL the titles + descriptions. See picture below. I cannot scrape the last 4 titles, and the last 5 descriptions are also empty. Any clues on this? Much appreciated.
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
root = "https://www.google.com/"
url = "https://google.com/search?q="
query = 'Why do I only see the first 4 results?' # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query
print(f'Main link to search for: {link}')
options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options)
driver.get(link)
wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_all_elements_located((By.XPATH, './/h3')))
link_tag = './/div[#class= "yuRUbf"]/a'
title_tag = './/h3'
description_tag = './/span[#class= "aCOpRe"]'
titles = driver.find_elements_by_xpath(title_tag)
links = driver.find_elements_by_xpath(link_tag)
descriptions = driver.find_elements_by_xpath(description_tag)
for t in titles:
print('title:', t.text)
for l in links:
print('links:', l.get_attribute("href"))
for d in descriptions:
print('descriptions:', d.text)
# Why are the last 4 titles and the last 5 descriptions empty??
Image of the results:

Cause those 4 are not the actual links, Google always show "People also ask". If you see their DOM structure
<div style="padding-right:24px" jsname="xXq91c" class="cbphWd" data-
kt="KjCl66uM1I_i7PsBqYb-irfI74DmAeDWm-uv7IveYLKIxo-bn9L1H56X2ZSUy9L-6wE"
data-hveid="CAgQAw" data-ved="2ahUKEwjAoJ2ivd3wAhXU-nMBHWj1D8EQuk4oAHoECAgQAw">
How do I get Google to show all results?
</div>
it is not an anchor tag so you won't see href tag so your links list will have 4 empty value cause there are 4 divs like that.
to grab those 4 you need to use different locator :
XPATH : //*[local-name()='svg']/../following-sibling::div[#style]
title_tags = driver.find_elements(By.XPATH, "//*[local-name()='svg']/../following-sibling::div[#style]")
for title in title_tags:
print(title.text)

How to extract the product titles from the website using Selenium Python

Im trying to scrape the title from a website, but it is only returning 1 title. How can I get all the titles?
Below is one of the elements Im trying to fetch using xpath (starts-with):
<div id="post-4550574" class="post-box " data-permalink="https://hypebeast.com/2019/4/undercover-nike-sfb-mountain-sneaker-release-info" data-title="The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date"><div class="post-box-image-container fixed-ratio-3-2">
This is my current code:
from selenium import webdriver
import requests
from bs4 import BeautifulSoup as bs
driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
driver.get('https://hypebeast.com/search?s=nike+undercover')
element = driver.find_element_by_xpath(".//*[starts-with(#id, 'post-')]")
print(element.get_attribute('data-title'))
Output:
The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date
I was expecting a lot more title but only returning one result.

To extract the product titles from the website as the desired elements are JavaScript enabled elements you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
XPATH:
driver.get('https://hypebeast.com/search?s=nike+undercover')
print([element.text for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2/span")))])
CSS_SELECTOR:
driver.get('https://hypebeast.com/search?s=nike+undercover')
print([element.text for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h2>span")))])
Console Output:
['The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date', 'The UNDERCOVER x Nike SFB Mountain Surfaces in "Dark Obsidian/University Red"', 'A First Look at UNDERCOVER’s Nike SFB Mountain Collaboration', "Here's Where to Buy the UNDERCOVER x Gyakusou Nike Running Models", 'Take Another Look at the Upcoming UNDERCOVER x Nike Daybreak', "Take an Official Look at GYAKUSOU's SS19 Footwear and Apparel Range", 'UNDERCOVER x Nike Daybreak Expected to Hit Shelves This Summer', "The 10 Best Sneakers From Paris Fashion Week's FW19 Runways", "UNDERCOVER FW19 Debuts 'A Clockwork Orange' Theme, Nike & Valentino Collabs", 'These Are the Best Sneakers of 2018']

You don't need selenium. You can use requests, which is faster, and target the data-title attribute
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://hypebeast.com/search?s=nike+undercover')
soup = bs(r.content, 'lxml')
titles = [item['data-title'] for item in soup.select('[data-title]')]
print(titles)
If you do want selenium matching syntax is
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://hypebeast.com/search?s=nike+undercover')
titles = [item.get_attribute('data-title') for item in driver.find_elements_by_css_selector('[data-title]')]
print(titles)

If a locator finds multiple elements then find_elemnt returns the first element. find_elements returns a list of all elements found by the locator.
Then you can iterate the list and get all the elements.
If all of the elements you are trying to find has the class post-box then you could find the elements by class name.

Just sharing my experience and what I've used, might help someone. Just use,
element.get_attribute('ATTRIBUTE-NAME')

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Selenium/BeautifulSoup - WebScrape this field - selenium

Related

The output from my selenium script is blank, how do I fix?

How store values together after scrape

How do i extract data from yelp using selenium python

Failing to scrape the full page from Google Search results using selenium

How to extract the product titles from the website using Selenium Python

Categories

Resources