Check whether a product is sold out by crawler - beautifulsoup

I would like to access whether the product is avaliable on this website(mouse product). By using the following code, I expected to get "Sold out". However, I always got "Add to Cart", probably because there is a script below according to the condition.
How could I get "Sold out" in the situation? Thank you for your help.
page = r.get("https://finalmouse.com/collections/museum/products/starlight-12-phantom?variant=39672355324040").content
soup = bs(page, "html.parser")
span = soup.find("span", {"id":"AddToCartText"})
print(span.text)
website screen shot

As Sean wrote in comment, on this page the html is loaded dynamically, the page is updated via js. To parse such sites you need to connect selenium and webdriver.
Code for your case:
from bs4 import BeautifulSoup as bs
from selenium import webdriver
url = "https://finalmouse.com/collections/museum/products/starlight-12-phantom?variant=39672355324040"
driver = webdriver.Chrome()
driver.get(url)
page = driver.page_source
soup = bs(page, "html.parser")
span = soup.find("span", {"id":"AddToCartText"})
print(span.text)

Related

The output from my selenium script is blank, how do I fix?

First time using selenium for web scraping a website, and I'm fairly new to python. I have tried to scrape a Swedish housing site to extract price, address, area, size, etc., for every listing for a specific URL that shows all houses for sale in a specific area called "Lidingö".
I managed to bypass the pop-up window for accepting cookies.
However, the output I get from the terminal is blank when the script runs. I get nothing, not an error, not any output.
What could possibly be wrong?
The code is:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
s = Service("/Users/brustabl1/hemnet/chromedriver")
url = "https://www.hemnet.se/bostader?location_ids%5B%5D=17846&item_types%5B%5D=villa"
driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.implicitly_wait(10)
driver.get(url)
# The cookie button clicker
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[62]/div/div/div/div/div/div[2]/div[2]/div[2]/button"))).click()
lists = driver.find_elements(By.XPATH, '//*[#id="result"]/ul[1]/li[1]/a/div[2]')
for list in lists:
adress = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[2]/a/div[2]/div/div[1]/div[1]/h2')
area = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[1]/div[1]/div/span[2]')
price = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[1]')
rooms = list.find_element(By.XPATH,'//*
[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[3]')
size = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[2]')
print(adress.text)
There are a lot of flaws in your code...
lists = driver.find_elements(By.XPATH, '//*[#id="result"]/ul[1]/li[1]/a/div[2]')
in your code this returns a list of elements in the variable lists
for list in lists:
adress = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[2]/a/div[2]/div/div[1]/div[1]/h2')
area = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[1]/div[1]/div/span[2]')
price = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[1]')
rooms = list.find_element(By.XPATH,'//*
[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[3]')
size = list.find_element(By.XPATH,'//*[#id="result"]/ul[1]/li[1]/a/div[2]/div/div[2]/div[1]/div[2]')
print(adress.text)
you are not storing the value of each address in a list, instead, you are updating its value through each iteration.And xpath refers to the exact element, your loop is selecting the same element over and over again!
And scraping text through selenium is a bad practice, use BeautifulSoup instead.

WebScrape - Pagination/Next page

The current code works and scrapes the page how I want it to.
However, how can I get this to run for the next page ? The URL is not unique for the second page and I want to run for all pages.
import requests
from bs4 import BeautifulSoup as bs
lists=[]
r = requests.get('https://journals.lww.com/ccmjournal/toc/2022/01001')
soup = bs(r.content, 'lxml')
d = {i.text.strip():i['href'] for i in soup.select('.ej-toc-subheader + div h4 > a')}
lists.append(d)
It's a dynamically loaded webpage you need to use selenium for that. Have a look here : Selenium with Python

Failing to scrape the full page from Google Search results using selenium

I'm trying to scrape Google results using selenium chromedriver. Before, I used requests + Beautifulsoup to scrape google Results, and this worked, however I got blocked from Google after around 300 results. I've been reading into this topic and it seems to me that using selenium + webdriver is less easily blocked by Google.
Now, I'm trying to scrape Google results using selenium. I would like to scrape the title, link and description of all items. Essentially, I want to do this: How to scrape all results from Google search results pages (Python/Selenium ChromeDriver)
NoSuchElementException: no such element: Unable to locate element:
{"method":"css selector","selector":"h3"} (Session info:
chrome=90.0.4430.212)
Therefore, I'm trying another code. This code is able to scrape some, but not ALL the titles + descriptions. See picture below. I cannot scrape the last 4 titles, and the last 5 descriptions are also empty. Any clues on this? Much appreciated.
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
root = "https://www.google.com/"
url = "https://google.com/search?q="
query = 'Why do I only see the first 4 results?' # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query
print(f'Main link to search for: {link}')
options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options)
driver.get(link)
wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_all_elements_located((By.XPATH, './/h3')))
link_tag = './/div[#class= "yuRUbf"]/a'
title_tag = './/h3'
description_tag = './/span[#class= "aCOpRe"]'
titles = driver.find_elements_by_xpath(title_tag)
links = driver.find_elements_by_xpath(link_tag)
descriptions = driver.find_elements_by_xpath(description_tag)
for t in titles:
print('title:', t.text)
for l in links:
print('links:', l.get_attribute("href"))
for d in descriptions:
print('descriptions:', d.text)
# Why are the last 4 titles and the last 5 descriptions empty??
Image of the results:
Cause those 4 are not the actual links, Google always show "People also ask". If you see their DOM structure
<div style="padding-right:24px" jsname="xXq91c" class="cbphWd" data-
kt="KjCl66uM1I_i7PsBqYb-irfI74DmAeDWm-uv7IveYLKIxo-bn9L1H56X2ZSUy9L-6wE"
data-hveid="CAgQAw" data-ved="2ahUKEwjAoJ2ivd3wAhXU-nMBHWj1D8EQuk4oAHoECAgQAw">
How do I get Google to show all results?
</div>
it is not an anchor tag so you won't see href tag so your links list will have 4 empty value cause there are 4 divs like that.
to grab those 4 you need to use different locator :
XPATH : //*[local-name()='svg']/../following-sibling::div[#style]
title_tags = driver.find_elements(By.XPATH, "//*[local-name()='svg']/../following-sibling::div[#style]")
for title in title_tags:
print(title.text)

BeautifulSoup findAll() not finding all, regardless of which parser I use

So I've read through all the questions about findAll() not working that I can find, and the answer always seems to be an issue with the particular html parser. I have run the following code using the default 'html.parser' along with 'lxml' and 'html5lib' yet I can only find one instance when I should be finding 14.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://robertsspaceindustries.com/pledge/ships'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, features = "lxml")
containers = page_soup.findAll("ul", {"class":"ships-listing"})
len(containers)
I tried posting a picture of the HTML code, but I don't have enough reputation. Here is a link to the image (https://imgur.com/a/cOLO8GO).
When you download a page through urllib (or requests HTTP library) it downloads the original HTML source file.
Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'> and these elements are generated by the JavaScript.
So when you download a page using urllib, the downloaded content only contains the original source page (you could see it by view-source option in the browser).

Web Scraping with Beautiful Soup in Python - JavaScript Table

Im trying to scrape a table from a website but I cant seem to figure it out with Beautifulsoup in Python. Im not sure if its because of the table format, but I basically want to turn this table into a CSV.
from bs4 import BeautifulSoup
import requests
page = requests.geenter code heret("https://spotwx.com/products/grib_index.php?model=hrrr_wrfprsf&lat=41.03399&lon=-73.76291&tz=America/New_York&display=table")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify)
Any advice on how to isolate this data table? I've checked so many Beautifulsoup tutorials, but the HTML looks different than most references. Many thanks in advance for your help -
Try this. The table from that site generates dynamically so you can't get results using requests only.
from selenium import webdriver
from bs4 import BeautifulSoup
import csv
link = "https://spotwx.com/products/grib_index.php?model=hrrr_wrfprsf&lat=41.03399&lon=-73.76291&tz=America/New_York&display=table"
with open("spotwx.csv", "w", newline='') as infile:
writer = csv.writer(infile)
writer.writerow(['DateTime','Tmp','Dpt','Rh','Wh','Wd','Wg','Apcp','Slp'])
with webdriver.Chrome() as driver:
driver.get(link)
soup = BeautifulSoup(driver.page_source, 'lxml')
for item in soup.select("table#example tbody tr"):
data = [elem.text for elem in item.select('td')]
print(data)
writer.writerow(data)