How can I scrape all the comments on the site? - selenium

I want to scrape all the comments (436 comments) of the macbook product on the e-commerce site I shared below, but it didn't scrape all the comments, it cut it in half. How Can I Solve This?
![Sample Code][1]
browser=webdriver.Chrome("C:/Users/Emre/Desktop/Elif/web_scraping/chromedriver.exe") browser.get("https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-uzay-grisi-p-68042136") Brand=browser.find_element_by_xpath("//[#id='product-detail-app']/div/div[2]/div[1]/div[2]/div[2]/div/div/div[1]/h1/a").text Brand_name=browser.find_element_by_xpath("//[#id='product-detail-app']/div/div[2]/div[1]/div[2]/div[2]/div/div/div[1]/h1").text Price=browser.find_element_by_xpath("//*[#id='product-detail-app']/div/div[2]/div[1]/div[2]/div[2]/div/div/div[5]/div/div/div/div[2]/span").text Assesment=browser.find_element_by_class_name("rvw-cnt-tx") Assesment.click() time.sleep(5) for i in range(1,5): browser.execute_script('window.scrollTo(0,document.body.scrollHeight)') time.sleep(5) Assesments= browser.find_elements_by_class_name("pr-rnr-com") #degerlendirmeler= browser.find_elements_by_class_name("rnr-com-tx") Assesment_List=[] for Assesment in Assesments: Assesment_List.append(Assesment.text) with open ("customer_comments.txt","w",encoding = "UTF-8") as file: for Assesment in Assesment_List: file.write(Assesment + "/n") time.sleep(5)

Related

Check whether a product is sold out by crawler

I would like to access whether the product is avaliable on this website(mouse product). By using the following code, I expected to get "Sold out". However, I always got "Add to Cart", probably because there is a script below according to the condition.
How could I get "Sold out" in the situation? Thank you for your help.
page = r.get("https://finalmouse.com/collections/museum/products/starlight-12-phantom?variant=39672355324040").content
soup = bs(page, "html.parser")
span = soup.find("span", {"id":"AddToCartText"})
print(span.text)
website screen shot
As Sean wrote in comment, on this page the html is loaded dynamically, the page is updated via js. To parse such sites you need to connect selenium and webdriver.
Code for your case:
from bs4 import BeautifulSoup as bs
from selenium import webdriver
url = "https://finalmouse.com/collections/museum/products/starlight-12-phantom?variant=39672355324040"
driver = webdriver.Chrome()
driver.get(url)
page = driver.page_source
soup = bs(page, "html.parser")
span = soup.find("span", {"id":"AddToCartText"})
print(span.text)

WebScrape - Pagination/Next page

The current code works and scrapes the page how I want it to.
However, how can I get this to run for the next page ? The URL is not unique for the second page and I want to run for all pages.
import requests
from bs4 import BeautifulSoup as bs
lists=[]
r = requests.get('https://journals.lww.com/ccmjournal/toc/2022/01001')
soup = bs(r.content, 'lxml')
d = {i.text.strip():i['href'] for i in soup.select('.ej-toc-subheader + div h4 > a')}
lists.append(d)
It's a dynamically loaded webpage you need to use selenium for that. Have a look here : Selenium with Python

Failing to scrape the full page from Google Search results using selenium

I'm trying to scrape Google results using selenium chromedriver. Before, I used requests + Beautifulsoup to scrape google Results, and this worked, however I got blocked from Google after around 300 results. I've been reading into this topic and it seems to me that using selenium + webdriver is less easily blocked by Google.
Now, I'm trying to scrape Google results using selenium. I would like to scrape the title, link and description of all items. Essentially, I want to do this: How to scrape all results from Google search results pages (Python/Selenium ChromeDriver)
NoSuchElementException: no such element: Unable to locate element:
{"method":"css selector","selector":"h3"} (Session info:
chrome=90.0.4430.212)
Therefore, I'm trying another code. This code is able to scrape some, but not ALL the titles + descriptions. See picture below. I cannot scrape the last 4 titles, and the last 5 descriptions are also empty. Any clues on this? Much appreciated.
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
root = "https://www.google.com/"
url = "https://google.com/search?q="
query = 'Why do I only see the first 4 results?' # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query
print(f'Main link to search for: {link}')
options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options)
driver.get(link)
wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_all_elements_located((By.XPATH, './/h3')))
link_tag = './/div[#class= "yuRUbf"]/a'
title_tag = './/h3'
description_tag = './/span[#class= "aCOpRe"]'
titles = driver.find_elements_by_xpath(title_tag)
links = driver.find_elements_by_xpath(link_tag)
descriptions = driver.find_elements_by_xpath(description_tag)
for t in titles:
print('title:', t.text)
for l in links:
print('links:', l.get_attribute("href"))
for d in descriptions:
print('descriptions:', d.text)
# Why are the last 4 titles and the last 5 descriptions empty??
Image of the results:
Cause those 4 are not the actual links, Google always show "People also ask". If you see their DOM structure
<div style="padding-right:24px" jsname="xXq91c" class="cbphWd" data-
kt="KjCl66uM1I_i7PsBqYb-irfI74DmAeDWm-uv7IveYLKIxo-bn9L1H56X2ZSUy9L-6wE"
data-hveid="CAgQAw" data-ved="2ahUKEwjAoJ2ivd3wAhXU-nMBHWj1D8EQuk4oAHoECAgQAw">
How do I get Google to show all results?
</div>
it is not an anchor tag so you won't see href tag so your links list will have 4 empty value cause there are 4 divs like that.
to grab those 4 you need to use different locator :
XPATH : //*[local-name()='svg']/../following-sibling::div[#style]
title_tags = driver.find_elements(By.XPATH, "//*[local-name()='svg']/../following-sibling::div[#style]")
for title in title_tags:
print(title.text)

Beautifulsoup Scraping Object Selection

I am trying to scrape by Beautifulsoup and new to this, I need table rows as you see enter image description here.
The tables are coming from reactapp and then shown on the website. I need suggestion how to do this. I am struggling to create the beautifulsoup object and do not know what the actual class to grap to reach table rows and their content.
webpage = urlopen(req).read()
soup = bs(webpage, "html.parser")
table=soup.find('table', {'class': 'equity'})
rows=list()
for row in table.findAll("tr"):
rows.append(row)
Need your help, really appreciated, having hard time to get it done!
You can grab the td elements with this code:
webpage = urlopen(req).read()
soup = bs(webpage, "lxml")
table=soup.find('table', {'class': 'table'}).find('tr')
rows=list()
for row in table.findAll("td"):
rows.append(row)
I prefered using lxml as the parser because it has some advantages, but you can keep using html.parser
You can also use pandas, It will create, It's so much easier to learn from its documentaion (there is a lot).

why the code of python2.7 have no any output?

This is an example from a python book. When I run it I don't get any output. Can someone help me? Thanks!!!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
text = urlopen('https://python.org/community/jobs').read()
soup = BeautifulSoup(text)
jobs = set()
for header in soup('h3'):
links = header('a', 'reference')
if not links: continue
link = links[0]
jobs.add('%s (%s)' % (link.string, link['href']))
print jobs.add('%s (%s)' % (link.string, link['href']))
print '\n'.join(sorted(jobs, key=lambda s: s.lower()))
reedit--
firstly,i only considered the url is wrong but ignore the html infomation i wanna to get was not exist. May be this is why i get empty output.
If you open the page and inspect the html you'll notice there are no <h3> tags containing links. This is why you have no output.
So if not links: continue always continues.
This is probably because the page has moved to https://www.python.org/jobs/ so the <h3> tags containing links on the page are no longer present.
If you point this code's url to the new page. I'd suggest using taking some time to familiarize yourself with the page source. For instance it uses <h2> instead of <h3> tags for its links.