Check if a button is disabled with Beautifulsoup? - beautifulsoup

Is there any way to check if a button is disabled with beautiful soup. I'm trying to scrape Airbnb page and i want to check if there is a next page or not. I tried :
def findNextPage(soupPage):
''' Finds the next page with listings if it exists '''
if not(soupPage.find("div", {"class": "_jro6t0"}).find("button", {"class": "_za9j7e"}, attrs={"disabled"})):
return "https://airbnb.com" + soupPage.find("div", {"class": "_jro6t0"}).find("a")["href"]
else:
return "no next page"

Related

Dynamic(with mouseover/coordinates) web scraping python unable to extract information

I'm trying to scrape the data that only appears on mouseover(selenium). It's a concert map and this is my entire code. I keep getting TypeError: 'ActionChains' object is not iterable
The idea would be to hover over the whole map & always scrape the code when the html changes. I'm pretty sure I need two for loops for that, but I don't know yet, how to combine them. Also, I know I'll have to use bs4, could someone share ideas how I could go about this?
driver = webdriver.Chrome()
driver.maximize_window()
driver.get('https://www.ticketcorner.ch/event/simple-minds-hallenstadion-12035377/')
#accept shadow-root cookie banner
time.sleep(5)
driver.execute_script('return document.querySelector("#cmpwrapper").shadowRoot.querySelector("#cmpbntyestxt")').click()
time.sleep(5)
# Click on the saalplan so we get to the concert map
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#tickets > div.seat-switch-wrapper.js-tab-switch-group > a:nth-child(3) > div > div.seat-switch-radio.styled-checkbox.theme-switch-bg.theme-text-color.theme-link-color-hover.theme-switch-border > label'))).click()
time.sleep(5)
#Scroll to the concert map, which will be the element to hover over
element_map = driver.find_element(By.CLASS_NAME, 'js-tickettype-cta')
actions = ActionChains(driver)
actions.move_to_element(element_map).perform()
# Close the drop-down which is partially hiding the concert map
driver.find_element(by=By.XPATH, value='//*[#id="seatmap-tab"]/div[2]/div/div/section/div[2]/div[1]/div[1]/div/div/div[1]/div').click()
# Mouse Hover over the concert map and find the empty seats to extract the data.
actions = ActionChains(driver)
data = actions.move_to_element_with_offset(driver.find_element(by=By.XPATH, value='//*[#id="seatmap-tab"]/div[2]/div/div/section/div[2]/div[1]/div[2]/div/div[2]/div[1]/div[2]/div[2]/canvas'),0,0)
for i in data:
actions.move_by_offset(50, 50).perform()
time.sleep(2)
# print content of each box
hover_data = driver.find_element(By.XPATH, '//*[#id="tooltipster-533522"]/div[1]').get_attribute('tooltipster-content')
print(hover_data)```
# The code I would use to hover over the element
#actions.move_by_offset(100, 50).perform()
# time.sleep(5)
# actions.move_by_offset(150, 50).perform()
# time.sleep(5)
# actions.move_by_offset(200, 50).perform()

Playwright: Download via Print to PDF?

I'm seeking to scrape a web page using Playwright.
I load the page, and click the download button with Playwright successfully. This brings up a print dialog box with a printer selected.
I would like to select "Save as PDF" and then click the "Save" button.
Here's my current code:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
playwright_page = browser.new_page()
got_error = False
try:
playwright_page.goto(url_to_start_from)
print(playwright_page.title())
html = playwright_page.content()
except Exception as e:
print(f"Playwright exception: {e}")
got_error = True
if not got_error:
soup = BeautifulSoup(html, 'html.parser')
#download pdf
with playwright_page.expect_download() as download_info:
playwright_page.locator("text=download").click()
download = download_info.value
path = download.path()
download.save_as(DOWNLOADED_PDF_FOLDER)
browser.close()
Is there a way to do this using Playwright?
Thanks very much to #KJ in the comments, who suggested that with headless=True, Chromium won't even put up a print dialog box in the first place.

Locating elements in section with selenium

I'm trying to enter text into a field (the subject field in the image) in a section using Selenium .
I've tried locating by Xpath , ID and a few others but it looks like maybe I need to switch context to the section. I've tried the following, errors are in comments after lines.
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
opts = Options()
browser = Firefox(options=opts)
browser.get('https://www.linkedin.com/feed/')
sign_in = '/html/body/div[1]/main/p/a'
browser.find_element_by_xpath(sign_in).click()
email = '//*[#id="username"]'
browser.find_element_by_xpath(email).send_keys(my_email)
pword = '//*[#id="password"]'
browser.find_element_by_xpath(pword).send_keys(my_pword)
signin = '/html/body/div/main/div[2]/div[1]/form/div[3]/button'
browser.find_element_by_xpath(signin).click()
search = '/html/body/div[8]/header/div[2]/div/div/div[1]/div[2]/input'
name = 'John McCain'
browser.find_element_by_xpath(search).send_keys(name+"\n")#click()
#click on first result
first_result = '/html/body/div[8]/div[3]/div/div[1]/div/div[1]/main/div/div/div[1]/div/div/div/div[2]/div[1]/div[1]/span/div/span[1]/span/a/span/span[1]'
browser.find_element_by_xpath(first_result).click()
#hit message button
msg_btn = '/html/body/div[8]/div[3]/div/div/div/div/div[2]/div/div/main/div/div[1]/section/div[2]/div[1]/div[2]/div/div/div[2]/a'
browser.find_element_by_xpath(msg_btn).click()
sleep(10)
## find subject box in section
section_class = '/html/body/div[3]/section'
browser.find_element_by_xpath(section_class) # no such element
browser.switch_to().frame('/html/body/div[3]/section') # no such frame
subject = '//*[#id="compose-form-subject-ember156"]'
browser.find_element_by_xpath(subject).click() # no such element
compose_class = 'compose-form__subject-field'
browser.find_element_by_class_name(compose_class) # no such class
id = 'compose-form-subject-ember156'
browser.find_element_by_id(id) # no such element
css_selector= 'compose-form-subject-ember156'
browser.find_element_by_css_selector(css_selector) # no such element
wind = '//*[#id="artdeco-hoverable-outlet__message-overlay"]
browser.find_element_by_xpath(wind) #no such element
A figure showing the developer info for the text box in question is attached.
How do I locate the text box and send keys to it? I'm new to selenium but have gotten thru login and basic navigation to this point.
I've put the page source (as seen by the Selenium browser object at this point) here.
The page source (as seen when I click in the browser window and hit 'copy page source') is here .
Despite the window in focus being the one I wanted it seems like the browser object saw things differently . Using
window_after = browser.window_handles[1]
browser.switch_to_window(window_after)
allowed me to find the element using an Xpath.

How to simulate pressing buttons to keep on scraping more elements with Scrapy

On this page (https://www.realestate.com.kh/buy/), I managed to grab a list of ads, and individually parse their content with this code:
import scrapy
class scrapingThings(scrapy.Spider):
name = 'scrapingThings'
# allowed_domains = ['https://www.realestate.com.kh/buy/']
start_urls = ['https://www.realestate.com.kh/buy/']
def parse(self, response):
ads = response.xpath('//*[#class="featured css-ineky e1jqslr40"]//a/#href')
c = 0
for url in ads:
c += 1
absolute_url = response.urljoin(url.extract())
self.item = {}
self.item['url'] = absolute_url
yield scrapy.Request(absolute_url, callback=self.parse_ad, meta={'item': self.item})
def parse_ad(self, response):
# Extract things
yield {
# Yield things
}
However, I'd like to automate the switching from one page to another to grab the entirety of the ads available (not only on the first page, but on all pages). By, I guess, simulating the pressings of the 1, 2, 3, 4, ..., 50 buttons as displayed on this screen capture:
Is this even possible with Scrapy? If so, how can one achieve this?
Yes it's possible. Let me show you two ways of doing it:
You can have your spider select the buttons, get the #href value of them, build a [full] URL and yield as a new request.
Here is an example:
def parse(self, response):
....
href = response.xpath('//div[#class="desktop-buttons"]/a[#class="css-owq2hj"]/following-sibling::a[1]/#href').get()
req_url = response.urljoin(href)
yield Request(url=req_url, callback=self.parse_ad)
The selector in the example will always return the #href of the next page's button (It returns only one value, if you are in page 2 it returns the #href of page 2)
In this page, the href is an relative url, so we need to use response.urljoin() method to build a full url. It will use the response as base.
We yield a new request, the response will be parsed in the callback function you determined.
This will require your callback function to always yield the request for the next page. So it's a recursive solution.
A more simple approach would be to just observe the pattern of the hrefs and manually yield all requests. Each button has a href of "/buy/?page={nr}" where {nr} is the number of the page, se can arbitrarily change this nr value and yield all requests at once.
def parse(self, response):
....
nr_pages = response.xpath('//div[#class="desktop-buttons"]/a[#class="css-1en2dru"]//text()').getall()
last_page_nr = int(nr_pages[-1])
for nr in range(2, last_page_nr + 1):
req_url = f'/buy/?page={nr}'
yield Request(url=response.urljoin(req_url), callback=self.parse_ad)
nr_pages returns the number of all buttons
last_page_nr selects the last number (which is the last available page)
We loop in the range between 2 to the value of last_page_nr (50 in this case) and in each loop we request a new page (that correspond to the number).
This way you can make all the requests in your parse method, and parse the response in the parse_ad without recursive calling.
Finally I suggest you take a look on scrapy tutorial it covers several common scenarios on scraping.

Scrapy Running Results

Just getting started with Scrapy, I'm hoping for a nudge in the right direction.
I want to scrape data from here:
https://www.sportstats.ca/display-results.xhtml?raceid=29360
This is what I have so far:
import scrapy
import re
class BlogSpider(scrapy.Spider):
name = 'sportstats'
start_urls = ['https://www.sportstats.ca/display-results.xhtml?raceid=29360']
def parse(self, response):
headings = []
results = []
tables = response.xpath('//table')
headings = list(tables[0].xpath('thead/tr/th/span/span/text()').extract())
rows = tables[0].xpath('tbody/tr[contains(#class, "ui-widget-content ui-datatable")]')
for row in rows:
result = []
tds = row.xpath('td')
for td in enumerate(tds):
if headings[td[0]].lower() == 'comp.':
content = None
elif headings[td[0]].lower() == 'view':
content = None
elif headings[td[0]].lower() == 'name':
content = td[1].xpath('span/a/text()').extract()[0]
else:
try:
content = td[1].xpath('span/text()').extract()[0]
except:
content = None
result.append(content)
results.append(result)
for result in results:
print(result)
Now I need to move on to the next page, which I can do in a browser by clicking the "right arrow" at the bottom, which I believe is the following li:
<li><a id="mainForm:j_idt369" href="#" class="ui-commandlink ui-widget fa fa-angle-right" onclick="PrimeFaces.ab({s:"mainForm:j_idt369",p:"mainForm",u:"mainForm:result_table mainForm:pageNav mainForm:eventAthleteDetailsDialog",onco:function(xhr,status,args){hideDetails('athlete-popup');showDetails('event-popup');scrollToTopOfElement('mainForm\\:result_table');;}});return false;"></a>
How can I get scrapy to follow that?
If you open the url in a browser without javascript you won't be able to move to the next page. As you can see inside the li tag, there is some javascript to be executed in order to get the next page.
Yo get around this, the first option is usually try to identify the request generated by javascript. In your case, it should be easy: just analyze the java script code and replicate it with python in your spider. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.
You're going to need to perform a callback. Generate the url from the xpath from the 'next page' button. So url = response.xpath(xpath to next_page_button) and then when you're finished scraping that page you'll do yield scrapy.Request(url, callback=self.parse_next_page). Finally you create a new function called def parse_next_page(self, response):.
A final, final note is if it happens to be in Javascript (and you can't scrape it even if you're sure you're using the correct xpath) check out my repo in using splash with scrapy https://github.com/Liamhanninen/Scrape