Scraping a page every time it changes - selenium

Hi I am currently looking to scrape a [age such as this "https://www.tennis24.com/match/ABiALWlt/#match-statistics;0" every time it changes score. currently i have the ability to scrape it using selenium and BS using the below code
from selenium import webdriver
Chrom_path = r"C:\Users\Dan1\Desktop\chromedriver.exe"
driver = webdriver.Chrome(Chrom_path)
driver.get("https://www.tennis24.com/match/zVrM3ySQ/#match-statistics;0")
data = driver.find_elements_by_class_name("statTextGroup")
for d in data:
sub_data = d.find_elements_by_xpath(".//*")
assert len(sub_data)==3
for s_d in sub_data:
print(s_d.get_attribute('class')[19:], s_d.get_attribute('innerText'))
but I have no idea how to automate it so that once the score at the top of the page located here"Medical timeout6 : 6 ( 0 : 0 )" changes, the scraper scrapes the new data. the change to monitor though is only visible when the match is in play and not always there.
if you need anymore info please let me know and ill be happy to add it

You can scrape in a while loop the "scoreboard"-class and when this is not the same as the old value of this then this value changed and you can scrape the other things you wanted.
Hope it helped

Related

C# stumped with screen scraping issue on aspx page

I'm having some trouble scraping some HTML that I'm getting from a postback on a site. It is an aspx page that I am trying to get the generated HTML from.
I have looked at the cookie data and session data and forum data being sent with Chrome developer tools and I still cannot get the page to respond with the search results despite mimicking almost all of it in my code.
There are 3 dropdowns on the page, 2 of which are pre-populated when you first visit the page. After choosing values for the first 2 (it does a postback every time you select on those two), it will populate values for the 3rd drop down. Once selecting a value in the 3rd drop down, you hit the search button and the results come back in a table below that.
After hitting the search button and getting the results on the screen, I went into developer tools and grabbed all of the values that looked relevant (especially all form values) and captured them in my code, but still no luck. Even captured the big viewstate exactly.
Here is a code sample of many code samples that I've tried. Admittedly, I'm not very familiar with some of these classes and I've been trying different code snippets.
I'm not sure if I'm doing it wrong in my code or if I'm just missing form data or cookies to make it execute the POST and return the correct data. My code currently returns HTML from the page back to the responseInString variable, but the HTML looks like it's the first version of the page (as if you visited it for the first time) with no drop down boxes selected and the 3rd is not populated with any values. So I don't know if my code is actually hitting the code-behind and doing the form POST to make it return data.
Any help would be greatly appreciated. Thank you!
using (var wb = new WebClient())
{
var data = new NameValueCollection();
data["_EVENTTARGET"] = "";
data["_EVENTARGUMENT"] = "";
data["_LASTFOCUS"] = "";
data["_VIEWSTATE"] = "(giant viewstate)";
data["__VIEWSTATEGENERATOR"] = "D86C5D2F";
//3 more form input/select fields after this with values corresponding to the drop downs.
wb.Headers.Add(HttpRequestHeader.Cookie,
".ASPXANONYMOUS=(long string);" +
"ASP.NET_SessionId=(Redacted);" +
" _gid=GA1.2.1071490528.1676265043;" +
"LoginToken=(Redacted);" +
"LoginUserID=(Redacted);" +
"_ga=GA1.1.1195633641.1675746985;" +
"_ga_38VTY8CNGZ=GS1.1.1676265043.7.1.1676265065.0.0.0");
wb.Headers.Add("Sec-Fetch-Dest", "document");
wb.Headers.Add("Sec-Fetch-Mode", "navigate");
wb.Headers.Add("Sec-Fetch-Site", "same-origin");
wb.Headers.Add("Sec-Fetch-User", "?1");
wb.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
var response = wb.UploadValues("(the web page url)", "POST", data);
string responseInString = Encoding.UTF8.GetString(response);
return responseInString;
}

How to find Xpath of any "city" of variable dropdown?

I installed Chrome addon "Selectors Hub".
I opened site: spicejet.com
I choose some random city of "from" dropdown.
With help of "Selectors Hub" chrome addon, I grabbed the Xpath code of
that city:
//div[#class='css-1dbjc4n r-14lw9ot r-z2wwpe r-vgw6uq r-156q2ks r-urutk0 r-8uuktl r-136ojw6']//div[11]
While validating this Xpath code in console, it shows 0 matches.
This website is built on ReactJs as front-end, if I am not wrong, and finding the elements of a ReactJs website is a bit challenging; adding to that, if you rely on some locator finding tool, it's gets more difficult. It's always better you build your own locator strategy than rely on tools, especially for websites built with React, Vue, etc.
Having said that, the strategy here is to find the relatively narrowed down relative locator, and then since you are looking for a random selection of city, collect all the cities first, then apply random to it. Here is what it figured:
I collected cities, but along with it came some unwanted items (courtesy my relative locator), and I check them and if they are picked up, I pass them out, and only when a city is selected by random, I click on it. Check this code:
import random
driver.get("https://www.spicejet.com/")
time.sleep(10)
driver.find_element(By.XPATH, "//div[#data-testid='to-testID-destination']").click()
time.sleep(2)
cities = driver.find_elements(By.XPATH, "//div[#data-testid='to-testID-destination']//div[#data-focusable='true']")
print(len(cities))
x = random.choice(cities)
if x.text in ['To', 'India', 'International']:
pass
else:
print(x.text)
x.click()
time.sleep(5)
driver.quit()
Output:
Pakyong
Pakyong Airport
PYG
Process finished with exit code 0

Make Selenium scroll LinkedIn to scrape jobs

I have this code scraping each job title and company name from :
https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt
This is for every job title
job_titles = browser.find_elements_by_css_selector("a.job-card-list__title")
c = []
for title in job_titles:
c.append(title.text)
print(c)
print((len(c)))
This is for every company name
Company_Names = browser.find_elements_by_css_selector("a.job-card-container__company-name")
d = []
for name in Company_Names:
d.append(name.text)
print(d)
print((len(d)))
I provided the URL above, there are many many pages!
how can I make Selenium auto-open each page and scrape each of the 4 thousand results available?
I have found a way to paginate to each page, but I am yet to know how to scrape each page.
So the URL is :
https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start=25
The start parameter in the URL increments by 25 from each page to the other.
so we add this piece of code which navigates us successfully to the other pages:
page = 25
pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page))
for i in range(1,40):
page = i * 25
pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page

CSS Selector in Selenium - Web Scraping

I am doing Linkedin web scraping as a part of my college project. This is the code to locate the skills & endorsements, recommendations and accomplishments section:
skills = driver.find_elements_by_css_selector('#ember661')
recom = driver.find_elements_by_css_selector('#ember679')
acc = driver.find_elements_by_css_selector('#ember695')
But I am getting an empty list in all the three variables. Please help!
There are couple of reasons.
ID is geneerated and not the same for all the profiles.
You should not expect a list of element. There is a single section of each type on profile page so that a single element would be returned.
Those sections might get loaded asynchronously so that the page is loaded but the section has not yet been. So that the locators return false. In such case you need to use explicit waiting. Like:
waiter = WebDriverWait(driver, 10)
skills = waiter.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.pv-skill-categories-section')))
recom = waiter.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.pv-recommendations-section')))
acc = waiter.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.pv-accomplishments-section')))

Extracting information from same page popup in python Selenium webscraping

Note : I'm experienced in python however just starting out in selenium and webscraping. Please excuse if this is a bad question or if my fundamentals in selenium seem amiss. I could not find an answer in hours of searching hence i ask here
Goal: To extract the "About the Business" information found in Yelp pages of businesses
Some pages have their about the business information within a Read More button based popup (eg : https://www.yelp.com/biz/and-pizza-bethesda-bethesda)
Some pages do not have their business information in a Read More button based popup (eg : https://www.yelp.com/biz/pneuma-fashions-upper-marlboro-3 )
Problem: Unable to navigate to the About the Business popup that appears after clicking the Read More button and extract the text present in it.
Attempts as of now: From googling I had found explanations on how to handle alert popups or window popups. However the code doesnt work. The popup that emerges when clicking Read More button does not cause change in window_handles
import re
# getting all sections of the page
result=driver.find_elements_by_tag_name("section")
About = None
for sec in result:
if sec.text.startswith("About the Business"):
# this pertains only to the About the business section
main_page=driver.current_window_handle
print(main_page) # Returns the current handle
sec.find_element_by_tag_name("button").click()
popup=None
for handle in driver.window_handles: # is an iterable with only one handle
# The only handle present is the main_page handle
print(handle)
if handle!=main_page:
popup = handle
print(popup) # returns None
driver.switch_to.window(popup) # Throws error because popup=None
# THE FOLLOWING SECTION IS NOT EXECUTED BECAUSE OF THE ERROR ABOVE
#////////////////////////////////////////////////////
button_contents=driver.find_elements_by_tag_name("p")
for b in button_contents:
print(b.text) # intended to print text contents
close=driver.find_element_by_tag_name("button")
close.click()
driver.switch_to.window(main_page)
Please help
Thank you to everyone who reads this question and provides advice and answers
That is a custom pop-up so you won't need to switch to it. I suggest to study about getting relative xpath . Use loop to navigate to your urls and include below code
driver.get(your_URL)
readMoreBtnXpath= "//h4[text()='About the Business']/ancestor::section//button"
aboutTheBusinessSec = "//h4[text()='About the Business']/ancestor::section"
fromTheBusinessSec = "((//h2[text()='From the business']/parent::div/following-sibling::div//div)[5]/div)[last()]/preceding-sibling::div"
try:
driver.find_element(By.XPATH, readMoreBtnXpath).click()
button_contents = driver.find_elements(By.XPATH, fromTheBusinessSec)
for b in button_contents:
print(b.text)
except:
print(driver.find_element(By.XPATH, aboutTheBusinessSec).text)
One thing that u should know is that the pop-up is not displayed in a new window. It is instead displayed in the same page itself. Here is the complete code to extract the text from the pop-up:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.yelp.com/biz/and-pizza-bethesda-bethesda')
try:
driver.find_element_by_xpath('//*[#id="wrap"]/div[3]/div/div[4]/div/div/div[2]/div/div/div[1]/div/div[1]/section[5]/div[2]/button').click()
p1 = driver.find_element_by_xpath('//*[#id="modal-portal-container"]/div[2]/div/div/div/div[2]/div/div[2]/div/div[2]/div/div/div[1]/p').text
p2 = driver.find_element_by_xpath('//*[#id="modal-portal-container"]/div[2]/div/div/div/div[2]/div/div[2]/div/div[2]/div/div/div[2]/p[2]').text
print("Specialties --",p1)
print("History --",p2)
except:
print('Read more button not found')
Output:
Specialties -- Award-winning pizza: Named one of Fast Company's "World's Most Innovative Companies" in 2018, third-place in the Washington Post Express's of "Best Fast Casual" in 2018, third place in the Washington City Paper's "Best Gluten-Free Menu" in 2018 and won its "Best Pizza in D.C." in 2017, 11th on TripAdvisor's "Best Fast Casual Restaurants -- United States" in 2018.
History -- Since 2012, we've built pizza shops with an edge to their craft pies, beverages and shop design, created an environment where ALL of our Tribe can thrive, supported our local communities and now we'll text you back, if you want. Started with a pizza shop. Became a culture. That's &pizza.
Edit:
Since this doesn't work with this website, replace the first find_element_by_xpath with:
driver.find_element_by_xpath("//div[#class='lemon--div__373c0__1mboc border-color--default__373c0__3-ifU']/button[.='Read more']").click()
This works for both the websites.