What is the correct soup.find() command? - beautifulsoup

I am trying to webscrape the racename ('The Valley R2') and the horse name ('Ronniejay') from the following website https://www.punters.com.au/form-guide/form-finder/e2a0f7e13bf0057b4c156aea23019b18.
What is the correct soup.find() code to do this.
My code to get the race name:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.punters.com.au/form-guide/form-finder/e2a0f7e13bf0057b4c156aea23019b18').text
soup = BeautifulSoup(source,'lxml')
race = soup.find('h3')
print(race)

The website uses JavaScript, but requests doesn't support it. We can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
URL = "https://www.punters.com.au/form-guide/form-finder/e2a0f7e13bf0057b4c156aea23019b18"
driver = webdriver.Chrome(r"C:\path\to\chromedriver.exe")
driver.get(URL)
# Wait for page to fully render
sleep(5)
soup = BeautifulSoup(driver.page_source, "lxml")
race_name = soup.select_one(".form-result-group__event span").text
horse_name = "".join(
x for x in soup.select_one(".form-result__competitor-name").text if x.isalpha()
)
print(race_name)
print(horse_name)
driver.quit()
Output:
The Valley R2
Ronniejay

Related

Scraping Amazon dropdown list using Selenium; Dynamic scraping

#I am scraping amazon.ae. I was trying scrape the size of clothes (e.g. jeans) by #selecting from the dropdown menu. My code is as follows but got an error. Please help
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import Select
url='https://www.amazon.ae/Jack-Jones-Glenn-Original-Pants/dp/B07JQB87KL/ref=sr_1_5?
crid=M8QQKGLLZ1O9&keywords=jeans&qid=1657289288&sprefix=jeans%2Caps%2C232&sr=8-
5&th=1'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
d=driver.find_element_by_name("dropdown_selected_size_name").click()
select=Select(d)
select.select_by_index(1)
#Error: AttributeError: 'NoneType' object has no attribute 'tag_name'

How to modify code to scrape data off of 2nd table on this webpage

I am trying to scrape data from a table on the following website: https://www.eliteprospects.com/league/nhl/stats/2021-2022
This is the code I found to successfully scrape off data from the first table for skater stats:
import requests
import pandas as pd
from bs4 import BeautifulSoup
dfs = []
for page in range(1,10):
url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort=tp&page={page}"
print(f"Loading {url=}")
soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = (
pd.read_html(str(soup.select_one(".player-stats")))[0]
.dropna(how="all")
.reset_index(drop=True)
)
dfs.append(df)
df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)
But I am having difficulty scraping off the goalie stats from the bottom table. Any idea how to modify the code to get the stats from the bottom table? I tried changing line 13 to "(".goalie-stats")" but it returned an error when I tried to run the code.
Thank you!!
I found a way to get the data, but it isn't perfect. When I get it, it makes a lot of unnamed columns. Still, it gets the data, so I hope it's helpful
import requests
import pandas as pd
from bs4 import BeautifulSoup
dfs = []
for page in range(1,3):
url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort-goalie-stats=svp&page-goalie={page}#goalies"
print(f"Loading {url=}")
soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = (
pd.read_html(str(soup.select_one(".goalie-stats")).replace('%', ''))[0]
.dropna(how="all")
.reset_index(drop=True)
)
dfs.append(df)
df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)

WebScraping a changing webpage search results?

I'm trying to get data from a search result but every time I try to use a specific link to give to Beautiful Soup I get errors and I think it is because the webpage isn't the same every time you visit it? I'm not exactly sure what this is called to even search so any help would be appreciated.
This is the link to the search results. But when you go to visit it unless you've already made a search it won't show up the results.
https://www.clarkcountycourts.us/Portal/Home/WorkspaceMode?p=0
instead, if you copy and paste it will take you to this page to make a search.
https://www.clarkcountycourts.us/Portal/ and then you have to click smart search.
So for simplicity's sake, let's say we search for "Robinson" and I need to take the table data and export it to an excel file. I cant give beautiful soup a link because it isn't valid I believe? How would I go about this challenge?
Even pulling the tables up with a simple view table doesn't give any info about the data from our search of "Robinson" such as Case Number or File Date to create a pandas data frame.
//EDIT//
so far thanks to #Arundeep Chohan
This is what I've got. Huge Shout out for the awesome help!
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time
from bs4 import BeautifulSoup
import requests
import pandas as pd
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
driver.implicitly_wait(20) # gives an implicit wait for 20 seconds
driver.get("https://www.clarkcountycourts.us/Portal/Home/Dashboard/29")
search_box = driver.find_element_by_id("caseCriteria_SearchCriteria")
search_box.send_keys("Robinson")
#Code to complete captchas
WebDriverWait(driver, 15).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[name^='a-'][src^='https://www.google.com/recaptcha/api2/anchor?']")))
WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//span[#id='recaptcha-anchor']"))).click()
driver.switch_to.default_content() #necessary to switch out of iframe element for submit button
time.sleep(5) #gives time to click submit to results
submit_box = driver.find_element_by_id("btnSSSubmit").click()
time.sleep(5)
soup = BeautifulSoup(driver.page_source,'html.parser')
df = pd.read_html(str(soup))[0]
print(df)
options = Options()
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)
driver.maximize_window()
wait=WebDriverWait(driver,10)
driver.get('https://www.clarkcountycourts.us/Portal/')
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"a.portlet-buttons"))).click()
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"input#caseCriteria_SearchCriteria"))).send_keys("Robinson")
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[#title='reCAPTCHA']")))
elem=wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"div.recaptcha-checkbox-checkmark")))
driver.execute_script("arguments[0].click()", elem)
driver.switch_to.default_content()
x = input("Waiting for recaptcha done")
wait.until(EC.element_to_be_clickable((By.XPATH,"(//input[#id='btnSSSubmit'])[1]"))).click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
df = pd.read_html(str(soup))[0]
print(df)
Should be the minimum to get to your page if you want to know.There's an iframe to deal and the spinner to deal with. After this just use pandas to grab the table.
(edit): They added a recaptcha properly so add a solver where I added my pause input.
Import:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
from bs4 import BeautifulSoup
Outputs:
Waiting for manual date to be entered. Enter YES when done.
Unnamed: 0_level_0 ... Date of Birth
Case Number ... File Date
Case Number ... File Date
0 NaN ... NaN
1 NaN ... Cases (1) Case NumberStyle / DefendantFile Da...
2 Case Number ... File Date
3 08A575873 ... 11/17/2008
4 NaN ... NaN
5 NaN ... Cases (1) Case NumberStyle / DefendantFile Da...
6 Case Number ... File Date
7 08A575874 ... 11/17/2008

python webscraping fail in loop but works when i do it manually

I was trying to collect some data from web programmatically for 6000 stocks, i used Python 3.6 selenium webdriver Firefox. [I intended to use BeautifulSoup to parse the HTML but it seems every-time when I update the web, the link doesn't change, soup doesn't cope with Javascript]
Anyway, When I create a for loop to do this, a specific row in my code, share_price = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(25) > text:nth-child(2)"), goes wrong most of the time (It worked a couple times though, So i believe my code is good). However, it works fine if I did it manually (copy and paste into Python IDLE and run it). I tried to use time.sleep(4) to allow web to load before I salvage anything from background, but it seems this is not the solution. Now I'm running out of hint. Can anyone help me unravel this.
Below is my code:
from selenium import webdriver
import time
import pyautogui
filename = "historical_price_marketcap.csv"
f = open(filename,"w")
headers = "stock_ticker, share_price, market_cap\n"
f.write(headers)
driver = webdriver.Firefox()
def get_web():
driver.get("https://stockrow.com")
import csv
with open("TICKER.csv") as file:
read = csv.reader(file)
TICKER=[]
for row in read:
ticker = row[0][1:-1]
TICKER.append(ticker)
for Ticker in range(len(TICKER)):
get_web()
time.sleep(3)
pyautogui.click(425, 337)
pyautogui.typewrite(TICKER[Ticker],0.25)
time.sleep(2)
pyautogui.press("enter")
time.sleep(2)
pyautogui.click(268, 337)
pyautogui.press("backspace")
time.sleep(2)
pyautogui.typewrite('Stock Price',0.25)
time.sleep(2)
pyautogui.press("enter")
time.sleep(2)
pyautogui.click(702, 427)
for i in range(int(10)):
pyautogui.press("backspace")
time.sleep(2)
pyautogui.typewrite("2013-12-01",0.25)
pyautogui.press("enter")
time.sleep(2)
pyautogui.click(882, 425)
for k in range(10):
pyautogui.press("backspace")
time.sleep(2)
pyautogui.typewrite("2013-12-31",0.25)
pyautogui.press("enter")
time.sleep(2)
pyautogui.click(1317, 318)
for j in range(3):
pyautogui.press("down")
time.sleep(10)
share_price = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(25) > text:nth-child(2)")
get_web()
time.sleep(3)
pyautogui.click(425, 337)
pyautogui.typewrite(TICKER[Ticker],0.25)
time.sleep(2)
pyautogui.press("enter")
time.sleep(2)
pyautogui.click(268, 337)
pyautogui.press("backspace")
time.sleep(2)
pyautogui.typewrite('Market Cap',0.25)
time.sleep(2)
pyautogui.press("enter")
time.sleep(2)
pyautogui.click(702, 427)
for i in range(int(10)):
pyautogui.press("backspace")
time.sleep(2)
pyautogui.typewrite("2013-12-01",0.25)
pyautogui.press("enter")
time.sleep(2)
pyautogui.click(882, 425)
for k in range(10):
pyautogui.press("backspace")
time.sleep(2)
pyautogui.typewrite("2013-12-31",0.25)
pyautogui.press("enter")
time.sleep(2)
pyautogui.click(1317, 318)
for j in range(3):
pyautogui.press("down")
time.sleep(10)
market_cap = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(28) > text:nth-child(2)")
f.close()
it seems that the two lines that is bugging me is share_price = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(25) > text:nth-child(2)") Here is the error message from Python:
Traceback (most recent call last):
File "C:\Users\HENGBIN\Desktop\get_historical_data.py", line 65, in <module>
share_price = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(25) > text:nth-child(2)")
File "E:\Program Files\python3.6.1\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 457, in find_element_by_css_selector
return self.find_element(by=By.CSS_SELECTOR, value=css_selector)
File "E:\Program Files\python3.6.1\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 791, in find_element
'value': value})['value']
File "E:\Program Files\python3.6.1\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 256, in execute
self.error_handler.check_response(response)
File "E:\Program Files\python3.6.1\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: .highcharts-root > g:nth-child(25) > text:nth-child(2)
It doesn't work most of the time in loop but works fine if I run it manually in Python IDLE. I don't know what is going on.........
There are several things in your script, that I'd do differently.
First of all - try to get rid of pyautogui. Selenium has build in functions for clicking (check out this SO-question) and for sending all sorts of keys (check out this SO-question). Also when you change the content in the browser (with pyautogui) my experience is that selenium will not always be aware of these changes. That could explain you issues with regards to finding the elements pyautogui created when searching for them with selenium.
Secondly: Your get_web()-function could cause problems. Generally speaking content inside a function has to be returned - or declared global - to be accessible outside the function. The driver, that opens your webpage is global (you instantiate it outside the function), but the url inside the function is local meaning you could have problems accessing the content outside the function. I'd recommend that you get rid of the function (as it really doesn't do anything besides opening the url) and simply just replace the function call in your code like so:
for Ticker in range(len(TICKER)):
driver.get("https://stockrow.com")
time.sleep(3)
# insert keys, click and so on...
This should make it possible for you to use seleniums driver.find_elements...-methods.
Thirdly: I assume that you'd like to extract some data from the site as well. If so, do the parsing with something else than selenium. Selenium is a slow parser. You could try BeautifulSoup instead.
Once the site is loaded you load the html in BeautifulSoup and extract whatever you want (there a SO-question here, that'll show you how you go about that)
from bs4 import BeautifulSoup
.....
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
element_you_want_to_retrieve = soup.find('tag_name', attrs={'key': 'value'})
But with this site what you really should do was tap into the api call the site makes on its own. Use Chromes inspector tool. You'll see that it queries three API's that you can call directly and avoid the whole selenium thing.
the url for apple looks like this:
url = 'https://stockrow.com/api/fundamentals.json?indicators[]=0&tickers[]=APPL'
So with the requests library you could retrieve the content as json like so:
import requests
from pprint import pprint
url = 'https://stockrow.com/api/fundamentals.json?indicators[]=0&tickers[]=AAPL'
response = requests.get(url).json()
pprint(response)
This is a much faster solution than selenium.

Run multiple spiders from script in scrapy

I am doing scrapy project I want run multiple spiders at a time
This is code for run spiders from script. I getting error .. how to do
from spiders.DmozSpider import DmozSpider
from spiders.CraigslistSpider import CraigslistSpider
from scrapy import signals, log
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
TO_CRAWL = [DmozSpider, CraigslistSpider]
RUNNING_CRAWLERS = []
def spider_closing(spider):
"""Activates on spider closed signal"""
log.msg("Spider closed: %s" % spider, level=log.INFO)
RUNNING_CRAWLERS.remove(spider)
if not RUNNING_CRAWLERS:
reactor.stop()
log.start(loglevel=log.DEBUG)
for spider in TO_CRAWL:
settings = Settings()
# crawl responsibly
settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
crawler = Crawler(settings)
crawler_obj = spider()
RUNNING_CRAWLERS.append(crawler_obj)
# stop reactor when spider closes
crawler.signals.connect(spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(crawler_obj)
crawler.start()
blocks process so always keep as the last statement
reactor.run()
Sorry to not answer the question itself but just bringing into your attention scrapyd and scrapinghub (at least for a quick test). reactor.run() (when you make it) will run any number of Scrapy instances on a single CPU. Do you want this side effect? Even if you have a look on scrapyd's code, they don't run multiple instances with a single thread but they do fork/spawn subprocesses.
You need something like the code below. You can easily find it from Scrapy docs :)
First utility you can use to run your spiders is
scrapy.crawler.CrawlerProcess. This class will start a Twisted reactor
for you, configuring the logging and setting shutdown handlers. This
class is the one used by all Scrapy commands.
# -*- coding: utf-8 -*-
import sys
import logging
import traceback
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from scrapy.utils.project import get_project_settings
from spiders.DmozSpider import DmozSpider
from spiders.CraigslistSpider import CraigslistSpider
SPIDER_LIST = [
DmozSpider, CraigslistSpider
]
if __name__ == "__main__":
try:
## set up the crawler and start to crawl one spider at a time
process = CrawlerProcess(get_project_settings())
for spider in SPIDER_LIST:
process.crawl(spider)
process.start()
except Exception, e:
exc_type, exc_obj, exc_tb = sys.exc_info()
logging.info('Error on line {}'.format(sys.exc_info()[-1].tb_lineno))
logging.info("Exception: %s" % str(traceback.format_exc()))
References:
http://doc.scrapy.org/en/latest/topics/practices.html