Web Scraping with Beautiful Soup in Python - JavaScript Table - beautifulsoup

Im trying to scrape a table from a website but I cant seem to figure it out with Beautifulsoup in Python. Im not sure if its because of the table format, but I basically want to turn this table into a CSV.
from bs4 import BeautifulSoup
import requests
page = requests.geenter code heret("https://spotwx.com/products/grib_index.php?model=hrrr_wrfprsf&lat=41.03399&lon=-73.76291&tz=America/New_York&display=table")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify)
Any advice on how to isolate this data table? I've checked so many Beautifulsoup tutorials, but the HTML looks different than most references. Many thanks in advance for your help -

Try this. The table from that site generates dynamically so you can't get results using requests only.
from selenium import webdriver
from bs4 import BeautifulSoup
import csv
link = "https://spotwx.com/products/grib_index.php?model=hrrr_wrfprsf&lat=41.03399&lon=-73.76291&tz=America/New_York&display=table"
with open("spotwx.csv", "w", newline='') as infile:
writer = csv.writer(infile)
writer.writerow(['DateTime','Tmp','Dpt','Rh','Wh','Wd','Wg','Apcp','Slp'])
with webdriver.Chrome() as driver:
driver.get(link)
soup = BeautifulSoup(driver.page_source, 'lxml')
for item in soup.select("table#example tbody tr"):
data = [elem.text for elem in item.select('td')]
print(data)
writer.writerow(data)

Related

Scrape wikipedia table using BeautifulSoup

I would like to scrape the table titled "List of chemical elements" from the wikipedia link below and display it using pandas
https://en.wikipedia.org/wiki/List_of_chemical_elements
I am new to beautifulsoup and this is currently what i have.
from bs4 import BeautifulSoup
import requests as r
import pandas as pd
response = r.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
wiki_text = response.text
soup = BeautifulSoup(wiki_text, 'html.parser')
table_soup = soup.find_all('table')
You can select the table with beautifulsoup in different ways:
By its "title":
soup.select_one('table:-soup-contains("List of chemical elements")')
By order in tree (it is the first one):
soup.select_one('table')
soup.select('table')[0]
By its class (there is no id in your case):
soup.select_one('table.wikitable')
Or simply with pandas
pd.read_html('https://en.wikipedia.org/wiki/List_of_chemical_elements')[0]
*To get the expected result, try it yourself and if you have difficulties, ask a new question.

I am not sure between which two elements I should be looking to scrape and formatting error (jupyter + selenium)

I finally got around to displaying the page that I need in text/HTML and did conclude that the data I need is also included. For now I just have it printing the entire page because I remain conflicted between the two elements that I potentially need to get what I want. Between these three highlighted elements 1, 2, and 3, I am having trouble with first identifying which one I should reference (I would go with the 'table' element but it doesn't highlight the left most column with ticker names which is literally half the point of getting this data, though the name is referenced like so as shown in the highlighted yellow part). Also, the class descriptions seem really long and and sometimes appears to have two within the same elements so I was wondering how I would address that? And though this problem is not as immediate, if you did take that code and just printed it and scrolled a bit down, the table data is in straight columns so I was wondering if that would be addressed after I reference the proper element or have to write something additional to fix it? Would the fact that I have multiple pages to scan also change anything in the code? Thank you in advance!
Code:
!pip install selenium
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome("D:/chromedriver/chromedriver.exe")
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# get text
text = soup.get_text()
print(text)
edit
read_html without bs4
You wont need beautifulsoup to get your goal, pandas is selecting all html tables from the page source and push them into a list of data frames.
In your case there is only one table in the page source, so you get your df by selecting the first element in list by slicing with [0]:
df = pd.read_html(driver.page_source)[0]
Example
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe')
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
df = pd.read_html(driver.page_source)[0]
driver.close()
Initial answer based on bs4
Your close to a solution, let pandas take control and read the html prettified and bs4 flavored to pandas and modify it there to your needs:
pd.read_html(soupt_one('table').prettify(), flavor='bs4')
Example
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe')
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
df = pd.read_html(soup.select_one('table').prettify(), flavor='bs4')[0]
df

Web scraping using python throws empty array

import requests
from bs4 import BeautifulSoup as soup
my_url='http://cvr.ac.in/home4/index.php/eee-sp-870859316/eeefaculty'
page=requests.get(my_url)
data=page.text
page_soup=soup(data,'html.parser')
cont=page_soup.select("div",{"class": "item-page"})
print(cont)
I am trying to scrape the faculty details name, designation , profile into a csv file .
when I use above code it throws empty [].
any help greatly appreciated.
The page is looking for any of a defined set of valid user agents. For example,
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://cvr.ac.in/home4/index.php/eee-sp-870859316/eeefaculty', headers = {'User-Agent': 'Chrome/80.0.3987.163'})
soup = bs(r.content, 'lxml')
print(soup.select('.item-page'))
Without that, you get an 406 response and the classes you are looking for in the html are not present.

Using BS, I cannot "find" the ID of info, when I know it exists

I am a new user to Beautiful Soup and am trying to create a baby application that retrieves the view count from a YouTube url.
So, I looked at the BS docs and I saw that you could retrieve items by their id. So I attempted to retrieve the info id - but whenever I attempt to do this, it comes out as "None", so it must not be finding the id.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=bUPvE5yv72I'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
divisions = soup.findAll("div")
print(divisions[0])
info = soup.find(id="info")
print(info)
You can try to search for meta itemprop="interactionCount" tag, but this value can be often not exact. Best way is using the official YouTube API:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=bUPvE5yv72I'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
print(soup.select_one('meta[itemprop="interactionCount"][content]')['content'])
Prints:
165011

Beautiful Soup NoneType error

I am trying to write a small script in Python to help me through some of the more tedious parts of my job. I wrote this:
from bs4 import BeautifulSoup
import lxml
import os
import sys
questid = "1478"
soup = BeautifulSoup(open("Data/xmls/quests.xml"), "lxml")
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
questSoup = BeautifulSoup(quest)
for floor in questSoup.find_all('location_id'):
print(floor)
What this is supposed to do is to get a part of a huge xml called "quests", based on tag - and its attribute - "id". Then it is supposed to make a new soup from that part and get all the tags from within the . For now, before I figure out which quest ids I want to choose (and how will I handle input), I just hardcoded one ("1478").
The script so far prints the quest, but fails to create a new soup from it.
Is it possible that the quest variable is not a string? Or am I doing something wrong?
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
# questSoup = BeautifulSoup(quest)
for floor in quest.find_all('location_id'):
print(floor)
No need to build a new soup object from tag object, you can use find_all on both of them, as both are navigable strings, so they behave in the same way and can be accessed in the same way.
In my opinion, soup object is special tag object which is named document
import requests, bs4
r =requests.get('http://www.google.com')
soup = bs4.BeautifulSoup(r.text, 'lxml')
soup.name
out:
'[document]'