Parsing HTML table to pandas with beautiful soup - pandas

I want to parse the "Team Batting" table from
http://www.baseball-reference.com/teams/NYM/2017.shtml
I can find the html table:
table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')
And I can find the data in the table and store it into a list:
table_text=[]
for tr in table_body.findAll('tr'):
tds = tr.findAll('td')
for td in tds:
table_text.append(td.get_text())
How can I re-create this table in pandas? I was thinking of creating a dictionary but am not sure how to from this data. How can I scrape this html table and display it?

You are looking for pandas.read_html() which you can point to your table using the match argument. Note that it will return a list of DataFrames, get the first one:
import pandas as pd
url = "http://www.baseball-reference.com/teams/NYM/2017.shtml"
dfs = pd.read_html(url, match="Team Batting")
print(dfs[0])

Related

Split and explore a column of dictionaries into separate columns with Pandas

I have a data get from API website. I am using Python 3.9 and try to split the measurements parameter to separate columns, but this column has a dictionary.
I want to get keys of dictionary to a new columns to visualize data.
Frame with measurements column
I need to split this column into separate columns, so that the DataFrame looks like this:
Frame with measurement columns
And below is my code:
import requests
import pandas as pd
url = 'https://api.openaq.org/v2/latest?`your text`limit=1000&page=1&offset=0&sort=desc&radius=1000&order_by=lastUpdated&dumpRaw=false'
headers = {"Accept": "application/json"}
response = requests.get(url, headers=headers)
print(response.content)
data = json.loads(response.text)
data1 = pd.DataFrame(data['results'])`

Scraping data from website with selenium and pandas

Scraping a website and the table looks like this and i think it has two tables because the rank and names are a separate table, so im not sure how to get all that and put it all together as 1 csv
this is the website i want to scrape, its a partial table without membership
https://fantasydata.com/nba/dfs-projections/fanduel?date=02-03-2022&dfsoperator=2&dfsslateid=18504
screenshot
Im useing
WebDriverWait(driver,30).until(EC.presence_of_element_located((By.CSS_SELECTOR,"table")))
dfs = pd.read_html(driver.page_source, header=None)
driver.implicitly_wait(120)
dvp_projections = {}
for idx, table in enumerate(dfs):
temp_df = table.iloc[1:]
dvp_projections[idx] = temp_df
temp_df.to_csv('/home/joe/NBA/Sportsdata_dvp.csv' ,index=False)
but im only getting this and also im missing the header
List
What you'll want to do is join/merge/concat the tables. But you'll want them to concat the columns, not the rows (which is what pd.concat() does by default.)
So try:
df = pd.concat(dfs, axis=1)

Scrapping Table from Website with Pandas Returning Empty Data Frame

I am trying to extract 'Holdings' table from
https://www.ishares.com/us/products/268752/ishares-global-reit-etf
I use pandas but it return me empty dataframe with only columns name.
Could anyone help me with this please?
import pandas as pd
url = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
tab = pd.read_html(url)
df = pd.DataFrame(tab[6])

Python bs4 extracting the right table content

i am struggling with this problem now for some time and need some help. I reach out to the website "https://www.finanzen.net/boersenkurse" and want to extract the table whoch is the part "Meistgesuchte Aktien". As there are some iun the document, i am getting also the other tables, which i am not interested in.
I want to create a Dataframe out of the data. So each row shoul look the same as on the website. Means that Name = SAP, Kurs 96,33, ect.
'''
from bs4 import BeautifulSoup
import requests
URL = "https://www.finanzen.net/boersenkurse"
html = requests.get(URL, {}).text
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all("tr")
tables
'''
I do noct get how to choose only the relevant tr. If someone has any idea, please let me know. Thanks in advance!
from bs4 import BeautifulSoup
import requests
URL = "https://www.finanzen.net/boersenkurse"
table = soup.find_all("table", class_="table-overflow-hidden")[1]
#Extracting heading of the columns of the table.
rows = table.find_all('tr')
columns=[]
headings = rows[0].find_all('th')
for col in headings:
columns.append(col.text.strip())
print(columns)
#Extracting all data of the table row wise.
all_data=[]
for row in rows[1:]:
data = row.find_all('td')
lst=[]
for d in data:
lst.append(d.text.strip())
all_data.append(lst)
#Creating the dataframe out of the extracted data.
ds = pd.DataFrame(all_data, columns=columns)
ds
Easier to use pandas and index in for table
import pandas as pd
pd.read_html('https://www.finanzen.net/boersenkurse')[2]
soup still using pandas at end:
from bs4 import BeautifulSoup as bs
import requests
import pandas
r = requests.get('https://www.finanzen.net/boersenkurse')
soup = bs(r.content, 'lxml')
t = soup.select_one('div:nth-child(4) table')
pd.DataFrame([[i.text.strip() for i in r.select('th,td')] for r in t.select('tr')])

Pandas and beautiful soup: print href instead of the value for a column

This is so similar to other posts on SO e.g. here, i just can't see what I'm doing wrong.
I want to scrape the box labelled 'activity' on this page, and I want the output to look like this:
So you can see the two main features of interest compared to the original webpage (1) combining multiple tables into one table, by just creating a new column if the column is not already seen and (2) I want to extract the actual href for that column as opposed to just the name e.g. 'Jacobsen et al' because I was to eventually extract the PMID value (an integer) from the href.
These are my two goals, I wrote this code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
for i in range(23,24):
# try:
res = requests.get("http://www.conoserver.org/index.php?page=card&table=protein&id=" + str(i))
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table',{'class':'activitytable'})
for each_table in table:
#this can print references
print(each_table.a)
#this can print the data frames
df = pd.read_html(str(each_table))
print(df)
#how to combine the two?
Can someone tell me the correct way to print the href individually for each row of each table (e.g essentially so it adds an extra column to each table with the actual href?; so it should print out three tables, with an extra href column in each table)
Then I can try focus on how to combine the tables, I've just mentioned the ultimate goal here in case someone can think of a more pythonic way of killing two birds with one stone/in case it helps but I think they're different issues.
You can initialise a final dataframe. Then as you iterate, store the href as a variable string then add that column to the sub table dataframe. Then you'll just keep appending those dataframes to a final dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Initalized empty "final" dataframe
final_df = pd.DataFrame()
for i in range(20,24):
# try:
res = requests.get("http://www.conoserver.org/index.php?page=card&table=protein&id=" + str(i))
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table',{'class':'activitytable'})
for each_table in table:
#Store the href
href = each_table.a['href']
#Get the table
df = pd.read_html(str(each_table))[0]
#Put that href in the column 'ref'
df['ref'] = href
# Append that dataframe into your final dataframe, and repeat
final_df = final_df.append(df, sort=True).reset_index(drop=True)