Python bs4 extracting the right table content - beautifulsoup

i am struggling with this problem now for some time and need some help. I reach out to the website "https://www.finanzen.net/boersenkurse" and want to extract the table whoch is the part "Meistgesuchte Aktien". As there are some iun the document, i am getting also the other tables, which i am not interested in.
I want to create a Dataframe out of the data. So each row shoul look the same as on the website. Means that Name = SAP, Kurs 96,33, ect.
'''
from bs4 import BeautifulSoup
import requests
URL = "https://www.finanzen.net/boersenkurse"
html = requests.get(URL, {}).text
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all("tr")
tables
'''
I do noct get how to choose only the relevant tr. If someone has any idea, please let me know. Thanks in advance!

from bs4 import BeautifulSoup
import requests
URL = "https://www.finanzen.net/boersenkurse"
table = soup.find_all("table", class_="table-overflow-hidden")[1]
#Extracting heading of the columns of the table.
rows = table.find_all('tr')
columns=[]
headings = rows[0].find_all('th')
for col in headings:
columns.append(col.text.strip())
print(columns)
#Extracting all data of the table row wise.
all_data=[]
for row in rows[1:]:
data = row.find_all('td')
lst=[]
for d in data:
lst.append(d.text.strip())
all_data.append(lst)
#Creating the dataframe out of the extracted data.
ds = pd.DataFrame(all_data, columns=columns)
ds

Easier to use pandas and index in for table
import pandas as pd
pd.read_html('https://www.finanzen.net/boersenkurse')[2]
soup still using pandas at end:
from bs4 import BeautifulSoup as bs
import requests
import pandas
r = requests.get('https://www.finanzen.net/boersenkurse')
soup = bs(r.content, 'lxml')
t = soup.select_one('div:nth-child(4) table')
pd.DataFrame([[i.text.strip() for i in r.select('th,td')] for r in t.select('tr')])

Related

How to read a csv file with commas in field with pandas python?

Hi I have a csv file with items like this
product_id,url
100,https://url/p/Cimory-Yogurt-Squeeze-Original-120-g-745133
"1000,""https://url/p/OREO-Biskuit-Dark-&-White-Chocolate-123,5-g-559227"""
1002,https:/url/p/GARNIER-Micellar-Cleansing-Water-Sensitive-Skin-Pink-125-ml-371378
I tried using
import pandas as pd
productUrl = pd.read_csv('productUrl.csv',sep=","quotechar='"')
It returns back as
product_id
url
100
https://url/p/Cimory-Yogurt-Squeeze-Original-120-g-745133
1000,"https://url/p/OREO-Biskuit-Dark-&-White-Chocolate-123,5-g-559227"
1002
https:/url/p/GARNIER-Micellar-Cleansing-Water-Sensitive-Skin-Pink-125-ml-371378
How do I read the csv? Because the url has commas in there too.
You do not need the quotechar='"', simply read it as is:
pd.read_csv('productUrl.csv')
Be aware that your pandas.read_csv() example wont work cause it is missing a , between parameters
Example
import pandas as pd
from io import StringIO
csvString = """product_id,url
100,https://url/p/Cimory-Yogurt-Squeeze-Original-120-g-745133
1000,"https://url/p/OREO-Biskuit-Dark-&-White-Chocolate-123,5-g-559227"
1002,https:/url/p/GARNIER-Micellar-Cleansing-Water-Sensitive-Skin-Pink-125-ml-371378"""
pd.read_csv(StringIO(csvString))
Output
product_id
url
0
100
https://url/p/Cimory-Yogurt-Squeeze-Original-120-g-745133
1
1000
https://url/p/OREO-Biskuit-Dark-&-White-Chocolate-123,5-g-559227
2
1002
https:/url/p/GARNIER-Micellar-Cleansing-Water-Sensitive-Skin-Pink-125-ml-371378

Scrapping Table from Website with Pandas Returning Empty Data Frame

I am trying to extract 'Holdings' table from
https://www.ishares.com/us/products/268752/ishares-global-reit-etf
I use pandas but it return me empty dataframe with only columns name.
Could anyone help me with this please?
import pandas as pd
url = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
tab = pd.read_html(url)
df = pd.DataFrame(tab[6])

Parse list and create DataFrame

I have been given a list called data which has the following content
data=[b'Name,Age,Occupation,Salary\r\nRam,37,Plumber,1769\r\nMohan,49,Elecrician,3974\r\nRahim,39,Teacher,4559\r\n']
I wanted to have a pandas dataframe which looks like the link
Expected Dataframe
How can I achieve this.
You can try this:
data=[b'Name,Age,Occupation,Salary\r\nRam,37,Plumber,1769\r\nMohan,49,Elecrician,3974\r\nRahim,39,Teacher,4559\r\n']
processed_data = [x.split(',') for x in data[0].decode().replace('\r', '').strip().split('\n')]
df = pd.DataFrame(columns=processed_data[0], data=processed_data[1:])
Hope it helps.
I would recommend you to convert this list to string as there is only one index in this list
str1 = ''.join(data)
Then use solution provided here
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
TESTDATA = StringIO(str1)
df = pd.read_csv(TESTDATA, sep=",")

Pandas and beautiful soup: print href instead of the value for a column

This is so similar to other posts on SO e.g. here, i just can't see what I'm doing wrong.
I want to scrape the box labelled 'activity' on this page, and I want the output to look like this:
So you can see the two main features of interest compared to the original webpage (1) combining multiple tables into one table, by just creating a new column if the column is not already seen and (2) I want to extract the actual href for that column as opposed to just the name e.g. 'Jacobsen et al' because I was to eventually extract the PMID value (an integer) from the href.
These are my two goals, I wrote this code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
for i in range(23,24):
# try:
res = requests.get("http://www.conoserver.org/index.php?page=card&table=protein&id=" + str(i))
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table',{'class':'activitytable'})
for each_table in table:
#this can print references
print(each_table.a)
#this can print the data frames
df = pd.read_html(str(each_table))
print(df)
#how to combine the two?
Can someone tell me the correct way to print the href individually for each row of each table (e.g essentially so it adds an extra column to each table with the actual href?; so it should print out three tables, with an extra href column in each table)
Then I can try focus on how to combine the tables, I've just mentioned the ultimate goal here in case someone can think of a more pythonic way of killing two birds with one stone/in case it helps but I think they're different issues.
You can initialise a final dataframe. Then as you iterate, store the href as a variable string then add that column to the sub table dataframe. Then you'll just keep appending those dataframes to a final dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Initalized empty "final" dataframe
final_df = pd.DataFrame()
for i in range(20,24):
# try:
res = requests.get("http://www.conoserver.org/index.php?page=card&table=protein&id=" + str(i))
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table',{'class':'activitytable'})
for each_table in table:
#Store the href
href = each_table.a['href']
#Get the table
df = pd.read_html(str(each_table))[0]
#Put that href in the column 'ref'
df['ref'] = href
# Append that dataframe into your final dataframe, and repeat
final_df = final_df.append(df, sort=True).reset_index(drop=True)

Parsing HTML table to pandas with beautiful soup

I want to parse the "Team Batting" table from
http://www.baseball-reference.com/teams/NYM/2017.shtml
I can find the html table:
table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')
And I can find the data in the table and store it into a list:
table_text=[]
for tr in table_body.findAll('tr'):
tds = tr.findAll('td')
for td in tds:
table_text.append(td.get_text())
How can I re-create this table in pandas? I was thinking of creating a dictionary but am not sure how to from this data. How can I scrape this html table and display it?
You are looking for pandas.read_html() which you can point to your table using the match argument. Note that it will return a list of DataFrames, get the first one:
import pandas as pd
url = "http://www.baseball-reference.com/teams/NYM/2017.shtml"
dfs = pd.read_html(url, match="Team Batting")
print(dfs[0])