BeautifulSoup - Table - beautifulsoup

On the website below there is a table called "tbListaOpc"
https://opcoes.net.br/opcoes/bovespa/ABEV3
I tried but was unsuccessful in retrieving the information from this table.
Can anyone help me figure out how to read all the rows in this tbListaOpc table?
Thank you!

You can use selenium combined with pandas in order to scrape the table:
from selenium import webdriver
import pandas as pd
import time
url = 'https://opcoes.net.br/opcoes/bovespa/ABEV3'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(2)
#driver.find_element_by_xpath('//*[#id="grade-tipo-items"]/label[1]').click() -- uncomment this line to click on the CALLs button
#driver.find_element_by_xpath('//*[#id="grade-tipo-items"]/label[2]').click() -- uncomment this line to click on the PUTs button
time.sleep(0.5)
dfs = pd.read_html(driver.page_source)
driver.close()
df = dfs[-1]
columns = [col[0] for col in list(df.columns)]
df.columns = columns
print(df)
Output:
Ticker Tipo FM Mod. ... Gamma Theta ($) Theta (%) Vega
0 ABEVW958 PUT NaN E ... 424.0 -55.0 -2750.0 1669.0
1 ABEVK100 CALL NaN A ... NaN NaN NaN NaN
2 ABEVW100 PUT NaN E ... 691.0 -71.0 -2367.0 2417.0
3 ABEVW105 PUT NaN E ... 1118.0 -82.0 -2050.0 3302.0
4 ABEVK110 CALL NaN A ... 417.0 -49.0 -18.0 2062.0
5 ABEVW110 PUT NaN E ... 1978.0 -120.0 -1500.0 5315.0
6 ABEVK115 CALL NaN A ... 2742.0 -220.0 -259.0 8315.0
7 ABEVW115 PUT ✔ E ... 2897.0 -192.0 -914.0 8149.0
8 ABEVK120 CALL ✔ A ... 3385.0 -235.0 -452.0 9582.0
9 ABEVW120 PUT ✔ E ... 5170.0 -145.0 -580.0 9531.0
10 ABEVK125 CALL ✔ A ... 3446.0 -214.0 -764.0 9249.0
11 ABEVW125 PUT ✔ E ... 3242.0 -221.0 -316.0 9319.0
12 ABEVK130 CALL ✔ A ... 2740.0 -176.0 -1173.0 7495.0
13 ABEVW130 PUT ✔ E ... 2705.0 -174.0 -166.0 7618.0
14 ABEVK135 CALL ✔ A ... 1909.0 -131.0 -1638.0 5405.0
15 ABEVW135 PUT NaN E ... 1725.0 -71.0 -50.0 4050.0
16 ABEVK140 CALL ✔ A ... 1216.0 -87.0 -2175.0 3520.0
17 ABEVW140 PUT NaN E ... 771.0 -22.0 -12.0 1736.0
[18 rows x 17 columns]
You can also export these values to a csv file by using the to_csv function:
df.to_csv('D:\\Values.csv',index=False)
Screenshot of csv file:

Related

Web Scaping: Panda DataFrame.read_html(url_address) returns a Empty DataFrame?

I wanna web scrape the information of this table in this page that has many other pages.
I wrote the following code:
url = 'https://dbaasp.org/search?id.value=&name.value=&sequence.value=&sequence.option=full&sequenceLength.value=&complexity.value=&synthesisType.value=Nonribosomal&uniprot.value=&nTerminus.value=&cTerminus.value=&unusualAminoAcid.value=&intraChainBond.value=&interChainBond.value=&coordinationBond.value=&threeDStructure.value=&kingdom.value=&source.value=&hemolyticAndCytotoxicActivitie.value=on&synergy.value=&articleAuthor.value=&articleJournal.value=&articleYear.value=&articleVolume.value=&articlePages.value=&articleTitle.value='
pep_table = pd.read_html(url)
But the output was this:
pep_table
[Empty DataFrame
Columns: [ID, Name, N terminus, Sequence, C terminus, View]
Index: []]
I also tried to get it through selenium webdriver:
chromedriver = '/usr/local/bin/chromedriver'
driver = webdriver.Chrome(chromedriver)
driver.get(url)
table = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#DataTables_Table_0_info")))
tableRows = table.get_attribute("outerHTML")
df = pd.read_html(tableRows)[0]
But it shows the selenium webdriver timeout error:
File "/home/es/anaconda3/envs/pyg-env/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Am I using the wrong selector?
This page is the search results. Do I need to add more selectors?
How to solve this issue?
Your table locator was wrong. I have modified that. The easiest way without clicking on the pagination button and navigating to url.
You can use this url, where you have to change the offset value.
url="https://dbaasp.org/search?id.value=&name.value=&sequence.value=&sequence.option=full&sequenceLength.value=&complexity.value=&synthesisType.value=Nonribosomal&uniprot.value=&nTerminus.value=&cTerminus.value=&unusualAminoAcid.value=&intraChainBond.value=&interChainBond.value=&coordinationBond.value=&threeDStructure.value=&kingdom.value=&source.value=&hemolyticAndCytotoxicActivitie.value=on&synergy.value=&articleAuthor.value=&articleJournal.value=&articleYear.value=&articleVolume.value=&articlePages.value=&articleTitle.value=&limit=30&offset={}"
You need to create an empty dataframe and concat with it.
Use time.sleep() to wait otherwise page will move faster and unable to capture all pages.
Code:
url="https://dbaasp.org/search?id.value=&name.value=&sequence.value=&sequence.option=full&sequenceLength.value=&complexity.value=&synthesisType.value=Nonribosomal&uniprot.value=&nTerminus.value=&cTerminus.value=&unusualAminoAcid.value=&intraChainBond.value=&interChainBond.value=&coordinationBond.value=&threeDStructure.value=&kingdom.value=&source.value=&hemolyticAndCytotoxicActivitie.value=on&synergy.value=&articleAuthor.value=&articleJournal.value=&articleYear.value=&articleVolume.value=&articlePages.value=&articleTitle.value=&limit=30&offset={}"
counter=0
df=pd.DataFrame()
while counter <150:
driver.get(url.format(counter))
time.sleep(2)
table = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#DataTables_Table_0")))
tableRows = table.get_attribute("outerHTML")
df1 = pd.read_html(tableRows)[0]
df = pd.concat([df,df1], ignore_index=True)
counter=counter+30
print(df)
Output:
ID Name N terminus Sequence C terminus View
0 1688 Gramicidin S, GS NaN VXLfPVXLfP NaN View
1 3314 Gratisin, GR NaN VXLfPyVXLfPy NaN View
2 3316 Tyrocidine A, TA NaN fPFfNQYVXL NaN View
3 4876 Trichogin GA IV C8 XGLXGGLXGIX NaN View
4 5374 Baceridin NaN WaXVlL NaN View
.. ... ... ... ... ... ...
137 19210 Burkholdine-1215 NaN xxGNSXXs NaN View
138 19212 Burkholdine-1213 NaN xnGNSNXs NaN View
139 19548 Hirsutatin A NaN XTSXXF NaN View
140 19549 Hirsutatin B NaN XTSXXX NaN View
141 19554 Hirsutellide NaN XxIXxI NaN View
[142 rows x 6 columns]

Different numbers of commas between fields in CSV files, throwing errors with pd.readcsv

I'm using the NOAA weather dataset to build a machine learning model to predict weather data. Python cannot read in this data as there are: a.) commas in the fields, and b.) different numbers of commas between each field.
Here are the headers and the first line:
"STATION","DATE","SOURCE","REPORT_TYPE","CALL_SIGN","QUALITY_CONTROL","AA1","AJ1","AL1","CIG","DEW","GA1","KA1","MA1","MF1","OC1","RH1","SLP","TMP","VIS","WND"
"72503014732","2022-01-01T00:00:00","4","FM-12","99999","V020",,,,"99999,9,9,N","+0078,1","99,9,+00450,1,99,9","120,M,+0128,1","99999,9,10129,1",,,,"10141,1","+0106,1","016000,1,9,9","160,1,N,0046,1"
When I open this on excel, this is how it looks:
Image of rendered data on excel sheet
I have tried regex, I've tried setting the delimiter to ",", but it still doesn't work
As your fields are quoted the commas are not an issue for pandas:
df = pd.read_csv('yourfile.csv', sep=',')
output:
STATION DATE SOURCE REPORT_TYPE CALL_SIGN \
0 72503014732 2022-01-01T00:00:00 4 FM-12 99999
QUALITY_CONTROL AA1 AJ1 AL1 CIG ... GA1 \
0 V020 NaN NaN NaN 99999,9,9,N ... 99,9,+00450,1,99,9
KA1 MA1 MF1 OC1 RH1 SLP TMP \
0 120,M,+0128,1 99999,9,10129,1 NaN NaN NaN 10141,1 +0106,1
VIS WND
0 016000,1,9,9 160,1,N,0046,1
[1 rows x 21 columns]

Pandas: Merging multiple dataframes efficiently

I have a situation where I need to merge multiple dataframes that I can do easily using the below code:
# Merge all the datasets together
df_prep1 = df_prep.merge(df1,on='e_id',how='left')
df_prep2 = df_prep1.merge(df2,on='e_id',how='left')
df_prep3 = df_prep2.merge(df3,on='e_id',how='left')
df_prep4 = df_prep3.merge(df_4,on='e_id',how='left')
df_prep5 = df_prep4.merge(df_5,on='e_id',how='left')
df_prep6 = df_prep5.merge(df_6,on='e_id',how='left')
But what I want to understand is that if there is any other efficient way to perform this merge, maybe using a helper function? If yes, then how could I achieve that?
You can use reduce from functools module to merge multiple dataframes:
from functools import reduce
dfs = [df_1, df_2, df_3, df_4, df_5, df_6]
out = reduce(lambda dfl, dfr: pd.merge(dfl, dfr, on='e_id', how='left'), dfs)
You can put all your dfs into a list, or pass them from a function, a loop, etc. and then have 1 main df that you merge everything onto.
You can start with an empty df and iterate through. In your case, since you are doing left merge, it looks like your df_prep should already have all of the e_id values that you want. You'll need to figure out what you want to do with any additional columns, e.g., you can have pandas add _x and _y after conflicting column names that you don't merge, or rename them, etc. See this toy example:
main_df = pd.DataFrame({'e_id': [0, 1, 2, 3, 4]})
for x in range(3):
dfx = pd.DataFrame({'e_id': [x], 'another_col' + str(x): [x * 10]})
main_df = main_df.merge(dfx, on='e_id', how='left')
to get:
e_id another_col0 another_col1 another_col2
0 0 0.0 NaN NaN
1 1 NaN 10.0 NaN
2 2 NaN NaN 20.0
3 3 NaN NaN NaN
4 4 NaN NaN NaN

Update a Pandas MultiIndex DataFrame

The dataframe "data" has a MultiIndex.
data.head()
Close High Low Open Volume
Symbol Date
A 1999-11-18 28.6358 33.5207 27.3725
30.6572 59753154
1999-11-19 27.2040 28.9727 26.8253 28.9323
16172993
1999-11-22 29.3517 29.3517 26.9935 27.8357
5435127
1999-11-23 27.1198 28.8885 27.1198 28.6358
5035889
1999-11-24 27.6676 28.2571 26.9513 27.0389
5141708
The dictionary f has a key 'AAPL' which is a regular DataFrame.
f['AAPL'].head()
Open High Low Close Volume
Date
2018-06-11 191.350 191.970 190.21 191.23 18308460
2018-06-12 191.385 192.611 191.15 192.28 16911141
2018-06-13 192.420 192.880 190.44 190.70 21638393
2018-06-14 191.550 191.570 190.22 190.80 21610074
2018-06-15 190.030 190.160 188.26 188.84 61719160
I'd like to append to data['AAPL'] so that it has the data from f['AAPL']. This works, but is not inplace:
data.loc['AAPL'].append(f['AAPL'], verify_integrity=True).tail()
Close High Low Open Volume
Date
2018-07-30 189.91 192.20 189.0700 191.90 21029535
2018-07-31 190.29 192.14 189.3400 190.30 39373038
2018-08-01 201.50 201.76 197.3100 199.13 67935716
2018-08-02 207.39 208.38 200.3500 200.58 62404012
2018-08-03 207.99 208.74 205.4803 207.03 33447396
When I try to update data, I get all NaNs.
data.loc['AAPL'] = data.loc['AAPL'].append(f['AAPL'], verify_integrity=True).tail()
Close High Low Open Volume
Date
2018-06-04 NaN NaN NaN NaN NaN
2018-06-05 NaN NaN NaN NaN NaN
2018-06-06 NaN NaN NaN NaN NaN
2018-06-07 NaN NaN NaN NaN NaN
2018-06-08 NaN NaN NaN NaN NaN
Edit:
The "data" DataFrame was created with pandas data_reader:
import pandas_datareader.data as web
data = web.DataReader(['A','AAPL','F'], 'morningstar', start, end)
"f" was created the same way, but using 'iex' as the source instead of 'morningstar' (at the moment the morningstar source is returning 404s, so I switched to iex).
I still don't know why assigning to data.loc['AAPL'] doesn't work, but the following does:
# Converts dict with keys as tickers, DataFrame as values, to a DataFrame with a MultiIndex
new_data = pd.concat(f)
# Just append, and sort index to align the dates
data = data.append(new_data).sort_index()
Personal preference: I would first create a temp df with the data to append as a multi index dataframe.
toappend = pd.concat([f['AAPL']], keys=['AAPL'], names=['Symbol'])
And then create a new dataframe by appending the data and new data.
newdata = data.append(toappend, verify_integrity=True)
or if you prefer to do it in one line:
newdata = data.append(pd.concat([f['AAPL']], keys=['AAPL'], names=['Symbol']), verify_integrity=True)
My full test code is:
import pandas as pd
import numpy as np
symbols = ['AAA', 'BBB', 'CCC']
dates = ['2018-06-11', '2018-06-12', '2018-06-13']
cols = ['Close', 'High', 'Low']
midx = pd.MultiIndex.from_product([symbols, dates], names=['Symbol', 'Date'])
data= pd.DataFrame(10, midx, cols)
aapldf = pd.DataFrame(15, dates, cols)
aapldf.index.name = 'Date'
f = {'AAPL': aapldf}
toappend = pd.concat([f['AAPL']], keys=['AAPL'], names=['Symbol'])
newdata = data.append(toappend, verify_integrity=True)
print(newdata)

How does df.interpolate(inplace=True) function?

I am having trouble understanding how this functions. With inplace=True, the function doesn't output anything and the original df remains unchanged. How does this work?
So sorry I wrote 'filter' in my first post. That was very stupid mistake.
As #Alex requested, the example is as follows:
df = pd.DataFrame(np.random.randn(4,3), columns=map(chr, range(65,68)))
df['B'] = np.nan
print df
print df.interpolate(axis=1)
print df
print df.interpolate(axis=1, inplace=True)
print df
The output is as follows:
A B C
0 -0.956273 NaN 0.919723
1 1.127298 NaN -0.585326
2 -0.045163 NaN -0.946355
3 -1.375863 NaN -1.279663
A B C
0 -0.956273 -0.018275 0.919723
1 1.127298 0.270986 -0.585326
2 -0.045163 -0.495759 -0.946355
3 -1.375863 -1.327763 -1.279663
A B C
0 -0.956273 NaN 0.919723
1 1.127298 NaN -0.585326
2 -0.045163 NaN -0.946355
3 -1.375863 NaN -1.279663
None
A B C
0 -0.956273 NaN 0.919723
1 1.127298 NaN -0.585326
2 -0.045163 NaN -0.946355
3 -1.375863 NaN -1.279663
As you can see, the first interpolation created a copy of the original dataframe. What I wanted is to interpolate and update the original dataframe, so I tried inplace since the documentation states the follow:
inplace : bool, default False
Update the NDFrame in place if possible.
The second interpolation did not return any value, and it did not update the original dataframe. So I'm confused.
And as #joris requested, my pandas version is '0.15.1'. Though this request is due to my mistake...