import class in beautifulsoup

import class in beautifulsoup - pandas

I'm trying to import this as table from table id "octable"
import requests
from bs4 import BeautifulSoup
import lxml.html as lh
import pandas as pd
r = requests.get('https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?segmentLink=17&instrument=OPTIDX&symbol=BANKNIFTY')
doc = lh.fromstring(r.content)
data = doc.xpath('//*[#id="octable"]')
type(data)
df = pd.DataFrame(data)
print(df)
however this is what i get
0 [[[], [], []], [[], [], [], [], [], [], [], []...
1 \
0 [[], [[<Element img at 0x29d1aa5cbd8>]], [], [...
2 \
0 [[], [[<Element img at 0x29d1aa5ca98>]], [], [...
3 \
0 [[], [[<Element img at 0x29d1aa5cbd8>]], [], [...

This worked well for me. I'd recommend familiarizing yourself with read_html.
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get('https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?segmentLink=17&instrument=OPTIDX&symbol=BANKNIFTY')
soup = BeautifulSoup(r.content,features='html.parser')
table = soup.find('table',{'id':'octable'})
df = pd.read_html(str(table))
print(df)
Output:
[ CALLS ... PUTS
Chart OI Chng in OI Volume IV LTP ... LTP IV Volume Chng in OI OI Chart
0 NaN 580 20 3 - 5929.35 ... 2.15 60.85 1300 -2920 14980 NaN
1 NaN - - - - - ... - - - - - NaN
2 NaN - - - - - ... 3.20 61.28 68 320 500 NaN
3 NaN 8620 -40 6 - 5585.00 ... 2.60 58.90 305 -60 13400 NaN
4 NaN - - - - - ... 2.50 57.62 8 -60 - NaN
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
89 NaN - - - - - ... - - - - - NaN
90 NaN 20 20 2 31.08 4.30 ... - - - - - NaN
91 NaN - - - - - ... - - - - - NaN
92 NaN 80 60 9 28.39 1.20 ... 3000.00 - - - 140 NaN
93 Total 4568440 NaN 2456057 NaN NaN ... NaN NaN 2562288 NaN 4181760 Total
[94 rows x 23 columns]]

Related

checking for duplicates in panda data frame

import pandas as pd
from io import StringIO
import requests
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import make_interp_spline
url = 'https://m-selig.ae.illinois.edu/ads/coord/b737a.dat'
response = requests.get(url).text
lines = []
for idx, line in enumerate(response.split('\n'), start=1):
if all([x.replace('.','').replace('-','').isdecimal() for x in line.split()]):
lines.append(line)
lines = [x.split() for x in lines]
df = pd.DataFrame(lines)
df = df.dropna(axis=0)
df = df.astype(float)
df = df[~(df > 1).any(1)]
print(df)
output...
0 1
2 0.0000 0.0177
3 0.0023 0.0309
4 0.0050 0.0372
5 0.0076 0.0415
6 0.0143 0.0499
7 0.0249 0.0582
8 0.0495 0.0730
9 0.0740 0.0814
10 0.0990 0.0866
11 0.1530 0.0907
12 0.1961 0.0905
13 0.2504 0.0887
14 0.3094 0.0858
15 0.3520 0.0833
16 0.3919 0.0804
17 0.4477 0.0756
18 0.5034 0.0696
19 0.5593 0.0626
20 0.5965 0.0575
21 0.6488 0.0498
22 0.8351 0.0224
23 0.9109 0.0132
24 1.0000 0.0003
26 0.0000 0.0177
27 0.0022 0.0038
28 0.0049 -0.0018
29 0.0072 -0.0053
30 0.0119 -0.0106
31 0.0243 -0.0204
32 0.0486 -0.0342
33 0.0716 -0.0457
34 0.0979 -0.0516
35 0.1488 -0.0607
36 0.1953 -0.0632
37 0.2501 -0.0632
38 0.2945 -0.0626
39 0.3579 -0.0610
40 0.3965 -0.0595
41 0.4543 -0.0563
42 0.5050 -0.0527
43 0.5556 -0.0482
44 0.6063 -0.0427
45 0.6485 -0.0375
46 0.8317 -0.0149
47 0.9410 -0.0053
48 1.0000 -0.0003
This is my code for a website I'm scraping data from. I'm running into a problem where the x points start from zero, go up, and come back down to zero creating a line in the middle of the plot which I don't need.
Notice how there is two df[0] = 0 on rows 2 and 26, How can I write a code where it detects duplicates?

Try one of the following?
Out of the loop
df1=df.drop_duplicates(keep='first', inplace=False, ignore_index=False)
Inside your loop
lines = []
lines1 = []
for idx, line in enumerate(response.split('\n'), start=1):
if all([x.replace('.','').replace('-','').isdecimal() for x in line.split()]):
if not (line in lines1): lines.append(line)
lines1.append(line)

pandas dataframe can't specify list value to column

I'm trying to assign a list value to a column with the following code
In [105] df[df['review_meta_id'] == 5349]['tags'].head()
Out [105] 4 NaN
2035 NaN
2630 NaN
3085 NaN
6833 NaN
Name: tags, dtype: object
In [106] tags
Out [106] ['자연공원', '도심상점']
In [107] df.loc[df['review_meta_id'] == 5349,'tags'] = pd.Series(tags)
In [108] df[df['review_meta_id'] == 5349]['tags'].head()
Out [108] 4 NaN
2035 NaN
2630 NaN
3085 NaN
6833 NaN
Name: tags, dtype: object
In [109]
So why is value not being assigned?
*Edit
So it seems, I can do something like
df.loc[df['review_meta_id'] == 5349,'tags'] = pd.Series([tags] * len(df))
why not ?
df.loc[df['review_meta_id'] == 5349,'tags'] = pd.Series([tags] * len(df[df['review_meta_id'] == 5349]))

The reason here is pandas will check the index as well when you create the new series, (The index is range index from 0 to n-1)then index is mismatched with assign position, so pandas will return NaN
df.loc[df['review_meta_id'] == 5349,'tags'] = pd.Series([tags] * len(df[df['review_meta_id'] == 5349]), index = df.index[df['review_meta_id'] == 5349] )

Pandas melt data based on two or more binary columns

I have a data frame that looks like this that includes price side and volume parameters from multiple exchanges.
df = pd.DataFrame({
'price_ex1' : [9380.59650, 9394.85206, 9397.80000],
'side_ex1' : ['bid', 'bid', 'ask'],
'size_ex1' : [0.416, 0.053, 0.023],
'price_ex2' : [9437.24045, 9487.81185, 9497.81424],
'side_ex2' : ['bid', 'bid', 'ask'],
'size_ex2' : [10.0, 556.0, 23.0]
})
df
price_ex1 side_ex1 size_ex1 price_ex2 side_ex2 size_ex2
0 9380.59650 bid 0.416 9437.24045 bid 10.0
1 9394.85206 bid 0.053 9487.81185 bid 556.0
2 9397.80000 ask 0.023 9497.81424 ask 23.0
For each exchange (I have more than two exchanges), I want the index to be the union of all prices from all exchanges (i.e. union of price_ex1, price_ex2, etc...) ranked from highest to lowest. Then I want to create two size columns for each exchange based on the side parameter of that exchange. The output should look like this where empty columns are NaN.
I am not sure what is the best pandas function to do this, whether it is pivot or melt and how to use that function when I have more than 1 binary column I am flattening.
Thank you for your help!

This is a three step process. After you correct your multiindexed columns, you should stack your dataset, then pivot it.
First, clean up the multiindex columns so that you more easily transform:
df.columns = pd.MultiIndex.from_product([['1', '2'], [col[:-4] for col in df.columns[:3]]], names=['exchange', 'params'])
exchange 1 2
params price side size price side size
0 9380.59650 bid 0.416 9437.24045 bid 10.0
1 9394.85206 bid 0.053 9487.81185 bid 556.0
2 9397.80000 ask 0.023 9497.81424 ask 23.0
Then stack and append the exchange num to the bid and ask values:
df = df.swaplevel(axis=1).stack()
df['side'] = df.apply(lambda row: row.side + '_ex' + row.name[1], axis=1)
params price side size
exchange
0 1 9380.59650 bid_ex1 0.416
2 9437.24045 bid_ex2 10.000
1 1 9394.85206 bid_ex1 0.053
2 9487.81185 bid_ex2 556.000
2 1 9397.80000 ask_ex1 0.023
2 9497.81424 ask_ex2 23.000
Finally, pivot and sort by price:
df.pivot_table(index=['price'], values=['size'], columns=['side']).sort_values('price', ascending=False)
params size
side ask_ex1 ask_ex2 bid_ex1 bid_ex2
price
9497.81424 NaN 23.0 NaN NaN
9487.81185 NaN NaN NaN 556.0
9437.24045 NaN NaN NaN 10.0
9397.80000 0.023 NaN NaN NaN
9394.85206 NaN NaN 0.053 NaN
9380.59650 NaN NaN 0.416 NaN

You can try something like this.
Please make a dataframe with the data that you show us and name it something 'example.csv'
price_ex1 side_ex1 size_ex1 price_ex2 side_ex2 size_ex2
import pandas as pd
import numpy as np
df = pd.read_csv('example.csv')
df1 = df[['price_ex1','side_ex1','size_ex1']]
df2 = df[['price_ex2','side_ex2','size_ex2']]
df3 = df1.append(df2)
df4 = df3[['price_ex1','price_ex2']]
arr = df4.values
df3['price_ex1'] = arr[~np.isnan(arr)].astype(float)
df3.drop(columns=['price_ex2'], inplace=True)
df3.columns = ['price', 'bid_ex1', 'ask_ex1', 'bid_ex2', 'ask_ex2']
def change(bid_ex1, ask_ex1, bid_ex2, ask_ex2, col_name):
if col_name == 'bid_ex1_col':
if (bid_ex1 != np.nan or bid_ex2 != np.nan) and bid_ex1 == 'bid':
return bid_ex2
else:
return bid_ex1
if col_name == 'ask_ex1_col':
if (bid_ex1 != np.nan or bid_ex2 != np.nan) and bid_ex1 == 'ask':
return bid_ex2
else:
return ask_ex1
if col_name == 'ask_ex2_col':
if (ask_ex1 != np.nan or ask_ex2 != np.nan) and ask_ex1 == 'ask':
return ask_ex2
else:
return ask_ex1
if col_name == 'bid_ex2_col':
if (ask_ex1 != np.nan or ask_ex2 != np.nan) and ask_ex1 == 'bid':
return ask_ex2
else:
return ask_ex1
df3['bid_ex1_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'bid_ex1_col'), axis=1)
df3['ask_ex1_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'ask_ex1_col'), axis=1)
df3['ask_ex2_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'ask_ex2_col'), axis=1)
df3['bid_ex2_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'bid_ex2_col'), axis=1)
df3.drop(columns=['bid_ex1', 'ask_ex1', 'bid_ex2', 'ask_ex2'], inplace=True)
df3.replace(to_replace='ask', value=np.nan,inplace=True)
df3.replace(to_replace='bid', value=np.nan,inplace=True)

One option is to flip to long form with pivot_longer before flipping back to wide form with pivot_wider from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(names_to = ('ex1', 'ex2', 'ex'),
values_to=('price','side','size'),
names_pattern=['price', 'side', 'size'])
.loc[:, ['price', 'side','ex','size']]
.assign(ex = lambda df: df.ex.str.split('_').str[-1])
.pivot_wider('price', ('side', 'ex'), 'size')
.sort_values('price', ascending = False)
)
price bid_ex1 ask_ex1 bid_ex2 ask_ex2
5 9497.81424 NaN NaN NaN 23.0
4 9487.81185 NaN NaN 556.0 NaN
3 9437.24045 NaN NaN 10.0 NaN
2 9397.80000 NaN 0.023 NaN NaN
1 9394.85206 0.053 NaN NaN NaN
0 9380.59650 0.416 NaN NaN NaN

How to create a new column in a Pandas DataFrame using pandas.cut method?

I have a column with house prices that looks like this:
0 0.0
1 1480000.0
2 1035000.0
3 0.0
4 1465000.0
5 850000.0
6 1600000.0
7 0.0
8 0.0
9 0.0
Name: Price, dtype: float64
and I want to create a new column called data['PriceRanges'] which sets each price in a given range. This is what my code looks like:
data = pd.read_csv("Melbourne_housing_FULL.csv")
data.fillna(0, inplace=True)
for i in range(0, 12000000, 50000):
bins = np.array(i)
labels = np.array(str(i))
data['PriceRange'] = pd.cut(data.Price, bins=bins, labels=labels, right=True)
And I get this Error message:
TypeError: len() of unsized object
I've been trying different approaches and seem to be stuck here. I'd really appreciate some help.
Thanks,
Hugo

There is problem you overwrite bins and labels in loop, so there is only last value.
for i in range(0, 12000000, 50000):
bins = np.array(i)
labels = np.array(str(i))
print (bins)
11950000
print (labels)
11950000
There is no necessary loop, only instead range use numpy alternative arange and for labels create ranges. Last add parameter include_lowest=True to cut for include first value of bins (0) to first group.
bins = np.arange(0, 12000000, 50000)
labels = ['{} - {}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
#correct first value
labels[0] = '0 - 50000'
print (labels[:10])
['0 - 50000', '50001 - 100000', '100001 - 150000', '150001 - 200000',
'200001 - 250000', '250001 - 300000', '300001 - 350000', '350001 - 400000',
'400001 - 450000', '450001 - 500000']
data['PriceRange'] = pd.cut(data.Price,
bins=bins,
labels=labels,
right=True,
include_lowest=True)
print (data)
Price PriceRange
0 0.0 0 - 50000
1 1480000.0 1450001 - 1500000
2 1035000.0 1000001 - 1050000
3 0.0 0 - 50000
4 1465000.0 1450001 - 1500000
5 850000.0 800001 - 850000
6 1600000.0 1550001 - 1600000
7 0.0 0 - 50000
8 0.0 0 - 50000
9 0.0 0 - 50000

df.loc[rows, [col]] vs df.loc[rows, col] in assignment

What do the following assignments behave differently?
df.loc[rows, [col]] = ...
df.loc[rows, col] = ...
For example:
r = pd.DataFrame({"response": [1,1,1],},index = [1,2,3] )
df = pd.DataFrame({"x": [999,99,9],}, index = [3,4,5] )
df = pd.merge(df, r, how="left", left_index=True, right_index=True)
df.loc[df["response"].isnull(), "response"] = 0
print df
x response
3 999 0.0
4 99 0.0
5 9 0.0
but
df.loc[df["response"].isnull(), ["response"]] = 0
print df
x response
3 999 1.0
4 99 0.0
5 9 0.0
why should I expect the first to behave differently to the second?

df.loc[df["response"].isnull(), ["response"]]
returns a DataFrame, so if you want to assign something to it it must be aligned by both index and columns
Demo:
In [79]: df.loc[df["response"].isnull(), ["response"]] = \
pd.DataFrame([11,12], columns=['response'], index=[4,5])
In [80]: df
Out[80]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
alternatively you can assign an array/matrix of the same shape:
In [83]: df.loc[df["response"].isnull(), ["response"]] = [11, 12]
In [84]: df
Out[84]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
I'd also consider using fillna() method:
In [88]: df.response = df.response.fillna(0)
In [89]: df
Out[89]:
x response
3 999 1.0
4 99 0.0
5 9 0.0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

import class in beautifulsoup - pandas

Related

checking for duplicates in panda data frame

pandas dataframe can't specify list value to column

Pandas melt data based on two or more binary columns

How to create a new column in a Pandas DataFrame using pandas.cut method?

df.loc[rows, [col]] vs df.loc[rows, col] in assignment

Categories

Resources