I'm trying to assign a list value to a column with the following code
In [105] df[df['review_meta_id'] == 5349]['tags'].head()
Out [105] 4 NaN
2035 NaN
2630 NaN
3085 NaN
6833 NaN
Name: tags, dtype: object
In [106] tags
Out [106] ['자연공원', '도심상점']
In [107] df.loc[df['review_meta_id'] == 5349,'tags'] = pd.Series(tags)
In [108] df[df['review_meta_id'] == 5349]['tags'].head()
Out [108] 4 NaN
2035 NaN
2630 NaN
3085 NaN
6833 NaN
Name: tags, dtype: object
In [109]
So why is value not being assigned?
*Edit
So it seems, I can do something like
df.loc[df['review_meta_id'] == 5349,'tags'] = pd.Series([tags] * len(df))
why not ?
df.loc[df['review_meta_id'] == 5349,'tags'] = pd.Series([tags] * len(df[df['review_meta_id'] == 5349]))
The reason here is pandas will check the index as well when you create the new series, (The index is range index from 0 to n-1)then index is mismatched with assign position, so pandas will return NaN
df.loc[df['review_meta_id'] == 5349,'tags'] = pd.Series([tags] * len(df[df['review_meta_id'] == 5349]), index = df.index[df['review_meta_id'] == 5349] )
Related
I have a dataframe which uses an identifier for groups and has several columns with missing values.
toy_df = pd.DataFrame({'ID':[1,1, 1, 2, 2, 2, 3, 3, 3],
'Age': [10, 10, 10, 20, 20, 20, 20, 20, 20],
'A': [np.nan, 5, 5, np.nan, np.nan, np.nan, 10, 12, 12],
'B': [3, 4, 5, 2, 2, 1, np.nan, 4, 3]})
ID Age A B
0 1 10 NaN 3.0
1 1 10 5.0 4.0
2 1 10 5.0 5.0
3 2 20 NaN 2.0
4 2 20 NaN 2.0
5 2 20 NaN 1.0
6 3 20 10.0 NaN
7 3 20 12.0 4.0
8 3 20 12.0 3.0
Now I want to fill NaNs by some rules either within groups of same age or just within the ID group
group_mode = toy_df.groupby('Age')['A'].apply(lambda x: list(x.mode()))
group_median = toy_df.groupby('Age')['A'].median()
def impute_column(series, group_mode, group_median, agg_key, key, age):
if series.isna().sum() == series.shape[0]:
modes = group_mode[group_mode.index == age]
#if multiple modes are available use median
if np.ravel(modes.to_list()).shape[0] > 1:
median_ = group_median[group_median.index == age]
series = series.fillna(value = median_)
else:
mode_ = modes.item()[0]
series = series.fillna(value = mode_)
#if up to 3 values are missing use linear interpolation
elif series.isna().sum() < 4:
series = series.interpolate(limit_direction='both', method='linear')
#else we have sparse values / use median
else:
median_ = series.median()
series = series.fillna(value = median_)
return series
And if I test it with one of the columns & groups it works
impute_column(series=toy_df['A'], group_mode=group_mode, group_median=group_median,
agg_key='Age', key = 'A', age=10)
0 10.0
1 5.0
2 5.0
3 10.0
4 10.0
5 10.0
6 10.0
7 12.0
8 12.0
Now I want to be efficient and
group over IDs
loop over the grouped object
loop over all columns
and update my dataframe
all_columns = ['A', 'B']
grouped = toy_df.groupby('ID')
for key in all_columns:
group_mode = toy_df.groupby('Age')[key].apply(lambda x: list(x.mode()))
group_median = toy_df.groupby('Age')[key].median()
for _, group in grouped:
age = group['Age'].iloc[0]
group[key] = impute_column(series=group[key], group_mode=group_mode, group_median=group_median,
agg_key='Age', key=key, age=age)
The calculations are running (I've printed them out), but the final dataframe ain't updated
ID Age A B
0 1 10 NaN 3.0
1 1 10 5.0 4.0
2 1 10 5.0 5.0
3 2 20 NaN 2.0
4 2 20 NaN 2.0
5 2 20 NaN 1.0
6 3 20 10.0 NaN
7 3 20 12.0 4.0
8 3 20 12.0 3.0
what seems to work is the code below.
but as you can see, it does not follow the bullet points above. Further I am pretty sure, that computing big groupby objects for each group is immensely inefficient
def impute_column(series, group_mode, group_median, key, age):
if series.isna().sum() == series.shape[0]:
modes = group_mode[group_mode.index == age]
#if multiple modes are available use median
if np.ravel(modes.to_list()).shape[0] > 1:
median_ = group_median[group_median.index == age]
series = series.fillna(value = median_)
else:
mode_ = modes.item()[0]
series = series.fillna(value = mode_)
#if up to 3 values are missing use linear interpolation
elif series.isna().sum() < 4:
series = series.interpolate(limit_direction='both', method='linear')
#else we have sparse values / use median
else:
median_ = series.median()
series = series.fillna(value = median_)
return series
def impute_frame(data, full_data, agg_key):
age = data['Age'].iloc[0]
for key in ['A', 'B']:
group_mode = full_data.groupby(agg_key)[key].apply(lambda x: list(x.mode()))
group_median = full_data.groupby(agg_key)[key].median()
data[key] = impute_column(data[key], group_mode, group_median, key, age)
return data
toy_df.groupby('ID').apply(impute_frame, full_data=toy_df, agg_key='Age')
ID Age A B
0 1 10 5.0 3.0
1 1 10 5.0 4.0
2 1 10 5.0 5.0
3 2 20 12.0 2.0
4 2 20 12.0 2.0
5 2 20 12.0 1.0
6 3 20 10.0 4.0
7 3 20 12.0 4.0
8 3 20 12.0 3.0
I have a data frame that looks like this that includes price side and volume parameters from multiple exchanges.
df = pd.DataFrame({
'price_ex1' : [9380.59650, 9394.85206, 9397.80000],
'side_ex1' : ['bid', 'bid', 'ask'],
'size_ex1' : [0.416, 0.053, 0.023],
'price_ex2' : [9437.24045, 9487.81185, 9497.81424],
'side_ex2' : ['bid', 'bid', 'ask'],
'size_ex2' : [10.0, 556.0, 23.0]
})
df
price_ex1 side_ex1 size_ex1 price_ex2 side_ex2 size_ex2
0 9380.59650 bid 0.416 9437.24045 bid 10.0
1 9394.85206 bid 0.053 9487.81185 bid 556.0
2 9397.80000 ask 0.023 9497.81424 ask 23.0
For each exchange (I have more than two exchanges), I want the index to be the union of all prices from all exchanges (i.e. union of price_ex1, price_ex2, etc...) ranked from highest to lowest. Then I want to create two size columns for each exchange based on the side parameter of that exchange. The output should look like this where empty columns are NaN.
I am not sure what is the best pandas function to do this, whether it is pivot or melt and how to use that function when I have more than 1 binary column I am flattening.
Thank you for your help!
This is a three step process. After you correct your multiindexed columns, you should stack your dataset, then pivot it.
First, clean up the multiindex columns so that you more easily transform:
df.columns = pd.MultiIndex.from_product([['1', '2'], [col[:-4] for col in df.columns[:3]]], names=['exchange', 'params'])
exchange 1 2
params price side size price side size
0 9380.59650 bid 0.416 9437.24045 bid 10.0
1 9394.85206 bid 0.053 9487.81185 bid 556.0
2 9397.80000 ask 0.023 9497.81424 ask 23.0
Then stack and append the exchange num to the bid and ask values:
df = df.swaplevel(axis=1).stack()
df['side'] = df.apply(lambda row: row.side + '_ex' + row.name[1], axis=1)
params price side size
exchange
0 1 9380.59650 bid_ex1 0.416
2 9437.24045 bid_ex2 10.000
1 1 9394.85206 bid_ex1 0.053
2 9487.81185 bid_ex2 556.000
2 1 9397.80000 ask_ex1 0.023
2 9497.81424 ask_ex2 23.000
Finally, pivot and sort by price:
df.pivot_table(index=['price'], values=['size'], columns=['side']).sort_values('price', ascending=False)
params size
side ask_ex1 ask_ex2 bid_ex1 bid_ex2
price
9497.81424 NaN 23.0 NaN NaN
9487.81185 NaN NaN NaN 556.0
9437.24045 NaN NaN NaN 10.0
9397.80000 0.023 NaN NaN NaN
9394.85206 NaN NaN 0.053 NaN
9380.59650 NaN NaN 0.416 NaN
You can try something like this.
Please make a dataframe with the data that you show us and name it something 'example.csv'
price_ex1 side_ex1 size_ex1 price_ex2 side_ex2 size_ex2
import pandas as pd
import numpy as np
df = pd.read_csv('example.csv')
df1 = df[['price_ex1','side_ex1','size_ex1']]
df2 = df[['price_ex2','side_ex2','size_ex2']]
df3 = df1.append(df2)
df4 = df3[['price_ex1','price_ex2']]
arr = df4.values
df3['price_ex1'] = arr[~np.isnan(arr)].astype(float)
df3.drop(columns=['price_ex2'], inplace=True)
df3.columns = ['price', 'bid_ex1', 'ask_ex1', 'bid_ex2', 'ask_ex2']
def change(bid_ex1, ask_ex1, bid_ex2, ask_ex2, col_name):
if col_name == 'bid_ex1_col':
if (bid_ex1 != np.nan or bid_ex2 != np.nan) and bid_ex1 == 'bid':
return bid_ex2
else:
return bid_ex1
if col_name == 'ask_ex1_col':
if (bid_ex1 != np.nan or bid_ex2 != np.nan) and bid_ex1 == 'ask':
return bid_ex2
else:
return ask_ex1
if col_name == 'ask_ex2_col':
if (ask_ex1 != np.nan or ask_ex2 != np.nan) and ask_ex1 == 'ask':
return ask_ex2
else:
return ask_ex1
if col_name == 'bid_ex2_col':
if (ask_ex1 != np.nan or ask_ex2 != np.nan) and ask_ex1 == 'bid':
return ask_ex2
else:
return ask_ex1
df3['bid_ex1_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'bid_ex1_col'), axis=1)
df3['ask_ex1_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'ask_ex1_col'), axis=1)
df3['ask_ex2_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'ask_ex2_col'), axis=1)
df3['bid_ex2_col'] = df3.apply(lambda row: change(row['bid_ex1'],row['ask_ex1'],row['bid_ex2'],row['ask_ex2'], 'bid_ex2_col'), axis=1)
df3.drop(columns=['bid_ex1', 'ask_ex1', 'bid_ex2', 'ask_ex2'], inplace=True)
df3.replace(to_replace='ask', value=np.nan,inplace=True)
df3.replace(to_replace='bid', value=np.nan,inplace=True)
One option is to flip to long form with pivot_longer before flipping back to wide form with pivot_wider from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(names_to = ('ex1', 'ex2', 'ex'),
values_to=('price','side','size'),
names_pattern=['price', 'side', 'size'])
.loc[:, ['price', 'side','ex','size']]
.assign(ex = lambda df: df.ex.str.split('_').str[-1])
.pivot_wider('price', ('side', 'ex'), 'size')
.sort_values('price', ascending = False)
)
price bid_ex1 ask_ex1 bid_ex2 ask_ex2
5 9497.81424 NaN NaN NaN 23.0
4 9487.81185 NaN NaN 556.0 NaN
3 9437.24045 NaN NaN 10.0 NaN
2 9397.80000 NaN 0.023 NaN NaN
1 9394.85206 0.053 NaN NaN NaN
0 9380.59650 0.416 NaN NaN NaN
I'm trying to import this as table from table id "octable"
import requests
from bs4 import BeautifulSoup
import lxml.html as lh
import pandas as pd
r = requests.get('https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?segmentLink=17&instrument=OPTIDX&symbol=BANKNIFTY')
doc = lh.fromstring(r.content)
data = doc.xpath('//*[#id="octable"]')
type(data)
df = pd.DataFrame(data)
print(df)
however this is what i get
0 [[[], [], []], [[], [], [], [], [], [], [], []...
1 \
0 [[], [[<Element img at 0x29d1aa5cbd8>]], [], [...
2 \
0 [[], [[<Element img at 0x29d1aa5ca98>]], [], [...
3 \
0 [[], [[<Element img at 0x29d1aa5cbd8>]], [], [...
This worked well for me. I'd recommend familiarizing yourself with read_html.
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get('https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?segmentLink=17&instrument=OPTIDX&symbol=BANKNIFTY')
soup = BeautifulSoup(r.content,features='html.parser')
table = soup.find('table',{'id':'octable'})
df = pd.read_html(str(table))
print(df)
Output:
[ CALLS ... PUTS
Chart OI Chng in OI Volume IV LTP ... LTP IV Volume Chng in OI OI Chart
0 NaN 580 20 3 - 5929.35 ... 2.15 60.85 1300 -2920 14980 NaN
1 NaN - - - - - ... - - - - - NaN
2 NaN - - - - - ... 3.20 61.28 68 320 500 NaN
3 NaN 8620 -40 6 - 5585.00 ... 2.60 58.90 305 -60 13400 NaN
4 NaN - - - - - ... 2.50 57.62 8 -60 - NaN
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
89 NaN - - - - - ... - - - - - NaN
90 NaN 20 20 2 31.08 4.30 ... - - - - - NaN
91 NaN - - - - - ... - - - - - NaN
92 NaN 80 60 9 28.39 1.20 ... 3000.00 - - - 140 NaN
93 Total 4568440 NaN 2456057 NaN NaN ... NaN NaN 2562288 NaN 4181760 Total
[94 rows x 23 columns]]
I have this pandas df:
value
index1 index2 index3
1 1 1 10.0
2 -0.5
3 0.0
2 2 1 3.0
2 0.0
3 0.0
3 1 0.0
2 -5.0
3 6.0
I would like to get the 'value' of a specific combination of index, using a dict.
Usually, I use, for example:
df = df.iloc[df.index.isin([2],level='index1')]
df = df.iloc[df.index.isin([3],level='index2')]
df = df.iloc[df.index.isin([2],level='index3')]
value = df.values[0][0]
Now, I would like to get my value = -5 in a shorter way using this dictionary:
d = {'index1':2,'index2':3,'index3':2}
And also, if I use:
d = {'index1':2,'index2':3}
I would like to get the array:
[0.0, -5.0, 6.0]
Tips?
You can use SQL-like method DataFrame.query():
In [69]: df.query(' and '.join('{}=={}'.format(k,v) for k,v in d.items()))
Out[69]:
value
index1 index2 index3
2.0 3.0 2 -5.0
for another dict:
In [77]: d = {'index1':2,'index2':3}
In [78]: df.query(' and '.join('{}=={}'.format(k,v) for k,v in d.items()))
Out[78]:
value
index1 index2 index3
2.0 3.0 1 0.0
2 -5.0
3 6.0
A non-query way would be
In [64]: df.loc[np.logical_and.reduce([
df.index.get_level_values(k) == v for k, v in d.items()])]
Out[64]:
value
index1 index2 index3
2 3 2 -5.0
What do the following assignments behave differently?
df.loc[rows, [col]] = ...
df.loc[rows, col] = ...
For example:
r = pd.DataFrame({"response": [1,1,1],},index = [1,2,3] )
df = pd.DataFrame({"x": [999,99,9],}, index = [3,4,5] )
df = pd.merge(df, r, how="left", left_index=True, right_index=True)
df.loc[df["response"].isnull(), "response"] = 0
print df
x response
3 999 0.0
4 99 0.0
5 9 0.0
but
df.loc[df["response"].isnull(), ["response"]] = 0
print df
x response
3 999 1.0
4 99 0.0
5 9 0.0
why should I expect the first to behave differently to the second?
df.loc[df["response"].isnull(), ["response"]]
returns a DataFrame, so if you want to assign something to it it must be aligned by both index and columns
Demo:
In [79]: df.loc[df["response"].isnull(), ["response"]] = \
pd.DataFrame([11,12], columns=['response'], index=[4,5])
In [80]: df
Out[80]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
alternatively you can assign an array/matrix of the same shape:
In [83]: df.loc[df["response"].isnull(), ["response"]] = [11, 12]
In [84]: df
Out[84]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
I'd also consider using fillna() method:
In [88]: df.response = df.response.fillna(0)
In [89]: df
Out[89]:
x response
3 999 1.0
4 99 0.0
5 9 0.0