I have a multindex dataframe df1 as:
node A1 A2
bkt B1 B2
Month
1 0.15 -0.83
2 0.06 -0.12
bs.columns
MultiIndex([( 'A1', 'B1'),
( 'A2', 'B2')],
names=[node, 'bkt'])
and another similar multiindex dataframe df2 as:
node A1 A2
bkt B1 B2
Month
1 -0.02 -0.15
2 0 0
3 -0.01 -0.01
4 -0.06 -0.11
I want to concat them vertically so that resulting dataframe df3 looks as following:
df3 = pd.concat([df1, df2], axis=0)
While concatenating I want to introduce 2 blank row between dataframes df1 and df2. In addition I want to introduce two strings Basis Mean and Basis P25 in df3 as shown below.
print(df3)
Basis Mean
node A1 A2
bkt B1 B2
Month
1 0.15 -0.83
2 0.06 -0.12
Basis P25
node A1 A2
bkt B1 B2
Month
1 -0.02 -0.15
2 0 0
3 -0.01 -0.01
4 -0.06 -0.11
I don't know whether there is anyway of doing the above.
I don't think that that is an actual concatenation you are talking about.
The following could already do the trick:
print('Basis Mean')
print(df1.to_string())
print('\n')
print('Basis P25')
print(df2.to_string())
This isn't usually how DataFrames are used, but perhaps you wish to append rows of empty strings in between df1 and df2, along with rows containing your titles?
df1 = pd.concat([pd.DataFrame([["Basis","Mean",""]],columns=df1.columns), df1], axis=0)
df1 = df1.append(pd.Series("", index=df1.columns), ignore_index=True)
df1 = df1.append(pd.Series("", index=df1.columns), ignore_index=True)
df1 = df1.append(pd.Series(["Basis","P25",""], index=df1.columns),ignore_index=True)
df3 = pd.concat([df1, df2], axis=0)
Author clarified in the comment that he wants to make it easy to print to an excel file. It can be achieved using pd.ExcelWriter.
Below is an example of how to do it.
from dataclasses import dataclass
from typing import Any, Dict, List, Optional
import pandas as pd
#dataclass
class SaveTask:
df: pd.DataFrame
header: Optional[str]
extra_pd_settings: Optional[Dict[str, Any]] = None
def fill_xlsx(
save_tasks: List[SaveTask],
writer: pd.ExcelWriter,
sheet_name: str = "Sheet1",
n_rows_between_blocks: int = 2,
) -> None:
current_row = 0
for save_task in save_tasks:
extra_pd_settings = save_task.extra_pd_settings or {}
if "startrow" in extra_pd_settings:
raise ValueError(
"You should not use parameter 'startrow' in extra_pd_settings"
)
save_task.df.to_excel(
writer,
sheet_name=sheet_name,
startrow=current_row + 1,
**extra_pd_settings
)
worksheet = writer.sheets[sheet_name]
worksheet.write(current_row, 0, save_task.header)
has_header = extra_pd_settings.get("header", True)
current_row += (
1 + save_task.df.shape[0] + n_rows_between_blocks + int(has_header)
)
if __name__ == "__main__":
# INPUTS
df1 = pd.DataFrame(
{"hello": [1, 2, 3, 4], "world": [0.55, 1.12313, 23.12, 0.0]}
)
df2 = pd.DataFrame(
{"foo": [3, 4]},
index=pd.MultiIndex.from_tuples([("foo", "bar"), ("baz", "qux")]),
)
# Xlsx creation
writer = pd.ExcelWriter("test.xlsx", engine="xlsxwriter")
fill_xlsx(
[
SaveTask(
df1,
"Hello World Table",
{"index": False, "float_format": "%.3f"},
),
SaveTask(df2, "Foo Table with MultiIndex"),
],
writer,
)
writer.save()
As an extra bonus, pd.ExcelWriter allows to save data on different sheets in Excel and choose their names right from Python code.
Related
I want to join two dataframes and get result as below. I tried many ways, but it fails.
I want only texts on df2 ['A'] which contain text on df1 ['A']. What do I need to change in my code?
I want:
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
import pandas as pd
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
})
df2 = pd.DataFrame(
{ "A": ["A0_link0", "A1_link1", "A2_link2", "A3_link3", "A4_link4", 'An_linkn'],
"B" : ["B0_link0", "B1_link1", "B2_link2", "B3_link3", "B4_link4", 'Bn_linkn']
})
result = pd.concat([df1, df2], ignore_index=True, join= "inner", sort=False)
print(result)
Create an intermediate dataframe and map:
d = (df2.assign(key=df2['A'].str.extract(r'([^_]+)'))
.set_index('key'))
df1['A'].map(d['A'])
Output:
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
Name: A, dtype: object
Or merge if you want several columns from df2 (df1.merge(d, left_on='A', right_index=True))
You can set the index as An and pd.concat on columns
result = (pd.concat([df1.set_index(df1['A']),
df2.set_index(df2['A'].str.split('_').str[0])],
axis=1, join="inner", sort=False)
.reset_index(drop=True))
print(result)
A A B
0 A0 A0_link0 B0_link0
1 A1 A1_link1 B1_link1
2 A2 A2_link2 B2_link2
3 A3 A3_link3 B3_link3
df2.A.loc[df2.A.str.split('_',expand=True).iloc[:,0].isin(df1.A)]
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
I have been trying to obtain a slice of a DataFrame using the .loc[] attribute. I provided a list of labels in the argument. But, I don't understand why the length of the DataFrame slice is greater than than the length of the list provided in the argument. In the following code, 'get_list_of_university_towns()' and 'convert_housing_data_to_quarters()' are two pre-defined functions which returned two different DataFrames.
utowns = get_list_of_university_towns().set_index(['State','RegionName']).index
alltowns_data = convert_housing_data_to_quarters()
X = alltowns_data.loc[utowns]
utowns_set = set(utowns)
non_utowns = [item for item in alltowns_data.index if(item not in utowns_set)]
len(non_utowns)
Y = alltowns_data.loc[non_utowns]
len(Y)
Here, I get the length of the non_utowns list as 10,461 and the length of the DataFrame slice to be 10,937.
Happens if there are duplicates in the index. A minimal example below:
In [5]: df = pd.DataFrame(np.random.randn(5,5),columns=list("abcde"),index=["a1","a1","a2","a3","a4"])
In [6]: df
Out[6]:
a b c d e
a1 0.127354 -0.629876 -0.632672 0.724379 -0.229413
a1 -0.460529 1.109605 -0.515612 1.351368 -0.458339
a2 0.803422 0.679793 1.294371 0.894829 0.035137
a3 1.413082 -0.893431 -2.041129 1.498739 0.021033
a4 1.233385 -0.829729 0.446116 1.665736 -0.309078
In [7]: nu = [x for x in df.index if x[-1] != "2"]
In [8]: nu
Out[8]: ['a1', 'a1', 'a3', 'a4']
In [9]: len(nu)
Out[9]: 4
In [10]: df.loc[nu]
Out[10]:
a b c d e
a1 0.127354 -0.629876 -0.632672 0.724379 -0.229413
a1 -0.460529 1.109605 -0.515612 1.351368 -0.458339
a1 0.127354 -0.629876 -0.632672 0.724379 -0.229413
a1 -0.460529 1.109605 -0.515612 1.351368 -0.458339
a3 1.413082 -0.893431 -2.041129 1.498739 0.021033
a4 1.233385 -0.829729 0.446116 1.665736 -0.309078
In [11]: df
Out[11]:
a b c d e
a1 0.127354 -0.629876 -0.632672 0.724379 -0.229413
a1 -0.460529 1.109605 -0.515612 1.351368 -0.458339
a2 0.803422 0.679793 1.294371 0.894829 0.035137
a3 1.413082 -0.893431 -2.041129 1.498739 0.021033
a4 1.233385 -0.829729 0.446116 1.665736 -0.309078
How to use the values of one column to access values in another
import numpy
impot pandas
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=[['Value']])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
so how to access the value 'bleh' for each row?
df.Value.iloc[df['bleh']]
Edit:
Thanks to #ScottBoston. My DF constructor had one layer of [] too much.
The correct answer is:
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
df['idx_int'] = range(df.shape[0])
df['haa'] = df['idx_int'] - df.bleh.values
df['newcol'] = df.Value.iloc[df['haa'].values].values
Try:
df['Value'].tolist()
Output:
[-1.0856306033005612,
0.9973454465835858,
0.28297849805199204,
-1.506294713918092,
-0.5786002519685364,
1.651436537097151,
-2.426679243393074,
-0.42891262885617726,
1.265936258705534,
-0.8667404022651017]
Your dataframe constructor still needs to be fixed.
Are you looking for:
df.set_index('bleh')
output:
Value
bleh
0 -1.085631
1 0.997345
2 0.282978
1 -1.506295
4 -0.578600
0 1.651437
0 -2.426679
4 -0.428913
1 1.265936
7 -0.866740
If so you, your dataframe constructor has as extra set of [] in it.
np.random.seed(123)
df = pd.DataFrame((np.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: np.random.randint(0, x + 1, 1)[0])
columns paramater in dataframe takes a list not a list of list.
How to apply a rolling Kalman Filter to a DataFrame column (without using external data)?
That is, pretending that each row is a new point in time and therefore requires for the descriptive statistics to be updated (in a rolling manner) after each row.
For example, how to apply the Kalman Filter to any column in the below DataFrame?
n = 2000
index = pd.date_range(start='2000-01-01', periods=n)
data = np.random.randn(n, 4)
df = pd.DataFrame(data, columns=list('ABCD'), index=index)
I've seen previous responses (1 and 2) however they are not applying it to a DataFrame column (and they are not vectorized).
How to apply a rolling Kalman Filter to a column in a DataFrame?
Exploiting some good features of numpy and using pykalman library, and applying the Kalman Filter on column D for a rolling window of 3, we can write:
import pandas as pd
from pykalman import KalmanFilter
import numpy as np
def rolling_window(a, step):
shape = a.shape[:-1] + (a.shape[-1] - step + 1, step)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def get_kf_value(y_values):
kf = KalmanFilter()
Kc, Ke = kf.em(y_values, n_iter=1).smooth(0)
return Kc
n = 2000
index = pd.date_range(start='2000-01-01', periods=n)
data = np.random.randn(n, 4)
df = pd.DataFrame(data, columns=list('ABCD'), index=index)
wsize = 3
arr = rolling_window(df.D.values, wsize)
zero_padding = np.zeros(shape=(wsize-1,wsize))
arrst = np.concatenate((zero_padding, arr))
arrkalman = np.zeros(shape=(len(arrst),1))
for i in range(len(arrst)):
arrkalman[i] = get_kf_value(arrst[i])
kalmandf = pd.DataFrame(arrkalman, columns=['D_kalman'], index=index)
df = pd.concat([df,kalmandf], axis=1)
df.head() should yield something like this:
A B C D D_kalman
2000-01-01 -0.003156 -1.487031 -1.755621 -0.101233 0.000000
2000-01-02 0.172688 -0.767011 -0.965404 -0.131504 0.000000
2000-01-03 -0.025983 -0.388501 -0.904286 1.062163 0.013633
2000-01-04 -0.846606 -0.576383 -1.066489 -0.041979 0.068792
2000-01-05 -1.505048 0.498062 0.619800 0.012850 0.252550
Is there an easy way to quickly see contents of two pd.DataFrames side-by-side in Jupyter notebooks?
df1 = pd.DataFrame([(1,2),(3,4)], columns=['a', 'b'])
df2 = pd.DataFrame([(1.1,2.1),(3.1,4.1)], columns=['a', 'b'])
df1, df2
You should try this function from #Wes_McKinney
def side_by_side(*objs, **kwds):
''' Une fonction print objects side by side '''
from pandas.io.formats.printing import adjoin
space = kwds.get('space', 4)
reprs = [repr(obj).split('\n') for obj in objs]
print(adjoin(space, *reprs))
# building a test case of two DataFrame
import pandas as pd
import numpy as np
n, p = (10, 3) # dfs' shape
# dfs indexes and columns labels
index_rowA = [t[0]+str(t[1]) for t in zip(['rA']*n, range(n))]
index_colA = [t[0]+str(t[1]) for t in zip(['cA']*p, range(p))]
index_rowB = [t[0]+str(t[1]) for t in zip(['rB']*n, range(n))]
index_colB = [t[0]+str(t[1]) for t in zip(['cB']*p, range(p))]
# buliding the df A and B
dfA = pd.DataFrame(np.random.rand(n,p), index=index_rowA, columns=index_colA)
dfB = pd.DataFrame(np.random.rand(n,p), index=index_rowB, columns=index_colB)
side_by_side(dfA,dfB) Outputs
cA0 cA1 cA2 cB0 cB1 cB2
rA0 0.708763 0.665374 0.718613 rB0 0.320085 0.677422 0.722697
rA1 0.120551 0.277301 0.646337 rB1 0.682488 0.273689 0.871989
rA2 0.372386 0.953481 0.934957 rB2 0.015203 0.525465 0.223897
rA3 0.456871 0.170596 0.501412 rB3 0.941295 0.901428 0.329489
rA4 0.049491 0.486030 0.365886 rB4 0.597779 0.201423 0.010794
rA5 0.277720 0.436428 0.533683 rB5 0.701220 0.261684 0.502301
rA6 0.391705 0.982510 0.561823 rB6 0.182609 0.140215 0.389426
rA7 0.827597 0.105354 0.180547 rB7 0.041009 0.936011 0.613592
rA8 0.224394 0.975854 0.089130 rB8 0.697824 0.887613 0.972838
rA9 0.433850 0.489714 0.339129 rB9 0.263112 0.355122 0.447154
The closest to what you want could be:
> df1.merge(df2, right_index=1, left_index=1, suffixes=("_1", "_2"))
a_1 b_1 a_2 b_2
0 1 2 1.1 2.1
1 3 4 3.1 4.1
It's not specific of the notebook, but it will work, and it's not that complicated. Another solution would be to convert your dataframe to an image and put them side by side in subplots. But it's a bit far-fetched and complicated.
I ended up using a helper function to quickly compare two data frames:
def cmp(df1, df2, topn=10):
n = topn
a = df1.reset_index().head(n=n)
b = df2.reset_index().head(n=n)
span = pd.DataFrame(data=[('-',) for _ in range(n)], columns=['sep'])
a = a.merge(span, right_index=1, left_index=1)
return a.merge(b, right_index=1, left_index=1, suffixes=['_L', '_R'])