Pandas concatenate dataframe with multiindex retaining index names - pandas

I have a list of DataFrames as follows where each DataFrame in the list is as follows:
dfList[0]
monthNum 1 2
G1
2.0 0.05 -0.16
3.0 1.17 0.07
4.0 9.06 0.83
dfList[1]
monthNum 1 2
G2
21.0 0.25 0.26
31.0 1.27 0.27
41.0 9.26 0.23
dfList[0].index
Float64Index([2.0, 3.0, 4.0], dtype='float64', name='G1')
dfList[0].columns
Int64Index([1, 2], dtype='int64', name='monthNum')
I am trying to achieve the following in a dataframe Final_Combined_DF:
monthNum 1 2
G1
2.0 0.05 -0.16
3.0 1.17 0.07
4.0 9.06 0.83
G2
21.0 0.25 0.26
31.0 1.27 0.27
41.0 9.26 0.23
I tried doing different combinations of:
pd.concat(dfList, axis=0)
but it has not given me desired output. I am not sure how to go about this.

We can try pd.concat with keys using the Index.name from each DataFrame to add a new level index in the final frame:
final_combined_df = pd.concat(
df_list, keys=map(lambda d: d.index.name, df_list)
)
final_combined_df:
monthNum 0 1
G1 2.0 4 7
3.0 7 1
4.0 9 5
G2 21.0 8 1
31.0 1 8
41.0 2 6
Setup Used:
import numpy as np
import pandas as pd
np.random.seed(5)
df_list = [
pd.DataFrame(np.random.randint(1, 10, (3, 2)),
columns=pd.Index([0, 1], name='monthNum'),
index=pd.Index([2.0, 3.0, 4.0], name='G1')),
pd.DataFrame(np.random.randint(1, 10, (3, 2)),
columns=pd.Index([0, 1], name='monthNum'),
index=pd.Index([21.0, 31.0, 41.0], name='G2'))
]
df_list:
[monthNum 0 1
G1
2.0 4 7
3.0 7 1
4.0 9 5,
monthNum 0 1
G2
21.0 8 1
31.0 1 8
41.0 2 6]

Related

Pandas Cumsum in expanding rows

Looking to learn how to code this solution in a more elegant way. Need to split a set of rows into a smaller pieces and control the utilization as well as calculate the balance. Current solution is not generating the balance properly
import pandas as pd
import numpy as np
box_list = [['Box0', 0.2],
['Box1', 1.0],
['Box2', 1.8],
['Box4', 2.0],
['Box8', 4.01],]
sdf = pd.DataFrame(box_list, columns = ['Name', 'Size'])
print(sdf)
Name
Size
1
Box1
1.00
2
Box2
1.80
3
Box4
2.00
4
Box8
4.01
df = pd.DataFrame({'Name': np.repeat(sdf['Name'], sdf['Size'].apply(np.ceil)),
'Size': np.repeat(sdf['Size'], sdf['Size'].apply(np.ceil)),})
df['Max_Units']=df['Size'].apply(lambda x: np.ceil(x) if x>1.0 else 1.0)
df = df.reset_index()
df['Utilization'] =df['Size'].apply(lambda x: x-int(x) if x>1.0 else (x if x<1.0 else 1.0))
df['Balance'] =df['Max_Units']
g = df.groupby(['index'], as_index=0, group_keys=0)
df['Utilization'] = g.apply(lambda x:
pd.Series(np.where((x.Balance.shift(1) >= 1.0),
1.0,
x.Utilization))).values
df.loc[(df.Utilization == 0.0), ['Utilization']] = 1.0
df['Balance'] = g.apply(lambda x:
pd.Series(np.where((x.Balance.shift(1) >= 1.0),
x.Max_Units-x.Utilization,
0))).values
print(df)
index
Name
Size
Max_Units
Utilization
Balance
0
0
Box0
0.20
1.0
0.20
0.0
1
1
Box1
1.00
1.0
1.00
0.0
2
2
Box2
1.80
2.0
0.80
0.0
3
2
Box2
1.80
2.0
1.00
1.0
4
3
Box4
2.00
2.0
1.00
0.0
5
3
Box4
2.00
2.0
1.00
1.0
6
4
Box8
4.01
5.0
0.01
0.0
7
4
Box8
4.01
5.0
1.00
4.0
8
4
Box8
4.01
5.0
1.00
4.0
9
4
Box8
4.01
5.0
1.00
4.0
10
4
Box8
4.01
5.0
1.00
4.0
I'm not sure if I completely understand what all of these values are supposed to be representing.
However, I've achieved the correct desired output for your sample set in more direct ways:
import pandas as pd
import numpy as np
box_list = [['Box0', 0.2],
['Box1', 1.0],
['Box2', 1.8],
['Box4', 2.0],
['Box8', 4.01], ]
df = pd.DataFrame(box_list, columns=['Name', 'Size'])
# Set ceil column to ceil of size since it's used more than once
df['ceil'] = df['Size'].apply(np.ceil)
# Duplicate Rows based on Ceil of Size
df = df.loc[df.index.repeat(df['ceil'])]
# Get Max Units by comparing it to the ceil column
df['Max_Units'] = df.apply(lambda s: max(s['ceil'], 1), axis=1)
# Extract Decimal Portion By Using % 1 (Catch Special Case of x == 1)
df['Utilization'] = df['Size'].apply(lambda x: 1 if x == 1 else x % 1)
# Everywhere Max_Units cumcount is not 0 set Utilization to 1
df.loc[df.groupby(df['Max_Units']).cumcount().ne(0), 'Utilization'] = 1
# Set Balance to index cumcount as float
df['Balance'] = df.groupby(df.index).cumcount().astype(float)
# Drop Unnecessary Column and reset index for output
df = df.drop(columns=['ceil']).reset_index()
# For Display
print(df)
Output:
index
Name
Size
Max_Units
Utilization
Balance
0
0
Box0
0.20
1.0
0.20
0.0
1
1
Box1
1.00
1.0
1.00
0.0
2
2
Box2
1.80
2.0
0.80
0.0
3
2
Box2
1.80
2.0
1.00
1.0
4
3
Box4
2.00
2.0
1.00
0.0
5
3
Box4
2.00
2.0
1.00
1.0
6
4
Box8
4.01
5.0
0.01
0.0
7
4
Box8
4.01
5.0
1.00
1.0
8
4
Box8
4.01
5.0
1.00
2.0
9
4
Box8
4.01
5.0
1.00
3.0
10
4
Box8
4.01
5.0
1.00
4.0

sort column by absolute value with pandas

I am trying to sort this dataframe, on abs(C)
A B C
0 10.3 11.3 -0.72
1 16.2 10.9 -0.84
2 18.1 15.2 0.64
3 12.2 11.3 0.31
4 17.2 12.2 -0.75
5 11.6 15.4 -0.08
6 16.0 10.4 0.05
7 18.8 14.7 -0.61
8 12.6 16.3 0.85
9 11.6 10.8 0.93
To do that, I have to append a new column D = abs(C), and then sort on D
df['D']= abs (df['C'])
df.sort_values(by=['D'])
Is there a way to do the job in one method?
Use Series.argsort for position of absolute values by Series.abs and then change order of rows by DataFrame.iloc:
df2 = df.iloc[df.C.abs().argsort()]
print (df2)
A B C
6 16.0 10.4 0.05
5 11.6 15.4 -0.08
3 12.2 11.3 0.31
7 18.8 14.7 -0.61
2 18.1 15.2 0.64
0 10.3 11.3 -0.72
4 17.2 12.2 -0.75
1 16.2 10.9 -0.84
8 12.6 16.3 0.85
9 11.6 10.8 0.93
(From my answer in another post:)
Perfect Simple Solution with the Pandas > V_1.1.0:
Use the parameter key in the sort_values function:
import pandas as pd
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd', 'e', 'f'], 'b': [-3, -2, -1, 0, 1, 2]})
df.sort_values(by='b', key=abs)
will yield:
a b
3 d 0
2 c -1
4 e 1
1 b -2
5 f 2
0 a -3
import pandas as pd
ttt = pd.DataFrame({'a': ['a', 'b', 'c', 'd', 'e', 'f'], 'b': [-3, -2, -1, 0, 1, 2]})
# ascending order
ttt_as = ttt.iloc[ttt.b.abs().argsort()]
print (ttt_as)
# descending order
ttt_des = ttt.iloc[ttt.b.abs().argsort()][::-1]
print (ttt_des)

fill_value in the pandas shift doesn't work with groupby

I need to shift column in pandas dataframe, for every name and fill resulting NA's with predefined value. Below is code snippet compiled with python 2.7
import pandas as pd
d = {'Name': ['Petro', 'Petro', 'Petro', 'Petro', 'Petro', 'Mykola', 'Mykola', 'Mykola', 'Mykola', 'Mykola', 'Mykyta', 'Mykyta', 'Mykyta', 'Mykyta', 'Mykyta'],
'Month': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'Value': [25, 2.5, 24.6, 28, 26.4, 35, 24, 35, 22, 27, 30, 30, 34, 30, 23]
}
data = pd.DataFrame(d)
data['ValueLag'] = data.groupby('Name').Value.shift(-1, fill_value = 20)
print data
After running code above I get the following output
Month Name Value ValueLag
0 1 Petro 25.0 2.5
1 2 Petro 2.5 24.6
2 3 Petro 24.6 28.0
3 4 Petro 28.0 26.4
4 5 Petro 26.4 NaN
5 1 Mykola 35.0 24.0
6 2 Mykola 24.0 35.0
7 3 Mykola 35.0 22.0
8 4 Mykola 22.0 27.0
9 5 Mykola 27.0 NaN
10 1 Mykyta 30.0 30.0
11 2 Mykyta 30.0 34.0
12 3 Mykyta 34.0 30.0
13 4 Mykyta 30.0 23.0
14 5 Mykyta 23.0 NaN
Looks like fill_value did not work here. While I need NaN to be filled with some number let's say 4.
Or if to tell all the story I need that last value to be extended like this
Month Name Value ValueLag
0 1 Petro 25.0 2.5
1 2 Petro 2.5 24.6
2 3 Petro 24.6 28.0
3 4 Petro 28.0 26.4
4 5 Petro 26.4 26.4
5 1 Mykola 35.0 24.0
6 2 Mykola 24.0 35.0
7 3 Mykola 35.0 22.0
8 4 Mykola 22.0 27.0
9 5 Mykola 27.0 27.0
10 1 Mykyta 30.0 30.0
11 2 Mykyta 30.0 34.0
12 3 Mykyta 34.0 30.0
13 4 Mykyta 30.0 23.0
14 5 Mykyta 23.0 23.0
Is there a way to fill with last value forward or first value backward if shifting positive number of periods?
It seems that the fill value is by group rather than a single value. Try the following,
data['ValueLag'] = data.groupby('Name').Value.shift(-1).ffill()

Take product of columns in dataframe with lags

Have following dataframe.
A = pd.Series([2, 3, 4, 5], index=[1, 2, 3, 4])
B = pd.Series([6, 7, 8, 9], index=[1, 2, 3, 4])
Aw = pd.Series([0.25, 0.3, 0.33, 0.36], index=[1, 2, 3, 4])
Bw = pd.Series([0.75, 0.7, 0.67, 0.65], index=[1, 2, 3, 4])
df = pd.DataFrame({'A': A, 'B': B, 'Aw': Aw, 'Bw', Bw})
df
Index A B Aw Bw
1 2 6 0.25 0.75
2 3 7 0.30 0.70
3 4 8 0.33 0.67
4 5 9 0.36 0.64
What I would like to do is multiply 'A' and lag of 'Aw' and likewise 'B' with 'Bw'. The resulting dataframe will look like the following:
Index A B Aw Bw A_ctr B_ctr
1 2 6 NaN NaN NaN NaN
2 3 7 0.25 0.75 0.75 5.25
3 4 8 0.3 0.7 1.2 5.6
4 5 9 0.33 0.64 1.65 5.76
Thank you in advance
To get your desired output, first shift Aw and Bw, then multiply them by A and B:
df[['Aw','Bw']] = df[['Aw','Bw']].shift()
df[['A_ctr','B_ctr']] = df[['A','B']].values*df[['Aw','Bw']]
A B Aw Bw A_ctr B_ctr
1 2 6 NaN NaN NaN NaN
2 3 7 0.25 0.75 0.75 5.25
3 4 8 0.30 0.70 1.20 5.60
4 5 9 0.33 0.67 1.65 6.03

Pandas dataframe creating multiple rows at once via .loc

I can create a new row in a dataframe using .loc():
>>> df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
>>> df
a b
1 10 100
2 20 200
>>> df.loc[3, 'a'] = 30
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
But how can I create more than one row using the same method?
>>> df.loc[[4, 5], 'a'] = [40, 50]
...
KeyError: '[4 5] not in index'
I'm familiar with .append() but am looking for a way that does NOT require constructing a new row into a Series before having it appended to df.
Desired input:
>>> df.loc[[4, 5], 'a'] = [40, 50]
Desired output
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Where last 2 rows are newly added.
Admittedly, this is a very late answer, but I have had to deal with a similar problem and think my solution might be helpful to others as well.
After recreating your data, it is basically a two-step approach:
Recreate data:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
df.loc[3, 'a'] = 30
Extend the df.index using .reindex:
idx = list(df.index)
new_rows = list(map(str, range(4, 6))) # easier extensible than new_rows = ["4", "5"]
idx.extend(new_rows)
df = df.reindex(index=idx)
Set the values using .loc:
df.loc[new_rows, "a"] = [40, 50]
giving you
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Example data
>>> data = pd.DataFrame({
'a': [10, 6, -3, -2, 4, 12, 3, 3],
'b': [6, -3, 6, 12, 8, 11, -5, -5],
'id': [1, 1, 1, 1, 6, 2, 2, 4]})
Case 1 Note that range can be altered to whatever it is that you desire.
>>> for i in range(10):
... data.loc[i, 'a'] = 30
...
>>> data
a b id
0 30.0 6.0 1.0
1 30.0 -3.0 1.0
2 30.0 6.0 1.0
3 30.0 12.0 1.0
4 30.0 8.0 6.0
5 30.0 11.0 2.0
6 30.0 -5.0 2.0
7 30.0 -5.0 4.0
8 30.0 NaN NaN
9 30.0 NaN NaN
Case 2 Here we are adding a new column to a data frame that had 8 rows to begin with. As we extend our new column c to be of length 10 the other columns are extended with NaN.
>>> for i in range(10):
... data.loc[i, 'c'] = 30
...
>>> data
a b id c
0 10.0 6.0 1.0 30.0
1 6.0 -3.0 1.0 30.0
2 -3.0 6.0 1.0 30.0
3 -2.0 12.0 1.0 30.0
4 4.0 8.0 6.0 30.0
5 12.0 11.0 2.0 30.0
6 3.0 -5.0 2.0 30.0
7 3.0 -5.0 4.0 30.0
8 NaN NaN NaN 30.0
9 NaN NaN NaN 30.0
Also somewhat late, but my solution was similar to the accepted one:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index=[1,2])
# single index assignment always works
df.loc[3, 'a'] = 30
# multiple indices
new_rows = [4,5]
# there should be a nicer way to add more than one index/row at once,
# but at least this is just one extra line:
df = df.reindex(index=df.index.append(pd.Index(new_rows))) # note: Index.append() doesn't accept non-Index iterables?
# multiple new rows now works:
df.loc[new_rows, "a"] = [40, 50]
print(df)
... which yields:
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
This also works now (useful when performance on aggregating dataframes matters):
# inserting whole rows:
df.loc[new_rows] = [[41, 51], [61,71]]
print(df)
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 41.0 51.0
5 61.0 71.0