Replicating a Row in Pandas Dataframe - pandas

I've a dataset in this format:
df = pd.DataFrame({'ID': [1,2,3,4],
'Type': ['A', 'B', 'C', 'D'],
'Value': [100, 200, 201, 120]})
I want to replicate one of this dataframe more than once and add new ID which is incremental. So, the new dataset looks like the following:
df = pd.DataFrame({'ID': [1,2,3,4, 10, 11, 12, 13],
'Type': ['A', 'B', 'C', 'D', 'B', 'B', 'B', 'B'],
'Value': [100, 200, 201, 120, 200, 200, 200, 200]})
I could replicate the rows in the following way:
df_B = df[df['ID'] == 'B']
df_B = pd.concat([df_B]*4)
But I've trouble updating the ID column incrementally.
If someone can guide me.
Thanks!

Do you mean by something like?
print(df.append(df.loc[df['Type'].eq('B')].reset_index(drop=True).iloc[[0] * 4].reset_index(drop=True).assign(ID=df['ID'] + 9), ignore_index=True))
Output:
ID Type Value
0 1 A 100
1 2 B 200
2 3 C 201
3 4 D 120
4 10 B 200
5 11 B 200
6 12 B 200
7 13 B 200

You can specify starting value for incremental new column and add it by DataFrame.assign, also add original by DataFrame.append:
N = 4
first = 10
df_B = df[df['Type'] == 'B']
df_B = df.append(pd.concat([df_B]*N).assign(ID=range(first, first + N)), ignore_index=True)
print (df_B)
ID Type Value
0 1 A 100
1 2 B 200
2 3 C 201
3 4 D 120
4 10 B 200
5 11 B 200
6 12 B 200
7 13 B 200

Related

Transform pandas dataframe rename columns based on row values

I have the following data frame df_in
data = [{'s':123, 'x': 5, 'a': 1, 'b': 2, 'c': 3},
{'s':123, 'x': 22, 'a': 4, 'b': 5, 'c': 6},
{'s':123, 'x': 33, 'a': 7, 'b': 8, 'c': 9},
{'s':124, 'x': 5, 'a': 11, 'b': 12, 'c': 3},
{'s':124, 'x': 22, 'a': 14, 'b': 15, 'c': 16},
{'s':124, 'x': 33, 'a': 17, 'b': 18, 'c': 19}]
df = pd.DataFrame(data, columns=['s', 'x', 'a', 'b', 'c'])
and would like to produce df_out where _x needs to be appended to the column names. Later I will index df_out on s and then do df_out.to_dict('index') to produce the desired output I need. I have tried transposing the df_in and then renaming rows with lambda function based on x but having trouble getting the desired df_out. Any help would be great.
Thanks
converting x to str to aid in joining
pivot followed by merging of the column labels
df2=df.pivot(index='s', columns='x').reset_index()
df2.columns= [str(col[0]+'_'+ str(col[1])).strip('_') for col in df2.columns]
df2
s a_5 a_22 a_33 b_5 b_22 b_33 c_5 c_22 c_33
0 123 1 4 7 2 5 8 3 6 9
1 124 11 14 17 12 15 18 3 16 19
You can use a pivot:
(df
.pivot('s', 'x')
.pipe(lambda d: d.set_axis(d. columns.map(lambda x: '_'.join(map(str, x))), axis=1))
.reset_index()
)
Output:
s a_5 a_22 a_33 b_5 b_22 b_33 c_5 c_22 c_33
0 123 1 4 7 2 5 8 3 6 9
1 124 11 14 17 12 15 18 3 16 19
Here's another option:
df_out = df.melt(['s','x']).set_index(['s','x', 'variable']).unstack([2,1])['value']
df_out.columns = [f'{i}_{j}' for i, j in df_out.columns]
print(df_out.reset_index())
Output:
s a_5 a_22 a_33 b_5 b_22 b_33 c_5 c_22 c_33
0 123 1 4 7 2 5 8 3 6 9
1 124 11 14 17 12 15 18 3 16 19
One option is with pivot_wider from pyjanitor :
# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_wider(index = 's', names_from = 'x')
s a_5 a_22 a_33 b_5 b_22 b_33 c_5 c_22 c_33
0 123 1 4 7 2 5 8 3 6 9
1 124 11 14 17 12 15 18 3 16 19

Get DataFrame values from variable column

I would like to get series with values from column which name is variable and stored in 'COL'.
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 30, 40], 'B': [100, 200, 300, 400], 'COL': ['A', 'B', 'A', 'B']})
df
A B COL
0 10 100 A
1 20 200 B
2 30 300 A
3 40 400 B
What I need as a result:
X
0 10
1 200
2 30
3 400
Use DataFrame.lookup, only necessary all values from COL exist in columns names:
df['X'] = df.lookup(df.index, df['COL'])
print (df)
A B COL X
0 1 10 A 1
1 2 20 B 20
2 3 30 A 3
3 4 40 B 40
this will solve your problem
df["X"]=df.apply(lambda x: x[x["COL"]],axis=1)

Subtract columns from DataFrames with different shapes by looking up based on another column

df2 has more columns and rows than df1. For each row in df2, I want to lookup a corresponding row in df1 based on matching values in one of their columns. From this matching row in df1, I want to subtract a column between df2 and df1. I tried set_index and directly subtracting the dataframes, but that resulted in a lot of NaN.
df1 = pd.DataFrame([[1, 10], [2, 20], [3, 30]],
columns=['A', 'B'])
df2 = pd.DataFrame([[1, 100, 15], [1, 200, 20],
[2, 100, 30], [2, 200, 35],
[3, 100, 50], [3, 200, 55]],
columns=['A', 'X', 'B'])
# For each row in df2, lookup in df1 based on column A, and produce
# difference of values in columnn B.
expected = pd.DataFrame([[1, 100, 5], [1, 200, 10],
[2, 100, 10], [2, 200, 15],
[3, 100, 20], [3, 200, 25]],
columns=['A', 'X', 'B'])
DataFrames:
df1
A B
0 1 10
1 2 20
2 3 30
df2
A X B
0 1 100 15
1 1 200 20
2 2 100 30
3 2 200 35
4 3 100 50
5 3 200 55
expected
A X B
0 1 100 5
1 1 200 10
2 2 100 10
3 2 200 15
4 3 100 20
5 3 200 25
set_index df1 to 'A' and map it back to df2.A. After that do subtraction:
df2['B'] -= df2.A.map(df1.set_index('A').B)
Out[216]:
A X B
0 1 100 5
1 1 200 10
2 2 100 10
3 2 200 15
4 3 100 20
5 3 200 25
Note: In case df2.A has values doesn't exist in df1.A, it will return NaN on that row. I leave it that way because your sample data doesn't specify how to handle it. If you want to keep the value of B the same in that case, you just need to chain .fillna(0) to the end of map or call method subtract with fill_value=0 option
df2['B'] -= df2.A.map(df1.set_index('A').B).fillna(0)
You can use merge also:
df2.merge(df1, on='A').eval('B = B_x - B_y').drop(['B_x','B_y'], axis=1)
Output:
A X B
0 1 100 5
1 1 200 10
2 2 100 10
3 2 200 15
4 3 100 20
5 3 200 25

How to merge specific column values in a pandas using an aggregate [duplicate]

I have a pandas dataframe
x = pd.DataFrame.from_dict({'row':[1, 1, 2, 2, 3, 3, 3], 'add': [1, 2, 3, 4, 5, 6, 7], 'take1': ['a', 'b', 'c', 'd', 'e', 'f', 'g'], 'take2': ['11', '22', '33', '44', '55', '66', '77'], 'range': [100, 200, 300, 400, 500, 600, 700]})
add range row take1 take2
0 1 100 1 a 11
1 2 200 1 b 22
2 3 300 2 c 33
3 4 400 2 d 44
4 5 500 3 e 55
5 6 600 3 f 66
6 7 700 3 g 77
I want to group it by the row column, then add up entries in add column, but take the first entry from take1 and take2, and select the min and max from range:
add row take1 take2 min_range max_range
0 3 1 a 11 100 200
1 7 2 c 33 300 400
2 18 3 e 55 500 700
Use DataFrameGroupBy.agg by dict, but then some cleaning is necessary, because get MultiIndex in columns:
#create a dictionary of column names and functions to apply to that column
d = {'add':'sum', 'take1':'first', 'take2':'first', 'range':['min','max']}
#group by the row column and apply the corresponding aggregation to each
#column as specified in the dictionary d
df = x.groupby('row', as_index=False).agg(d)
#rename some columns
df = df.rename(columns={'first':'', 'sum':''})
df.columns = ['{0[0]}_{0[1]}'.format(x).strip('_') for x in df.columns]
print (df)
row take1 range_min range_max take2 add
0 1 a 100 200 11 3
1 2 c 300 400 33 7
2 3 e 500 700 55 18
Details : Aggregate the columns based by the functions specified in the dictionary :
df = x.groupby('row', as_index=False).agg(d)
row range take2 take1 add
min max first first sum
0 1 100 200 11 a 3
1 2 300 400 33 c 7
2 3 500 700 55 e 18
Replacing column names sum and first with '' will lead to
row range take2 take1 add
min max
0 1 100 200 11 a 3
1 2 300 400 33 c 7
2 3 500 700 55 e 18
List comprehension on columns by using string formatters will get the desired column names. Assigning it to df.columns will get the desired output.
Here's what I had, without column renaming/sorting.
x = pd.DataFrame.from_dict({'row':[1, 1, 2, 2, 3, 3, 3], 'add': [1, 2, 3, 4, 5, 6, 7], 'take1': ['a', 'b', 'c', 'd', 'e', 'f', 'g'], 'take2': ['11', '22', '33', '44', '55', '66', '77'], 'range': [100, 200, 300, 400, 500, 600, 700]})
x.reset_index(inplace = True)
min_cols = x.ix[x.groupby(['row'])['index'].idxmin().values][['row','take1','take2']]
x_grouped = x.groupby(['row']).agg({'add':'sum','range':[np.min, np.max]})
x_out = pd.merge(x_grouped,min_cols, how = 'left',left_index = True, right_on = ['row'])
print x_out
(add, sum) (range, amin) (range, amax) row take1 take2
0 3 100 200 1 a 11
2 7 300 400 2 c 33
4 18 500 700 3 e 55

Pandas - Select multiple nested rows after a groupby

Given the following dataset :
df = pd.DataFrame({'a' : [1, 1, 3],
'b' : [4, 5, 6],
'c' : [7, 8, 9],
'cat_1' : ['a', 'a', 'b'],
'cat_2' : ['a', 'b', 'c']})
df
group_cat = df.groupby(['cat_1', 'cat_2'])
agg_cat = group_cat.agg({'a':['sum','mean'], 'b':['min','max']})
print(agg_cat)
a b
sum mean min max
cat_1 cat_2
a a 1 1 4 4
b 1 1 5 5
b c 3 3 6 6
Using xs() I'm able to select specific nested columns :
print(agg_cat.xs([('a','sum'),('b','max')], axis = 1))
a b
sum max
cat_1 cat_2
a a 1 4
b 1 5
b c 3 6
But when I'm trying to apply the same logic at the row level (axis=0), I'm getting an error :
print(agg_cat.xs([('a','a'),('b','c')], axis = 0))
KeyError: (('a', 'a'), ('b', 'c'))
You need to use .loc indexer to filter data by indexes.
agg_cat.loc[[('a','a'),('b','c')]]
a b
sum mean min max
cat_1 cat_2
a a 1 1 4 4
b c 3 3 6 6