append columns from rows in pandas - pandas

convert rows into new columns, like:
original dataframe:
attr_0 attr_1 attr_2 attr_3
0 day_0 -0.032546 0.161111 -0.488420 -0.811738
1 day_1 -0.341992 0.779818 -2.937992 -0.236757
2 day_2 0.592365 0.729467 0.421381 0.571941
3 day_3 -0.418947 2.022934 -1.349382 1.411210
4 day_4 -0.726380 0.287871 -1.153566 -2.275976
...
after convertion:
day_0_attr_0 day_0_attr_1 day_0_attr_2 day_0_attr_3 day_1_attr_0 \
0 -0.032546 0.144388 -0.992263 0.734864 -0.936625
day_1_attr_1 day_1_attr_2 day_1_attr_3 day_2_attr_0 day_2_attr_1 \
0 -1.717135 -0.228005 -0.330573 -0.28034 0.834345
day_2_attr_2 day_2_attr_3 day_3_attr_0 day_3_attr_1 day_3_attr_2 \
0 1.161089 0.385277 -0.014138 -1.05523 -0.618873
day_3_attr_3 day_4_attr_0 day_4_attr_1 day_4_attr_2 day_4_attr_3
0 0.724463 0.137691 -1.188638 -2.457449 -0.171268

If MultiIndex use:
print (df.index)
MultiIndex(levels=[[0, 1, 2, 3, 4], ['day_0', 'day_1', 'day_2', 'day_3', 'day_4']],
labels=[[0, 1, 2, 3, 4], [0, 1, 2, 3, 4]])
df = df.reset_index(level=0, drop=True).stack().reset_index()
level_0 level_1 0
0 day_0 attr_0 -0.032546
1 day_0 attr_1 0.161111
2 day_0 attr_2 -0.488420
3 day_0 attr_3 -0.811738
4 day_1 attr_0 -0.341992
5 day_1 attr_1 0.779818
6 day_1 attr_2 -2.937992
7 day_1 attr_3 -0.236757
8 day_2 attr_0 0.592365
9 day_2 attr_1 0.729467
10 day_2 attr_2 0.421381
11 day_2 attr_3 0.571941
12 day_3 attr_0 -0.418947
13 day_3 attr_1 2.022934
14 day_3 attr_2 -1.349382
15 day_3 attr_3 1.411210
16 day_4 attr_0 -0.726380
17 day_4 attr_1 0.287871
18 day_4 attr_2 -1.153566
19 day_4 attr_3 -2.275976
df = pd.DataFrame([df[0].values], columns = df['level_0'] + '_' + df['level_1'])
print (df)
day_0_attr_0 day_0_attr_1 ... day_4_attr_2 day_4_attr_3
0 -0.032546 0.161111 ... -1.153566 -2.275976
[1 rows x 20 columns
Another solution with product:
from itertools import product
cols = ['{}_{}'.format(a,b) for a, b in product(df.index.get_level_values(1), df.columns)]
print (cols)
['day_0_attr_0', 'day_0_attr_1', 'day_0_attr_2', 'day_0_attr_3',
'day_1_attr_0', 'day_1_attr_1', 'day_1_attr_2', 'day_1_attr_3',
'day_2_attr_0', 'day_2_attr_1', 'day_2_attr_2', 'day_2_attr_3',
'day_3_attr_0', 'day_3_attr_1', 'day_3_attr_2', 'day_3_attr_3',
'day_4_attr_0', 'day_4_attr_1', 'day_4_attr_2', 'day_4_attr_3']
df = pd.DataFrame([df.values.ravel()], columns=cols)
print (df)
day_0_attr_0 day_0_attr_1 ... day_4_attr_2 day_4_attr_3
0 -0.032546 0.161111 ... -1.153566 -2.275976
[1 rows x 20 columns]
If no MultiIndex solutions are a bit changed:
print (df.index)
Index(['day_0', 'day_1', 'day_2', 'day_3', 'day_4'], dtype='object')
df = df.stack().reset_index()
df = pd.DataFrame([df[0].values], columns = df['level_0'] + '_' + df['level_1'])
from itertools import product
cols = ['{}_{}'.format(a,b) for a, b in product(df.index, df.columns)]
df = pd.DataFrame([df.values.ravel()], columns=cols)
print (df)

You can use melt and string concatenation approach i.e
idx = df.index
temp = df.melt()
# Repeat the index
temp['variable'] = pd.Series(np.concatenate([idx]*len(df.columns))) + '_' + temp['variable']
# Set index and transpose
temp.set_index('variable').T
variable day_0_attr_0 day_1_attr_0 day_2_attr_0 day_3_attr_0 day_4_attr_0 . . . .
value -0.032546 -0.341992 0.592365 -0.418947 -0.72638 . . . .

Related

How do I get a time delta that is closest to 0 days?

I have the following dataframe:
gp_columns = {
'name': ['companyA', 'companyB'],
'firm_ID' : [1, 2],
'timestamp_one' : ['2016-04-01', '2017-09-01']
}
fund_columns = {
'firm_ID': [1, 1, 2, 2, 2],
'department_ID' : [10, 11, 20, 21, 22],
'timestamp_mult' : ['2015-01-01', '2016-03-01', '2016-10-01', '2017-02-01', '2018-11-01'],
'number' : [400, 500, 1000, 3000, 4000]
}
gp_df = pd.DataFrame(gp_columns)
fund_df = pd.DataFrame(fund_columns)
gp_df['timestamp_one'] = pd.to_datetime(gp_df['timestamp_one'])
fund_df['timestamp_mult'] = pd.to_datetime(fund_df['timestamp_mult'])
merged_df = gp_df.merge(fund_df)
merged_df
merged_df_v1 = merged_df.copy()
merged_df_v1['incidence_num'] = merged_df.groupby('firm_ID')['department_ID']\
.transform('cumcount')
merged_df_v1['incidence_num'] = merged_df_v1['incidence_num'] + 1
merged_df_v1['time_delta'] = merged_df_v1['timestamp_mult'] - merged_df_v1['timestamp_one']
merged_wide = pd.pivot(merged_df_v1, index = ['name','firm_ID', 'timestamp_one'], \
columns = 'incidence_num', \
values = ['department_ID', 'time_delta', 'timestamp_mult', 'number'])
merged_wide.reset_index()
that looks as follows:
My question is how i get a column that calculates the minimum time delta (so closest to 0). Note that the time delta can be negative or positive, so .abs() does not work for me here.
I want a dataframe with this particular output:
You can stack (which removes NaTs) and groupby.first after sorting the rows by absolute value (with the key parameter of sort_values):
df = merged_wide.reset_index()
df['time_delta_min'] = (df['time_delta'].stack()
.sort_values(key=abs)
.groupby(level=0).first()
)
output:
name firm_ID timestamp_one department_ID \
incidence_num 1 2 3
0 companyA 1 2016-04-01 10 11 NaN
1 companyB 2 2017-09-01 20 21 22
time_delta timestamp_mult \
incidence_num 1 2 3 1 2
0 -456 days -31 days NaT 2015-01-01 2016-03-01
1 -335 days -212 days 426 days 2016-10-01 2017-02-01
number time_delta_min
incidence_num 3 1 2 3
0 NaT 400 500 NaN -31 days
1 2018-11-01 1000 3000 4000 -212 days
Use lookup with indices of absolute values by DataFrame.idxmin:
idx, cols = pd.factorize(df['time_delta'].abs().idxmin(axis=1))
df['time_delta_min'] = (df['time_delta'].reindex(cols, axis=1).to_numpy()
[np.arange(len(df)), idx])
print (df)

Second-level aggregation in pandas

I have a simple example:
DF = pd.DataFrame(
{"F1" : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
"F2" : [1, 2, 1, 2, 2, 3, 1, 2, 3, 2],
"F3" : ['xx', 'yy', 'zz', 'zz', 'zz', 'xx', 'yy', 'zz', 'zz', 'zz']})
DF
How can I improve the code so that in the F3-unique column, in addition to the list of unique values of the F3 column in the group, the number of appearances of these values in the group is displayed like this:
Use .groupby() + .sum() + value_counts() + .agg():
df2 = DF.groupby('F1')['F2'].sum()
df3 = (DF.groupby(['F1', 'F3'])['F3']
.value_counts()
.reset_index([2], name='count')
.apply(lambda x: x['F3'] + '-' + str(x['count']), axis=1)
)
df4 = df3.groupby(level=0).agg(' '.join)
df4.name = 'F3'
df_out = pd.concat([df2, df4], axis=1).reset_index()
Result:
print(df_out)
F1 F2 F3
0 A 4 xx-1 yy-1 zz-1
1 B 7 xx-1 zz-2
2 C 8 yy-1 zz-3
Seems like groupby aggregate's named aggregation + python's collections.Counter could work well here:
from collections import Counter
df2 = DF.groupby('F1', as_index=False).aggregate({
'F2': 'sum',
'F3': lambda g: ' '.join([f'{k}-{v}' for k, v in Counter(g).items()])
})
df2:
F1 F2 F3
0 A 4 xx-1 yy-1 zz-1
1 B 7 zz-2 xx-1
2 C 8 yy-1 zz-3
aggregating to a Counter turns a collection into a dictionary based on the number of unique values:
df2 = DF.groupby('F1', as_index=False).aggregate({
'F2': 'sum',
'F3': Counter
})
F1 F2 F3
0 A 4 {'xx': 1, 'yy': 1, 'zz': 1}
1 B 7 {'zz': 2, 'xx': 1}
2 C 8 {'yy': 1, 'zz': 3}
The surrounding comprehension is used to reformat the data display:
Sample with 1 row:
' '.join([f'{k}-{v}' for k, v in Counter({'xx': 1, 'yy': 1, 'zz': 1}).items()])
xx-1 yy-1 zz-1

Pandas time re-sampling categorical data from a column with calculations from another numerical column

I have a data-frame with a categorical column and a numerical , the index set to time data
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
the above code gives:
Kind Values
date
2013-03-01 A 1.0
2013-03-02 B 1.5
2013-03-01 A 2.0
2013-03-02 B 3.0
2013-03-01 B 5.0
2013-03-02 A 3.0
My aim is to achieve the below data-frame:
A_count B_count A_Val max B_Val max
date
2013-03-01 2 1 2 5
2013-03-02 0 3 0 3
Which also has the time as index . Here, I note that If we use
data = pd.DataFrame(data.resample('D')['Pack'].value_counts())
we get :
Kind
date Kind
2013-03-01 A 2
B 1
2013-03-02 B 3
Use DataFrame.pivot_table with flattening MultiIndex in columns in list comprehension:
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
#is possible omit
#df = df.set_index('date')
df = df.pivot_table(index='date', columns='Kind', values='Values', aggfunc=['count','max'])
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
A_count B_count A_max B_max
date
2013-03-01 2.0 1.0 2.0 5.0
2013-03-02 NaN 3.0 NaN 3.0
Another solution with Grouper for resample by days:
df = df.set_index('date')
df = df.groupby([pd.Grouper(freq='d'), 'Kind'])['Values'].agg(['count','max']).unstack()
df.columns = [f'{b}_{a}' for a, b in df.columns]

How can I select the columns names where a condition is met

I need to select column names where the count is greater than 2. I have this dataset:
Index | col_1 | col_2 | col_3 | col_4
-------------------------------------
0 | 5 | NaN | 4 | 2
1 | 2 | 2 | NaN | 2
2 | NaN | 3 | NaN | 1
3 | 3 | NaN | NaN | 1
The expected result is a list: ['col_1', 'col_4']
When I use
df.count() > 2
I get
col_1 True
col_2 False
col_3 False
col_4 True
Length: 4, dtype: bool
This is the code for testing
import pandas as pd
import numpy as np
data = {'col_1': [5, 2, np.NaN, 3],
'col_2': [np.NaN, 2, 3, np.NaN],
'col_3': [4, np.NaN, np.NaN, np.NaN],
'col_4': [2, 2, 1,1]}
frame = pd.DataFrame(data)
frame.count() > 2
You can do it this way.
import pandas as pd
import numpy as np
data = {'col_1': [5, 2, np.NaN, 3],
'col_2': [np.NaN, 2, 3, np.NaN],
'col_3': [4, np.NaN, np.NaN, np.NaN],
'col_4': [2, 2, 1,1]}
frame = pd.DataFrame(data)
expected_list = []
for col in list(frame.columns):
if frame[col].count() > 2:
expected_list.append(col)
Use dict can easily solve this:
frame[[key for key, value in dict(frame.count() > 2).items() if value]]
Try:
(df.columns)[(df.count() > 2).values].to_list()

How to merge columns using mask

I am trying to merge two columns (Phone 1 and 2)
Here is my fake data:
import pandas as pd
employee = {'EmployeeID' : [0, 1, 2, 3, 4, 5, 6, 7],
'LastName' : ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'],
'Name' : ['w', 'x', 'y', 'z', None, None, None, None],
'phone1' : [1, 1, 2, 2, 4, 5, 6, 6],
'phone2' : [None, None, 3, 3, None, None, 7, 7],
'level_15' : [0, 1, 0, 1, 0, 0, 0, 1]}
df2 = pd.DataFrame(employee)
and I want the 'phone' column to be
'phone' : [1, 2, 3, 4, 5, 7, 9, 10]
In the beginning of my code, i split the names based on '/' and this code below creates a column with 0s and 1s which I used as mask to do other tasks through out my code.
df2 = (df2.set_index(cols)['name'].str.split('/',expand=True).stack().reset_index(name='Name'))
m = df2['level_15'].eq(0)
print (m)
#remove column level_15
df2 = df2.drop(['level_15'], axis=1)
#add last name for select first letter by condition, replace NaNs by forward fill
df2['last_name'] = df2['name'].str[:2].where(m).ffill()
df2['name'] = df2['name'].mask(m, df2['name'].str[2:])
I feel like there is a way to merge phone1 and phone2 using the 0s and 1s, but I can't figure out. Thank you.
First, start by filling in NaNs;
df2['phone2'] = df2.phone2.fillna(df2.phone1)
# Alternatively, based on your latest update
# df2['phone2'] = df2.phone2.mask(df2.phone2.eq(0)).fillna(df2.phone1)
You can just use np.where to merge columns on odd/even indices:
df2['phone'] = np.where(np.arange(len(df2)) % 2 == 0, df2.phone1, df2.phone2)
df2 = df2.drop(['phone1', 'phone2'], 1)
df2
EmployeeID LastName Name phone
0 0 a w 1
1 1 b x 2
2 2 c y 3
3 3 d z 4
4 4 e None 5
5 5 f None 6
6 6 g None 7
7 7 h None 8
Or, with Series.where/mask:
df2['phone'] = df2.pop('phone1').where(
np.arange(len(df2)) % 2 == 0, df2.pop('phone2')
)
Or,
df2['phone'] = df2.pop('phone1').mask(
np.arange(len(df2)) % 2 != 0, df2.pop('phone2)
)
df2
EmployeeID LastName Name phone
0 0 a w 1
1 1 b x 2
2 2 c y 3
3 3 d z 4
4 4 e None 5
5 5 f None 6
6 6 g None 7
7 7 h None 8