Pandas: how to groupby on concatenated dataframes with same column names? - pandas

How to properly concat (or maybe this is .merge()?) N dataframes with the same column names, so that I could groupby them with distinguished column keys. For ex:
dfs = {
'A': df1, // columns are C1, C2, C3
'B': df2, // same columns C1, C2, C3
}
gathered_df = pd.concat(dfs.values()).groupby(['C2'])['C3']\
.count()\
.sort_values(ascending=False)\
.reset_index()
I want to get something like
|----------|------------|-------------|
| | A | B |
| C2_val1 | count_perA | count_perB |
| C2_val2 | count_perA | count_perB |
| C2_val3 | count_perA | count_perB |

I think you need reset_index for create columns from MultiIndex and then add column to groupby dor distinguish dataframes. Last reshape by unstack:
gathered_df = pd.concat(dfs).reset_index().groupby(['C2','level_0'])['C3'].count().unstack()
What is the difference between size and count in pandas?
Sample:
df1 = pd.DataFrame({'C1':[1,2,3],
'C2':[4,5,5],
'C3':[7,8,np.nan]})
df2 = df1.mul(10).fillna(1)
df2.C2 = df1.C2
print (df1)
C1 C2 C3
0 1 4 7.0
1 2 5 8.0
2 3 5 NaN
print (df2)
C1 C2 C3
0 10 4 70.0
1 20 5 80.0
2 30 5 1.0
dfs = {
'A': df1,
'B': df2
}
gathered_df = pd.concat(dfs).reset_index().groupby(['C2','level_0'])['C3'].count().unstack()
gathered_df.index.name = None
gathered_df.columns.name = None
print (gathered_df)
A B
4 1 1
5 1 2

Related

Pandas compare Dataframes to search not exact match [duplicate]

I want to merge the rows of the two dataframes hereunder, when the strings in Test1 column of DF2 contain a substring of Test1 column of DF1.
DF1 = pd.DataFrame({'Test1':list('ABC'),
'Test2':[1,2,3]})
print (DF1)
Test1 Test2
0 A 1
1 B 2
2 C 3
DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
'Test2':[1,2,3,4]})
print (DF2)
Test1 Test2
0 ee 1
1 bA 2
2 cCc 3
3 D 4
For that, I am able with "str contains" to identify the substring of DF1.Test1 available in the strings of DF2.Test1
INPUT:
for i in DF1.Test1:
ok = DF2[Df2.Test1.str.contains(i)]
print(ok)
OUPUT:
Now, I would like to add in the output, the merge of the substrings of Test1 which match with the strings of Test2
OUPUT:
For that, I tried with "pd.merge" and "if" but i am not able to find the right code yet..
Do you have suggestions please?
for i in DF1.Test1:
if DF2.Test1.str.contains(i) == 'True':
ok = pd.merge(DF1, DF2, on= ['Test1'[i]], how='outer')
print(ok)
Thank you for your ideas :)
I could not respnd to jezrael's comment because of my reputation. But I changed his answer to a function to merge on non-capitalized text.
def str_merge(part_string_df,full_string_df, merge_column):
merge_column_lower = 'merge_column_lower'
part_string_df[merge_column_lower] = part_string_df[merge_column].str.lower()
full_string_df[merge_column_lower] = full_string_df[merge_column].str.lower()
pat = '|'.join(r"{}".format(x) for x in part_string_df[merge_column_lower])
full_string_df['Test3'] = full_string_df[merge_column_lower].str.extract('('+ pat + ')', expand=True)
DF = pd.merge(part_string_df, full_string_df, left_on= merge_column_lower, right_on='Test3').drop([merge_column_lower + '_x',merge_column_lower + '_y','Test3'],axis=1)
return DF
Used with example:
DF1 = pd.DataFrame({'Test1':list('ABC'),
'Test2':[1,2,3]})
DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
'Test2':[1,2,3,4]})
print(str_merge(DF1,DF2, 'Test1'))
Test1_x Test2_x Test1_y Test2_y
0 B 2 bA 2
1 C 3 cCc 3
I believe you need extract values to new column and then merge, last remove helper column Test3:
pat = '|'.join(r"{}".format(x) for x in DF1.Test1)
DF2['Test3'] = DF2.Test1.str.extract('('+ pat + ')', expand=False)
DF = pd.merge(DF1, DF2, left_on= 'Test1', right_on='Test3').drop('Test3', axis=1)
print (DF)
Test1_x Test2_x Test1_y Test2_y
0 A 1 bA 2
1 C 3 cCc 3
Detail:
print (DF2)
Test1 Test2 Test3
0 ee 1 NaN
1 bA 2 A
2 cCc 3 C
3 D 4 NaN

Pivoting and transposing using pandas dataframe

Suppose that I have a pandas dataframe like the one below:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
The above would give me the following output:
print(df)
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
or
|fk ID| value | valId |
| 1 | 3 | 1 |
| 1 | 3 | 2 |
| 2 | 4 | 1 |
| 2 | 5 | 2 |
and I would like to transpose and pivot it in such a way that I get the following table and the same order of column names:
fk ID value valID fkID value valID
| 1 | 3 | 1 | 1 | 3 | 2 |
| 2 | 4 | 1 | 2 | 5 | 2 |
The most straightforward solution I can think of is
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
# concatenate the rows (Series) of each 'fk ID' group side by side
def flatten_group(g):
return pd.concat(row for _, row in g.iterrows())
res = df.groupby('fk ID', as_index=False).apply(flatten_group)
However, using Series.iterrows is not ideal, and can be very slow if the size of each group is large.
Furthermore, the above solution doesn't work if the 'fk ID' groups have different sizes. To see that, we can add a third group to the DataFrame
>>> df2 = df.append({'fk ID': 3, 'value':10, 'valID': 4},
ignore_index=True)
>>> df2
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
4 3 10 4
>>> df2.groupby('fk ID', as_index=False).apply(flatten_group)
0 fk ID 1
value 3
valID 1
fk ID 1
value 3
valID 2
1 fk ID 2
value 4
valID 1
fk ID 2
value 5
valID 2
2 fk ID 3
value 10
valID 4
dtype: int64
The result is not a DataFrame as one could expect, because pandas can't align the columns of the groups.
To solve this I suggest the following solution. It should work for any group size, and should be faster for large DataFrames.
import numpy as np
def flatten_group(g):
# flatten each group data into a single row
flat_data = g.to_numpy().reshape(1,-1)
return pd.DataFrame(flat_data)
# group the rows by 'fk ID'
groups = df.groupby('fk ID', group_keys=False)
# get the maximum group size
max_group_size = groups.size().max()
# contruct the new columns by repeating the
# original columns 'max_group_size' times
new_cols = np.tile(df.columns, max_group_size)
# aggregate the flattened rows
res = groups.apply(flatten_group).reset_index(drop=True)
# update the columns
res.columns = new_cols
Output:
# df
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1 3 2
1 2 4 1 2 5 2
# df2
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1.0 3.0 2.0
1 2 4 1 2.0 5.0 2.0
2 3 10 4 NaN NaN NaN
You can cast df as a numpy array, reshape it and cast it back to a dataframe, then rename the columns (0..5).
This is working too if values are not numbers but strings.
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
nrows = 2
array = df.to_numpy().reshape((nrows, -1))
pd.DataFrame(array).rename(mapper=lambda x: df.columns[x % len(df.columns)], axis=1)
If your group sizes are guaranteed to be the same, you could merge your odd and even rows:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
df_even = df[df.index%2==0].reset_index(drop=True)
df_odd = df[df.index%2==1].reset_index(drop=True)
df_odd.join(df_even, rsuffix='_2')
Yields
fk ID value valID fk ID_2 value_2 valID_2
0 1 3 2 1 3 1
1 2 5 2 2 4 1
I'd expect this to be pretty performant, and this could be generalized for any number of rows in each group (vs assuming odd/even for two rows per group), but will require that you have the same number of rows per fk ID.

Grouped count of combinations in Pandas column

I have a dataset with two values per person like the below and want to generate all combinations and counts of the combinations. I have a working solution but it's hardcoded and not scalable, I am looking for ideas on how to improve my solution.
Example:
d = {'person': [1,1,2,2,3,3,4,4,5,5,6,6], 'type': ['a','b','a','c','c','b','d','a','b','c','b','d']}
df = pd.DataFrame(data=d)
df
person type
0 1 a
1 1 b
2 2 a
3 2 c
4 3 c
5 3 b
6 4 d
7 4 a
8 5 b
9 5 c
10 6 b
11 6 d
My Inefficient Solution:
df = pd.get_dummies(df)
typecols = [col for col in df.columns if 'type' in col]
df = df.groupby(['person'], as_index=False)[typecols].apply(lambda x: x.astype(int).sum())
df["a_b"] = df["type_a"] + df["type_b"]
df["a_c"] = df["type_a"] + df["type_c"]
df["a_d"] = df["type_a"] + df["type_d"]
df["b_c"] = df["type_b"] + df["type_c"]
df["b_d"] = df["type_b"] + df["type_d"]
df["c_d"] = df["type_c"] + df["type_d"]
df["a_b"] = df.apply(lambda x: 1 if x["a_b"] == 2 else 0, axis=1)
df["a_c"] = df.apply(lambda x: 1 if x["a_c"] == 2 else 0, axis=1)
df["a_d"] = df.apply(lambda x: 1 if x["a_d"] == 2 else 0, axis=1)
df["b_c"] = df.apply(lambda x: 1 if x["b_c"] == 2 else 0, axis=1)
df["b_d"] = df.apply(lambda x: 1 if x["b_d"] == 2 else 0, axis=1)
df["c_d"] = df.apply(lambda x: 1 if x["c_d"] == 2 else 0, axis=1)
df_sums = df[['a_b','a_c','a_d','b_c','b_d','c_d']].sum()
print(df_sums.to_markdown(tablefmt="grid"))
+-----+-----+
| | 0 |
+=====+=====+
| a_b | 1 |
+-----+-----+
| a_c | 1 |
+-----+-----+
| a_d | 1 |
+-----+-----+
| b_c | 2 |
+-----+-----+
| b_d | 1 |
+-----+-----+
| c_d | 0 |
+-----+-----+
This solution works because every person has exactly two distinct values from a list of six distinct values but would quickly become unmanageable if there were NULLS or more than six distinct.
We can do:
s = df.sort_values('type').groupby('person', sort=False)['type']\
.agg(tuple).value_counts()
s.index = [f'{x}_{y}' for x, y in s.index]
s = s.sort_index()
print(s)
a_b 1
a_c 1
a_d 1
b_c 2
b_d 1
Name: type, dtype: int64
get all the combinations is also simple:
from itertools import combinations
s = df.sort_values('type').groupby('person', sort=False)['type']\
.agg(tuple).value_counts()\
.reindex(list(combinations(df['type'].unique(), 2)), fill_value=0)
(a, b) 1
(a, c) 1
(a, d) 1
(b, c) 2
(b, d) 1
(c, d) 0
Name: type, dtype: int64
You can do a self-merge within person with a query to de-duplicate the matches (this is why we create the N column). Then we sort the types so we only get one of 'a_b' (and not also 'b_a'), create the labels, and take the value_counts. Using combinations we can get the list of all possibilities to reindex with.
import numpy as np
from itertools import combinations
ids = ['_'.join(x) for x in combinations(df['type'].unique(), 2)]
#['a_b', 'a_c', 'a_d', 'b_c', 'b_d', 'c_d']
df['N'] = range(len(df))
df1 = df.merge(df, on='person').query('N_x > N_y')
df1[['type_x', 'type_y']] = np.sort(df1[['type_x', 'type_y']].to_numpy(), 1)
df1['label'] = df1['type_x'].str.cat(df1['type_y'], sep='_')
df1['label'].value_counts().reindex(ids, fill_value=0)
a_b 1
a_c 1
a_d 1
b_c 2
b_d 1
c_d 0
Name: label, dtype: int64

Using .loc and shift() to add one to a serialnumber

I'm trying to add two dataframes using concat with axis = 0, so the columns stay the same but the index increases. One of the dataframes contains a specific columns with a serial number (going from one upwards - but not necessarily in sequence eg. 1,2,3,4,5, etc.)
import pandas as pd
import numpy as np
a = pd.DataFrame(data = {'Name': ['A', 'B','C'],
'Serial Number': [1, 2,5]} )
b = pd.DataFrame(data = {'Name': ['D','E','F'],
'Serial Number': [np.nan,np.nan,np.nan]})
c = pd.concat([a,b],axis=0).reset_index()
I would like to have column 'Serial Number' in dataframe C to start from 5+1 the next one 6+1.
I've tried a variety of things eg:
c.loc[c['B'].isna(), 'B'] = c['B'].shift(1)+1
But it doesn't seem to work.
Desired output:
| Name | Serial Number|
-------------------------
1 A | 1
2 B | 2
3 C | 5
4 D | 6
5 E | 7
6 F | 8
One idea is create arange by number od missinng values add maximal value and 1:
a = np.arange(c['Serial Number'].isna().sum()) + c['Serial Number'].max() + 1
c.loc[c['Serial Number'].isna(), 'Serial Number'] = a
print (c)
index Name Serial Number
0 0 A 1.0
1 1 B 2.0
2 2 C 5.0
3 0 D 6.0
4 1 E 7.0
5 2 F 8.0

Comparing and replacing column items pandas dataframe

I have three columns C1,C2,C3 in panda dataframe. My aim is to replace C1_i by C2_j whenever C3_i=C1_j. These are all strings. I was trying where but failed. What is a good way to do this avoiding for loop?
If my data frame is
df=pd.DataFrame({'c1': ['a', 'b', 'c'], 'c2': ['d','e','f'], 'c3': ['c', 'z', 'b']})
Then I want c3 to be replaced by ['f','z','e']
I tried this, which takes very long time.
for i in range(0,len(df)):
for j in range(0,len(df)):
if (df.iloc[i]['c1']==df.iloc[j]['c3']):
df.iloc[j]['c3']=accounts.iloc[i]['c2']
Use map by Series created by set_index:
df['c3'] = df['c3'].map(df.set_index('c1')['c2']).fillna(df['c3'])
Alternative solution with update:
df['c3'].update(df['c3'].map(df.set_index('c1')['c2']))
print (df)
c1 c2 c3
0 a d f
1 b e z
2 c f e
Example data:
dataframe = pd.DataFrame({'a':['10','4','3','40','5'], 'b':['5','4','3','2','1'], 'c':['s','d','f','g','h']})
Output:
a b c
0 10 5 s
1 4 4 d
2 3 3 f
3 40 2 g
4 5 1 h
Code:
def replace(df):
if len(dataframe[dataframe.b==df.a]) != 0:
df['a'] = dataframe[dataframe.b==df.a].c.values[0]
return df
dataframe = dataframe.apply(replace, 1)
Output:
a b c
0 1 5 0
1 2 4 0
2 0 3 0
3 4 2 0
4 5 1 0
Is it what you want?