Pandas - How to split single dataframe into multiple dataframe? [duplicate] - pandas

This question already has answers here:
Split pandas dataframe based on groupby
(4 answers)
Closed 15 days ago.
I wanted to create multiple dataframe and make them into a list of dataframe. I wanted to split it by specifying the veh value. For example, from the datadrame below, I wanted to get 4 single dataframe:
ped value 1 with veh value 1
ped value 1 with veh value 2
ped value 1 with veh value 3
ped value 1 with veh value 4
ped value
veh value
1
1
1
1
1
2
1
2
1
3
1
3
1
4
1
4
Wanted output:
| ped value| veh value|
| --------------------|
| 1 | 1 |
| 1 | 1 |
ped value
veh value
1
2
1
2
ped value
veh value
1
3
1
3
ped value
veh value
1
4
1
4
grouped = df.groupby(['ped', 'veh']) ped_veh1 = grouped.get_group(("P1", 1)) print(ped_veh1)
The code above is the initial code i used to split the dataframe. However, I got 100 different veh value so is there any way to achieve the output as above?
I have tried using for i in range method:
for i in range (1,100): grouped = df.groupby(['ped', 'veh']) ped_veh1 = grouped.get_group(("P1", i)) print(ped_veh1)
However, the code does not work because the value i is not continuous for example:
i = 1,2,3,5,6,8,9,10,12
The code stop running when they can't find i = 4 and error is raised.
So, is there any way or solution to solve this problem?

You can use split Pandas Dataframe using groupby() function then get_group
dfs = []
for v in df['veh value'].unique():
dfs.append(df.groupby('veh value').get_group(v))
Output
[ ped value veh value
0 1 1
1 1 1,
ped value veh value
2 1 2
3 1 2,
ped value veh value
4 1 3
5 1 3,
ped value veh value
6 1 4
7 1 4]

Related

How to check pair of string values in a column, after grouping the dataframe using ID column?

My Doubt in a Table/Dataframe viewI have a dataframe containing 2 columns: ID and Code.
ID Code Flag
1 A 0
1 C 1
1 B 1
2 A 0
2 B 1
3 A 0
4 C 0
Within each ID, if Code 'A' exists with 'B' or 'C', then it should flag 1.
I tried Groupby('ID') with filter(). but it is not showing the perfect result. Could anyone please help ?
You can do the following:
First use pd.groupby('ID') and concatenate the codes using 'sum' to create a new column. Then assing the value 1 if a row contains A or B as Code and when the new column contains an A:
df['s'] = df.groupby('ID').Code.transform('sum')
df['Flag'] = 0
df.loc[((df.Code == 'B') | (df.Code == 'C')) & df.s.str.contains('A'), 'Flag'] = 1
df = df.drop(columns = 's')
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0
You can use boolean masks, direct for B/C, per group for A, then combine them and convert to integer:
# is the Code a B or C?
m1 = df['Code'].isin(['B', 'C'])
# is there also a A in the same group?
m2 = df['Code'].eq('A').groupby(df['ID']).transform('any')
# if both are True, flag 1
df['Flag'] = (m1&m2).astype(int)
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0

insert column to df on sequenced location

i have a df like this:
id
month
1
1
1
3
1
4
1
6
i want to transform it become like this:
id
1
2
3
4
5
6
1
1
0
1
1
0
1
ive tried using this code:
ndf = df[['id']].join(pd.get_dummies(
df['month'])).groupby('id').max()
but it shows like this:
id
1
3
4
6
1
1
1
1
1
how can i insert the middle column (2 and 5) even if it's not in the data?
You can use pd.crosstab
instead, then create new columns using pd.RangeIndex based on the min and max month, and finally use DataFrame.reindex (and optionally DataFrame.reset_index afterwards):
import pandas as pd
new_cols = pd.RangeIndex(df['month'].min(), df['month'].max())
res = (
pd.crosstab(df['id'], df['month'])
.reindex(columns=new_cols, fill_value=0)
.reset_index()
)
Output:
>>> res
id 1 2 3 4 5
0 1 1 0 1 1 0

Append new column to DF after sum?

I have a sample dataframe below:
sn C1-1 C1-2 C1-3 H2-1 H2-2 K3-1 K3-2
1 4 3 5 4 1 4 2
2 2 2 0 2 0 1 2
3 1 2 0 0 2 1 2
I will like to sum based on the prefix of C1, H2, K3 and output three new columns with the total sum. The final result is this:
sn total_c1 total_h2 total_k3
1 12 5 6
2 4 2 3
3 3 2 3
What I have tried on my original df:
lst = ["C1", "H2", "K3"]
lst2 = ["total_c1", "total_h2", "total_k3"]
for k in lst:
idx = df.columns.str.startswith(i)
for j in lst2:
df[j] = df.iloc[:,idx].sum(axis=1)
df1 = df.append(df, sort=False)
But I kept getting error
IndexError: Item wrong length 35 instead of 36.
I can't figure out how to append the new total column to produce my end result in the loop.
Any help will be appreciated (or better suggestion as oppose to loop). Thank you.
You can use groupby:
# columns of interest
cols = df.columns[1:]
col_groups = cols.str.split('-').str[0]
out_df = df[['sn']].join(df[cols].groupby(col_groups, axis=1)
.sum()
.add_prefix('total_')
)
Output:
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Let us try ,split then groupby with it with axis=1
out = df.groupby(df.columns.str.split('-').str[0],axis=1).sum().set_index('sn').add_prefix('Total_').reset_index()
Out[84]:
sn Total_C1 Total_H2 Total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Another option, where we create a dictionary to groupby the columns:
mapping = {entry: f"total_{entry[:2]}" for entry in df.columns[1:]}
result = df.groupby(mapping, axis=1).sum()
result.insert(0, "sn", df.sn)
result
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3

Setting groupby nunique to dataframe column

I have a groupby that I am trying to set as a new column within my dataframe, but when I set a column name to the result of the groupby its returns NaN as the value of each row. If the groupby is set to a new value and then printed the value returns the gropby values and the nunique for each. Is the issue I am facing due to some indexing that needs to be resolved on the dataframe?
When set to column:
merged_df['noramlized_values'] = merged_df.groupby(['be_hash'])['id'].nunique()
// normalized_values
// NaN
When set to a new value:
test = merged_df.groupby(['be_hash'])['id'].nunique()
// ij32ndshufho23nd 1
Data example
id date be_hash unique_call_rank normalized_calls What I want
1 10/20/20 10171 1 3 1
1 10/20/20 10171 1 3 0
2 10/20/20 10171 2 3 1
3 10/23/20 10171 3 3 1
Use DataFrame.duplicated with both columns and numpy.where:
merged_df['noramlized_values'] = np.where(merged_df.duplicated(['be_hash','id']), 0, 1)
print (merged_df)
id date be_hash unique_call_rank normalized_calls What I want \
0 1 10/20/20 10171 1 3 1
1 1 10/20/20 10171 1 3 0
2 2 10/20/20 10171 2 3 1
3 3 10/23/20 10171 3 3 1
noramlized_values
0 1
1 0
2 1
3 1

Pandas : dataframe cumsum , reset if other column is false [duplicate]

This question already has an answer here:
How to reset cumsum after change in sign of values?
(1 answer)
Closed 4 years ago.
I have a dataframe with 2 columns, the objective here is simple ; reset the df.cumsum() if a row column is set to False;
df
value condition
0 1 1
1 2 1
2 3 1
3 4 0
4 5 1
the wanted result is as follows :
df
value condition
0 1 1
1 3 1
2 6 1
3 4 0
4 9 1
If i loop over the dataframe as described in this post Python pandas cumsum() reset after hitting max
i can achieve the wanted results, but i was looking for a more vectorized way using pandas standard functions
How about:
df['cSum'] = df.groupby((df.condition == 0).cumsum()).value.cumsum()
Output:
value condition cSum
0 1 1 1
1 2 1 3
2 3 1 6
3 4 0 4
4 5 1 9
You'll group consecutive rows together until you encounter a 0 in the condition column, and then you apply the cumsum within each group separately.