As my new comer for python using, could help me how to create the "Number column" by number sequence under different name of "test1" column, thanks.
(for example: pandas groupby function??)
Try:
df["Number"] = (df.test1 != df.test1.shift()).cumsum()
print(df)
Prints:
test1 Number
0 AAA 1
1 AAA 1
2 AAA 1
3 AAA 1
4 BBB 2
5 BBB 2
6 BBB 2
7 AAA 3
8 AAA 3
9 AAA 3
10 CCC 4
11 CCC 4
Related
First time asking. I know that this may be a silly question, but I'm overwhelmed. I need to group a DataFrame and count the values but I’m having problems.
Lets suppose that I have this DF:
Plate 2021 2022 Total
AAA 1 1
BBB 0 1
CCC 1 1
BBB 1 1
AAA 1 1
AAA 0 1
BBB 1 0
CCC 0 1
My expected outcome would be:
Plate 2021 2022 Total
AAA 2 3 5
BBB 2 2 4
CCC 1 2 3
I tried different combinations of gropupby such as:
df.groupby('Plate').sum()
df.groupby('Plate', '2020', '2021')('total).sum()
but still didn't reach the expected result.
I would apreciate your help :)
I have a pandas df as follows:
MATERIAL DATE HIGH LOW
AAA 2022-01-01 10 0
AAA 2022-01-02 0 0
AAA 2022-01-03 5 2
BBB 2022-01-01 0 0
BBB 2022-01-02 10 5
BBB 2022-01-03 8 4
I want to groupby MATERIAL and sort_values by DATE
and choose all rows except last one in the group.
The resulting result should be:
MATERIAL DATE HIGH LOW
AAA 2022-01-01 10 0
AAA 2022-01-02 0 0
BBB 2022-01-01 0 0
BBB 2022-01-02 10 5
I have tried df.sort_values('DATE').groupby('MATERIAL').head(-1) but this results in an empty df.
The DATE is a pd.datetime object.
Thanks!
Use Series.duplicated with keep='last' for all values without last:
df = df.sort_values(['MATERIAL','DATE'])
df = df[df['MATERIAL'].duplicated(keep='last')]
print (df)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
With groupby solution is possible by GroupBy.cumcount with descending count and filter all rows without 0:
df = df.sort_values(['MATERIAL','DATE'])
df = df[df.groupby('MATERIAL').cumcount(ascending=False).ne(0)]
print (df)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
Another way is to sort by dates first, then group and take every row except the last one using indexing:
>>> df.sort_values("DATE").groupby("MATERIAL").apply(lambda group_df: group_df.iloc[:-1])
MATERIAL DATE HIGH LOW
MATERIAL
AAA 0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
BBB 3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
You could use:
(df.groupby('MATERIAL', as_index=False, group_keys=False)
.apply(lambda d: d.iloc[:len(d)-1])
)
output:
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
Another way would be using groupby+transform with nth as -1, and compare this with DATE column and only select rows which doesnot match this:
df = df.sort_values(['MATERIAL','DATE'])
c = df['DATE'].ne(df.groupby("MATERIAL")['DATE'].transform('nth',-1))
out = df[c].copy()
print(out)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
Side note: Since you have a date column, you can also use transform with max or last but that would only limit you to the last row as opposed to the second last row for example for which you might need nth as shown above:
c = df['DATE'].ne(df.groupby("MATERIAL")['DATE'].transform('max'))
df1.loc[df1.sort_values(['MATERIAL','DATE'])\
.duplicated(subset='MATERIAL',keep='last')]\
.pipe(print)
MATERIAL DATE HIGH LOW
0 AAA 2022-01-01 10 0
1 AAA 2022-01-02 0 0
3 BBB 2022-01-01 0 0
4 BBB 2022-01-02 10 5
If I have this simple dataframe, how do I use groupby() to get the desired summary dataframe?
Using Python 3.8
Inputs
x = [1,1,1,2,2,2,2,2,3,3,4,4,4]
y = [100,100,100,101,102,102,102,102,103,103,104,104,104]
z = [1,2,3,1,1,2,3,4,1,2,1,2,3]
df = pd.DataFrame(list(zip(x, y, z)), columns =['id', 'set', 'n'])
display(df)
Desired Output
With df.drop_duplicates
df.drop("n",1).drop_duplicates(['id','set'])
id set
0 1 100
3 2 101
4 2 102
8 3 103
10 4 104
Groupby and explode
df.groupby('id')['set'].unique().explode()
id
1 100
2 101
2 102
3 103
4 104
You can try using .explode() and then reset the index of the result:
> df.groupby('id')['set'].unique().explode().reset_index(name='unique_value')
id unique_value
0 1 100
1 2 101
2 2 102
3 3 103
4 4 104
I have two dataframes, df_new and df_old they have same number and names for the fields. I want to subtract df_old from df_new if cells are in numerical fields. If cells are in text fields, I just want to use df_new value.
I'm trying to have the result in a separate dataframe called changes.
I need help incorporating a dtype check condition on the formula below so it will work only on numerical fields.
df_new.set_index('Index').subtract(df_old.set_index('Index'), fill_value=0)
NEW
Part Qty Stock Notes
0 AAA 40 10 yyy
1 BBB 40 10 yyy
2 CCC 50 20 yyy
3 DDD 40 10
4 EEE 40 10
5 FFF 40 10
OLD
Part Qty Stock Notes
0 AAA 40 10 xxx
1 BBB 40 10 xxx
2 CCC 40 10 xxx
3 DDD 40 10
4 EEE 40 10
5 FFF 40 10
CHANGES
Part Qty Stock Notes
0 AAA 0 0 yyy
1 BBB 0 0 yyy
2 CCC 10 10 yyy
3 DDD 0 0
4 EEE 0 0
5 FFF 0 0
What about:
from pandas.api.types import is_numeric_dtype
to_subtract = [column for column, dtype in new_df.dtypes.items() if is_numeric_dtype(dtype)]
new_df[to_subtract] = new_df[to_subtract] - old_df[to_subtract]
When trying to count rows with similar 'kind' in data frame:
import pandas as pd
items = [('aaa','aaa text 1'), ('aaa','aaa text 2'), ('aaa','aaa text 3'),
('bb', 'bb text 1'), ('bb', 'bb text 2'), ('bb', 'bb text 3'),
('bb', 'bb text 4'),
('cccc','cccc text 1'), ('cccc','cccc text 2'),
('dd', 'dd text 1'),
('e', 'e text 1'),
('fff', 'fff text 1'),
]
df = pd.DataFrame(items, columns=['kind', 'msg'])
df
kind msg
0 aaa aaa text 1
1 aaa aaa text 2
2 aaa aaa text 3
3 bb bb text 1
4 bb bb text 2
5 bb bb text 3
6 bb bb text 4
7 cccc cccc text 1
8 cccc cccc text 2
9 dd dd text 1
10 e e text 1
11 fff fff text 1
This code works:
df = df[['kind']].groupby(['kind'])['kind'] \
.count() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False) \
.head(5)
df
Resulting in:
kind count
0 aaa 1
1 bb 1
2 cccc 1
3 dd 1
4 e 1
Yet, how can one get a data frame with all columns as in original one plus 'count' column? So the result should have columns 'kind', 'msg', 'count' in this order?
Also, how to sort this resulting data frame in descending order of count?
IIUC
In [247]: df['count'] = df.groupby('kind').transform('count')
In [248]: df
Out[248]:
kind msg count
0 aaa aaa text 1 3
1 aaa aaa text 2 3
2 aaa aaa text 3 3
3 bb bb text 1 4
4 bb bb text 2 4
5 bb bb text 3 4
6 bb bb text 4 4
7 cccc cccc text 1 2
8 cccc cccc text 2 2
9 dd dd text 1 1
10 e e text 1 1
11 fff fff text 1 1
sorting:
In [249]: df.sort_values('count', ascending=False)
Out[249]:
kind msg count
3 bb bb text 1 4
4 bb bb text 2 4
5 bb bb text 3 4
6 bb bb text 4 4
0 aaa aaa text 1 3
1 aaa aaa text 2 3
2 aaa aaa text 3 3
7 cccc cccc text 1 2
8 cccc cccc text 2 2
9 dd dd text 1 1
10 e e text 1 1
11 fff fff text 1 1
Here is the simple code to count the frequencies and add a column to the data frame when grouping by the kind column.
df['count'] = df.groupby('kind')['kind'].transform('count')
This can also be achieved as part of a chain by
df.assign(
count=lambda x: x.groupby('kind')['kind'].transform('count')
)
This is useful if you already have a chained expression, or you need to pass the dataframe with the extra column to a function but don't want to overwrite the dataframe.