Pandas: how to group on column change? - pandas

I am working with a log system, and I need to group data not in a standard way.
Alas with my limited knowledge of Pandas I couldn't find any example, probably because I don't know proper search terms.
This is a sample dataframe:
df = pd.DataFrame({
"speed": [2, 4, 6, 8, 8, 9, 2, 3, 8, 9, 13, 18, 25, 27, 18, 8, 6, 8, 12, 20, 27, 34, 36, 41, 44, 54, 61, 60, 61, 40, 17, 12, 15, 24],
"class": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 3, 1, 1, 1, 2]
})
df.groupby(by="class").groups returns indexed of each row, all grouped together by class value:
class indexes
1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 30, 32],
2: [12, 13, 19, 20, 21, 22, 33],
3: [23, 24, 29],
4: [25],
5: [26, 27, 28]
I need instead to split every time column class changes:
speed class
0 2 1
1 4 1
2 6 1
3 8 1
4 8 1
5 9 1
6 2 1
7 3 1
8 8 1
9 9 1
10 13 1
11 18 1
12 25 2 <= split here
13 27 2
14 18 1 <= split here
15 8 1
16 6 1
17 8 1
18 12 1 <= split here
19 20 2
20 27 2
21 34 2
22 36 2 <= split here
23 41 3
24 44 3 <= split here
25 54 4 <= split here
26 61 5
27 60 5
28 61 5 <= split here
29 40 3 <= split here
30 17 1 <= split here
31 12 1
32 15 1
33 24 2 <= split here
The desired grouping should return something like:
class count mean
0 1 12 70.50
1 2 2 26.00
2 1 5 10.40
3 2 4 29.25
4 3 2 42.50
5 4 1 54.00
6 5 3 60.66
7 3 1 40.00
8 1 3 14.66
9 2 1 24.00
Is there any command to do it not iteratively?

Use Series.cumsum with compare if not equal shifted values and aggregate by GroupBy.agg:
g = df["class"].ne(df["class"].shift()).cumsum()
df = (df.groupby(['class', g], sort=False)['speed'].agg(['count','mean'])
.reset_index(level=1, drop=True)
.reset_index())
print (df)
class count mean
0 1 12 7.500000
1 2 2 26.000000
2 1 5 10.400000
3 2 4 29.250000
4 3 2 42.500000
5 4 1 54.000000
6 5 3 60.666667
7 3 1 40.000000
8 1 3 14.666667
9 2 1 24.000000

You can groupby the cumsum of when the class column differs from the value below it:
df.groupby(df["class"].diff().ne(0).cumsum()).speed.agg(['size', 'mean'])
size mean
class
1 12 7.500000
2 2 26.000000
3 5 10.400000
4 4 29.250000
5 2 42.500000
6 1 54.000000
7 3 60.666667
8 1 40.000000
9 3 14.666667
10 1 24.000000
Update: I hadn't seen how you wanted the class column: what you can do is group by the original class column as well as the cumsum above, and do a bit of index-sorting and resetting (but at this point this answer just converges with #jezrael's answer :P)
result = (
df.groupby(["class", df["class"].diff().ne(0).cumsum()])
.speed.agg(["size", "mean"])
.sort_index(level=1)
.reset_index(level=0)
.reset_index(drop=True)
)
class size mean
0 1 12 7.500000
1 2 2 26.000000
2 1 5 10.400000
3 2 4 29.250000
4 3 2 42.500000
5 4 1 54.000000
6 5 3 60.666667
7 3 1 40.000000
8 1 3 14.666667
9 2 1 24.000000

Related

Cut continuous data by outliers

For example I have DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'b': [2, 2, 4, 3, 1000, 2000, 1, 500, 3]})
I need to cut by outliers and get these intervals: 1-4, 5-6, 7, 8, 9.
Cutting with pd.cut and pd.qcut does not give these results
You can group them by consecutive values depending on the above/below mask:
m = df['b'].gt(100)
df['group'] = m.ne(m.shift()).cumsum()
output:
a b group
0 1 2 1
1 2 2 1
2 3 4 1
3 4 3 1
4 5 1000 2
5 6 2000 2
6 7 1 3
7 8 500 4
8 9 3 5

create column based on column values - merge integers

I would like to create a new column "Group". The integer values from column "Step_ID" should be converted into 1 and 2. The fist two values should be converted to 1, the second two values to 2, the third two values to 1 etc. See the image below.
import pandas as pd
data = {'Step_ID': [1, 1, 2, 2, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10, 11, 11]}
df1 = pd.DataFrame(data)
You can try:
m = (df.Step_ID % 2) + df.Step_ID
df['new_group'] = (m.ne(m.shift()).cumsum() % 2).replace(0,2)
OUTPUT:
Step_ID new_group
0 1 1
1 1 1
2 2 1
3 2 1
4 3 2
5 4 2
6 5 1
7 6 1
8 6 1
9 7 2
10 8 2
11 8 2
12 9 1
13 10 1
14 11 2
15 11 2

Get and Modify column in groups on rows that meet a condition

I have this DataFrame:
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12], 'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
day hour sales
0 1 10 0
1 1 10 40
2 1 10 30
3 2 11 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 12 20
And I would like to filter to get the first entry of each day that has volume greater than 0. And as an additional thing I would like to change the 'sales' column for these to 9.
So to get something like this:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
I only came up with this iterative solution. But is there a solution, how I can apply it in a more functional way?
# Group by day:
groups = df.groupby(by=['day'])
# Get all indices of first non-zero sales entry per day:
indices = []
for name, group in groups:
group = group[group['sales'] > 0]
indices.append(group.index.to_list()[0])
# Change their values:
df.iloc[indices, df.columns.get_loc('hour')] = 9
You can create a group of df['day'] after checking of sales is greater than 0 , then get idxmax and filter out groups which doesnot have any value greater than 0 by using any , then assign with loc[]
g = df['sales'].gt(0).groupby(df['day'])
idx = g.idxmax()
df.loc[idx[g.any()],'hour']=9
print(df)
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
Create a mask m that groups by day as well as rows where the sales are not 0.
Then, use this mask as well as df['sales'] > 0 to change those specific rows to 9 with np.where()
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12],
'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
m = df.groupby(['day', df['sales'].ne(0)])['sales'].transform('first')
df['hour'] = np.where((df['sales'] == m) & (df['sales'] > 0), 9, df['hour'])
df
Out[37]:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20

How to group consecutive values in other columns into ranges based on one column

I have the following dataframe:
I would like to get the following output from the dataframe
Is there anyway to group other columns ['B', 'index'] based on column 'A' using groupby aggregate function, pivot_table in pandas.
I couldn't think about an approach to write code.
Use:
df=df.reset_index() #if 'index' not is a colum
g=df['A'].ne(df['A'].shift()).cumsum()
new_df=df.groupby(g,as_index=False).agg(index=('index',list),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
In pandas <0.25:
new_df=df.groupby(g,as_index=False).agg({'index':list,'A':'first','B':lambda x: list(x.unique())})
if you want to repeat repeated in the index use the same function for the index column as for B:
new_df=df.groupby(g,as_index=False).agg(index=('index',lambda x: list(x.unique())),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
Here is an example:
df=pd.DataFrame({'index':range(20),
'A':[1,1,1,1,2,2,0,0,0,1,1,1,1,1,1,0,0,0,3,3]
,'B':[1,2,3,5,5,5,7,8,9,9,9,12,12,14,15,16,17,18,19,20]})
print(df)
index A B
0 0 1 1
1 1 1 2
2 2 1 3
3 3 1 5
4 4 2 5
5 5 2 5
6 6 0 7
7 7 0 8
8 8 0 9
9 9 1 9
10 10 1 9
11 11 1 12
12 12 1 12
13 13 1 14
14 14 1 15
15 15 0 16
16 16 0 17
17 17 0 18
18 18 3 19
19 19 3 20
g=df['A'].ne(df['A'].shift()).cumsum()
new_df=df.groupby(g,as_index=False).agg(index=('index',list),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
index A B
0 [0, 1, 2, 3] 1 [1, 2, 3, 5]
1 [4, 5] 2 [5]
2 [6, 7, 8] 0 [7, 8, 9]
3 [9, 10, 11, 12, 13, 14] 1 [9, 12, 14, 15]
4 [15, 16, 17] 0 [16, 17, 18]
5 [18, 19] 3 [19, 20]

Sum weights based on other rows values

i'm looking for a query to sum the weights (from the first row -> q17), for every row (every question), but based on the answers of other questions. the idea is to leave out the value from the sum, if the answer does not have a value within other questions.
example
q17 5 5 4 3 5 5 4 5 = 36 for q17
q18 4 2 3 2 5 4 1 4 = 36 for q18
q19 5 2 4 2 5 4 1 4 = 36 for q19
q20 4 2 5 3 5 4 1 4 = 36 for q20
q21 4 5 3 5 5 1 = 26 for q21
q22 4 2 4 2 4 4 1 4 = 36 for q22
q23 4 1 4 3 5 4 1 4 = 36 for q23
q24 4 1 4 1 5 3 1 4 = 36 for q24
q25 5 2 4 3 5 4 1 4 = 36 for q25
q26 5 4 5 3 5 5 5 5 = 36 for q26
q27 5 4 4 1 5 4 1 4 = 36 for q27
q28 4 1 5 2 5 5 1 4 = 36 for q28
q29 5 5 5 4 5 4 5 5 = 36 for q29
q30 4 2 3 2 5 4 1 4 = 36 for q30
q31 4 3 4 4 5 4 1 5 = 36 for q31
q32 4 1 4 1 5 4 1 4 = 36 for q32
the weights are at q17. and i need to calculate the weights for every question. but where a question isn't answered, i don't need to sum into the weights.
don't take q18-32 as values, just take them only as true/false if it has a value or not for the question then sum the weights from q17 based on this for every q.
the data is the following
Q A
17 5
18 4
19 5
20 4
21 4
22 4
23 4
24 4
25 5
26 5
27 5
28 4
29 5
30 4
31 4
32 4
17 5
18 2
19 2
20 2
22 2
23 1
24 1
25 2
26 4
27 4
28 1
29 5
30 2
31 3
32 1
17 4
18 3
19 4
20 5
21 5
22 4
23 4
24 4
25 4
26 5
27 4
28 5
29 5
30 3
31 4
32 4
17 3
18 2
19 2
20 3
21 3
22 2
23 3
24 1
25 3
26 3
27 1
28 2
29 4
30 2
31 4
32 1
17 5
18 5
19 5
20 5
21 5
22 4
23 5
24 5
25 5
26 5
27 5
28 5
29 5
30 5
31 5
32 5
17 5
18 4
19 4
20 4
21 5
22 4
23 4
24 3
25 4
26 5
27 4
28 5
29 4
30 4
31 4
32 4
17 4
18 1
19 1
20 1
21 1
22 1
23 1
24 1
25 1
26 5
27 1
28 1
29 5
30 1
31 1
32 1
17 5
18 4
19 4
20 4
22 4
23 4
24 4
25 4
26 5
27 4
28 4
29 5
30 4
31 5
32 4
I think this is what you require:
SELECT 'q'+CAST(Q AS nvarchar(4)) AS Q,
CAST(SUM(A) AS NVARCHAR(4)) + ' total for q' + CAST(Q AS NVARCHAR(4)) AS A
FROM tbl
GROUP BY Q
With a sample SQL fiddle
Output:
Q A
q17 36 total for q17
q18 25 total for q18
q19 27 total for q19
q20 28 total for q20
q21 23 total for q21
q22 25 total for q22
q23 26 total for q23
q24 23 total for q24
q25 28 total for q25
q26 37 total for q26
q27 28 total for q27
q28 27 total for q28
q29 38 total for q29
q30 25 total for q30
q31 30 total for q31
q32 24 total for q32
CREATE TABLE Table1
(`col0` varchar(3),
`col1` int,
`col2` int,
`col3` int,
`col4` int,
`col5` int,
`col6` int,
`col7` int,
`col8` int)
;
INSERT INTO Table1
(`col0`, `col1`, `col2`, `col3`, `col4`, `col5`, `col6`, `col7`, `col8`)
VALUES
('q17', 5, 5, 4, 3, 5, 5, 4, 5),
('q18', 4, 2, 3, 2, 5, 4, 1, 4),
('q19', 5, 2, 4, 2, 5, 4, 1, 4),
('q20', 4, 2, 5, 3, 5, 4, 1, 4),
('q21', 4, NULL, 5, 3, 5, 5, 1, Null),
('q22', 4, 2, 4, 2, 4, 4, 1, 4),
('q23', 4, 1, 4, 3, 5, 4, 1, 4),
('q24', 4, 1, 4, 1, 5, 3, 1, 4),
('q25', 5, 2, 4, 3, 5, 4, 1, 4),
('q26', 5, 4, 5, 3, 5, 5, 5, 5),
('q27', 5, 4, 4, 1, 5, 4, 1, 4),
('q28', 4, 1, 5, 2, 5, 5, 1, 4),
('q29', 5, 5, 5, 4, 5, 4, 5, 5),
('q30', 4, 2, 3, 2, 5, 4, 1, 4),
('q31', 4, 3, 4, 4, 5, 4, 1, 5),
('q32', 4, 1, 4, 1, 5, 4, 1, 4)
;
SELECT *,
Concat('= ',(IFNULL(`Col1`,0)+
IFNULL(`Col2`,0)+
IFNULL(`Col3`,0)+
IFNULL(`Col4`,0)+
IFNULL(`Col5`,0)+
IFNULL(`Col6`,0)+
IFNULL(`Col7`,0)+
IFNULL(`Col8`,0)),' For ',Col0) As Result from Table1
Live Demo
http://sqlfiddle.com/#!9/8dcff2/2