How to count each x entries and mark the occurence of this sequence with a value in a pandas dataframe? - pandas

I want to create a column C (based on B) which counts each beginning of a series of 4 entries in B (or the dataframe as general). I have the following pandas data frame:
A B
1 100
2 102
3 103
4 104
5 105
6 106
7 108
8 109
9 110
10 112
11 113
12 115
13 116
14 118
15 120
16 121
I want to create the following column C:
A C
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 3
10 3
11 3
12 3
13 4
14 4
15 4
16 4
This column C should count each series of 4 entries of the dataframe.
Thanks in advance.

Use:
df['C'] = df.index // 4 + 1
Given that you have fairly simple dataframe it's okay to assume that you have generic index which is a RangeIndex object.
In your example it would look like this:
df.index
#RangeIndex(start=0, stop=16, step=1)
That being said values of this index are the following:
df.index.values
#array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype=int64)
Converting such array into your desired output is performed using the formula:
x // 4 + 1
Where // is the operator used for floor division.

General solution is create numpy array by np.arange, then use integer division by 4 and add 1, because python count from 0:
df['C'] = np.arange(len(df)) // 4 + 1
print (df)
A B C
0 1 100 1
1 2 102 1
2 3 103 1
3 4 104 1
4 5 105 2
5 6 106 2
6 7 108 2
7 8 109 2
8 9 110 3
9 10 112 3
10 11 113 3
11 12 115 3
12 13 116 4
13 14 118 4
14 15 120 4
15 16 121 4

Related

Required data frame after explode or other option to fill a running difference b/w two columns pandas dataframe

Input data frame as given given below,
data = {
'labels': ["A","B","A","B","A","B","M","B","M","B","M"],
'start': [0,9,13,23,47,77,81,92,100,104,118],
'stop': [9,13,23,47,77,81,92,100,104,118,145],
}
df = pd.DataFrame.from_dict(data)
labels start stop
0 A 0 9
1 B 9 13
2 A 13 23
3 B 23 47
4 A 47 77
5 B 77 81
6 M 81 92
7 B 92 100
8 M 100 104
9 B 104 118
10 M 118 145
The output data frame required is as below,
Try this:
df['start'] = df.apply(lambda x: range(x['start'] + 1, x['stop'] + 1), axis=1)
df = df.explode('start')
Output:
>>> df
labels start stop
0 A 1 9
0 A 2 9
0 A 3 9
0 A 4 9
0 A 5 9
0 A 6 9
0 A 7 9
0 A 8 9
0 A 9 9
1 B 10 13
1 B 11 13
1 B 12 13
1 B 13 13
2 A 14 23
2 A 15 23
2 A 16 23
2 A 17 23
2 A 18 23
2 A 19 23
2 A 20 23
2 A 21 23
2 A 22 23
2 A 23 23
...

Keep only the first value on duplicated column (set 0 to others)

Supposing I have the following situation:
A dataframe where the first column ['ID'] will eventually have duplicated values.
import pandas as pd
df = pd.DataFrame({"ID": [1,2,3,4,4,5,5,5,6,6],
"l_1": [10,12,32,45,45,20,20,20,20,20],
"l_2": [11,12,32,11,21,27,38,12,9,6],
"l_3": [5,9,32,12,21,21,18,12,8,1],
"l_4": [6,21,12,77,77,2,2,2,8,8]})
ID l_1 l_2 l_3 l_4
1 10 11 5 6
2 12 12 9 21
3 32 32 32 12
4 45 11 12 77
4 45 21 21 77
5 20 27 21 2
5 20 38 18 2
5 20 12 12 2
6 20 9 8 8
6 20 6 1 8
When duplicated IDs occurs:
I need to keep only the first values for column l_1 and l_4 (other duplicated rows must be zero).
Columns 'l_2' and 'l_3' must stay the same.
When duplicated IDs occurs, the values on these rows on columns l_1 and l_4 will be also duplicated.
Expected output:
ID l_1 l_2 l_3 l_4
1 10 11 5 6
2 12 12 9 21
3 32 32 32 12
4 45 11 12 77
4 0 21 21 0
5 20 27 21 2
5 0 38 18 0
5 0 12 12 0
6 20 9 8 8
6 0 6 1 0
Is there a Straightforward way using pandas or numpy to accomplish this ?
I could just accomplish it doing all these steps:
x1 = df[df.duplicated(subset=['ID'], keep=False)].copy()
x1.loc[x1.groupby('ID')['l_1'].apply(lambda x: (x.shift(1) == x)), 'l_1'] = 0
x1.loc[x1.groupby('ID')['l_4'].apply(lambda x: (x.shift(1) == x)), 'l_4'] = 0
df = df.drop_duplicates(subset=['ID'], keep=False)
df = pd.concat([df, x1])
Isn't this just:
df.loc[df.duplicated('ID'), ['l_1','l_4']] = 0
Output:
ID l_1 l_2 l_3 l_4
0 1 10 11 5 6
1 2 12 12 9 21
2 3 32 32 32 12
3 4 45 11 12 77
4 4 0 21 21 0
5 5 20 27 21 2
6 5 0 38 18 0
7 5 0 12 12 0
8 6 20 9 8 8
9 6 0 6 1 0

Sum of group but keep the same value for each row in pandas

How to solve same problem in this link Sum of group but keep the same value for each row in r using pandas?
I can generate separate df have the sum for each group and then merge the generated df with the original.
You can use groupby & transform as below to get your output.
df['sumx']=df.groupby(['ID', 'Group'],sort=False)['x'].transform(sum)
df['sumy']=df.groupby(['ID', 'Group'],sort=False)['y'].transform(sum)
df
output
ID Group x y sumx sumy
1 1 1 1 12 3 25
2 1 1 2 13 3 25
3 1 2 3 14 3 14
4 3 1 4 15 15 48
5 3 1 5 16 15 48
6 3 1 6 17 15 48
7 3 2 7 18 15 37
8 3 2 8 19 15 37
9 4 1 9 20 30 63
10 4 1 10 21 30 63
11 4 1 11 22 30 63
12 4 2 12 23 12 23

Winsorize within groups of dataframe

I have a dataframe like this:
df = pd.DataFrame([[1,2],
[1,4],
[1,5],
[2,65],
[2,34],
[2,23],
[2,45]], columns = ['label', 'score'])
Is there an efficient way to create a column score_winsor that winsorises the score column within the groups at the 1% level?
I tried this with no success:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: max(x.quantile(.01), min(x, x.quantile(.99))))
You could use scipy's implementation of winsorize
df["score_winsor"] = df.groupby('label')['score'].transform(lambda row: winsorize(row, limits=[0.01,0.01]))
Output
>>> df
label score score_winsor
0 1 2 2
1 1 4 4
2 1 5 5
3 2 65 65
4 2 34 34
5 2 23 23
6 2 45 45
This works:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: np.maximum(x.quantile(.01), np.minimum(x, x.quantile(.99))))
Output
print(df.to_string())
label score score_winsor
0 1 2 2.04
1 1 4 4.00
2 1 5 4.98
3 2 65 64.40
4 2 34 34.00
5 2 23 23.33
6 2 45 45.00

Filter multiples in a pandas dataframe

My data can be easily converted into a pandas dataframe that looks something like:
import pandas as pd
data={'a':["t", "g"]*9,'b' [1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6],'distance':[10, 15, 290, 300, 315, 320, 350, 360, 10, 25, 225, 240, 325, 335, 365, 205, 15, 35]}
df=pd.DataFrame(data,columns=['a','b','distance'])
print df
a b distance
0 t 1 10
1 g 2 15
2 t 3 290
3 g 4 300
4 t 5 315
5 g 6 320
6 t 1 350
7 g 2 360
8 t 3 10
9 g 4 25
10 t 5 225
11 g 6 240
12 t 1 325
13 g 2 335
14 t 3 365
15 g 4 205
16 t 5 15
17 g 6 35
I want to erase all the lines that have the same value in the "b" column but keep the one line with the smallest value in the "distance" column. In this case I would like to erase all the lines that have a "distance" greater than 200 so that, in this example, only the lines with the index 0,1,8,9,16,17 remain. In the end all the lines should have a different "b" value and the smallest "distance". It would look like:
a b distance
0 t 1 10
1 g 2 15
2 t 3 10
3 g 4 25
4 t 5 15
5 g 6 35
How could I do that?
groupby on b col and call idxmin on distance column to index the orig df:
In [114]:
df.loc[df.groupby('b')['distance'].idxmin()]
Out[114]:
a b distance
0 t 1 10
1 g 2 15
8 t 3 10
9 g 4 25
16 t 5 15
17 g 6 35
Here you can see that idxmin returns the indices of the lowest values:
In [115]:
df.groupby('b')['distance'].idxmin()
Out[115]:
b
1 0
2 1
3 8
4 9
5 16
6 17
Name: distance, dtype: int64
Try this:
df.groupby('b')['a','b','distance'].min()
# a b distance
# b
# 1 t 1 10
# 2 g 2 15
# 3 t 3 10
# 4 g 4 25
# 5 t 5 15
# 6 g 6 35
​