pandas: filter rows having max value per category - pandas

Starting out with data like this:
np.random.seed(314)
df = pd.DataFrame({
'date':[pd.date_range('2016-04-01', '2016-04-05')[r] for r in np.random.randint(0,5,20)],
'cat':['ABCD'[r] for r in np.random.randint(0,4,20)],
'count': np.random.randint(0,100,20)
})
cat count date
0 B 87 2016-04-04
1 A 95 2016-04-05
2 D 89 2016-04-02
3 D 39 2016-04-05
4 A 39 2016-04-01
5 C 61 2016-04-05
6 C 58 2016-04-04
7 B 49 2016-04-03
8 D 20 2016-04-02
9 B 54 2016-04-01
10 B 87 2016-04-01
11 D 36 2016-04-05
12 C 13 2016-04-05
13 A 79 2016-04-04
14 B 91 2016-04-03
15 C 83 2016-04-05
16 C 85 2016-04-05
17 D 93 2016-04-01
18 C 85 2016-04-02
19 B 91 2016-04-03
I'd like to end up with only the rows where count is the maximum value in the corresponding cat:
cat count date
1 A 95 2016-04-05
14 B 91 2016-04-03
16 C 85 2016-04-05
17 D 93 2016-04-01
18 C 85 2016-04-02
19 B 91 2016-04-03
Note that can be multiple records with the max count per category

Using transform
df[df['count']==df.groupby('cat')['count'].transform('max')]
Out[163]:
cat count date
1 A 95 2016-04-05
14 B 91 2016-04-03
16 C 85 2016-04-05
17 D 93 2016-04-01
18 C 85 2016-04-02
19 B 91 2016-04-03

Related

Dataframe Operation Splicing

I have a single column dataframe without headers and I want to split it into multiple columns as follows
The current dataframe -
1
2
3
4
5
.
.
100
I want to represent it as -
1 6 .. .. 96
2 7 .. .. 97
3 8 .. .. 98
4 9 .. .. 99
5 10 .. .. 100
Assuming such a DataFrame:
df = pd.DataFrame({'col': range(1, 101)})
you can use the underlying numpy array to reshape:
df2 = pd.DataFrame(df['col'].to_numpy().reshape(5, -1, order='F'))
output:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 \
0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91
1 2 7 12 17 22 27 32 37 42 47 52 57 62 67 72 77 82 87 92
2 3 8 13 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88 93
3 4 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79 84 89 94
4 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
19
0 96
1 97
2 98
3 99
4 100

Pandas drop_duplicates with multiple conditions

I have some measurement datas that need to be filtered, I read them as dataframe data, like these:
df
RequestTime RequestID ResponseTime ResponseID
0 150 14 103 101
1 150 15 110 102
2 25 16 121 103
3 25 16 97 104
4 22 16 44 105
5 19 17 44 106
6 26 18 29 106
7 30 18 29 106
and I need to use two different conditions at the same time, that is, to filter 'RequestTime' 'RequestID' and 'ResponseTime' 'ResponseID' by use drop_duplicate(subset=) at the same time. I have used follow command to get the filter results for each of the two conditions:
>>>df[['RequestTime','RequestID','ResponseTime','ResponseID']].drop_duplicates(subset = ['ResponseTime','ResponseID'])
RequestTime RequestID ResponseTime ResponseID
0 150 14 103 101
1 150 15 110 102
2 25 16 121 103
4 22 16 44 105
5 19 17 44 106
6 26 18 29 106
7 30 18 29 106
>>>df[['RequestTime','RequestID','ResponseTime','ResponseID']].drop_duplicates(subset = ['RequestTime','RequestID'])
RequestTime RequestID ResponseTime ResponseID
0 150 14 103 101
1 150 15 110 102
2 25 16 121 103
3 25 16 97 104
4 22 16 44 105
5 19 17 44 106
6 26 18 29 106
but how to combine the two conditions to drop duplicate row 3 and row 7?
IIUC,
m = ~(df.duplicated(subset=['RequestTime','RequestID']) | df.duplicated(subset=['ResponseTime', 'ResponseID']))
df[m]
Output:
RequestTime RequestID ResponseTime ResponseID
0 150 14 103 101
1 150 15 110 102
2 25 16 121 103
4 22 16 44 105
5 19 17 44 106
6 26 18 29 106
Create a mask (boolean series) to boolean index your dataframe.
Or chain methods:
df.drop_duplicates(subset=['RequestTime', 'RequestID']).drop_duplicates(subset=['ResponseTime', 'ResponseID'])

how to encode only categorical data in a dataframe

enter image description here
how to encode only categorical data in a data frame
Income Length of Residence Median House Value Number of Vehicles Percentage Asian Percentage Black Percentage English Speaking Percentage Hispanic Percentage White MakeDescr SeriesDescr Msrp
1 90000 15.0 F 4 1 1 71 6 81 HYUNDAI Sonata-4 Cyl. 19395.0
2 125000 7.0 H 1 11 1 91 1 81 JEEP Grand Cherokee-V6 29135.0
3 90000 8.0 F 1 1 1 71 6 86 JEEP Liberty 20700.0
4 125000 8.0 F 3 1 1 86 6 86 VOLKSWAGEN Passat-V6 28750.0
5 90000 8.0 F 1 1 1 71 6 81 JEEP Wrangler 20210.0
6 110000 7.0 G 5 6 6 71 6 76 HYUNDAI Santa Fe-V6 25645.0
7 110000 7.0 G 3 11 6 71 6 71 HYUNDAI Sonata-4 Cyl. 15999.0
8 125000 8.0 G 1 1 11 81 6 76 HYUNDAI Santa Fe-V6 23645.0
9 125000 9.0 G 1 6 1 91 1 86 CHEVROLET TRUCK Trailblazer EXT 32040.0
10 110000 8.0 E 2 6 46 81 16 26 JEEP Wrangler-V6 18660.0
11 125000 11.0 G 3 6 1 76 1 86 CHEVROLET TRUCK Silverado 2500 HD 31775.0
12 125000 12.0 G 2 11 6 66 1 71 CHEVROLET Cobalt 13675.0
13 125000 13.0 G 2 1 16 95 6 71 HYUNDAI Veracruz-V6 28600.0
15 110000 11.0 F 5 6 41 61 11 41 HYUNDAI Santa Fe 22499.0
16 125000 9.0 F 2 1 6 91 1 81 HYUNDAI Santa Fe 22499.0
17 125000 8.0 G 2 11 11 66 1 66 MITSUBISHI Endeavor-V6 32602.0
18 110000 12.0 E 1 6 46 81 16 26 HYUNDAI Accent-4 Cyl. 10899.0
19 90000 9.0 F 4 1 6 71 6 81 JEEP Grand Cherokee-6 Cyl. 29080.0
21 125000 8.0 G 1 6 1 76 1 86 MITSUBISHI Endeavor-V6 29302.0
22 110000 12.0 F 2 6 26 66 11 51 HYUNDAI Santa Fe 22499.0
23 90000 9.0 F 1 6 6 66 6 76 HYUNDAI Santa Fe-V6 20995.0
24 125000 9.0 H 1 6 1 91 1 81 HYUNDAI Sonata-V6 18799.0
25 90000 14.0 F 2 1 6 71 11 81 HYUNDAI Elantra-4 Cyl. 13299.0
26 125000 9.0 G 3 1 11 81 6 76 JEEP Grand Cherokee-6 Cyl. 29080.0
27 125000 8.0 H 5 6 1 91 1 81 CHEVROLET TRUCK Trailblazer 29395.0
28 110000 12.0 E 4 6 41 61 11 36 HYUNDAI Sonata-4 Cyl. 15999.0
29 110000 10.0 E 1 6 41 61 11 36 HYUNDAI Santa Fe-V6 20995.0
30 125000 10.0 F 2 6 1 71 6 86 CHEVROLET TRUCK Tahoe 37000.0
32 90000 10.0 F 1 1 1 71 6 86 MITSUBISHI Galant-V6 19997.0
33 125000 12.0 F 1 1 1 86 6 86 CHEVROLET TRUCK Trailblazer 28175.0
... ... ... ... ... ... ... ... ... ... ... ... ...
4451 110000 9.0 F 3 6 41 61 11 36 NISSAN Sentra-4 Cyl. 17990.0
4452 125000 11.0 G 2 1 11 81 6 76 CHEVROLET TRUCK Tahoe 39515.0
4453 125000 8.0 H 1 6 1 91 1 81 HYUNDAI Elantra-4 Cyl. 15195.0
4454 110000 10.0 F 3 6 41 61 11 41 HYUNDAI Genesis-4 Cyl. 26750.0
4455 125000 7.0 H 4 11 1 76 1 76 HYUNDAI Sonata-4 Cyl. 19695.0
4456 125000 9.0 G 5 6 1 76 1 86 NISSAN Altima 22500.0
4457 110000 11.0 E 1 6 46 81 16 26 GMC LIGHT DUTY Denali 51935.0
4458 125000 6.0 H 1 11 1 76 1 76 JEEP Liberty-V6 24865.0
4459 125000 12.0 G 3 1 16 95 6 71 HONDA Accord-V6 26700.0
4460 125000 7.0 F 1 1 1 86 6 86 HYUNDAI Veloster-4 Cyl. 17300.0
4461 90000 10.0 F 2 6 11 66 6 71 CADILLAC SRX-V6 42210.0
4463 110000 8.0 F 3 6 26 61 11 56 GMC LIGHT DUTY Acadia 42390.0
4468 125000 8.0 G 1 1 1 91 1 86 HONDA Pilot-V6 40820.0
4469 125000 10.0 H 5 11 1 91 1 81 TOYOTA Highlander-V6 30695.0
4470 110000 12.0 F 1 6 41 61 11 41 HYUNDAI Elantra-4 Cyl. 15195.0
4473 110000 13.0 F 1 6 21 66 6 61 ACURA TSX 32910.0
4476 125000 9.0 G 1 6 1 76 1 86 BMW X3 36750.0
4482 125000 10.0 H 1 6 1 91 1 81 SUBARU Forester-4 Cyl. 21195.0
4486 125000 11.0 H 2 6 1 91 1 81 GMC LIGHT DUTY Yukon XL 44315.0
4492 125000 10.0 H 2 6 1 91 1 81 BMW 5 Series 53400.0
4493 110000 12.0 G 2 6 6 71 6 76 ACURA TL 33725.0
4494 125000 12.0 F 3 1 1 86 6 86 ACURA TL 33725.0
4495 125000 12.0 F 3 1 1 86 6 86 ACURA TL 33725.0
4496 125000 7.0 G 5 1 11 81 6 76 ACURA TL 33325.0
4497 125000 9.0 G 1 6 1 76 1 86 ACURA TL 33725.0
4498 125000 12.0 G 3 1 11 81 6 76 ACURA TL 33725.0
4499 110000 14.0 G 8 11 6 71 6 71 ACURA TL 33725.0
4501 125000 9.0 G 3 11 6 66 1 71 FORD Taurus-V6 20050.0
4502 110000 2.0 G 4 11 6 71 6 71 DODGE Stratus-4 Cyl. 15910.0
4503 125000 8.0 F 1 1 1 86 6 86 DODGE Stratus-4 Cyl. 19145.0
# Using standard scikit-learn label encoder.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# Encode all string columns. Assuming all categoricals are of type str.
for c in df.select_dtypes(['object']):
print "Encoding column " + c
df[c] = le.fit_transform(df[c])

SQL - return the smallest value in one column that matches the value of another column

How can I return in column4 the smallest number from column3 based on column1?
1 100 1
2 100 1
1 101 2
1 102 4
2 200 19
3 200 19
16 200 19
18 200 19
19 200 19
20 200 19
3 301 28
6 301 28
3 302 29
3 310 30
4 400 31
4 410 32
4 420 33
5 500 34
7 500 34
5 510 35
6 510 35
5 520 36
6 610 37
7 700 38
7 701 39
8 701 39
8 800 40
8 802 41
Thank you!
Join to a subquery that calculates the minimums for each col1:
select a.col1, a.col2, a.col3, mcol3
from mytable a
join (select col1, min(col3) mcol3 from mytable group by col1) b
on b.col1 = a.col1
See SQLFiddle, showing this output from your sample data:
1 100 1 1
2 100 1 1
1 101 2 1
1 102 4 1
2 200 19 1
3 200 19 19
16 200 19 19
18 200 19 19
19 200 19 19
20 200 19 19
3 301 28 19
6 301 28 28
3 302 29 19
3 310 30 19
4 400 31 31
4 410 32 31
4 420 33 31
5 500 34 34
7 500 34 34
5 510 35 34
6 510 35 28
5 520 36 34
6 610 37 28
7 700 38 34
7 701 39 34
8 701 39 39
8 800 40 39
8 802 41 39

How to group result of a group by in the same column (Oracle)

I have a query that results in "RESULT 1"
This is a GROUP BY in X, Y, Z and W and a SUM in VALUE. but I want the W / VALUE pair grouped in the same column like the result "RESULT 2".
Is there an efficient way to do that in Oracle?
RESULT 1:
X Y Z W VALUE
-- ----------------- -------------------------- -------------------- ----------
45 18 1 101 1,12
45 18 1 104 1,12
45 18 1 137 2,58
45 18 1 216 6,06
45 18 1 218 5,9
45 18 1 223 7,08
45 18 1 302 4,86
45 18 1 303 4,68
45 18 11 101 9,38
45 18 11 104 9,38
45 18 11 201 9,38
45 18 13 118 9,21
45 18 13 137 2,69
45 18 13 201 9,38
RESULT 2:
X Y Z W VALOR W VALOR W VALOR W VALOR W VALOR W VALOR W VALOR W VALOR
-- ----------------- -------------------------- -------------------- ----------
45 18 1 101 1,12 104 1,12 137 2,58 216 6,06 218 5,9 223 7,08 302 4,86 303 4,68
45 18 11 101 9,38 104 9,38 201 9,38
45 18 13 118 9,21 137 2,69 201 9,38