PANDAS: a way to combine rows that are grouped by a field - pandas

I have a DataFrame that looks like:
test1 = pd.DataFrame( {
"ROUTE" : ["MIA-ORD", "MIA-AUA", "ORD-MIA", "MIA-HOU", "MIA-JFK", "JFK-MIA", "JFK-YYZ"],
"TICKET" : ["123", "345", "123", "678", "456", "345", "456"],
"COUPON" : [1,4,2,1,1,3,2],
"PAX" : ["Jessica", "Alex", "Jessica", "Jamanica", "Ernest","Alex", "Ernest"],
"PAID": [100.00,200.00,100.00,100.00,200.00,200.00,200.00]})
this gives me
ROUTE TICKET COUPON PAX PAID
0 MIA-ORD 123 1 Jessica 100.0
1 MIA-AUA 345 4 Alex 200.0
2 ORD-MIA 123 2 Jessica 100.0
3 MIA-HOU 678 1 Jamanica 100.0
4 MIA-JFK 456 1 Ernest 200.0
5 JFK-MIA 345 3 Alex 200.0
6 JFK-YYZ 456 2 Ernest 200.0
what I am trying to do is to combine the Route and coupon data to be
ROUTE TICKET COUPON PAX PAID
0 MIA-ORD-ORD-MIA 123 1-2 Jessica 100.0
1 JFK-MIA-MIA-AUA 345 3-4 Alex 200.0
2 MIA-HOU 678 1 Jamanica 100.0
3 MIA-JFK-JFK-YYZ 456 1-2 Ernest 200.0
So Far I have been able to group by ticket since its the obivous common identifier and sorted Coupons since the order of the flights for 'ALEX' are inverted.
rs1 = test1.groupby(['TICKET']).apply(pd.DataFrame.sort_values,'COUPON')
This results
ROUTE TICKET COUPON PAX PAID
TICKET
123 0 MIA-ORD 123 1 Jessica 100.0
2 ORD-MIA 123 2 Jessica 100.0
345 5 JFK-MIA 345 3 Alex 200.0
1 MIA-AUA 345 4 Alex 200.0
456 4 MIA-JFK 456 1 Ernest 200.0
6 JFK-YYZ 456 2 Ernest 200.0
678 3 MIA-HOU 678 1 Jamanica 100.0
but from here i cannot merge the ROUTE and COUPON.
I have tried:
st1=test1.groupby('TICKET').apply(lambda group: ','.join(group['ROUTE']))
But that only brings about the merged colmuns sorted alone. not the rest of the data.
TICKET
123 MIA-ORD,ORD-MIA
345 MIA-AUA,JFK-MIA
456 MIA-JFK,JFK-YYZ
678 MIA-HOU
dtype: object
Any Ideas?

We can use groupby in combination with agg and then apply '-'.join():
test1['COUPON']=test1['COUPON'].astype(str)
final = test1.groupby(['TICKET', 'PAX', 'PAID']).agg({'ROUTE':'-'.join,
'COUPON':'-'.join}).reset_index()
print(final)
TICKET PAX PAID ROUTE COUPON
0 123 Jessica 100.0 MIA-ORD-ORD-MIA 1-2
1 345 Alex 200.0 MIA-AUA-JFK-MIA 4-3
2 456 Ernest 200.0 MIA-JFK-JFK-YYZ 1-2
3 678 Jamanica 100.0 MIA-HOU 1

Related

How find duplicates of the same id but different date with SQL (Oracle)

I've got a datatable like that :
id
line
datedt
123
1
01/01/2021
123
2
01/01/2021
123
3
01/01/2021
777
1
13/04/2020
777
2
13/04/2020
123
1
12/04/2021
123
2
12/04/2021
888
1
01/07/2020
888
2
01/07/2020
452
1
05/01/2020
888
1
02/05/2021
888
2
02/05/2021
I'd like to obtain a result like that, ie : the nb of same id with differents dates.
Example we can find 123 with 2 diffents dates, 888 too, but only date for 777 and 452
id
nb
123
2
777
1
888
2
452
1
How could I get that ?
Hope it's clear :)
Thanks you
select id , count(distinct datedt) as nb
from table
group by id

SQL Select statement and show only changed

I have the simple select script and it generates following audit table.
SELECT *
FROM Mytable
WHERE File = '123456A'
Output:
ID
File
StatusA
StatusB
User
UpdateDate
1
123456A
A
0
Tom
2021-01-01
12
123456A
B
0
Jack
2021-01-05
19
123456A
A
1
Alicia
2021-02-09
56
123456A
B
1
Jason
2021-03-09
87
123456A
A
1
Jason
2021-03-10
107
123456A
B
0
Ellie
2021-03-26
203
123456A
A
0
lucy
2021-04-08
239
123456A
B
1
Ellie
2021-04-16
I am trying to retrieve the rows when only column StatusB is changed. So it will generates the table like this.
SELECT *
FROM Mytable
WHERE File = '123456A'
-AND StatusB is changed
ID
File
StatusA
StatusB
User
UpdateDate
1
123456A
A
0
Tom
2021-01-01
19
123456A
A
1
Alicia
2021-02-09
107
123456A
B
0
Ellie
2021-03-26
239
123456A
B
1
Ellie
2021-04-16
In this case, I can see Alicia and Ellie changed the column StatusB. I am still thinking how to accomplish this goal.
Thanks,
-Ming
You can use lag():
select t.*
from (select t.*,
lag(statusB) over (order by updatedate) as prev_statusB
from Mytable t
where File = '123456A'
) t
where prev_statusB is null or prev_statusB <> statusB;

selecting 33% begin, middle & end portion of a groupy object in pandas

I have a dataframe as below.
I wnat to do the groupby of "Cycle" & "Type". After the groupby is done, i want to perform several actions (sum, mean, var, std, rolling mean, lingress......) on the first 33%, middle 33% and end 33%. how do i do it?
With head() & tail() i can select only first & last few rows (that too if I know the numbers of rows i need & since length of each group varies, i do not know these values). So, can anyone guide?
Cycle Type Time Values
2 2 101 20.402
2 2 102 20.402
2 2 103 20.402
2 2 104 20.402
2 2 105 20.402
2 2 106 20.383
2 2 107 20.383
2 2 108 20.383
2 2 109 20.383
2 2 110 20.383
2 2 111 20.36
2 2 112 20.36
2 2 113 20.36
2 2 114 20.36
2 2 115 20.36
2 2 116 20.36
2 2 117 20.36
2 2 118 20.36
2 2 119 20.36
2 2 120 20.36
2 2 121 20.348
2 2 122 20.348
2 2 123 20.348
2 2 124 20.348
2 2 125 20.348
3 1 126 20.34
3 1 127 20.34
3 1 128 20.34
3 1 129 20.34
3 1 130 20.34
3 1 131 20.337
3 1 132 20.337
3 1 133 20.337
3 1 134 20.337
3 1 135 20.337
3 1 136 20.342
3 1 137 20.342
3 1 138 20.342
3 1 139 20.342
3 1 140 20.342
3 1 141 20.342
3 1 142 20.342
3 1 143 20.342
3 1 144 20.342
3 1 145 20.342
3 1 146 20.335
3 1 147 20.335
3 1 148 20.335
3 1 149 20.335
5 2 102 20.402
5 2 103 20.402
5 2 104 20.402
5 2 105 20.402
5 2 106 20.383
5 2 107 20.383
5 2 108 20.383
5 2 109 20.383
5 2 110 20.383
5 2 111 20.36
5 2 112 20.36
5 2 113 20.36
5 2 114 20.36
5 2 115 20.36
5 2 116 20.36
5 2 117 20.36
5 2 118 20.36
5 2 119 20.36
Update result achieved based on suggestion from Valenteno
Here is one way using cumcount and transform with floor division
g=df.groupby(['Cycle','Time'])
s=g.cumcount()//(g.Cycle.transform('count')//3).clip(upper=2)
df.groupby([df.Cycle,df.Time,s]).apply(Yourfunctionhere)
This should be close to what you want. Here I use only sum and mean, feel free to add other function to the agg argument list.
def sample(x):
aggrfunc = ['sum', 'mean']
first = x.iloc[0:len(x)//3].agg(aggrfunc)
middle = x.iloc[len(x)//3:2*len(x)//3].agg(aggrfunc)
last = x.iloc[2*len(x)//3:].agg(aggrfunc)
return pd.concat([first, middle, last], keys=['top 33%', 'middle 33%', 'bottom 33%']))
ddf = df.groupby(['Cycle', 'Type']).apply(sample)
Using your sample dataframe, this code produces ddf:
Cycle Type Time Values
Cycle Type
2 2 top 33% sum 16.0 16.0 836.0 163.159000
mean 2.0 2.0 104.5 20.394875
middle 33% sum 16.0 16.0 900.0 162.926000
mean 2.0 2.0 112.5 20.365750
bottom 33% sum 18.0 18.0 1089.0 183.180000
mean 2.0 2.0 121.0 20.353333
3 1 top 33% sum 24.0 8.0 1036.0 162.711000
mean 3.0 1.0 129.5 20.338875
middle 33% sum 24.0 8.0 1100.0 162.726000
mean 3.0 1.0 137.5 20.340750
bottom 33% sum 24.0 8.0 1164.0 162.708000
mean 3.0 1.0 145.5 20.338500
5 2 top 33% sum 30.0 12.0 627.0 122.374000
mean 5.0 2.0 104.5 20.395667
middle 33% sum 30.0 12.0 663.0 122.229000
mean 5.0 2.0 110.5 20.371500
bottom 33% sum 30.0 12.0 699.0 122.160000
mean 5.0 2.0 116.5 20.360000

Splitting Column Headers and Duplicating Row Values in Pandas Dataframe

In the example df below, I'm trying to find a way to split the column headers ('1;2','4','5;6') based on the ';' that exists and duplicate the row values in these split columns. (My actual df comes from an imported csv file so generally I have around 50-80 column headers that need spliting)
Below is my code below with output
import pandas as pd
import numpy as np
#
data = np.array([['Market','Product Code','1;2','4','5;6'],
['Total Customers',123,1,500,400],
['Total Customers',123,2,400,320],
['Major Customer 1',123,1,100,220],
['Major Customer 1',123,2,230,230],
['Major Customer 2',123,1,130,30],
['Major Customer 2',123,2,20,10],
['Total Customers',456,1,500,400],
['Total Customers',456,2,400,320],
['Major Customer 1',456,1,100,220],
['Major Customer 1',456,2,230,230],
['Major Customer 2',456,1,130,30],
['Major Customer 2',456,2,20,10]])
df =pd.DataFrame(data)
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
print (df)
0 Market Product Code 1;2 4 5;6
1 Total Customers 123 1 500 400
2 Total Customers 123 2 400 320
3 Major Customer 1 123 1 100 220
4 Major Customer 1 123 2 230 230
5 Major Customer 2 123 1 130 30
6 Major Customer 2 123 2 20 10
7 Total Customers 456 1 500 400
8 Total Customers 456 2 400 320
9 Major Customer 1 456 1 100 220
10 Major Customer 1 456 2 230 230
11 Major Customer 2 456 1 130 30
12 Major Customer 2 456 2 20 10
Below is my desired output
0 Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10
Ideally I would like to perform such a task at the 'read_csv' level. Any thoughts?
Try reindex with repeat
s=df.columns.str.split(';')
df=df.reindex(columns=df.columns.repeat(s.str.len()))
df.columns=sum(s.tolist(),[])
df
Out[247]:
Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10
You can split the columns with ';' and then reconstruct a df:
pd.DataFrame({c:df[t] for t in df.columns for c in t.split(';')})
Out[157]:
1 2 4 5 6 Market Product Code
1 1 1 500 400 400 Total Customers 123
2 2 2 400 320 320 Total Customers 123
3 1 1 100 220 220 Major Customer 1 123
4 2 2 230 230 230 Major Customer 1 123
5 1 1 130 30 30 Major Customer 2 123
6 2 2 20 10 10 Major Customer 2 123
7 1 1 500 400 400 Total Customers 456
8 2 2 400 320 320 Total Customers 456
9 1 1 100 220 220 Major Customer 1 456
10 2 2 230 230 230 Major Customer 1 456
11 1 1 130 30 30 Major Customer 2 456
12 2 2 20 10 10 Major Customer 2 456
Or if you would like to reserve column order:
pd.concat([df[t].to_frame(c) for t in df.columns for c in t.split(';')],1)
Out[167]:
Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10

If last names are similar in [Name] column, fill in missing values of another column

Below is a sample of a much larger dataframe.
Fare Cabin Pclass Ticket Name
257 86.5000 B77 1 110152 Cherry, Miss. Gladys
759 86.5000 B77 1 110152 Rothes, the Countess. of (Lucy Noel Martha Dye...
504 86.5000 B79 1 110152 Maioni, Miss. Roberta
262 79.6500 E67 1 110413 Taussig, Mr. Emil
558 79.6500 E67 1 110413 Taussig, Mrs. Emil (Tillie Mandelbaum)
585 79.6500 NaN 1 110413 Taussig, Miss. Ruth
475 52.0000 A14 1 110465 Clifford, Mr. George Quincy
110 52.0000 C110 1 110465 Porter, Mr. Walter Chamberlain
335 26.0000 C106 1 110469 Maguire, Mr. John Edward
158 26.5500 D22 1 110489 Borebank, Mr. John James
430 26.5500 C52 1 110564 Bjornstrom-Steffansson, Mr. Mauritz Hakan
236 75.2500 D37 1 110813 Warren, Mr. Frank Manley
366 75.2500 D37 1 110813 Warren, Mrs. Frank Manley (Anna Sophia Atkinson)
191 26.0000 NaN 1 111163 Salomon, Mr. Abraham L
170 33.5000 B19 1 111240 Van der hoef, Mr. Wyckoff
462 38.5000 E63 1 111320 Gee, Mr. Arthur H
329 57.9792 Nan 1 111361 Hippach, Miss. Jean Gertrude
523 57.9792 B18 1 111361 Hippach, Mrs. Louis Albert (Ida Sophia Fischer)
If I want to iterate the filling of missing values of "Cabin" for people who are missing "Cabin" values, with someone else's "Cabin" values, only if
the someone else (the one who has a cabin value) has the same last name and also are in the vicinity of oneself( as in one above or one below them) .
So in the dataframe above, [Tassuig, Miss.Ruth]'s Cabin value of "Nan" would be replaced with that of [Tassuig, Mrs.Emil]'s cabin value [E67] who is one above herself because both conditions are met. (Same last name and in the vicinity)
And [Hippach, Miss. Jean Gertrude]'s missing cabin value would be replaced with
[ Hippach, Mrs. Louis Albert (Ida Sophia Fischer)]'s Cabin value of [B18].
I tried to think of iteration but this is as far as I got
for x in df.Name.str.split(',')[x][0] ==df.Name.str.split(',')[x+1][0]:
if df.Cabin[x] or df.Cabin[x+1] == np.nan:
df.Cabin.replace(np.nan,
I want to make sure the np.nan value is replaced with a True value and not np.nan. Couldn't figure out how to do that.
Thanks.
Starting with your DataFrame
print(df)
Fare Cabin Pclass Ticket \
0 86.5000 B77 1 110152
1 86.5000 B77 1 110152
2 86.5000 B79 1 110152
3 79.6500 E67 1 110413
4 79.6500 E67 1 110413
5 79.6500 NaN 1 110413
6 52.0000 A14 1 110465
7 52.0000 C110 1 110465
8 26.0000 C106 1 110469
9 26.5500 D22 1 110489
10 26.5500 C52 1 110564
11 75.2500 D37 1 110813
12 75.2500 D37 1 110813
13 26.0000 NaN 1 111163
14 33.5000 B19 1 111240
15 38.5000 E63 1 111320
16 57.9792 NaN 1 111361
17 57.9792 B18 1 111361
Name
0 Cherry, Miss. Gladys
1 Rothes, the Countess. of (Lucy Noel Martha Dye...
2 Maioni, Miss. Roberta
3 Taussig, Mr. Emil
4 Taussig, Mrs. Emil (Tillie Mandelbaum)
5 Taussig, Miss. Ruth
6 Clifford, Mr. George Quincy
7 Porter, Mr. Walter Chamberlain
8 Maguire, Mr. John Edward
9 Borebank, Mr. John James
10 Bjornstrom-Steffansson, Mr. Mauritz Hakan
11 Warren, Mr. Frank Manley
12 Warren, Mrs. Frank Manley (Anna Sophia Atkinson)
13 Salomon, Mr. Abraham L
14 Van der hoef, Mr. Wyckoff
15 Gee, Mr. Arthur H
16 Hippach, Miss. Jean Gertrude
17 Hippach, Mrs. Louis Albert (Ida Sophia Fischer)
Creating a new column/series with just the LastName. Note, might be a better way to do this with pandas str methods, but I couldn't get anything to work
df['LastName'] = df['Name'].map(lambda x : x[:x.find(',')])
Then we leverage Pandas' shift and boolean indexing to see if the passenger above has the same last name (ie the Taussig case)
filter = (df['Cabin'].isnull()) & (df['LastName'] == df['LastName'].shift())
df.loc[filter,'Cabin'] = df['Cabin'].shift()
and then the passenger below by passing a -1 to shift() (ie the Hippach case)
filter = (df['Cabin'].isnull()) & (df['LastName'] == df['LastName'].shift(-1))
df.loc[filter,'Cabin'] = df['Cabin'].shift(-1)
print(df)
Fare Cabin Pclass Ticket \
0 86.5000 B77 1 110152
1 86.5000 B77 1 110152
2 86.5000 B79 1 110152
3 79.6500 E67 1 110413
4 79.6500 E67 1 110413
5 79.6500 E67 1 110413
6 52.0000 A14 1 110465
7 52.0000 C110 1 110465
8 26.0000 C106 1 110469
9 26.5500 D22 1 110489
10 26.5500 C52 1 110564
11 75.2500 D37 1 110813
12 75.2500 D37 1 110813
13 26.0000 NaN 1 111163
14 33.5000 B19 1 111240
15 38.5000 E63 1 111320
16 57.9792 B18 1 111361
17 57.9792 B18 1 111361
Name LastName
0 Cherry, Miss. Gladys Cherry
1 Rothes, the Countess. of (Lucy Noel Martha Dye... Rothes
2 Maioni, Miss. Roberta Maioni
3 Taussig, Mr. Emil Taussig
4 Taussig, Mrs. Emil (Tillie Mandelbaum) Taussig
5 Taussig, Miss. Ruth Taussig
6 Clifford, Mr. George Quincy Clifford
7 Porter, Mr. Walter Chamberlain Porter
8 Maguire, Mr. John Edward Maguire
9 Borebank, Mr. John James Borebank
10 Bjornstrom-Steffansson, Mr. Mauritz Hakan Bjornstrom-Steffansson
11 Warren, Mr. Frank Manley Warren
12 Warren, Mrs. Frank Manley (Anna Sophia Atkinson) Warren
13 Salomon, Mr. Abraham L Salomon
14 Van der hoef, Mr. Wyckoff Van der hoef
15 Gee, Mr. Arthur H Gee
16 Hippach, Miss. Jean Gertrude Hippach
17 Hippach, Mrs. Louis Albert (Ida Sophia Fischer) Hippach
groupby + fillna
# back fills, then forward fills
def bffill(x):
return x.bfill().ffill()
# group by last name
df['Cabin'] = df.groupby(df.Name.str.split(',').str[0]).Cabin.apply(bffill)
df