Filter rows using multiple conditions to create a new column - pandas

I have a dataframe that looks something like this:
df = pd.DataFrame({
'A Code': ['123', '234', '345', '234', '789', '345'],
'B Code': ['345', '123', '234', '123', '567', '567'],
'C Code': ['678', '123', '456', '234', '321', '789'],
'X Code': ['987', '765', '432', '876', '321', '765'],
'Y Code': ['765', '876', '987', '765', '432', '543'],
'H Code 1': ['EF', 'AB', 'GH', 'CD', 'GH', 'CD'],
'H Code 2': ['AB', 'CD', 'CD', 'AB', 'CD', 'GH']
})
A Code B Code C Code X Code Y Code H Code 1 H Code 2
0 123 345 678 987 765 EF AB
1 234 123 123 765 876 AB CD
2 345 234 456 432 987 GH CD
3 234 123 234 876 765 CD AB
4 789 567 321 321 432 GH CD
5 345 567 789 765 543 CD GH
The following conditions indicate that a row represents an apple:
1) H Code of 'AB' or 'EF'
OR
2) X or Y Code of '765'
OR
3) A, B or C Code of '123' and X or Y Code of '987'
OR
4) A, B or C Code of '234' or '345' and X or Y code of 876
I want to create a column that labels rows that meet any of the conditions above as apples and those that don't as bananas. So something like this:
A Code B Code C Code X Code Y Code H Code 1 H Code 2 Fruit
0 123 345 678 987 765 EF AB apple
1 234 123 123 765 876 AB CD apple
2 345 234 456 432 987 GH CD banana
3 234 123 234 876 765 CD AB apple
4 789 567 321 321 432 GH CD banana
5 345 567 789 765 543 CD GH apple
Thank you!

Use individual boolean masks for each condition and group them according your operators:
m1 = df[['H Code 1', 'H Code 2']].isin(['AB', 'EF']).any(axis=1)
m2 = df[['X Code', 'Y Code']].eq('765').any(axis=1)
m3 = df[['A Code', 'B Code', 'C Code']].eq(123).any(axis=1)
m4 = df[['X Code', 'Y Code']].eq('987').any(axis=1)
m5 = df[['A Code', 'B Code', 'C Code']].isin([234, 345]).any(axis=1)
m6 = df[['X Code', 'Y Code']].eq('876').any(axis=1)
m = m1 | m2 | (m3 & m4) | (m5 & m6)
df['Fruit'] = np.where(m, 'apple', 'banana')
Output:
>>> df
A Code B Code C Code X Code Y Code H Code 1 H Code 2 Fruit
0 123 345 678 987 765 EF AB apple
1 234 123 123 765 876 AB CD apple
2 345 234 456 432 987 GH CD banana
3 234 123 234 876 765 CD AB apple
4 789 567 321 321 432 GH CD banana
5 345 567 789 765 543 CD GH apple

Related

Excluding IDs with some value day after day

I have this df:
ID Date X Y
A 16-07-19 123 56
A 17-07-19 456 84
A 18-07-19 0 58
A 19-07-19 123 81
B 19-07-19 456 70
B 21-07-19 789 46
B 22-07-19 0 19
B 23-07-19 0 91
C 14-07-19 0 86
C 16-07-19 456 91
C 17-07-19 456 86
C 18-07-19 0 41
C 19-07-19 456 26
C 20-07-19 456 17
D 06-07-19 789 98
D 08-07-19 789 90
D 09-07-19 0 94
I want to exclude IDs that have any value in X column (except for 0) day after day.
For example: A has the value 123 on 16-07-19, and 456 on 17-07-19. So all A's observations should be excluded.
Expected result:
ID Date X Y
B 19-07-19 456 70
B 21-07-19 789 46
B 22-07-19 0 19
B 23-07-19 0 91
D 06-07-19 789 98
D 08-07-19 789 90
D 09-07-19 0 94
Let's do this in a vectorized manner, to keep our code as efficient as possible
(meaning: we avoid using GroupBy.apply)
First we check if the difference in Date is equal to 1 day
We check if X column is not equal to 0
we create a temporary column m where we check if both conditions are True
We groupby on ID and remove all groups where any of the rows are True
# df['Date'] = pd.to_datetime(df['Date']) <- if Date is not datetime type
m1 = df['Date'].diff(1).eq(pd.Timedelta(1, unit='d'))
m2 = df['X'].ne(0)
df['m'] = m1&m2
df = df[~df.groupby('ID')['m'].transform('any')].drop(columns='m')
ID Date X Y
4 B 2019-07-19 456 70
5 B 2019-07-21 789 46
6 B 2019-07-22 0 19
7 B 2019-07-23 0 91
14 D 2019-06-07 789 98
15 D 2019-08-07 789 90
16 D 2019-09-07 0 94

PANDAS: a way to combine rows that are grouped by a field

I have a DataFrame that looks like:
test1 = pd.DataFrame( {
"ROUTE" : ["MIA-ORD", "MIA-AUA", "ORD-MIA", "MIA-HOU", "MIA-JFK", "JFK-MIA", "JFK-YYZ"],
"TICKET" : ["123", "345", "123", "678", "456", "345", "456"],
"COUPON" : [1,4,2,1,1,3,2],
"PAX" : ["Jessica", "Alex", "Jessica", "Jamanica", "Ernest","Alex", "Ernest"],
"PAID": [100.00,200.00,100.00,100.00,200.00,200.00,200.00]})
this gives me
ROUTE TICKET COUPON PAX PAID
0 MIA-ORD 123 1 Jessica 100.0
1 MIA-AUA 345 4 Alex 200.0
2 ORD-MIA 123 2 Jessica 100.0
3 MIA-HOU 678 1 Jamanica 100.0
4 MIA-JFK 456 1 Ernest 200.0
5 JFK-MIA 345 3 Alex 200.0
6 JFK-YYZ 456 2 Ernest 200.0
what I am trying to do is to combine the Route and coupon data to be
ROUTE TICKET COUPON PAX PAID
0 MIA-ORD-ORD-MIA 123 1-2 Jessica 100.0
1 JFK-MIA-MIA-AUA 345 3-4 Alex 200.0
2 MIA-HOU 678 1 Jamanica 100.0
3 MIA-JFK-JFK-YYZ 456 1-2 Ernest 200.0
So Far I have been able to group by ticket since its the obivous common identifier and sorted Coupons since the order of the flights for 'ALEX' are inverted.
rs1 = test1.groupby(['TICKET']).apply(pd.DataFrame.sort_values,'COUPON')
This results
ROUTE TICKET COUPON PAX PAID
TICKET
123 0 MIA-ORD 123 1 Jessica 100.0
2 ORD-MIA 123 2 Jessica 100.0
345 5 JFK-MIA 345 3 Alex 200.0
1 MIA-AUA 345 4 Alex 200.0
456 4 MIA-JFK 456 1 Ernest 200.0
6 JFK-YYZ 456 2 Ernest 200.0
678 3 MIA-HOU 678 1 Jamanica 100.0
but from here i cannot merge the ROUTE and COUPON.
I have tried:
st1=test1.groupby('TICKET').apply(lambda group: ','.join(group['ROUTE']))
But that only brings about the merged colmuns sorted alone. not the rest of the data.
TICKET
123 MIA-ORD,ORD-MIA
345 MIA-AUA,JFK-MIA
456 MIA-JFK,JFK-YYZ
678 MIA-HOU
dtype: object
Any Ideas?
We can use groupby in combination with agg and then apply '-'.join():
test1['COUPON']=test1['COUPON'].astype(str)
final = test1.groupby(['TICKET', 'PAX', 'PAID']).agg({'ROUTE':'-'.join,
'COUPON':'-'.join}).reset_index()
print(final)
TICKET PAX PAID ROUTE COUPON
0 123 Jessica 100.0 MIA-ORD-ORD-MIA 1-2
1 345 Alex 200.0 MIA-AUA-JFK-MIA 4-3
2 456 Ernest 200.0 MIA-JFK-JFK-YYZ 1-2
3 678 Jamanica 100.0 MIA-HOU 1

Exclude rows where keys match, but are on different rows

I'm looking for the best way to produce the result set in the scenario provided. My cust3 column isn't identifying the repeated values in the indvid2 column. The end result I'm looking for is to exclude the rows where key1 and key2 match (ids:1,2,6 and 7), then sum accounts where the acctids match.If there's a better way to code this, I welcome all suggestions. Thanks!
WITH T10 as (
SELECT acctid,invid,(
case
when invid like '%-R' then left (InvID,LEN(invid) -2) else InvID
END) as InvID2
FROM table x
GROUP BY acctID,invID
),
T11 as (
SELECT acctid, Invid2, COUNT(InvID2) as cust3
FROM T10
GROUP BY InvID2,acctid
HAVING
COUNT (InvID2) > 1
)
select DISTINCT
a.acctid,
a.name,
b.invid,
C.invid2,
D.cust3,
b.amt,
b.key1,
b.key2
from table a
inner join table b (nolock) on a.acctid = b.acctid
inner join T10 C (nolock) on b.invid = c.invid
inner join T11 D (nolock) on C.invid2 = D.invid2
Resultset
id acctID name invid invid2 Cust3 amt key1 key2
1 123 James 101 101 2 $500 NULL 6789
2 123 james 101-R 101 2 ($500) 6789 NULL
3 123 James 102 102 2 $350 NULL NULL
4 123 James 103 103 2 $200 NULL NULL
5 246 Tony 98-R 98 2 ($750) 7423 NULL
6 432 David 45 45 2 $100 NULL 9634
7 432 David 45-R 45 2 ($100) 9634 NULL
8 359 Stan 39-R 39 2 ($50) 6157 NULL
9 753 George 95 95 2 $365 NULL NULL
10 753 George 108 108 2 $100 NULL NULL
Desired Resultset
id acctID name invid invid2 Cust3 amt key1 key2
1 123 James 101 101 2 $500 NULL 6789
2 123 james 101-R 101 2 ($500) 6789 NULL
3 123 James 102 102 1 $350 NULL NULL
4 123 James 103 103 1 $200 NULL NULL
5 246 Tony 98-R 98 1 ($750) 7423 NULL
6 432 David 45 45 2 $100 NULL 9634
7 432 David 45-R 45 2 ($100) 9634 NULL
8 359 Stan 39-R 39 1 ($50) 6157 NULL
9 753 George 95 95 1 $365 NULL NULL
10 753 George 108 108 1 $100 NULL NULL
Then to sum amt by acctid
id acctid name amt
1 123 James $550
2 246 Tony ($750)
3 359 Stan ($50)
4 753 George $465
Something like:
;WITH Keys as (
SELECT Key1.acctID, [Key] = Key1.Key1
FROM YourTable as Key1
INNER JOIN YourTable as Key2
ON Key1.Key1 = Key2.Key2 and Key1.acctID = Key2.acctID
)
SELECT t.acctID, t.name, amt = SUM(t.amt)
FROM YourTable as t
LEFT JOIN Keys as k
ON t.acctID = k.acctID and (t.Key1 = [Key] or t.Key2 = [Key])
WHERE k.acctID is Null
GROUP BY t.acctID, t.name

SQL Join by comparing measures or loop with cursors?

In order to verify if Deliveries are done on time, I need to match delivery Documents to PO schedule lines (SchLin) based on the comparison between Required Quantity (ReqQty) and Delivered Quantity (DlvQty).
The Delivery Docs have a reference to the PO and POItm but not to the SchLin.
Once a Delivery Doc is assigned to a Schedule Line I can calculate the Delivery Delta (DlvDelta) as the number of days it was delivered early or late compared to the requirement (ReqDate).
Examples of the two base tables are as follows:
Schedule lines
PO POItm SchLin ReqDate ReqQty
123 1 1 10/11 20
123 1 2 30/11 30
124 2 1 15/12 10
124 2 2 24/12 15
Delivery Docs
Doc Item PO POItm DlvDate DlvQty
810 1 123 1 29/10 12
816 1 123 1 02/11 07
823 1 123 1 04/11 13
828 1 123 1 06/11 08
856 1 123 1 10/11 05
873 1 123 1 14/11 09
902 1 124 2 27/11 05
908 1 124 2 30/11 07
911 1 124 2 08/12 08
923 1 124 2 27/12 09
Important: Schedule Lines and Deliveries should have the same PO and POItm.
The other logic to link is to sum the DlvQty until we reach (or exceed) ReqQty.
Those deliveries are then linked to the schedule line. Subsequent deliveries are used for the following schedule line(s). A delivery schould be matched to only one schedule line.
After comparing the ReqQty and DlvQty the assignments should result in following:
Result
Doc Item PO POItm Schlin ReqDate DlvDate DlvDelta
810 1 123 1 1 10/11 29/10 -11
816 1 123 1 1 10/11 02/11 -08
823 1 123 1 1 10/11 04/11 -06
828 1 123 1 2 30/11 06/11 -24
856 1 123 1 2 30/11 10/11 -20
873 1 123 1 2 30/11 14/11 -16
902 1 124 2 1 15/12 27/11 -18
908 1 124 2 1 15/12 30/11 -15
911 1 124 2 2 24/12 08/12 -16
923 1 124 2 2 24/12 27/12 +03
Up till now, I have done this with loops using cursors but performance is rather sluggish.
Is there another way in SQL (script) using e.g. joins by comparing measures to achieve the same result?
Regards,
Eric
If you can express the rule for matching a delivery with a schedule line, you can produce the results you want in a single query. And, yes, I promise it will be faster (and simpler) than executing the same logic in loops on cursors.
I can't reproduce your exact results because I don't quite understand how the two tables relate. Hopefully from the code below you'll be able to figure it out by adjusting the join criteria.
I don't have your DBMS. My code uses SQLite, which has its own peculiar date functions. You'll have to substitute the ones your system provides. In any event, I can't recommend 5-character strings for dates. Use a datetime type if you have one, and include 4-digit years regardless. Else how many days are there between Christmas and New Years Day?
create table S (
PO int not NULL,
POItm int not NULL,
SchLin int not NULL,
ReqDate char not NULL,
ReqQty int not NULL,
primary key (PO, POItm, SchLin)
);
insert into S values
(123, 1, 1, '10/11', 20 ),
(123, 1, 2, '30/11', 30 ),
(124, 2, 1, '15/12', 10 ),
(124, 2, 2, '24/12', 15 );
create table D (
Doc int not NULL,
Item int not NULL,
PO int not NULL,
POItm int not NULL,
DlvDate char not NULL,
DlvQty int not NULL,
primary key (Doc, Item)
);
insert into D values
(810, 1, 123, 1, '29/10', 12 ),
(816, 1, 123, 1, '02/11', 07 ),
(823, 1, 123, 1, '04/11', 13 ),
(828, 1, 123, 1, '06/11', 08 ),
(856, 1, 123, 1, '10/11', 05 ),
(873, 1, 123, 1, '14/11', 09 ),
(902, 1, 124, 2, '27/11', 05 ),
(908, 1, 124, 2, '30/11', 07 ),
(911, 1, 124, 2, '08/12', 08 ),
(923, 1, 124, 2, '27/12', 09 );
select D.Doc, D.Item, D.PO, S.SchLin, S.ReqDate, D.DlvDate
, cast(
julianday('2018-' || substr(DlvDate, 4,2) || '-' || substr(DlvDate, 1,2))
- julianday('2018-' || substr(ReqDate, 4,2) || '-' || substr(ReqDate, 1,2))
as int) as DlvDelta
from S join D on S.PO = D.PO and S.POItm = D.POItm
;
Result:
Doc Item PO SchLin ReqDate DlvDate DlvDelta
---------- ---------- ---------- ---------- ---------- ---------- ----------
810 1 123 1 10/11 29/10 -12
810 1 123 2 30/11 29/10 -32
816 1 123 1 10/11 02/11 -8
816 1 123 2 30/11 02/11 -28
823 1 123 1 10/11 04/11 -6
823 1 123 2 30/11 04/11 -26
828 1 123 1 10/11 06/11 -4
828 1 123 2 30/11 06/11 -24
856 1 123 1 10/11 10/11 0
856 1 123 2 30/11 10/11 -20
873 1 123 1 10/11 14/11 4
873 1 123 2 30/11 14/11 -16
902 1 124 1 15/12 27/11 -18
902 1 124 2 24/12 27/11 -27
908 1 124 1 15/12 30/11 -15
908 1 124 2 24/12 30/11 -24
911 1 124 1 15/12 08/12 -7
911 1 124 2 24/12 08/12 -16
923 1 124 1 15/12 27/12 12
923 1 124 2 24/12 27/12 3

SAS Transpose and summarize

I'm working on following scenario in SAS.
Input 1
AccountNumber Loans
123 abc, def, ghi
456 jkl, mnopqr, stuv
789 w, xyz
Output 1
AccountNumbers Loans
123 abc
123 def
123 ghi
456 jkl
456 mnopqr
456 stuv
789 w
789 xyz
Input 2
AccountNumbers Loans
123 15-abc
123 15-def
123 15-ghi
456 99-jkl
456 99-mnopqr
456 99-stuv
789 77-w
789 77-xyz
Output 2
AccountNumber Loans
123 15-abc, 15-def, 15-ghi
456 99-jkl, 99-mnopqr, 99-stuv
789 77-w, 77-xyz
I manged to get Input 2 from output 1, just need Output 2 now.
I will really appreciate the help.
Thanks!
Try this, replacing [Input 2] with the actual name of your Input 2 table.
data output2 (drop=loans);
do until (last.accountnumbers);
set [Input 2];
by accountnumbers;
length loans_combined $100;
loans_combined=catx(', ',loans_combined,loans);
end;
run;