Python Pandas Combining 2 df by keys - pandas

I'm trying to combine these two dataframes:
df1 =
ID1 ID2
111 1001
112 1002
113 1003
114 1004
115 1005
df2 =
ID1 Name Age
111 ABC 20
111 ABC 21
1001 ABC 22
1002 QAZ 18
1002 QAZ 19
1002 QAZ 20
113 XYZ 25
113 XYZ 25
to get an output like this:
ID Name Age ID1 ID2
111 ABC 20 111 1001
111 ABC 21 111 1001
1001 ABC 22 111 1001
1002 QAZ 18 112 1002
1002 QAZ 19 112 1002
1002 QAZ 20 112 1002
113 XYZ 25 113 1003
113 XYZ 25 113 1003
Is this possible?
Thanks in advance!

merge + combine_first PS: I think the ID1 in df2 should be ID
df2.merge(df1,left_on='ID',right_on='ID1',how='left').\
combine_first(df2.merge(df1,left_on='ID',right_on='ID2',how='left'))
Out[912]:
ID Name Age ID1 ID2
0 111 ABC 20 111.0 1001.0
1 111 ABC 21 111.0 1001.0
2 1001 ABC 22 111.0 1001.0
3 1002 QAZ 18 112.0 1002.0
4 1002 QAZ 19 112.0 1002.0
5 1002 QAZ 20 112.0 1002.0
6 113 XYZ 25 113.0 1003.0
7 113 XYZ 25 113.0 1003.0

Related

Pandas drop_duplicates with multiple conditions

I have some measurement datas that need to be filtered, I read them as dataframe data, like these:
df
RequestTime RequestID ResponseTime ResponseID
0 150 14 103 101
1 150 15 110 102
2 25 16 121 103
3 25 16 97 104
4 22 16 44 105
5 19 17 44 106
6 26 18 29 106
7 30 18 29 106
and I need to use two different conditions at the same time, that is, to filter 'RequestTime' 'RequestID' and 'ResponseTime' 'ResponseID' by use drop_duplicate(subset=) at the same time. I have used follow command to get the filter results for each of the two conditions:
>>>df[['RequestTime','RequestID','ResponseTime','ResponseID']].drop_duplicates(subset = ['ResponseTime','ResponseID'])
RequestTime RequestID ResponseTime ResponseID
0 150 14 103 101
1 150 15 110 102
2 25 16 121 103
4 22 16 44 105
5 19 17 44 106
6 26 18 29 106
7 30 18 29 106
>>>df[['RequestTime','RequestID','ResponseTime','ResponseID']].drop_duplicates(subset = ['RequestTime','RequestID'])
RequestTime RequestID ResponseTime ResponseID
0 150 14 103 101
1 150 15 110 102
2 25 16 121 103
3 25 16 97 104
4 22 16 44 105
5 19 17 44 106
6 26 18 29 106
but how to combine the two conditions to drop duplicate row 3 and row 7?
IIUC,
m = ~(df.duplicated(subset=['RequestTime','RequestID']) | df.duplicated(subset=['ResponseTime', 'ResponseID']))
df[m]
Output:
RequestTime RequestID ResponseTime ResponseID
0 150 14 103 101
1 150 15 110 102
2 25 16 121 103
4 22 16 44 105
5 19 17 44 106
6 26 18 29 106
Create a mask (boolean series) to boolean index your dataframe.
Or chain methods:
df.drop_duplicates(subset=['RequestTime', 'RequestID']).drop_duplicates(subset=['ResponseTime', 'ResponseID'])

how to add incremental number to specific column in pandas

I have following dataframe in pandas
code tank length dia diff
123 3 625 210 -0.38
123 5 635 210 1.2
I want to add 1 only in length for 5 times if the diff is positive and subtract 1 if the dip is negative. My desired dataframe looks like
code tank length diameter
123 3 625 210
123 3 624 210
123 3 623 210
123 3 622 210
123 3 621 210
123 3 620 210
123 5 635 210
123 5 636 210
123 5 637 210
123 5 638 210
123 5 639 210
123 5 640 210
I am doing following in pandas.
df.add(1)
But, its adding 1 to all the columns.
Use Index.repeat 6 times, then add counter values by GroupBy.cumcount and last create default RangeIndex by DataFrame.set_index:
df1 = df.loc[df.index.repeat(6)].copy()
df1['length'] += df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
Or:
df1 = (df.loc[df.index.repeat(6)]
.assign(length = lambda x: x.groupby(level=0).cumcount() + x['length'])
.reset_index(drop=True))
print (df1)
code tank length dia
0 123 3 625 210
1 123 3 626 210
2 123 3 627 210
3 123 3 628 210
4 123 3 629 210
5 123 3 630 210
6 123 5 635 210
7 123 5 636 210
8 123 5 637 210
9 123 5 638 210
10 123 5 639 210
11 123 5 640 210
EDIT:
df1 = df.loc[df.index.repeat(6)].copy()
add = df1.groupby(level=0).cumcount()
mask = df1['diff'] < 0
df1['length'] = np.where(mask, df1['length'] - add, df1['length'] + add)
df1 = df1.reset_index(drop=True)
print (df1)
code tank length dia diff
0 123 3 625 210 -0.38
1 123 3 624 210 -0.38
2 123 3 623 210 -0.38
3 123 3 622 210 -0.38
4 123 3 621 210 -0.38
5 123 3 620 210 -0.38
6 123 5 635 210 1.20
7 123 5 636 210 1.20
8 123 5 637 210 1.20
9 123 5 638 210 1.20
10 123 5 639 210 1.20
11 123 5 640 210 1.20
We can use pd.concat, np.cumsum and groupby + .add.
If you want to substract, simply multiply addition * -1 so for example: (np.cumsum(np.ones(n))-1) * -1
n = 6
new = pd.concat([df]*n).sort_values(['code', 'length']).reset_index(drop=True)
addition = np.cumsum(np.ones(n))-1
new['length'] = new.groupby(['code', 'tank'])['length'].apply(lambda x: x.add(addition))
Output
code tank length dia
0 123 3 625.0 210
1 123 3 626.0 210
2 123 3 627.0 210
3 123 3 628.0 210
4 123 3 629.0 210
5 123 3 630.0 210
6 123 5 635.0 210
7 123 5 636.0 210
8 123 5 637.0 210
9 123 5 638.0 210
10 123 5 639.0 210
11 123 5 640.0 210

How can i give a numeric order to each line based on unique ID

I'm working on big data set and I need to give and print a numeric order for each unique ID ($1) and want to delete the lines above 335 numeric order for each unique ID.
The data looks like
101 24
101 13
101 15
102 25
102 21
102 23
103 20
103 12
103 18
The output looks like this
101 24 1
101 13 2
101 15 3
102 25 1
102 21 2
102 23 3
103 20 1
103 12 2
103 18 3
Try below one
Input
$ cat f
101 24
101 13
101 15
102 25
102 21
102 23
103 20
103 12
103 18
Output
$ awk '{print $0,++a[$1]}' f
101 24 1
101 13 2
101 15 3
102 25 1
102 21 2
102 23 3
103 20 1
103 12 2
103 18 3
If data is sorted ( column1 ) then use below one, faster
$ awk '$1!=p{n=0}{print $0,++n; p=$1}' f
101 24 1
101 13 2
101 15 3
102 25 1
102 21 2
102 23 3
103 20 1
103 12 2
103 18 3
To remove id above 335
$ awk '$1!=p{n=0; p=$1}++n<335{print $0,n}' f
$ awk '++a[$1]<335{print $0,a[$1]}' f

How to add status to the table

I have the following table where is clipping from my db. I have 2 types of contracts.
I: client pays for first 6mth 60$, next 6mth 120$ (111 client)
II: client pays for first 6mth 60$ but if want still pays 60$ the contract will be extended at 6mth, whole contract is 18mth. (321 client who still pays)
ID_Client | Amount | Amount_charge | Lenght | Date_from | Date_to | Reverse
--------------------------------------------------------------------------------
111 60 60 12 2015-01-01 2015-01-31 12
111 60 60 12 2015-02-01 2015-02-28 11
111 60 60 12 2015-03-01 2015-03-31 10
111 60 60 12 2015-04-01 2015-04-30 9
111 60 60 12 2015-05-01 2015-05-31 8
111 60 60 12 2015-06-01 2015-06-30 7
111 120 60 12 2015-07-01 2015-07-31 6
111 120 60 12 2015-08-01 2015-08-31 5
111 120 60 12 2015-09-01 2015-09-30 4
111 120 60 12 2015-10-01 2015-10-31 3
111 120 60 12 2015-11-01 2015-11-30 2
111 120 60 12 2015-12-01 2015-12-31 1
111 120 60 12 2016-01-01 2015-01-31 0
111 120 60 12 2016-02-01 2015-02-29 0
321 60 60 12 2015-01-01 2015-01-31 12
321 60 60 12 2015-02-01 2015-02-28 11
321 60 60 12 2015-03-01 2015-03-31 10
321 60 60 12 2015-04-01 2015-04-30 9
321 60 60 12 2015-05-01 2015-05-31 8
321 60 60 12 2015-06-01 2015-06-30 7
321 60 60 12 2015-07-01 2015-07-31 6
321 60 60 12 2015-08-01 2015-08-31 5
321 60 60 12 2015-09-01 2015-09-30 4
321 60 60 12 2015-10-01 2015-10-31 3
321 60 60 12 2015-11-01 2015-11-30 2
321 60 60 12 2015-12-01 2015-12-31 1
321 60 60 12 2016-01-01 2016-01-30 0
321 60 60 12 2016-02-01 2016-02-31 0
321 60 60 12 2016-03-01 2016-03-30 0
321 60 60 12 2016-04-01 2016-04-31 0
I need to add status column.
A - normal period of agreement
D - where the agreement is doubled after 6mth but after 12mth is E(nd of agreemnt)
E - where contract is finished
L - where contract after 6mth was extended, after 18mth the status will be type E
For 321 Client after 12mth the lenght of contract was updated from 12 to 18
I have a lot of clients so i think better will be using loop to go by all clients?
ID_Client | Amount | Amount_charge | Lenght | Date_from | Date_to | Reverse | Status
-----------------------------------------------------------------------------------------
111 60 60 12 2015-01-01 2015-01-31 12 A
111 60 60 12 2015-02-01 2015-02-28 11 A
111 60 60 12 2015-03-01 2015-03-31 10 A
111 60 60 12 2015-04-01 2015-04-30 9 A
111 60 60 12 2015-05-01 2015-05-31 8 A
111 60 60 12 2015-06-01 2015-06-30 7 A
111 120 60 12 2015-07-01 2015-07-31 6 D
111 120 60 12 2015-08-01 2015-08-31 5 D
111 120 60 12 2015-09-01 2015-09-30 4 D
111 120 60 12 2015-10-01 2015-10-31 3 D
111 120 60 12 2015-11-01 2015-11-30 2 D
111 120 60 12 2015-12-01 2015-12-31 1 D
111 120 60 12 2016-01-01 2015-01-31 0 E
111 120 60 12 2016-02-01 2015-02-29 0 E
321 60 60 12 2015-01-01 2015-01-31 12 A
321 60 60 12 2015-02-01 2015-02-28 11 A
321 60 60 12 2015-03-01 2015-03-31 10 A
321 60 60 12 2015-04-01 2015-04-30 9 A
321 60 60 12 2015-05-01 2015-05-31 8 A
321 60 60 12 2015-06-01 2015-06-30 7 A
321 60 60 12 2015-07-01 2015-07-31 6 L
321 60 60 12 2015-08-01 2015-08-31 5 L
321 60 60 12 2015-09-01 2015-09-30 4 L
321 60 60 12 2015-10-01 2015-10-31 3 L
321 60 60 12 2015-11-01 2015-11-30 2 L
321 60 60 12 2015-12-01 2015-12-31 1 L
321 60 60 18 2016-01-01 2016-01-30 0 L
321 60 60 18 2016-02-01 2016-02-31 0 L
321 60 60 18 2016-03-01 2016-03-30 0 L
321 60 60 18 2016-04-01 2016-04-31 0 L
If the Reverse column is what I think:
update table1 a
set "Status"=
CASE
WHEN A."Reverse" > 6 THEN
'A'
WHEN A."Reverse" > 0 THEN
DECODE (A."Amount", A."Amount_charge", 'L', 'D')
ELSE
CASE
WHEN A."Amount" <> A."Amount_charge" THEN
'E'
ELSE
CASE WHEN ADD_MONTHS ( (SELECT b."Date_from" FROM table1 b WHERE a."ID_Client" = b."ID_Client" AND b."Reverse" = 1),6) > a."Date_from" THEN 'L'
ELSE
'E'
END
END
END
Better is to calculate the sums. The amount per month come from first payment. Something like this:
DECLARE
CURSOR c2
IS
SELECT ID_CLIENT, --AMOUNT, AMOUNT_CHARGE, LENGTH, DATE_FROM, DATE_TO, REVERSE, STATUS,
FIRST_VALUE (amount_charge) OVER (PARTITION BY id_client ORDER BY date_from) first_amount_charge,
SUM (amount) OVER (PARTITION BY id_client ORDER BY date_from) sum_amount,
SUM (amount_charge) OVER (PARTITION BY id_client ORDER BY date_from) sum_amount_charge
FROM TABLE2
FOR UPDATE NOWAIT;
BEGIN
FOR c1 IN c2
LOOP
UPDATE table2
SET status = CASE WHEN c1.sum_amount <= 6 * c1.first_amount_charge THEN 'A'
WHEN c1.sum_amount > 18 * c1.first_amount_charge THEN 'E'
WHEN c1.sum_amount > c1.sum_amount_charge THEN 'D'
ELSE 'L'
END
WHERE CURRENT OF c2;
END LOOP;
END;

How to extract info based on the latest row

I have two tables:-
TABLE A :-
ORNO DEL PONO QTY
801 123 1 80
801 123 2 60
801 123 3 70
801 151 1 95
801 151 3 75
802 130 1 50
802 130 2 40
802 130 3 30
802 181 2 55
TABLE B:-
ORNO PONO STATUS ITEM
801 1 12 APPLE
801 2 12 ORANGE
801 3 12 MANGO
802 1 22 PEAR
802 2 22 KIWI
802 3 22 MELON
I wish to extract the info based on the latest DEL (in Table A) using SQL. The final output should look like this:-
OUTPUT:-
ORNO PONO STATUS ITEM QTY
801 1 12 APPLE 95
801 2 12 ORANGE 60
801 3 12 MANGO 75
802 1 22 PEAR 50
802 2 22 KIWI 55
802 3 22 MELON 30
Thanks.
select b.*, y.QTY
from
(
select a.ORNO, a.PONO, MAX(a.DEL) [max]
from #tA a
group by a.ORNO, a.PONO
)x
join #tA y on y.ORNO = x.ORNO and y.PONO = x.PONO and y.DEL = x.max
join #tB b on b.ORNO = y.ORNO and b.PONO = y.PONO
Output:
ORNO PONO STATUS ITEM QTY
----------- ----------- ----------- ---------- -----------
801 1 12 APPLE 95
801 2 12 ORANGE 60
801 3 12 MANGO 75
802 1 22 PEAR 50
802 2 22 KIWI 55
802 3 22 MELON 30