Matching two tables using pandas - pandas

I have these two tables
One table is called LineMax
OrigNode DestNode DivisionNum Prefix FromMP ToMP Suffix
7764 25961 3 AB 18 20.9
7764 50213 3 AB 18 17.3
7765 35444 3 AB 0 1.5
7841 35444 3 AB 6 1.5
15390 25961 3 AB 23.75 20.9
25961 7764 3 AB 20.9 18
25961 15390 3 AB 20.9 23.75
And I have another data set
OPER_MNT_DIV_CD TRK_CLS_NBR LN_PFX SEG_BGN_MP SEG_END_MP LN_SFX
4 1 362.7 362.71
4 1 362.71 362.83
4 1 362.83 362.98
4 1 362.98 363.35
4 1 363.35 363.4
4 1 363.4 363.54
4 1 363.54 363.67
4 1 363.67 363.81
4 1 363.81 363.95
4 1 363.95 364.1
4 1 364.1 364.15
4 1 364.15 364.5
4 1 364.5 364.55
I am trying to match my data. Basically I want to match to my second table it should have the same Prefix,Suffix, and Divison Number.
So basically
Prefix=LN_pfx
Suffix=LN_SFX
DIVISIONNUM=OPER_MNT_DIV_CD
I also want my first to Mp and from Mp to be contained within the SEG BGN MP AND SEG END MP like this $Seg_BGN_MP<=fromMp<ToMp<=Seg_end_mp4$
But I cannot seem to get my code to work. My second data table had some white space so I removed them and I turned Oper MNT DIV CID from a string to an int to make an easier comparison.
I also removed all the white space and turned every string capital in my first table.
But I cannot seem to get the matches I want.
import numpy as np
import pyodbc
import math
x=pyodbc.connect("DSN=DBP1")
table1=pd.read_csv("LineMaxOrder.csv")
s2="select oper_mnt_div_cd, trk_cls_nbr, ln_pfx, seg_bgn_mp, seg_end_mp, ln_sfx, crvtr_mn, crvtr_deg, xstg_elev from dcmctrk.crv_seg where trk_cls_nbr = 1 order by oper_mnt_div_cd, ln_pfx,ln_sfx"
table1=table1.drop(table1.columns[[0]],axis=1)
dChange=pd.read_sql_query(s2,x)
dChange["LN_PX"]=dChange["LN_PFX"].str.strip()
dChange["LN_PFX"]=dChange["LN_PFX"].str.upper()
dChange["LN_SFX"]=dChange["LN_SFX"].str.strip()
dChange["LN_SFX"]=dChange["LN_SFX"].str.upper()
dChange["OPER_MNT_DIV_CD"]=dChange["OPER_MNT_DIV_CD"].astype(int)
dfObj2=table1.select_dtypes(["object"])
table1[dfObj2.columns] = dfObj2.apply(lambda x: x.str.strip())
table1[dfObj2.columns]=dfObj2.apply(lambda x:x.str.upper())
table1=table1.fillna('')
w=[]
for idx,row in table1.iterrows():
a=row[3]
b=row[4]
c=row[7]
agu1=row[5]
agu2=row[6]
big=max(agu1,agu2)
small=min(agu2,agu1)
result=dChange[(dChange["OPER_MNT_DIV_CD"]==a)&(dChange["LN_PFX"]==b)&(dChange["LN_SFX"]==c)]
if result.empty:
continue
else:
result[(result["SEG_BGN_MP"]<=small)&(result["SEG_END_MP"]>=big)]
if result.empty:
continue
else:
print(result)
w.append(result)

use the pandas merge to combine joins as : inner, left, or right and link by matching fields

Related

the 'combine' of a split-apply-combine in pd.groupby() works brilliantly, but I'm not sure why

I have a fragment of code similar to below. It works perfectly, but I'm not sure why I am so lucky.
The groupby() is a split-apply-combine operation. So I understand why the qf.groupby(qf.g).mean() returns a series with two rows, the mean() for each of a,b.
And what's brilliant is that -combine step of the qf.groupby(qf.g).cumsum() reassembles all the rows into their original order as found in the starting df.
My question is, "Why am I able to count on this behavior?" I'm glad I can, but I cannot articulate why it's possible.
#split-apply-combine
import pandas as pd
#DF with a value, and an arbitrary category
qf= pd.DataFrame(data=[x for x in "aaabbaaaab"], columns=['g'])
qf['val'] = [1,2,3,1,2,3,4,5,6,9]
print(f"applying mean() to members in each group of a,b ")
print ( qf.groupby(qf.g).mean() )
print(f"\n\napplying cumsum() to members in each group of a,b ")
print( qf.groupby(qf.g).cumsum() ) #this combines them in the original index order thankfully
qf['running_totals'] = qf.groupby(qf.g).cumsum()
print (f"\n{qf}")
yields:
applying mean() to members in each group of a,b
val
g
a 3.428571
b 4.000000
applying cumsum() to members in each group of a,b
val
0 1
1 3
2 6
3 1
4 3
5 9
6 13
7 18
8 24
9 12
g val running_totals
0 a 1 1
1 a 2 3
2 a 3 6
3 b 1 1
4 b 2 3
5 a 3 9
6 a 4 13
7 a 5 18
8 a 6 24
9 b 9 12

pandas duplicate values: Result visual inspection not duplicates

Hello thanks in advance for all answers, I really appreciate community help
Here is my dataframe - from a csv containing scraped data from cars classified ads
Unnamed: 0 NameYear \
0 0 BMW 7 серия, 2007
1 1 BMW X3, 2021
2 2 BMW 2 серия Gran Coupe, 2021
3 3 BMW X5, 2021
4 4 BMW X1, 2021
Price \
0 520 000 ₽
1 от 4 810 000 ₽\n4 960 000 ₽ без скидки
2 2 560 000 ₽
3 от 9 259 800 ₽\n9 974 800 ₽ без скидки
4 от 3 130 000 ₽\n3 220 000 ₽ без скидки
CarParams \
0 187 000 км, AT (445 л.с.), седан, задний, бензин
1 2.0 AT (190 л.с.), внедорожник, полный, дизель
2 1.5 AMT (140 л.с.), седан, передний, бензин
3 3.0 AT (400 л.с.), внедорожник, полный, дизель
4 2.0 AT (192 л.с.), внедорожник, полный, бензин
url
0 https://www.avito.ru/moskva/avtomobili/bmw_7_s...
1 https://www.avito.ru/moskva/avtomobili/bmw_x3_...
2 https://www.avito.ru/moskva/avtomobili/bmw_2_s...
3 https://www.avito.ru/moskva/avtomobili/bmw_x5_...
4 https://www.avito.ru/moskva/avtomobili/bmw_x1_...
THE TASK - I want to know if there are duplicate rows, or if the SAME car advertisement appears twice. Most reliable maybe url because it should be unique: CarParameters or NameYear can repeat so I will check nunique and duplicated on url column
screenshot to visually inspect the reslt of duplicated:
THE ISSUE: Visual inspection (sorry for unprofessional jargon) shows these urls are not the SAME, but I wanted to get possible exactly same urls to check for repeat data. I tried to set keep = False as well
Try:
df.duplicated(subset=["url"], keep=False)
df.duplicted() gives you a pd.Series with bool-values.
Here is a example that your could probably use
from random import randint
import pandas as pd
urls=['http://www.google.com',
'http://www.stackoverfow.com',
'http://bla.xy','http://bla.com']
d=[]
for i, url in enumerate(urls):
for j in range(0,randint(1,i+1)):
d.append(dict(customer=str(randint(1,100)), url=url))
df=pd.DataFrame(d)
df['dups']=df['url'].duplicated(keep=False)
print(df)
resulting in the following df:
customer url dups
0 89 http://www.google.com False
1 43 http://www.stackoverfow.com False
2 36 http://bla.xy True
3 86 http://bla.xy True
4 32 http://bla.com False
the column dups shows you which urls exist more than once. In my example data is only the url http://bla.xy
The important thing is that you check what the parameter keep does
keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to mark.
first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.
In my case used False to get all duplicated values

How do I plot multiple time series grouped by different colours?

I am trying to have a single graph displaying a 10 series. These are divided in group such as this simple example :
Group A | Group B
time(h) S1 S2 S3 S4 S5 S6
0 1 3 1 3 4 5
24 2 1 3 4 2 1
48 3 2 2 1 2 2
How can I add these 6 series in a single graph and categorise their group A/B by colour ?
Thank you so much!
You can try seaborn:
import seaborn as sns
sns.lineplot(data=df.stack(level=[0,1]).reset_index(name='value'),
x='time', y='value', hue='level_1', style='level_2'
)
And you would get:

iterrows() of 2 columns and save results in one column

in my data frame I want to iterrows() of two columns but want to save result in 1 column.for example df is
x y
5 10
30 445
70 32
expected output is
points sequence
5 1
10 2
30 1
445 2
I know about iterrows() but it saved out put in two different columns.How can I get expected output and is there any way to generate sequence number according to condition? any help will be appreciated.
First never use iterrows, because really slow.
If want 1, 2 sequence by number of columns convert values to numy array by DataFrame.to_numpy and add numpy.ravel, then for sequence use numpy.tile:
df = pd.DataFrame({'points': df.to_numpy().ravel(),
'sequence': np.tile([1,2], len(df))})
print (df)
points sequence
0 5 1
1 10 2
2 30 1
3 445 2
4 70 1
5 32 2
Do this way:
>>> pd.DataFrame([i[1] for i in df.iterrows()])
points sequence
0 5 1
1 10 2
2 30 1
3 445 2

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])