pandas duplicate values: Result visual inspection not duplicates - pandas

Hello thanks in advance for all answers, I really appreciate community help
Here is my dataframe - from a csv containing scraped data from cars classified ads
Unnamed: 0 NameYear \
0 0 BMW 7 серия, 2007
1 1 BMW X3, 2021
2 2 BMW 2 серия Gran Coupe, 2021
3 3 BMW X5, 2021
4 4 BMW X1, 2021
Price \
0 520 000 ₽
1 от 4 810 000 ₽\n4 960 000 ₽ без скидки
2 2 560 000 ₽
3 от 9 259 800 ₽\n9 974 800 ₽ без скидки
4 от 3 130 000 ₽\n3 220 000 ₽ без скидки
CarParams \
0 187 000 км, AT (445 л.с.), седан, задний, бензин
1 2.0 AT (190 л.с.), внедорожник, полный, дизель
2 1.5 AMT (140 л.с.), седан, передний, бензин
3 3.0 AT (400 л.с.), внедорожник, полный, дизель
4 2.0 AT (192 л.с.), внедорожник, полный, бензин
url
0 https://www.avito.ru/moskva/avtomobili/bmw_7_s...
1 https://www.avito.ru/moskva/avtomobili/bmw_x3_...
2 https://www.avito.ru/moskva/avtomobili/bmw_2_s...
3 https://www.avito.ru/moskva/avtomobili/bmw_x5_...
4 https://www.avito.ru/moskva/avtomobili/bmw_x1_...
THE TASK - I want to know if there are duplicate rows, or if the SAME car advertisement appears twice. Most reliable maybe url because it should be unique: CarParameters or NameYear can repeat so I will check nunique and duplicated on url column
screenshot to visually inspect the reslt of duplicated:
THE ISSUE: Visual inspection (sorry for unprofessional jargon) shows these urls are not the SAME, but I wanted to get possible exactly same urls to check for repeat data. I tried to set keep = False as well

Try:
df.duplicated(subset=["url"], keep=False)

df.duplicted() gives you a pd.Series with bool-values.
Here is a example that your could probably use
from random import randint
import pandas as pd
urls=['http://www.google.com',
'http://www.stackoverfow.com',
'http://bla.xy','http://bla.com']
d=[]
for i, url in enumerate(urls):
for j in range(0,randint(1,i+1)):
d.append(dict(customer=str(randint(1,100)), url=url))
df=pd.DataFrame(d)
df['dups']=df['url'].duplicated(keep=False)
print(df)
resulting in the following df:
customer url dups
0 89 http://www.google.com False
1 43 http://www.stackoverfow.com False
2 36 http://bla.xy True
3 86 http://bla.xy True
4 32 http://bla.com False
the column dups shows you which urls exist more than once. In my example data is only the url http://bla.xy
The important thing is that you check what the parameter keep does
keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to mark.
first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.
In my case used False to get all duplicated values

Related

Kronecker product over the rows of a pandas dataframe

So I have these two dataframes and I would like to get a new dataframe which consists of the kronecker product of the rows of the two dataframes. What is the correct way to this?
As an example:
DataFrame1
c1 c2
0 10 100
1 11 110
2 12 120
and
DataFrame2
a1 a2
0 5 7
1 1 10
2 2 4
Then I would like to have the following matrix:
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480
I hope my question is clear.
PS. I saw this question was posted here kronecker product pandas dataframes. However, the answer given is not the correct answer (I believe to mine and the original question, but definitely not to mine). The answer there gives a Kronecker product of both dataframes, but I only want it over the rows.
Create MultiIndex by MultiIndex.from_product, convert both columns to MultiIndex by DataFrame.reindex and multiple Dataframe, last flatten MultiIndex:
c = pd.MultiIndex.from_product([df1, df2])
df = df1.reindex(c, axis=1, level=0).mul(df2.reindex(c, axis=1, level=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
print (df)
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480
Use numpy for efficiency:
import numpy as np
pd.DataFrame(np.einsum('nk,nl->nkl', df1, df2).reshape(df1.shape[0], -1),
columns=pd.MultiIndex.from_product([df1, df2]).map(''.join)
)
Output:
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480

Concatenate labels to an existing dataframe

I want to use a list of names "headers" to create a new column in my dataframe. In the initial table, the name of each division is positioned above the results for each team in that division. I want to add that header to each row entry for each divsion to make the data more identifiable like this. I have the headers stored in the "header" object in my code. How can I multiply each division header by the number of rows that appear in the division and append to the dataset?
Edit: here is another snippet of what I want the get from the end product.
df3 = df.iloc[0:6]
df3.insert(0, 'Divisions', ['na','L5 Junior', 'L5 Junior', 'na',
'L5 Senior - Medium', 'L5 Senior - Medium'])
df3
.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
Import HTML
scr = 'https://tv.varsity.com/results/7361971-2022-spirit-unlimited-battle-at-the-
boardwalk-atlantic-city-grand-ntls/31220'
scr1 = requests.get(scr)
soup = BeautifulSoup(scr1.text, "html.parser")
List of names to append
table_MN = pd.read_html(scr)
sp3 = soup.find(class_="full-content").find_all("h2")
headers = [elt.text for elt in sp3]
table_MN = pd.read_html(scr)
Extract text and header from division info
div = pd.DataFrame(headers)
div.columns = ["division"]
df = pd.concat(table_MN, ignore_index=True)
df.columns = df.iloc[0]
df
It is still not clear what is the output you are looking for. However, may I suggest the following, which accomplishes selecting common headers from tables in table_MN and the concatenating the results. If it is going in the right direction pls let me know, and indicate what else you want to extract from the resulting table:
tmn_1 = [tbl.T.set_index(0).T for tbl in table_MN]
pd.concat(tmn_1, axis=0, ignore_index = True)
output:
Rank Program Name Team Name Raw Score Deductions Performance Score Event Score
-- ------ --------------------------- ----------------- ----------- ------------ ------------------- -------------
0 1 Rockstar Cheer New Jersey City Girls 47.8667 0 95.7333 95.6833
1 2 Cheer Factor Xtraordinary 46.6667 0.15 93.1833 92.8541
2 1 Rockstar Cheer New Jersey City Girls 47.7667 0 95.5333 23.8833
3 2 Cheer Factor Xtraordinary 46.0333 0.2 91.8667 22.9667
4 1 Star Athletics Roar 47.5333 0.9 94.1667 93.9959
5 1 Prime Time All Stars Lady Onyx 43.9 1.35 86.45 86.6958
6 1 Prime Time All Stars Lady Onyx 44.1667 0.9 87.4333 21.8583
7 1 Just Cheer All Stars Jag 5 46.4333 0.15 92.7167 92.2875
8 1 Just Cheer All Stars Jag 5 45.8 0.6 91 22.75
9 1 Quest Athletics Black Ops 47.4333 0.45 94.4167 93.725
10 1 Quest Athletics Black Ops 46.5 1.35 91.65 22.9125
11 1 The Stingray Allstars X-Rays 45.3 0.95 89.65 88.4375
12 1 Vortex Allstars Lady Rays 45.7 0.5 90.9 91.1083
13 1 Vortex Allstars Lady Rays 45.8667 0 91.7333 22.9333
14 1 Upper Merion All Stars Citrus 46.4333 0 92.8667 92.7
15 2 Cheer Factor JUNIOR X 45.9 1.1 90.7 90.6542
16 3 NJ Premier All Stars Prodigy 44.6333 0.05 89.2167 89.8292
17 1 Upper Merion All Stars Citrus 46.1 0 92.2 23.05
18 2 NJ Premier All Stars Prodigy 45.8333 0 91.6667 22.9167
19 3 Cheer Factor JUNIOR X 45.7333 0.95 90.5167 22.6292
20 1 Virginia Royalty Athletics Dynasty 46.5 0 93 92.9
21 1 Virginia Royalty Athletics Dynasty 46.3 0 92.6 23.15
22 1 South Jersey Storm Lady Reign 47.2333 0 94.4667 93.4875
...

Matching two tables using pandas

I have these two tables
One table is called LineMax
OrigNode DestNode DivisionNum Prefix FromMP ToMP Suffix
7764 25961 3 AB 18 20.9
7764 50213 3 AB 18 17.3
7765 35444 3 AB 0 1.5
7841 35444 3 AB 6 1.5
15390 25961 3 AB 23.75 20.9
25961 7764 3 AB 20.9 18
25961 15390 3 AB 20.9 23.75
And I have another data set
OPER_MNT_DIV_CD TRK_CLS_NBR LN_PFX SEG_BGN_MP SEG_END_MP LN_SFX
4 1 362.7 362.71
4 1 362.71 362.83
4 1 362.83 362.98
4 1 362.98 363.35
4 1 363.35 363.4
4 1 363.4 363.54
4 1 363.54 363.67
4 1 363.67 363.81
4 1 363.81 363.95
4 1 363.95 364.1
4 1 364.1 364.15
4 1 364.15 364.5
4 1 364.5 364.55
I am trying to match my data. Basically I want to match to my second table it should have the same Prefix,Suffix, and Divison Number.
So basically
Prefix=LN_pfx
Suffix=LN_SFX
DIVISIONNUM=OPER_MNT_DIV_CD
I also want my first to Mp and from Mp to be contained within the SEG BGN MP AND SEG END MP like this $Seg_BGN_MP<=fromMp<ToMp<=Seg_end_mp4$
But I cannot seem to get my code to work. My second data table had some white space so I removed them and I turned Oper MNT DIV CID from a string to an int to make an easier comparison.
I also removed all the white space and turned every string capital in my first table.
But I cannot seem to get the matches I want.
import numpy as np
import pyodbc
import math
x=pyodbc.connect("DSN=DBP1")
table1=pd.read_csv("LineMaxOrder.csv")
s2="select oper_mnt_div_cd, trk_cls_nbr, ln_pfx, seg_bgn_mp, seg_end_mp, ln_sfx, crvtr_mn, crvtr_deg, xstg_elev from dcmctrk.crv_seg where trk_cls_nbr = 1 order by oper_mnt_div_cd, ln_pfx,ln_sfx"
table1=table1.drop(table1.columns[[0]],axis=1)
dChange=pd.read_sql_query(s2,x)
dChange["LN_PX"]=dChange["LN_PFX"].str.strip()
dChange["LN_PFX"]=dChange["LN_PFX"].str.upper()
dChange["LN_SFX"]=dChange["LN_SFX"].str.strip()
dChange["LN_SFX"]=dChange["LN_SFX"].str.upper()
dChange["OPER_MNT_DIV_CD"]=dChange["OPER_MNT_DIV_CD"].astype(int)
dfObj2=table1.select_dtypes(["object"])
table1[dfObj2.columns] = dfObj2.apply(lambda x: x.str.strip())
table1[dfObj2.columns]=dfObj2.apply(lambda x:x.str.upper())
table1=table1.fillna('')
w=[]
for idx,row in table1.iterrows():
a=row[3]
b=row[4]
c=row[7]
agu1=row[5]
agu2=row[6]
big=max(agu1,agu2)
small=min(agu2,agu1)
result=dChange[(dChange["OPER_MNT_DIV_CD"]==a)&(dChange["LN_PFX"]==b)&(dChange["LN_SFX"]==c)]
if result.empty:
continue
else:
result[(result["SEG_BGN_MP"]<=small)&(result["SEG_END_MP"]>=big)]
if result.empty:
continue
else:
print(result)
w.append(result)
use the pandas merge to combine joins as : inner, left, or right and link by matching fields

Divide dataframe in different bins based on condition

i have a pandas dataframe
id no_of_rows
1 2689
2 1515
3 3826
4 814
5 1650
6 2292
7 1867
8 2096
9 1618
10 923
11 766
12 191
i want to divide id's into 5 different bins based on their no. of rows,
such that every bin has approx(equal no of rows)
and assign it as a new column bin
One approach i thought was
df.no_of_rows.sum() = 20247
div_factor = 20247//5 == 4049
if we add 1st and 2nd row its sum = 2689+1515 = 4204 > div_factor.
Therefore assign bin = 1 where id = 1.
Now look for the next ones
id no_of_rows bin
1 2689 1
2 1515 2
3 3826 3
4 814 4
5 1650 4
6 2292 5
7 1867
8 2096
9 1618
10 923
11 766
12 191
But this method proved wrong.
Is there a way to have 5 bins such that every bin has good amount of stores(approximately equal)
You can use an approach based on percentiles.
n_bins = 5
dfa = df.sort_values(by='no_of_rows').cumsum()
df['bin'] = dfa.no_of_rows.apply(lambda x: int(n_bins*x/dfa.no_of_rows.max()))
And then you can check with
df.groupby('bin').sum()
The more records you have the more fair it will be in terms of dispersion.

Using SPSS Reference Variables and an Index to Create a New Variable

Essentially, I have a log which contains a Unique identifier for a subject which is tracked through multiple cases. I then used the following code, suggested previously through the great community here, to create an Index. Unfortunately, I've run into a new challenge that I can't seem to figure out. Here is a sample of the current data set to provide perspective.
Indexing function
sort cases by Unique_Modifier.
if $casenum=1 or Unique_Modifier<>lag(Unique_Modifier) Index=1.
if Unique_Modifier=lag(Unique_Modifier) Index=lag(Index)+1.
format Index(f2).
execute.
Unique Identifier Index Variable of interest
A 1 101
A 2 101
A 3 607
A 4 607
A 5 101
A 6 101
B 1 108
B 2 210
C 1 610
C 2 987
C 3 1100
C 4 610
What I'd like to do is create a new variable which contains the number of discrete, different entries in the variable of interest column. The expected output would be as the following:
Unique Identifier Index Variable of interest Intended Output
A 1 101 1
A 2 101 1
A 3 607 2
A 4 607 2
A 5 101 2
A 6 101 2
B 1 108 1
B 2 210 2
C 1 610 1
C 2 987 2
C 3 1100 3
C 4 610 3
I've tried a few different ways to do it, one was to use a similar index function, but it fails as if the variable of interest is different in subsequent lines, it works but, sometimes, we have a recurrence of a variable like 5 lines later. My next idea was to use the AGGREGATE function, but I looked through the IBM manual and it doesn't seem like there is a function within aggregate that would produce the intended output here. Anyone have any ideas? I think a loop is the best bet, but loops within SPSS are a bit funky and hard to get working.
Try this:
data list list/Unique_Identifier Index VOI (3f) .
begin data.
1 1 101
1 2 101
1 3 607
1 4 607
1 5 101
1 6 101
2 1 108
2 2 210
3 1 610
3 2 987
3 3 1100
3 4 610
end data.
string voiT (a1000).
compute voiT=concat(ltrim(string(VOI,f10)),",").
compute Intended_Output=1.
do if index>1.
do if index(lag(voiT), rtrim(voiT))>0.
compute Intended_Output=lag(Intended_Output).
compute voiT=lag(voiT).
else.
compute Intended_Output=lag(Intended_Output)+1.
compute voiT=concat(rtrim(lag(voiT)), rtrim(voiT)).
end if.
end if .
exe.