Pandas: sorting by integer notation value - pandas

In this dataframe, column key values correspond to integer notation of each song key.
df
track key
0 Last Resort 4
1 Casimir Pulaski Day 8
2 Glass Eyes 8
3 Ohio - Live At Massey Hall 1971 7
4 Ballad of a Thin Man 11
5 Can You Forgive Her? 11
6 The Only Thing 3
7 Goodbye Baby (Baby Goodbye) 4
8 Heart Of Stone 0
9 Ohio 0
10 the gate 2
11 Clampdown 2
12 Cry, Cry, Cry 4
13 What's Happening Brother 8
14 Stupid Girl 11
15 I Don't Wanna Play House 7
16 Inner City Blues (Make Me Wanna Holler) 11
17 The Lonesome Death of Hattie Carroll 4
18 Paint It, Black - (Original Single Mono Version) 5
19 Let Him Run Wild 11
20 Undercover (Of The Night) - Remastered 5
21 Between the Bars 7
22 Like a Rolling Stone 0
23 Once 2
24 Pale Blue Eyes 5
25 The Way You Make Me Feel - 2012 Remaster 1
26 Jeremy 2
27 The Entertainer 7
28 Pressure 9
29 Play With Fire - Mono Version / Remastered 2002 2
30 D-I-V-O-R-C-E 9
31 Big Shot 0
32 What's Going On 1
33 Folsom Prison Blues - Live 0
34 American Woman 1
35 Cocaine Blues - Live 8
36 Jesus, etc. 5
the notation is as follows:
'C' --> 0
'C#'--> 1
'D' --> 2
'Eb'--> 3
'E' --> 4
'F' --> 5
'F#'--> 6
'G' --> 7
'Ab'--> 8
'A' --> 9
'Bb'--> 10
'B' --> 11
what is specific about this notation is that 11 is closer to 0 than 2, for instance.
GOAL:
given an input_notation = 0, I would like to sort according to closeness to key 0, or 'C'.
you can get closest value by doing:
closest_key = (input_notation -1) % 12
so I would like to sort according to this logic, having on top input_notation values and then closest matches, like so:
8 Heart Of Stone 0
9 Ohio 0
22 Like a Rolling Stone 0
31 Big Shot 0
33 Folsom Prison Blues - Live 0
(...)
I have tried:
v = df[['key']].values
df = df.iloc[np.lexsort(np.abs(v - (input_notation - 1) %12 ).T)]
but this does not work..
any clues?

You can define the closeness firstly and then use argsort with iloc to sort the data frame:
input_notation = 0
# define the closeness or distance
diff = (df.key - input_notation).abs()
closeness = np.minimum(diff, 12 - diff)
# use argsort to calculate the sorting index, and iloc to reorder the data frame
closest_to_input = df.iloc[closeness.argsort(kind='mergesort')]
closest_to_input.head()
# track key
#8 Heart Of Stone 0
#9 Ohio 0
#22 Like a Rolling Stone 0
#31 Big Shot 0
#33 Folsom Prison Blues - Live 0

Related

How to extract car model name from the car dataset?

Can anyone help me to extact the car model names from the following sample dataframe?
index,Make,Model,Price,Year,Kilometer,Fuel Type,Transmission,Location,Color,Owner,Seller Type
0,Honda,Amaze 1.2 VX i-VTEC,505000,2017,87150,Petrol,Manual,Pune,Grey,First,Corporate
1,Maruti Suzuki,Swift DZire VDI,450000,2014,75000,Diesel,Manual,Ludhiana,White,Second,Individual
2,Hyundai,i10 Magna 1.2 Kappa2,220000,2011,67000,Petrol,Manual,Lucknow,Maroon,First,Individual
3,Toyota,Glanza G,799000,2019,37500,Petrol,Manual,Mangalore,Red,First,Individual
I have used this code :
model_name = df['Model'].str.extract(r'(\w+)')
How ever, i'm unable to get the car names which has names such as WR-V, CR-V ( or which has space or hyfen in between the names)
This is the detailed link of the dataset:https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho?select=car+details+v4.csv
Desired output should be:
index,0
0,Amaze
1,Swift
2,i10
3,Glanza
4,Innova
5,Ciaz
6,CLA
7,X1 xDrive20d
8,Octavia
9,Terrano
10,Elite
11,Kwid
12,Ciaz
13,Harrier
14,Polo
15,Celerio
16,Alto
17,Baleno
18,Wagon
19,Creta
20,S-Presso
21,Vento
22,Santro
23,Venue
24,Alto
25,Ritz
26,Creta
27,Brio
28,Elite
29,WR-V
30,Venue
Please help me!!
The exact logic is unclear, but assuming you want the first word (including special characters) or the first two words if the first word has only one or two characters:
df['Model'].str.extract(r'(\S{3,}|\S{1,2}\s+\S+)', expand=False)
Output:
0 Amaze
1 Swift
2 i10
3 Glanza
4 Innova
5 Ciaz
6 CLA
7 X1 xDrive20d
8 Octavia
9 Terrano
10 Elite
11 Kwid
12 Ciaz
13 Harrier
14 Polo
15 Celerio
16 Alto
17 Baleno
18 Wagon
19 Creta
20 S-Presso
21 Vento
22 Santro
23 Venue
24 Alto
25 Ritz
26 Creta
27 Brio
28 Elite
29 WR-V
... ...
Name: Model, dtype: object

compare value in two rows in a column pandas

I have a pandas df something like this:
color pct days text
1 red 5 7 good
2 red 10 30 good
3 red 11 60 bad
4 blue 6 7 bad
5 blue 15 30 good
6 blue 21 60 bad
7 yellow 2 7 good
8 yellow 5 30 bad
9 yellow 7 60 bad
So basically, for each color, I have percentage values for 7 days, 30 days and 60 days. Please note that these are not always in correct order as I gave in example above. My task now is to look at the change in percentage for each color between the consecutive days values and if the change is greater or equal to 5%, then write in column "text" as "NA". Text in days 7 category is default and cannot be overwritten.
Desired result:
color pct days text
1 red 5 7 good
2 red 10 30 NA
3 red 11 60 bad
4 blue 6 7 bad
5 blue 15 30 NA
6 blue 21 60 NA
7 yellow 2 7 good
8 yellow 5 30 bad
9 yellow 7 60 bad
I am able to achieve this by a very very long process that I am very sure is not efficient. I am sure there is a much better way of doing this, but I am new to python, so struggling. Can someone please help me with this? Many thanks in advance
A variation on a (now-deleted) suggested answer as comment:
# ensure numeric data
df['pct'] = pd.to_numeric(df['pct'], errors='coerce')
df['days'] = pd.to_numeric(df['days'], errors='coerce')
# update in place
df.loc[df.sort_values(['color','days'])
.groupby('color')['pct']
.diff().ge(5), 'text'] = 'NA'
Output:
color pct days text
1 red 5 7 good
2 red 10 30 NA
3 red 11 60 bad
4 blue 6 7 bad
5 blue 15 30 NA
6 blue 21 60 NA
7 yellow 2 7 good
8 yellow 5 30 bad
9 yellow 7 60 bad
In the code below I'm reading your example table into a pandas dataframe using io, you don't need to do this, you already have your pandas table.
import pandas as pd
import io
df = pd.read_csv(io.StringIO(
""" color pct days text
1 red 5 7 good
2 red 10 30 good
3 red 11 60 bad
4 blue 6 7 bad
5 blue 15 30 good
6 blue 21 60 bad
7 yellow 2 7 good
8 yellow 5 30 bad
9 yellow 7 60 bad"""
),delim_whitespace=True)
not_seven_rows = df['days'].ne(7)
good_rows = df['pct'].lt(5)
#Set the rows which are < 5 and not 7 days to be 'good'
df.loc[good_rows & not_seven_rows, 'text'] = 'good'
#Set the rows which are >= 5 and not 7 days to be 'NA'
df.loc[(~good_rows) & not_seven_rows, 'text'] = 'NA'
df
Output
def function1(dd:pd.DataFrame):
dd1=dd.sort_values("days")
return dd1.assign(text=np.where(dd1.pct.diff()>=5,"NA",dd1.text))
df1.groupby('color',sort=False).apply(function1).reset_index(drop=True)
out
color pct days text
0 red 5 7 good
1 red 10 30 NA
2 red 11 60 bad
3 blue 6 7 bad
4 blue 15 30 NA
5 blue 21 60 NA
6 yellow 2 7 good
7 yellow 5 30 bad
8 yellow 7 60 bad

How to reshape, group by and rename Julia dataframe?

I have the following DataFrame :
Police Product PV1 PV2 PV3 PM1 PM2 PM3
0 1 AA 10 8 14 150 145 140
1 2 AB 25 4 7 700 650 620
2 3 AA 13 22 5 120 80 60
3 4 AA 12 6 12 250 170 120
4 5 AB 10 13 5 500 430 350
5 6 BC 7 21 12 1200 1000 900
PV1 is the item PV for year 1, PV2 for year 2, ....
I would like to combine reshaping and group by operations + some renaming stuffs to obtain the DF below :
Product Item Year1 Year2 Year3
0 AA PV 35 36 31
1 AA PM 520 395 320
2 AB PV 35 17 12
3 AB PM 1200 1080 970
4 BC PV 7 21 12
5 BC PM 1200 1000 900
It makes a group by operation on product name and reshape the DF to pass the item as a column and put the sum of each in new columns years.
I found a way to do it in Python but I am now looking for a solution passing my code in Julia.
No problem for the groupby operation, but I have more issues with the reshaping / renaming part.
If you have any idea, I would be very grateful.
Thanks for any help
Edit :
As you recommended, I have installed Julia 1.5 and updated the DataFrames pkg to 0.22 version. As a result, the code runs well. The only remaining issue is related to the non constant lenght of column names in my real DF, which makes the transform part of the code not completly suitable. I will search for a way to split char/num with regular expression.
Thanks a lot for your time and sorry for the mistakes on editing.
There are probably several ways how you can do it. Here is an example using in-built functions (also taking advantage of several advanced features at once, so if you have any questions regarding the code please comment and I can explain):
julia> using CSV, DataFrames, Chain
julia> str = """
Police Product PV1 PV2 PV3 PM1 PM2 PM3
1 AA 10 8 14 150 145 140
2 AB 25 4 7 700 650 620
3 AA 13 22 5 120 80 60
4 AA 12 6 12 250 170 120
5 AB 10 13 5 500 430 350
6 BC 7 21 12 1200 1000 900""";
julia> #chain str begin
IOBuffer
CSV.read(DataFrame, ignorerepeated=true, delim=" ")
groupby(:Product)
combine(names(df, r"\d") .=> sum, renamecols=false)
stack(Not(:Product))
transform!(:variable => ByRow(x -> (first(x, 2), last(x, 1))) => [:Item, :Year])
unstack([:Product, :Item], :Year, :value, renamecols = x -> Symbol("Year", x))
sort!(:Product)
end
6×5 DataFrame
Row │ Product Item Year1 Year2 Year3
│ String String Int64? Int64? Int64?
─────┼─────────────────────────────────────────
1 │ AA PV 35 36 31
2 │ AA PM 520 395 320
3 │ AB PV 35 17 12
4 │ AB PM 1200 1080 970
5 │ BC PV 7 21 12
6 │ BC PM 1200 1000 900
I used Chain.jl just to show how it can be employed in practice (but of course it is not needed).
You can add #aside show(_) annotation after any stage of the processing to see the results of the processing steps.
Edit:
Is this the regex you need (split non-digit characters followed by digit characters)?
julia> match(r"([^\d]+)(\d+)", "fsdfds123").captures
2-element Array{Union{Nothing, SubString{String}},1}:
"fsdfds"
"123"
Then just write:
ByRow(x -> match(r"([^\d]+)(\d+)", x).captures)
as your transformation

Auctions System Logical Condition

I am trying to make an auctions system but can not figure out the logical conditions for doing so..
Lets say that I have 10 credit
$credit
I have already bet 5 credits on another auction... so I owe 5 from 10 $owe
I thus have 5 available... $available = $credit - $owe (=5)
I bet 3 from available (on a different item)...
I wish to bet again 4 (cancel 3, update to 4), but credit available is now $available - 3 (=2)
Can't find a logical solution.... written in code.
What is the condition for setting a bet???
Made up a matrix with the dependence between variables:
bet available owe lastbet
1 10 10 0
2 9 11 1
3 7 13 2
4 4 16 3
5 0 20 4
6 -5 25 5
7 -11 31 6
8 -18 38 7
9 -26 46 8
10 -35 55 9
11 -45 65 10
Need to translate it into a condition statement.... (the next row would not meet the conditions)
The condition should fail on the 11th row....
Based on the Matrix... I found out that the condition is:
if ($bet <= (($owe + $available) / 2)) {}
Not very intuitive......

Divide dataframe in different bins based on condition

i have a pandas dataframe
id no_of_rows
1 2689
2 1515
3 3826
4 814
5 1650
6 2292
7 1867
8 2096
9 1618
10 923
11 766
12 191
i want to divide id's into 5 different bins based on their no. of rows,
such that every bin has approx(equal no of rows)
and assign it as a new column bin
One approach i thought was
df.no_of_rows.sum() = 20247
div_factor = 20247//5 == 4049
if we add 1st and 2nd row its sum = 2689+1515 = 4204 > div_factor.
Therefore assign bin = 1 where id = 1.
Now look for the next ones
id no_of_rows bin
1 2689 1
2 1515 2
3 3826 3
4 814 4
5 1650 4
6 2292 5
7 1867
8 2096
9 1618
10 923
11 766
12 191
But this method proved wrong.
Is there a way to have 5 bins such that every bin has good amount of stores(approximately equal)
You can use an approach based on percentiles.
n_bins = 5
dfa = df.sort_values(by='no_of_rows').cumsum()
df['bin'] = dfa.no_of_rows.apply(lambda x: int(n_bins*x/dfa.no_of_rows.max()))
And then you can check with
df.groupby('bin').sum()
The more records you have the more fair it will be in terms of dispersion.