How to extract car model name from the car dataset? - pandas

Can anyone help me to extact the car model names from the following sample dataframe?
index,Make,Model,Price,Year,Kilometer,Fuel Type,Transmission,Location,Color,Owner,Seller Type
0,Honda,Amaze 1.2 VX i-VTEC,505000,2017,87150,Petrol,Manual,Pune,Grey,First,Corporate
1,Maruti Suzuki,Swift DZire VDI,450000,2014,75000,Diesel,Manual,Ludhiana,White,Second,Individual
2,Hyundai,i10 Magna 1.2 Kappa2,220000,2011,67000,Petrol,Manual,Lucknow,Maroon,First,Individual
3,Toyota,Glanza G,799000,2019,37500,Petrol,Manual,Mangalore,Red,First,Individual
I have used this code :
model_name = df['Model'].str.extract(r'(\w+)')
How ever, i'm unable to get the car names which has names such as WR-V, CR-V ( or which has space or hyfen in between the names)
This is the detailed link of the dataset:https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho?select=car+details+v4.csv
Desired output should be:
index,0
0,Amaze
1,Swift
2,i10
3,Glanza
4,Innova
5,Ciaz
6,CLA
7,X1 xDrive20d
8,Octavia
9,Terrano
10,Elite
11,Kwid
12,Ciaz
13,Harrier
14,Polo
15,Celerio
16,Alto
17,Baleno
18,Wagon
19,Creta
20,S-Presso
21,Vento
22,Santro
23,Venue
24,Alto
25,Ritz
26,Creta
27,Brio
28,Elite
29,WR-V
30,Venue
Please help me!!

The exact logic is unclear, but assuming you want the first word (including special characters) or the first two words if the first word has only one or two characters:
df['Model'].str.extract(r'(\S{3,}|\S{1,2}\s+\S+)', expand=False)
Output:
0 Amaze
1 Swift
2 i10
3 Glanza
4 Innova
5 Ciaz
6 CLA
7 X1 xDrive20d
8 Octavia
9 Terrano
10 Elite
11 Kwid
12 Ciaz
13 Harrier
14 Polo
15 Celerio
16 Alto
17 Baleno
18 Wagon
19 Creta
20 S-Presso
21 Vento
22 Santro
23 Venue
24 Alto
25 Ritz
26 Creta
27 Brio
28 Elite
29 WR-V
... ...
Name: Model, dtype: object

Related

How to convert wide dataframe to long based on similar column

I have a pandas dataframe like this
and i want to convert it to below dataframe
i am not sure how to use pd.wide_to_long function here
below is the dataset for creating dataframe:
Date, IN:Male teacher ,IN:Male engineer, IN: Male Atronaut , IN:female teacher ,IN:female engineer, IN: female Atronaut ,GB:Male teacher ,GB:Male engineer, GB: Male Atronaut,GB:female teacher ,GB:female engineer, GB: female Atronaut
20220405,25,29,5,41,23,23,12,23,34,11,22,34
20220404,21,29,4,40,23,22,12,23,32,10,23,34
Convert Date column to index and for all another columns remove possible traling spaces by str.strip, then replace spaces to : and last split by one or more : to MultiIndex, so possible reshape by DataFrame.stack with DataFrame.rename_axis for new columns names created by DataFrame.reset_index:
df1 = df.set_index('Date')
df1.columns = df1.columns.str.strip().str.replace('\s+', ':').str.split('[:]+', expand=True)
df1 = df1.stack([0,1]).rename_axis(['Date','Symbol','Gender']).reset_index()
print (df1)
Date Symbol Gender Atronaut engineer teacher
0 20220405 GB Male 34 23 12
1 20220405 GB female 34 22 11
2 20220405 IN Male 5 29 25
3 20220405 IN female 23 23 41
4 20220404 GB Male 32 23 12
5 20220404 GB female 34 23 10
6 20220404 IN Male 4 29 21
7 20220404 IN female 22 23 40
pivot_longer from pyjanitor offers an easy way to abstract the reshaping; in this case it can be solved with a regular expression:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(
index = 'Date',
names_to = ('symbol', 'gender', '.value'),
names_pattern = r"(.+):\s*(.+)\s+(.+)",
sort_by_appearance = True)
Date symbol gender teacher engineer Atronaut
0 20220405 IN Male 25 29 5
1 20220405 IN female 41 23 23
2 20220405 GB Male 12 23 34
3 20220405 GB female 11 22 34
4 20220404 IN Male 21 29 4
5 20220404 IN female 40 23 22
6 20220404 GB Male 12 23 32
7 20220404 GB female 10 23 34
The regular expression has capture groups, any group paired with .value stays as a header, the rest become column values.

Group rows using the cumulative sum of a third column

I have a table with two columns:
sort_column = A column I use for sorting
value_column = My metric of interest (a positive integer)
Using SQL, I need to create contiguous groups of rows, ordered by sort_column, such that the sum of value_column within each group is the largest possible but staying below 100 (100 not included).
Find below an example of my desired result.
Thanks
sort_column
value_column
desired_result
1
53
1
2
25
1
3
33
2
4
25
2
5
10
2
6
46
3
7
9
3
8
49
4
9
48
4
10
53
5
11
33
5
12
52
6
13
29
6
14
16
6
15
66
7
16
1
7
17
62
8
18
57
9
19
47
10
20
12
10
Ok, so after a few lengthy attempts, I came to the conclusion the task is impossible with pure SQL, because a given value of the desired column depends on previous values of that same column, in a way that cannot be obtained from the first two columns alone, so the problem is impossible to tackle without using a recursive CTE, which BigQuery does not support.
I solved the issue by writing a javascript UDF for the task. It seems to be working fine and produces the expected results.
Many thanks everyone!

How to reshape, group by and rename Julia dataframe?

I have the following DataFrame :
Police Product PV1 PV2 PV3 PM1 PM2 PM3
0 1 AA 10 8 14 150 145 140
1 2 AB 25 4 7 700 650 620
2 3 AA 13 22 5 120 80 60
3 4 AA 12 6 12 250 170 120
4 5 AB 10 13 5 500 430 350
5 6 BC 7 21 12 1200 1000 900
PV1 is the item PV for year 1, PV2 for year 2, ....
I would like to combine reshaping and group by operations + some renaming stuffs to obtain the DF below :
Product Item Year1 Year2 Year3
0 AA PV 35 36 31
1 AA PM 520 395 320
2 AB PV 35 17 12
3 AB PM 1200 1080 970
4 BC PV 7 21 12
5 BC PM 1200 1000 900
It makes a group by operation on product name and reshape the DF to pass the item as a column and put the sum of each in new columns years.
I found a way to do it in Python but I am now looking for a solution passing my code in Julia.
No problem for the groupby operation, but I have more issues with the reshaping / renaming part.
If you have any idea, I would be very grateful.
Thanks for any help
Edit :
As you recommended, I have installed Julia 1.5 and updated the DataFrames pkg to 0.22 version. As a result, the code runs well. The only remaining issue is related to the non constant lenght of column names in my real DF, which makes the transform part of the code not completly suitable. I will search for a way to split char/num with regular expression.
Thanks a lot for your time and sorry for the mistakes on editing.
There are probably several ways how you can do it. Here is an example using in-built functions (also taking advantage of several advanced features at once, so if you have any questions regarding the code please comment and I can explain):
julia> using CSV, DataFrames, Chain
julia> str = """
Police Product PV1 PV2 PV3 PM1 PM2 PM3
1 AA 10 8 14 150 145 140
2 AB 25 4 7 700 650 620
3 AA 13 22 5 120 80 60
4 AA 12 6 12 250 170 120
5 AB 10 13 5 500 430 350
6 BC 7 21 12 1200 1000 900""";
julia> #chain str begin
IOBuffer
CSV.read(DataFrame, ignorerepeated=true, delim=" ")
groupby(:Product)
combine(names(df, r"\d") .=> sum, renamecols=false)
stack(Not(:Product))
transform!(:variable => ByRow(x -> (first(x, 2), last(x, 1))) => [:Item, :Year])
unstack([:Product, :Item], :Year, :value, renamecols = x -> Symbol("Year", x))
sort!(:Product)
end
6×5 DataFrame
Row │ Product Item Year1 Year2 Year3
│ String String Int64? Int64? Int64?
─────┼─────────────────────────────────────────
1 │ AA PV 35 36 31
2 │ AA PM 520 395 320
3 │ AB PV 35 17 12
4 │ AB PM 1200 1080 970
5 │ BC PV 7 21 12
6 │ BC PM 1200 1000 900
I used Chain.jl just to show how it can be employed in practice (but of course it is not needed).
You can add #aside show(_) annotation after any stage of the processing to see the results of the processing steps.
Edit:
Is this the regex you need (split non-digit characters followed by digit characters)?
julia> match(r"([^\d]+)(\d+)", "fsdfds123").captures
2-element Array{Union{Nothing, SubString{String}},1}:
"fsdfds"
"123"
Then just write:
ByRow(x -> match(r"([^\d]+)(\d+)", x).captures)
as your transformation

Pandas: sorting by integer notation value

In this dataframe, column key values correspond to integer notation of each song key.
df
track key
0 Last Resort 4
1 Casimir Pulaski Day 8
2 Glass Eyes 8
3 Ohio - Live At Massey Hall 1971 7
4 Ballad of a Thin Man 11
5 Can You Forgive Her? 11
6 The Only Thing 3
7 Goodbye Baby (Baby Goodbye) 4
8 Heart Of Stone 0
9 Ohio 0
10 the gate 2
11 Clampdown 2
12 Cry, Cry, Cry 4
13 What's Happening Brother 8
14 Stupid Girl 11
15 I Don't Wanna Play House 7
16 Inner City Blues (Make Me Wanna Holler) 11
17 The Lonesome Death of Hattie Carroll 4
18 Paint It, Black - (Original Single Mono Version) 5
19 Let Him Run Wild 11
20 Undercover (Of The Night) - Remastered 5
21 Between the Bars 7
22 Like a Rolling Stone 0
23 Once 2
24 Pale Blue Eyes 5
25 The Way You Make Me Feel - 2012 Remaster 1
26 Jeremy 2
27 The Entertainer 7
28 Pressure 9
29 Play With Fire - Mono Version / Remastered 2002 2
30 D-I-V-O-R-C-E 9
31 Big Shot 0
32 What's Going On 1
33 Folsom Prison Blues - Live 0
34 American Woman 1
35 Cocaine Blues - Live 8
36 Jesus, etc. 5
the notation is as follows:
'C' --> 0
'C#'--> 1
'D' --> 2
'Eb'--> 3
'E' --> 4
'F' --> 5
'F#'--> 6
'G' --> 7
'Ab'--> 8
'A' --> 9
'Bb'--> 10
'B' --> 11
what is specific about this notation is that 11 is closer to 0 than 2, for instance.
GOAL:
given an input_notation = 0, I would like to sort according to closeness to key 0, or 'C'.
you can get closest value by doing:
closest_key = (input_notation -1) % 12
so I would like to sort according to this logic, having on top input_notation values and then closest matches, like so:
8 Heart Of Stone 0
9 Ohio 0
22 Like a Rolling Stone 0
31 Big Shot 0
33 Folsom Prison Blues - Live 0
(...)
I have tried:
v = df[['key']].values
df = df.iloc[np.lexsort(np.abs(v - (input_notation - 1) %12 ).T)]
but this does not work..
any clues?
You can define the closeness firstly and then use argsort with iloc to sort the data frame:
input_notation = 0
# define the closeness or distance
diff = (df.key - input_notation).abs()
closeness = np.minimum(diff, 12 - diff)
# use argsort to calculate the sorting index, and iloc to reorder the data frame
closest_to_input = df.iloc[closeness.argsort(kind='mergesort')]
closest_to_input.head()
# track key
#8 Heart Of Stone 0
#9 Ohio 0
#22 Like a Rolling Stone 0
#31 Big Shot 0
#33 Folsom Prison Blues - Live 0

PowerPivot formula for row wise weighted average

I have a table in PowerPivot which contains the logged data of a traffic control camera mounted on a road. This table is filled the velocity and the number of vehicles that pass this camera during a specific time(e.g. 14:10 - 15:25). Now I want to know that how can I get the average velocity of cars for an specific hour and list them in a separate table with 24 rows(hour 0 - 23) where the second column of each row is the weighted average velocity of that hour? A sample of my stat_table data is given below:
count vel hour
----- --- ----
133 96.00237 15
117 91.45705 21
81 81.90521 6
2 84.29946 21
4 77.7841 18
1 140.8766 17
2 56.14951 14
6 71.72839 13
4 64.14309 9
1 60.949 17
1 77.00728 21
133 100.3956 6
109 100.8567 15
54 86.6369 9
1 83.96901 17
10 114.6556 21
6 85.39127 18
1 76.77993 15
3 113.3561 2
3 94.48055 2
In a separate PowerPivot table I have 24 rows and 2 columns but when I enter my formula, the whole rows get updated with the same number. My formula is:
=sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count] * stat_table[vel])/sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count])
Create a new calculated column named "WeightedVelocity" as follows
WeightedVelocity = [count]*[vel]
Create a measure "WeightedAverage" as follows
WeightedAverage = sum(stat_table[WeightedVelocity]) / sum(stat_table[count])
Use measure "WeightedAverage" in VALUES area of pivot Table and use "hour" column in ROWS to get desired result.