Can anyone help me to extact the car model names from the following sample dataframe?
index,Make,Model,Price,Year,Kilometer,Fuel Type,Transmission,Location,Color,Owner,Seller Type
0,Honda,Amaze 1.2 VX i-VTEC,505000,2017,87150,Petrol,Manual,Pune,Grey,First,Corporate
1,Maruti Suzuki,Swift DZire VDI,450000,2014,75000,Diesel,Manual,Ludhiana,White,Second,Individual
2,Hyundai,i10 Magna 1.2 Kappa2,220000,2011,67000,Petrol,Manual,Lucknow,Maroon,First,Individual
3,Toyota,Glanza G,799000,2019,37500,Petrol,Manual,Mangalore,Red,First,Individual
I have used this code :
model_name = df['Model'].str.extract(r'(\w+)')
How ever, i'm unable to get the car names which has names such as WR-V, CR-V ( or which has space or hyfen in between the names)
This is the detailed link of the dataset:https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho?select=car+details+v4.csv
Desired output should be:
index,0
0,Amaze
1,Swift
2,i10
3,Glanza
4,Innova
5,Ciaz
6,CLA
7,X1 xDrive20d
8,Octavia
9,Terrano
10,Elite
11,Kwid
12,Ciaz
13,Harrier
14,Polo
15,Celerio
16,Alto
17,Baleno
18,Wagon
19,Creta
20,S-Presso
21,Vento
22,Santro
23,Venue
24,Alto
25,Ritz
26,Creta
27,Brio
28,Elite
29,WR-V
30,Venue
Please help me!!
The exact logic is unclear, but assuming you want the first word (including special characters) or the first two words if the first word has only one or two characters:
df['Model'].str.extract(r'(\S{3,}|\S{1,2}\s+\S+)', expand=False)
Output:
0 Amaze
1 Swift
2 i10
3 Glanza
4 Innova
5 Ciaz
6 CLA
7 X1 xDrive20d
8 Octavia
9 Terrano
10 Elite
11 Kwid
12 Ciaz
13 Harrier
14 Polo
15 Celerio
16 Alto
17 Baleno
18 Wagon
19 Creta
20 S-Presso
21 Vento
22 Santro
23 Venue
24 Alto
25 Ritz
26 Creta
27 Brio
28 Elite
29 WR-V
... ...
Name: Model, dtype: object
I'm working with a pandas dataframe that has multiple groups:
date | group | brand | calculated_value
_______________________________
5 | 1 | x | 1
6 | 1 | x | NaN
7 | 1 | x | NaN
5 | 2 | y | 1
6 | 2 | y | NaN
Within each date, group, and brand, I have initialized the first instance with a calculated_value. I am iterating through these with nested for loops so that I can update and assign the next sequential date occurrence of calculated_value (within date-group-brand).
The groupby()/apply() paradigm doesn't work for me, because in e.g. the third row above, the function being passed to apply() looks above and finds NaN. It is not a sequential update.
After calculating the value, I am attempting to assign it to the cell in question, using the right syntax to avoid the CopySettings problem:
df.loc[ (df.date == 5) & (df.group == 1) & (df.brand == 'x'), "calculated_value" ] = calc_value
However, this fails to set the cell, and it remains NaN. Why is that? I've tried searching many terms, but I was not able to find an answer relevant to my case.
I have confirmed that each of the for loops is incrementing properly, and that I'm addressing the correct row in each iteration.
EDIT: I discovered the problem. When I pass the cells to calculate_function as individual arguments, they each pass as a single-value series, and the function returns a single-value series, which cannot be assigned to the NaN cell. No error was thrown on the mismatched assignment, and the for loop didn't terminate.
I fixed this by passing
calculate_function(arg1.values[0], arg2.values[0], ...)
Extracting the value array and taking its first index seems inelegant and brittle, but the default is a quirky behavior compared what I'm used to in R.
You can use groupby().idxmin() to identify the first date in each group of group, band:
s = df.groupby(['group', 'brand']).date.idxmin()
df.loc[s,'calculated_value'] = 1
Output:
date group brand calculated_value
0 5 1 x 1.0
1 6 1 x NaN
2 7 1 x NaN
3 5 2 y 1.0
4 6 2 y NaN
I will do transform with min
s=df.groupby(['group','brand']).date.transform('min')
df['calculated_value']=df.date.eq(s).astype(int)
I'm trying to tidy some data, specifically by taking two columns "measure" and "value" and making more columns for each unique value of measure.
So far I have some python (3) code that reads in data and pivots it to the form that I want--roughly. This code looks like so:
import pandas as pd
#Load the data
df = pd.read_csv(r"C:\Users\User\Documents\example data.csv")
#Pivot the dataframe
df_pivot = df.pivot_table(index=['Geography Type', 'Geography Name', 'Week Ending',
'Item Name'], columns='Measure', values='Value')
print(df_pivot.head())
This outputs:
Measure X Y Z
Geography Type Geography Name Week Ending Item Name
Type 1 Total US 1/1/2018 Item A 57 51 16
Item B 95 37 17
1/8/2018 Item A 92 8 32
Item B 36 49 54
Type 2 Region 1 1/1/2018 Item A 78 46 88
This is almost perfect, but for my work I need to put this file in software and for the software to read the data correctly it needs values for each of the rows, so I need the columns values for each of those indexes to extend through the rows, like so:
Measure X Y Z
Geography Type Geography Name Week Ending Item Name
Type 1 Total US 1/1/2018 Item A 57 51 16
Type 1 Total US 1/1/2018 Item B 95 37 17
Type 1 Total US 1/8/2018 Item A 92 8 32
Type 1 Total US 1/8/2018 Item B 36 49 54
Type 2 Region 1 1/1/2018 Item A 78 46 88
and so on.
I am producing some statistics which require grouping results by church, and only counting those churches which have been visited more than once.
So I can do:
df = pd.read_excel('/home/tim/metatron/church_data.xlsx')
chthresh = 1 # Minimum number of visits to a church in order to be considered
chgp = df.groupby('Church')
chcnt = pd.DataFrame(chgp['Date'].count())
chcnt2 = chcnt[chcnt['Date'] > chthresh]
which gives me what I want:
In[8]: chcnt2
Out[8]:
Date
Church
Manchester 36
Sale 29
Salford 33
For the purposes of analysis, though, I would like to anonymise these churches and replace them with (say) A, B, C etc. (There may be more than three churches). What would be the easiest/best way to allocate some sort of alphabetic label, e.g. in this case "Manchester" -> "A", "Sale" -> "B", "Salford" -> "C"
I can give the churches some sort of ordinal value:
chcnt3 = chcnt2.reset_index()
chcnt3['Ordinal']=chcnt3.index.values
Which produces
In[9]: chcnt3
Out[9]:
Church Date Ordinal
0 Manchester 36 0
1 Sale 29 1
2 Salford 33 2
But how would I convert this to some sort of letter? Is there a better way to do this?
You can create a letter map:
from string import ascii_uppercase
letter_map = dict(zip(range(len(ascii_uppercase)), ascii_uppercase))
and use this for mapping:
chcnt3['letter'] = chcnt3['Ordinal'].map(letter_map)
chcnt3
Out:
Church Date Ordinal letter
0 Manchester 36 0 A
1 Sale 29 1 B
2 Salford 33 2 C
Without creating the ordinal column, you can do this on the chcnt2 DataFrame too:
chcnt2['letter'] = list(ascii_uppercase[:len(chcnt2)])
chcnt2
Out:
Date letter
Church
Manchester 36 A
Sale 29 B
Salford 33 C
I don't know if Stata can do this but I use the tabulate command a lot in order to find frequencies. For instance, I have a success variable which takes on values 0 to 1 and I would like to know the success rate for a certain group of observations ie tab success if group==1. I was wondering if I can do sort of the inverse of this operation. That is, I would like to know if I can find a value of "group" for which the frequency is greater than or equal to 15% for example.
Is there a command that does this?
Thanks
As an example
sysuse auto
gen success=mpg<29
Now I want to find the value of price such that the frequency of the success variable is greater than 75% for example.
According to #Nick:
ssc install groups
sysuse auto
count
74
#return list optional
local nobs=r(N) # r(N) gives total observation
groups rep78, sel(f >(0.15*`r(N)')) #gives the group for which freq >15 %
+---------------------------------+
| rep78 Freq. Percent % <= |
|---------------------------------|
| 3 30 43.48 57.97 |
| 4 18 26.09 84.06 |
+---------------------------------+
groups rep78, sel(f >(0.10*`nobs'))# more than 10 %
+----------------------------------+
| rep78 Freq. Percent % <= |
|----------------------------------|
| 2 8 11.59 14.49 |
| 3 30 43.48 57.97 |
| 4 18 26.09 84.06 |
| 5 11 15.94 100.00 |
+----------------------------------+
I'm not sure if I fully understand your question/situation, but I believe this might be useful. You can egen a variable that is equal to the mean of success, by group, and then see which observations have the value for mean(success) that you're looking for.
egen avgsuccess = mean(success), by(group)
tab group if avgsuccess >= 0.15
list group if avgsuccess >= 0.15
Does that accomplish what you want?