Incremental id based on another column's value - pandas

From this DataFrame:
car_id month
93829 September
27483 April
48372 October
93829 December
93829 March
48372 February
27483 March
How to add a third column which is basically a new id for car, but an incremental one, like this:
car_id month new_incremental_car_id
93829 September 0
27483 April 1
48372 October 2
93829 December 0
93829 March 0
48372 February 2
27483 March 1
Currently I'm doing it by using groupby('car_id') to create a new DataFrame, to which I add an incremental column, which I then join back to the original DataFrame using car_id join key.
Is there a less cumbersome, more direct method to achieve this goal?
EDIT
The code I'm currently using:
cars_id = pd.DataFrame(list(car_sales.groupby('car_id')['car_id'].groups))
cars_id['car_short_id'] = cars_id.index
cars_id.set_index(0, inplace=True)
car_sales.join(cars_id, on='car_id', how='left')

Apart from pd.factorize you can
Use, map a dict constructed from unique values.
In [959]: df.car_id.map({x: i for i, x in enumerate(df.car_id.unique())})
Out[959]:
0 0
1 1
2 2
3 0
4 0
5 2
6 1
Name: car_id, dtype: int64
Or, using category type and codes but not in the same order.
In [954]: df.car_id.astype('category').cat.codes
Out[954]:
0 2
1 0
2 1
3 2
4 2
5 1
6 0
dtype: int8

use factorize method:
In [49]: df['new_incremental_car_id'] = pd.factorize(df.car_id)[0].astype(np.uint16)
In [50]: df
Out[50]:
car_id month new_incremental_car_id
0 93829 September 0
1 27483 April 1
2 48372 October 2
3 93829 December 0
4 93829 March 0
5 48372 February 2
6 27483 March 1
In [51]: df.dtypes
Out[51]:
car_id int64
month object
new_incremental_car_id uint16
dtype: object

Related

Cumulative Deviation of 2 Columns in Pandas DF

I have a rather simple request and have not found a suitable solution online. I have a DF that looks like this below and I need to find the cumulative deviation as shown in a new column to the DF. My DF looks like this:
year month Curr Yr LT Avg
0 2022 1 667590.5985 594474.2003
1 2022 2 701655.5967 585753.1173
2 2022 3 667260.5368 575550.6112
3 2022 4 795338.8914 562312.5309
4 2022 5 516510.1103 501330.4306
5 2022 6 465717.9192 418087.1358
6 2022 7 366100.4456 344854.2453
7 2022 8 355089.157 351539.9371
8 2022 9 468479.4396 496831.2979
9 2022 10 569234.4156 570767.1723
10 2022 11 719505.8569 594368.6991
11 2022 12 670304.78 576495.7539
And, I need the cumulative deviation new column in this DF to look like this:
Cum Dev
0.122993392
0.160154637
0.159888559
0.221628609
0.187604073
0.178089327
0.16687643
0.152866293
0.129326033
0.114260993
0.124487107
0.128058305
In Excel, the calculation would look like this with data in Excel columns Z3:Z14, AA3:AA14 for the first row: =SUM(Z$3:Z3)/SUM(AA$3:AA3)-1 and for the next row: =SUM(Z$3:Z4)/SUM(AA$3:AA4)-1 and for the next as follows with the last row looking like this in the Excel example: =SUM(Z$3:Z14)/SUM(AA$3:AA14)-1
Thank you kindly for your help,
You can divide the cumulative sums of those 2 columns element-wise, and then subtract 1 at the end:
>>> (df["Curr Yr"].cumsum() / df["LT Avg"].cumsum()) - 1
0 0.122993
1 0.160155
2 0.159889
3 0.221629
4 0.187604
5 0.178089
6 0.166876
7 0.152866
8 0.129326
9 0.114261
10 0.124487
11 0.128058
dtype: float64

Rolling Rows in pandas.DataFrame

I have a dataframe that looks like this:
year
month
valueCounts
2019
1
73.411285
2019
2
53.589128
2019
3
71.103842
2019
4
79.528084
I want valueCounts column's values to be rolled like:
year
month
valueCounts
2019
1
53.589128
2019
2
71.103842
2019
3
79.528084
2019
4
NaN
I can do this by dropping first index of dataframe and assigning last index to NaN but it doesn't look efficient. Is there any simpler method to do this?
Thanks.
Assuming your dataframe are already sorted.
Use shift:
df['valueCounts'] = df['valueCounts'].shift(-1)
print(df)
# Output
year month valueCounts
0 2019 1 53.589128
1 2019 2 71.103842
2 2019 3 79.528084
3 2019 4 NaN

How to join columns in Julia?

I have opened a dataframe in julia where i have 3 columns like this:
day month year
1 1 2011
2 4 2015
3 12 2018
how can I make a new column called date that goes:
day month year date
1 1 2011 1/1/2011
2 4 2015 2/4/2015
3 12 2018 3/12/2018
I was trying with this:
df[!,:date]= df.day.*"/".*df.month.*"/".*df.year
but it didn't work.
in R i would do:
df$date=paste(df$day, df$month, df$year, sep="/")
is there anything similar?
thanks in advance!
Julia has an inbuilt Date type in its standard library:
julia> using Dates
julia> df[!, :date] = Date.(df.year, df.month, df.day)
3-element Vector{Date}:
2011-01-01
2015-04-02
2018-12-03

Pandas 1.0 create column of months from year and date

I have a dataframe df with values as:
df.iloc[1:4, 7:9]
Year Month
38 2020 4
65 2021 4
92 2022 4
I am trying to create a new MonthIdx column as:
df['MonthIdx'] = pd.to_timedelta(df['Year'], unit='Y') + pd.to_timedelta(df['Month'], unit='M') + pd.to_timedelta(1, unit='D')
But I get the error:
ValueError: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta values durations.
Following is the desired output:
df['MonthIdx']
MonthIdx
38 2020/04/01
65 2021/04/01
92 2022/04/01
So you can pad the month value in a series, and then reformat to get a datetime for all of the values:
month = df.Month.astype(str).str.pad(width=2, side='left', fillchar='0')
df['MonthIdx'] = pd.to_datetime(pd.Series([int('%d%s' % (x,y)) for x,y in zip(df['Year'],month)]),format='%Y%m')
This will give you:
Year Month MonthIdx
0 2020 4 2020-04-01
1 2021 4 2021-04-01
2 2022 4 2022-04-01
You can reformat the date to be a string to match exactly your format:
df['MonthIdx'] = df['MonthIdx'].apply(lambda x: x.strftime('%Y/%m/%d'))
Giving you:
Year Month MonthIdx
0 2020 4 2020/04/01
1 2021 4 2021/04/01
2 2022 4 2022/04/01

adding a new column using values in another column using pandas

I have a data set
id Category Date
1 Sick 2016-10-10
12:10:21
2 Active 2017-09-08
11:09:06
3 Weak 2018-11-12
06:10:04
Now i want to add a new column which only has year in the data set using pandas?
You could do:
import pandas as pd
data = [[1, 'Sick ', '2016-10-10 12:10:21'],
[2, 'Active', '2017-09-08 11:09:06'],
[3, 'Weak ', '2018-11-12 06:10:04']]
df = pd.DataFrame(data=data, columns=['id', 'category', 'date'])
df['year'] = pd.to_datetime(df['date']).dt.year
print(df)
Output
id category date year
0 1 Sick 2016-10-10 12:10:21 2016
1 2 Active 2017-09-08 11:09:06 2017
2 3 Weak 2018-11-12 06:10:04 2018
you can just do df['year'] = pd.DatetimeIndex(df['Date']).year
Output:
id category Date year
0 1 Sick 2016-10-10 12:10:21 2016
1 2 Active 2017-09-08 11:09:06 2017
2 3 Weak 2018-11-12 06:10:04 2018