Calculate age between two integer columns - pandas

I have the below df:
OnlineDate BDate
20190813 19720116
20190809 19570912
20190807 19600601
20190801 19760919
20190816 19530916
The two columns are integer and are dates YYYYMMDD
I'm trying to get a new column that is the result in Years between these two dates.
So, the expected output is the next
OnlineDate BDate NewColumn
20190813 19720116 47
20190809 19570912 61
20190807 19600601 59
20190801 19760919 51
20190816 19530916 66
I can't just subtract the years because the days and months count to determine the year,
Do I have to create a function to do it or I can do it without one?

It requires some setup, but you're looking to convert your columns to a datetime, get the year from it and then simply subtract them to get the differences
import pandas as pd
import numpy as np
# setup
onlinedate = [20190813, 20190809, 20190807, 20190801, 20190816]
bdate = [19720116, 19570912, 19600601, 19760919, 19530916]
df = pd.DataFrame({"onlinedate":onlinedate, "bdate":bdate})
# convert to dates
onlinedate_year = pd.to_datetime(df["onlinedate"], format="%Y%M%d")
bdate_year = pd.to_datetime(df["bdate"], format="%Y%M%d")
# Setup new column, columnwise operation
# Subtract the two dates and divide by years
df["NewColumn"] = ((onlinedate_year - bdate_year)/np.timedelta64(1,'Y'))
# convert the float column in to int
df["NewColumn"] = df["NewColumn"].astype(int)
print(df)
output:
onlinedate bdate NewColumn
0 20190813 19720116 46
1 20190809 19570912 61
2 20190807 19600601 59
3 20190801 19760919 42
4 20190816 19530916 65

Convert the data types to datetime;
for col in ['OnlineDate','BDate']:
df[col]=pd.to_datetime(df[col],format="%Y%m%d")
Subtract the years;
df['NewColumn']=df['OnlineDate'].dt.year-df['BDate'].dt.year

Related

How can I create a bar chart with ranges of values

I have a data frame with, among other things, a user id and an age. I need to produce a bar chart of the number of users that fall with ranges of ages. What's throwing me is that there is really no upper bound for the age range. The specific ranges I'm trying to plot are age <= 25, 25 < age <= 75 and age > 75.
I'm relatively new to Pandas and plotting, and I'm sure this is a simple thing for more experienced data wranglers. Any assistance would be greatly appreciated.
You'll need to use the pandas.cut method to do this, and you can supply custom bins and labels!
from pandas import DataFrame, cut
from numpy.random import default_rng
from numpy import arange
from matplotlib.pyplot import show
# Make som dummy data
rng = default_rng(0)
df = DataFrame({'id': arange(100), 'age': rng.normal(50, scale=20, size=100).clip(min=0)})
print(df.head())
id age
0 0 52.514604
1 1 47.357903
2 2 62.808453
3 3 52.098002
4 4 39.286613
# Use pandas.cut to bin all of the ages & assign
# these bins to a new column to demonstrate how it works
## bins are [0-25), [25-75), [75-inf)
df['bin'] = cut(df['age'], [0, 25, 75, float('inf')], labels=['under 25', '25 up to 75', '75 or older'])
print(df.head())
id age bin
0 0 52.514604 25 up to 75
1 1 47.357903 25 up to 75
2 2 62.808453 25 up to 75
3 3 52.098002 25 up to 75
4 4 39.286613 25 up to 75
# Get the value_counts of those bins and plot!
df['bin'].value_counts().sort_index().plot.bar()
show()

Duplicated rows when merging on pandas

I have a list that contains multiple pandas dataframes.
Each dataframe has columns 'Trading Day' and Maturity.
However the name of the column Maturity changes depending on the maturity, for example the first dataframe column names are: 'Trading Day', 'Y_2021','Y_2022'.
The second dataframe has 'Trading Day',Y_2022','Y_2023','Y_2024'.
The column 'Trading day' has all unique np.datetime64 dates for every dataframe
And the maturity columns have either floats or nans
My goal is to merge all the dataframes into one and have something like:
'Trading Day','Y_2021,'Y_2022','Y_2023',...'Y_2030'
In my code gh is the list that contains all the dataframes and original is a dataframe that contains all the dates from 5 years ago through today.
gt is the final dataframe.
So far what I have done is:
original = pd.DataFrame()
original['Trading Day'] = np.arange(np.datetime64(str(year_now-5)+('-01-01')), np.datetime64(date.today())+1)
for i in range(len(gh)):
gh[i]['Trading Day']=gh[i]['Trading Day'].astype('datetime64[ns]')
gt = pd.merge(original,gh[0],on='Trading Day',how = 'left')
for i in range (1,len(gh)):
gt=pd.merge(gt,gh[i],how='outer')
The code works more or less the problem is that when there is a change of years I get the following example results:
Y_2021 Y_2023 Y_2024
2020-06-05 45
2020-06-05 54
2020-06-05 43
2020-06-06 34
2020-06-06 23
2020-06-06 34
#While what I want is:
Y_2021 Y_2023 Y_2024
2020-06-05 45 54 43
2020-06-06 34 23 34
Given your actual output and what you want, you should be able to just:
output.ffill().bfill().drop_duplicates()
To get the output you want.
Found the fix:
gt = gt.groupby('Trading Day').sum()
gt = gt.replace(0, np.nan)

how split data with respect of months?

Hi I have a time series data set. I would like to make a new column for each month.
data:
creationDate fre skill
2019-02-15T20:43:29Z 14 A
2019-02-15T21:10:32Z 15 B
2019-03-22T07:14:50Z 41 A
2019-03-22T06:47:41Z 64 B
2019-04-11T09:49:46Z 25 A
2019-04-11T09:49:46Z 29 B
output:
skill 2019-02 2019-03 2019-04
A 14 41 25
B 15 64 29
I know I can do it manually like below and make columns (when I have date1_start and date1_end):
dfdate1=data[(data['creationDate'] >= date1_start) & (data['creationDate']<= date1_end)]
But since I have many many months, it is not feasible to that this ways for each month.
Use DataFrame.pivot with convert datetimes to month periods by Series.dt.to_period:
df['dates'] = pd.to_datetime(df['creationDate']).dt.to_period('m')
df = df.pivot('skill','dates','fre')
Or to custom strings YYYY-MM by Series.dt.strftime:
df['dates'] = pd.to_datetime(df['creationDate']).dt.strftime('%Y-%m')
df = df.pivot('skill','dates','fre')
EDIT:
ValueError: Index contains duplicate entries, cannot reshape
It means there are duplicates, use DataFrame.pivot_table with some aggregation, e.g. sum, mean:
df = df.pivot_table(index='skill',columns='dates',values='fre', aggfunc='sum')

how to group by month and another column pandas data frame

I have a data frame that looks like below:
import pandas as pd
df = pd.DataFrame({'Date':[2019-08-06,2019-08-08,2019-08-01,2019-10-12], 'Name':['A','A','B','C'], 'grade':[100,90,69,80]})
I want to groupby the data by month and year from the Datetime and also group by Name. Then sum up the other columns.
So, the desired output will be something similar to this
df = pd.DataFrame({'Date':[2019-08, 2019-08, 2019-10-12], 'Name':['A','B','C'], 'grade':[190,69,80]})
I have tried grouper
df.groupby(pd.Grouper(freq='M').sum()
However, it won't take the Name column into play and just drop the entire column.
Try :
df['Date'] = pd.to_datetime(df.Date)
df.groupby([df.Date.dt.to_period('M'), 'Name']).sum().reset_index()
Date Name grade
0 2019-08 A 190
1 2019-08 B 69
2 2019-10 C 80
I assume date column is of dtype datetime. Then group with
grouped = df.groupby([df.Date.dt.year, df.Date.dt.month, 'Name']).sum()

How could i download a dataframe after performing some calculations on it , with the new result?

Link: https://gist.github.com/dishantrathi/541db1a19a8feaf114723672d998b857
Input was a set of date ranging from 2012 - 2015, and need to count the number of time a date repeated.
After counting, I have a dataset of dates and counted the unique counts of each date and now I have to download the unique count with the corresponding date in Ascending Order.
The output file should be in csv.
I believe you need reset_index for 2 column DataFrame from Series, sort by sort_values:
df1 = df.groupby('Date').size().reset_index(name='count').sort_values('count')
Another solution with value_counts:
df1 = (df['Date'].value_counts()
.rename_axis('Date')
.reset_index(name='count')
.sort_values('count'))
print (df1.head())
Date count
66 02-05-2014 54
594 13-05-2014 56
294 07-02-2014 57
877 19-04-2013 58
162 04-05-2014 59
df1.to_csv('file.csv', index=False)