Pandas changing data format od a long column - pandas

I'm polishing my code. In one point I want to convert date given as string to another string that holds the same date but shows it in different format.
After each date there is code, always same code for given date.
Here is my df:
import pandas as pd
data = ['2012-06-29 A','2012-08-29 B','2012-10-29 X','2012-10-15 A']*50000
data.sort()
df = pd.DataFrame({'A':data})
A
2012-06-29 A
2012-06-29 A
2012-06-29 A
2012-06-29 A
2012-06-29 A
And here is how I'm doing it now:
df['A'] = df['A'].apply(lambda x: pd.to_datetime(x.split(' ')[0]).strftime('%d %b %Y ') + x.split(' ')[1])
A
29 Jun 2012 A
29 Jun 2012 A
29 Jun 2012 A
29 Jun 2012 A
29 Jun 2012 A
It works fine but it seems to create bottleneck (not really it is only part of data preparation).
Can it be done better/faster?
In total I have about 15 dates like this for 1 df (and many dfs). I wonder if creation of dict or temporary support_df from unique dates and applying those somehow (how?) through lambda to avoid multiple conversions.
Additional info (maybe useful): Column A, later on, becomes part of MultiIndex.

IIUC, my first attempt would be this method:
No need to apply on the dataframe:
(pd.to_datetime(df['A'].str.split().str[0]).dt.strftime('%d %b %Y') + ' '
+ df['A'].str.split().str[1])
Second attempt using list comprehension instead of .str accessor:
(pd.to_datetime(pd.Series([i.split()[0] for i in df.A])).dt.strftime('%d %b %Y')
+ ' ' + pd.Series([i.split()[1] for i in df.A]))
Third attempt:
ls = [i.split() for i in df.A]
i,j = zip(*ls)
pd.Series(pd.to_datetime(i).strftime('%d %b %Y')) + ' ' + pd.Series(j)

Related

how split data with respect of months?

Hi I have a time series data set. I would like to make a new column for each month.
data:
creationDate fre skill
2019-02-15T20:43:29Z 14 A
2019-02-15T21:10:32Z 15 B
2019-03-22T07:14:50Z 41 A
2019-03-22T06:47:41Z 64 B
2019-04-11T09:49:46Z 25 A
2019-04-11T09:49:46Z 29 B
output:
skill 2019-02 2019-03 2019-04
A 14 41 25
B 15 64 29
I know I can do it manually like below and make columns (when I have date1_start and date1_end):
dfdate1=data[(data['creationDate'] >= date1_start) & (data['creationDate']<= date1_end)]
But since I have many many months, it is not feasible to that this ways for each month.
Use DataFrame.pivot with convert datetimes to month periods by Series.dt.to_period:
df['dates'] = pd.to_datetime(df['creationDate']).dt.to_period('m')
df = df.pivot('skill','dates','fre')
Or to custom strings YYYY-MM by Series.dt.strftime:
df['dates'] = pd.to_datetime(df['creationDate']).dt.strftime('%Y-%m')
df = df.pivot('skill','dates','fre')
EDIT:
ValueError: Index contains duplicate entries, cannot reshape
It means there are duplicates, use DataFrame.pivot_table with some aggregation, e.g. sum, mean:
df = df.pivot_table(index='skill',columns='dates',values='fre', aggfunc='sum')

Pandas grouping values and getting most recent date

I have a large csv file that I read into Pandas that gets me a DataFrame for "Community_Name" and "Date" its about 186k lines with about 120 unique "Community_Names" and a range of Dates. I would like to Group the data by Community and find the most recent Date for each in the file. I will use this later on to pull data from each community up to that most recent date later on.
I am struggling with getting the most recent date value for each community. I thought the .max() would work, but it returns the greatest value overall rather than for each community...
with open('communitydates.csv', 'r', newline='', encoding='utf-8') as csv_file:
csv_reader = csv.DictReader(csv_file)
for line in csv_reader:
date = line['Date'] + " " + line['Year']
date = datetime.datetime.strptime(date, '%B %d %Y').strftime('%Y %d %m')
community_name = line['Community']
entry = community_name, date
dates_list.append(entry)df = pd.DataFrame(dates_list)
df.columns = ["Community", "Date"]
df["Date"] = pd.to_datetime(df["Date"], format='%Y %d %m').max()
grouped_by_community = df.groupby("Community")
recent_date_by_community = grouped_by_community.first()
Ideally I want to convert the DataFrame into a Dictionary or List to do the check later on.
max_dates = recent_date_by_community.to_dict('index')
for k in max_dates:
print(k, max_dates[k]['Date'])
Which currently gives me this...but...the date is the same for all 102 communities vs the actual date in the file.
Addison 2019-10-09 00:00:00
I assuming I am using the .max() statement incorrectly, but have not been able to figure out how change it.

Removing Space between bars in seaborn barplot

I am trying to plot following data. Duration is Jan to Dec. Type varies from 1 to 7. Key point is, not all types exist for each month. This is not missing value, type simply do not exist.
Month Type Coef
Jan 1 2.3
Jan 2 2.1
..
Code:
ax = sns.barplot(x = 'Month', y = 'Coef_E',hue = 'LCZ',data = df_E, palette=palette)
Result
I want to remove space market by arrows.

Create datetime from columns in a DataFrame

I got a DataFrame with these columns :
year month day gender births
I'd like to create a new column type "Date" based on the column year, month and day as : "yyyy-mm-dd"
I'm just beginning in Python and I just can't figure out how to proceed...
Assuming you are using pandas to create your dataframe, you can try:
>>> import pandas as pd
>>> df = pd.DataFrame({'year':[2015,2016],'month':[2,3],'day':[4,5],'gender':['m','f'],'births':[0,2]})
>>> df['dates'] = pd.to_datetime(df.iloc[:,0:3])
>>> df
year month day gender births dates
0 2015 2 4 m 0 2015-02-04
1 2016 3 5 f 2 2016-03-05
Taken from the example here and the slicing (iloc use) "Selection" section of "10 minutes to pandas" here.
You can useĀ .assign
For example:
df2= df.assign(ColumnDate = df.Column1.astype(str) + '- ' + df.Column2.astype(str) + '-' df.Column3.astype(str) )
It is simple and it is much faster than lambda if you have tonnes of data.

How to go from relative dates to absolute dates in DataFrame columns

I have a pandas DataFrame containing forward prices for future maturities, quoted on multiple different trading months ('trade date'). Trade dates are given in absolute terms ('January'). The maturities are given in relative terms ('M+1').
How can I convert the maturities into an absolute format, i.e. in trade date 'January' the maturity 'M+1' should say 'February'.
Here is example data:
import pandas as pd
import numpy as np
data_keys = ['trade date', 'm+1', 'm+2', 'm+3']
data = {'trade date':['jan','feb','mar','apr'],
'm+1':np.random.randn(4),
'm+2':np.random.randn(4),
'm+3':np.random.randn(4)}
df = pd.DataFrame(data)
df = df[data_keys]
Starting data:
trade date m+1 m+2 m+3
0 jan -0.446535 -1.012870 -0.839881
1 feb 0.013255 0.265500 1.130098
2 mar 0.406562 -1.122270 -1.851551
3 apr -0.890004 0.752648 0.778100
Result:
Should have Feb, Mar, Apr, May, Jun, Jul in the columns. NaN will be shown in many instances.
The starting DataFrame:
trade date m+1 m+2 m+3
0 jan -1.350746 0.948835 0.579352
1 feb 0.011813 2.020158 -1.221110
2 mar -0.183187 -0.303099 1.323092
3 apr 0.081105 0.662628 -0.703152
Solution:
Define a list of all possible absolute dates you will encounter, in
chronological order. Do the same for relative dates.
Create a function to act on groups coming from df.groupby. The
function will convert the column names of each group appropriately to
an absolute format.
Apply the function.
Pandas handles the clever concatenation of all groups.
Code:
abs_in_order = ['jan','feb','mar','apr','may','jun','jul','aug']
rel_in_order = ['m+0','m+1','m+2','m+3','m+4']
def rel2abs(group, abs_in_order, rel_in_order):
abs_date = group['trade date'].unique()[0]
l = len(rel_in_order)
i = abs_in_order.index(abs_date)
namesmap = dict(zip(rel_in_order, abs_in_order[i:i+l]))
group.rename(columns=namesmap, inplace=True)
return group
grouped = df.groupby(['trade date'])
df = grouped.apply(rel2abs, abs_in_order, rel_in_order)
Pandas may mess up the column order. Do this to get back to something in chronological order:
order = ['trade date'] + abs_in_order
cols = [e for e in order if e in df.columns]
df[cols]
Result:
trade date feb mar apr may jun jul
0 jan -1.350746 0.948835 0.579352 NaN NaN NaN
1 feb NaN 0.011813 2.020158 -1.221110 NaN NaN
2 mar NaN NaN -0.183187 -0.303099 1.323092 NaN
3 apr NaN NaN NaN 0.081105 0.662628 -0.703152
You question doesn't contain enough information to answer it.
You say that the prices are quoted on dates given in absolute terms ('January').
January is not a date, but 2-Jan-2015 is.
What is your actual 'date' and what is its format (i.e. text, datetime.date, pd.Timestamp, etc.). You can use type(date) to check where date is an instance of whatever quote date it represents.
The easiest solution is to get your trade dates into pd.Timestamps and then add an offset:
trade_date = pd.Timestamp('2015-1-15')
>>> trade_date + pd.DateOffset(months=1)
Timestamp('2015-02-15 00:00:00')