In R, how to mordify 'User-defined function' for date parse - tidyverse

In R, how to mordify below 'User-defined function' my_date_parse?
In below function,I want to replace character GMT+9 and GMT-8 in the beginner,then parse it, but failed.
library(lubridate)
library(tidyverse)
# raw data
df <- data.frame(
date=c("1 Feb 2022 11:48:42 pm GMT+9","1 Feb 2022 9:41:56 am GMT-8","Feb 1, 2022 6:19:26 a.m. PST","Feb 1, 2022 1:22:37 a.m. PST","Feb 12, 2022 7:54:32 a.m. PST","31.01.2022 23:11:54 UTC","Feb 1, 2022 12:00:47 AM PST","Feb 1, 2022 12:44:28 PM PST","Feb 1, 2022 12:20:22 AM PST"),
country=c("AU","AU","CA","CA","CA","DE","US","US","US"))
# User-defined function
my_date_parse <- function(country,date){
# date <- str_replace_all(date,'([GMT+9])',"")
# date <- str_replace_all(date,'([GMT-8])',"")
case_when(
country=='US'~ mdy_hms(date),
country=='DE'~ dmy_hms(date),
country=='CA'~ mdy_hms(date),
country=='AU'~ dmy_hms(date),
)
}
# parse the date with User-defined function
df %>% mutate(date_parsed=my_date_parse(country,date))

Related

forecasting in time series in R with dplyr and ggplot

Hope all goes well.
I have a data set that I can share a small piece of it:
date=c("2022-08-01","2022-08-02","2022-08-03","2022-08-04",
"2022-08-05","2022-08-6")
sold_items=c(12,18,9,31,19,10)
df <- data.frame(date=as.Date(date),sold_items)
df %>% sample_n(5)
date sold_items
1 2022-08-04 31
2 2022-08-03 9
3 2022-08-01 12
4 2022-08-06 10
5 2022-08-02 18
I need to forecast the number of sold items in the next two weeks (14 days after the last available date in the data).
And also need to show the forecasted data along with the current data on one graph using ggplot
I have been looking into forecast package to use ARIMA but I am lost and could not convert this data to a time series object.
I wonder if someone can provide a solution with dplyr to my problem.
Thank you very much.
# first create df
` df =
tibble(
sold = c(12, 18, 9, 31, 19, 10),
date = seq(as.Date("2022-08-01"),
by = "day",
length = length(sold))) %>%
relocate(date)
#then coerce to a tsibble object (requires package fpp3) and model:
df %>%
as_tsibble(index = date) %>%
model(ARIMA(sold)) %>%
forecast(h = 14)

drop duplicates isnt working on my imported csv file

Looking for some help on this one. I do not know why but drop duplicates is not working, tried a loop with lambda. still nothing I can do will remove mutlple duplicates on the output.
# Import files for use in the program:
import pandas as pd
import os
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import matplotlib.dates as mdates
import numpy as np
import csv
Import CSV files into a data frame
Crash_Data_df = pd.read_csv("crash_data.csv",encoding='UTF-8')
#split date column
Crash_Data_df[["Day", "Month", "DayNum","Time","Zone","Year"]] =
Crash_Data_df["Date"].str.split(" ", n = 6, expand = True)
#define max and min dates
d_max=Crash_Data_df["Date"].min()
d_min=Crash_Data_df["Date"].max()
#split name column
Crash_Data_df[["A","B"]] = Crash_Data_df["Name"].str.split("_|2018100", n =
2, expand = True)
#Drop time zone
Crash_Data_df.drop(['Zone'], axis = 1, inplace = True)
Crash_Data_df.reset_index(drop=True)
# group by unnamed column
Crash_Data_df = Crash_Data_df.loc[Crash_Data_df['Unnamed: 0'] == 0, :]
#del columns
del Crash_Data_df['Unnamed: 0']
del Crash_Data_df['Name']
del Crash_Data_df['A']
Crash_Data_df = Crash_Data_df.loc[Crash_Data_df['B'] != 9954815, :]
Crash_Data_df = Crash_Data_df.dropna(how='any')
Crash_Data_df.drop_duplicates(subset=['Time'], keep=False)
Crash_Data_df.sort_values(by=['B'])
Crash_Data_df.reset_index(drop=True)
Crash_Data_df = Crash_Data_df.rename(columns=
{'B':'ID','Date':'DATE','Direction':'DIRECTION','Road':'ROAD',
'Location':'LOCATION','Event':'EVENT','Day':'DAY',
'Month':'MONTH','DayNum':'DAYNUM','Time':'TIME','Year':'YEAR'})
Crash_Data_df.set_index('ID', inplace=True,drop=True)
Crash_Data_df.to_csv("crash_data_check.csv", index=False, header=True)
Crash_Data_df.drop_duplicates()
Crash_Data_df.groupby("ID").filter(lambda x: len(x) > 1)
Crash_Data_df.head()
ID duplicates are not dropped. tried different columns no deal.. the output looks like:
DATE DIRECTION ROAD LOCATION EVENT DAY MONTH DAYNUM TIME YEAR
ID
9954815 Sun Oct 07 03:35:22 CDT 2018 WB T.H.62 # T.H.100 NB CRASH Sun Oct 07 03:35:22 2018
9954815 Sun Oct 07 03:35:22 CDT 2018 WB T.H.62 # T.H.100 NB CRASH Sun Oct 07 03:35:22 2018
9954815 Sun Oct 07 03:35:22 CDT 2018 WB T.H.62 # T.H.100 NB CRASH Sun Oct 07 03:35:22 2018
9954815 Sun Oct 07 03:35:22 CDT 2018 WB T.H.62 # T.H.100 NB CRASH Sun Oct 07 03:35:22 2018
9954815 Sun Oct 07 03:35:22 CDT 2018 WB T.H.62 # T.H.100 NB CRASH Sun Oct 07 03:35:22 2018
9954815 Sun Oct 07 03:35:22 CDT 2018 WB T.H.62 # T.H.100 NB CRASH Sun Oct 07 03:35:22 201
DATE object
DIRECTION object
ROAD object
LOCATION object
EVENT object
DAY object
MONTH object
DAYNUM object
TIME object
YEAR object
dtype: object
Because the .drop_duplicates returns a copy of the df, you want to either update the df variable or do the drop with inplace=True.
Try:
Crash_Data_df = Crash_Data_df.drop_duplicates(subset=['Time'], keep=False)
Or
Crash_Data_df.drop_duplicates(subset=['Time'], keep=False, inplace=True)
Both should work.
BTW, Same applies at the other drop_duplicates calling.

Pandas changing data format od a long column

I'm polishing my code. In one point I want to convert date given as string to another string that holds the same date but shows it in different format.
After each date there is code, always same code for given date.
Here is my df:
import pandas as pd
data = ['2012-06-29 A','2012-08-29 B','2012-10-29 X','2012-10-15 A']*50000
data.sort()
df = pd.DataFrame({'A':data})
A
2012-06-29 A
2012-06-29 A
2012-06-29 A
2012-06-29 A
2012-06-29 A
And here is how I'm doing it now:
df['A'] = df['A'].apply(lambda x: pd.to_datetime(x.split(' ')[0]).strftime('%d %b %Y ') + x.split(' ')[1])
A
29 Jun 2012 A
29 Jun 2012 A
29 Jun 2012 A
29 Jun 2012 A
29 Jun 2012 A
It works fine but it seems to create bottleneck (not really it is only part of data preparation).
Can it be done better/faster?
In total I have about 15 dates like this for 1 df (and many dfs). I wonder if creation of dict or temporary support_df from unique dates and applying those somehow (how?) through lambda to avoid multiple conversions.
Additional info (maybe useful): Column A, later on, becomes part of MultiIndex.
IIUC, my first attempt would be this method:
No need to apply on the dataframe:
(pd.to_datetime(df['A'].str.split().str[0]).dt.strftime('%d %b %Y') + ' '
+ df['A'].str.split().str[1])
Second attempt using list comprehension instead of .str accessor:
(pd.to_datetime(pd.Series([i.split()[0] for i in df.A])).dt.strftime('%d %b %Y')
+ ' ' + pd.Series([i.split()[1] for i in df.A]))
Third attempt:
ls = [i.split() for i in df.A]
i,j = zip(*ls)
pd.Series(pd.to_datetime(i).strftime('%d %b %Y')) + ' ' + pd.Series(j)

Append a tuple to a dataframe as a row

I am looking for a solution to add rows to a dataframe. Here is the data I have :
A grouped object ( obtained by grouping a dataframe on month and year i.e in this grouped object key is [month,year] and value is all the rows / dates in that month and year).
I want to extract all the month , year combinations and put that in a new dataframe. Issue : When I iterate over the grouped object, month, row is a tuple, so I converted the tuple into a list and added it to a dataframe using thye append command. Instead of getting added as rows :
1 2014
2 2014
3 2014
it got added in one column
0 1
1 2014
0 2
1 2014
0 3
1 2014
...
I want to store these values in a new dataframe. Here is how I want the new dataframe to be :
month year
1 2014
2 2014
3 2014
I tried converting the tuple to list and then I tried various other things like pivoting. Inputs would be really helpful.
Here is the sample code :
df=df.groupby(['month','year'])
df = pd.DataFrame()
for key, value in df:
print "type of key is:",type(key)
print "type of list(key) is:",type(list(key))
df = df.append(list(key))
print df
When you do the groupby the resulting MultiIndex is available as:
In [11]: df = pd.DataFrame([[1, 2014, 42], [1, 2014, 44], [2, 2014, 23]], columns=['month', 'year', 'val'])
In [12]: df
Out[12]:
month year val
0 1 2014 42
1 1 2014 44
2 2 2014 23
In [13]: g = df.groupby(['month', 'year'])
In [14]: g.grouper.result_index
Out[14]:
MultiIndex(levels=[[1, 2], [2014]],
labels=[[0, 1], [0, 0]],
names=['month', 'year'])
Often this will be sufficient, and you won't need a DataFrame. If you do, one way is the following:
In [21]: pd.DataFrame(index=g.grouper.result_index).reset_index()
Out[21]:
month year
0 1 2014
1 2 2014
I thought there was a method to get this, but can't recall it.
If you really want the tuples you can use .values or to_series:
In [31]: g.grouper.result_index.values
Out[31]: array([(1, 2014), (2, 2014)], dtype=object)
In [32]: g.grouper.result_index.to_series()
Out[32]:
month year
1 2014 (1, 2014)
2 2014 (2, 2014)
dtype: object
You had initially declared both the groupby and empty dataframe as df. Here's a modified version of your code that allows you to append a tuple as a dataframe row.
g=df.groupby(['month','year'])
df = pd.DataFrame()
for (key1,key2), value in g:
row_series = pd.Series((key1,key),index=['month','year'])
df = df.append(row_series, ignore_index = True)
print df
If all you want are the unique values, you could use drop_duplicates
In [29]: df[['month','year']].drop_duplicates()
Out[29]:
month year
0 1 2014
2 2 2014

How to go from relative dates to absolute dates in DataFrame columns

I have a pandas DataFrame containing forward prices for future maturities, quoted on multiple different trading months ('trade date'). Trade dates are given in absolute terms ('January'). The maturities are given in relative terms ('M+1').
How can I convert the maturities into an absolute format, i.e. in trade date 'January' the maturity 'M+1' should say 'February'.
Here is example data:
import pandas as pd
import numpy as np
data_keys = ['trade date', 'm+1', 'm+2', 'm+3']
data = {'trade date':['jan','feb','mar','apr'],
'm+1':np.random.randn(4),
'm+2':np.random.randn(4),
'm+3':np.random.randn(4)}
df = pd.DataFrame(data)
df = df[data_keys]
Starting data:
trade date m+1 m+2 m+3
0 jan -0.446535 -1.012870 -0.839881
1 feb 0.013255 0.265500 1.130098
2 mar 0.406562 -1.122270 -1.851551
3 apr -0.890004 0.752648 0.778100
Result:
Should have Feb, Mar, Apr, May, Jun, Jul in the columns. NaN will be shown in many instances.
The starting DataFrame:
trade date m+1 m+2 m+3
0 jan -1.350746 0.948835 0.579352
1 feb 0.011813 2.020158 -1.221110
2 mar -0.183187 -0.303099 1.323092
3 apr 0.081105 0.662628 -0.703152
Solution:
Define a list of all possible absolute dates you will encounter, in
chronological order. Do the same for relative dates.
Create a function to act on groups coming from df.groupby. The
function will convert the column names of each group appropriately to
an absolute format.
Apply the function.
Pandas handles the clever concatenation of all groups.
Code:
abs_in_order = ['jan','feb','mar','apr','may','jun','jul','aug']
rel_in_order = ['m+0','m+1','m+2','m+3','m+4']
def rel2abs(group, abs_in_order, rel_in_order):
abs_date = group['trade date'].unique()[0]
l = len(rel_in_order)
i = abs_in_order.index(abs_date)
namesmap = dict(zip(rel_in_order, abs_in_order[i:i+l]))
group.rename(columns=namesmap, inplace=True)
return group
grouped = df.groupby(['trade date'])
df = grouped.apply(rel2abs, abs_in_order, rel_in_order)
Pandas may mess up the column order. Do this to get back to something in chronological order:
order = ['trade date'] + abs_in_order
cols = [e for e in order if e in df.columns]
df[cols]
Result:
trade date feb mar apr may jun jul
0 jan -1.350746 0.948835 0.579352 NaN NaN NaN
1 feb NaN 0.011813 2.020158 -1.221110 NaN NaN
2 mar NaN NaN -0.183187 -0.303099 1.323092 NaN
3 apr NaN NaN NaN 0.081105 0.662628 -0.703152
You question doesn't contain enough information to answer it.
You say that the prices are quoted on dates given in absolute terms ('January').
January is not a date, but 2-Jan-2015 is.
What is your actual 'date' and what is its format (i.e. text, datetime.date, pd.Timestamp, etc.). You can use type(date) to check where date is an instance of whatever quote date it represents.
The easiest solution is to get your trade dates into pd.Timestamps and then add an offset:
trade_date = pd.Timestamp('2015-1-15')
>>> trade_date + pd.DateOffset(months=1)
Timestamp('2015-02-15 00:00:00')