transform data frame in time series for date type POSIXct - dataframe

I have a data frame with the following two variables:
amount: num 1213.5 34.5 ...
txn_date: POSIXct, format "2017-05-01 12:13:30" ...
I want to transform it in a time series using ts().
I started using this code:
Z <- zoo(data$amount, order.by=as.Date(as.character(data$txn_date), format="%Y/%m/%d %H:%M:%S"))
But the problem is that in Z I loose the dates. In fact, all the dates are reported as NA.
How can I solve it?
For my analysis is important to have date in the format:%Y/%m/%d %H:%M:%S
for example 2017-05-01 12:13:30. I don't want to remove the time component in the variable txn_date.
Yhan you for your help,
Andrea

I think your prolem comes from the way you're manipulating your data frame, could post more details about it please ?
I think i have a fix for you.
Data frame I used :
> df1
$data
value
1 1.9150
2 3.1025
3 6.7400
4 8.5025
5 11.0025
6 9.8025
7 9.0775
8 7.0900
9 6.8525
10 7.4900
$date
%Y-%m-%d
1 1974-01-01
2 1974-01-02
3 1974-01-03
4 1974-01-04
5 1974-01-05
6 1974-01-06
7 1974-01-07
8 1974-01-08
9 1974-01-09
10 1974-01-10
> class(df1$data$value)
[1] "numeric"
> class(df1$date$`%Y-%m-%d`)
[1] "POSIXct" "POSIXt"
Then I can create a time serie by calling zoo like that :
> Z<-zoo(df1$data,order.by=(as.POSIXct(df1$date$`%Y-%m-%d`)))
> Z
value
1974-01-01 1.9150
1974-01-02 3.1025
1974-01-03 6.7400
1974-01-04 8.5025
1974-01-05 11.0025
1974-01-06 9.8025
1974-01-07 9.0775
1974-01-08 7.0900
1974-01-09 6.8525
1974-01-10 7.4900
The important thing here is that I use df1$date$%Y-%m-%d instead of just
df1$date
In fact if I try the way you did it I get NA values too :
> Z<-zoo(df1$data,order.by=as.POSIXct(as.Date(as.character(df1$date),format("%Y-%m-%d"))))
> Z
value
<NA> 1.915
To get the name of data$txn_date you can use the following command : names(data$txn_date) and try my solution with your data frame and name.
> names(df1$date)
[1] "%Y-%m-%d"

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

Calculating the difference between values based on their date

I have a dataframe that looks like this, where the "Date" is set as the index
A B C D E
Date
1999-01-01 1 2 3 4 5
1999-01-02 1 2 3 4 5
1999-01-03 1 2 3 4 5
1999-01-04 1 2 3 4 5
I'm trying to compare the percent difference between two pairs of dates. I think I can do the first bit:
start_1 = "1999-01-02"
end_1 = "1999-01-03"
start_2 = "1999-01-03"
end_2 = "1999-01-04"
Obs_1 = df.loc[end_1] / df.loc[start_1] -1
Obs_2 = df.loc[end_2] / df.loc[start_2] -1
The output I get from - eg Obs_1 looks like this:
A 0.011197
B 0.007933
C 0.012850
D 0.016678
E 0.007330
dtype: float64
I'm looking to build some correlations between Obs_1 and Obs_2. I think I need to create a new dataframe with the labels A-E as one column (or as the index), and then the data series from Obs_1 and Obs_2 as adjacent columns.
But I'm struggling! I can't 'see' what Obs_1 and Obs_2 'are' - have I created a list? A series? How can I tell? What would be the best way of combining the two into a single dataframe...say df_1.
I'm sure the answer is staring me in the face but I'm going mental trying to figure it out...and because I'm not quite sure what Obs_1 and Obs_2 'are', it's hard to search the SO archive to help me.
Thanks in advance

Unexpected groupby result: some rows are missing

I am facing an issue with transforming my data using Pandas' groupby. I have a table (several million rows and 3 variables) that I am trying to group by "Date" variable.
Snippet from a raw table:
Date V1 V2
07_19_2017_17_00_06 10 5
07_19_2017_17_00_06 20 6
07_19_2017_17_00_08 15 3
...
01_07_2019_14_06_59 30 1
01_07_2019_14_06_59 40 2
The goal is to group rows with the same value of "Date" by applying a mean function over V1 and sum function over V2. So that the expected result resembles:
Date V1 V2
07_19_2017_17_00_06 15 11 # This row has changed
07_19_2017_17_00_08 15 3
...
01_07_2019_14_06_59 35 3 # and this one too!
My code:
df = df.groupby(['Date'], as_index=False).agg({'V1': 'mean', 'V2': 'sum'})
The output I am getting, however, is totally unexpected and I am can't find a reasonable explanation of why it happens. It seems like Pandas is only processing data from 01_01_2018_00_00_01 to 12_31_2018_23_58_40, instead of 07_19_2017_17_00_06 to 01_07_2019_14_06_59.
Date V1 V2
01_01_2018_00_00_01 30 3
01_01_2018_00_00_02 20 4
...
12_31_2018_23_58_35 15 3
12_31_2018_23_58_40 16 11
If you have any clue, I would really appreciate your input. Thank you!
I suspect that the issue is based around Pandas not recognizing the date format that I've used. A solution turned out to be quite simple: convert all of the dates into UNIX time format, divide by 60 and then, repeat the groupby procedure.

find closest match within a vector to fill missing values using dplyr

A dummy dataset is :
data <- data.frame(
group = c(1,1,1,1,1,2),
dates = as.Date(c("2005-01-01", "2006-05-01", "2007-05-01","2004-08-01",
"2005-03-01","2010-02-01")),
value = c(10,20,NA,40,NA,5)
)
For each group, the missing values need to be filled with the non-missing value corresponding to the nearest date within same group. In case of a tie, pick any.
I am using dplyr. which.closest from birk but it needs a vector and a value. How to look up within a vector without writing loops. Even if there is an SQL solution, will do.
Any pointers to the solution?
May be something like: value = value[match(which.closest(dates,THISdate) & !is.na(value))]
Not sure how to specify Thisdate.
Edit: The expected value vector should look like:
value = c(10,20,20,40,10,5)
Using knn1 (nearest neighbor) from the class package (which comes with R -- don't need to install it) and dplyr define an na.knn1 function which replaces each NA value in x with the non-NA x value having the closest time.
library(class)
na.knn1 <- function(x, time) {
is_na <- is.na(x)
if (sum(is_na) == 0 || all(is_na)) return(x)
train <- matrix(time[!is_na])
test <- matrix(time[is_na])
cl <- x[!is_na]
x[is_na] <- as.numeric(as.character(knn1(train, test, cl)))
x
}
data %>% mutate(value = na.knn1(value, dates))
giving:
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
Add an appropriate group_by if the intention was to do this by group.
You can try the use of sapply to find the values closest since the x argument in `which.closest only takes a single value.
first create a vect whereby the dates with no values are replaced with NA and use it within the which.closest function.
library(birk)
vect=replace(data$dates,which(is.na(data$value)),NA)
transform(data,value=value[sapply(dates,which.closest,vec=vect)])
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
if which.closest was to take a vector then there would be no need of sapply. But this is not the case.
Using the dplyr package:
library(birk)
library(dplyr)
data%>%mutate(vect=`is.na<-`(dates,is.na(value)),
value=value[sapply(dates,which.closest,vect)])%>%
select(-vect)

pd.datetime is failing to convert to date

I have a data frame, which has a column 'Date', it is a string type, and as I want to use the column 'Date' as index, first I want to convert it to datetime, so I did:
data['Date'] = pd.to_datetime(data['Date'])
then I did,
data = data.set_index('Date')
but when I tried to do
data = data.loc['01/06/2006':'09/06/2006',]
the slicing is not accomplished, there is no Error but the slicing doesn't occur, I tried with iloc
data = data.iloc['01/06/2006':'09/06/2006',]
and the error message is the following:
TypeError: cannot do slice indexing on <class `'pandas.tseries.index.DatetimeIndex'> with these indexers [01/06/2006] of <type 'str'>`
So I come to the conclusion that the pd.to_datetime didn't work, even though no Error was raised?
Can anybody clarify what is going on? Thanks in advance
It seems you need change order of datetime string to YYYY-MM-DD:
data = data.loc['2006-06-01':'2006-06-09']
Sample:
data = pd.DataFrame({'col':range(15)}, index=pd.date_range('2006-06-01','2006-06-15'))
print (data)
col
2006-06-01 0
2006-06-02 1
2006-06-03 2
2006-06-04 3
2006-06-05 4
2006-06-06 5
2006-06-07 6
2006-06-08 7
2006-06-09 8
2006-06-10 9
2006-06-11 10
2006-06-12 11
2006-06-13 12
2006-06-14 13
2006-06-15 14
data = data.loc['2006-06-01':'2006-06-09']
print (data)
col
2006-06-01 0
2006-06-02 1
2006-06-03 2
2006-06-04 3
2006-06-05 4
2006-06-06 5
2006-06-07 6
2006-06-08 7
2006-06-09 8
As I what I want is to create a new DataFrame with specific dates from the original DataFrame, I convert the column 'Date' as Index
data = data.set_index(data['Date'])
And then just create the new Data Frame using loc
data1 = data.loc['01/06/2006':'09/06/2006']
I am quite new to Python and I thought that I needed to convert to datetime the column 'Date' which is string, but apparently is not necessary. Thanks for your help #jezrael