pd.datetime is failing to convert to date - dataframe

I have a data frame, which has a column 'Date', it is a string type, and as I want to use the column 'Date' as index, first I want to convert it to datetime, so I did:
data['Date'] = pd.to_datetime(data['Date'])
then I did,
data = data.set_index('Date')
but when I tried to do
data = data.loc['01/06/2006':'09/06/2006',]
the slicing is not accomplished, there is no Error but the slicing doesn't occur, I tried with iloc
data = data.iloc['01/06/2006':'09/06/2006',]
and the error message is the following:
TypeError: cannot do slice indexing on <class `'pandas.tseries.index.DatetimeIndex'> with these indexers [01/06/2006] of <type 'str'>`
So I come to the conclusion that the pd.to_datetime didn't work, even though no Error was raised?
Can anybody clarify what is going on? Thanks in advance

It seems you need change order of datetime string to YYYY-MM-DD:
data = data.loc['2006-06-01':'2006-06-09']
Sample:
data = pd.DataFrame({'col':range(15)}, index=pd.date_range('2006-06-01','2006-06-15'))
print (data)
col
2006-06-01 0
2006-06-02 1
2006-06-03 2
2006-06-04 3
2006-06-05 4
2006-06-06 5
2006-06-07 6
2006-06-08 7
2006-06-09 8
2006-06-10 9
2006-06-11 10
2006-06-12 11
2006-06-13 12
2006-06-14 13
2006-06-15 14
data = data.loc['2006-06-01':'2006-06-09']
print (data)
col
2006-06-01 0
2006-06-02 1
2006-06-03 2
2006-06-04 3
2006-06-05 4
2006-06-06 5
2006-06-07 6
2006-06-08 7
2006-06-09 8

As I what I want is to create a new DataFrame with specific dates from the original DataFrame, I convert the column 'Date' as Index
data = data.set_index(data['Date'])
And then just create the new Data Frame using loc
data1 = data.loc['01/06/2006':'09/06/2006']
I am quite new to Python and I thought that I needed to convert to datetime the column 'Date' which is string, but apparently is not necessary. Thanks for your help #jezrael

Related

convert to datetime based on condition

I want to convert my datetime object into seconds
0 49:36.5
1 50:13.7
2 50:35.8
3 50:37.4
4 50:39.3
...
92 1:00:47.8
93 1:01:07.7
94 1:02:15.3
95 1:05:03.0
96 1:05:29.6
Name: Finish, Length: 97, dtype: object
the problem is that the format changes at index 92 which results in an error: ValueError: expected hh:mm:ss format before .
This error is caused when I try to convert the column to seconds
filt_data["F"] = pd.to_timedelta('00:'+filt_data["Finish"]).dt.total_seconds()
when I do the conversion in two steps it works but results in two different column which I don't know how to merge nor does it seem really efficient:
filt_data["F1"] = pd.to_timedelta('00:'+filt_data["Finish"].loc[0:89]).dt.total_seconds()
filt_data["F2"] = pd.to_timedelta('0'+filt_data["Finish"].loc[90:97]).dt.total_seconds()
the above code does not cause any error and gets the job done but results in two different columns. Any idea how to do this?
Ideally I would like to loop through the column and based on the format i.E. "50:39.3" or "1:00:47.8" add "00:" or "0" to the object.
I would use str.replace:
pd.to_timedelta(df['Finish'].str.replace('^(\d+:\d+\.\d+)', r'0:\1', regex=True))
Or str.count and map:
pd.to_timedelta(df['Finish'].str.count(':').map({1: '0:', 2: ''}).add(df['Finish']))
Output:
0 0 days 00:49:36.500000
1 0 days 00:50:13.700000
2 0 days 00:50:35.800000
3 0 days 00:50:37.400000
4 0 days 00:50:39.300000
92 0 days 01:00:47.800000
93 0 days 01:01:07.700000
94 0 days 01:02:15.300000
95 0 days 01:05:03
96 0 days 01:05:29.600000
Name: Finish, dtype: timedelta64[ns]
Given your data:
import pandas as pd
times = [
"49:36.5",
"50:13.7",
"50:35.8",
"50:37.4",
"50:39.3",
"1:00:47.8",
"1:01:07.7",
"1:02:15.3",
"1:05:03.0",
"1:05:29.6",
]
df = pd.DataFrame({'time': times})
df
You can write a function that you apply on each separate entry in the time column:
def format_time(time):
time = time.split('.')[0]
time = time.split(':')
if(len(time) < 3):
time.insert(0, "0")
return ":".join(time)
df["formatted_time"] = df.time.apply(format_time)
df
Then you could undertake two steps:
Convert column to datetime
Convert column to UNIX timestamp (number of seconds since 1970-01-01)
df["time_datetime"] = pd.to_datetime(df.formatted_time, infer_datetime_format=True)
df["time_seconds"] = (df.time_datetime - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
df

'Timestamp' object has no attribute 'dt' pandas

I have a dataset of 100,000 rows and 15 column in a 10mb csv.
the column I am working on is a : Date/Time column in a string format
source code
import pandas as pd
import datetime as dt
trupl = pd.DataFrame({'Time/Date' : ['12/1/2021 2:09','22/4/2021 21:09','22/6/2021 9:09']})
trupl['Time/Date'] = pd.to_datetime(trupl['Time/Date'])
print(trupl)
Output
Time/Date
0 2021-12-02 02:09:00
1 2021-04-22 21:09:00
2 2021-06-22 09:09:00
What I need to do is a bit confusing but I'll try to make it simple :
if the time of the date is between 12 am and 8 am ; subtract one day from the Time/Date and put the new timestamp in a new column.
if not, put it as it is.
Expected output
Time/Date Date_adjusted
0 12/2/2021 2:09 12/1/2021 2:09
1 22/4/2021 21:09 22/4/2021 21:09
2 22/6/2021 9:09 22/6/2021 9:09
I tried the below code :
trupl['Date_adjusted'] = trupl['Time/Date'].map(lambda x:x- dt.timedelta(days=1) if x >= dt.time(0,0,0) and x < dt.time(8,0,0) else x)
i get a TypeError: '>=' not supported between 'Timestamp' and 'datetime.time'
and when applying dt.time to x , i get an error " Timestamp" object has no attribute 'dt'
so how can i convert x to time in order to compare it ? or there is a better workaround ?
I searched a lot for a fix but I couldn't find a similar case.
Try:
trupl['Date_adjusted'] = trupl['Time/Date'].map(lambda x: x - dt.timedelta(days=1) if (x.hour >= 0 and x.hour < 8) else x)

Unexpected groupby result: some rows are missing

I am facing an issue with transforming my data using Pandas' groupby. I have a table (several million rows and 3 variables) that I am trying to group by "Date" variable.
Snippet from a raw table:
Date V1 V2
07_19_2017_17_00_06 10 5
07_19_2017_17_00_06 20 6
07_19_2017_17_00_08 15 3
...
01_07_2019_14_06_59 30 1
01_07_2019_14_06_59 40 2
The goal is to group rows with the same value of "Date" by applying a mean function over V1 and sum function over V2. So that the expected result resembles:
Date V1 V2
07_19_2017_17_00_06 15 11 # This row has changed
07_19_2017_17_00_08 15 3
...
01_07_2019_14_06_59 35 3 # and this one too!
My code:
df = df.groupby(['Date'], as_index=False).agg({'V1': 'mean', 'V2': 'sum'})
The output I am getting, however, is totally unexpected and I am can't find a reasonable explanation of why it happens. It seems like Pandas is only processing data from 01_01_2018_00_00_01 to 12_31_2018_23_58_40, instead of 07_19_2017_17_00_06 to 01_07_2019_14_06_59.
Date V1 V2
01_01_2018_00_00_01 30 3
01_01_2018_00_00_02 20 4
...
12_31_2018_23_58_35 15 3
12_31_2018_23_58_40 16 11
If you have any clue, I would really appreciate your input. Thank you!
I suspect that the issue is based around Pandas not recognizing the date format that I've used. A solution turned out to be quite simple: convert all of the dates into UNIX time format, divide by 60 and then, repeat the groupby procedure.

Get coherent subsets from pandas series

I'm rather new to pandas and recently run into a problem. I have a pandas DataFrame that I need to process. I need to extract parts of the DataFrame where specific conditions are met. However, i want these parts to be coherent blocks, not one big set.
Example:
Consider the following pandas DataFrame
col1 col2
0 3 11
1 7 15
2 9 1
3 11 2
4 13 2
5 16 16
6 19 17
7 23 13
8 27 4
9 32 3
I want to extract the subframes where the values of col2 >= 10, resulting maybe in a list of DataFrames in the form of (in this case):
col1 col2
0 3 11
1 7 15
col1 col2
5 16 16
6 19 17
7 23 13
Ultimately, I need to do further analysis on the values in col1 within the resulting parts. However, the start and end of each of these blocks is important to me, so simply creating a subset using pandas.DataFrame.loc isn't going to work for me, i think.
What I have tried:
Right now I have a workaround that gets the subset using pandas.DataFrame.loc and then extracts the start and end index of each coherent block afterwards, by iterating through the subset and check, whether there is a jump in the indices. However, it feels rather clumsy and I feel that I'm missing a basic pandas function here, that would make my code more efficient and clean.
This is code representing my current workaround as adapted to the above example
# here the blocks will be collected for further computations
blocks = []
# get all the items where col2 >10 using 'loc[]'
subset = df.loc[df['col2']>10]
block_start = 0
block_end = None
#loop through all items in subset
for i in range(1, len(subset)):
# if the difference between the current index and the last is greater than 1 ...
if subset.index[i]-subset.index[i-1] > 1:
# ... this is the current blocks end
next_block_start = i
# extract the according block and add it to the list of all blocks
block = subset[block_start:next_block_start]
blocks.append(block)
#the next_block_start index is now the new block's starting index
block_start = next_block_start
#close and add last block
blocks.append(subset[block_start:])
Edit: I was by mistake previously referring to 'pandas.DataFrame.where' instead of 'pandas.DataFrame.loc'. I seem to be a bit confused by my recent research.
You can split you problem into parts. At first you check the condition:
df['mask'] = (df['col2']>10)
We use this to see where a new subset starts:
df['new'] = df['mask'].gt(df['mask'].shift(fill_value=False))
Now you can combine these informations into a group number. The cumsum will generate a step function which we set to zero (via the mask column) if this is not a group we are interested in.
df['grp'] = (df.new + 0).cumsum() * df['mask']
EDIT
You don't have to do the group calculation in your df:
s = (df['col2']>10)
s = (s.gt(s.shift(fill_value=False)) + 0).cumsum() * s
After that you can split this into a dict of separate DataFrames
grp = {}
for i in np.unique(s)[1:]:
grp[i] = df.loc[s == i, ['col1', 'col2']]

transform data frame in time series for date type POSIXct

I have a data frame with the following two variables:
amount: num 1213.5 34.5 ...
txn_date: POSIXct, format "2017-05-01 12:13:30" ...
I want to transform it in a time series using ts().
I started using this code:
Z <- zoo(data$amount, order.by=as.Date(as.character(data$txn_date), format="%Y/%m/%d %H:%M:%S"))
But the problem is that in Z I loose the dates. In fact, all the dates are reported as NA.
How can I solve it?
For my analysis is important to have date in the format:%Y/%m/%d %H:%M:%S
for example 2017-05-01 12:13:30. I don't want to remove the time component in the variable txn_date.
Yhan you for your help,
Andrea
I think your prolem comes from the way you're manipulating your data frame, could post more details about it please ?
I think i have a fix for you.
Data frame I used :
> df1
$data
value
1 1.9150
2 3.1025
3 6.7400
4 8.5025
5 11.0025
6 9.8025
7 9.0775
8 7.0900
9 6.8525
10 7.4900
$date
%Y-%m-%d
1 1974-01-01
2 1974-01-02
3 1974-01-03
4 1974-01-04
5 1974-01-05
6 1974-01-06
7 1974-01-07
8 1974-01-08
9 1974-01-09
10 1974-01-10
> class(df1$data$value)
[1] "numeric"
> class(df1$date$`%Y-%m-%d`)
[1] "POSIXct" "POSIXt"
Then I can create a time serie by calling zoo like that :
> Z<-zoo(df1$data,order.by=(as.POSIXct(df1$date$`%Y-%m-%d`)))
> Z
value
1974-01-01 1.9150
1974-01-02 3.1025
1974-01-03 6.7400
1974-01-04 8.5025
1974-01-05 11.0025
1974-01-06 9.8025
1974-01-07 9.0775
1974-01-08 7.0900
1974-01-09 6.8525
1974-01-10 7.4900
The important thing here is that I use df1$date$%Y-%m-%d instead of just
df1$date
In fact if I try the way you did it I get NA values too :
> Z<-zoo(df1$data,order.by=as.POSIXct(as.Date(as.character(df1$date),format("%Y-%m-%d"))))
> Z
value
<NA> 1.915
To get the name of data$txn_date you can use the following command : names(data$txn_date) and try my solution with your data frame and name.
> names(df1$date)
[1] "%Y-%m-%d"