I would like to subset a data frame based on a date column, which originally has this format:
3/22/13
After I transform it to a date:
df['date']=pd.to_datetime(df['date'], format='%m/%d/%y')
I get this:
2013-03-22 00:00:00
Now I would like to subset it with something like this:
df.loc[(df['date']>'2014-06-22')]
But that either gives me an empty data frame or full data frame, that is no filtering.
Any suggestions how I can get this to work?
remark: I am well aware that similar questions have been asked in other forums but I could not figure out a solution since my date column looks different.
First you have to convert your starting date and final date into a datetime format. Then you can apply multiple conditions inside df.loc. Do not forget to reassign your modifications to your df :
import pandas as pd
from datetime import datetime
df['date']=pd.to_datetime(df['date'], format='%m/%d/%y')
date1 = datetime.strptime('2013-03-23', '%Y-%m-%d')
date2 = datetime.strptime('2013-03-25', '%Y-%m-%d')
df = df.loc[(df['date']>date1) & (df['date']<date2)]
Related
I have data from an S3 bucket and want to convert the Date column from string to date. The current Date column is in the format 7/1/2022 12:0:15 AM.
Current code I am using in AWS Glue Studio to attempt the custom transformation:
MyTransform (glueContext, dfc) -> DynamicFrameCollection:
from pyspark.sql.functions import col, to_timestamp
df = dfc.select(list(dfc.keys())[0]).toDF()
df = df.withColumn('Date',to_timestamp(col("Date"), 'MM/dd/yyyy HH:MM:SS'))
df_res = DynamicFrame.fromDF(df, glueContext, "df")
return(DynamicFrameCollection({"CustomTransform0": df_res}, glueContext))
With MM/dd/yyyy HH:MM:SS date formatting, it runs but returns null for the Date column. When I try any other date format besides this, it errors out. I suspect the date formatting may be the issue, but I am not certain.
After converting string to timestamp you need to cast it to date type, like this:
df = df.withColumn(df_col, df[df_col].cast("date"))
We ended up removing the HH:MM:SS portion of the date format and this worked for our needs. I would still be interested if anyone can figure out how to get the hours, minutes, seconds, and AM/PM to work, but we can do without for now.
I have a pandas DataFrame which includes a datetime column and I want to filter the data frame between the current hour and 10 hours ago. I have tried different ways to do it but still I cannot handle it. Because when I want to use pandas, the column type is Series and I can't use timedelta to compare them. If I use a for loop to compare the column as a string to my time interval, it is not efficient.
The table is like this:
And I want to filter the 'dateTime' column between current time and 10 hours ago, then filter based on 'weeks' > 80.
I have tried these codes as well But they have not worked:
filter_criteria = main_table['dateTime'].sub(today).abs().apply(lambda x: x.hours <= 10)
main_table.loc[filter_criteria]
This returns an error:
TypeError: unsupported operand type(s) for -: 'str' and 'datetime.datetime'
Similarly this code has the same problem:
main_table.loc[main_table['dateTime'] >= (datetime.datetime.today() - pd.DateOffset(hours=10))]
And:
main_table[(pd.to_datetime('today') - main_table['dateTime'] ).dt.hours.le(10)]
In all of the code above main_table is the name of my data frame.
How can I filter them?
First you need to make sure that your datatype in datetime column is correct. you can check it by using:
main_table.info()
If it is not datetime (i.e, object) convert it:
# use proper formatting if this line does not work
main_table['dateTime'] = pd.to_datetime(main_table['dateTime'])
Then you need to find the datetime object of ten hour before current time (ref):
from datetime import datetime, timedelta
date_time_ten_before = datetime.now() - timedelta(hours = 10)
All it remains is to filter the column:
main_table_10 = main_table[main_table['dateTime'] >= date_time_ten_before]
This CSV has dates in ISO 8601 time.
0 2014-01-01T00:00:00.000
1 2014-01-01T00:46:43.000
2 2014-01-01T01:33:26.001
I want to select the rows up until January 2. I'm not sure how to do this. I thought including
parse_dates=True
would allow me to refer to the date/time values directly like this:
sat0 = sat0[sat0['epoch']<2014-01-02T00:00:00.000]
but it's not working.
You can use pd.to_datetime() in converting the target date string column by its appropriate format. In this case, you can use '%Y-%m-%dT%H:%M:%S.%f' as the date format.
Here's the implementation in your case:
import pandas as pd
sat0 = pd.DataFrame({
'epoch': [
'2014-01-01T00:00:00.000',
'2014-01-01T00:46:43.000',
'2014-03-01T01:33:26.001',
]})
date_fmt = '%Y-%m-%dT%H:%M:%S.%f'
sat0['epoch'] = pd.to_datetime(sat0['epoch'], format=date_fmt)
sat0 = sat0[sat0['epoch'] < '2014-01-02T00:00:00.000']
Which outputs
epoch
0 2014-01-01 00:00:00
1 2014-01-01 00:46:43
I am looking to convert datetime to date for a pandas datetime series.
I have listed the code below:
df = pd.DataFrame()
df = pandas.io.parsers.read_csv("TestData.csv", low_memory=False)
df['PUDATE'] = pd.Series([pd.to_datetime(date) for date in df['DATE_TIME']])
df['PUDATE2'] = datetime.datetime.date(df['PUDATE']) #Does not work
Can anyone guide me in right direction?
You can access the datetime methods of a Pandas series by using the .dt methods (in a aimilar way to how you would access string methods using .str. For your case, you can extract the date of your datetime column as:
df['PUDATE'].dt.date
This is a simple way to get day of month, from a pandas
#create a dataframe with dates as a string
test_df = pd.DataFrame({'dob':['2001-01-01', '2002-02-02', '2003-03-03', '2004-04-04']})
#convert column to type datetime
test_df['dob']= pd.to_datetime(test_df['dob'])
# Extract day, month , year using dt accessor
test_df['DayOfMonth']=test_df['dob'].dt.day
test_df['Month']=test_df['dob'].dt.month
test_df['Year']=test_df['dob'].dt.year
I think you need to specify the format for example
df['PUDATE2']=datetime.datetime.date(df['PUDATE'], format='%Y%m%d%H%M%S')
So you just need to know what format you are using
I'm trying to get today's date in a few different formats and I keep getting errors:
pd.to_datetime('Today',format='%m/%d/%Y') + MonthEnd(-1)
ValueError: time data 'Today' does not match format '%m/%d/%Y' (match)
What is the correct syntax to get todays date in yyyy-mm-dd and yyyymm formats?
For YYYY-MM-DD format, you can do this:
import datetime as dt
print(dt.datetime.today().date())
2017-05-23
For YYYY-MM format, you can do this:
print(dt.datetime.today().date().strftime('%Y-%m'))
2017-05
If you need to do this on just a few columns you can use:
import pandas as pd
dataframe_name['Date_Column_name'].apply(pd.tslib.normalize_date)
This method doesn't use any other module except pandas. If you need a "custom" date format you can always do:
from datetime import datetime as dt
dataframe_name['Date_Column_name'].dt.strftime('%d/%m/%Y')
Here is a list of strftime options.