I have a csv file where the a timestamp column is coming with values in the following format: 2022-05-12T07:09:33.727-07:00
When I try something like:
df['timestamp'] = pd.to_datetime(df['timestamp'])
It seems to fail silently as the dtype of that column is still object. I am wondering how I can parse such a value.
Also, what is the strategy so that it remains robust to a variety of input time formats?
Related
So I got my dataframe from a JSON file, and the date is labelled as 2000M01 for 2000 january, 2000M02 for 2000 february etc. I need to have it in a different format: 2000Jan, 2000Feb etc.-I have a different data set in this format, I could bring both of these to a third one, if that's easier. Like 2000-01 or some official date format.
My main issue is that as far as I know 2000M01 is not an official data format in any way, so I can't just convert it that way.
Any ideas how I could convert this?
You can easily feed a custom format to pd.to_datetime, in your case it would be '%YM%m', e.g.:
pd.to_datetime('2000M01', format = '%YM%m')
Then you can convert it to any format you want.
You can change the date format with the datetime module
def reformat_date(date_from_json):
date = datetime.datetime.strptime(date_from_json, "%YM%m")
return date.strftime("%Y%b")
As specified in datetime documentation in strftime and strptime formats, you can deal with the unusual date formatting with %YM%m dealing with the input format with the day defaulting to the 1st, and %Y%b giving you the format you want.
Then you map the function to the pandas dataframe
dataframe['DATE_COLUMN'] = dataframe['OLD_DATE_COLUMN'].map(lambda date: reformat_date(date))
I have a large dataset in python/pandas, and it's behaving very strangely. I have a column, df['station'], with numeric and string values, and I have converted it into an object. If I do dtypes, the column claims to be of dtype object.
station object
However, when I run df['station'].value_counts() on the column, it says the dtype is int64
I'm unable to access the values that are strings, if I try df[df['station']=='NONE'] or df[df['station']=='S52'] which are both shown to be present in the value_counts() attached image, nothing shows up.
I tried to remove NaNs from the column, and then converting to string and running value_counts() again, but it still shows up as int64.
I am very new to coding (this is the first code I am writing).
I have multiple csv files, all with the same headers. The files correspond to hourly ozone concentration for every day of the year, and each file is a separate year [range from 2009-2020]. I have a column 'date' that contains the year-month-day, and I have a column for hour of the day (0-23). I want to separate the year from the month-day, combine the hour with the month-day and make this the index, and then merge the other csv files into one dataframe.
In addition, I need to average data values from each day at each hour for all 10 years, however, three of my files include leap days (an extra 24 values). I would appreciate any advice on how to account for the leap years. I assume that I would need to add the leap day to the files without it, then provide null values, then drop the null values (but that seems circular).
Also, if you have any tips on how to simplify my process, feel free to share!
Thanks in advance for your help.
Update: I tried the advice from Rookie below, but after importing csv data, I get an error message:
import pandas as pd
import os
path = "C:/Users/heath/Documents/ARB project Spring2020/ozone/SJV/SKNP"
df = pd.DataFrame()
for file in os.listdir(path):
df_temp = pd.read_csv(os.path.join(path, file))
df = pd.concat((df, df_temp), axis = 0)
First, I get an error message that says OSError: Initializing from file failed.
I tried to fix the issue by adding engine = 'python' based on advice from OSError: Initializing from file failed on csv in Pandas, but now I'm getting PermissionError: [Errno 13] Permission denied: 'C:/Users/heath/Documents/ARB project Spring2020/ozone/SJV/SKNP\\.ipynb_checkpoints'
Please help, I'm not sure what else to do. I edited the permission so that everyone has the read & write access. However, I still had the "permission denied" error when I imported the csv on Windows.
First, you want to identify what type of column you are dealing with once it is in a pandas DataFrame. This can be accomplished with the dtypes method. For example, if your DataFrame is df, you can do df.dtypes which will let you know what the column types are. If you see an object type, that will tell you that pandas is interpreting the object as a string (sequence of characters not an actual date or time value). If you see datetime64[ns], pandas knows this is a datetime value (date and time combined). If you see timedelta[ns], pandas knows this is a time difference (more on this later).
If the dtype are objects, let's convert these to a datetime64[ns] type so we can let pandas know we are dealing with date/time values. This can be done by simple reassignment. For example, if the format of the date is YYYY-mm-dd (2020-06-04), then we can convert the date column using the following method (assuming the name of your date column is "Date"). Please reference strftime for different formatting.
df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d")
The time column is slightly more tricky. Pandas is not aware of just time so we need to convert time to a timedelta[64]. If the time format is hh:mm:ss (i.e. "21:02:24"), we can use the follow method to convert object type.
df["Time"] = pd.to_timedelta(df["Time"])
If the format is different, you will need to convert the string format to the hh:mm:ss format.
Now to combine these columns, we can now simple add them:
df["DateTime"] = df["Date"] + df["Time"]
To create the formatted datetime column you mentioned, you can do this by creating a new column in a string format. The below will give "06-04 21" indicating June 4 9 PM. strftime can guide whatever format you desire.
df["Formatted_DateTime"] = df["DateTime"].dt.strftime("%m-%d %H")
You will need to do this for each file. I recommend using a for loop here. Below is a full code snippet. This will obviously vary depending on your column types, file names, etc.
import os # module to iterate over the files
import pandas as pd
base_path = "path/to/directory" # This is the directory path where all your files are stored
# It will be faster to read in all files at once THEN format the date
df = pd.DataFrame()
for file in os.listdir(base_path):
df_temp = pd.read_csv(os.path.join(base_path, file)) # This will read every file in the base_path directory
df = pd.concat((df, df_temp), axis=0) # Concatenating (merging) the files
# Formatting the data
df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d") # Date conversion
df["Time"] = pd.to_timedelta(df["Time"]) # Time conversion
df["DateTime"] = df["Date"] + df["Time"] # Combine date and time to single column
df["Formatted_DateTime"] = df["DateTime"].dt.strftime("%m-%d %H") # Format the datetime values
Now that everything is formatted, the average portion is easy. Since you are only interested in averaging the values for each month-day hour, we can use the groupby capability.
df_group = df.groupby(["Formatted_DateTime"]) # This will group you data by unique values of the "Formatted_DateTime" column
df_average = df_group.mean() # This will average your data within each group (accounting for the leap years)
It's always good to check your work!
print(df_average.head(5)) # This will print the first 5 days averaged values
I'a m reading a csv file with Pandas. In the file there is a column with dates in dd/mm/yyyy format.
def load_csv():
mydateparser = lambda x: dt.datetime.strptime(x, "%d/%m/%Y")
return pd.read_csv('myfile.csv', delimiter=';', parse_dates=['data'], date_parser=mydateparser)
Using this parser the column 'data' type becomes data datetime64[ns], but the format is changed to yyyy-mm-dd.
I need the the column 'data' type to be datetime64[ns] and formated as dd/mm/yyyy.
How can it be done?
Regards,
Elio Fernandes
Date is not stored in yyyy-mm-dd format or dd/mm/yyyy format. It's stored in datetime format. Python by default chooses to shows it in yyyy-mm-dd format. But don't get it wrong, it still is stored in datetime format.
You will get a better idea if you add time to data and then try to display it.
The way to achieve what you wish is by changing date to string right before displaying, so as, it remains datetime in dataframe but you get the specified string format when you display.
The following uses Series.strftime() to change to string. Documentation here.
df['data'].strftime('%d/%m/%Y')
or
The following uses datetime.strftime() to change to string. Documentation here.
df['data'].apply(lambda x: x.strftime('%Y-%m-%d'))
For further reference check out strftime-and-strptime-behavior.
This question will be of great help to understand how datetime is stored in python:
How does Python store datetime internally?
The problem is somewhat simple. My objective is to compute the days difference between two dates, say A and B.
These are my attempts:
df['daydiff'] = df['A']-df['B']
df['daydiff'] = ((df['A']) - (df['B'])).dt.days
df['daydiff'] = (pd.to_datetime(df['A'])-pd.to_datetime(df['B'])).dt.days
These works for me before but for some reason, I'm keep getting this error this time:
TypeError: class 'datetime.time' is not convertible to datetime
When I export the df to excel, then the date works just fine. Any thoughts?
Use pd.Timestamp to handle the awkward differences in your formatted times.
df['A'] = df['A'].apply(pd.Timestamp) # will handle parsing
df['B'] = df['B'].apply(pd.Timestamp) # will handle parsing
df['day_diff'] = (df['A'] - df['B']).dt.days
Of course, if you don't want to change the format of the df['A'] and df['B'] within the DataFrame that you are outputting, you can do this in a one-liner.
df['day_diff'] = (df['A'].apply(pd.Timestamp) - df['B'].apply(pd.Timestamp)).dt.days
This will give you the days between as an integer.
When I applied the solution offered by emmet02, I got TypeError: Cannot convert input [00:00:00] of type as well. It's basically saying that the dataframe contains missing timestamp values which are represented as [00:00:00], and this value is rejected by pandas.Timestamp function.
To address this, simply apply a suitable missing-value strategy to clean your data set, before using
df.apply(pd.Timestamp)