Parse Datetime in Pandas Dataframe - pandas

I have the checkout column in Dataframe of type 'object' in '2017-08-04T23:31:19.000+02:00' format.
But i want it in the format as shown in the image.
Can anyone help me please.
Thank you:)

You should be able to convert the object column to a date time column, then use the built in date and time functions.
# create an intermediate column that we won't store on the DataFrame
checkout_as_datetime = pd.to_datetime(df['checkout'])
# Add the desired columns to the dataframe
df['checkout_date'] = checkout_as_datetime.dt.date
df['checkout_time'] = checkout_as_datetime.dt.time
Though, if you're goal isn't to write these specific new columns out somewhere, but to use them for other calculations, it may be simpler to just overwrite your original column and use the datetime methods from there.
df['checkout'] = pd.to_datetime(df['checkout'])
df['checkout'].dt.date # to access the date

I haven't tested this, but something along the lines of:
df['CheckOut_date'] = pd.to_datetime(df["CheckOut_date"].dt.strftime('%Y-%m-%d'))
df['CheckOut_time'] = pd.to_datetime(df["CheckOut_time"].dt.strftime('%H:%m:%s'))

Related

Convert dataframe to xarray with datetime coordinate

I have the following pandas dataframe:
I would like to convert it to an xarray dataset, but when I use the code below, it uses the date column as a variable.
dsn = nino.to_xarray()
print(dsn)
I need the date column to be the time coordinate (ie datetime64). How do I do this?
Further to this, I tried the following:
dsn = dsn.assign_coords(time=dsn['DATE'])
dsn = dsn.drop('YR') dsn = dsn.drop('MON')
dsn = dsn.drop('DATE')
But when I try select certain years (using dsn.sel(time=slice()) I get an error: 'no index found for coordinate time'. Any guidance would be greatly appreciated?
To convert a variable into datetime format, see the following: xarray: coords conversion to datetime64
Xarray distinguishes between coordinates and dimensions. If you want to be able to easily slice your data by a variable, you'll need to set it as a dimension of your dataset.
See here for more info on xarray's data structures: https://docs.xarray.dev/en/stable/user-guide/data-structures.html
Presuming you'd like to get rid of the index and replace it with the time coordinate, the following should work:
dsn = dsn.set_coords('DATE').swap_dims({'index': 'DATE'})
# Set 'date' as a coordinate, and then swap the index with the 'date' column.

Convert dataframe column from Object to numeric

Hello I have a conversion question. I'm using some code to conditionally add a value to a new column in my dataframe (df). the new column ('new_col') is created in type object. How do I convert 'new_col' in dataframe to float for aggregation in code to follow. I'm new to python, tried several function and methods. Any help would be greatly appreciated.
conds = [(df['sc1']=='UP_MJB'),(df['sc1']=='UP_MSCI')]
actions = [df['st1'],df['st2']]
df['new_col'] = np.select(conds,actions,default=df['sc1'])
tried astype(float), got value Error. Talked to teammate, tried df.to_numeric(np.select(conds,actions,default=df['sc1'])). That worked.

Replacing the formulae in dataframe (pandas)

sample csvI have a .csv file which has many columns consisting of formulas like this:
="010",
="011"
By default when I am storing it to a pandas dataframe, the formulas are stored as it is. Can anyone help me like how to convert these values in whole dataframe to values:
010,
011
You can use Series.apply() with a custom function.
Here is a very basic example that should work with the data you have shown.
def to_float(item):
return float(item[1:].strip('"'))
df['COUNTRY CODE'] = df['COUNTRY CODE'].apply(to_float)
The function to_float() is just an example, you shall decide how to implement the custom function to do the transformation.

How to split year from date and make a new column; how to deal with leap years

I am very new to coding (this is the first code I am writing).
I have multiple csv files, all with the same headers. The files correspond to hourly ozone concentration for every day of the year, and each file is a separate year [range from 2009-2020]. I have a column 'date' that contains the year-month-day, and I have a column for hour of the day (0-23). I want to separate the year from the month-day, combine the hour with the month-day and make this the index, and then merge the other csv files into one dataframe.
In addition, I need to average data values from each day at each hour for all 10 years, however, three of my files include leap days (an extra 24 values). I would appreciate any advice on how to account for the leap years. I assume that I would need to add the leap day to the files without it, then provide null values, then drop the null values (but that seems circular).
Also, if you have any tips on how to simplify my process, feel free to share!
Thanks in advance for your help.
Update: I tried the advice from Rookie below, but after importing csv data, I get an error message:
import pandas as pd
import os
path = "C:/Users/heath/Documents/ARB project Spring2020/ozone/SJV/SKNP"
df = pd.DataFrame()
for file in os.listdir(path):
df_temp = pd.read_csv(os.path.join(path, file))
df = pd.concat((df, df_temp), axis = 0)
First, I get an error message that says OSError: Initializing from file failed.
I tried to fix the issue by adding engine = 'python' based on advice from OSError: Initializing from file failed on csv in Pandas, but now I'm getting PermissionError: [Errno 13] Permission denied: 'C:/Users/heath/Documents/ARB project Spring2020/ozone/SJV/SKNP\\.ipynb_checkpoints'
Please help, I'm not sure what else to do. I edited the permission so that everyone has the read & write access. However, I still had the "permission denied" error when I imported the csv on Windows.
First, you want to identify what type of column you are dealing with once it is in a pandas DataFrame. This can be accomplished with the dtypes method. For example, if your DataFrame is df, you can do df.dtypes which will let you know what the column types are. If you see an object type, that will tell you that pandas is interpreting the object as a string (sequence of characters not an actual date or time value). If you see datetime64[ns], pandas knows this is a datetime value (date and time combined). If you see timedelta[ns], pandas knows this is a time difference (more on this later).
If the dtype are objects, let's convert these to a datetime64[ns] type so we can let pandas know we are dealing with date/time values. This can be done by simple reassignment. For example, if the format of the date is YYYY-mm-dd (2020-06-04), then we can convert the date column using the following method (assuming the name of your date column is "Date"). Please reference strftime for different formatting.
df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d")
The time column is slightly more tricky. Pandas is not aware of just time so we need to convert time to a timedelta[64]. If the time format is hh:mm:ss (i.e. "21:02:24"), we can use the follow method to convert object type.
df["Time"] = pd.to_timedelta(df["Time"])
If the format is different, you will need to convert the string format to the hh:mm:ss format.
Now to combine these columns, we can now simple add them:
df["DateTime"] = df["Date"] + df["Time"]
To create the formatted datetime column you mentioned, you can do this by creating a new column in a string format. The below will give "06-04 21" indicating June 4 9 PM. strftime can guide whatever format you desire.
df["Formatted_DateTime"] = df["DateTime"].dt.strftime("%m-%d %H")
You will need to do this for each file. I recommend using a for loop here. Below is a full code snippet. This will obviously vary depending on your column types, file names, etc.
import os # module to iterate over the files
import pandas as pd
base_path = "path/to/directory" # This is the directory path where all your files are stored
# It will be faster to read in all files at once THEN format the date
df = pd.DataFrame()
for file in os.listdir(base_path):
df_temp = pd.read_csv(os.path.join(base_path, file)) # This will read every file in the base_path directory
df = pd.concat((df, df_temp), axis=0) # Concatenating (merging) the files
# Formatting the data
df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d") # Date conversion
df["Time"] = pd.to_timedelta(df["Time"]) # Time conversion
df["DateTime"] = df["Date"] + df["Time"] # Combine date and time to single column
df["Formatted_DateTime"] = df["DateTime"].dt.strftime("%m-%d %H") # Format the datetime values
Now that everything is formatted, the average portion is easy. Since you are only interested in averaging the values for each month-day hour, we can use the groupby capability.
df_group = df.groupby(["Formatted_DateTime"]) # This will group you data by unique values of the "Formatted_DateTime" column
df_average = df_group.mean() # This will average your data within each group (accounting for the leap years)
It's always good to check your work!
print(df_average.head(5)) # This will print the first 5 days averaged values

Reading .txt into Julia DataFrame as Date Type

Is there way to read date ("2000-01") variables from text files into a Julia DataFrame directly, as a date? There's no documentation on this from what I have seen.
df = readtable("path/dates.txt", eltypes = [Date, Date])
This doesn't work, even though it seems like it should. My usual process is to read the dates in as strings and then loop over each row to create a new date variable. This has become a bottleneck in some of my processes now, do to the size of the DataFrames.
My usual flow is to do something like this:
full_df[:real_date] = Date(full_df[:temp_dte_string], "m/d/y")
Thank you
I don't think there's currently any way to do the loading in a single step like your first suggested code. However you can speed up the second method somewhat by making a DateFormat object and calling Date with that instead of with a string.
(This is mentioned briefly here.)
dfmt = Dates.DateFormat(“m/d/y”)
full_df[:real_date] = Date(full_df[:temp_dte_string], dfmt)
(For some reason I thought Date was not vectorized and have been doing this inside a for loop in all my code. Whoops.)
By delete a variable do you mean delete a column or a row? If you mean the former, then there's a few other ways to do this including things like
function vectorin(a, b) #IMHO this should be in base
bset = Set(b)
[i in bset for i in a]
end
df = DataFrame(A1="", A2="", A3="", D="", E="", F="") #Some long list of columns
badCols = [:D, :F] #Some long list of columns you want to remove
df = df[names(df)[!vectorin(names(df), badCols)]]
Sometimes I read in csv files with a lot of extra columns, then just do something like
df = readtable("data.csv")
df = df[[:Only, :the, :cols, :I, :want]]