I have the following pandas dataframe:
I would like to convert it to an xarray dataset, but when I use the code below, it uses the date column as a variable.
dsn = nino.to_xarray()
print(dsn)
I need the date column to be the time coordinate (ie datetime64). How do I do this?
Further to this, I tried the following:
dsn = dsn.assign_coords(time=dsn['DATE'])
dsn = dsn.drop('YR') dsn = dsn.drop('MON')
dsn = dsn.drop('DATE')
But when I try select certain years (using dsn.sel(time=slice()) I get an error: 'no index found for coordinate time'. Any guidance would be greatly appreciated?
To convert a variable into datetime format, see the following: xarray: coords conversion to datetime64
Xarray distinguishes between coordinates and dimensions. If you want to be able to easily slice your data by a variable, you'll need to set it as a dimension of your dataset.
See here for more info on xarray's data structures: https://docs.xarray.dev/en/stable/user-guide/data-structures.html
Presuming you'd like to get rid of the index and replace it with the time coordinate, the following should work:
dsn = dsn.set_coords('DATE').swap_dims({'index': 'DATE'})
# Set 'date' as a coordinate, and then swap the index with the 'date' column.
Related
I have a csv file where the a timestamp column is coming with values in the following format: 2022-05-12T07:09:33.727-07:00
When I try something like:
df['timestamp'] = pd.to_datetime(df['timestamp'])
It seems to fail silently as the dtype of that column is still object. I am wondering how I can parse such a value.
Also, what is the strategy so that it remains robust to a variety of input time formats?
Hello I have a conversion question. I'm using some code to conditionally add a value to a new column in my dataframe (df). the new column ('new_col') is created in type object. How do I convert 'new_col' in dataframe to float for aggregation in code to follow. I'm new to python, tried several function and methods. Any help would be greatly appreciated.
conds = [(df['sc1']=='UP_MJB'),(df['sc1']=='UP_MSCI')]
actions = [df['st1'],df['st2']]
df['new_col'] = np.select(conds,actions,default=df['sc1'])
tried astype(float), got value Error. Talked to teammate, tried df.to_numeric(np.select(conds,actions,default=df['sc1'])). That worked.
Coming from Python, I started using Julia for its speed in a big-data project. When reading data from .xlsx files, the datatype in each column is "any", despite most of the data being integers or floats.
Is there any Julia-way of inferring the datatypes in a DataFrame (like df = infertype.(df))?
This may be difficult in Julia, given the reduced flexibility on dataypes, but any tips on how to accomplish it would be appreciated. Assume, ex-ante, I do not know which column is which, but the types can only be int, float, string or date.
Using DataFrames
Using XLSX
df = DataFrame(XLSX.readtable("MyFile.xlsx", "Sheet1")...)
You can just do:
df = DataFrame(XLSX.readtable("MyFile.xlsx", "Sheet1"; infer_eltypes=true)...)
Additionally, it is worth knowing that typing in Julia ? before the command shows help that can contain such information:
help?> XLSX.readtable
readtable(filepath, sheet, [columns]; [first_row], [column_labels], [header], [infer_eltypes], [stop_in_empty_row], [stop_in_row_function]) -> data, column_labels
Returns tabular data from a spreadsheet as a tuple (data, column_labels). (...)
(...)
Use infer_eltypes=true to get data as a Vector{Any} of typed vectors. The default value is infer_eltypes=false.
(...)
I am very new to coding (this is the first code I am writing).
I have multiple csv files, all with the same headers. The files correspond to hourly ozone concentration for every day of the year, and each file is a separate year [range from 2009-2020]. I have a column 'date' that contains the year-month-day, and I have a column for hour of the day (0-23). I want to separate the year from the month-day, combine the hour with the month-day and make this the index, and then merge the other csv files into one dataframe.
In addition, I need to average data values from each day at each hour for all 10 years, however, three of my files include leap days (an extra 24 values). I would appreciate any advice on how to account for the leap years. I assume that I would need to add the leap day to the files without it, then provide null values, then drop the null values (but that seems circular).
Also, if you have any tips on how to simplify my process, feel free to share!
Thanks in advance for your help.
Update: I tried the advice from Rookie below, but after importing csv data, I get an error message:
import pandas as pd
import os
path = "C:/Users/heath/Documents/ARB project Spring2020/ozone/SJV/SKNP"
df = pd.DataFrame()
for file in os.listdir(path):
df_temp = pd.read_csv(os.path.join(path, file))
df = pd.concat((df, df_temp), axis = 0)
First, I get an error message that says OSError: Initializing from file failed.
I tried to fix the issue by adding engine = 'python' based on advice from OSError: Initializing from file failed on csv in Pandas, but now I'm getting PermissionError: [Errno 13] Permission denied: 'C:/Users/heath/Documents/ARB project Spring2020/ozone/SJV/SKNP\\.ipynb_checkpoints'
Please help, I'm not sure what else to do. I edited the permission so that everyone has the read & write access. However, I still had the "permission denied" error when I imported the csv on Windows.
First, you want to identify what type of column you are dealing with once it is in a pandas DataFrame. This can be accomplished with the dtypes method. For example, if your DataFrame is df, you can do df.dtypes which will let you know what the column types are. If you see an object type, that will tell you that pandas is interpreting the object as a string (sequence of characters not an actual date or time value). If you see datetime64[ns], pandas knows this is a datetime value (date and time combined). If you see timedelta[ns], pandas knows this is a time difference (more on this later).
If the dtype are objects, let's convert these to a datetime64[ns] type so we can let pandas know we are dealing with date/time values. This can be done by simple reassignment. For example, if the format of the date is YYYY-mm-dd (2020-06-04), then we can convert the date column using the following method (assuming the name of your date column is "Date"). Please reference strftime for different formatting.
df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d")
The time column is slightly more tricky. Pandas is not aware of just time so we need to convert time to a timedelta[64]. If the time format is hh:mm:ss (i.e. "21:02:24"), we can use the follow method to convert object type.
df["Time"] = pd.to_timedelta(df["Time"])
If the format is different, you will need to convert the string format to the hh:mm:ss format.
Now to combine these columns, we can now simple add them:
df["DateTime"] = df["Date"] + df["Time"]
To create the formatted datetime column you mentioned, you can do this by creating a new column in a string format. The below will give "06-04 21" indicating June 4 9 PM. strftime can guide whatever format you desire.
df["Formatted_DateTime"] = df["DateTime"].dt.strftime("%m-%d %H")
You will need to do this for each file. I recommend using a for loop here. Below is a full code snippet. This will obviously vary depending on your column types, file names, etc.
import os # module to iterate over the files
import pandas as pd
base_path = "path/to/directory" # This is the directory path where all your files are stored
# It will be faster to read in all files at once THEN format the date
df = pd.DataFrame()
for file in os.listdir(base_path):
df_temp = pd.read_csv(os.path.join(base_path, file)) # This will read every file in the base_path directory
df = pd.concat((df, df_temp), axis=0) # Concatenating (merging) the files
# Formatting the data
df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d") # Date conversion
df["Time"] = pd.to_timedelta(df["Time"]) # Time conversion
df["DateTime"] = df["Date"] + df["Time"] # Combine date and time to single column
df["Formatted_DateTime"] = df["DateTime"].dt.strftime("%m-%d %H") # Format the datetime values
Now that everything is formatted, the average portion is easy. Since you are only interested in averaging the values for each month-day hour, we can use the groupby capability.
df_group = df.groupby(["Formatted_DateTime"]) # This will group you data by unique values of the "Formatted_DateTime" column
df_average = df_group.mean() # This will average your data within each group (accounting for the leap years)
It's always good to check your work!
print(df_average.head(5)) # This will print the first 5 days averaged values
I have the checkout column in Dataframe of type 'object' in '2017-08-04T23:31:19.000+02:00' format.
But i want it in the format as shown in the image.
Can anyone help me please.
Thank you:)
You should be able to convert the object column to a date time column, then use the built in date and time functions.
# create an intermediate column that we won't store on the DataFrame
checkout_as_datetime = pd.to_datetime(df['checkout'])
# Add the desired columns to the dataframe
df['checkout_date'] = checkout_as_datetime.dt.date
df['checkout_time'] = checkout_as_datetime.dt.time
Though, if you're goal isn't to write these specific new columns out somewhere, but to use them for other calculations, it may be simpler to just overwrite your original column and use the datetime methods from there.
df['checkout'] = pd.to_datetime(df['checkout'])
df['checkout'].dt.date # to access the date
I haven't tested this, but something along the lines of:
df['CheckOut_date'] = pd.to_datetime(df["CheckOut_date"].dt.strftime('%Y-%m-%d'))
df['CheckOut_time'] = pd.to_datetime(df["CheckOut_time"].dt.strftime('%H:%m:%s'))