How to write POSIXct with milliseconds using vroom_write()? - formatting

How can I write POSIXct columns with milliseconds using vroom::vroom_write()?
I can use format() before saving to "render" the time as character (see below), but I wonder if there's a neater way, e.g., by setting some option?
# Example data
df = data.frame(time = Sys.time() + runif(5, 0, 10^6))
# Convert POSIXct cols
POSIXct_cols = sapply(df, \(x) "POSIXct" %in% class(x))
df[POSIXct_cols] = lapply(df[POSIXct_cols], \(x) format(x, "%Y-%m-%d %H:%M:%OS3")))
# Save
vroom::vroom_write(df, "df.csv")

Related

add negative seconds using numpy date time

I'm trying to add an array of time offsets (in seconds, which can be both positive and negative) to a constant timestamp using numpy.
numpy version is 1.19.1, python = 3.7.4
If "offsets" is all positive numbers, things work just fine:
time0 = numpy.datetime64("2007-04-03T15:06:48.032208Z")
offsets = numpy.arange(0, 10)
time = offsets.astype("datetime64[s]")
time2 = time0 + time
But, if offsets includes some negative numbers:
offsets = numpy.arange(-5, 5)
time = offsets.astype("datetime64[s]")
time2 = time0 + time
Traceback (most recent call last):
File "", line 1, in
numpy.core._exceptions.UFuncTypeError: ufunc 'add' cannot use operands with types dtype('<M8[ms]') and dtype('<M8[s]')
How do I deal with an offsets array that can contain both positive and negative numbers?
Any insight appreciated, I'm stumped here.
Catherine
The error tells you that you cannot perform addition of two dates (two datetime64[ns] objects). As you can imagine, adding say May 12 and May 19 together does not make logical sense. Running your first example produces the same error in my environment, even with only positive values in the offsets array.
Instead, you can convert your offsets values into timedelta values:
import numpy
time0 = numpy.datetime64("2007-04-03T15:06:48.032208Z")
offsets = numpy.arange(0, 10)
time = offsets.astype(numpy.timedelta64(1, "s"))
time2 = time0 + time
print(time2)
# ['2007-04-03T15:06:48.032208' '2007-04-03T15:06:49.032208'
# '2007-04-03T15:06:50.032208' '2007-04-03T15:06:51.032208'
# '2007-04-03T15:06:52.032208' '2007-04-03T15:06:53.032208'
# '2007-04-03T15:06:54.032208' '2007-04-03T15:06:55.032208'
# '2007-04-03T15:06:56.032208' '2007-04-03T15:06:57.032208']
offsets = numpy.arange(-5, 5)
time = offsets.astype(numpy.timedelta64(1, "s"))
time2 = time0 + time
print(time2)
# ['2007-04-03T15:06:43.032208' '2007-04-03T15:06:44.032208'
# '2007-04-03T15:06:45.032208' '2007-04-03T15:06:46.032208'
# '2007-04-03T15:06:47.032208' '2007-04-03T15:06:48.032208'
# '2007-04-03T15:06:49.032208' '2007-04-03T15:06:50.032208'
# '2007-04-03T15:06:51.032208' '2007-04-03T15:06:52.032208']

how to convert {{ds}} airflow macro to DATE datatype

df['date'] = {{ds}}
in this data frame I want a value of type DATE to store, but all I get is a string datatype.
You can't use macros outside of operators scope. Macros are rendering as part of operator execution otherwise it's just a plain text string. {{ ds }} can work only on templated fields of the operator. In your example it's clear that you are looking to get the value of ds inside a python callable rather than in templated fields thus you can get the value as:
def func(**kwargs):
execution_date = kwargs['execution_date']
df = pd.DataFrame()
df['execution_date'] = execution_date
#If you want also to convert the column to datetime you can add
df['execution_date'] = pd.to_datetime(df['execution_date'])
print(df)
op = PythonOperator(
task_id='example',
python_callable=func,
provide_context=True,
dag=dag
)

Pandas timezone - how to check if it is not defined for a whole column?

To check if a timezone is not defined for the first row of a "timestamp" column in a pandas Series I can query .tz for a single element with:
import pandas as pd
dates = pd.Series(pd.date_range('2/2/2002', periods=10, freq='M'))
assert dates.iloc[0].tz is None
Do I have a way to check if there are elements where the timezone is defined, or even better, a way to list all the timezones in the whole series, without looping through its elements, such as:
dates.iloc[5] = dates.iloc[5].tz_localize('Africa/Abidjan')
dates.iloc[7] = dates.iloc[7].tz_localize('Africa/Banjul')
zones = []
for k in range(dates.shape[0]):
zones.append(dates.iloc[k].tz)
print(set(zones))
?
You can get the time zone setting of a datetime Series using the dt accessor, i.e. S.dt.tz. This will raise ValueError if you have multiple time zones since the datetime objects will then be stored in an object array, as opposed to a datetime64 array if you have only one time zone or None. You can make use of this to get a solution that is a bit more efficient than looping every time:
import pandas as pd
# tzinfo is None:
dates0 = pd.Series(pd.date_range('2/2/2002', periods=10, freq='M'))
# one timezone:
dates1 = dates0.dt.tz_localize('Africa/Abidjan')
# mixed timezones:
dates2 = dates0.copy()
dates2.iloc[5] = dates2.iloc[5].tz_localize('Africa/Abidjan')
dates2.iloc[7] = dates2.iloc[7].tz_localize('Africa/Banjul')
for ds in [dates0, dates1, dates2]:
try:
zones = ds.dt.tz
except ValueError:
zones = set(t.tz for t in ds.values)
print(zones)
# prints
None
Africa/Abidjan
{None, <DstTzInfo 'Africa/Banjul' GMT0:00:00 STD>, <DstTzInfo 'Africa/Abidjan' GMT0:00:00 STD>}

how to merge two columns in one column as date with pandas?

I have a csv with the first column the date and the 5th the hours.
I would like to merge them in a single column with a specific format in order to write another csv file.
This is basically the file:
DATE,DAY.WEEK,DUMMY.WEEKENDS.HOLIDAYS,DUMMY.MONDAY,HOUR
01/01/2015,5,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,2,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,3,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,4,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,5,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,6,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,7,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,8,1,0,0,0,0,0,0,0,0,0,0,0
I have tried to read the dataframe as
dataR = pd.read_csv(fnamecsv)
and convert the first line to date, as:
date_dt3 = datetime.strptime(dataR["DATE"].iloc[0], '%d/%m/%Y')
However, this seems to me not the correct way for two reasons:
1) it add the hour without considering the hour column;
2) it seems not use the pandas feature.
Thanks for any kind of help,
Diedro
Using + operator
you need to convert data frame elements into string before join. you can also use different separators during join, e.g. dash, underscore or space.
import pandas as pd
df = pd.DataFrame({'Last': ['something', 'you', 'want'],
'First': ['merge', 'with', 'this']})
print('Before Join')
print(df, '\n')
print('After join')
df['Name']= df["First"].astype(str) +" "+ df["Last"]
print(df) ```
You can use read_csv with parameters parse_dates with list of both columns names and date_parser for specify format:
f = lambda x: pd.to_datetime(x, format='%d/%m/%Y %H')
dataR = pd.read_csv(fnamecsv, parse_dates=[['DATE','HOUR']], date_parser=f)
Or convert hours to timedeltas and add to datetimes later:
dataR = pd.read_csv(fnamecsv, parse_dates=[0], dayfirst=True)
dataR['DATE'] += pd.to_timedelta(dataR.pop('HOUR'), unit='H')

Using pandas on a CSV file, how do I filter a data set by "month" if the "date" column is in format "MM/DD/YYYY"

I have a CSV file with a column "date" whose values are formatted MM/DD/YYYY. I was wondering if there was a way I could filter the data in this file based on just month using pandas in python.
### csv file ###
___, Date, ...
12/4/2003
6/15/2012
#################
data = pd.read_csv("file.csv")
# how do i do this line?
is_data_july = data["date"].onlyCheckFirstChar == "6"
Thanks
You might want to have a look at pd.to_datetime.
df = pd.read_csv("file.csv")
df['date'] = pd.to_datetime(df['date'], ...)
mask = df['date'].dt.month == 6
df.loc[mask].to_csv("newfile.csv")
In fact, pd.read_csv has a shortcut for this (if the default options in pd.to_datetime work for you):
df = pd.read_csv("file.csv", parse_dates=['date'])
mask = df['date'].dt.month == 6
df.loc[mask].to_csv("newfile.csv")