I'm trying to add an array of time offsets (in seconds, which can be both positive and negative) to a constant timestamp using numpy.
numpy version is 1.19.1, python = 3.7.4
If "offsets" is all positive numbers, things work just fine:
time0 = numpy.datetime64("2007-04-03T15:06:48.032208Z")
offsets = numpy.arange(0, 10)
time = offsets.astype("datetime64[s]")
time2 = time0 + time
But, if offsets includes some negative numbers:
offsets = numpy.arange(-5, 5)
time = offsets.astype("datetime64[s]")
time2 = time0 + time
Traceback (most recent call last):
File "", line 1, in
numpy.core._exceptions.UFuncTypeError: ufunc 'add' cannot use operands with types dtype('<M8[ms]') and dtype('<M8[s]')
How do I deal with an offsets array that can contain both positive and negative numbers?
Any insight appreciated, I'm stumped here.
Catherine
The error tells you that you cannot perform addition of two dates (two datetime64[ns] objects). As you can imagine, adding say May 12 and May 19 together does not make logical sense. Running your first example produces the same error in my environment, even with only positive values in the offsets array.
Instead, you can convert your offsets values into timedelta values:
import numpy
time0 = numpy.datetime64("2007-04-03T15:06:48.032208Z")
offsets = numpy.arange(0, 10)
time = offsets.astype(numpy.timedelta64(1, "s"))
time2 = time0 + time
print(time2)
# ['2007-04-03T15:06:48.032208' '2007-04-03T15:06:49.032208'
# '2007-04-03T15:06:50.032208' '2007-04-03T15:06:51.032208'
# '2007-04-03T15:06:52.032208' '2007-04-03T15:06:53.032208'
# '2007-04-03T15:06:54.032208' '2007-04-03T15:06:55.032208'
# '2007-04-03T15:06:56.032208' '2007-04-03T15:06:57.032208']
offsets = numpy.arange(-5, 5)
time = offsets.astype(numpy.timedelta64(1, "s"))
time2 = time0 + time
print(time2)
# ['2007-04-03T15:06:43.032208' '2007-04-03T15:06:44.032208'
# '2007-04-03T15:06:45.032208' '2007-04-03T15:06:46.032208'
# '2007-04-03T15:06:47.032208' '2007-04-03T15:06:48.032208'
# '2007-04-03T15:06:49.032208' '2007-04-03T15:06:50.032208'
# '2007-04-03T15:06:51.032208' '2007-04-03T15:06:52.032208']
df['date'] = {{ds}}
in this data frame I want a value of type DATE to store, but all I get is a string datatype.
You can't use macros outside of operators scope. Macros are rendering as part of operator execution otherwise it's just a plain text string. {{ ds }} can work only on templated fields of the operator. In your example it's clear that you are looking to get the value of ds inside a python callable rather than in templated fields thus you can get the value as:
def func(**kwargs):
execution_date = kwargs['execution_date']
df = pd.DataFrame()
df['execution_date'] = execution_date
#If you want also to convert the column to datetime you can add
df['execution_date'] = pd.to_datetime(df['execution_date'])
print(df)
op = PythonOperator(
task_id='example',
python_callable=func,
provide_context=True,
dag=dag
)
To check if a timezone is not defined for the first row of a "timestamp" column in a pandas Series I can query .tz for a single element with:
import pandas as pd
dates = pd.Series(pd.date_range('2/2/2002', periods=10, freq='M'))
assert dates.iloc[0].tz is None
Do I have a way to check if there are elements where the timezone is defined, or even better, a way to list all the timezones in the whole series, without looping through its elements, such as:
dates.iloc[5] = dates.iloc[5].tz_localize('Africa/Abidjan')
dates.iloc[7] = dates.iloc[7].tz_localize('Africa/Banjul')
zones = []
for k in range(dates.shape[0]):
zones.append(dates.iloc[k].tz)
print(set(zones))
?
You can get the time zone setting of a datetime Series using the dt accessor, i.e. S.dt.tz. This will raise ValueError if you have multiple time zones since the datetime objects will then be stored in an object array, as opposed to a datetime64 array if you have only one time zone or None. You can make use of this to get a solution that is a bit more efficient than looping every time:
import pandas as pd
# tzinfo is None:
dates0 = pd.Series(pd.date_range('2/2/2002', periods=10, freq='M'))
# one timezone:
dates1 = dates0.dt.tz_localize('Africa/Abidjan')
# mixed timezones:
dates2 = dates0.copy()
dates2.iloc[5] = dates2.iloc[5].tz_localize('Africa/Abidjan')
dates2.iloc[7] = dates2.iloc[7].tz_localize('Africa/Banjul')
for ds in [dates0, dates1, dates2]:
try:
zones = ds.dt.tz
except ValueError:
zones = set(t.tz for t in ds.values)
print(zones)
# prints
None
Africa/Abidjan
{None, <DstTzInfo 'Africa/Banjul' GMT0:00:00 STD>, <DstTzInfo 'Africa/Abidjan' GMT0:00:00 STD>}
I have a csv with the first column the date and the 5th the hours.
I would like to merge them in a single column with a specific format in order to write another csv file.
This is basically the file:
DATE,DAY.WEEK,DUMMY.WEEKENDS.HOLIDAYS,DUMMY.MONDAY,HOUR
01/01/2015,5,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,2,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,3,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,4,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,5,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,6,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,7,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,8,1,0,0,0,0,0,0,0,0,0,0,0
I have tried to read the dataframe as
dataR = pd.read_csv(fnamecsv)
and convert the first line to date, as:
date_dt3 = datetime.strptime(dataR["DATE"].iloc[0], '%d/%m/%Y')
However, this seems to me not the correct way for two reasons:
1) it add the hour without considering the hour column;
2) it seems not use the pandas feature.
Thanks for any kind of help,
Diedro
Using + operator
you need to convert data frame elements into string before join. you can also use different separators during join, e.g. dash, underscore or space.
import pandas as pd
df = pd.DataFrame({'Last': ['something', 'you', 'want'],
'First': ['merge', 'with', 'this']})
print('Before Join')
print(df, '\n')
print('After join')
df['Name']= df["First"].astype(str) +" "+ df["Last"]
print(df) ```
You can use read_csv with parameters parse_dates with list of both columns names and date_parser for specify format:
f = lambda x: pd.to_datetime(x, format='%d/%m/%Y %H')
dataR = pd.read_csv(fnamecsv, parse_dates=[['DATE','HOUR']], date_parser=f)
Or convert hours to timedeltas and add to datetimes later:
dataR = pd.read_csv(fnamecsv, parse_dates=[0], dayfirst=True)
dataR['DATE'] += pd.to_timedelta(dataR.pop('HOUR'), unit='H')
I have a CSV file with a column "date" whose values are formatted MM/DD/YYYY. I was wondering if there was a way I could filter the data in this file based on just month using pandas in python.
### csv file ###
___, Date, ...
12/4/2003
6/15/2012
#################
data = pd.read_csv("file.csv")
# how do i do this line?
is_data_july = data["date"].onlyCheckFirstChar == "6"
Thanks
You might want to have a look at pd.to_datetime.
df = pd.read_csv("file.csv")
df['date'] = pd.to_datetime(df['date'], ...)
mask = df['date'].dt.month == 6
df.loc[mask].to_csv("newfile.csv")
In fact, pd.read_csv has a shortcut for this (if the default options in pd.to_datetime work for you):
df = pd.read_csv("file.csv", parse_dates=['date'])
mask = df['date'].dt.month == 6
df.loc[mask].to_csv("newfile.csv")