how to convert {{ds}} airflow macro to DATE datatype - dataframe

df['date'] = {{ds}}
in this data frame I want a value of type DATE to store, but all I get is a string datatype.

You can't use macros outside of operators scope. Macros are rendering as part of operator execution otherwise it's just a plain text string. {{ ds }} can work only on templated fields of the operator. In your example it's clear that you are looking to get the value of ds inside a python callable rather than in templated fields thus you can get the value as:
def func(**kwargs):
execution_date = kwargs['execution_date']
df = pd.DataFrame()
df['execution_date'] = execution_date
#If you want also to convert the column to datetime you can add
df['execution_date'] = pd.to_datetime(df['execution_date'])
print(df)
op = PythonOperator(
task_id='example',
python_callable=func,
provide_context=True,
dag=dag
)

Related

Pandas Sort Values By Date Doesn't Sort By Year

I have a large data set that is in this format
I'd like to order this data set by the "created_at" column, so I converted the "created_at" column to type datetime following this guide:
https://www.geeksforgeeks.org/how-to-sort-a-pandas-dataframe-by-date/
data = pd.read_csv(PATH_TO_CSV)
data['created_at'] = data['created_at'].str.split("+").str[0]
data['created_at'] = pd.to_datetime(data['created_at'],format="%Y-%m-%dT%H:%M:%S")
data.sort_values(by='created_at')
But it's not sorting by year as expected. The values starting with 2012 should be at the top, but they aren't
print(data)
print(type(data['created_at'][0]))
What am I missing?
With a datetime type, this should be able to sort directly, make sure to assign the output as sorting is not in place:
# no need for an intermediate column nor to pass the full format
data['created_at'] = pd.to_datetime(data['created_at'].str.split("+").str[0])
# assign output
data = data.sort_values(by='created_at')
As in the comments already stated. the sorted df needs to be assigned again. sort_values doesn't work inplace by default.
data = data.sort_values(by='created_at')
# OR
data.sort_values(by='created_at', inplace=True)

Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp

I have a feature which let's me query a databricks delta table from a client app. This is the code I use for that purpose:
df = spark.sql('SELECT * FROM EmployeeTerritories LIMIT 100')
dataframe = df.toPandas()
dataframe_json = dataframe.to_json(orient='records', force_ascii=False)
However, the second line throws me the error
Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp
I know what this error says, my date-type field is out of bounds and I tried searching for the solution but none of them were eligible for my scenario.
The solutions I found were about a specific dataframe column but in my case I have a global problem because I have tons of delta tables and I don't know the specific date-typed column so I can do type manipulation in order to avoid this.
Is it possible to find all Timestamp type columns and cast them to string? Does this seem like a good solution? Do you have any other ideas on how can I achieve what I'm trying to do?
Is it possible to find all Timestamp type columns and cast them to
string?
Yes, that's the way to go. You can loop through df.dtype and handle columns having type = "timestamp" by casting them into strings before calling df.toPandas():
import pyspark.sql.functions as F
df = df.select(*[
F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c)
for c, t in df.dtypes
])
dataframe = df.toPandas()
You can define this as a function that take df as parameter and use it with all your tables:
def stringify_timestamps(df: DataFrame) -> DataFrame:
return df.select(*[
F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c).alias(c)
for c, t in df.dtypes
])
If you want to preserve the timestamp type, you can consider nullifying the timestamp values which are greater than pd.Timestamp.max as shown in this post instead of converting into strings.

Set a pandas column to a timezone only datetime object

I'm trying to refactor a piece of code which is very slow. It takes the timezoneId of every row and applies the pytz.timezone() method to transform that into a datetime object, with only the timezone info.
for example, the Dataframe looks like this:
df = pd.DataFrame([['America/Sao_Paulo'], ['US/Eastern'], ['Europe/Moscow']], index =['ID1', 'ID2', 'ID3'])
I need these strings to be converted to datetime objects.
If I try to use .apply(pytz.timezone) I get the following error:
AttributeError: 'Series' object has no attribute 'upper'
And I cannot use the .to_datetime Pandas method with only timezone information.
How would I got about creating a datetime object with only timezone information?
EDIT:
Here's what the code I'm rewriting looks like:
try:
tz_id = data[-1]['timezone']['timeZoneId']
self.timezone = pytz.timezone(tz_id)
except:
self.timezone = pytz.timezone("US/Eastern")
return self.timezone
It only takes one ID per time, which is why it currently works and is so slow.

Setting the data model in pandas

So I'm used to database ETL. In SQL I Create the table and set the char lengths, data types etc. As I understand it pandas uses the max length of whatever is put into the dataframe. Fine if you're staying in python, but I need to specify these things explicitly.
Here's some base code to work from, pointers welcome:
df=pd.Dataframe()
df['ID'] = some data probably i + 1
df['text'] = some text length set to max 255
Here is an informative article on pandas datatypes:
https://pbpython.com/pandas_dtypes.html
If you want to see the datatypes of your dataframe, you can do:
df.info()
You can set datatypes explicitly where you can choose to ignore or raise errors with .astype():
df['ID'] = df['ID'].astype(int, errors='raise')
df['ID'] = df['ID'].astype(int, errors='ignore')
For strings you can set the datatype as follows:
df['text'] = df['text'].astype('string')
Or if you're using an older version of pandas < 1.0 then do:
df['text'] = df['text'].astype(str)
If you want to set a max length for your string, you could do:
df['text'] = df['text'].str.slice(0, 255)

How to add return value from function into dataframe Column?

I want to add multiple return values from function into one column, for example, here I am splitting the combined hours and minutes and adding into different variables "hour" and "minute" each. Now I want to add these values in one data frame column.
def issuetimesplit(x):
hour = str(x)[:-2]
minutes = str(x)[-2:]
time = print(f"{hour} : {minutes}")
return time
for i,v in timeData.items():
issuetimesplit(v)
Let me sum the comments up.
You can't use print to define a string variable.
In the function, your new string can be returned immediately. It means time variable is not needed. However, it is not a mistake to define it.
You can use apply() function, that could be a solution for your problem.
The code is following.
# import pandas
import pandas as pd
# define function
def make_new_time_feature(row):
hour = row['old_time_feature'][:-2]
minutes = row['old_time_feature'][-2:]
return f'{hour} : {minutes}'
# call apply
your_df['new_time_feature'] = your_df.apply(lambda row: make_new_time_feature(row), axis=1)