Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp - pandas

I have a feature which let's me query a databricks delta table from a client app. This is the code I use for that purpose:
df = spark.sql('SELECT * FROM EmployeeTerritories LIMIT 100')
dataframe = df.toPandas()
dataframe_json = dataframe.to_json(orient='records', force_ascii=False)
However, the second line throws me the error
Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp
I know what this error says, my date-type field is out of bounds and I tried searching for the solution but none of them were eligible for my scenario.
The solutions I found were about a specific dataframe column but in my case I have a global problem because I have tons of delta tables and I don't know the specific date-typed column so I can do type manipulation in order to avoid this.
Is it possible to find all Timestamp type columns and cast them to string? Does this seem like a good solution? Do you have any other ideas on how can I achieve what I'm trying to do?

Is it possible to find all Timestamp type columns and cast them to
string?
Yes, that's the way to go. You can loop through df.dtype and handle columns having type = "timestamp" by casting them into strings before calling df.toPandas():
import pyspark.sql.functions as F
df = df.select(*[
F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c)
for c, t in df.dtypes
])
dataframe = df.toPandas()
You can define this as a function that take df as parameter and use it with all your tables:
def stringify_timestamps(df: DataFrame) -> DataFrame:
return df.select(*[
F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c).alias(c)
for c, t in df.dtypes
])
If you want to preserve the timestamp type, you can consider nullifying the timestamp values which are greater than pd.Timestamp.max as shown in this post instead of converting into strings.

Related

Aggregating multiple data types in pandas groupby

I have a data frame with rows that are mostly translations of other rows e.g. an English row and an Arabic row. They share an identifier (location_shelfLocator) and I'm trying to merge the rows together based on the identifier match. In some columns the Arabic doesn't contain a translation, but the same English value (e.g. for the language column both records might have ['ger'] which becomes ['ger', 'ger']) so I would like to get rid of these duplicate values. This is my code:
df_merged = df_filled.groupby("location_shelfLocator").agg(
lambda x: np.unique(x.tolist())
)
It works when the values being aggregated are the same type (e.g. when they are both strings or when they are both arrays). When one is a string and the other is an array, it doesn't work. I get this warning:
FutureWarning: ['subject_name_namePart'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df_merged = df_filled.groupby("location_shelfLocator").agg(lambda x: np.unique(x.tolist()))
and the offending column is removed from the final data frame. Any idea how I can combine these values and remove duplicates when they are both lists, both strings, or one of each?
Here is some sample data:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
81055/vdc_100000000094.0x000093,ara,"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', 'الكواكب']",المُلكية العامة,كلاوديوس بطلميوس (بطليمو)
81055/vdc_100000000094.0x000093,ara,"['Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']",Public Domain,"['Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
And expected output:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
"[‘81055/vdc_100000000094.0x000093’] ",[‘ara’],"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', ‘الكواكب’, 'Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']","[‘المُلكية العامة’, ‘Public Domain’]","[‘كلاوديوس بطلميوس (بطليمو)’,’Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
If you cannot have a control over the input value, you need to fix it somehow.
Something like this. Here, I am converting string value in subject_name_namePart to array of string.
from ast import literal_eval
mask = df.subject_name_namePart.str[0] != '['
df.loc[mask, 'subject_name_namePart'] = "['" + df.loc[mask, 'subject_name_namePart'] + "']"
df['subject_name_namePart'] = df.subject_name_namePart.transform(literal_eval)
Then, you can do (explode) + aggregation.
df = df.explode('subject_name_namePart')
df = df.groupby('location_shelfLocator').agg(lambda x: x.unique().tolist())

Round values of a varying quantity of columns on Databricks Scala

I am using Scala on Databricks and:
I have a dataframe that has N columns.
All but the first Y columns are of the type "float" and have numbers that I want to round to 0 decimals.
I don't want to write to each column that needs to be rounded one specific line of code, because there may be a lot of columns that will need to be rounded and they vary.
In order to do that, I tried to create a function with Map (not sure if it is the best option):
def roundValues(precision: Int)(df: DataFrame): DataFrame = {
val roundedCols = df.columns.map(c => round(col(c), precision).as(c))
df.select(roundedCols: _*)
}
df.transform(roundValues(0))
But I always get an error because the first Y columns are strings, dates, or other types.
My questions:
How can I round the values on all of the necessary columns?
The number of Y columns in the beginning may vary, as well as the number of N-Y columns that I need to round. Is there a way for me not to have to manually insert the name of the columns that will need to be rounded? (ex.: round only the columns of the type float, ignore all other)
In the end, should I convert from float to other type? I am going to use the final dataframe to do some plots or some simple calculations. I won't need decimals anymore for these things.
You can get datatype information from dataframe schema:
import org.apache.spark.sql.types.FloatType
val floatColumns = df.schema.fields.filter(_.dataType == FloatType).map(_.name)
val selectExpr = df.columns.map(c =>
if (floatColumns.contains(c))
round(col(c), 0).as(c)
else col(c)
)
val df1 = df.select(selectExpr: _*)

PySpark dataframe Pandas UDF returns empty dataframe

I'm trying to apply a pandas_udf to my PySpark dataframe for some filtering, following the groupby('Key').apply(UDF) method. To use the pandas_udf I defined an output schema and have a condition on the column Number. As an example, the simplified idea here is that I wish only to return the ID of the rows with odd Number.
This now brings up a problem that sometimes there is no odd Number in a group therefore the UDF just returns an empty dataframe, which is in conflict with the defined schema to return an int for Number.
Is there a way to solve this problem and only output and combine all the odd Number rows as a new dataframe?
schema = StructType([
StructField("Key", StringType()),
StructField("Number", IntegerType())
])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def get_odd(df):
odd = df.loc[df['Number']%2 == 1]
return odd[['ID', 'Number']]
I come across this issue with null DataFrame in some groups. I solve this by checking for empty DataFrame and return a DataFrame with schema defined:
if df_out.empty:
# change the schema as needed
return pd.DataFrame({'fullVisitorId': pd.Series([], dtype='str'),
'time': pd.Series([], dtype='datetime64[ns]'),
'total_transactions': pd.Series([], dtype='int')})

Manipulating duplicate rows across a subset of columns in dataframe pandas

Suppose I have a dataframe as follows:
df = pd.DataFrame({"user":[11,11,11,21,21,21,21,21,32,32],
"event":[0,0,1,0,0,1,1,1,0,0],
"datetime":['05:29:54','05:32:04','05:32:08',
'15:35:26','15:36:07','15:36:16','15:36:50','15:36:54',
'09:29:12', '09:29:25'] })
I would like to handle the repetitive lines across the first column (user) to reach the following.
In this case, we replace the 'event' column with the maximum value related in the 'user' column (for example for user=11, the maximum value for event is 1). And the third column is replaced by the average of the datetime.
P.S. It has been already discussed about dropping the repetitive rows here, however, I do not want to drop rows blindly. Especially when I am dealing with a dataframe with a lot of attributes.
You want to groupby and aggregate
df.groupby('user').agg({'event': 'max',
'datetime': lambda s: pd.to_timedelta(s).mean()})
If you want, you can also just change your datetime column first to timedelta using pd.to_timedelta and just take the mean in the agg
You can use str to represent the way you intend
df.groupby('user').agg({'event': 'max',
'datetime': lambda s: str(pd.to_timedelta(s).mean().to_pytimedelta())})
You can convert datetimes to native integers and aggregate mean, last convert back and for HH:MM:SS strings use strftime:
df['datetime'] = pd.to_datetime(df['datetime']).astype(np.int64)
df1 = df.groupby('user', as_index=False).agg({'event':'max', 'datetime':'mean'})
df1['datetime'] = pd.to_datetime(df1['datetime']).dt.strftime('%H:%M:%S')
print (df1)
user event datetime
0 11 1 05:31:22
1 21 1 15:36:18
2 32 0 09:29:18

What's the Pandas way to write `if()` conditional between two `timeseries` columns?

My naive approach to Pandas Series needs some pointers. I have one Pandas DataFrame with two joined tables. The left table had timestamp with title Time1 and the right had Time2; My new DataFrame has both.
At this step I'm comparing the two datetime columns using helper functions g() and f():
df['date_error'] = g(df['Time1'], df['Time2'])
The working helper function g() compares two datetime values:
def g(newer,older):
value = newer > older
return value
This gives me a column (True,False) values. When I use the conditional in the helper function f(), I get an error because newer and older are Pandas Series:
def f(newer,older):
if newer > older:
delta = (newer - older)
else :
# arbitrairly large value to maintain col dtype
delta = datetime.timedelta(minutes=1000)
return delta
Ok. Fine. I know I'm not unpacking the Pandas Series correctly, because I can get this to work with the following monstrosity:
def f(newer,older):
delta = []
for (k,v),(k2,v2) in zip(newer.iteritems(), older.iteritems()):
if v > v2 :
delta.append(v - v2)
else :
# arbitrairly large value to maintain col dtype
delta.append(datetime.timedelta(minutes=1000))
return pd.Series(delta)
What's the Pandas way a conditional between two DataFrame columns?
Usually where is the pandas equivalent of if:
df = pd.DataFrame([['1/1/01 11:00', '1/1/01 12:00'],
['1/1/01 14:00', '1/1/01 13:00']],
columns = ['Time1', 'Time2']
).apply(pd.to_datetime)
(df.Time1 - df.Time2).where(df.Time1 > df.Time2)
0 NaT
1 01:00:00
dtype: timedelta64[ns]
If you don't want nulls in this column you could call fillna(1000) afterwards, however note that this datatype supports a null value NaT (not a time).