Round values of a varying quantity of columns on Databricks Scala - dataframe

I am using Scala on Databricks and:
I have a dataframe that has N columns.
All but the first Y columns are of the type "float" and have numbers that I want to round to 0 decimals.
I don't want to write to each column that needs to be rounded one specific line of code, because there may be a lot of columns that will need to be rounded and they vary.
In order to do that, I tried to create a function with Map (not sure if it is the best option):
def roundValues(precision: Int)(df: DataFrame): DataFrame = {
val roundedCols = df.columns.map(c => round(col(c), precision).as(c))
df.select(roundedCols: _*)
}
df.transform(roundValues(0))
But I always get an error because the first Y columns are strings, dates, or other types.
My questions:
How can I round the values on all of the necessary columns?
The number of Y columns in the beginning may vary, as well as the number of N-Y columns that I need to round. Is there a way for me not to have to manually insert the name of the columns that will need to be rounded? (ex.: round only the columns of the type float, ignore all other)
In the end, should I convert from float to other type? I am going to use the final dataframe to do some plots or some simple calculations. I won't need decimals anymore for these things.

You can get datatype information from dataframe schema:
import org.apache.spark.sql.types.FloatType
val floatColumns = df.schema.fields.filter(_.dataType == FloatType).map(_.name)
val selectExpr = df.columns.map(c =>
if (floatColumns.contains(c))
round(col(c), 0).as(c)
else col(c)
)
val df1 = df.select(selectExpr: _*)

Related

Aggregating multiple data types in pandas groupby

I have a data frame with rows that are mostly translations of other rows e.g. an English row and an Arabic row. They share an identifier (location_shelfLocator) and I'm trying to merge the rows together based on the identifier match. In some columns the Arabic doesn't contain a translation, but the same English value (e.g. for the language column both records might have ['ger'] which becomes ['ger', 'ger']) so I would like to get rid of these duplicate values. This is my code:
df_merged = df_filled.groupby("location_shelfLocator").agg(
lambda x: np.unique(x.tolist())
)
It works when the values being aggregated are the same type (e.g. when they are both strings or when they are both arrays). When one is a string and the other is an array, it doesn't work. I get this warning:
FutureWarning: ['subject_name_namePart'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df_merged = df_filled.groupby("location_shelfLocator").agg(lambda x: np.unique(x.tolist()))
and the offending column is removed from the final data frame. Any idea how I can combine these values and remove duplicates when they are both lists, both strings, or one of each?
Here is some sample data:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
81055/vdc_100000000094.0x000093,ara,"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', 'الكواكب']",المُلكية العامة,كلاوديوس بطلميوس (بطليمو)
81055/vdc_100000000094.0x000093,ara,"['Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']",Public Domain,"['Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
And expected output:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
"[‘81055/vdc_100000000094.0x000093’] ",[‘ara’],"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', ‘الكواكب’, 'Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']","[‘المُلكية العامة’, ‘Public Domain’]","[‘كلاوديوس بطلميوس (بطليمو)’,’Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
If you cannot have a control over the input value, you need to fix it somehow.
Something like this. Here, I am converting string value in subject_name_namePart to array of string.
from ast import literal_eval
mask = df.subject_name_namePart.str[0] != '['
df.loc[mask, 'subject_name_namePart'] = "['" + df.loc[mask, 'subject_name_namePart'] + "']"
df['subject_name_namePart'] = df.subject_name_namePart.transform(literal_eval)
Then, you can do (explode) + aggregation.
df = df.explode('subject_name_namePart')
df = df.groupby('location_shelfLocator').agg(lambda x: x.unique().tolist())

How to rename the columns starting with 'abcd' to starting with 'wxyz' in Spark Scala?

How can I rename the columns starting with abcd to starting with wxyz.
List of columns: abcd_name, abcd_id, abcd_loc, empId, empCode
I need to change the names of columns in a dataframe that starts with abcd
Required column list: wxyz_name, wxyz_id, wxyz_loc, empId, empCode
I tried getting all the columns' lists using the below code, but not sure how to implement it.
val df_cols_abcd = df.columns.filter(_.startsWith("abcd")).map(df(_))
You can do that with foldLeft:
val oldPrefix = "abcd"
val newPrefix = "wxyz"
val newDf = df.columns
.filter(_.startsWith(oldPrefix))
.foldLeft(df)((acc, oldName) =>
acc.withColumnRenamed(oldName, newPrefix + oldName.substring(oldPrefix.length))
)
Your first idea to filter columns with startWith is correct. The only think you miss the the part where you rename all the columns.
I recommend to do some research about foldLeft if you're not familiar with. The idea is the following:
I start with an initial dataframe (df in the first brackets).
I will apply a function to it with each of the columns I need to rename (the function is the one in the second brackets). This function takes as argument an accumulator (acc) that is an intermediate dataframe (because it will rename the columns one at a time), and another argument which is the current element of the list (here the list contains the name of the columns that need to be modified).

Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp

I have a feature which let's me query a databricks delta table from a client app. This is the code I use for that purpose:
df = spark.sql('SELECT * FROM EmployeeTerritories LIMIT 100')
dataframe = df.toPandas()
dataframe_json = dataframe.to_json(orient='records', force_ascii=False)
However, the second line throws me the error
Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp
I know what this error says, my date-type field is out of bounds and I tried searching for the solution but none of them were eligible for my scenario.
The solutions I found were about a specific dataframe column but in my case I have a global problem because I have tons of delta tables and I don't know the specific date-typed column so I can do type manipulation in order to avoid this.
Is it possible to find all Timestamp type columns and cast them to string? Does this seem like a good solution? Do you have any other ideas on how can I achieve what I'm trying to do?
Is it possible to find all Timestamp type columns and cast them to
string?
Yes, that's the way to go. You can loop through df.dtype and handle columns having type = "timestamp" by casting them into strings before calling df.toPandas():
import pyspark.sql.functions as F
df = df.select(*[
F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c)
for c, t in df.dtypes
])
dataframe = df.toPandas()
You can define this as a function that take df as parameter and use it with all your tables:
def stringify_timestamps(df: DataFrame) -> DataFrame:
return df.select(*[
F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c).alias(c)
for c, t in df.dtypes
])
If you want to preserve the timestamp type, you can consider nullifying the timestamp values which are greater than pd.Timestamp.max as shown in this post instead of converting into strings.

Manipulating duplicate rows across a subset of columns in dataframe pandas

Suppose I have a dataframe as follows:
df = pd.DataFrame({"user":[11,11,11,21,21,21,21,21,32,32],
"event":[0,0,1,0,0,1,1,1,0,0],
"datetime":['05:29:54','05:32:04','05:32:08',
'15:35:26','15:36:07','15:36:16','15:36:50','15:36:54',
'09:29:12', '09:29:25'] })
I would like to handle the repetitive lines across the first column (user) to reach the following.
In this case, we replace the 'event' column with the maximum value related in the 'user' column (for example for user=11, the maximum value for event is 1). And the third column is replaced by the average of the datetime.
P.S. It has been already discussed about dropping the repetitive rows here, however, I do not want to drop rows blindly. Especially when I am dealing with a dataframe with a lot of attributes.
You want to groupby and aggregate
df.groupby('user').agg({'event': 'max',
'datetime': lambda s: pd.to_timedelta(s).mean()})
If you want, you can also just change your datetime column first to timedelta using pd.to_timedelta and just take the mean in the agg
You can use str to represent the way you intend
df.groupby('user').agg({'event': 'max',
'datetime': lambda s: str(pd.to_timedelta(s).mean().to_pytimedelta())})
You can convert datetimes to native integers and aggregate mean, last convert back and for HH:MM:SS strings use strftime:
df['datetime'] = pd.to_datetime(df['datetime']).astype(np.int64)
df1 = df.groupby('user', as_index=False).agg({'event':'max', 'datetime':'mean'})
df1['datetime'] = pd.to_datetime(df1['datetime']).dt.strftime('%H:%M:%S')
print (df1)
user event datetime
0 11 1 05:31:22
1 21 1 15:36:18
2 32 0 09:29:18

Group DataFrame by binning a column::Float64, in Julia

Say I have a DataFrame with a column of Float64s, I'd like to group the dataframe by binning that column. I hear the cut function might help, but it's not defined over dataframes. Some work has been done (https://gist.github.com/tautologico/3925372), but I'd rather use a library function rather than copy-pasting code from the Internet. Pointers?
EDIT Bonus karma for finding a way of doing this by month over UNIX timestamps :)
You could bin dataframes based on a column of Float64s like this. Here my bins are increments of 0.1 from 0.0 to 1.0, binning the dataframe based on a column of 100 random numbers between 0.0 and 1.0.
using DataFrames #load DataFrames
df = DataFrame(index = rand(Float64,100)) #Make a DataFrame with some random Float64 numbers
df_array = map(x->df[(df[:index] .>= x[1]) .& (df[:index] .<x[2]),:],zip(0.0:0.1:0.9,0.1:0.1:1.0)) #Map an anonymous function that gets every row between two numbers specified by a tuple called x, and map that anonymous function to an array of tuples generated using the zip function.
This will produce an array of 10 dataframes, each one with a different 0.1-sized bin.
As for the UNIX timestamp question, I'm not as familiar with that side of things, but after playing around a bit maybe something like this could work:
using Dates
df = DataFrame(unixtime = rand(1E9:1:1.1E9,100)) #Make a dataframe with floats containing pretend unix time stamps
df[:date] = Dates.unix2datetime.(df[:unixtime]) #convert those timestamps to DateTime types
df[:year_month] = map(date->string(Dates.Year.(date))*" "*string(Dates.Month.(date)),df[:date]) #Make a string for every month in your time range
df_array = map(ym->df[df[:year_month] .== ym,:],unique(df[:year_month])) #Bin based on each unique year_month string