replace .withColumn with a df.select - dataframe

I am doing a basic transformation on my pyspark dataframe but here i am using multiple .withColumn statements.
def trim_and_lower_col(col_name):
return F.when(F.trim(col_name) == "", F.lit("unspecified")).otherwise(F.lower(F.trim(col_name)))
df = (
source_df.withColumn("browser", trim_and_lower_col("browser"))
.withColumn("browser_type", trim_and_lower_col("browser_type"))
.withColumn("domains", trim_and_lower_col("domains"))
)
I read that creating multiple withColumn statements isn't very efficient and i should use df.select() instead.
I tried this:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
df = (
source_df.select([trim_and_lower_col(col).alias(col) for col in cols_to_transform] + source_df.columns)
)
but it gives me a duplicate column error
What else can I try?

The duplicate column comes because you pass each transformed column twice in that list, once as your newly transformed column (through .alias) as original column (by name in source_df.columns). This solution will allow you to use a single select statement, preserve the column order and not hit the duplication issue:
df = (
source_df.select([trim_and_lower_col(col).alias(col) if col in cols_to_transform else col for col in source_df.columns])
)
Chaining many .withColumn does pose a problem as the unresolved query plan can get pretty large and cause StackOverflow error on Spark driver during query plan optimisation. One good explanation of this problem is shared here: https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015

You are naming your new columns the following: .alias(col).
That means that they have the same name as the column you use to create the new one.
During the creation (using .withColumn) this does not pose a problem. As soon as you are trying to select, Spark does not know which column to pick.
You could fix it for example by giving the new columns a suffix:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
df = (
source_df.select([trim_and_lower_col(col).alias(f"{col}_new") for col in cols_to_transform] + source_df.columns)
)
Another solution, which does pollute the DAG though, would be:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
for col in cols_to_transform:
source_df = source_df.withColumn(col, trim_and_lower_col(col))

If you only have these few withColumns, keep using them.. It's still way more readable thus way more maintainable and self explanatory..
If you look into it, you'll see that spark says to be careful with the withColumns when you have like 200 of them.
Using select makes your code more error prone too since it's more complex to read.
Now, if you have many columns, I would define
the list of the column to transform,
the list of the column to keep
then do the select
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
cols_to_keep = [c for c in df.columns if c not in cols_to_transform]
cols_transformed = [trim_and_lower_col(c).alias(c) for c in cols_to_transform]
source_df.select(*cols_to_keep, *cols_transformed)
This would give you the same column order as the withColumns.

Related

how to dynamically build select list from a API payload using PyPika

I have a JSON API payload containing tablename, columnlist - how to build a SELECT query from it using pypika?
So far I have been able to use a string columnlist, but not able to do advanced querying using functions, analytics etc.
from pypika import Table, Query, functions as fn
def generate_sql (tablename, collist):
table = Table(tablename)
columns = [str(table)+'.'+each for each in collist]
q = Query.from_(table).select(*columns)
return q.get_sql(quote_char=None)
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue)']
print (generate_sql(tablename, collist)) #1
table = Table(tablename)
q = Query.from_(table).select(table.id, table.fname, fn.Sum(table.revenue))
print (q.get_sql(quote_char=None)) #2
#1 outputs
SELECT "customers".id,"customers".fname,"customers".fn.Sum(revenue) FROM customers
#2 outputs correctly
SELECT id,fname,SUM(revenue) FROM customers
You should not be trying to assemble the query in a string by yourself, that defeats the whole purpose of pypika.
What you can do in your case, that you have the name of the table and the columns coming as texts in a json object, you can use * to unpack those values from the collist and use the syntax obj[key] to get the table attribute with by name with a string.
q = Query.from_(table).select(*(table[col] for col in collist))
# SELECT id,fname,fn.Sum(revenue) FROM customers
Hmm... that doesn't quite work for the fn.Sum(revenue). The goal is to get SUM(revenue).
This can get much more complicated from this point. If you are only sending column names that you know to belong to that table, the above solution is enough.
But if you have complex sql expressions, making reference to sql functions or even different tables, I suggest you to rethink your decision of sending that as json. You might end up with something as complex as pypika itself, like a custom parser or wathever. than your better option here would be to change the format of your json response object.
If you know you only need to support a very limited set of capabilities, it could be feasible. For example, you can assume the following constraints:
all column names refer to only one table, no joins or alias
all functions will be prefixed by fn.
no fancy stuff like window functions, distinct, count(*)...
Then you can do something like:
from pypika import Table, Query, functions as fn
import re
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue / 2)', 'revenue % fn.Count(id)']
def parsed(cols):
pattern = r'(?:\bfn\.[a-zA-Z]\w*)|([a-zA-Z]\w*)'
subst = lambda m: f"{'' if m.group().startswith('fn.') else 'table.'}{m.group()}"
yield from (re.sub(pattern, subst, col) for col in cols)
table = Table(tablename)
env = dict(table=table, fn=fn)
q = Query.from_(table).select(*(eval(col, env) for col in parsed(collist)))
print (q.get_sql(quote_char=None)) #2
Output:
SELECT id,fname,SUM(revenue/2),MOD(revenue,COUNT(id)) FROM customers

Scala - Dataframe - Select based on list of Columns AND apply function to some columns

I am using Apache Spark dataframe to process some files, and my requirement is as follows
I need to select specific columns from dataframe and create another dataframe - this is simple
I need to apply some function to specific columns from data frame and create another dataframe - I am able to do it
//For 1
dataframe.select("COLUMN1")
//more specifically I use the below
dataframe.select(listOfCol.map(col): _*)
//For 2
/dataframe.withColumn("newColumn", someUDF(col("COLUMN2")))
But I need both to be applied together in a elegant way, I came across this - Spark apply function to certain columns
Can I follow this OR is there any wat better, any suggestion or direction would be great help. thanks in advance.
EDIT
There is predefined map that will give which specific columns what functions (if needed) are applied, a sample one
example val colFunctionMap = Map(
"COLUMN1" -> ("function1", true),
"COLUMN2" -> ("function2", true),
"COLUMN3" -> ("", false),
"COLUMN4" -> ("", false),
...
)
so I need something like this - below is just a sudo code
dataframe.select(
for each (listOfCol)
if function has to be applied -> Apply and select column
else -> just select column
)
Please let me know if you need any other details...

Spark is not able to resolve the columns correctly when joins data frames

I'm using pyspark ( python 3.8) over spark3.0 on Databricks. When running this DataFrame join:
next_df = (
days_currencies_matrix.alias("a")
.join(
data_to_merge.alias("b"),
[
days_currencies_matrix.dt == data_to_merge.RATE_DATE,
days_currencies_matrix.CURRENCY_CODE == data_to_merge.CURRENCY_CODE,
],
"LEFT",
)
.select(
days_currencies_matrix.CURRENCY_CODE,
days_currencies_matrix.dt.alias("RATE_DATE"),
data_to_merge.AVGYTD,
data_to_merge.ENDMTH,
data_to_merge.AVGMTH,
data_to_merge.AVGWEEK,
data_to_merge.AVGMTD,
)
)
And I’m getting this error:
Column AVGYTD#67187, AVGWEEK#67190, ENDMTH#67188, AVGMTH#67189, AVGMTD#67191 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
Which is telling me that the above columns belong to more than one dataset. Why is that happening? The code is telling to spark exactly the source dataframe;
also, the days_currencies_matrix has only 2 columns: dt and CURRENCY_CODE.
The source data frames are defined as follow:
currencies_in_merging_data = data_to_merge.select('CURRENCY_CODE').distinct()
days_currencies_matrix = dt_days.crossJoin(currencies_in_merging_data)
Is it because days_currencies_matrix DataFrame actually is built over the data_to_merge? Is that something related to Lazy evaluations or it is a bug?
BTW, this version works with no issues:
next_df = (
days_currencies_matrix.alias("a")
.join(
data_to_merge.alias("b"),
[
days_currencies_matrix.dt == data_to_merge.RATE_DATE,
days_currencies_matrix.CURRENCY_CODE == data_to_merge.CURRENCY_CODE,
],
"LEFT",
)
.select(
col("a.dt").alias("RATE_DATE"),
col("a.CURRENCY_CODE"),
col("b.AVGYTD"),
col("b.ENDMTH"),
col("b.AVGMTH"),
col("b.AVGWEEK"),
col("b.AVGMTD"),
)
)
I found the problem.
The 1st select() is about the next_df. In fact, in the first case I'm referencing the columns using the joining data frame names, which are not inside the final result.
In the second code I'm correctly referencing the columns using the name assigned to them by join, and these are the correct names.
BTW:
This works too:
next_df = days_currencies_matrix.alias('a').join( data_to_merge.alias('b') , [
days_currencies_matrix.dt == data_to_merge.RATE_DATE,
days_currencies_matrix.CURRENCY_CODE == data_to_merge.CURRENCY_CODE ], 'LEFT')
next_df = next_df.select(next_df.AVGYTD, next_df.AVGWEEK, next_df.ENDMTH)

Except command not working in databrick sql (spark sql)

I have written this except query to get difference in record from both hive tables from databricks notebook.(I am trying to get result as we get in mssql ie only difference in resultset)
select PreqinContactID,PreqinContactName,PreqinPersonTitle,EMail,City
from preqin_7dec.PreqinContact where filename='InvestorContactPD.csv'
except
select CONTACT_ID,NAME,JOB_TITLE,EMAIL,CITY
from preqinct.InvestorContactPD where contact_id in (
select PreqinContactID from preqin_7dec.PreqinContact
where filename='InvestorContactPD.csv')
But the result set returned is also having matching records.The record which i have shown above is coming in result set but when i checked it separately based on contact_id it is same.so I am not sure why except is returning the matching record also.
Just wanted to know how we can use except or any difference finding command in databrick notebook by using sql.
I want to see nothing in result set if source and target data is same.
EXCEPT works perfectly well in Databricks as this simple test will show:
val df = Seq((3445256, "Avinash Singh", "Chief Manager", "asingh#gmail.com", "Mumbai"))
.toDF("contact_id", "name", "job_title", "email", "city")
// Save the dataframe to a temp view
df.createOrReplaceTempView("tmp")
df.show
The SQL test:
%sql
SELECT *
FROM tmp
EXCEPT
SELECT *
FROM tmp;
This query will yield no results. Is it possible you have some leading or trailing spaces for example? Spark is also case-sensitive so that could also be causing your issue. Try a case-insensitive test by applying the LOWER function to all columns, eg

update an SQL table via R sqlSave

I have a data frame in R having 3 columns, using sqlSave I can easily create a table in an SQL database:
channel <- odbcConnect("JWPMICOMP")
sqlSave(channel, dbdata, tablename = "ManagerNav", rownames = FALSE, append = TRUE, varTypes = c(DateNav = "datetime"))
odbcClose(channel)
This data frame contains information about Managers (Name, Nav and Date) which are updatede every day with new values for the current date and maybe old values could be updated too in case of errors.
How can I accomplish this task in R?
I treid to use sqlUpdate but it returns me the following error:
> sqlUpdate(channel, dbdata, tablename = "ManagerNav")
Error in sqlUpdate(channel, dbdata, tablename = "ManagerNav") :
cannot update ‘ManagerNav’ without unique column
When you create a table "the white shark-way" (see documentation), it does not get a primary index, but is just plain columns, and often of the wrong type. Usually, I use your approach to get the columns names right, but after that you should go into your database and assign a primary index, correct column widths and types.
After that, sqlUpdate() might work; I say might, because I have given up using sqlUpdate(), there are too many caveats, and use sqlQuery(..., paste("Update....))) for the real work.
What I would do for this is the following
Solution 1
sqlUpdate(channel, dbdata,tablename="ManagerNav", index=c("ManagerNav"))
Solution 2
Lcolumns <- list(dbdata[0,])
sqlUpdate(channel, dbdata,tablename="ManagerNav", index=c(Lcolumns))
Index is used to specify what columns R is going to update.
Hope this helps!
If none of the other solutions work and your data is not that big, I'd suggest using sqlQuery() and loop through your dataframe.
one_row_of_your_df <- function(i) {
sql_query <-
paste0("INSERT INTO your_table_name (column_name1, column_name2, column_name3) VALUES",
"(",
"'",your_dataframe[i,1],",",
"'",your_dataframe[i,2],"'",",",
"'",your_dataframe[i,3],"'",",",
")"
)
return(sql_query)
}
This function is Exasol specific, it is pretty similar to MySQL, but not identical, so small changes could be necessary.
Then use a simple for loop like this one:
for(i in 1:nrow(your_dataframe))
{
sqlQuery(your_connection, one_row_of_your_df(i))
}