Select field only if it exists (SQL or Scala) - sql

The input dataframe may not always have all the columns. In SQL or SCALA, I want to create a select statement where even if the dataframe does not have column, it won't error out and it will only output the columns that do exist.
For example, this statement will work.
Select store, prod, distance from table
+-----+------+--------+
|store|prod |distance|
+-----+------+--------+
|51 |42 |2 |
|51 |42 |5 |
|89 |44 |9 |
If the dataframe looks like below, I want the same statement to work, to just ignore what's not there, and just output the existing columns (in this case 'store' and 'prod')
+-----+------+
|store|prod |
+-----+------+
|51 |42 |
|51 |42 |
|89 |44 |

You can have list of all cols in list, either hard coded or prepare from other meta data and use intersect
val columnNames = Seq("c1","c2","c3","c4")
df.select( df.columns.intersect(columnNames).map(x=>col(x)): _* ).show()

You can make use of columns method on Dataframe. This would look like that:
val result = if(df.columns.contains("distance")) df.select("store", "prod", "distance")
else df.select("store", "prod")
Edit:
Having many such columns, you can keep them in array, for example cols and filter it:
val selectedCols = cols.filter(col -> df.columns.contains("distance")).map(col)
val result = df.select(selectedCols:_*)

Assuming you use the expanded SQL template, like select a,b,c from tab, you could do something like below to get the required results.
Get the sql string and convert it to lowercase.
Split the sql on space or comma to get the individual words in an array
Remove "select" and "from" from the above array as they are SQL keywords.
Now your last index is the table name
First to last index but one contains the list of select columns.
To get the required columns, just filter it against df2.columns. The columns that are there in SQL but not in table will be filtered out
Now construct the sql using the individual pieces.
Run it using spark.sql(reqd_sel_string) to get the results.
Check this out
scala> val df2 = Seq((51,42),(51,42),(89,44)).toDF("store","prod")
df2: org.apache.spark.sql.DataFrame = [store: int, prod: int]
scala> df2.createOrReplaceTempView("tab2")
scala> val sel_query="Select store, prod, distance from tab2".toLowerCase
sel_query: String = select store, prod, distance from tab2
scala> val tabl_parse = sel_query.split("[ ,]+").filter(_!="select").filter(_!="from")
tabl_parse: Array[String] = Array(store, prod, distance, tab2)
scala> val tab_name=tabl_parse(tabl_parse.size-1)
tab_name: String = tab2
scala> val tab_cols = (0 until tabl_parse.size-1).map(tabl_parse(_))
tab_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod, distance)
scala> val reqd_cols = tab_cols.filter( x=>df2.columns.contains(x))
reqd_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod)
scala> val reqd_sel_string = "select " + reqd_cols.mkString(",") + " from " + tab_name
reqd_sel_string: String = select store,prod from tab2
scala> spark.sql(reqd_sel_string).show(false)
+-----+----+
|store|prod|
+-----+----+
|51 |42 |
|51 |42 |
|89 |44 |
+-----+----+
scala>

Related

convert this sql left-join query to spark dataframes (scala)

I have this sql query which is a left-join and has a select statement in the beginning which chooses from the right table columns as well..
Can you please help to convert it to a spark dataframes and get the result using spark-shell?
I don't want to use the sql code in spark instead I want to use dataframes.
I know the join syntax in scala, but I don't know how to choose from the right table (here it is count(w.id2)) when resulting df from left join doesn't have access to the right table's columns.
Thank you!
select count(x.user_id) user_id_count, count(w.id2) current_id2_count
from
(select
user_id
from
tb1
where
year='2021'
and month=1
) x
left join
(select id1, max(id2) id2 from tb2 group by id1) w
on
x.user_id=w.id1;
In spark I would create two dataframes x and w and join them:
var x = spark.sqlContext.table("tb1").where("year='2021' and month=1")
var w= spark.sqlContext.table("tb2").groupBy("id1").agg(max("id2")).alias("id2"
var joined = x.join(w, x("user_id")===w("id1"), "left")
EDIT :
I was confused about the left join. There was some error from the spark that column id2 is not available and I thought it is because the resulting df from left-join will have only left table's columns. However the reason was that when I was choosing max(id2) I had to give it an alias correctly.
Here is a sample and the solution:
var x = Seq("1","2","3","4").toDF("user_id")
var w = Seq (("1", 1), ("1",2), ("3",10),("1",5),("5",4)).toDF("id1", "id2")
var z= w.groupBy("id1").agg(max("id2").alias("id2"))
val xJoinsZ= x.join(z, x("user_id") === z("id1"), "left").select(count(col("user_id").alias("user_id_count")), count(col("id2").alias("current_id2_count")))
scala> x.show(false)
+-------+
|user_id|
+-------+
|1 |
|2 |
|3 |
|4 |
+-------+
scala> z.show(false)
+---+---+
|id1|id2|
+---+---+
|3 |10 |
|5 |4 |
|1 |5 |
+---+---+
scala> xJoinsZ.show(false)
+---------------------------------+---------------------------------+
|count(user_id AS `user_id_count`)|count(id2 AS `current_id2_count`)|
+---------------------------------+---------------------------------+
|4 |2 |
+---------------------------------+---------------------------------+
Your request is quite difficult to understand, however I am gonna try to reply taking the SQL code you provided as baseline and reproduce it with Spark.
// Reading tb1 (x) and filtering for Jan 2021, selecting only "user_id"
val x: DataFrame = spark.read
.table("tb1")
.filter(col("year") === "2021")
.filter(col("mont") === "01")
.select("user_id")
// Reading tb2 (w) and for each "id1" getting the max "id2"
val w: DataFrame = spark.read
.table("tb2")
.groupBy(col("id1"))
.max("id2")
// Joining tb1 (x) and tb2 (w) on "user_id" === "id1", then counting user_id and id2
val xJoinsW: DataFrame = x
.join(w, x("user_id") === w("id1"), "left")
.select(count(col("user_id").as("user_id_count")), count(col("max(id2)").as("current_id2_count")))
A small but relevant remark, as you're using Scala and Spark, I would suggest you to use val and not var. val means it's final, cannot be reassigned, whereas, var can be reassigned later. You can read more here.
Lastly, feel free to change the Spark reading mechanism with whatever you like.

How to get counts for null, not null, distinct values, all rows for all columns in a sql table?

I'm currently looking to get a table that gets counts for null, not null, distinct values, and all rows for all columns in a given table. This happens to be in Databricks (Apache Spark).
Something that looks like what is shown below.
I know I can do this with something like the SQL shown below. Also, I can use something like Java or Python, etc., to generate the SQL.
The question is:
Is this the most efficient approach?
Is there a better way to write this query (less typing and/or more efficient)?
select
1 col_position,
'location_id' col_name,
count(*) all_records,
count(location_id) not_null,
count(*) - count(location_id) null,
count(distinct location_id) distinct_values
from
admin
union
select
2 col_position,
'location_zip' col_name,
count(*) all_records,
count(location_zip) not_null,
count(*) - count(location_zip) null,
count(distinct location_zip) distinct_values
from
admin
union
select
3 col_position,
'provider_npi' col_name,
count(*) all_records,
count(provider_npi) not_null,
count(*) - count(provider_npi) null,
count(distinct provider_npi) distinct_values
from
admin
order by col_position
;
As said in the comments, using UNION ALL should be efficient.
Using SQL
To avoid taping all the columns and sub queries, you can generate the SQL query from the list of columns like this:
val df = spark.sql("select * from admin")
// generate the same query from the columns list
val sqlQuery =
df.columns.zipWithIndex.map { case (c, i) =>
Seq(
s"$i col_position",
s"$c col_name",
"count(*) all_records",
s"count($c) not_null",
s"count(*) - count($c) null",
s"count(distinct $c) distinct_values"
).mkString("select ", ", ", " from admin")
}.mkString("", " union all\n", "order by col_position")
spark.sql(sqlQuery).show
Using DataFrame (Scala)
There are some optimizations you can do by using DataFrame, like calculate count(*) one time, avoid typing all the column names, and the possibility to use caching.
Example input DataFrame :
//+---+---------+--------+---------------------+------+
//|id |firstName|lastName|email |salary|
//+---+---------+--------+---------------------+------+
//|1 |michael |armbrust|no-reply#berkeley.edu|100K |
//|2 |xiangrui |meng |no-reply#stanford.edu|120K |
//|3 |matei |null |no-reply#waterloo.edu|140K |
//|4 |null |wendell |null |160K |
//|5 |michael |jackson |no-reply#neverla.nd |null |
//+---+---------+--------+---------------------+------+
First, get count and column list :
val cols = df.columns
val allRecords = df.count
Then, calculate each metric by looping through the columns list (you can create a function for each metric for example) :
val nullsCountDF = df.select(
(Seq(expr("'nulls' as metric")) ++ cols.map(c =>
sum(when(col(c).isNull, lit(1)).otherwise(lit(0))).as(c)
)): _*
)
val distinctCountDF = df.select(
(Seq(expr("'distinct_values' as metric")) ++ cols.map(c =>
countDistinct(c).as(c)
)): _*
)
val maxDF = df.select(
(Seq(expr("'max_value' as metric")) ++ cols.map(c => max(c).as(c))): _*
)
val minDF = df.select(
(Seq(expr("'min_value' as metric")) ++ cols.map(c => min(c).as(c))): _*
)
val allRecordsDF = spark.sql("select 'all_records' as metric," + cols.map(c => s"$allRecords as $c").mkString(","))
Finally, union the data frames created above:
val metricsDF = Seq(allRecordsDF, nullsCountDF, distinctCountDF, maxDF, minDF).reduce(_ union _)
metricsDF.show
//+---------------+---+---------+--------+---------------------+------+
//|metric |id |firstName|lastName|email |salary|
//+---------------+---+---------+--------+---------------------+------+
//|all_records |5 |5 |5 |5 |5 |
//|nulls |0 |1 |1 |1 |1 |
//|distinct_values|5 |3 |4 |4 |4 |
//|max_value |5 |xiangrui |wendell |no-reply#waterloo.edu|160K |
//|min_value |1 |matei |armbrust|no-reply#berkeley.edu|100K |
//+---------------+---+---------+--------+---------------------+------+
Using DataFrame (Python)
For Python example, you can see my other answer.
use count (ifnull(field,0)) total_count. This will count all non-null rows.

Applying withColumn only when column exists in the dataframe

I am using spark-sql-2.4.1v with Java 8. I have a scenario where I will be passed the columns names as list/Seq, for those columns only i need to do perform certain operations like sum, avg, percentages etc.
In my scenario, let's say I have column1, column2, column3 columns. First time I will pass column1 name.
Will pull/select "column1" data and perform some operation based on "column1". Second time I will pass column2 name, but earlier column1 not pulled this time so my dataset does not contain "column1" hence earlier conditions are breaking with error "AnalysisException: cannot resolve 'column1' given input columns".
Hence I need to check the columns, if some column exists then only perform that column related operations else ignore those operations.
How to do this in Spark?
Sample data which is in database.
val data = List(
("20", "score", "school", "2018-03-31", 14 , 12 , 20),
("21", "score", "school", "2018-03-31", 13 , 13 , 21),
("22", "rate", "school", "2018-03-31", 11 , 14, 22),
("21", "rate", "school", "2018-03-31", 13 , 12, 23)
)
val df = data.toDF("id", "code", "entity", "date", "column1", "column2" ,"column3")
.select("id", "code", "entity", "date", "column2") /// these are passed for each run....this set will keep changing.
Dataset<Row> enrichedDs = df
.withColumn("column1_org",col("column1"))
.withColumn("column1",
when(col("column1").isNotNull() , functions.callUDF("lookUpData",col("column1").cast(DataTypes.StringType)))
);
The above logic is only applicable when in select columns "column1" is available. This is failing in the second set as "column1" is not select, so I need some understanding why this only applicable when selected columns as "column1" is available. I need some logic to achieve this.
check if this is helpful-
you can filter out columns, and process only valid columns
df.show(false)
/**
* +---+-----+------+----------+-------+-------+-------+
* |id |code |entity|date |column1|column2|column3|
* +---+-----+------+----------+-------+-------+-------+
* |20 |score|school|2018-03-31|14 |12 |20 |
* |21 |score|school|2018-03-31|13 |13 |21 |
* |22 |rate |school|2018-03-31|11 |14 |22 |
* |21 |rate |school|2018-03-31|13 |12 |23 |
* +---+-----+------+----------+-------+-------+-------+
*/
// list of columns
val cols = Seq("column1", "column2" ,"column3", "column4")
val processColumns = cols.filter(df.columns.contains).map(sqrt)
df.select(processColumns: _*).show(false)
/**
* +------------------+------------------+-----------------+
* |SQRT(column1) |SQRT(column2) |SQRT(column3) |
* +------------------+------------------+-----------------+
* |3.7416573867739413|3.4641016151377544|4.47213595499958 |
* |3.605551275463989 |3.605551275463989 |4.58257569495584 |
* |3.3166247903554 |3.7416573867739413|4.69041575982343 |
* |3.605551275463989 |3.4641016151377544|4.795831523312719|
* +------------------+------------------+-----------------+
*/
Not sure if i fully understand your requirement, but are you simply trying to perform some conditional operation depending on what columns are available in your dataframe which is not know prior to execution?
if so, Dataframe.columns returns a list of columns which you can parse and select accordingly
i.e
df.columns.foreach { println }

Where/filtering in pyspark

I used sql with pyspark but when I used where for filtering the result was a empty table but It's false because I have data with this filtering.
"Lesividad" is a string:
|-- LESIVIDAD: string (nullable = true)
t_acc = spark.sql("SELECT LESIVIDAD, COUNT(LESIVIDAD) AS COUNT FROM acc_table
WHERE LESIVIDAD = 'IL' GROUP BY LESIVIDAD")
t_acc.show()
+---------+-----+
|LESIVIDAD|COUNT|
+---------+-----+
+---------+-----+
My table "Lesividad" is:
t_acc = spark.sql("""SELECT LESIVIDAD FROM acc_table GROUP BY
LESIVIDAD""").show()

+--------------------+
| LESIVIDAD|
+--------------------+
| NO ASIGNADA|
|IL ...|
|MT ...|
|HG ...|
|HL ...|
+--------------------+
your code is perfect. I presume the problem is with the data which your trying to search i.e. LESIVIDAD = 'IL'.
Please note, in pyspark, header/column names of table are case-insensitive where as data inside the table is case-****sensitive. So if your table contains 'il' / 'Il' /'iL' and there is no 'IL'. It will return empty table only.
Hence please note the data which you are trying to search is case-sensitive. Hence type correctly.

How to get the COUNT of emails for each id in Scala

I use this query in SQL to get return how many user_id's have more than one email. How would I write this same query against a users DataFrame in Scala? also how would I be able to return to exact emails for each user_id
SELECT DISTINCT user_id
FROM Users
Group by user_id
Having count(DISTINCT email) > 1
Let's assume that you have a dataframe of users. In spark, one could create a sample of such a dataframe like this:
import spark.implicits._
val df = Seq(("me", "contact#me.com"),
("me", "me#company.com"),
("you", "you#company.com")).toDF("user_id", "email")
df.show()
+-------+---------------+
|user_id| email|
+-------+---------------+
| me| contact#me.com|
| me| me#company.com|
| you|you#company.com|
+-------+---------------+
Now, the logic would be very similar as the one you have in SQL:
df.groupBy("user_id")
.agg(countDistinct("email") as "count")
.where('count > 1)
.show()
+-------+-----+
|user_id|count|
+-------+-----+
| me| 2|
+-------+-----+
Then you can add a .drop("count") or a .select("user_id") to only keep users.
Note that there is no having clause in spark. Once you have called agg to aggregate your dataframe by user, you have a regular dataframe on which you can call any transformation function, such as a filter on the count column here.