convert this sql left-join query to spark dataframes (scala) - sql

I have this sql query which is a left-join and has a select statement in the beginning which chooses from the right table columns as well..
Can you please help to convert it to a spark dataframes and get the result using spark-shell?
I don't want to use the sql code in spark instead I want to use dataframes.
I know the join syntax in scala, but I don't know how to choose from the right table (here it is count(w.id2)) when resulting df from left join doesn't have access to the right table's columns.
Thank you!
select count(x.user_id) user_id_count, count(w.id2) current_id2_count
from
(select
user_id
from
tb1
where
year='2021'
and month=1
) x
left join
(select id1, max(id2) id2 from tb2 group by id1) w
on
x.user_id=w.id1;
In spark I would create two dataframes x and w and join them:
var x = spark.sqlContext.table("tb1").where("year='2021' and month=1")
var w= spark.sqlContext.table("tb2").groupBy("id1").agg(max("id2")).alias("id2"
var joined = x.join(w, x("user_id")===w("id1"), "left")
EDIT :
I was confused about the left join. There was some error from the spark that column id2 is not available and I thought it is because the resulting df from left-join will have only left table's columns. However the reason was that when I was choosing max(id2) I had to give it an alias correctly.
Here is a sample and the solution:
var x = Seq("1","2","3","4").toDF("user_id")
var w = Seq (("1", 1), ("1",2), ("3",10),("1",5),("5",4)).toDF("id1", "id2")
var z= w.groupBy("id1").agg(max("id2").alias("id2"))
val xJoinsZ= x.join(z, x("user_id") === z("id1"), "left").select(count(col("user_id").alias("user_id_count")), count(col("id2").alias("current_id2_count")))
scala> x.show(false)
+-------+
|user_id|
+-------+
|1 |
|2 |
|3 |
|4 |
+-------+
scala> z.show(false)
+---+---+
|id1|id2|
+---+---+
|3 |10 |
|5 |4 |
|1 |5 |
+---+---+
scala> xJoinsZ.show(false)
+---------------------------------+---------------------------------+
|count(user_id AS `user_id_count`)|count(id2 AS `current_id2_count`)|
+---------------------------------+---------------------------------+
|4 |2 |
+---------------------------------+---------------------------------+

Your request is quite difficult to understand, however I am gonna try to reply taking the SQL code you provided as baseline and reproduce it with Spark.
// Reading tb1 (x) and filtering for Jan 2021, selecting only "user_id"
val x: DataFrame = spark.read
.table("tb1")
.filter(col("year") === "2021")
.filter(col("mont") === "01")
.select("user_id")
// Reading tb2 (w) and for each "id1" getting the max "id2"
val w: DataFrame = spark.read
.table("tb2")
.groupBy(col("id1"))
.max("id2")
// Joining tb1 (x) and tb2 (w) on "user_id" === "id1", then counting user_id and id2
val xJoinsW: DataFrame = x
.join(w, x("user_id") === w("id1"), "left")
.select(count(col("user_id").as("user_id_count")), count(col("max(id2)").as("current_id2_count")))
A small but relevant remark, as you're using Scala and Spark, I would suggest you to use val and not var. val means it's final, cannot be reassigned, whereas, var can be reassigned later. You can read more here.
Lastly, feel free to change the Spark reading mechanism with whatever you like.

Related

Extracting value of a json in Spark SQl

I am looking to aggregate by extracting the value of a json key here from one of the column here. can someone help me with the right syntax in Spark SQL
select count(distinct(Name)) as users, xHeaderFields['xyz'] as app group by app order by users desc
The table column is something like this. I have removed other columns for simplification.Table has columns like Name etc.
Assuming that your dataset is called ds and there is only one key=xyz object per columns;
First, to JSON conversion (if needed):
ds = ds.withColumn("xHeaderFields", expr("from_json(xHeaderFields, 'array<struct<key:string,value:string>>')"))
Then filter the key = xyz and take the first element (assuming there is only one xyz key):
.withColumn("xHeaderFields", expr("filter(xHeaderFields, x -> x.key == 'xyz')[0]"))
Finally, extract value from your object:
.withColumn("xHeaderFields", expr("xHeaderFields.value"))
Final result:
+-------------+
|xHeaderFields|
+-------------+
|null |
|null |
|Settheclass |
+-------------+
Good luck!

How can I extend an SQL table with new primary keys as well as add up values for exisiting keys?

I want to join or update the following two tables and also add up df for existing words. So if the word endeavor does not exist in the first table, it should be added with its df value or if the word hello exists in both tables df should be summed up.
FYI I'm using MariaDB and PySpark to do word counts on documents and calculate tf, df, and tfidf values.
Table name: df
+--------+----+
| word| df|
+--------+----+
|vicinity| 5|
| hallo| 2|
| admire| 3|
| settled| 1|
+--------+----+
Table name: word_list
| word| df|
+----------+---+
| hallo| 1|
| settled| 1|
| endeavor| 1|
+----------+---+
So in the end the updated/combined table should look like this:
| word| df|
+----------+---+
| vicinity| 5|
| hallo| 3|
| admire| 3|
| settled| 2|
| endeavor| 1|
+----------+---+
What I've tried to do so far is the following:
SELECT df.word, df.df + word_list.df FROM df FULL OUTER JOIN word_list ON df.word=word_list.word
SELECT df.word FROM df JOIN word_list ON df.word=word_list.word
SELECT df.word FROM df FULL OUTER JOIN word_list ON df.word=word_list.word
None of them worked, I either get a table with just null values, some null values, or some exception. I'm sure there must be an easy SQL statement to achieve this but I've been stuck with this for hours and also haven't found anything relatable on stack overflow.
You just need to UNION the two tables first, then aggregate on the word. Since the tables are identically structured it's very easy. Look at this fiddle. I have used maria 10.3 since you didn't specify, but these queries should be completely compliant with (just about) any DBMS.
https://dbfiddle.uk/?rdbms=mariadb_10.3&fiddle=c6d86af77f19fc1f337ad1140ef07cd2
select word, sum(df) as df
from (
select * from df
UNION ALL
select * from word_list
) z
group by word
order by sum(df) desc;
UNION is the vertical cousin of JOIN, that is, UNION joins to datasets vertically or row-wise, and JOIN adds them horizontally, that is by adding columns to the output. Both datasets need to have the same number of columns for the UNION to work, and you need to use UNION ALL here so that the union returns all rows, because the default behavior is to return unique rows. In this dataset, since settled has a value of 1 in both tables, it would only have one entry in the UNION if you don't use the ALL keyword, and so when you do the sum the value of df would be 1 instead of 2, as you are expecting.
The ORDER BY isn't necessary if you are just transferring to a new table. I just added it to get my results in the same order as your sample output.
Let me know if this worked for you.

How to get counts for null, not null, distinct values, all rows for all columns in a sql table?

I'm currently looking to get a table that gets counts for null, not null, distinct values, and all rows for all columns in a given table. This happens to be in Databricks (Apache Spark).
Something that looks like what is shown below.
I know I can do this with something like the SQL shown below. Also, I can use something like Java or Python, etc., to generate the SQL.
The question is:
Is this the most efficient approach?
Is there a better way to write this query (less typing and/or more efficient)?
select
1 col_position,
'location_id' col_name,
count(*) all_records,
count(location_id) not_null,
count(*) - count(location_id) null,
count(distinct location_id) distinct_values
from
admin
union
select
2 col_position,
'location_zip' col_name,
count(*) all_records,
count(location_zip) not_null,
count(*) - count(location_zip) null,
count(distinct location_zip) distinct_values
from
admin
union
select
3 col_position,
'provider_npi' col_name,
count(*) all_records,
count(provider_npi) not_null,
count(*) - count(provider_npi) null,
count(distinct provider_npi) distinct_values
from
admin
order by col_position
;
As said in the comments, using UNION ALL should be efficient.
Using SQL
To avoid taping all the columns and sub queries, you can generate the SQL query from the list of columns like this:
val df = spark.sql("select * from admin")
// generate the same query from the columns list
val sqlQuery =
df.columns.zipWithIndex.map { case (c, i) =>
Seq(
s"$i col_position",
s"$c col_name",
"count(*) all_records",
s"count($c) not_null",
s"count(*) - count($c) null",
s"count(distinct $c) distinct_values"
).mkString("select ", ", ", " from admin")
}.mkString("", " union all\n", "order by col_position")
spark.sql(sqlQuery).show
Using DataFrame (Scala)
There are some optimizations you can do by using DataFrame, like calculate count(*) one time, avoid typing all the column names, and the possibility to use caching.
Example input DataFrame :
//+---+---------+--------+---------------------+------+
//|id |firstName|lastName|email |salary|
//+---+---------+--------+---------------------+------+
//|1 |michael |armbrust|no-reply#berkeley.edu|100K |
//|2 |xiangrui |meng |no-reply#stanford.edu|120K |
//|3 |matei |null |no-reply#waterloo.edu|140K |
//|4 |null |wendell |null |160K |
//|5 |michael |jackson |no-reply#neverla.nd |null |
//+---+---------+--------+---------------------+------+
First, get count and column list :
val cols = df.columns
val allRecords = df.count
Then, calculate each metric by looping through the columns list (you can create a function for each metric for example) :
val nullsCountDF = df.select(
(Seq(expr("'nulls' as metric")) ++ cols.map(c =>
sum(when(col(c).isNull, lit(1)).otherwise(lit(0))).as(c)
)): _*
)
val distinctCountDF = df.select(
(Seq(expr("'distinct_values' as metric")) ++ cols.map(c =>
countDistinct(c).as(c)
)): _*
)
val maxDF = df.select(
(Seq(expr("'max_value' as metric")) ++ cols.map(c => max(c).as(c))): _*
)
val minDF = df.select(
(Seq(expr("'min_value' as metric")) ++ cols.map(c => min(c).as(c))): _*
)
val allRecordsDF = spark.sql("select 'all_records' as metric," + cols.map(c => s"$allRecords as $c").mkString(","))
Finally, union the data frames created above:
val metricsDF = Seq(allRecordsDF, nullsCountDF, distinctCountDF, maxDF, minDF).reduce(_ union _)
metricsDF.show
//+---------------+---+---------+--------+---------------------+------+
//|metric |id |firstName|lastName|email |salary|
//+---------------+---+---------+--------+---------------------+------+
//|all_records |5 |5 |5 |5 |5 |
//|nulls |0 |1 |1 |1 |1 |
//|distinct_values|5 |3 |4 |4 |4 |
//|max_value |5 |xiangrui |wendell |no-reply#waterloo.edu|160K |
//|min_value |1 |matei |armbrust|no-reply#berkeley.edu|100K |
//+---------------+---+---------+--------+---------------------+------+
Using DataFrame (Python)
For Python example, you can see my other answer.
use count (ifnull(field,0)) total_count. This will count all non-null rows.

Select field only if it exists (SQL or Scala)

The input dataframe may not always have all the columns. In SQL or SCALA, I want to create a select statement where even if the dataframe does not have column, it won't error out and it will only output the columns that do exist.
For example, this statement will work.
Select store, prod, distance from table
+-----+------+--------+
|store|prod |distance|
+-----+------+--------+
|51 |42 |2 |
|51 |42 |5 |
|89 |44 |9 |
If the dataframe looks like below, I want the same statement to work, to just ignore what's not there, and just output the existing columns (in this case 'store' and 'prod')
+-----+------+
|store|prod |
+-----+------+
|51 |42 |
|51 |42 |
|89 |44 |
You can have list of all cols in list, either hard coded or prepare from other meta data and use intersect
val columnNames = Seq("c1","c2","c3","c4")
df.select( df.columns.intersect(columnNames).map(x=>col(x)): _* ).show()
You can make use of columns method on Dataframe. This would look like that:
val result = if(df.columns.contains("distance")) df.select("store", "prod", "distance")
else df.select("store", "prod")
Edit:
Having many such columns, you can keep them in array, for example cols and filter it:
val selectedCols = cols.filter(col -> df.columns.contains("distance")).map(col)
val result = df.select(selectedCols:_*)
Assuming you use the expanded SQL template, like select a,b,c from tab, you could do something like below to get the required results.
Get the sql string and convert it to lowercase.
Split the sql on space or comma to get the individual words in an array
Remove "select" and "from" from the above array as they are SQL keywords.
Now your last index is the table name
First to last index but one contains the list of select columns.
To get the required columns, just filter it against df2.columns. The columns that are there in SQL but not in table will be filtered out
Now construct the sql using the individual pieces.
Run it using spark.sql(reqd_sel_string) to get the results.
Check this out
scala> val df2 = Seq((51,42),(51,42),(89,44)).toDF("store","prod")
df2: org.apache.spark.sql.DataFrame = [store: int, prod: int]
scala> df2.createOrReplaceTempView("tab2")
scala> val sel_query="Select store, prod, distance from tab2".toLowerCase
sel_query: String = select store, prod, distance from tab2
scala> val tabl_parse = sel_query.split("[ ,]+").filter(_!="select").filter(_!="from")
tabl_parse: Array[String] = Array(store, prod, distance, tab2)
scala> val tab_name=tabl_parse(tabl_parse.size-1)
tab_name: String = tab2
scala> val tab_cols = (0 until tabl_parse.size-1).map(tabl_parse(_))
tab_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod, distance)
scala> val reqd_cols = tab_cols.filter( x=>df2.columns.contains(x))
reqd_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod)
scala> val reqd_sel_string = "select " + reqd_cols.mkString(",") + " from " + tab_name
reqd_sel_string: String = select store,prod from tab2
scala> spark.sql(reqd_sel_string).show(false)
+-----+----+
|store|prod|
+-----+----+
|51 |42 |
|51 |42 |
|89 |44 |
+-----+----+
scala>

Splitting a string in SparkSQL

I have a file with several lines. For example
A B C
awer.ttp.net Code 554
abcd.ttp.net Code 747
asdf.ttp.net Part 554
xyz.ttp.net Part 747
I want to make a SparkSQL statement to split just column a of the table and I want a new row added to the table D, with values awe, abcd, asdf, and xyz.
You can use split function and get the first element for new Column D
Here is an simple example
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
("awer.ttp.net","Code", 554),
("abcd.ttp.net","Code", 747),
("asdf.ttp.net","Part", 554),
("xyz.ttp.net","Part", 747)
)).toDF("A","B","C")
data.withColumn("D", split($"A", "\\.")(0)).show(false)
//using SQL
data.createOrReplaceTempView("tempTable")
data.sqlContext.sql("SELECT A, B, C, SUBSTRING_INDEX(A, '.', 1) as D from tempTable")
Output:
+------------+----+---+----+
|A |B |C |D |
+------------+----+---+----+
|awer.ttp.net|Code|554|awer|
|abcd.ttp.net|Code|747|abcd|
|asdf.ttp.net|Part|554|asdf|
|xyz.ttp.net |Part|747|xyz |
+------------+----+---+----+
you can do something similar to the below in SparkSQL
select A,B,C, split(A,'\\.')[0] as D from tablename;