Splitting a string in SparkSQL - sql

I have a file with several lines. For example
A B C
awer.ttp.net Code 554
abcd.ttp.net Code 747
asdf.ttp.net Part 554
xyz.ttp.net Part 747
I want to make a SparkSQL statement to split just column a of the table and I want a new row added to the table D, with values awe, abcd, asdf, and xyz.

You can use split function and get the first element for new Column D
Here is an simple example
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
("awer.ttp.net","Code", 554),
("abcd.ttp.net","Code", 747),
("asdf.ttp.net","Part", 554),
("xyz.ttp.net","Part", 747)
)).toDF("A","B","C")
data.withColumn("D", split($"A", "\\.")(0)).show(false)
//using SQL
data.createOrReplaceTempView("tempTable")
data.sqlContext.sql("SELECT A, B, C, SUBSTRING_INDEX(A, '.', 1) as D from tempTable")
Output:
+------------+----+---+----+
|A |B |C |D |
+------------+----+---+----+
|awer.ttp.net|Code|554|awer|
|abcd.ttp.net|Code|747|abcd|
|asdf.ttp.net|Part|554|asdf|
|xyz.ttp.net |Part|747|xyz |
+------------+----+---+----+

you can do something similar to the below in SparkSQL
select A,B,C, split(A,'\\.')[0] as D from tablename;

Related

Extracting value of a json in Spark SQl

I am looking to aggregate by extracting the value of a json key here from one of the column here. can someone help me with the right syntax in Spark SQL
select count(distinct(Name)) as users, xHeaderFields['xyz'] as app group by app order by users desc
The table column is something like this. I have removed other columns for simplification.Table has columns like Name etc.
Assuming that your dataset is called ds and there is only one key=xyz object per columns;
First, to JSON conversion (if needed):
ds = ds.withColumn("xHeaderFields", expr("from_json(xHeaderFields, 'array<struct<key:string,value:string>>')"))
Then filter the key = xyz and take the first element (assuming there is only one xyz key):
.withColumn("xHeaderFields", expr("filter(xHeaderFields, x -> x.key == 'xyz')[0]"))
Finally, extract value from your object:
.withColumn("xHeaderFields", expr("xHeaderFields.value"))
Final result:
+-------------+
|xHeaderFields|
+-------------+
|null |
|null |
|Settheclass |
+-------------+
Good luck!

convert this sql left-join query to spark dataframes (scala)

I have this sql query which is a left-join and has a select statement in the beginning which chooses from the right table columns as well..
Can you please help to convert it to a spark dataframes and get the result using spark-shell?
I don't want to use the sql code in spark instead I want to use dataframes.
I know the join syntax in scala, but I don't know how to choose from the right table (here it is count(w.id2)) when resulting df from left join doesn't have access to the right table's columns.
Thank you!
select count(x.user_id) user_id_count, count(w.id2) current_id2_count
from
(select
user_id
from
tb1
where
year='2021'
and month=1
) x
left join
(select id1, max(id2) id2 from tb2 group by id1) w
on
x.user_id=w.id1;
In spark I would create two dataframes x and w and join them:
var x = spark.sqlContext.table("tb1").where("year='2021' and month=1")
var w= spark.sqlContext.table("tb2").groupBy("id1").agg(max("id2")).alias("id2"
var joined = x.join(w, x("user_id")===w("id1"), "left")
EDIT :
I was confused about the left join. There was some error from the spark that column id2 is not available and I thought it is because the resulting df from left-join will have only left table's columns. However the reason was that when I was choosing max(id2) I had to give it an alias correctly.
Here is a sample and the solution:
var x = Seq("1","2","3","4").toDF("user_id")
var w = Seq (("1", 1), ("1",2), ("3",10),("1",5),("5",4)).toDF("id1", "id2")
var z= w.groupBy("id1").agg(max("id2").alias("id2"))
val xJoinsZ= x.join(z, x("user_id") === z("id1"), "left").select(count(col("user_id").alias("user_id_count")), count(col("id2").alias("current_id2_count")))
scala> x.show(false)
+-------+
|user_id|
+-------+
|1 |
|2 |
|3 |
|4 |
+-------+
scala> z.show(false)
+---+---+
|id1|id2|
+---+---+
|3 |10 |
|5 |4 |
|1 |5 |
+---+---+
scala> xJoinsZ.show(false)
+---------------------------------+---------------------------------+
|count(user_id AS `user_id_count`)|count(id2 AS `current_id2_count`)|
+---------------------------------+---------------------------------+
|4 |2 |
+---------------------------------+---------------------------------+
Your request is quite difficult to understand, however I am gonna try to reply taking the SQL code you provided as baseline and reproduce it with Spark.
// Reading tb1 (x) and filtering for Jan 2021, selecting only "user_id"
val x: DataFrame = spark.read
.table("tb1")
.filter(col("year") === "2021")
.filter(col("mont") === "01")
.select("user_id")
// Reading tb2 (w) and for each "id1" getting the max "id2"
val w: DataFrame = spark.read
.table("tb2")
.groupBy(col("id1"))
.max("id2")
// Joining tb1 (x) and tb2 (w) on "user_id" === "id1", then counting user_id and id2
val xJoinsW: DataFrame = x
.join(w, x("user_id") === w("id1"), "left")
.select(count(col("user_id").as("user_id_count")), count(col("max(id2)").as("current_id2_count")))
A small but relevant remark, as you're using Scala and Spark, I would suggest you to use val and not var. val means it's final, cannot be reassigned, whereas, var can be reassigned later. You can read more here.
Lastly, feel free to change the Spark reading mechanism with whatever you like.

How to get counts for null, not null, distinct values, all rows for all columns in a sql table?

I'm currently looking to get a table that gets counts for null, not null, distinct values, and all rows for all columns in a given table. This happens to be in Databricks (Apache Spark).
Something that looks like what is shown below.
I know I can do this with something like the SQL shown below. Also, I can use something like Java or Python, etc., to generate the SQL.
The question is:
Is this the most efficient approach?
Is there a better way to write this query (less typing and/or more efficient)?
select
1 col_position,
'location_id' col_name,
count(*) all_records,
count(location_id) not_null,
count(*) - count(location_id) null,
count(distinct location_id) distinct_values
from
admin
union
select
2 col_position,
'location_zip' col_name,
count(*) all_records,
count(location_zip) not_null,
count(*) - count(location_zip) null,
count(distinct location_zip) distinct_values
from
admin
union
select
3 col_position,
'provider_npi' col_name,
count(*) all_records,
count(provider_npi) not_null,
count(*) - count(provider_npi) null,
count(distinct provider_npi) distinct_values
from
admin
order by col_position
;
As said in the comments, using UNION ALL should be efficient.
Using SQL
To avoid taping all the columns and sub queries, you can generate the SQL query from the list of columns like this:
val df = spark.sql("select * from admin")
// generate the same query from the columns list
val sqlQuery =
df.columns.zipWithIndex.map { case (c, i) =>
Seq(
s"$i col_position",
s"$c col_name",
"count(*) all_records",
s"count($c) not_null",
s"count(*) - count($c) null",
s"count(distinct $c) distinct_values"
).mkString("select ", ", ", " from admin")
}.mkString("", " union all\n", "order by col_position")
spark.sql(sqlQuery).show
Using DataFrame (Scala)
There are some optimizations you can do by using DataFrame, like calculate count(*) one time, avoid typing all the column names, and the possibility to use caching.
Example input DataFrame :
//+---+---------+--------+---------------------+------+
//|id |firstName|lastName|email |salary|
//+---+---------+--------+---------------------+------+
//|1 |michael |armbrust|no-reply#berkeley.edu|100K |
//|2 |xiangrui |meng |no-reply#stanford.edu|120K |
//|3 |matei |null |no-reply#waterloo.edu|140K |
//|4 |null |wendell |null |160K |
//|5 |michael |jackson |no-reply#neverla.nd |null |
//+---+---------+--------+---------------------+------+
First, get count and column list :
val cols = df.columns
val allRecords = df.count
Then, calculate each metric by looping through the columns list (you can create a function for each metric for example) :
val nullsCountDF = df.select(
(Seq(expr("'nulls' as metric")) ++ cols.map(c =>
sum(when(col(c).isNull, lit(1)).otherwise(lit(0))).as(c)
)): _*
)
val distinctCountDF = df.select(
(Seq(expr("'distinct_values' as metric")) ++ cols.map(c =>
countDistinct(c).as(c)
)): _*
)
val maxDF = df.select(
(Seq(expr("'max_value' as metric")) ++ cols.map(c => max(c).as(c))): _*
)
val minDF = df.select(
(Seq(expr("'min_value' as metric")) ++ cols.map(c => min(c).as(c))): _*
)
val allRecordsDF = spark.sql("select 'all_records' as metric," + cols.map(c => s"$allRecords as $c").mkString(","))
Finally, union the data frames created above:
val metricsDF = Seq(allRecordsDF, nullsCountDF, distinctCountDF, maxDF, minDF).reduce(_ union _)
metricsDF.show
//+---------------+---+---------+--------+---------------------+------+
//|metric |id |firstName|lastName|email |salary|
//+---------------+---+---------+--------+---------------------+------+
//|all_records |5 |5 |5 |5 |5 |
//|nulls |0 |1 |1 |1 |1 |
//|distinct_values|5 |3 |4 |4 |4 |
//|max_value |5 |xiangrui |wendell |no-reply#waterloo.edu|160K |
//|min_value |1 |matei |armbrust|no-reply#berkeley.edu|100K |
//+---------------+---+---------+--------+---------------------+------+
Using DataFrame (Python)
For Python example, you can see my other answer.
use count (ifnull(field,0)) total_count. This will count all non-null rows.

Read Rules from a file and apply those rules to pyspark dataframe rows

I have a rule book csv, data looks like this:
operator|lastname|operator|firstname|val
equals | ABC |contains| XYZ | 2
equals | QWE |contains| rty | 3
so if lastname equals ABC and firstname like XYZ then val will be 2 ,like that. this file can be changed or modified so conditions will be dynamic. Even rows can be added in future.
Now, my pyspark dataframe is:
lastname| firstname| service
ABC | XYZNMO | something
QUE | rtysdf | something
I need to apply rule from that csv file to this dataframe and add the val column. So my desired output dataframe will be like:
lastname| firstname| service | val
ABC | XYZNMO | something| 2
QUE | rtysdf | something| 3
Remember the rule book is dynamic, rules can be added or deleted or modified anytime. Even operators in rule book can be modified.
Thanks in Advance
Use csv parser to parse csv files and get rules data. Then programmatically, create SQL statement using rule data - something similar to:
query = "SELECT
CASE WHEN lastname = 'ABC' and firstname LIKE 'XYZ%' THEN 2
ELSE
CASE WHEN lastname = 'QUE' and firstname LIKE 'rty% THEN 3
END
END AS val
FROM table"
then run:
df.createOrReplaceTempView("table")
result_df = spark.sql(query) # above dynamic query
You can achieve it by using the below process i believe
Create temporary tables on top of your dataframes
write an SQL using Spark SQL api and keep that in text file as a single record
read the sql statement using
sqlStatement=spark.sparkContext.textFile("sqllocation").first().toString()
you prepared in step#2 and run it using spark.sql(sqlStatement)
this way you update the sql statement which is in the text file as when required

Select field only if it exists (SQL or Scala)

The input dataframe may not always have all the columns. In SQL or SCALA, I want to create a select statement where even if the dataframe does not have column, it won't error out and it will only output the columns that do exist.
For example, this statement will work.
Select store, prod, distance from table
+-----+------+--------+
|store|prod |distance|
+-----+------+--------+
|51 |42 |2 |
|51 |42 |5 |
|89 |44 |9 |
If the dataframe looks like below, I want the same statement to work, to just ignore what's not there, and just output the existing columns (in this case 'store' and 'prod')
+-----+------+
|store|prod |
+-----+------+
|51 |42 |
|51 |42 |
|89 |44 |
You can have list of all cols in list, either hard coded or prepare from other meta data and use intersect
val columnNames = Seq("c1","c2","c3","c4")
df.select( df.columns.intersect(columnNames).map(x=>col(x)): _* ).show()
You can make use of columns method on Dataframe. This would look like that:
val result = if(df.columns.contains("distance")) df.select("store", "prod", "distance")
else df.select("store", "prod")
Edit:
Having many such columns, you can keep them in array, for example cols and filter it:
val selectedCols = cols.filter(col -> df.columns.contains("distance")).map(col)
val result = df.select(selectedCols:_*)
Assuming you use the expanded SQL template, like select a,b,c from tab, you could do something like below to get the required results.
Get the sql string and convert it to lowercase.
Split the sql on space or comma to get the individual words in an array
Remove "select" and "from" from the above array as they are SQL keywords.
Now your last index is the table name
First to last index but one contains the list of select columns.
To get the required columns, just filter it against df2.columns. The columns that are there in SQL but not in table will be filtered out
Now construct the sql using the individual pieces.
Run it using spark.sql(reqd_sel_string) to get the results.
Check this out
scala> val df2 = Seq((51,42),(51,42),(89,44)).toDF("store","prod")
df2: org.apache.spark.sql.DataFrame = [store: int, prod: int]
scala> df2.createOrReplaceTempView("tab2")
scala> val sel_query="Select store, prod, distance from tab2".toLowerCase
sel_query: String = select store, prod, distance from tab2
scala> val tabl_parse = sel_query.split("[ ,]+").filter(_!="select").filter(_!="from")
tabl_parse: Array[String] = Array(store, prod, distance, tab2)
scala> val tab_name=tabl_parse(tabl_parse.size-1)
tab_name: String = tab2
scala> val tab_cols = (0 until tabl_parse.size-1).map(tabl_parse(_))
tab_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod, distance)
scala> val reqd_cols = tab_cols.filter( x=>df2.columns.contains(x))
reqd_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod)
scala> val reqd_sel_string = "select " + reqd_cols.mkString(",") + " from " + tab_name
reqd_sel_string: String = select store,prod from tab2
scala> spark.sql(reqd_sel_string).show(false)
+-----+----+
|store|prod|
+-----+----+
|51 |42 |
|51 |42 |
|89 |44 |
+-----+----+
scala>