Where/filtering in pyspark - sql

I used sql with pyspark but when I used where for filtering the result was a empty table but It's false because I have data with this filtering.
"Lesividad" is a string:
|-- LESIVIDAD: string (nullable = true)
t_acc = spark.sql("SELECT LESIVIDAD, COUNT(LESIVIDAD) AS COUNT FROM acc_table
WHERE LESIVIDAD = 'IL' GROUP BY LESIVIDAD")
t_acc.show()
+---------+-----+
|LESIVIDAD|COUNT|
+---------+-----+
+---------+-----+
My table "Lesividad" is:
t_acc = spark.sql("""SELECT LESIVIDAD FROM acc_table GROUP BY
LESIVIDAD""").show()

+--------------------+
| LESIVIDAD|
+--------------------+
| NO ASIGNADA|
|IL ...|
|MT ...|
|HG ...|
|HL ...|
+--------------------+

your code is perfect. I presume the problem is with the data which your trying to search i.e. LESIVIDAD = 'IL'.
Please note, in pyspark, header/column names of table are case-insensitive where as data inside the table is case-****sensitive. So if your table contains 'il' / 'Il' /'iL' and there is no 'IL'. It will return empty table only.
Hence please note the data which you are trying to search is case-sensitive. Hence type correctly.

Related

Extracting value of a json in Spark SQl

I am looking to aggregate by extracting the value of a json key here from one of the column here. can someone help me with the right syntax in Spark SQL
select count(distinct(Name)) as users, xHeaderFields['xyz'] as app group by app order by users desc
The table column is something like this. I have removed other columns for simplification.Table has columns like Name etc.
Assuming that your dataset is called ds and there is only one key=xyz object per columns;
First, to JSON conversion (if needed):
ds = ds.withColumn("xHeaderFields", expr("from_json(xHeaderFields, 'array<struct<key:string,value:string>>')"))
Then filter the key = xyz and take the first element (assuming there is only one xyz key):
.withColumn("xHeaderFields", expr("filter(xHeaderFields, x -> x.key == 'xyz')[0]"))
Finally, extract value from your object:
.withColumn("xHeaderFields", expr("xHeaderFields.value"))
Final result:
+-------------+
|xHeaderFields|
+-------------+
|null |
|null |
|Settheclass |
+-------------+
Good luck!

How can I extend an SQL table with new primary keys as well as add up values for exisiting keys?

I want to join or update the following two tables and also add up df for existing words. So if the word endeavor does not exist in the first table, it should be added with its df value or if the word hello exists in both tables df should be summed up.
FYI I'm using MariaDB and PySpark to do word counts on documents and calculate tf, df, and tfidf values.
Table name: df
+--------+----+
| word| df|
+--------+----+
|vicinity| 5|
| hallo| 2|
| admire| 3|
| settled| 1|
+--------+----+
Table name: word_list
| word| df|
+----------+---+
| hallo| 1|
| settled| 1|
| endeavor| 1|
+----------+---+
So in the end the updated/combined table should look like this:
| word| df|
+----------+---+
| vicinity| 5|
| hallo| 3|
| admire| 3|
| settled| 2|
| endeavor| 1|
+----------+---+
What I've tried to do so far is the following:
SELECT df.word, df.df + word_list.df FROM df FULL OUTER JOIN word_list ON df.word=word_list.word
SELECT df.word FROM df JOIN word_list ON df.word=word_list.word
SELECT df.word FROM df FULL OUTER JOIN word_list ON df.word=word_list.word
None of them worked, I either get a table with just null values, some null values, or some exception. I'm sure there must be an easy SQL statement to achieve this but I've been stuck with this for hours and also haven't found anything relatable on stack overflow.
You just need to UNION the two tables first, then aggregate on the word. Since the tables are identically structured it's very easy. Look at this fiddle. I have used maria 10.3 since you didn't specify, but these queries should be completely compliant with (just about) any DBMS.
https://dbfiddle.uk/?rdbms=mariadb_10.3&fiddle=c6d86af77f19fc1f337ad1140ef07cd2
select word, sum(df) as df
from (
select * from df
UNION ALL
select * from word_list
) z
group by word
order by sum(df) desc;
UNION is the vertical cousin of JOIN, that is, UNION joins to datasets vertically or row-wise, and JOIN adds them horizontally, that is by adding columns to the output. Both datasets need to have the same number of columns for the UNION to work, and you need to use UNION ALL here so that the union returns all rows, because the default behavior is to return unique rows. In this dataset, since settled has a value of 1 in both tables, it would only have one entry in the UNION if you don't use the ALL keyword, and so when you do the sum the value of df would be 1 instead of 2, as you are expecting.
The ORDER BY isn't necessary if you are just transferring to a new table. I just added it to get my results in the same order as your sample output.
Let me know if this worked for you.

convert this sql left-join query to spark dataframes (scala)

I have this sql query which is a left-join and has a select statement in the beginning which chooses from the right table columns as well..
Can you please help to convert it to a spark dataframes and get the result using spark-shell?
I don't want to use the sql code in spark instead I want to use dataframes.
I know the join syntax in scala, but I don't know how to choose from the right table (here it is count(w.id2)) when resulting df from left join doesn't have access to the right table's columns.
Thank you!
select count(x.user_id) user_id_count, count(w.id2) current_id2_count
from
(select
user_id
from
tb1
where
year='2021'
and month=1
) x
left join
(select id1, max(id2) id2 from tb2 group by id1) w
on
x.user_id=w.id1;
In spark I would create two dataframes x and w and join them:
var x = spark.sqlContext.table("tb1").where("year='2021' and month=1")
var w= spark.sqlContext.table("tb2").groupBy("id1").agg(max("id2")).alias("id2"
var joined = x.join(w, x("user_id")===w("id1"), "left")
EDIT :
I was confused about the left join. There was some error from the spark that column id2 is not available and I thought it is because the resulting df from left-join will have only left table's columns. However the reason was that when I was choosing max(id2) I had to give it an alias correctly.
Here is a sample and the solution:
var x = Seq("1","2","3","4").toDF("user_id")
var w = Seq (("1", 1), ("1",2), ("3",10),("1",5),("5",4)).toDF("id1", "id2")
var z= w.groupBy("id1").agg(max("id2").alias("id2"))
val xJoinsZ= x.join(z, x("user_id") === z("id1"), "left").select(count(col("user_id").alias("user_id_count")), count(col("id2").alias("current_id2_count")))
scala> x.show(false)
+-------+
|user_id|
+-------+
|1 |
|2 |
|3 |
|4 |
+-------+
scala> z.show(false)
+---+---+
|id1|id2|
+---+---+
|3 |10 |
|5 |4 |
|1 |5 |
+---+---+
scala> xJoinsZ.show(false)
+---------------------------------+---------------------------------+
|count(user_id AS `user_id_count`)|count(id2 AS `current_id2_count`)|
+---------------------------------+---------------------------------+
|4 |2 |
+---------------------------------+---------------------------------+
Your request is quite difficult to understand, however I am gonna try to reply taking the SQL code you provided as baseline and reproduce it with Spark.
// Reading tb1 (x) and filtering for Jan 2021, selecting only "user_id"
val x: DataFrame = spark.read
.table("tb1")
.filter(col("year") === "2021")
.filter(col("mont") === "01")
.select("user_id")
// Reading tb2 (w) and for each "id1" getting the max "id2"
val w: DataFrame = spark.read
.table("tb2")
.groupBy(col("id1"))
.max("id2")
// Joining tb1 (x) and tb2 (w) on "user_id" === "id1", then counting user_id and id2
val xJoinsW: DataFrame = x
.join(w, x("user_id") === w("id1"), "left")
.select(count(col("user_id").as("user_id_count")), count(col("max(id2)").as("current_id2_count")))
A small but relevant remark, as you're using Scala and Spark, I would suggest you to use val and not var. val means it's final, cannot be reassigned, whereas, var can be reassigned later. You can read more here.
Lastly, feel free to change the Spark reading mechanism with whatever you like.

How to get the COUNT of emails for each id in Scala

I use this query in SQL to get return how many user_id's have more than one email. How would I write this same query against a users DataFrame in Scala? also how would I be able to return to exact emails for each user_id
SELECT DISTINCT user_id
FROM Users
Group by user_id
Having count(DISTINCT email) > 1
Let's assume that you have a dataframe of users. In spark, one could create a sample of such a dataframe like this:
import spark.implicits._
val df = Seq(("me", "contact#me.com"),
("me", "me#company.com"),
("you", "you#company.com")).toDF("user_id", "email")
df.show()
+-------+---------------+
|user_id| email|
+-------+---------------+
| me| contact#me.com|
| me| me#company.com|
| you|you#company.com|
+-------+---------------+
Now, the logic would be very similar as the one you have in SQL:
df.groupBy("user_id")
.agg(countDistinct("email") as "count")
.where('count > 1)
.show()
+-------+-----+
|user_id|count|
+-------+-----+
| me| 2|
+-------+-----+
Then you can add a .drop("count") or a .select("user_id") to only keep users.
Note that there is no having clause in spark. Once you have called agg to aggregate your dataframe by user, you have a regular dataframe on which you can call any transformation function, such as a filter on the count column here.

Select field only if it exists (SQL or Scala)

The input dataframe may not always have all the columns. In SQL or SCALA, I want to create a select statement where even if the dataframe does not have column, it won't error out and it will only output the columns that do exist.
For example, this statement will work.
Select store, prod, distance from table
+-----+------+--------+
|store|prod |distance|
+-----+------+--------+
|51 |42 |2 |
|51 |42 |5 |
|89 |44 |9 |
If the dataframe looks like below, I want the same statement to work, to just ignore what's not there, and just output the existing columns (in this case 'store' and 'prod')
+-----+------+
|store|prod |
+-----+------+
|51 |42 |
|51 |42 |
|89 |44 |
You can have list of all cols in list, either hard coded or prepare from other meta data and use intersect
val columnNames = Seq("c1","c2","c3","c4")
df.select( df.columns.intersect(columnNames).map(x=>col(x)): _* ).show()
You can make use of columns method on Dataframe. This would look like that:
val result = if(df.columns.contains("distance")) df.select("store", "prod", "distance")
else df.select("store", "prod")
Edit:
Having many such columns, you can keep them in array, for example cols and filter it:
val selectedCols = cols.filter(col -> df.columns.contains("distance")).map(col)
val result = df.select(selectedCols:_*)
Assuming you use the expanded SQL template, like select a,b,c from tab, you could do something like below to get the required results.
Get the sql string and convert it to lowercase.
Split the sql on space or comma to get the individual words in an array
Remove "select" and "from" from the above array as they are SQL keywords.
Now your last index is the table name
First to last index but one contains the list of select columns.
To get the required columns, just filter it against df2.columns. The columns that are there in SQL but not in table will be filtered out
Now construct the sql using the individual pieces.
Run it using spark.sql(reqd_sel_string) to get the results.
Check this out
scala> val df2 = Seq((51,42),(51,42),(89,44)).toDF("store","prod")
df2: org.apache.spark.sql.DataFrame = [store: int, prod: int]
scala> df2.createOrReplaceTempView("tab2")
scala> val sel_query="Select store, prod, distance from tab2".toLowerCase
sel_query: String = select store, prod, distance from tab2
scala> val tabl_parse = sel_query.split("[ ,]+").filter(_!="select").filter(_!="from")
tabl_parse: Array[String] = Array(store, prod, distance, tab2)
scala> val tab_name=tabl_parse(tabl_parse.size-1)
tab_name: String = tab2
scala> val tab_cols = (0 until tabl_parse.size-1).map(tabl_parse(_))
tab_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod, distance)
scala> val reqd_cols = tab_cols.filter( x=>df2.columns.contains(x))
reqd_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod)
scala> val reqd_sel_string = "select " + reqd_cols.mkString(",") + " from " + tab_name
reqd_sel_string: String = select store,prod from tab2
scala> spark.sql(reqd_sel_string).show(false)
+-----+----+
|store|prod|
+-----+----+
|51 |42 |
|51 |42 |
|89 |44 |
+-----+----+
scala>