Group query in subquery to get column value as column name - sql

The data i've in my database:
| id| some_id| status|
| 1| 1 | SUCCESS|
| 2| 2 | SUCCESS|
| 3| 1 | SUCCESS|
| 4| 3 | SUCCESS|
| 5| 1 | SUCCESS|
| 6| 4 | FAILED |
| 7| 1 | SUCCESS|
| 8| 1 | FAILED |
| 9| 4 | FAILED |
| 10| 1 | FAILED |
.......
I ran a query to group by id and status to get the below result:
| some_id| count| status|
| 1 | 20| SUCCESS|
| 2 | 5 | SUCCESS|
| 3 | 10| SUCCESS|
| 2 | 15| FAILED |
| 3 | 12| FAILED |
| 4 | 25 | FAILED |
I want to use the above query as subquery to get the result below, where the distinct status are column name.
| some_id| SUCCESS| FAILED|
| 1 | 20 | null/0|
| 2 | 5 | 15 |
| 3 | 10 | 12 |
| 4 | null/0| 25 |
Any other approach to get the final data is also appreciated. Let me know if need more info.
Thanks

You may use a pivot query here with the help of FILTER:
SELECT
some_id,
COUNT(*) FILTER (WHERE status = 'SUCCESS') AS SUCCESS,
COUNT(*) FILTER (WHERE status = 'FAILED') AS FAILED
FROM yourTable
GROUP BY
some_id;
Demo

Related

pyspark add 0 with empty index

I have dataframe like below:
+--------+---------+---------+
| name | index | score |
+--------+---------+---------+
| name0 | 0 | 50 |
| name0 | 2 | 90 |
| name0 | 3 | 100 |
| name0 | 5 | 85 |
| name1 | 1 | 65 |
| name1 | 2 | 50 |
| name1 | 3 | 70 |
+--------+---------+---------+
and index should be 0~5, so what I want to get is:
+--------+---------+---------+
| name | index | score |
+--------+---------+---------+
| name0 | 0 | 50 |
| name0 | 1 | 0 |
| name0 | 2 | 90 |
| name0 | 3 | 100 |
| name0 | 4 | 0 |
| name0 | 5 | 85 |
| name1 | 0 | 0 |
| name1 | 1 | 65 |
| name1 | 2 | 50 |
| name1 | 3 | 70 |
| name1 | 4 | 0 |
| name1 | 5 | 0 |
+--------+---------+---------+
I want to fill 0 in empty index, but I have no idea.
Is there any solution? Please consider that I don't use pandas.
Cross join the names with a range of indices, then left join to the original dataframe using name and index, and replace nulls with 0.
spark.conf.set("spark.sql.crossJoin.enabled", True)
df2 = (df.select('name')
.distinct()
.join(spark.range(6).toDF('index'))
.join(df, ['name', 'index'], 'left')
.fillna({'score': 0})
)
df2.show()
+-----+-----+-----+
| name|index|score|
+-----+-----+-----+
|name0| 0| 50|
|name0| 1| 0|
|name0| 2| 90|
|name0| 3| 100|
|name0| 4| 0|
|name0| 5| 85|
|name1| 0| 0|
|name1| 1| 65|
|name1| 2| 50|
|name1| 3| 70|
|name1| 4| 0|
|name1| 5| 0|
+-----+-----+-----+

SQL to add position depending on multiple columns

I have a table that I am adding a position column in. I will need to add a numbered position to all rows already in the table. The numbering depends on 4 columns that would match each other between rows. For example
id| name| fax | cart| area |
1| jim | 1 | 4 | 1 |
2| jim | 1 | 4 | 1 |
3| jim | 2 | 4 | 1 |
4| jim | 2 | 4 | 1 |
5| bob | 1 | 4 | 1 |
6| bob | 1 | 4 | 1 |
7| bob | 2 | 5 | 1 |
8| bob | 2 | 5 | 2 |
9| bob | 2 | 5 | 2 |
10| bob | 2 | 5 | 2 |
would result with
id| name| fax | cart| area | position
1| jim | 1 | 4 | 1 | 1
2| jim | 1 | 4 | 1 | 2
3| jim | 2 | 4 | 1 | 1
4| jim | 2 | 4 | 1 | 2
5| bob | 1 | 4 | 1 | 1
6| bob | 1 | 4 | 1 | 2
7| bob | 2 | 5 | 1 | 1
8| bob | 2 | 5 | 2 | 1
9| bob | 2 | 5 | 2 | 2
10| bob | 2 | 5 | 2 | 3
I need an sql query that will iterate over the table and add the position.
Use row_number():
select
t.*,
row_number() over(partition by name, fax, cart, area order by id) position
from mytable t
If you wanted an update query:
update mytable as t
set position = rn
from (
select id, row_number() over(partition by name, fax, cart, area order by id) rn
from mytable
) x
where x.id = t.id

Creating Third Table Using Two Table With Help Of Spark-Sql or PySpark (No Use of Panda(Python))

I am trying to Create Third Table Using Two Table With Help Of Spark-Sql or PySpark (No Use of Panda(Python))
Dataframe One:
+---------+---------+------------+-----------+
| NAME | NAME_ID | CLIENT | CLIENT_ID |
+---------+---------+------------+-----------+
| RISHABH | 1 | SINGH | 5 |
| RISHABH | 1 | PATHAK | 3 |
| RISHABH | 1 | KUMAR | 2 |
| KEDAR | 2 | PATHAK | 3 |
| KEDAR | 2 | JADHAV | 1 |
| ANKIT | 3 | SRIVASTAVA | 6 |
| ANKIT | 3 | KUMAR | 2 |
| SUMIT | 4 | SINGH | 5 |
| SUMIT | 4 | SHARMA | 4 |
+---------+---------+------------+-----------+
Dataframe Two:
| NAME | NAME_ID | CLIENT | CLIENT_ID |
| RISHBAH | _____ | SRIVASTAVA | _____ |
| KEDAR | _____ | KUMAR | _____ |
| RISHABH | _____ | SINGH | _____ |
| KEDAR | _____ | PATHAK | _____ |
###Require Dataframe Output:###
+---------+---------+------------+-----------+
| NAME | NAME_ID | CLIENT | CLIENT_ID |
| RISHBAH | 1 | SRIVASTAVA | 6 |
| KEDAR | 2 | KUMAR | 2 |
| RISHABH | 1 | SINGH | 5 |
| KEDAR | 2 | PATHAK | 3 |
Using Spark-Sql or Spark.
Tried With df1.join(df2,df1.NAME == df2.NAME,"left")
But I am Not Getting The Output As Required.
I would suggest the following spark-sql approach
val df1 = <assuming data loaded>
val df2 = <assuming data loaded>
//createviews on top of dataframe
df1.createOrReplaceTempView("tbl1")
df1.createOrReplaceTempView("tbl2")
//extract the unique names and nameIds from the first df
uniqueNameDF=sparkSession.sql("select distict name,name_Id from tbl1")
//extract the unique client names and clientIds
uniqueClientDF=sparkSession.sql("select distict client,client_Id from tbl1")
//create Views on these temporary results
uniqueNameDF.createOrReplaceTempView("name")
uniqueClientDF.createOrReplaceTempView("client")
//join the above views with df2 to get the desired result
resultDF=sparkSession.sql("select n.name,n.name_id,c.client,c.client_id from tbl2 join name n on tbl2.name=n.name join client c on tbl2.client=c.client")
# FROM DATAFRAME ONE AS df_with_key
# SPLIT OUT DISTINCT BY NAME AND CLIENT
nameDF=df_with_key.select("NAME","NAME_ID").distinct()
clientDF=df_with_key.select("CLIENT","CLIENT_ID").distinct()
# DATAFRAME TWO AS df_with_client
+-------+-------+----------+---------+
| NAME|NAME_ID| CLIENT|CLIENT_ID|
+-------+-------+----------+---------+
| KEDAR| null| KUMAR| null|
| KEDAR| null| PATHAK| null|
|RISHABH| null| SINGH| null|
|RISHBAH| null|SRIVASTAVA| null|
+-------+-------+----------+---------+
# NOW JOIN FIRST WITH NAME AND THEN CLIENT
df_with_client.drop("NAME_ID").join(nameDF,nameDF.NAME==df_with_client.NAME,"LEFT").drop(nameDF.NAME).drop("CLIENT_ID").join(clientDF,df_with_client.CLIENT==clientDF.CLIENT).drop(clientDF.CLIENT).select("NAME","NAME_ID","CLIENT","CLIENT_ID").show()
+-------+-------+----------+---------+
| NAME|NAME_ID| CLIENT|CLIENT_ID|
+-------+-------+----------+---------+
| KEDAR| 2| KUMAR| 2|
| KEDAR| 2| PATHAK| 3|
|RISHABH| 1| SINGH| 5|
|RISHBAH| 1|SRIVASTAVA| 6|
+-------+-------+----------+---------+

How to flatten a pyspark dataframes that contains multiple rows per id?

I have a pyspark dataframe with two id columns id and id2. Each id is repeated exactly n times. All id's have the same set of id2's. I'm trying to "flatten" the matrix resulting from each unique id into one row according to id2.
Here's an example to explain what I'm trying to achieve, my dataframe looks like this:
+----+-----+--------+--------+
| id | id2 | value1 | value2 |
+----+-----+--------+--------+
| 1 | 1 | 54 | 2 |
+----+-----+--------+--------+
| 1 | 2 | 0 | 6 |
+----+-----+--------+--------+
| 1 | 3 | 578 | 14 |
+----+-----+--------+--------+
| 2 | 1 | 10 | 1 |
+----+-----+--------+--------+
| 2 | 2 | 6 | 32 |
+----+-----+--------+--------+
| 2 | 3 | 0 | 0 |
+----+-----+--------+--------+
| 3 | 1 | 12 | 2 |
+----+-----+--------+--------+
| 3 | 2 | 20 | 5 |
+----+-----+--------+--------+
| 3 | 3 | 63 | 22 |
+----+-----+--------+--------+
The desired output is the following table:
+----+----------+----------+----------+----------+----------+----------+
| id | value1_1 | value1_2 | value1_3 | value2_1 | value2_2 | value2_3 |
+----+----------+----------+----------+----------+----------+----------+
| 1 | 54 | 0 | 578 | 2 | 6 | 14 |
+----+----------+----------+----------+----------+----------+----------+
| 2 | 10 | 6 | 0 | 1 | 32 | 0 |
+----+----------+----------+----------+----------+----------+----------+
| 3 | 12 | 20 | 63 | 2 | 5 | 22 |
+----+----------+----------+----------+----------+----------+----------+
So, basically, for each unique id and for each column col, I will have n new columns col_1,... for each of the n id2 values.
Any help would be appreciated!
In Spark 2.4 you can do this way
var df3 =Seq((1,1,54 , 2 ),(1,2,0 , 6 ),(1,3,578, 14),(2,1,10 , 1 ),(2,2,6 , 32),(2,3,0 , 0 ),(3,1,12 , 2 ),(3,2,20 , 5 ),(3,3,63 , 22)).toDF("id","id2","value1","value2")
scala> df3.show()
+---+---+------+------+
| id|id2|value1|value2|
+---+---+------+------+
| 1| 1| 54| 2|
| 1| 2| 0| 6|
| 1| 3| 578| 14|
| 2| 1| 10| 1|
| 2| 2| 6| 32|
| 2| 3| 0| 0|
| 3| 1| 12| 2|
| 3| 2| 20| 5|
| 3| 3| 63| 22|
+---+---+------+------+
using coalesce retrieve the first value of the id.
scala> var df4 = df3.groupBy("id").pivot("id2").agg(coalesce(first("value1")),coalesce(first("value2"))).orderBy(col("id"))
scala> val newNames = Seq("id","value1_1","value2_1","value1_2","value2_2","value1_3","value2_3")
Renaming columns
scala> df4.toDF(newNames: _*).show()
+---+--------+--------+--------+--------+--------+--------+
| id|value1_1|value2_1|value1_2|value2_2|value1_3|value2_3|
+---+--------+--------+--------+--------+--------+--------+
| 1| 54| 2| 0| 6| 578| 14|
| 2| 10| 1| 6| 32| 0| 0|
| 3| 12| 2| 20| 5| 63| 22|
+---+--------+--------+--------+--------+--------+--------+
rearranged column if needed. let me know if you have any question related to the same. HAppy HAdoop

when querying the same table ,spark sql return null values but hive and impaly get nomal data?

I have a table in hive
Query the same table in two ways:
hive or impala: I get the expetcted results like this
0: jdbc:hive2://cdh-master3:10000/> SELECT * FROM kafka_table.risk_order_user_level_info rouli WHERE rouli.month = '2019_01' AND rouli.day = '08' androuli.order_id >0 limit 5;
INFO : OK
+-----------------+-------------------+------------+--------------+---------------+-------------------+-----------------------+---------------+---------------------+----------------------+-------------------+--------------+------------+--+
| rouli.order_id | rouli.order_type | rouli.uid | rouli.po_id | rouli.status | rouli.user_level | rouli.pre_user_level | rouli.credit | rouli.down_payment | rouli.open_order_id | rouli.createtime | rouli.month | rouli.day |
+-----------------+-------------------+------------+--------------+---------------+-------------------+-----------------------+---------------+---------------------+----------------------+-------------------+--------------+------------+--+
| 39180235 | 2 | 10526665 | -999 | 100 | 10 | 106 | 27000 | 0 | -999 | 1546887803138 | 2019_01 | 08 |
| 39180235 | 2 | 10526665 | -999 | 100 | 10 | 106 | 27000 | 0 | -999 | 1546887805302 | 2019_01 | 08 |
| 39180235 | 2 | 10526665 | -999 | 100 | 10 | 106 | 27000 | 0 | -999 | 1546887807457 | 2019_01 | 08 |
| 39180235 | 2 | 10526665 | -999 | 100 | 10 | 106 | 27000 | 0 | -999 | 1546887809610 | 2019_01 | 08 |
| 39804907 | 2 | 15022908 | -999 | 100 | -999 | -999 | 0 | 85000 | -999 | 1546887807461 | 2019_01 | 08 |
+-----------------+-------------------+------------+--------------+---------------+-------------------+-----------------------+---------------+---------------------+----------------------+-------------------+--------------+------------+--+
but usr spark whate python or scala ,I got this,several colums are null
scala> spark.sql("SELECT * FROM kafka_table.risk_order_user_level_info WHERE month = '2019_01' AND day = '08' limit 5").show()
+--------+----------+--------+-----+------+----------+--------------+-------+------------+-------------+-------------+-------+---+
|order_id|order_type| uid|po_id|status|user_level|pre_user_level| credit|down_payment|open_order_id| createTime| month|day|
+--------+----------+--------+-----+------+----------+--------------+-------+------------+-------------+-------------+-------+---+
| null| null|14057428| null| 90| null| null|2705000| null| null|1546920940672|2019_01| 08|
| null| null| 5833953| null| 90| null| null|2197000| null| null|1546920941872|2019_01| 08|
| null| null|10408291| null| 100| null| null|1386000| null| null|1546920941979|2019_01| 08|
| null| null| 621761| null| 100| null| null| 100000| null| null|1546920942282|2019_01| 08|
| null| null|10408291| null| 100| null| null|1386000| null| null|1546920942480|2019_01| 08|
+--------+----------+--------+-----+------+----------+--------------+-------+------------+-------------+-------------+-------+---+
How can I make sparksql return expected results ???
ps:
I execute the flowing sql in spark and hive find different results;
SELECT * FROM kafka_table.risk_order_user_level_info rouli
WHERE rouli.month = '2019_01' AND rouli.day = '08'
and order_id IN (
39906526,
39870975,
39832606,
39889240,
39836630
)
two results
this is where this question posted this page hit me;
I also check the records' number of the table in two ways above and the counts are same
Include rouli.order_id >0 condition in you spark sql query as well. You will see non null records in your spark sql output .
Note: Limit will will return records randomly . The results shown in the above two scenarios belong to different order_ids .
Solved on my own.
the data in this tabel is writen by sparksql, but the name of field in scala(spark) is diffent with hive (create table sql).
eg:orderID (scala) but order_id (sql)