pyspark add 0 with empty index - dataframe

I have dataframe like below:
+--------+---------+---------+
| name | index | score |
+--------+---------+---------+
| name0 | 0 | 50 |
| name0 | 2 | 90 |
| name0 | 3 | 100 |
| name0 | 5 | 85 |
| name1 | 1 | 65 |
| name1 | 2 | 50 |
| name1 | 3 | 70 |
+--------+---------+---------+
and index should be 0~5, so what I want to get is:
+--------+---------+---------+
| name | index | score |
+--------+---------+---------+
| name0 | 0 | 50 |
| name0 | 1 | 0 |
| name0 | 2 | 90 |
| name0 | 3 | 100 |
| name0 | 4 | 0 |
| name0 | 5 | 85 |
| name1 | 0 | 0 |
| name1 | 1 | 65 |
| name1 | 2 | 50 |
| name1 | 3 | 70 |
| name1 | 4 | 0 |
| name1 | 5 | 0 |
+--------+---------+---------+
I want to fill 0 in empty index, but I have no idea.
Is there any solution? Please consider that I don't use pandas.

Cross join the names with a range of indices, then left join to the original dataframe using name and index, and replace nulls with 0.
spark.conf.set("spark.sql.crossJoin.enabled", True)
df2 = (df.select('name')
.distinct()
.join(spark.range(6).toDF('index'))
.join(df, ['name', 'index'], 'left')
.fillna({'score': 0})
)
df2.show()
+-----+-----+-----+
| name|index|score|
+-----+-----+-----+
|name0| 0| 50|
|name0| 1| 0|
|name0| 2| 90|
|name0| 3| 100|
|name0| 4| 0|
|name0| 5| 85|
|name1| 0| 0|
|name1| 1| 65|
|name1| 2| 50|
|name1| 3| 70|
|name1| 4| 0|
|name1| 5| 0|
+-----+-----+-----+

Related

fill null values by joining dataframe with different number of rows and on multiple column

I've tried to search but although I got similar scenarios I didn't find what I was looking for.
I have the following two dataframes:
+---------------------------+
| ID| Value| type |
+---------------------------+
| user0| 100 | Car |
| user1| 102 | Car |
| user2| 109 | Dog |
| user3| 103 | NA |
| user4| 110 | Dog |
| user5| null | null |
| user6| null | null |
| user7| null | null |
+---------------------------+
+---------------------------+
| ID2| Value2| type2|
+---------------------------+
| user5| 115 | Cell |
| user6| 103 | Cell |
| user7| 100 | Fridge|
+---------------------------+
I'd like to join those two and have this as result:
+---------------------------+
| ID| Value| type |
+---------------------------+
| user0| 100 | Car |
| user1| 102 | Car |
| user2| 109 | Dog |
| user3| 103 | NA |
| user4| 110 | Dog |
| user5| 115 | Cell |
| user6| 103 | Cell |
| user7| 100 | Fridge |
+---------------------------+
I have tried the following but it doesn't give back the desired results:
df_joined= df1.join(df2,(df1.id==df2.id2) &
(df1.value==df2.value2) &
(df1.type==df2.type2),
"left").drop('id2','value2','type2')
I only get the values from the first df, probably left is not the right join type but I can't understand what should I use.
You just need to join using ID, not the other columns, because the other columns are not the same. To combine the other columns, use coalesce, which gives the first non-null value.
import pyspark.sql.functions as F
df_joined = df1.join(df2, df1.ID == df2.ID2, 'left').select(
'ID',
F.coalesce(df1.Value, df2.Value2).alias('Value'),
F.coalesce(df1.type, df2.type2).alias('type')
)
df_joined.show()
+-----+-----+------+
| ID|Value| type|
+-----+-----+------+
|user0| 100| Car|
|user1| 102| Car|
|user2| 109| Dog|
|user3| 103| NA|
|user4| 110| Dog|
|user5| 115| Cell|
|user6| 103| Cell|
|user7| 100|Fridge|
+-----+-----+------+
You can also use union then get the max :
from pyspark.sql import functions as F
result = df1.union(df2).groupBy("ID").agg(
F.max("value").alias("value"),
F.max("type").alias("type")
)
result.show()
#+-----+-----+------+
#| ID|value| type|
#+-----+-----+------+
#|user0| 100| Car|
#|user1| 102| Car|
#|user2| 109| Dog|
#|user3| 103| NA|
#|user4| 110| Dog|
#|user5| 115| Cell|
#|user6| 103| Cell|
#|user7| 100|Fridge|
#+-----+-----+------+

Group query in subquery to get column value as column name

The data i've in my database:
| id| some_id| status|
| 1| 1 | SUCCESS|
| 2| 2 | SUCCESS|
| 3| 1 | SUCCESS|
| 4| 3 | SUCCESS|
| 5| 1 | SUCCESS|
| 6| 4 | FAILED |
| 7| 1 | SUCCESS|
| 8| 1 | FAILED |
| 9| 4 | FAILED |
| 10| 1 | FAILED |
.......
I ran a query to group by id and status to get the below result:
| some_id| count| status|
| 1 | 20| SUCCESS|
| 2 | 5 | SUCCESS|
| 3 | 10| SUCCESS|
| 2 | 15| FAILED |
| 3 | 12| FAILED |
| 4 | 25 | FAILED |
I want to use the above query as subquery to get the result below, where the distinct status are column name.
| some_id| SUCCESS| FAILED|
| 1 | 20 | null/0|
| 2 | 5 | 15 |
| 3 | 10 | 12 |
| 4 | null/0| 25 |
Any other approach to get the final data is also appreciated. Let me know if need more info.
Thanks
You may use a pivot query here with the help of FILTER:
SELECT
some_id,
COUNT(*) FILTER (WHERE status = 'SUCCESS') AS SUCCESS,
COUNT(*) FILTER (WHERE status = 'FAILED') AS FAILED
FROM yourTable
GROUP BY
some_id;
Demo

How to group by with a condition in PySpark

How to group by with a condition in PySpark?
This is an example data:
+-----+-------+-------------+------------+
| zip | state | Agegrouping | patient_id |
+-----+-------+-------------+------------+
| 123 | x | Adult | 123 |
| 124 | x | Children | 231 |
| 123 | x | Children | 456 |
| 156 | x | Adult | 453 |
| 124 | y | Adult | 34 |
| 432 | y | Adult | 23 |
| 234 | y | Children | 13 |
| 432 | z | Children | 22 |
| 234 | z | Adult | 44 |
+-----+-------+-------------+------------+
then wanted to see the data as:
+-----+-------+-------+----------+------------+
| zip | state | Adult | Children | patient_id |
+-----+-------+-------+----------+------------+
| 123 | x | 1 | 1 | 2 |
| 124 | x | 1 | 1 | 2 |
| 156 | x | 1 | 0 | 1 |
| 432 | y | 1 | 1 | 2 |
| 234 | z | 1 | 1 | 2 |
+-----+-------+-------+----------+------------+
How can I do this?
Here is the spark sql version.
df.createOrReplaceTempView('table')
spark.sql('''
select zip, state,
count(if(Agegrouping = 'Adult', 1, null)) as adult,
count(if(Agegrouping = 'Children', 1, null)) as children,
count(1) as patient_id
from table
group by zip, state;
''').show()
+---+-----+-----+--------+----------+
|zip|state|adult|children|patient_id|
+---+-----+-----+--------+----------+
|123| x| 1| 1| 2|
|156| x| 1| 0| 1|
|234| z| 1| 0| 1|
|432| z| 0| 1| 1|
|234| y| 0| 1| 1|
|124| y| 0| 0| 1|
|124| x| 0| 1| 1|
|432| y| 1| 0| 1|
+---+-----+-----+--------+----------+
You can use conditional aggregation:
select zip, state,
sum(case when agegrouping = 'Adult' then 1 else 0 end) as adult,
sum(case when agegrouping = 'Children' then 1 else 0 end) as children,
count(*) as num_patients
from t
group by zip, state;
Use conditional aggreagation:
select
zip,
state,
sum(case when agregrouping = 'Adult' then 1 else 0 end ) as adult
sum(case when agregrouping = 'Children' then 1 else 0 end ) as children,
count(*) patient_id
from mytable
group by zip, state

How to flatten a pyspark dataframes that contains multiple rows per id?

I have a pyspark dataframe with two id columns id and id2. Each id is repeated exactly n times. All id's have the same set of id2's. I'm trying to "flatten" the matrix resulting from each unique id into one row according to id2.
Here's an example to explain what I'm trying to achieve, my dataframe looks like this:
+----+-----+--------+--------+
| id | id2 | value1 | value2 |
+----+-----+--------+--------+
| 1 | 1 | 54 | 2 |
+----+-----+--------+--------+
| 1 | 2 | 0 | 6 |
+----+-----+--------+--------+
| 1 | 3 | 578 | 14 |
+----+-----+--------+--------+
| 2 | 1 | 10 | 1 |
+----+-----+--------+--------+
| 2 | 2 | 6 | 32 |
+----+-----+--------+--------+
| 2 | 3 | 0 | 0 |
+----+-----+--------+--------+
| 3 | 1 | 12 | 2 |
+----+-----+--------+--------+
| 3 | 2 | 20 | 5 |
+----+-----+--------+--------+
| 3 | 3 | 63 | 22 |
+----+-----+--------+--------+
The desired output is the following table:
+----+----------+----------+----------+----------+----------+----------+
| id | value1_1 | value1_2 | value1_3 | value2_1 | value2_2 | value2_3 |
+----+----------+----------+----------+----------+----------+----------+
| 1 | 54 | 0 | 578 | 2 | 6 | 14 |
+----+----------+----------+----------+----------+----------+----------+
| 2 | 10 | 6 | 0 | 1 | 32 | 0 |
+----+----------+----------+----------+----------+----------+----------+
| 3 | 12 | 20 | 63 | 2 | 5 | 22 |
+----+----------+----------+----------+----------+----------+----------+
So, basically, for each unique id and for each column col, I will have n new columns col_1,... for each of the n id2 values.
Any help would be appreciated!
In Spark 2.4 you can do this way
var df3 =Seq((1,1,54 , 2 ),(1,2,0 , 6 ),(1,3,578, 14),(2,1,10 , 1 ),(2,2,6 , 32),(2,3,0 , 0 ),(3,1,12 , 2 ),(3,2,20 , 5 ),(3,3,63 , 22)).toDF("id","id2","value1","value2")
scala> df3.show()
+---+---+------+------+
| id|id2|value1|value2|
+---+---+------+------+
| 1| 1| 54| 2|
| 1| 2| 0| 6|
| 1| 3| 578| 14|
| 2| 1| 10| 1|
| 2| 2| 6| 32|
| 2| 3| 0| 0|
| 3| 1| 12| 2|
| 3| 2| 20| 5|
| 3| 3| 63| 22|
+---+---+------+------+
using coalesce retrieve the first value of the id.
scala> var df4 = df3.groupBy("id").pivot("id2").agg(coalesce(first("value1")),coalesce(first("value2"))).orderBy(col("id"))
scala> val newNames = Seq("id","value1_1","value2_1","value1_2","value2_2","value1_3","value2_3")
Renaming columns
scala> df4.toDF(newNames: _*).show()
+---+--------+--------+--------+--------+--------+--------+
| id|value1_1|value2_1|value1_2|value2_2|value1_3|value2_3|
+---+--------+--------+--------+--------+--------+--------+
| 1| 54| 2| 0| 6| 578| 14|
| 2| 10| 1| 6| 32| 0| 0|
| 3| 12| 2| 20| 5| 63| 22|
+---+--------+--------+--------+--------+--------+--------+
rearranged column if needed. let me know if you have any question related to the same. HAppy HAdoop

How to add a column to a pyspark dataframe which contains the mean of one based on the grouping on another column

It is similar to some other questions but it is different.
Let say we have a pyspark dataframe df as below:
+-----+------+-----+
|col1 | col2 | col3|
+-----+------+-----+
|A | 5 | 6 |
+-----+------+-----+
|A | 5 | 8 |
+-----+------+-----+
|A | 6 | 3 |
+-----+------+-----+
|A | 5 | 9 |
+-----+------+-----+
|B | 9 | 6 |
+-----+------+-----+
|B | 3 | 8 |
+-----+------+-----+
|B | 9 | 8 |
+-----+------+-----+
|C | 3 | 4 |
+-----+------+-----+
|C | 5 | 1 |
+-----+------+-----+
I want to add another column as new_col which contains the mean of col2 based on grouping on col1. So, the answer must be as follows
+-----+------+------+--------+
|col1 | col2 | col3 | new_col|
+-----+------+------+--------+
| A | 5 | 6 | 5.25 |
+-----+------+------+--------+
| A | 5 | 8 | 5.25 |
+-----+------+------+--------+
| A | 6 | 3 | 5.25 |
+-----+------+------+--------+
| A | 5 | 9 | 5.25 |
+-----+------+------+--------+
| B | 9 | 6 | 7 |
+-----+------+------+--------+
| B | 3 | 8 | 7 |
+-----+------+------+--------+
| B | 9 | 8 | 7 |
+-----+------+------+--------+
| C | 3 | 4 | 4 |
+-----+------+------+--------+
| C | 5 | 1 | 4 |
+-----+------+------+--------+
Any help will be appreciated.
Step 1: Creating the DataFrame.
from pyspark.sql.functions import avg, col
from pyspark.sql.window import Window
values = [('A',5,6),('A',5,8),('A',6,3),('A',5,9),('B',9,6),('B',3,8),('B',9,8),('C',3,4),('C',5,1)]
df = sqlContext.createDataFrame(values,['col1','col2','col3'])
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| 5| 6|
| A| 5| 8|
| A| 6| 3|
| A| 5| 9|
| B| 9| 6|
| B| 3| 8|
| B| 9| 8|
| C| 3| 4|
| C| 5| 1|
+----+----+----+
Step 2: Creating another column having the mean, by grouping over column A.
w = Window().partitionBy('col1')
df = df.withColumn('new_col',avg(col('col2')).over(w))
df.show()
+----+----+----+-------+
|col1|col2|col3|new_col|
+----+----+----+-------+
| B| 9| 6| 7.0|
| B| 3| 8| 7.0|
| B| 9| 8| 7.0|
| C| 3| 4| 4.0|
| C| 5| 1| 4.0|
| A| 5| 6| 5.25|
| A| 5| 8| 5.25|
| A| 6| 3| 5.25|
| A| 5| 9| 5.25|
+----+----+----+-------+
Ok, after a lot of trying, I could answer the question myself. I post the answer here for anyone else with the similar question. The original file is a csv file here.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#reading the file
df = spark.read.csv('file's name.csv', header=True)
df.show()
output
+-----+------+-----+
|col1 | col2 | col3|
+-----+------+-----+
|A | 5 | 6 |
+-----+------+-----+
|A | 5 | 8 |
+-----+------+-----+
|A | 6 | 3 |
+-----+------+-----+
|A | 5 | 9 |
+-----+------+-----+
|B | 9 | 6 |
+-----+------+-----+
|B | 3 | 8 |
+-----+------+-----+
|B | 9 | 8 |
+-----+------+-----+
|C | 3 | 4 |
+-----+------+-----+
|C | 5 | 1 |
+-----+------+-----+
from pyspark.sql import functions as func
#Grouping the dataframe based on col1
col1group = df.groupBy('col1')
#Computing the average of col2 based on the grouping on col1
a= col1group.agg(func.avg("col2"))
a.show()
output
+-----+----------+
|col1 | avg(col2)|
+-----+----------+
| A | 5.25 |
+-----+----------+
| B | 7.0 |
+-----+----------+
| C | 4.0 |
+-----+----------+
Now, we join the last table with the initial dataframe to generate our desired dataframe:
df=test1.join(a, on = 'lable', how = 'inner')
df.show()
output
+-----+------+------+---------+
|col1 | col2 | col3 |avg(col2)|
+-----+------+------+---------+
| A | 5 | 6 | 5.25 |
+-----+------+------+---------+
| A | 5 | 8 | 5.25 |
+-----+------+------+---------+
| A | 6 | 3 | 5.25 |
+-----+------+------+---------+
| A | 5 | 9 | 5.25 |
+-----+------+------+---------+
| B | 9 | 6 | 7 |
+-----+------+------+---------+
| B | 3 | 8 | 7 |
+-----+------+------+---------+
| B | 9 | 8 | 7 |
+-----+------+------+---------+
| C | 3 | 4 | 4 |
+-----+------+------+---------+
| C | 5 | 1 | 4 |
+-----+------+------+---------+
Now change the name of the last column to whatever we want
df = df.withColumnRenamed('avg(val1)', 'new_col')
df.show()
output
+-----+------+------+--------+
|col1 | col2 | col3 | new_col|
+-----+------+------+--------+
| A | 5 | 6 | 5.25 |
+-----+------+------+--------+
| A | 5 | 8 | 5.25 |
+-----+------+------+--------+
| A | 6 | 3 | 5.25 |
+-----+------+------+--------+
| A | 5 | 9 | 5.25 |
+-----+------+------+--------+
| B | 9 | 6 | 7 |
+-----+------+------+--------+
| B | 3 | 8 | 7 |
+-----+------+------+--------+
| B | 9 | 8 | 7 |
+-----+------+------+--------+
| C | 3 | 4 | 4 |
+-----+------+------+--------+
| C | 5 | 1 | 4 |
+-----+------+------+--------+