Add aggregated columns to pivot without join - dataframe

Considering the table:
df=sc.parallelize([(1,1,1),(5,0,2),(27,1,1),(1,0,3),(5,1,1),(1,0,2)]).toDF(['id', 'error', 'timestamp'])
df.show()
+---+-----+---------+
| id|error|timestamp|
+---+-----+---------+
| 1| 1| 1|
| 5| 0| 2|
| 27| 1| 1|
| 1| 0| 3|
| 5| 1| 1|
| 1| 0| 2|
+---+-----+---------+
I would like to make a pivot on timestamp column keeping some other aggregated information from the original table. The result I am interested in can be achieved by
df1=df.groupBy('id').agg(sf.sum('error').alias('Ne'),sf.count('*').alias('cnt'))
df2=df.groupBy('id').pivot('timestamp').agg(sf.count('*')).fillna(0)
df1.join(df2, on='id').filter(sf.col('cnt')>1).show()
with the resulting table:
+---+---+---+---+---+---+
| id| Ne|cnt| 1| 2| 3|
+---+---+---+---+---+---+
| 5| 1| 2| 1| 1| 0|
| 1| 1| 3| 1| 1| 1|
+---+---+---+---+---+---+
However, there are at least two issues with the mentioned solution:
I am filtering by cnt at the end of the script. If I would be able to do this at the beginning, I can avoid almost all processing, because a large portion of data is removed using this filtration. Is there any way how to do this excepting collect and isin methods?
I am doing groupBy on id two-times. First, to aggregate some columns I need in results and the second time to get the pivot columns. Finally, I need join to merge these columns. I feel that I am surely missing some solution because it should be possible to do this with just one groubBy and without join, but I cannot figure out, how to do this.

I think you can not get around the join, because the pivot will need the timestamp values and the first grouping should not consider them. So if you have to create the NE and cnt values you have to group the dataframe only by id which results in the loss of timestamp if you want to preserve the values in columns you have to do the pivot as you did separately and join it back.
The only improvement that can be done is to move the filter to the df1 creation. So as you said this could already improve the performance since df1 should be much smaller after the filtering for your real data.
from pyspark.sql.functions import *
df=sc.parallelize([(1,1,1),(5,0,2),(27,1,1),(1,0,3),(5,1,1),(1,0,2)]).toDF(['id', 'error', 'timestamp'])
df1=df.groupBy('id').agg(sum('error').alias('Ne'),count('*').alias('cnt')).filter(col('cnt')>1)
df2=df.groupBy('id').pivot('timestamp').agg(count('*')).fillna(0)
df1.join(df2, on='id').show()
Output:
+---+---+---+---+---+---+
| id| Ne|cnt| 1| 2| 3|
+---+---+---+---+---+---+
| 5| 1| 2| 1| 1| 0|
| 1| 1| 3| 1| 1| 1|
+---+---+---+---+---+---+

Actually it is indeed possible to avoid join using Window as
w1 = Window.partitionBy('id')
w2 = Window.partitionBy('id', 'timestamp')
df.select('id', 'timestamp',
sf.sum('error').over(w1).alias('Ne'),
sf.count('*').over(w1).alias('cnt'),
sf.count('*').over(w2).alias('cnt_2')
).filter(sf.col('cnt')>1) \
.groupBy('id', 'Ne', 'cnt').pivot('timestamp').agg(sf.first('cnt_2')).fillna(0).show()

Related

Number of foods that scored "true" in being good, grouped by culture SQL

Okay, so I've been driving myself crazy trying to get this to display in SQL. I have a table that stores types of food, the culture they come from, a score, and a boolean value about whether or not they are good. I want to display a record of how many "goods" each culture racks up. Here's the table (don't ask about the database name):
So I've tried:
SELECT count(good = 1), culture FROM animals_db.foods group by culture;
Or
SELECT count(good = true), culture FROM animals_db.foods group by culture;
But it doesn't present the correct results, it seems to include anything that has any "good" value (1 or 0) at all.
How do I get the data I want?
instead of count , use sum.
SELECT sum(good), culture FROM animals_db.foods group by culture; -- assume good column value have integer data type and good value is represent as 1 otherwise 0
or other way is using count
select count( case when good=1 then 1 end) , culture from animals_db.foods group by culture;
If the purpose is to count the number of good=1 for each culture, this works:
select culture,
count(*)
from foods
where good=1
group by 1
order by 1;
Result:
culture |count(*)|
--------+--------+
| 1|
American| 1|
Chinese | 1|
European| 1|
Italian | 2|
The reason your first query doesn't return the result can be explained as below:
select culture,
good=1 as is_good
from foods
order by 1;
You actually get:
culture |is_good|
--------+-------+
| 1|
American| 0|
American| 1|
Chinese | 1|
European| 1|
French | 0|
French | 0|
German | 0|
Italian | 1|
Italian | 1|
After applied group by culture and count(good=1), you're actually counting the number of NOT NULL values in good=1. For example:
select culture,
count(good=0) as c0,
count(good=1) as c1,
count(good=2) as c2,
count(good) as c3,
count(null) as c4
from foods
group by culture
order by culture;
Outcome:
culture |c0|c1|c2|c3|c4|
--------+--+--+--+--+--+
| 1| 1| 1| 1| 0|
American| 2| 2| 2| 2| 0|
Chinese | 1| 1| 1| 1| 0|
European| 1| 1| 1| 1| 0|
French | 2| 2| 2| 2| 0|
German | 1| 1| 1| 1| 0|
Italian | 2| 2| 2| 2| 0|
Update: This is similar to your question: Is it possible to specify condition in Count()?.

Create new columns in Spark dataframe based on a hundred column pairs

I am trying to create around 9-10 columns based on values in 100 of columns(sch0,shm2...shm100) , however values of these columns would be the value in columns(idm0,idm1....idm100) which is part of same dataframe.
There are additional columns as well apart from these 2 pairs of 100.
Problem is, not all the scheme (schm0,schm1..schm100) would have values in it and we have to traverse through each to find out the values and create the columns accordingly, 85+ columns would be empty most of the time so we need to ignore them.
Input dataframe example:
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
|col1|col2|col3|sch0|idsm0|schm1|idsm1|schm2|idsm2|schm3|idsm3|
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
| a| b| c| 0| 1| 2| 3| 4| 5| null| null|
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
schm and idsm can go upto 100, so its basically the key-value pairs of 100 columns.
Expected output:
----+----+----+----------+-------+-------+
|col1|col2|col3|found_zero|found_2|found_4|
+----+----+----+----------+-------+-------+
| a| b| c| 1| 3| 5|
+----+----+----+----------+-------+-------+
Note: There is no fixed value in any column, any columns can have any value and the columns that we create has to be based on value found in any of the scheme columns (schm0...schm100) and the values in the created columns would be corresponding values of scheme i.e idsymbol (idsm0...idsm100)
I am finding it difficult to formulate a plan to do it, any help would be greatly appreciated.
Edited-
Adding another input example--
col1|col2|schm_0|idsm_0|schm_1|idsm_1|schm_2|idsm_2|schm_3|idsm_3|schm_4|idsm_4|schm_5|idsm_5|
+----+----+------+------+------+------+------+------+------+------+------+------+------+------+
| 2| 6| b1| id1| i| id2| xs| id3| ch| id4| null| null| null| null|
| 3| 5| b2| id5| x2| id6| ch| id7| be| id8| null| null| db| id15|
| 4| 7| b1| id9| ch| id10| xs| id11| us| id12| null| null| null| null|
+----+----+------+------+------+------+------+------+------+------+------+------+------+------+
for one particular record, col(schm_0,schm_1....schm_100) can have around 9 to 10 unique values as not all the columns would be populated with values.
we need to create 9 different columns based on the 9 unique values, so in short for one row , we need to iterate over each of 100 schmeme columns and collect all the values which is found there, based on found values, separate columns need to be created...and the values in those created columns would be the value in idsm(idsm_0,idsm_1....idsm_100)
i.e if schm_0 has value 'cb' we need to create new column for eg 'col_cb' and value in this column 'col_cb' would be value in 'idsm_0' column.
similarly we need to do for all 100 columns(we need to leave out the empty ones).
Expected output-
+----+----+------+------+-----+------+------+------+------+------+------+
|col1|col2|col_b1|col_b2|col_i|col_x2|col_ch|col_xs|col_be|col_us|col_db|
+----+----+------+------+-----+------+------+------+------+------+------+
| 2| 6| id1| null| id2| null| id4| id3| null| null| null|
| 3| 5| null| id5| null| id6| 1d7| null| id8| null| id15|
| 4| 7| id9| null| null| null| 1d10| id11| null| id12| null|
+----+----+------+------+-----+------+------+------+------+------+------+
Hope this clears the problem statements.
Any help on this would be highly appreciated.
Editing again for small issue-
as we see in above example , the columns that we create is based on the values found in schemesymbol columns,and there is already a defined set of columns that will get created which is 10 in numbers..columns for eg would be (col_a,col_b,col_c,cold_d,col_e,col_f,col_g_col_h,col_i,col_j)
not all the 10 keywords i.e(a,b,c....j) would be present all time in dataset under (shcheme0.....scheme99).
Requirement is we need to pass all 10 columns, if some of the keys(a,b,c...j) is not present ,the column created would be having null values.
You can get the required output that you are expecting but would be a multi step process.
First you would have to create two separate dataframes out of the original dataframe i.e one which contains schm columns and other which contains idsm columns. You will have to unpivot schm columns and idsm columns.
Then you would join both the dataframes based on unique combination of columns and filter the dataframe based on null values. You would then do group by based on unique columns and pivot on the schm columns and get the first value of the idsm columns.
//Sample Data
import org.apache.spark.sql.functions._
val initialdf = Seq((2,6,"b1","id1","i","id2","xs","id3","ch","id4",null,null,null,null),(3,5,"b2","id5","x2","id6","ch","id7","be","id8",null,null,"db","id15"),(4,7,"b1","id9","ch","id10","xs","id11","us","id12","es","id00",null,null)).toDF("col1","col2","schm_0","idsm_0","schm_1","idsm_1","schm_2","idsm_2","schm_3","idsm_3","schm_4","idsm_4","schm_5","idsm_5")
//creating two separate dataframes
val schmdf = initialdf.selectExpr("col1","col2", "stack(6, 'schm_0',schm_0, 'schm_1',schm_1,'schm_2',schm_2,'schm_3' ,schm_3, 'schm_4',schm_4,'schm_5',schm_5) as (schm,schm_value)").withColumn("id",split($"schm", "_")(1))
val idsmdf = initialdf.selectExpr("col1","col2", "stack(6, 'idsm_0',idsm_0, 'idsm_1',idsm_1,'idsm_2',idsm_2,'idsm_3' ,idsm_3, 'idsm_4',idsm_4,'idsm_5',idsm_5) as (idsm,idsm_value)").withColumn("id",split($"idsm", "_")(1))
//joining two dataframes and applying filter operation and giving alias for the column names to be used in next operation
val df = schmdf.join(idsmdf,Seq("col1","col2","id"),"inner").filter($"idsm_value" =!= "null").select("col1","col2","schm","schm_value","idsm","idsm_value").withColumn("schm_value", concat(lit("col_"),$"schm_value"))
df.groupBy("col1","col2").pivot("schm_value").agg(first("idsm_value")).show
you can see the output as below :
+----+----+------+------+------+------+------+------+-----+------+------+------+
|col1|col2|col_b1|col_b2|col_be|col_ch|col_db|col_es|col_i|col_us|col_x2|col_xs|
+----+----+------+------+------+------+------+------+-----+------+------+------+
| 2| 6| id1| null| null| id4| null| null| id2| null| null| id3|
| 3| 5| null| id5| id8| id7| id15| null| null| null| id6| null|
| 4| 7| id9| null| null| id10| null| id00| null| id12| null| id11|
+----+----+------+------+------+------+------+------+-----+------+------+------+
Updated Answer using Map:
If you have n number of columns and you know them in advance you can use the below approach which is more generic as compared to above approach.
//sample Data
val initialdf = Seq((2,6,"b1","id1","i","id2","xs","id3","ch","id4",null,null,null,null),(3,5,"b2","id5","x2","id6","ch","id7","be","id8",null,null,"db","id15"),(4,7,"b1","id9","ch","id10","xs","id11","us","id12","es","id00",null,null)).toDF("col1","col2","schm_0","idsm_0","schm_1","idsm_1","schm_2","idsm_2","schm_3","idsm_3","schm_4","idsm_4","schm_5","idsm_5")
import org.apache.spark.sql.functions._
val schmcols = Seq("schm_0", "schm_1", "schm_2","schm_3","schm_4","schm_5")
val schmdf = initialdf.select($"col1",$"col2", explode(array(
schmcols.map(column =>
struct(
lit(column).alias("schm"),
col(column).alias("schm_value")
)): _*
)).alias("schmColumn"))
.withColumn("id",split($"schmColumn.schm", "_")(1))
.withColumn("schm",$"schmColumn.schm")
.withColumn("schm_value",$"schmColumn.schm_value").drop("schmColumn")
val idcols = Seq("idsm_0", "idsm_1", "idsm_2","idsm_3","idsm_4","idsm_5")
val idsmdf = initialdf.select($"col1",$"col2", explode(array(
idcols.map(
column =>
struct(
lit(column).alias("idsm"),
col(column).alias("idsm_value")
)): _*
)).alias("idsmColumn"))
.withColumn("id",split($"idsmColumn.idsm", "_")(1))
.withColumn("idsm",$"idsmColumn.idsm")
.withColumn("idsm_value",$"idsmColumn.idsm_value").drop("idsmColumn")
val df = schmdf.join(idsmdf,Seq("col1","col2","id"),"inner")
.filter($"idsm_value" =!= "null")
.select("col1","col2","schm","schm_value","idsm","idsm_value")
.withColumn("schm_value", concat(lit("col_"),$"schm_value"))
df.groupBy("col1","col2").pivot("schm_value")
.agg(first("idsm_value")).show

Filtering rows in pyspark dataframe and creating a new column that contains the result

so I am trying to identify the crime that happens within the SF downtown boundary on Sunday. My idea was to first write a UDF to label if each crime is in the area I identify as the downtown area, if it happened within the area, then it will have a label of "1" and "0" if not. After that, I am trying to create a new column to store those results. I tried my best to write everything I can but it just doesn't work for some reason. Here is the code I wrote:
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
def filter_dt(x,y):
if (((x < -122.4213) & (x > -122.4313)) & ((y > 37.7540) & (y < 37.7740))):
return '1'
else:
return '0'
schema = StructType([StructField("isDT", BooleanType(), False)])
filter_dt_boolean = udf(lambda row: filter_dt(row[0], row[1]), schema)
#First, pick out the crime cases that happens on Sunday BooleanType()
q3_sunday = spark.sql("SELECT * FROM sf_crime WHERE DayOfWeek='Sunday'")
#Then, we add a new column for us to filter out(identify) if the crime is in DT
q3_final = q3_result.withColumn("isDT", filter_dt(q3_sunday.select('X'),q3_sunday.select('Y')))
The error I am getting is:Picture for the error message
My guess is that the udf I am having right now doesn't support the whole column as input to be compared, but I don't know how to fix it to make it work. Please help! Thank you!
A sample data would have helped. For now I assume that your data looks like this:
+----+---+---+
|val1| x| y|
+----+---+---+
| 10| 7| 14|
| 5| 1| 4|
| 9| 8| 10|
| 2| 6| 90|
| 7| 2| 30|
| 3| 5| 11|
+----+---+---+
Then you dont need a udf, as you can do the evaluation using the when() function
import pyspark.sql.functions as F
tst= sqlContext.createDataFrame([(10,7,14),(5,1,4),(9,8,10),(2,6,90),(7,2,30),(3,5,11)],schema=['val1','x','y'])
tst_res = tst.withColumn("isdt",F.when(((tst.x.between(4,10))&(tst.y.between(11,20))),1).otherwise(0))This will give the result
tst_res.show()
+----+---+---+----+
|val1| x| y|isdt|
+----+---+---+----+
| 10| 7| 14| 1|
| 5| 1| 4| 0|
| 9| 8| 10| 0|
| 2| 6| 90| 0|
| 7| 2| 30| 0|
| 3| 5| 11| 1|
+----+---+---+----+
If i have got the data wrong and still you need to pass multiple values to udf, you have to pass it as an array or a struct. I prefer a struct
from pyspark.sql.functions import udf
from pyspark.sql.types import *
#udf(IntegerType())
def check_data(row):
if((row.x in range(4,5))&(row.y in range(1,20))):
return(1)
else:
return(0)
tst_res1 = tst.withColumn("isdt",check_data(F.struct('x','y')))
The result will be the same. But it is always better to avoid UDF and go for spark inbuilt functions since spark catalyst cannot understand the logic inside the udf and cannot optimize it.
Try changing last line as below-
from pyspark.sql.functions import col
q3_final = q3_result.withColumn("isDT", filter_dt(col('X'),col('Y')))

Pyspark : how to code complicated dataframe calculation lead sum

I have given dataframe that looks like this.
THIS dataframe is sorted by date, and col1 is just some random value.
TEST_schema = StructType([StructField("date", StringType(), True),\
StructField("col1", IntegerType(), True),\
])
TEST_data = [('2020-08-01',3),('2020-08-02',1),('2020-08-03',-1),('2020-08-04',-1),('2020-08-05',3),\
('2020-08-06',-1),('2020-08-07',6),('2020-08-08',4),('2020-08-09',5)]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df.show()
+----------+----+
| date|col1|
+----------+----+
|2020-08-01| 3|
|2020-08-02| 1|
|2020-08-03| -1|
|2020-08-04| -1|
|2020-08-05| 3|
|2020-08-06| -1|
|2020-08-07| 6|
|2020-08-08| 4|
|2020-08-09| 5|
+----------+----+
LOGIC : lead(col1) +1, if col1 ==-1, then from the previous value lead(col1) +2...
the resulted dataframe will look like this (want column is what i want as output)
+----------+----+----+
| date|col1|WANT|
+----------+----+----+
|2020-08-01| 3| 2|
|2020-08-02| 1| 6|
|2020-08-03| -1| 5|
|2020-08-04| -1| 4|
|2020-08-05| 3| 8|
|2020-08-06| -1| 7|
|2020-08-07| 6| 5|
|2020-08-08| 4| 6|
|2020-08-09| 5| -1|
+----------+----+----+
Let's look at last row, where col1==5, that 5 is leaded +1 which is in want==6 (2020-08-08)
If we have col==-1, then we add +1 more ,, if we have col==-1 repeated twice, then we add +2 more..
this is hard to explain in words,lastly since it created last column instead of null, replace with -1. I have a diagram
You can check if the following code and logic works for you:
create a sub-group label g which take running sum of int(col1!=-1), and we only concern about Rows with col1 == -1, and nullify all other Rows.
the residual is 1 and if col1 == -1, plus the running count on Window w2
take the prev_col1 over w1 which is not -1 (using nullif), (the naming of prev_col1 might be confusion since it takes only if col1 = -1 using typical pyspark's way to do ffill, otherwise keep the original).
set val = prev_col1 + residual, take the lag and set null to -1
Code below:
from pyspark.sql.functions import when, col, expr, count, desc, lag, coalesce
from pyspark.sql import Window
w1 = Window.orderBy(desc('date'))
w2 = Window.partitionBy('g').orderBy(desc('date'))
TEST_df.withColumn('g', when(col('col1') == -1, expr("sum(int(col1!=-1))").over(w1))) \
.withColumn('residual', when(col('col1') == -1, count('*').over(w2) + 1).otherwise(1)) \
.withColumn('prev_col1',expr("last(nullif(col1,-1),True)").over(w1)) \
.withColumn('want', coalesce(lag(expr("prev_col1 + residual")).over(w1),lit(-1))) \
.orderBy('date').show()
+----------+----+----+--------+---------+----+
| date|col1| g|residual|prev_col1|want|
+----------+----+----+--------+---------+----+
|2020-08-01| 3|null| 1| 3| 2|
|2020-08-02| 1|null| 1| 1| 6|
|2020-08-03| -1| 4| 3| 3| 5|
|2020-08-04| -1| 4| 2| 3| 4|
|2020-08-05| 3|null| 1| 3| 8|
|2020-08-06| -1| 3| 2| 6| 7|
|2020-08-07| 6|null| 1| 6| 5|
|2020-08-08| 4|null| 1| 4| 6|
|2020-08-09| 5|null| 1| 5| -1|
+----------+----+----+--------+---------+----+

Spark SQL: Is there a way to distinguish columns with same name?

I have a csv with a header with columns with same name.
I want to process them with spark using only SQL and be able to refer these columns unambiguously.
Ex.:
id name age height name
1 Alex 23 1.70
2 Joseph 24 1.89
I want to get only first name column using only Spark SQL
As mentioned in the comments, I think that the less error prone method would be to have the schema of the input data changed.
Yet, in case you are looking for a quick workaround, you can simply index the duplicated names of the columns.
For instance, let's create a dataframe with three id columns.
val df = spark.range(3)
.select('id * 2 as "id", 'id * 3 as "x", 'id, 'id * 4 as "y", 'id)
df.show
+---+---+---+---+---+
| id| x| id| y| id|
+---+---+---+---+---+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+---+---+---+---+---+
Then I can use toDF to set new column names. Let's consider that I know that only id is duplicated. If we don't, adding the extra logic to figure out which columns are duplicated would not be very difficult.
var i = -1
val names = df.columns.map( n =>
if(n == "id") {
i+=1
s"id_$i"
} else n )
val new_df = df.toDF(names : _*)
new_df.show
+----+---+----+---+----+
|id_0| x|id_1| y|id_2|
+----+---+----+---+----+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+----+---+----+---+----+