Okay, so I've been driving myself crazy trying to get this to display in SQL. I have a table that stores types of food, the culture they come from, a score, and a boolean value about whether or not they are good. I want to display a record of how many "goods" each culture racks up. Here's the table (don't ask about the database name):
So I've tried:
SELECT count(good = 1), culture FROM animals_db.foods group by culture;
Or
SELECT count(good = true), culture FROM animals_db.foods group by culture;
But it doesn't present the correct results, it seems to include anything that has any "good" value (1 or 0) at all.
How do I get the data I want?
instead of count , use sum.
SELECT sum(good), culture FROM animals_db.foods group by culture; -- assume good column value have integer data type and good value is represent as 1 otherwise 0
or other way is using count
select count( case when good=1 then 1 end) , culture from animals_db.foods group by culture;
If the purpose is to count the number of good=1 for each culture, this works:
select culture,
count(*)
from foods
where good=1
group by 1
order by 1;
Result:
culture |count(*)|
--------+--------+
| 1|
American| 1|
Chinese | 1|
European| 1|
Italian | 2|
The reason your first query doesn't return the result can be explained as below:
select culture,
good=1 as is_good
from foods
order by 1;
You actually get:
culture |is_good|
--------+-------+
| 1|
American| 0|
American| 1|
Chinese | 1|
European| 1|
French | 0|
French | 0|
German | 0|
Italian | 1|
Italian | 1|
After applied group by culture and count(good=1), you're actually counting the number of NOT NULL values in good=1. For example:
select culture,
count(good=0) as c0,
count(good=1) as c1,
count(good=2) as c2,
count(good) as c3,
count(null) as c4
from foods
group by culture
order by culture;
Outcome:
culture |c0|c1|c2|c3|c4|
--------+--+--+--+--+--+
| 1| 1| 1| 1| 0|
American| 2| 2| 2| 2| 0|
Chinese | 1| 1| 1| 1| 0|
European| 1| 1| 1| 1| 0|
French | 2| 2| 2| 2| 0|
German | 1| 1| 1| 1| 0|
Italian | 2| 2| 2| 2| 0|
Update: This is similar to your question: Is it possible to specify condition in Count()?.
I am trying to create around 9-10 columns based on values in 100 of columns(sch0,shm2...shm100) , however values of these columns would be the value in columns(idm0,idm1....idm100) which is part of same dataframe.
There are additional columns as well apart from these 2 pairs of 100.
Problem is, not all the scheme (schm0,schm1..schm100) would have values in it and we have to traverse through each to find out the values and create the columns accordingly, 85+ columns would be empty most of the time so we need to ignore them.
Input dataframe example:
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
|col1|col2|col3|sch0|idsm0|schm1|idsm1|schm2|idsm2|schm3|idsm3|
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
| a| b| c| 0| 1| 2| 3| 4| 5| null| null|
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
schm and idsm can go upto 100, so its basically the key-value pairs of 100 columns.
Expected output:
----+----+----+----------+-------+-------+
|col1|col2|col3|found_zero|found_2|found_4|
+----+----+----+----------+-------+-------+
| a| b| c| 1| 3| 5|
+----+----+----+----------+-------+-------+
Note: There is no fixed value in any column, any columns can have any value and the columns that we create has to be based on value found in any of the scheme columns (schm0...schm100) and the values in the created columns would be corresponding values of scheme i.e idsymbol (idsm0...idsm100)
I am finding it difficult to formulate a plan to do it, any help would be greatly appreciated.
Edited-
Adding another input example--
col1|col2|schm_0|idsm_0|schm_1|idsm_1|schm_2|idsm_2|schm_3|idsm_3|schm_4|idsm_4|schm_5|idsm_5|
+----+----+------+------+------+------+------+------+------+------+------+------+------+------+
| 2| 6| b1| id1| i| id2| xs| id3| ch| id4| null| null| null| null|
| 3| 5| b2| id5| x2| id6| ch| id7| be| id8| null| null| db| id15|
| 4| 7| b1| id9| ch| id10| xs| id11| us| id12| null| null| null| null|
+----+----+------+------+------+------+------+------+------+------+------+------+------+------+
for one particular record, col(schm_0,schm_1....schm_100) can have around 9 to 10 unique values as not all the columns would be populated with values.
we need to create 9 different columns based on the 9 unique values, so in short for one row , we need to iterate over each of 100 schmeme columns and collect all the values which is found there, based on found values, separate columns need to be created...and the values in those created columns would be the value in idsm(idsm_0,idsm_1....idsm_100)
i.e if schm_0 has value 'cb' we need to create new column for eg 'col_cb' and value in this column 'col_cb' would be value in 'idsm_0' column.
similarly we need to do for all 100 columns(we need to leave out the empty ones).
Expected output-
+----+----+------+------+-----+------+------+------+------+------+------+
|col1|col2|col_b1|col_b2|col_i|col_x2|col_ch|col_xs|col_be|col_us|col_db|
+----+----+------+------+-----+------+------+------+------+------+------+
| 2| 6| id1| null| id2| null| id4| id3| null| null| null|
| 3| 5| null| id5| null| id6| 1d7| null| id8| null| id15|
| 4| 7| id9| null| null| null| 1d10| id11| null| id12| null|
+----+----+------+------+-----+------+------+------+------+------+------+
Hope this clears the problem statements.
Any help on this would be highly appreciated.
Editing again for small issue-
as we see in above example , the columns that we create is based on the values found in schemesymbol columns,and there is already a defined set of columns that will get created which is 10 in numbers..columns for eg would be (col_a,col_b,col_c,cold_d,col_e,col_f,col_g_col_h,col_i,col_j)
not all the 10 keywords i.e(a,b,c....j) would be present all time in dataset under (shcheme0.....scheme99).
Requirement is we need to pass all 10 columns, if some of the keys(a,b,c...j) is not present ,the column created would be having null values.
You can get the required output that you are expecting but would be a multi step process.
First you would have to create two separate dataframes out of the original dataframe i.e one which contains schm columns and other which contains idsm columns. You will have to unpivot schm columns and idsm columns.
Then you would join both the dataframes based on unique combination of columns and filter the dataframe based on null values. You would then do group by based on unique columns and pivot on the schm columns and get the first value of the idsm columns.
//Sample Data
import org.apache.spark.sql.functions._
val initialdf = Seq((2,6,"b1","id1","i","id2","xs","id3","ch","id4",null,null,null,null),(3,5,"b2","id5","x2","id6","ch","id7","be","id8",null,null,"db","id15"),(4,7,"b1","id9","ch","id10","xs","id11","us","id12","es","id00",null,null)).toDF("col1","col2","schm_0","idsm_0","schm_1","idsm_1","schm_2","idsm_2","schm_3","idsm_3","schm_4","idsm_4","schm_5","idsm_5")
//creating two separate dataframes
val schmdf = initialdf.selectExpr("col1","col2", "stack(6, 'schm_0',schm_0, 'schm_1',schm_1,'schm_2',schm_2,'schm_3' ,schm_3, 'schm_4',schm_4,'schm_5',schm_5) as (schm,schm_value)").withColumn("id",split($"schm", "_")(1))
val idsmdf = initialdf.selectExpr("col1","col2", "stack(6, 'idsm_0',idsm_0, 'idsm_1',idsm_1,'idsm_2',idsm_2,'idsm_3' ,idsm_3, 'idsm_4',idsm_4,'idsm_5',idsm_5) as (idsm,idsm_value)").withColumn("id",split($"idsm", "_")(1))
//joining two dataframes and applying filter operation and giving alias for the column names to be used in next operation
val df = schmdf.join(idsmdf,Seq("col1","col2","id"),"inner").filter($"idsm_value" =!= "null").select("col1","col2","schm","schm_value","idsm","idsm_value").withColumn("schm_value", concat(lit("col_"),$"schm_value"))
df.groupBy("col1","col2").pivot("schm_value").agg(first("idsm_value")).show
you can see the output as below :
+----+----+------+------+------+------+------+------+-----+------+------+------+
|col1|col2|col_b1|col_b2|col_be|col_ch|col_db|col_es|col_i|col_us|col_x2|col_xs|
+----+----+------+------+------+------+------+------+-----+------+------+------+
| 2| 6| id1| null| null| id4| null| null| id2| null| null| id3|
| 3| 5| null| id5| id8| id7| id15| null| null| null| id6| null|
| 4| 7| id9| null| null| id10| null| id00| null| id12| null| id11|
+----+----+------+------+------+------+------+------+-----+------+------+------+
Updated Answer using Map:
If you have n number of columns and you know them in advance you can use the below approach which is more generic as compared to above approach.
//sample Data
val initialdf = Seq((2,6,"b1","id1","i","id2","xs","id3","ch","id4",null,null,null,null),(3,5,"b2","id5","x2","id6","ch","id7","be","id8",null,null,"db","id15"),(4,7,"b1","id9","ch","id10","xs","id11","us","id12","es","id00",null,null)).toDF("col1","col2","schm_0","idsm_0","schm_1","idsm_1","schm_2","idsm_2","schm_3","idsm_3","schm_4","idsm_4","schm_5","idsm_5")
import org.apache.spark.sql.functions._
val schmcols = Seq("schm_0", "schm_1", "schm_2","schm_3","schm_4","schm_5")
val schmdf = initialdf.select($"col1",$"col2", explode(array(
schmcols.map(column =>
struct(
lit(column).alias("schm"),
col(column).alias("schm_value")
)): _*
)).alias("schmColumn"))
.withColumn("id",split($"schmColumn.schm", "_")(1))
.withColumn("schm",$"schmColumn.schm")
.withColumn("schm_value",$"schmColumn.schm_value").drop("schmColumn")
val idcols = Seq("idsm_0", "idsm_1", "idsm_2","idsm_3","idsm_4","idsm_5")
val idsmdf = initialdf.select($"col1",$"col2", explode(array(
idcols.map(
column =>
struct(
lit(column).alias("idsm"),
col(column).alias("idsm_value")
)): _*
)).alias("idsmColumn"))
.withColumn("id",split($"idsmColumn.idsm", "_")(1))
.withColumn("idsm",$"idsmColumn.idsm")
.withColumn("idsm_value",$"idsmColumn.idsm_value").drop("idsmColumn")
val df = schmdf.join(idsmdf,Seq("col1","col2","id"),"inner")
.filter($"idsm_value" =!= "null")
.select("col1","col2","schm","schm_value","idsm","idsm_value")
.withColumn("schm_value", concat(lit("col_"),$"schm_value"))
df.groupBy("col1","col2").pivot("schm_value")
.agg(first("idsm_value")).show
so I am trying to identify the crime that happens within the SF downtown boundary on Sunday. My idea was to first write a UDF to label if each crime is in the area I identify as the downtown area, if it happened within the area, then it will have a label of "1" and "0" if not. After that, I am trying to create a new column to store those results. I tried my best to write everything I can but it just doesn't work for some reason. Here is the code I wrote:
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
def filter_dt(x,y):
if (((x < -122.4213) & (x > -122.4313)) & ((y > 37.7540) & (y < 37.7740))):
return '1'
else:
return '0'
schema = StructType([StructField("isDT", BooleanType(), False)])
filter_dt_boolean = udf(lambda row: filter_dt(row[0], row[1]), schema)
#First, pick out the crime cases that happens on Sunday BooleanType()
q3_sunday = spark.sql("SELECT * FROM sf_crime WHERE DayOfWeek='Sunday'")
#Then, we add a new column for us to filter out(identify) if the crime is in DT
q3_final = q3_result.withColumn("isDT", filter_dt(q3_sunday.select('X'),q3_sunday.select('Y')))
The error I am getting is:Picture for the error message
My guess is that the udf I am having right now doesn't support the whole column as input to be compared, but I don't know how to fix it to make it work. Please help! Thank you!
A sample data would have helped. For now I assume that your data looks like this:
+----+---+---+
|val1| x| y|
+----+---+---+
| 10| 7| 14|
| 5| 1| 4|
| 9| 8| 10|
| 2| 6| 90|
| 7| 2| 30|
| 3| 5| 11|
+----+---+---+
Then you dont need a udf, as you can do the evaluation using the when() function
import pyspark.sql.functions as F
tst= sqlContext.createDataFrame([(10,7,14),(5,1,4),(9,8,10),(2,6,90),(7,2,30),(3,5,11)],schema=['val1','x','y'])
tst_res = tst.withColumn("isdt",F.when(((tst.x.between(4,10))&(tst.y.between(11,20))),1).otherwise(0))This will give the result
tst_res.show()
+----+---+---+----+
|val1| x| y|isdt|
+----+---+---+----+
| 10| 7| 14| 1|
| 5| 1| 4| 0|
| 9| 8| 10| 0|
| 2| 6| 90| 0|
| 7| 2| 30| 0|
| 3| 5| 11| 1|
+----+---+---+----+
If i have got the data wrong and still you need to pass multiple values to udf, you have to pass it as an array or a struct. I prefer a struct
from pyspark.sql.functions import udf
from pyspark.sql.types import *
#udf(IntegerType())
def check_data(row):
if((row.x in range(4,5))&(row.y in range(1,20))):
return(1)
else:
return(0)
tst_res1 = tst.withColumn("isdt",check_data(F.struct('x','y')))
The result will be the same. But it is always better to avoid UDF and go for spark inbuilt functions since spark catalyst cannot understand the logic inside the udf and cannot optimize it.
Try changing last line as below-
from pyspark.sql.functions import col
q3_final = q3_result.withColumn("isDT", filter_dt(col('X'),col('Y')))
I have given dataframe that looks like this.
THIS dataframe is sorted by date, and col1 is just some random value.
TEST_schema = StructType([StructField("date", StringType(), True),\
StructField("col1", IntegerType(), True),\
])
TEST_data = [('2020-08-01',3),('2020-08-02',1),('2020-08-03',-1),('2020-08-04',-1),('2020-08-05',3),\
('2020-08-06',-1),('2020-08-07',6),('2020-08-08',4),('2020-08-09',5)]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df.show()
+----------+----+
| date|col1|
+----------+----+
|2020-08-01| 3|
|2020-08-02| 1|
|2020-08-03| -1|
|2020-08-04| -1|
|2020-08-05| 3|
|2020-08-06| -1|
|2020-08-07| 6|
|2020-08-08| 4|
|2020-08-09| 5|
+----------+----+
LOGIC : lead(col1) +1, if col1 ==-1, then from the previous value lead(col1) +2...
the resulted dataframe will look like this (want column is what i want as output)
+----------+----+----+
| date|col1|WANT|
+----------+----+----+
|2020-08-01| 3| 2|
|2020-08-02| 1| 6|
|2020-08-03| -1| 5|
|2020-08-04| -1| 4|
|2020-08-05| 3| 8|
|2020-08-06| -1| 7|
|2020-08-07| 6| 5|
|2020-08-08| 4| 6|
|2020-08-09| 5| -1|
+----------+----+----+
Let's look at last row, where col1==5, that 5 is leaded +1 which is in want==6 (2020-08-08)
If we have col==-1, then we add +1 more ,, if we have col==-1 repeated twice, then we add +2 more..
this is hard to explain in words,lastly since it created last column instead of null, replace with -1. I have a diagram
You can check if the following code and logic works for you:
create a sub-group label g which take running sum of int(col1!=-1), and we only concern about Rows with col1 == -1, and nullify all other Rows.
the residual is 1 and if col1 == -1, plus the running count on Window w2
take the prev_col1 over w1 which is not -1 (using nullif), (the naming of prev_col1 might be confusion since it takes only if col1 = -1 using typical pyspark's way to do ffill, otherwise keep the original).
set val = prev_col1 + residual, take the lag and set null to -1
Code below:
from pyspark.sql.functions import when, col, expr, count, desc, lag, coalesce
from pyspark.sql import Window
w1 = Window.orderBy(desc('date'))
w2 = Window.partitionBy('g').orderBy(desc('date'))
TEST_df.withColumn('g', when(col('col1') == -1, expr("sum(int(col1!=-1))").over(w1))) \
.withColumn('residual', when(col('col1') == -1, count('*').over(w2) + 1).otherwise(1)) \
.withColumn('prev_col1',expr("last(nullif(col1,-1),True)").over(w1)) \
.withColumn('want', coalesce(lag(expr("prev_col1 + residual")).over(w1),lit(-1))) \
.orderBy('date').show()
+----------+----+----+--------+---------+----+
| date|col1| g|residual|prev_col1|want|
+----------+----+----+--------+---------+----+
|2020-08-01| 3|null| 1| 3| 2|
|2020-08-02| 1|null| 1| 1| 6|
|2020-08-03| -1| 4| 3| 3| 5|
|2020-08-04| -1| 4| 2| 3| 4|
|2020-08-05| 3|null| 1| 3| 8|
|2020-08-06| -1| 3| 2| 6| 7|
|2020-08-07| 6|null| 1| 6| 5|
|2020-08-08| 4|null| 1| 4| 6|
|2020-08-09| 5|null| 1| 5| -1|
+----------+----+----+--------+---------+----+