It's hard to put it in words what I am trying to do. My knowledge in SQL too weak to use right terminology, so I will try to illustrate it as example.
Say, I have a big table consisting columns "value", "user" and "type"
value|user_id|type|
100| 1| 1|
200| 1| 1|
100| 1| 2|
722| 1| 3|
48| 2| 1|
724| 2| 2|
175| 2| 3|
1) I calculate sum "value" for each "user_id" for each "type".
SELECT SUM("value"), "user_id", "type" from "table" group by "user_id", "type"
giving me
value|user_id|type|
300| 1| 1|
100| 1| 2|
722| 1| 3|
48| 2| 1|
724| 2| 2|
175| 2| 3|
2) I want to obtain rank of "user_id" for each "type" based on the "value".
For type 1, value for user 1 is greater than for user 2, so user 1 ranks as 1 and user 2 ranks 2.
For type 2, value for user 2 is greater...
In other words I want to produce table for user 1:
rank|type
1|1
2|2
1|3
and for user 2:
rank|type
2|1
1|2
2|3
I would really appreciate help with this.
You can use the result of an aggregate function for a window function. The window function will be processed after the group by:
SELECT sum(value), user_id, type,
rank() over (partition by type order by sum(value) desc)
from the_table
group by user_id, type
order by user_id, type;
Related
Okay, so I've been driving myself crazy trying to get this to display in SQL. I have a table that stores types of food, the culture they come from, a score, and a boolean value about whether or not they are good. I want to display a record of how many "goods" each culture racks up. Here's the table (don't ask about the database name):
So I've tried:
SELECT count(good = 1), culture FROM animals_db.foods group by culture;
Or
SELECT count(good = true), culture FROM animals_db.foods group by culture;
But it doesn't present the correct results, it seems to include anything that has any "good" value (1 or 0) at all.
How do I get the data I want?
instead of count , use sum.
SELECT sum(good), culture FROM animals_db.foods group by culture; -- assume good column value have integer data type and good value is represent as 1 otherwise 0
or other way is using count
select count( case when good=1 then 1 end) , culture from animals_db.foods group by culture;
If the purpose is to count the number of good=1 for each culture, this works:
select culture,
count(*)
from foods
where good=1
group by 1
order by 1;
Result:
culture |count(*)|
--------+--------+
| 1|
American| 1|
Chinese | 1|
European| 1|
Italian | 2|
The reason your first query doesn't return the result can be explained as below:
select culture,
good=1 as is_good
from foods
order by 1;
You actually get:
culture |is_good|
--------+-------+
| 1|
American| 0|
American| 1|
Chinese | 1|
European| 1|
French | 0|
French | 0|
German | 0|
Italian | 1|
Italian | 1|
After applied group by culture and count(good=1), you're actually counting the number of NOT NULL values in good=1. For example:
select culture,
count(good=0) as c0,
count(good=1) as c1,
count(good=2) as c2,
count(good) as c3,
count(null) as c4
from foods
group by culture
order by culture;
Outcome:
culture |c0|c1|c2|c3|c4|
--------+--+--+--+--+--+
| 1| 1| 1| 1| 0|
American| 2| 2| 2| 2| 0|
Chinese | 1| 1| 1| 1| 0|
European| 1| 1| 1| 1| 0|
French | 2| 2| 2| 2| 0|
German | 1| 1| 1| 1| 0|
Italian | 2| 2| 2| 2| 0|
Update: This is similar to your question: Is it possible to specify condition in Count()?.
I have a csv with a header with columns with same name.
I want to process them with spark using only SQL and be able to refer these columns unambiguously.
Ex.:
id name age height name
1 Alex 23 1.70
2 Joseph 24 1.89
I want to get only first name column using only Spark SQL
As mentioned in the comments, I think that the less error prone method would be to have the schema of the input data changed.
Yet, in case you are looking for a quick workaround, you can simply index the duplicated names of the columns.
For instance, let's create a dataframe with three id columns.
val df = spark.range(3)
.select('id * 2 as "id", 'id * 3 as "x", 'id, 'id * 4 as "y", 'id)
df.show
+---+---+---+---+---+
| id| x| id| y| id|
+---+---+---+---+---+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+---+---+---+---+---+
Then I can use toDF to set new column names. Let's consider that I know that only id is duplicated. If we don't, adding the extra logic to figure out which columns are duplicated would not be very difficult.
var i = -1
val names = df.columns.map( n =>
if(n == "id") {
i+=1
s"id_$i"
} else n )
val new_df = df.toDF(names : _*)
new_df.show
+----+---+----+---+----+
|id_0| x|id_1| y|id_2|
+----+---+----+---+----+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+----+---+----+---+----+
Considering the table:
df=sc.parallelize([(1,1,1),(5,0,2),(27,1,1),(1,0,3),(5,1,1),(1,0,2)]).toDF(['id', 'error', 'timestamp'])
df.show()
+---+-----+---------+
| id|error|timestamp|
+---+-----+---------+
| 1| 1| 1|
| 5| 0| 2|
| 27| 1| 1|
| 1| 0| 3|
| 5| 1| 1|
| 1| 0| 2|
+---+-----+---------+
I would like to make a pivot on timestamp column keeping some other aggregated information from the original table. The result I am interested in can be achieved by
df1=df.groupBy('id').agg(sf.sum('error').alias('Ne'),sf.count('*').alias('cnt'))
df2=df.groupBy('id').pivot('timestamp').agg(sf.count('*')).fillna(0)
df1.join(df2, on='id').filter(sf.col('cnt')>1).show()
with the resulting table:
+---+---+---+---+---+---+
| id| Ne|cnt| 1| 2| 3|
+---+---+---+---+---+---+
| 5| 1| 2| 1| 1| 0|
| 1| 1| 3| 1| 1| 1|
+---+---+---+---+---+---+
However, there are at least two issues with the mentioned solution:
I am filtering by cnt at the end of the script. If I would be able to do this at the beginning, I can avoid almost all processing, because a large portion of data is removed using this filtration. Is there any way how to do this excepting collect and isin methods?
I am doing groupBy on id two-times. First, to aggregate some columns I need in results and the second time to get the pivot columns. Finally, I need join to merge these columns. I feel that I am surely missing some solution because it should be possible to do this with just one groubBy and without join, but I cannot figure out, how to do this.
I think you can not get around the join, because the pivot will need the timestamp values and the first grouping should not consider them. So if you have to create the NE and cnt values you have to group the dataframe only by id which results in the loss of timestamp if you want to preserve the values in columns you have to do the pivot as you did separately and join it back.
The only improvement that can be done is to move the filter to the df1 creation. So as you said this could already improve the performance since df1 should be much smaller after the filtering for your real data.
from pyspark.sql.functions import *
df=sc.parallelize([(1,1,1),(5,0,2),(27,1,1),(1,0,3),(5,1,1),(1,0,2)]).toDF(['id', 'error', 'timestamp'])
df1=df.groupBy('id').agg(sum('error').alias('Ne'),count('*').alias('cnt')).filter(col('cnt')>1)
df2=df.groupBy('id').pivot('timestamp').agg(count('*')).fillna(0)
df1.join(df2, on='id').show()
Output:
+---+---+---+---+---+---+
| id| Ne|cnt| 1| 2| 3|
+---+---+---+---+---+---+
| 5| 1| 2| 1| 1| 0|
| 1| 1| 3| 1| 1| 1|
+---+---+---+---+---+---+
Actually it is indeed possible to avoid join using Window as
w1 = Window.partitionBy('id')
w2 = Window.partitionBy('id', 'timestamp')
df.select('id', 'timestamp',
sf.sum('error').over(w1).alias('Ne'),
sf.count('*').over(w1).alias('cnt'),
sf.count('*').over(w2).alias('cnt_2')
).filter(sf.col('cnt')>1) \
.groupBy('id', 'Ne', 'cnt').pivot('timestamp').agg(sf.first('cnt_2')).fillna(0).show()
With Spark dataframe, I want to update a row value based on other rows with same id.
For example,
I have records below,
id,value
1,10
1,null
1,null
2,20
2,null
2,null
I want to get the result as below
id,value
1,10
1,10
1,10
2,20
2,20
2,20
To summarize, the value column is null in some rows, I want to update them if there is another row with same id which has valid value.
In sql, I can simply write a update sentence with inner-join, but I didn't find the same way in Spark-sql.
update combineCols a
inner join combineCols b
on a.id = b.id
set a.value = b.value
(this is how I do it in sql)
Let's use SQL method to solve this issue -
myValues = [(1,10),(1,None),(1,None),(2,20),(2,None),(2,None)]
df = sqlContext.createDataFrame(myValues,['id','value'])
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select id, sum(value) over (partition by id) as value from table_view'
)
df1.show()
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
Caveat: Thos code assumes that there is only one non-null value for any particular id. When we groupby values, we have to use an aggregation function, and I have used sum. In case there are 2 non-null values for any id, then the will be summed up. If id could have multiple non-null values, then it's bettwe to use min/max, so that we get one of the values rather than sum.
df1=sqlContext.sql(
'select id, max(value) over (partition by id) as value from table_view'
)
You can use window to do this(in pyspark):
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# create dataframe
df = sc.parallelize([
[1,10],
[1,None],
[1,None],
[2,20],
[2,None],
[2,None],
]).toDF(('id', 'value'))
window = Window.partitionBy('id').orderBy(F.desc('value'))
df \
.withColumn('value', F.first('value').over(window)) \
.show()
Results:
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
You can use the same functions in scala.
Just going to start out saying that I am new to SQL and what I've written is based off of tutorials (Also I am using SQL Server 2012). The issue I am having is I am trying to take data from 4 different tables and put them into 1 table to be accessed by Access. However I keep getting duplicate results if a value is different from the rest.
The tables look like
Cell1
|LotNum|SerialNum|PassFail|
| Lot11| 1234| 1|
| Lot11| 2345| 1|
| Lot11| 3456| 1|
| Lot11| 4567| 1|
Cell2
|LotNum|SerialNum|PassFail|
| Lot11| 1234| 1|
| Lot11| 2345| 1|
| Lot11| 3456| 1|
| Lot11| 4567| 1|
Cell3
|LotNum|SerialNum|PassFail|
| Lot11| 1234| 1|
| Lot11| 2345| 1|
| Lot11| 3456| 1|
| Lot11| 4567| 1|
Cell4
|LotNum|SerialNum|PassFail|
| Lot11| 1234| 1|
| Lot11| 2345| 1|
| Lot11| 3456| 1|
| Lot11| 4567| 0|
My code is
Alter Procedure [dbo].[spSingleData](
#LotNum varchar(50)
)
AS
Truncate Table dbo.SingleSheet
Begin
Insert INTO dbo.SingleSheet (SerialNum, Cell1PF, Cell2Pf, Cell3PF, Cell4PF)
Select Distinct Cell1.SerialNum, Cell1.PF, Cell2.PF, Cell3.PF, Cell4.PF
From dbo.Cell1
Left Join Cell2 On Cell1.LotNum=Cell2.LotNum
Left Join Cell3 On Cell1.LotNum=Cell3.LotNum
Left Join Cell4 On Cell1.LotNum=Cell4.LotNum
Where Cell1.LotNum = #LotNum
Order by SerialNum
End
PassFail can be 0, 1, or NULL, however, like in the example above, if one of the PassFails is different from the rest, the resulting table returns
|1234| 1| 1| 1| 0|
|1234| 1| 1| 1| 1|
|2345| 1| 1| 1| 0|
|2345| 1| 1| 1| 1|
|3456| 1| 1| 1| 0|
|3456| 1| 1| 1| 1|
|4567| 1| 1| 1| 0|
|4567| 1| 1| 1| 1|
Am I just using the wrong Join or should I be using something else?
Is this what you are trying to achieve:
If so then you are missing a JOIN predicate on SerialNum and you do not need the DISTINCT
Sample Data:
IF OBJECT_ID('tempdb..#Cell1') IS NOT NULL
DROP TABLE #Cell1
CREATE TABLE #Cell1 (LotNum varchar(10),SerialNum int,PassFail bit)
INSERT INTO #Cell1
VALUES
('Lot11',1234,1),
('Lot11',2345,1),
('Lot11',3456,1),
('Lot11',4567,1)
IF OBJECT_ID('tempdb..#Cell2') IS NOT NULL
DROP TABLE #Cell2
CREATE TABLE #Cell2 (LotNum varchar(10),SerialNum int,PassFail bit)
INSERT INTO #Cell2
VALUES
('Lot11',1234,1),
('Lot11',2345,1),
('Lot11',3456,1),
('Lot11',4567,1)
IF OBJECT_ID('tempdb..#Cell3') IS NOT NULL
DROP TABLE #Cell3
CREATE TABLE #Cell3 (LotNum varchar(10),SerialNum int,PassFail bit)
INSERT INTO #Cell3
VALUES
('Lot11',1234,1),
('Lot11',2345,1),
('Lot11',3456,1),
('Lot11',4567,1)
IF OBJECT_ID('tempdb..#Cell4') IS NOT NULL
DROP TABLE #Cell4
CREATE TABLE #Cell4 (LotNum varchar(10),SerialNum int,PassFail bit)
INSERT INTO #Cell4
VALUES
('Lot11',1234,1),
('Lot11',2345,1),
('Lot11',3456,1),
('Lot11',4567,0)
Query:
SELECT #Cell1.SerialNum,
#Cell1.PassFail,
#Cell2.PassFail,
#Cell3.PassFail,
#Cell4.PassFail
FROM #Cell1
LEFT JOIN #Cell2 ON #Cell1.LotNum = #Cell2.LotNum AND #Cell1.SerialNum = #Cell2.SerialNum
LEFT JOIN #Cell3 ON #Cell1.LotNum = #Cell3.LotNum AND #Cell1.SerialNum = #Cell3.SerialNum
LEFT JOIN #Cell4 ON #Cell1.LotNum = #Cell4.LotNum AND #Cell1.SerialNum = #Cell4.SerialNum
ORDER BY SerialNum;
Results: