How to use window function in Redshift? - sql

I have 2 tables:
| Product |
|:----: |
| product_id |
| source_id|
Source
source_id
priority
sometimes there are cases when 1 product_id can contain few sources and my task is to select data with min priority from for example
| product_id | source_id| priority|
|:----: |:------:| :-----:|
| 10| 2| 9|
| 10| 4| 2|
| 20| 2| 9|
| 20| 4| 2|
| 30| 2| 9|
| 30| 4| 2|
correct result should be like:
| product_id | source_id| priority|
|:----: |:------:| :-----:|
| 10| 4| 2|
| 20| 4| 2|
| 30| 4| 2|
I am using query:
SELECT p.product_id, p.source_id, s.priority FROM Product p
INNER JOIN Source s on s.source_id = p.source_id
WHERE s.priority = (SELECT Min(s1.priority) OVER (PARTITION BY p.product_id) FROM Source s1)
but it returns error "this type of correlated subquery pattern is not supported yet" so as i understand i can't use such variant in Redshift, how should it be solved, are there any other ways?

You just need to unroll the where clause into the second data source and the easiest flag for min priority is to use the ROW_NUMBER() window function. You're asking Redshift to rerun the window function for each JOIN ON test which creates a lot of inefficiencies in clustered database. Try the following (untested):
SELECT p.product_id, p.source_id, s.priority
FROM Product p
INNER JOIN (
SELECT ROW_NUMBER() OVER (PARTITION BY p.product_id, order by s1.priority) as row_num,
source_id,
priority
FROM Source) s
on s.source_id = p.source_id
WHERE row_num = 1
Now the window function only runs once. You can also move the subquery to a CTE if that improve readability for your full case.

Already found best solution for that case:
SELECT
p.product_id
, p.source_id
, s.priority
, Min(s.priority) OVER (PARTITION BY p.product_id) as min_priority
FROM Product p
INNER JOIN Source s
ON s.source_id = p.source_id
WHERE s.priority = p.min_priority

Related

Make parent and child hierarchy with records for parents as well as children

I have a parent-child id_table hierarchy - e.g.
|parent|child|
|------|-----|
| | 0|
| 0| 1|
| 0| 2|
| 0| 3|
| 1| 4|
| 1| 5|
| 2| 6|
| 4| 7|
| 4| 8|
I'm building a visual tree hierarchy, where above data would be formatted as:
|parent|child1|child2|child3
|------|------|------|------
| 0| 1|4 | 7
| 0| 1|4 | 8
| 0| 1|5 |
| 0| 2|6 |
| 0| 3| |
Now I want to modify this query to include a row for each parent standalone without the child, so above data would become:
|parent|child1|child2|child3
|------|------|------|------
| 0| | |
| 0| 1| |
| 0| 1| 4|
| 0| 1| 4| 7
| 0| 1| 4| 8
| 0| 1| 5|
| 0| 2| |
| 0| 2| 6|
| 0| 3| |
To get the first result, I am building the data with repeated left joins (using above first data example) as to my understanding I can't do this with recursion, e.g.:
SELECT t1.child AS parent
t2.child AS child1
t3.child AS child2
t4.child AS child3
FROM id_table t1
LEFT JOIN id_table t2
ON t1.child = t2.parent
LEFT JOIN id_table t3
ON t1.child = t3.parent
LEFT JOIN id_table t4
ON t1.child = t4.parent
WHERE t1.child = '0'
This gets me the second example, but I'm lacking a record for each parent as well, as shown in the third example.
I assume this is probably a simple question, I'm just struggling with the syntax. TIA for any help.
EDIT: I had a prior question for a similar implementation in SAS EG: SQL - Recursive Tree Hierarchy with Record at Each Level, however that was with the SAS SQL implementation which is much more restricted - with that method I eventually had to just create temp tables at each level then union the end result, which was messy. Trying to find a cleaner solution.
GROUP BY ROLLUP can be used to create those extra rows:
SELECT DISTINCT
t1.child AS Parent
,t2.child AS child1
,t3.child AS child2
,t4.child AS child3
-- one more column for each additional level
FROM id_table t1
LEFT JOIN id_table t2
ON t1.child = t2.Parent
LEFT JOIN id_table t3
ON t2.child = t3.Parent
LEFT JOIN id_table t4
ON t3.child = t4.Parent
-- one additional join for each new level
WHERE t1.child = '0'
GROUP BY ROLLUP (t1.child,t2.child,t3.child,t4.child)
HAVING t1.child IS NOT NULL
Or a Recursive Query to traverse through the hierarchy, built the path and then split it into columns:
WITH RECURSIVE cte AS
( -- traverse the hierarchy and built the path
SELECT 1 AS lvl,
,child
,Cast(child AS VARCHAR(500)) AS Path -- must be large enough for concatenating all levels
FROM id_table
WHERE Parent IS NULL
UNION ALL
SELECT lvl+1
,t.child
,cte.Path || ',' || Trim(t.child)
FROM cte JOIN id_table AS t
ON cte.child = t.Parent
WHERE lvl < 20 -- just in case there's an endless loop
)
SELECT
StrTok(Path, ',', 1)
,StrTok(Path, ',', 2)
,StrTok(Path, ',', 3)
,StrTok(Path, ',', 4)
-- one additional StrTok for each new level
FROM cte
Don't know which one is more efficient.

how to update a row based on another row with same id

With Spark dataframe, I want to update a row value based on other rows with same id.
For example,
I have records below,
id,value
1,10
1,null
1,null
2,20
2,null
2,null
I want to get the result as below
id,value
1,10
1,10
1,10
2,20
2,20
2,20
To summarize, the value column is null in some rows, I want to update them if there is another row with same id which has valid value.
In sql, I can simply write a update sentence with inner-join, but I didn't find the same way in Spark-sql.
update combineCols a
inner join combineCols b
on a.id = b.id
set a.value = b.value
(this is how I do it in sql)
Let's use SQL method to solve this issue -
myValues = [(1,10),(1,None),(1,None),(2,20),(2,None),(2,None)]
df = sqlContext.createDataFrame(myValues,['id','value'])
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select id, sum(value) over (partition by id) as value from table_view'
)
df1.show()
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
Caveat: Thos code assumes that there is only one non-null value for any particular id. When we groupby values, we have to use an aggregation function, and I have used sum. In case there are 2 non-null values for any id, then the will be summed up. If id could have multiple non-null values, then it's bettwe to use min/max, so that we get one of the values rather than sum.
df1=sqlContext.sql(
'select id, max(value) over (partition by id) as value from table_view'
)
You can use window to do this(in pyspark):
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# create dataframe
df = sc.parallelize([
[1,10],
[1,None],
[1,None],
[2,20],
[2,None],
[2,None],
]).toDF(('id', 'value'))
window = Window.partitionBy('id').orderBy(F.desc('value'))
df \
.withColumn('value', F.first('value').over(window)) \
.show()
Results:
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
You can use the same functions in scala.

SQL - Pivot or Unpivot?

Another time, another problem. I have the following table:
|assemb.|Repl_1|Repl_2|Repl_3|Repl_4|Repl_5|Amount_1|Amount_2|Amount_3|Amount_4|Amount_5|
|---------------------------------------------------------------------------------------|
|4711001|111000|222000|333000|444000|555000| 1| 1| 1| 1| 1|
|---------------------------------------------------------------------------------------|
|4711002|222000|333000|444000|555000|666000| 1| 1| 1| 1| 1|
|---------------------------------------------------------------------------------------|
And here what I need:
|Article|Amount|
|--------------|
| 111000| 1|
|--------------|
| 222000| 2|
|--------------|
| 333000| 2|
|--------------|
| 444000| 2|
|--------------|
| 555000| 2|
|--------------|
| 666000| 1|
|---------------
Repl_1 to Repl_10 are replacement-articles of the assembly. I can have n assemblies with to 10 rep-articles. At the end I need to overview all articles with there amounts of all assemblies.
THX.
Best greetz
Vegeta
This is probably the quickest way of achieving it using UNION ALL. However, I'd recommend normalising your table
SELECT Article, SUM(Amount) FROM (
SELECT Repl_1 AS Article, SUM(Amount_1) AS Amount FROM #Test GROUP BY Repl_1
UNION ALL
SELECT Repl_2 AS Article, SUM(Amount_2) AS Amount FROM #Test GROUP BY Repl_2
UNION ALL
SELECT Repl_3 AS Article, SUM(Amount_3) AS Amount FROM #Test GROUP BY Repl_3
UNION ALL
SELECT Repl_4 AS Article, SUM(Amount_4) AS Amount FROM #Test GROUP BY Repl_4
UNION ALL
SELECT Repl_5 AS Article, SUM(Amount_5) AS Amount FROM #Test GROUP BY Repl_5
) tbl GROUP BY Article

SQL query for finding the most frequent value of a grouped by value

I'm using SQLite browser, I'm trying to find a query that can find the max of each grouped by a value from another column from:
Table is called main
| |Place |Value|
| 1| London| 101|
| 2| London| 20|
| 3| London| 101|
| 4| London| 20|
| 5| London| 20|
| 6| London| 20|
| 7| London| 20|
| 8| London| 20|
| 9| France| 30|
| 10| France| 30|
| 11| France| 30|
| 12| France| 30|
The result I'm looking for is the finding the most frequent value grouping by place:
| |Place |Most Frequent Value|
| 1| London| 20|
| 2| France| 30|
Or even better
| |Place |Most Frequent Value|Largest Percentage|2nd Largest Percentage|
| 1| London| 20| 0.75| 0.25|
| 2| France| 30| 1| 0.75|
You can group by place, then value, and order by frequency eg.
select place,value,count(value) as freq from cars group by place,value order by place, freq;
This will not give exactly the answer you want, but near to it like
London | 101 | 2
France | 30 | 4
London | 20 | 6
Now select place and value from this intermediate table and group by place, so that only one row per place is displayed.
select place,value from
(select place,value,count(value) as freq from cars group by place,value order by place, freq)
group by place;
This will produce the result like following:
France | 30
London | 20
This works for sqlite. But for some other programs, it might not work as expected and return the place and value with least frequency. In those, you can put order by place, freq desc instead to solve your problem.
The first part would be something like this.
http://sqlfiddle.com/#!7/ac182/8
with tbl1 as
(select a.place,a.value,count(a.value) as val_count
from table1 a
group by a.place,a.value
)
select t1.place,
t1.value as most_frequent_value
from tbl1 t1
inner join
(select place,max(val_count) as val_count from tbl1
group by place) t2
on t1.place=t2.place
and t1.val_count=t2.val_count
Here we are deriving tbl1 which will give us the count of each place and value combination. Now we will join this data with another derived table t2 which will find the max count and we will join this data to get the required result.
I am not sure how do you want the percentage in second output, but if you understood this query, you can use some logic on top of it do derive the required output. Play around with the sqlfiddle. All the best.
RANK
SQLite now supports RANK, so we can use the exact same syntax that works on PostgreSQL, similar to https://stackoverflow.com/a/12448971/895245
SELECT "city", "value", "cnt"
FROM (
SELECT
"city",
"value",
COUNT(*) AS "cnt",
RANK() OVER (
PARTITION BY "city"
ORDER BY COUNT(*) DESC
) AS "rnk"
FROM "Sales"
GROUP BY "city", "value"
) AS "sub"
WHERE "rnk" = 1
ORDER BY
"city" ASC,
"value" ASC
This would return all in case of tie. To return just one you could use ROW_NUMBER instead of RANK.
Tested on SQLite 3.34.0 and PostgreSQL 14.3. GitHub upstream.

SQL needed to get most popular product based also off a quantity

I'm currently trying to get the most popular productID from my MSSQL Database. This is what the table looks like (With a bit of dummy data):
OrderItems:
+--+-------+--------+---------+
|ID|OrderID|Quantity|ProductID|
+--+-------+--------+---------+
| 1| 1| 1| 1|
| 2| 1| 1| 2|
| 3| 2| 1| 1|
| 4| 2| 50| 2|
The OrderID field can be ignored, but I need to find the most popular ProductID's from this table, ordering them by how often they occur. The results set should look something like this:
+--------+
|PoductID|
+--------+
| 2|
| 1|
As ProductID 2 has a total quantity of 51, it needs to come out first, followed by ProductID 1 which only has a total quantity of 2.
(Note: Query needs to be compatible back to MSSQL-2008)
SELECT
productID
FROM
yourTable
GROUP BY
productID
ORDER BY
SUM(Quantity) DESC
GROUP BY allows SUM(), but you don't have to use it in the SELECT to be allowed to use it in the ORDER BY.
select ProductID
from OrderItems
group by ProductId
order by sum(Quantity) desc;