PERCENTILE_DISC() in PostgreSQL as a window function - sql

We are porting our system from SQL Server to PostgreSQL. In that we calculate Median Daily Turnovers for all companies on all dates for past 3 months. Below is the simplified query for the same
SELECT B.Company, B.Dt, B.Turnover, (Select distinct
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY Turnover) OVER (PARTITION
BY B.Company, B.Dt) from Example_Tbl AS G where G.Company = B.Company
and G.Dt <= B.Dt and G.Dt > DateAdd(dd, -92, B.Dt)) as
Med_3m_Turnover FROM Example_Tbl B;
The problem is that PostgreSQL doesn't support the use of percentile_disc() as a window function. The error message is:
ERROR: OVER is not supported for ordered-set aggregate percentile_disc
Is there any way I can implement the same functionality using something else in PostgreSQL.
edit: Here is example input data in Example_Tbl
Company Dt Turnover
x 1 10
x 2 45
x 3 20
y 1 300
y 2 100
y 3 200
And the ouptut should be as below. Please note, we are ignoring 3 Months right now and just have 3 rows per company
Company Dt Turnover Med_3m_Turnover
x 1 10 10
x 2 45 10 or 45 depending on percentile_desc
x 3 20 20
y 1 300 300
y 2 100 300 or 100 depending on percentile_desc
y 3 50 100

Your partition by clause (PARTITION
BY B.Company, B.Dt) is using values from the outer query (alias B), not the subquery (alias G), which wasn't immediately obvious to me at first. Because the values of B.company and B.Dt are constant for each execution of the subquery, then your partition clause is really no different than simply writing it like this:
... over (partition by 1)
You can test it in SQL Server if you want, but you'll find that the results are the same. Now, I don't know if using B.Company, B.Dt was intentional or not, but in effect, it means that the partition by clause is not actually partitioning anything.
So, as a result, the good news for you is that to write the equivalent query in PostgreSQL, you simply need to omit the OVER (PARTITION
BY B.Company, B.Dt) clause entirely, and the behavior will be the same as in SQL Server.

Related

Count different groups in the same query

Imagine I have a table like this:
# | A | B | MoreFieldsHere
1 1 1
2 1 3
3 1 5
4 2 6
5 2 7
6 3 9
B is associated to A in an 1:n relationship. The table could've been created with a join for example.
I want to get both the total count and the count of different A.
I know I can use a query like this:
SELECT v1.cnt AS total, v2.cnt AS num_of_A
FROM
(
SELECT COUNT(*) AS cnt
FROM SomeComplicatedQuery
WHERE 1=1
-- AND SomeComplicatedCondition
) v1,
(
SELECT COUNT(A) AS cnt
FROM SomeComplicatedQuery
WHERE 1=1
-- AND SomeComplicatedCondition
GROUP BY A
) v2
However SomeComplicatedQuery would be a complicated and slow query and SomeComplicatedCondition would be the same in both cases. And I want to avoid calling it unnessesarily. Aside from that if the query changes, you need to make sure to change it in the other place too, making it prone to error and creating (probably unnessesary) work.
Is there a way to do this more efficiently?
Are you looking for this?
SELECT COUNT(*) AS total, COUNT(DISTINCT A) AS num_of_A
FROM (. . . ) q

How to write a LEFT JOIN in BigQuery's Standard SQL?

We have a query that works in BigQuery's Legacy SQL. How do we write it in Standard SQL so it works?
SELECT Hour, Average, L.Key AS Key FROM
(SELECT 1 AS Key, *
FROM test.table_L AS L)
LEFT JOIN
(SELECT 1 AS Key, Avg(Total) AS Average
FROM test.table_R) AS R
ON L.Key = R.Key ORDER BY Hour ASC
Currently the error it gives is:
Equality is not defined for arguments of type ARRAY<INT64> at [4:74]
BigQuery has two modes for queries: Legacy SQL and Standard SQL. We have looked at the BigQuery Standard SQL documentation and also see just one SO answer on Standard SQL joins in BigQuery - but so far, it is unclear to us what the key change needed might be.
Table_L looks like this:
Row Hour
1 A
2 B
3 C
Table_R looks like this:
Row Value
1 10
2 20
3 30
Results Desired:
Row Hour Average(OfR) Key
1 A 20 1
2 B 20 1
3 C 20 1
How do we rewrite this BigQuery Legacy SQL query to work in Standard SQL?
Based on your recent update in question and comments - try below
WITH Table_L AS (
SELECT 1 AS Row, 'A' AS Hour UNION ALL
SELECT 2 AS Row, 'B' AS Hour UNION ALL
SELECT 3 AS Row, 'C' AS Hour
),
Table_R AS (
SELECT 1 AS Row, 10 AS Value UNION ALL
SELECT 2 AS Row, 20 AS Value UNION ALL
SELECT 3 AS Row, 30 AS Value
)
SELECT
Row,
Hour,
(SELECT AVG(Value) FROM Table_R) AS AverageOfR,
1 AS Key
FROM Table_L
Above is for testing
the query you should run in "production" is
SELECT
Row,
Hour,
(SELECT AVG(Value) FROM Table_R) AS AverageOfR,
1 AS Key
FROM Table_L
In case, if for some reason you are bound to JOIN, use below CROSS JOIN version
SELECT
Row,
Hour,
AverageOfR,
1 AS Key
FROM Table_L
CROSS JOIN ((SELECT AVG(Value) AS AverageOfR FROM Table_R))
or below LEFT JOIN version with Key field involved (in case if Key really important for your logic - which somehow I feel is true)
SELECT
Row,
Hour,
AverageOfR,
L.Key AS Key
FROM (SELECT 1 AS Key, Row, Hour FROM Table_L) AS L
LEFT JOIN ((SELECT 1 AS Key, AVG(Value) AS AverageOfR FROM Table_R)) AS R
ON L.Key = R.Key
Your error message suggests that key is not a column in table_L. If no, then don't include it in the query.
It looks like you simply want the average of the total from table_R. You can approach this as:
SELECT l.*, r.average
FROM test.table_L as l CROSS JOIN
(SELECT Avg(Total) as average
FROM test.table_R
) R
ORDER BY l.hour ASC;

How to divide two values from the different row

I have used this formula.
Quote change = (current month data / previous month data) * 100
Then my data stored on SQL SERVER table look like below :
id DATE DATA
1 2015/01/01 10
2 2015/02/01 20
3 2015/03/01 30
4 2015/04/01 40
5 2015/05/01 50
6 2015/06/01 60
7 2015/07/01 70
8 2015/08/01 80
9 2015/09/01 90
How can i implement this formula on SQL Function ?
For Example
current month is 2015/02/1
Quote change = (Current Month Data / Previous Month Data ) * 100
Quote change =( 15/10)*100
Then if current date is 2015/01/01. Because no data before 2015/01/01, I need to show 0 or #
Sql server 2012 have a window function called LAG that is very useful in situations like this.
Lag returns the value of a specific column in the previous row (specified by the order by part of the over clause).
Try this:
;With cte as
(
SELECT Id, Date, Data, LAG(Data) OVER(ORDER BY Date) As LastMonthData
FROM YourTable
)
SELECT Id,
Date,
Data,
CASE WHEN ISNULL(LastMonthData, 0) = 0 THEN 0 ELSE (Data/LastMonthData) * 100 END As Quote
FROM cte
I've used a CTE just so I wouldn't have to repeat the LAG twice.
The CASE expression is to prevent an exception in case the LastMonthData is 0 or null.
You can use inner join like mentioned below -
select a.*,isnull(cast(a.data/b.data as decimal(4,2))*100,0)
from TableA as a
inner join TableA as b
on b.date = dateadd(mm,-1,a.date)
Let me know if this helps

counting subquery in SQL

I have the following query to count how many times each process_track_id occurs in a table:
SELECT
a.process_track_id,
COUNT(1) AS 'num'
FROM
transreport.process_name a
GROUP BY
a.process_track_id
This returns the following results:
process_track_id | num
1 14
2 44
3 16
5 8
6 18
7 17
8 14
This is great. Now is the part where I am stuck. I would like to get the following table:
num count
8 1
14 2
16 1
17 1
18 1
44 1
Where num are the distinct counts from the first table, and count is how many times that frequency occurs.
Here is what I have tried (it's a subquery, but I'm not sold on the method) and I haven't been able to get it to work just yet. I'm new to SQL and I think I'm missing out on some some key aspects of the syntax.
SELECT
X.id_count,
count(1) as 'num_count'
FROM
(SELECT
a.process_track_id,
COUNT(1) AS 'id_count'
FROM
transreport.process_name a
GROUP BY
a.process_track_id
--COUNT(1) AS 'id_count'
) X;
Any ideas?
It's probably good to keep in mind that this may have to be run on a database with at least 1 million records, and I don't have the ability to create a new table in the process.
Thanks!
Here's the subquery method you were driving at:
SELECT id_count, COUNT(*) AS 'num_count'
FROM (SELECT a.process_track_id
,COUNT(*) AS 'id_count'
FROM transreport.process_name a
GROUP BY a.process_track_id
)sub
GROUP BY id_count
Not sure there's a better method as the aggregation needs to run once anyway.
Try this
SELECT x.num, COUNT(*) AS COUNT
FROM (
SELECT
a.process_track_id, -- <--- You may removed this column
COUNT(*) AS 'num'
FROM
transreport.process_name a
GROUP BY
a.process_track_id
) X
GROUP BY X.num

What is a SQL frequency-distribution query to count ranges with group-by, and include 0 counts?

Given:
table 'thing':
age
---
3.4
3.4
10.1
40
45
49
I want to count the number of things for each 10-year range, e.g.,
age_range | count
----------+-------
0 | 2
10| 1
20| 0
30| 0
40| 3
This query comes close:
SELECT FLOOR(age / 10) as age_range, COUNT(*)
FROM thing
GROUP BY FLOOR(age / 10) ORDER BY FLOOR(age / 10);
Output:
age_range | count
-----------+-------
0 | 1
1 | 2
4 | 3
However, it doesn't show me the ranges which have 0 counts. How can I modify the query so that it also shows the ranges in between with 0 counts?
I found similar stackoverflow questions for counting ranges, some for 0 counts, but they involve having to specify each range (either hard-coding the ranges into the query, or putting the ranges in a table). I would prefer to use a generic query like that above where I do not have to explicitly specify each range (e.g., 0-10, 10-20, 20-30, ...). I'm using PostgreSQL 9.1.3.
Is there a way to modify the simple query above to include 0 counts?
Similar:
Oracle: how to "group by" over a range?
Get frequency distribution of a decimal range in MySQL
generate_series to the rescue:
select 10 * s.d, count(t.age)
from generate_series(0, 10) s(d)
left outer join thing t on s.d = floor(t.age / 10)
group by s.d
order by s.d
Figuring out the upper bound for generate_series should be trivial with a separate query, I just used 10 as a placeholder.
This:
generate_series(0, 10) s(d)
essentially generates an inline table called s with a single column d which contains the values from 0 to 10 (inclusive).
You could wrap the two queries (one to figure out the range, one to compute the counts) into a function if necessary.
You need some way to invent the table of age ranges. Row number usually works nicely. Do a cartesian product against a big table to get lots of numbers.
WITH RANGES AS (
SELECT (rownum - 1) * 10 AS age_range
FROM ( SELECT row_number() OVER() as rownum
FROM pg_tables
) n
,( SELECT ceil( max(age) / 10 ) range_end
FROM thing
) m
WHERE n. rownum <= range_end
)
SELECT r.age_range, COUNT(t.age) AS count
FROM ranges r
LEFT JOIN thing t ON r.age_range = FLOOR(t.age / 10) * 10
GROUP BY r.age_range
ORDER BY r.age_range;
EDIT: mu is too short has a much more elegant answer, but if you didn't have a generate_series function on the db, ... :)