% of total calculation without subquery in Postgres

% of total calculation without subquery in Postgres - sql

I'm trying to create a "Percentage of Total" column and currently using a subquery with no issues:
SELECT ID, COUNT(*), COUNT(*) / (SELECT COUNT(*)
FROM DATA) AS % OF TOTAL FROM DATA GROUP BY ID;
| ID | COUNT | % OF TOTAL |
| 1 | 100 | 0.10 |
| 2 | 800 | 0.80 |
| 3 | 100 | 0.10 |
However, for reasons outside the scope of this question, I'm looking to see if there is any way to accomplish this without using a subquery. Essentially, the application uses logic outside of the SQL query to determine what the WHERE clause is and injects it into the query. That logic does not account for the existence of subqueries like the above, so before going back and rebuilding all of the existing logic to account for this scenario, I figured I'd see if there's another solution first.
I've tried accomplishing this effect with a window function, but to no avail.

Use window functions:
SELECT ID, COUNT(*),
COUNT(*) / SUM(COUNT(*)) OVER () AS "% OF TOTAL"
FROM DATA
GROUP BY ID;

SELECT id, count(*) AS ct
, round(count(*)::numeric
/ sum(count(*)) OVER (ORDER BY id), 2) AS pct_of_running_total
FROM data
GROUP BY id;
You must add ORDER BY to the window function or the order of rows is arbitrary. I may seem correct at first, but that can change any time and without warning. It seems you want to order rows by id.
And you obviously don't want integer division, which would truncate fractional digits. I cast to numeric and round the result to two fractional digits like in your result.
Related answer:
Postgres window function and group by exception
Key to understanding why this works is the sequence of evens in a SELECT query:
Best way to get result count before LIMIT was applied

Related

SQL Finding maximum average time for distinct cell

I have a table with large number of records for which i am trying to find only 10 numbers with the largest average time per number.
So the table may look like so:
number | time
012345 | 10s
012345 | 20s
055555 | 50s
055555 | 30s
068976 | 11s
etc...
and the output should look like so:
number | time
012345 | 15s
055555 | 40s
068976 | 11s
tried this but to no avail
select distinct(destination), avg(totalqueuetime)
from call
group by destination, totalqueuetime
order by totalqueue time desc limit 10;
it does not seem to group the numbers.

Please try the following code, which has been tested as confirmed as effective. ...-
(If you wish to sort by average total queue time, as your code sample above suggests)
SELECT destination,
AVG( totalqueuetime ) AS avgTQT
FROM call
GROUP BY destination
ORDER BY avgTQT DESC LIMIT 10;
(If you wish to sort by destination, as your desired output sample above suggests)
SELECT destination,
AVG( totalqueuetime ) AS avgTQT
FROM call
GROUP BY destination
ORDER BY destination DESC LIMIT 10;
If you have any questions or comments, then please feel free to post a Comment accordingly.
Note : As for your supplied code, if you remove totalqueuetime from the GROUP BY clause you will not need to use DISTINCT. Thanks to AVG your SELECT statement will place the average in every returned field, potentially leading to many instances of the same combination of description and average. Grouping them by Destination will reduce the list to one instance of each combination only.

Your group by has two keys. It should only have one:
select destination, avg(totalqueuetime)
from call
group by destination
order by totalqueue time desc
limit 10;
Notes on the use of distinct. select distinct is almost never needed with group by. In fact, in almost all cases, you don't need select distinct at all -- because you can use group by.
In addition, distinct is not a function. It applies to the entire entire row. So, don't use parentheses around the first column, unless you want to confuse yourself.

Referencing current row in FILTER clause of window function

In PostgreSQL 9.4 the window functions have the new option of a FILTER to select a sub-set of the window frame for processing. The documentation mentions it, but provides no sample. An online search yields some samples, including from 2ndQuadrant but all that I found were rather trivial examples with constant expressions. What I am looking for is a filter expression that includes the value of the current row.
Assume I have a table with a bunch of columns, one of which is of date type:
col1 | col2 | dt
------------------------
1 | a | 2015-07-01
2 | b | 2015-07-03
3 | c | 2015-07-10
4 | d | 2015-07-11
5 | e | 2015-07-11
6 | f | 2015-07-13
...
A window definition for processing on the date over the entire table is trivially constructed: WINDOW win AS (ORDER BY dt)
I am interested in knowing how many rows are present in, say, the 4 days prior to the current row (inclusive). So I want to generate this output:
col1 | col2 | dt | count
--------------------------------
1 | a | 2015-07-01 | 1
2 | b | 2015-07-03 | 2
3 | c | 2015-07-10 | 1
4 | d | 2015-07-11 | 3
5 | e | 2015-07-11 | 3
6 | f | 2015-07-13 | 4
...
The FILTER clause of the window functions seems like the obvious choice:
count(*) FILTER (WHERE current_row.dt - dt <= 4) OVER win
But how do I specify current_row.dt (for lack of a better syntax)? Is this even possible?
If this is not possible, are there other ways of selecting date ranges in a window frame? The frame specification is no help as it is all row-based.
I am not interested in alternative solutions using sub-queries, it has to be based on window processing.

You are not actually aggregating rows, so the new aggregate FILTER clause is not the right tool. A window function is more like it, a problem remains, however: the frame definition of a window cannot depend on values of the current row. It can only count a given number of rows preceding or following with the ROWS clause.
To make that work, aggregate counts per day and LEFT JOIN to a full set of days in range. Then you can apply a window function:
SELECT t.*, ct.ct_last4days
FROM (
SELECT *, sum(ct) OVER (ORDER BY dt ROWS 3 PRECEDING) AS ct_last4days
FROM (
SELECT generate_series(min(dt), max(dt), interval '1 day')::date AS dt
FROM tbl t1
) d
LEFT JOIN (SELECT dt, count(*) AS ct FROM tbl GROUP BY 1) t USING (dt)
) ct
JOIN tbl t USING (dt);
Omitting ORDER BY dt in the widow frame definition usually works, since the order is carried over from generate_series() in the subquery. But there are no guarantees in the SQL standard without explicit ORDER BY and it might break in more complex queries.
SQL Fiddle.
Related:
Select finishes where athlete didn't finish first for the past 3 events
PostgreSQL: running count of rows for a query 'by minute'
PostgreSQL unnest() with element number

I don't think there is any syntax that means "current row" in an expression. The gram.y file for postgres makes a filter clause
take just an a_expr, which is just the normal expression clauses. There
is nothing specific to window functions or filter clauses in an expression.
As far as I can find, the only current row notion in a window clause is for specifying the window frame boundaries. I don't think this gets you
what you want.
It's possible that you could get some traction from an enclosing query:
http://www.postgresql.org/docs/current/static/sql-expressions.html
When an aggregate expression appears in a subquery (see Section 4.2.11
and Section 9.22), the aggregate is normally evaluated over the rows
of the subquery. But an exception occurs if the aggregate's arguments
(and filter_clause if any) contain only outer-level variables: the
aggregate then belongs to the nearest such outer level, and is
evaluated over the rows of that query.
but it's not obvious to me how.

https://www.postgresql.org/docs/release/11.0/
Window functions now support all framing options shown in the SQL:2011
standard, including RANGE distance PRECEDING/FOLLOWING, GROUPS mode,
and frame exclusion options
https://dbfiddle.uk/p-TZHp7s
You can do something like
count(dt) over(order by dt RANGE BETWEEN INTERVAL '3 DAYS' PRECEDING AND CURRENT ROW)

How to count the number of active days in a dataset with SQL Server 2008

SQL Server 2008, rendered in html via aspx webpage.
What I want to achieve, is to get an average per day figure that makes allowance for missing days. To do this I need to count the number of active days in a table.
Example:
Date | Amount
---------------------
2014-08-16 | 234.56
2014-08-16 | 258.30
2014-08-18 | 25.84
2014-08-19 | 259.21
The sum of the lot (777.961) divided by the number of active days (3) would = 259.30
So it needs to go "count number of different dates in the returned range"
Is there a tidy way to do this?

If you just want that one row of output then this should work:
select sum(amount) / count(distinct date) as your_average
from your_table
Fiddle:
http://sqlfiddle.com/#!2/7ffd1/1/0

I don't know this will be help to you, how about using Group By, Avg, count function.
SELECT Date, AVG(Amount) AS 'AmountAverage', COUNT(*) AS 'NumberOfActiveDays'
FROM YourTable WITH(NOLOCK)
GROUP BY Date
About AVG function, see here: Link

MS Access: Rank SUM() Values

I am working on an old web app that is still using MS Access as it's data source and I have ran into issue while trying to rank SUM() values.
Let's say I have 2 different account numbers each of those account numbers has an unknown number of invoices. I need to sum up the total of all the invoices, group it by account number then add a rank (1-2).
RAW TABLE EXAMPLE...
Account | Sales | Invoice Number
001 | 400 | 123
002 | 150 | 456
001 | 300 | 789
DESIRED RESULTS...
Account | Sales | Rank
001 | 700 | 1
002 | 150 | 2
I tried...
SELECT Account, SUM(Sales) AS Sales,
(SELECT COUNT(*) FROM Invoices) AS RANK
FROM Invoices
ORDER BY Account
But that query keeps returning the number of records assigned to that account and not a rank.

This would be easier in a report, with a running count: Report - Running Count within a Group
This is not standard in a query, but you can do something with custom functions (it's elaborate, but possible):
http://support.microsoft.com/kb/94397/en-us

Easiest way is to break it up in to 2 queries, the first one is this and I've saved it as qryInvoices:
SELECT Invoices.Account, Sum(Invoices.Sales) AS Sales
FROM Invoices
GROUP BY Invoices.Account;
And then the second query uses the first as follows:
SELECT qryInvoices.Account, qryInvoices.Sales, (SELECT Count(*) FROM qryInvoices AS I WHERE I.Sales > qryInvoices.Sales)+1 AS Rank
FROM qryInvoices
ORDER BY qryInvoices.Sales DESC;
I've tested this and got the desired results as outlined in the question.
Note: It may be possible to achieve in one query using a Defined table, but in this instance it was looking a little ugly.

If you need the answer in one query, it should be
SELECT inv.*, (
SELECT 1+COUNT(*) FROM (
SELECT Account, Sum(Sales) AS Sum_sales FROM Invoices GROUP BY Account
) WHERE Sum_sales > inv.Sum_sales
) AS Rank
FROM (
SELECT Account, Sum(Sales) AS Sum_sales FROM Invoices GROUP BY Account
) inv
I have tried it on Access and it works. You may also use different names for the two instances of "Sum_sales" above to avoid confusion (in which case you can drop the "inv." prefix).

Optimizing a Vertica SQL query to do running totals

I have a table S with time series data like this:
key day delta
For a given key, it's possible but unlikely that days will be missing.
I'd like to construct a cumulative column from the delta values (positive INTs), for the purposes of inserting this cumulative data into another table. This is what I've got so far:
SELECT key, day,
SUM(delta) OVER (PARTITION BY key ORDER BY day asc RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
delta
FROM S
In my SQL flavor, default window clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, but I left that in there to be explicit.
This query is really slow, like order of magnitude slower than the old broken query, which filled in 0s for the cumulative count. Any suggestions for other methods to generate the cumulative numbers?
I did look at the solutions here:
Running total by grouped records in table
The RDBMs I'm using is Vertica. Vertica SQL precludes the first subselect solution there, and its query planner predicts that the 2nd left outer join solution is about 100 times more costly than the analytic form I show above.

I think you're essentially there. You may just need to update the syntax a bit:
SELECT s_qty,
Sum(s_price)
OVER(
partition BY NULL
ORDER BY s_qty ASC rows UNBOUNDED PRECEDING ) "Cumulative Sum"
FROM sample_sales;
Output:
S_QTY | Cumulative Sum
------+----------------
1 | 1000
100 | 11000
150 | 26000
200 | 28000
250 | 53000
300 | 83000
2000 | 103000
(7 rows)
reference link:
https://dwgeek.com/vertica-cumulative-sum-average-and-example.html/

Sometimes it's faster to just use a correlated subquery:
SELECT
[key]
, [day]
, delta
, (SELECT SUM(delta) FROM S WHERE [key] < t1.[key]) AS DeltaSum
FROM S t1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

% of total calculation without subquery in Postgres - sql

Use window functions: SELECT ID, COUNT(), COUNT() / SUM(COUNT(*)) OVER () AS "% OF TOTAL" FROM DATA GROUP BY ID;

Related

SQL Finding maximum average time for distinct cell

Referencing current row in FILTER clause of window function

How to count the number of active days in a dataset with SQL Server 2008

MS Access: Rank SUM() Values

Optimizing a Vertica SQL query to do running totals

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

% of total calculation without subquery in Postgres - sql

Use window functions: SELECT ID, COUNT(*), COUNT(*) / SUM(COUNT(*)) OVER () AS "% OF TOTAL" FROM DATA GROUP BY ID;

Related

SQL Finding maximum average time for distinct cell

Referencing current row in FILTER clause of window function

How to count the number of active days in a dataset with SQL Server 2008

MS Access: Rank SUM() Values

Optimizing a Vertica SQL query to do running totals

Categories

Resources

Use window functions: SELECT ID, COUNT(), COUNT() / SUM(COUNT(*)) OVER () AS "% OF TOTAL" FROM DATA GROUP BY ID;