Cumulative count of duplicates - sql

For a table looking like
ID | Value
-------------
1 | 2
2 | 10
3 | 3
4 | 2
5 | 0
6 | 3
7 | 3
I would like to calculate the number of IDs with a higher Value, for each Value that appears in the table, i.e.
Value | Position
----------------
10 | 0
3 | 1
2 | 4
0 | 6
This equates to the offset of the Value in a ORDER BY Value ordering.
I have considered doing this by calculating the number of duplicates with something like
SELECT Value, count(*) AS ct FROM table GROUP BY Value";
And then cumulating the result, but I guess that is not the optimal way to do it (nor have I managed to combine the commands accordingly)
How would one go about calculating this efficiently (for several dozens of thousands of rows)?

This seems like a perfect opportunity for the window function rank() (not the related dense_rank()):
SELECT DISTINCT ON (value)
value, rank() OVER (ORDER BY value DESC) - 1 AS position
FROM tbl
ORDER BY value DESC;
rank() starts with 1, while your count starts with 0, so subtract 1.
Adding a DISTINCT step (DISTINCT ON is slightly cheaper here) to remove duplicate rows (after computing counting ranks). DISTINCT is applied after window functions. Details in this related answer:
Best way to get result count before LIMIT was applied
Result exactly as requested.
An index on value will help performance.
SQL Fiddle.

You might also try this if you're not comfortable with window functions:
SELECT t1.value, COUNT(DISTINCT t2.id) AS position
FROM tbl t1 LEFT OUTER JOIN tbl t2
ON t1.value < t2.value
GROUP BY t1.value
Note the self-join.

Related

SQL compares the value of 2 columns and select the column with max value row-by-row

I have table something like:
GROUP
NAME
Value_1
Value_2
1
ABC
0
0
1
DEF
4
4
50
XYZ
6
6
50
QWE
6
7
100
XYZ
26
2
100
QWE
26
2
What I would like to do is to groupby group and select the name with highest value_1. If their value_1 are the same, compare and select the max with value_2. If they're still the same, select the first one.
The output will be something like:
GROUP
NAME
Value_1
Value_2
1
DEF
4
4
50
QWE
6
7
100
XYZ
26
2
The challenge for me here is I don't know how many categories in NAME so a simple case when is not working. Thanks for help
You can use window functions to solve the bulk of your problem:
select t.*
from (select t.*,
row_number() over (partition by group order by value1 desc, value2 desc) as seqnum
from t
) t
where seqnum = 1;
The one caveat is the condition:
If they're still the same, select the first one.
SQL tables represent unordered (multi-) sets. There is no "first" one unless a column specifies the ordering. The best you can do is choose an arbitrary value when all the other values are the same.
That said, you might have another column that has an ordering. If so, add that as a third key to the order by.

Running "distinct on" across all unique thresholds in a postgres table

I have a Postgres 11 table called sample_a that looks like this:
time | cat | val
------+-----+-----
1 | 1 | 5
1 | 2 | 4
2 | 1 | 6
3 | 1 | 9
4 | 3 | 2
I would like to create a query that for each unique timestep, gets the most recent values across each category at or before that timestep, and aggregates these values by taking the sum of these values and dividing by the count of these values.
I believe I have the query to do this for a given timestep. For example, for time 3 I can run the following query:
select sum(val)::numeric / count(val) as result from (
select distinct on (cat) * from sample_a where time <= 3 order by cat, time desc
) x;
and get 6.5. (This is because at time 3, the latest from category 1 is 9 and the latest from category 2 is 4. The count of the values are 2, and they sum up to 13, and 13 / 2 is 6.5.)
However, I would ideally like to run a query that will give me all the results for each unique time in the table. The output of this new query would look as follows:
time | result
------+----------
1 | 4.5
2 | 5
3 | 6.5
4 | 5
This new query ideally would avoid adding another subselect clause if possible; an efficient query would be preferred. I could get these prior results by running the prior query inside my application for each timestep, but this doesn't seem efficient for a large sample_a.
What would this new query look like?
See if performance is acceptable this way. Syntax might need minor tweaks:
select t.time, avg(mr.val) as result
from (select distinct time from sample_a) t,
lateral (
select distinct on (cat) val
from sample_a a
where a.time <= t.time
order by a.cat, a.time desc
) mr
group by t.time
I think you just want cumulative functions:
select time,
sum(sum(val)) over (order by time) / sum(sum(num_val)) over (order by time) as result
from (select time, sum(val) as sum_val, count(*) as num_val
from sample_a a
group by time
) a;
Note if val is an integer, you might need to convert to a numeric to get fractional values.
This can be expressed without a subquery as well:
select time,
sum(sum(val)) over (order by time) / sum(count(*)) over (order by time) as result
from sample_a
group by time

Calculate "position in run" in SQL

I have a table of consecutive ids (integers, 1 ... n), and values (integers), like this:
Input Table:
id value
-- -----
1 1
2 1
3 2
4 3
5 1
6 1
7 1
Going down the table i.e. in order of increasing id, I want to count how many times in a row the same value has been seen consecutively, i.e. the position in a run:
Output Table:
id value position in run
-- ----- ---------------
1 1 1
2 1 2
3 2 1
4 3 1
5 1 1
6 1 2
7 1 3
Any ideas? I've searched for a combination of windowing functions including lead and lag, but can't come up with it. Note that the same value can appear in the value column as part of different runs, so partitioning by value may not help solve this. I'm on Hive 1.2.
One way is to use a difference of row numbers approach to classify consecutive same values into one group. Then a row number function to get the desired positions in each group.
Query to assign groups (Running this will help you understand how the groups are assigned.)
select t.*
,row_number() over(order by id) - row_number() over(partition by value order by id) as rnum_diff
from tbl t
Final Query using row_number to get positions in each group assigned with the above query.
select id,value,row_number() over(partition by value,rnum_diff order by id) as pos_in_grp
from (select t.*
,row_number() over(order by id) - row_number() over(partition by value order by id) as rnum_diff
from tbl t
) t

Compare column entry to every other entry in the same column

I have a Column of values in SQLite.
value
-----
1
2
3
4
5
For each value I would like to know how many of the other values are larger and display the result. E.g. For value 1 there are 4 entries that have higher values.
value | Count
-------------
1 | 4
2 | 3
3 | 2
4 | 1
5 | 0
I have tried nested select statements and using the Count(*) function but I do not seem to be able to extract the correct levels. Any suggestions would be much appreciated.
Many Thanks
You can do this with a correlated subquery in SQLite:
select value,
(select count(*) from t t2 where t2.value > t.value) as "count"
from t;
In most other databases, you would use a ranking function such as rank() or dense_rank(), but SQLite doesn't support these functions.

Running Totals again. No over clause, no cursor, but increasing order

I am still having trouble creating an running total based on the increasing order of the value. Row id has no real meaning, it is just the PK. My server doesn't support OVER.
Row Value
1 3
2 7
3 1
4 2
Result:
Row Value
3 1
4 3
1 6
2 13
I have tried self and cross joins where I specify that the value of the second amount(the one being summed up) is less than the current value of the first. I have also tried doing this with the having clause but that always threw an error when I tried it that way. Can someone explain why it would be wrong to use it in that manner and how I should be doing it?
Here is one way to do a running total:
select row, value,
(select sum(value) from t t2 where t2.value <= t.value) as runningTotal
from t
you can use the with rollup command if you have sql server 2008.
select sum(value) from t t2 where t2.value <= t.value with rollup
If your platform supports recursive queries(IIRC you should omit the RECURSIVE keyword for microsoft stuff). Because the CTE needs to estimate the begin/end of a "chain", unfortunately, the tuples need to be ordered in some way (I use the "row" field; an internal tuple-id would be perfect for this purpose):
WITH RECURSIVE sums AS (
-- Terminal part
SELECT d0.row
, d0.value AS value
, d0.value AS runsum
FROM data d0
WHERE NOT EXISTS (
SELECT * FROM data nx
WHERE nx.row < d0.row
)
UNION
-- Recursive part
SELECT t1.row AS row
, t1.value AS value
, t0.runsum + t1.value AS runsum
FROM data t1
, sums t0
WHERE t1.row > t0.row
AND NOT EXISTS (
SELECT * FROM data nx
WHERE nx.row > t0.row
AND nx.row < t1.row
)
)
SELECT * FROM sums
;
RESULT:
row | value | runsum
-----+-------+--------
1 | 3 | 3
2 | 7 | 10
3 | 1 | 11
4 | 2 | 13
(4 rows)