SQL select elements where sum of field is less than N - sql

Given that I've got a table with the following, very simple content:
# select * from messages;
id | verbosity
----+-----------
1 | 20
2 | 20
3 | 20
4 | 30
5 | 100
(5 rows)
I would like to select N messages, which sum of verbosity is lower than Y (for testing purposes let's say it should be 70, then correct results will be messages with id 1,2,3).
It's really important to me, that solution should be database independent (it should work at least on Postgres and SQLite).
I was trying with something like:
SELECT * FROM messages GROUP BY id HAVING SUM(verbosity) < 70;
However it doesn't seem to work as expected, because it doesn't actually sum all values from verbosity column.
I would be very grateful for any hints/help.

SELECT m.id, sum(m1.verbosity) AS total
FROM messages m
JOIN messages m1 ON m1.id <= m.id
WHERE m.verbosity < 70 -- optional, to avoid pointless evaluation
GROUP BY m.id
HAVING SUM(m1.verbosity) < 70
ORDER BY total DESC
LIMIT 1;
This assumes a unique, ascending id like you have in your example.
In modern Postgres - or generally with modern standard SQL (but not in SQLite):
Simple CTE
WITH cte AS (
SELECT *, sum(verbosity) OVER (ORDER BY id) AS total
FROM messages
)
SELECT *
FROM cte
WHERE total < 70
ORDER BY id;
Recursive CTE
Should be faster for big tables where you only retrieve a small set.
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT id, verbosity, verbosity AS total
FROM messages
ORDER BY id
LIMIT 1
)
UNION ALL
SELECT c1.id, c1.verbosity, c.total + c1.verbosity
FROM cte c
JOIN LATERAL (
SELECT *
FROM messages
WHERE id > c.id
ORDER BY id
LIMIT 1
) c1 ON c1.verbosity < 70 - c.total
WHERE c.total < 70
)
SELECT *
FROM cte
ORDER BY id;
All standard SQL, except for LIMIT.
Strictly speaking, there is no such thing as "database-independent". There are various SQL-standards, but no RDBMS complies completely. LIMIT works for PostgreSQL and SQLite (and some others). Use TOP 1 for SQL Server, rownum for Oracle. Here's a comprehensive list on Wikipedia.
The SQL:2008 standard would be:
...
FETCH FIRST 1 ROWS ONLY
... which PostgreSQL supports - but hardly any other RDBMS.
The pure alternative that works with more systems would be to wrap it in a subquery and
SELECT max(total) FROM <subquery>
But that is slow and unwieldy.
db<>fiddle here
Old sqlfiddle

This will work...
select *
from messages
where id<=
(
select MAX(id) from
(
select m2.id, SUM(m1.verbosity) sv
from messages m1
inner join messages m2 on m1.id <=m2.id
group by m2.id
) v
where sv<70
)
However, you should understand that SQL is designed as a set based language, rather than an iterative one, so it designed to treat data as a set, rather than on a row by row basis.

Related

Selectivity estimation error on a simple query

Let us have a simple table tt created like this
WITH x AS (SELECT n FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)), t1 AS
(
SELECT ones.n + 10 * tens.n + 100 * hundreds.n + 1000 * thousands.n + 10000 * tenthousands.n as id
FROM x ones, x tens, x hundreds, x thousands, x tenthousands, x hundredthousands
)
SELECT id,
id % 100 groupby,
row_number() over (partition by id % 100 order by id) orderby,
row_number() over (partition by id % 100 order by id) / (id % 100 + 1) local_search
INTO tt
FROM t1
I have a simple query Q1:
select distinct g1.groupby,
(select count(*) from tt g2
where local_search = 1 and g1.groupby = g2.groupby) as orderby
from tt g1
option(maxdop 1)
I would like to know why the SQL Server estimates the result size so badly for Q1 (see the printscreen). Most of the operators in the query plan are estimated precisely, however, at the the root Hash Match operator introduce completely insane guess.
To make it more interesting I have tried different rewritings of the Q1. If I apply decorrelation of the subquery I obtain an equivalent query Q2:
select main.groupby,
coalesce(sub1.orderby,0) orderby
from
(
select distinct g1.groupby
from tt g1
) main
left join
(
select groupby, count(*) orderby
from tt g2
where local_search = 1
group by groupby
) sub1 on sub1.groupby = main.groupby
option(maxdop 1)
This query is interesting in two aspects: (1) the estimation is accurate (see printscreen), (2) it has also different query plan, which is more efficient that the query plan of Q1.
So the question is: why the estimation of Q1 is inccorect, whereas the estimation of Q2 is precise? Please do not post other rewritings of this SQL (I know that that is can be written even without subqueries), I'm interested only about the explanation of the selectivity estimator behaviour. Thanks.
It doesn't recognize the orderby value will be the same for all rows with the same groupby so it thinks distinct groupby, orderby will have more combinations than just distinct groupby.
It multiplies the estimate for DISTINCT orderby (for me this is 35.0367) and the estimate for DISTINCT groupby (for me this is 100) as if they were uncorrelated.
I get an estimate for 3503.67 for the root node in Q1
This rewrite avoids it as it now only groups by the single groupby column.
SELECT groupby,
max(orderby) AS orderby
FROM (SELECT g1.groupby,
(SELECT count(*)
FROM tt g2
WHERE local_search = 1
AND g1.groupby = g2.groupby) AS orderby
FROM tt g1) d
GROUP BY groupby
OPTION(maxdop 1)
This is not the optimal approach to this query though as shown by your Q2 and the comment #GarethD makes about the inefficiencies of running the correlated sub query multiple times and discarding the duplicates.

SQL percentage of the total

Hi how can I get the percentage of each record over the total?
Lets imagine I have one table with the following
ID code Points
1 101 2
2 201 3
3 233 4
4 123 1
The percentage for ID 1 is 20% for 2 is 30% and so one
how do I get it?
There's a couple approaches to getting that result.
You essentially need the "total" points from the whole table (or whatever subset), and get that repeated on each row. Getting the percentage is a simple matter of arithmetic, the expression you use for that depends on the datatypes, and how you want that formatted.
Here's one way (out a couple possible ways) to get the specified result:
SELECT t.id
, t.code
, t.points
-- , s.tot_points
, ROUND(t.points * 100.0 / s.tot_points,1) AS percentage
FROM onetable t
CROSS
JOIN ( SELECT SUM(r.points) AS tot_points
FROM onetable r
) s
ORDER BY t.id
The view query s is run first, that gives a single row. The join operation matches that row with every row from t. And that gives us the values we need to calculate a percentage.
Another way to get this result, without using a join operation, is to use a subquery in the SELECT list to return the total.
Note that the join approach can be extended to get percentage for each "group" of records.
id type points %type
-- ---- ------ -----
1 sold 11 22%
2 sold 4 8%
3 sold 25 50%
4 bought 1 50%
5 bought 1 50%
6 sold 10 20%
To get that result, we can use the same query, but a a view query for s that returns total GROUP BY r.type, and then the join operation isn't a CROSS join, but a match based on type:
SELECT t.id
, t.type
, t.points
-- , s.tot_points_by_type
, ROUND(t.points * 100.0 / s.tot_points_by_type,1) AS `%type`
FROM onetable t
JOIN ( SELECT r.type
, SUM(r.points) AS tot_points
FROM onetable r
GROUP BY r.type
) s
ON s.type = t.type
ORDER BY t.id
To do that same result with the subquery, that's going to be a correlated subquery, and that subquery is likely to get executed for every row in t.
This is why it's more natural for me to use a join operation, rather than a subquery in the SELECT list... even when a subquery works the same. (The patterns we use for more complex queries, like assigning aliases to tables, qualifying all column references, and formatting the SQL... those patterns just work their way back into simple queries. The rationale for these patterns is kind of lost in simple queries.)
try like this
select id,code,points,(points * 100)/(select sum(points) from tabel1) from table1
To add to a good list of responses, this should be fast performance-wise, and rather easy to understand:
DECLARE #T TABLE (ID INT, code VARCHAR(256), Points INT)
INSERT INTO #T VALUES (1,'101',2), (2,'201',3),(3,'233',4), (4,'123',1)
;WITH CTE AS
(SELECT * FROM #T)
SELECT C.*, CAST(ROUND((C.Points/B.TOTAL)*100, 2) AS DEC(32,2)) [%_of_TOTAL]
FROM CTE C
JOIN (SELECT CAST(SUM(Points) AS DEC(32,2)) TOTAL FROM CTE) B ON 1=1
Just replace the table variable with your actual table inside the CTE.

SQL to find missing numbers in sequence starting from min?

I have found several examples of SQL queries which will find missing numbers in a sequence. For example this one:
Select T1.val+1
from table T1
where not exists(select val from table T2 where T2.val = T1.val + 1);
This will only find gaps in an existing sequence. I would like to find gaps in a sequence starting from a minimum.
For example, if the values in my sequence are 2, 4 then the query above will return 3,5.
I would like to specify that my sequence must start at 0 so I would like the query to return 0,1,3,5.
How can I add the minimum value to my query?
A few answers to questions below:
There is no maximum, only a minimum
The DB is oracle
This is quite easy in Postgres:
select x.i as missing_sequence_value
from (
select i
from generate_series(0,5) i -- 0,5 are the lower and upper bounds you want
) x
left join the_table t on t.val = x.i
where t.val is null;
SQLFiddle: http://www.sqlfiddle.com/#!15/acb07/1
Edit
The Oracle solution is a bit more complex because generating the numbers requires a workaround
with numbers as (
select level - 1 as val
from dual
connect by level <= (select max(val) + 2 from the_table) -- this is the maximum
), number_range as (
select val
from numbers
where val >= 0 -- this is the minimum
)
select nr.val as missing_sequence_value
from number_range nr
left join the_table t on t.val = nr.val
where t.val is null;
SQLFiddle: http://www.sqlfiddle.com/#!4/71584/4
The idea (in both cases) is to generate a list of numbers you are interested in (from 0 to 5) and then doing an outer join against the values in your table. The rows where the outer join does not return something from your table (that's the condition where t.val is null) are the values that are missing.
The Oracle solution requires two common table expressions ("CTE", the "with" things) because you can't add a where level >= x in the first CTE that generates the numbers.
Note that the connect by level <= ... is relies on an undocumented (and unsupported) way of using connect by. But so many people are using that to get a "number generator" that I doubt that Oracle will actually remove this.
If you have to option to use a common table expression you can generate a sequence of numbers and use that as the source of numbers.
The variables #start and #end defines the range of numbers (you could easily use max(val) from yourtable as end instead).
This example is for MS SQL Server (but CTEs is a SQL 99 feature supported by many databases):
declare #start int, #end int
select #start=0, #end=5
;With sequence(num) as
(
select #start as num
union all
select num + 1
from sequence
where num < #end
)
select * from sequence seq
where not exists(select val from YourTable where YourTable.val = seq.num)
Option (MaxRecursion 1000)
Sample SQL Fiddle
Just presenting 2 small variations to suggestions above, both are for Oracle.
First is a small variation to one presented by jpw using a recursive Common Table Expression (CTE); simply to demonstrate that this technique is also available in Oracle.
WITH
seq (val)
AS (
SELECT 0 FROM dual
UNION ALL
SELECT val + 1
FROM seq
WHERE val < (
SELECT MAX(val) FROM the_table -- note below
)
)
SELECT
seq.val AS missing_sequence_value
FROM seq
LEFT JOIN the_table t
ON seq.val = t.val
WHERE t.val IS NULL
ORDER BY
missing_sequence_value
;
This variation at SQLfiddle
Notable difference to SQL Server: you can use a subquery to limit the recursion
Also, Oracle documentation often refers to Subquery Factoring e.g. subquery_factoring_clause::= instead of CTE
Second is a variation on the use of connect by level as used by a_horse_with_no_name
Level is a pseudo column available in Oracle's hierarchical queries and the root of a hierarchy is 1. When using connect by level, by default this will commence at 1
For this variation I just wished to demonstrate that it does not need to be coupled with CTE's at all, and hence the syntax can be quite concise.
SELECT
seq.val AS missing_sequence_value
FROM (
SELECT
level - 1 AS val
FROM dual
CONNECT BY LEVEL <= (SELECT max(val) FROM the_table)
) seq
LEFT JOIN the_table t
ON seq.val = t.val
WHERE t.val IS NULL
ORDER BY
missing_sequence_value
;
This variation at SQLfiddle

How can I select adjacent rows to an arbitrary row (in sql or postgresql)?

I want to select some rows based on certain criteria, and then take one entry from that set and the 5 rows before it and after it.
Now, I can do this numerically if there is a primary key on the table, (e.g. primary keys that are numerically 5 less than the target row's key and 5 more than the target row's key).
So select the row with the primary key of 7 and the nearby rows:
select primary_key from table where primary_key > (7-5) order by primary_key limit 11;
2
3
4
5
6
-=7=-
8
9
10
11
12
But if I select only certain rows to begin with, I lose that numeric method of using primary keys (and that was assuming the keys didn't have any gaps in their order anyway), and need another way to get the closest rows before and after a certain targeted row.
The primary key output of such a select might look more random and thus less succeptable to mathematical locating (since some results would be filtered, out, e.g. with a where active=1):
select primary_key from table where primary_key > (34-5)
order by primary_key where active=1 limit 11;
30
-=34=-
80
83
100
113
125
126
127
128
129
Note how due to the gaps in the primary keys caused by the example where condition (for example becaseu there are many inactive items), I'm no longer getting the closest 5 above and 5 below, instead I'm getting the closest 1 below and the closest 9 above, instead.
There's a lot of ways to do it if you run two queries with a programming language, but here's one way to do it in one SQL query:
(SELECT * FROM table WHERE id >= 34 AND active = 1 ORDER BY id ASC LIMIT 6)
UNION
(SELECT * FROM table WHERE id < 34 AND active = 1 ORDER BY id DESC LIMIT 5)
ORDER BY id ASC
This would return the 5 rows above, the target row, and 5 rows below.
Here's another way to do it with analytic functions lead and lag. It would be nice if we could use analytic functions in the WHERE clause. So instead you need to use subqueries or CTE's. Here's an example that will work with the pagila sample database.
WITH base AS (
SELECT lag(customer_id, 5) OVER (ORDER BY customer_id) lag,
lead(customer_id, 5) OVER (ORDER BY customer_id) lead,
c.*
FROM customer c
WHERE c.active = 1
AND c.last_name LIKE 'B%'
)
SELECT base.* FROM base
JOIN (
-- Select the center row, coalesce so it still works if there aren't
-- 5 rows in front or behind
SELECT COALESCE(lag, 0) AS lag, COALESCE(lead, 99999) AS lead
FROM base WHERE customer_id = 280
) sub ON base.customer_id BETWEEN sub.lag AND sub.lead
The problem with sgriffinusa's solution is that you don't know which row_number your center row will end up being. He assumed it will be row 30.
For similar query I use analytic functions without CTE. Something like:
select ...,
LEAD(gm.id) OVER (ORDER BY Cit DESC) as leadId,
LEAD(gm.id, 2) OVER (ORDER BY Cit DESC) as leadId2,
LAG(gm.id) OVER (ORDER BY Cit DESC) as lagId,
LAG(gm.id, 2) OVER (ORDER BY Cit DESC) as lagId2
...
where id = 25912
or leadId = 25912 or leadId2 = 25912
or lagId = 25912 or lagId2 = 25912
such query works more faster for me than CTE with join (answer from Scott Bailey). But of course less elegant
You could do this utilizing row_number() (available as of 8.4). This may not be the correct syntax (not familiar with postgresql), but hopefully the idea will be illustrated:
SELECT *
FROM (SELECT ROW_NUMBER() OVER (ORDER BY primary_key) AS r, *
FROM table
WHERE active=1) t
WHERE 25 < r and r < 35
This will generate a first column having sequential numbers. You can use this to identify the single row and the rows above and below it.
If you wanted to do it in a 'relationally pure' way, you could write a query that sorted and numbered the rows. Like:
select (
select count(*) from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
Then use that as a common table expression. Write a select which filters it down to the rows you're interested in, then join it back onto itself using a criterion that the index of the right-hand copy of the table is no more than k larger or smaller than the index of the row on the left. Project over just the rows on the right. Like:
with numbered_emps as (
select (
select count(*)
from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
)
select b.*
from numbered_emps a, numbered_emps b
where a.name like '% Smith' -- this is your main selection criterion
and ((b.idx - a.idx) between -5 and 5) -- this is your adjacency fuzzy-join criterion
What could be simpler!
I'd imagine the row-number based solutions will be faster, though.

ORDER BY the IN value list

I have a simple SQL query in PostgreSQL 8.3 that grabs a bunch of comments. I provide a sorted list of values to the IN construct in the WHERE clause:
SELECT * FROM comments WHERE (comments.id IN (1,3,2,4));
This returns comments in an arbitrary order which in my happens to be ids like 1,2,3,4.
I want the resulting rows sorted like the list in the IN construct: (1,3,2,4).
How to achieve that?
You can do it quite easily with (introduced in PostgreSQL 8.2) VALUES (), ().
Syntax will be like this:
select c.*
from comments c
join (
values
(1,1),
(3,2),
(2,3),
(4,4)
) as x (id, ordering) on c.id = x.id
order by x.ordering
In Postgres 9.4 or later, this is simplest and fastest:
SELECT c.*
FROM comments c
JOIN unnest('{1,3,2,4}'::int[]) WITH ORDINALITY t(id, ord) USING (id)
ORDER BY t.ord;
WITH ORDINALITY was introduced with in Postgres 9.4.
No need for a subquery, we can use the set-returning function like a table directly. (A.k.a. "table-function".)
A string literal to hand in the array instead of an ARRAY constructor may be easier to implement with some clients.
For convenience (optionally), copy the column name we are joining to ("id" in the example), so we can join with a short USING clause to only get a single instance of the join column in the result.
Works with any input type. If your key column is of type text, provide something like '{foo,bar,baz}'::text[].
Detailed explanation:
PostgreSQL unnest() with element number
Just because it is so difficult to find and it has to be spread: in mySQL this can be done much simpler, but I don't know if it works in other SQL.
SELECT * FROM `comments`
WHERE `comments`.`id` IN ('12','5','3','17')
ORDER BY FIELD(`comments`.`id`,'12','5','3','17')
With Postgres 9.4 this can be done a bit shorter:
select c.*
from comments c
join (
select *
from unnest(array[43,47,42]) with ordinality
) as x (id, ordering) on c.id = x.id
order by x.ordering;
Or a bit more compact without a derived table:
select c.*
from comments c
join unnest(array[43,47,42]) with ordinality as x (id, ordering)
on c.id = x.id
order by x.ordering
Removing the need to manually assign/maintain a position to each value.
With Postgres 9.6 this can be done using array_position():
with x (id_list) as (
values (array[42,48,43])
)
select c.*
from comments c, x
where id = any (x.id_list)
order by array_position(x.id_list, c.id);
The CTE is used so that the list of values only needs to be specified once. If that is not important this can also be written as:
select c.*
from comments c
where id in (42,48,43)
order by array_position(array[42,48,43], c.id);
I think this way is better :
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY id=1 DESC, id=3 DESC, id=2 DESC, id=4 DESC
Another way to do it in Postgres would be to use the idx function.
SELECT *
FROM comments
ORDER BY idx(array[1,3,2,4], comments.id)
Don't forget to create the idx function first, as described here: http://wiki.postgresql.org/wiki/Array_Index
In Postgresql:
select *
from comments
where id in (1,3,2,4)
order by position(id::text in '1,3,2,4')
On researching this some more I found this solution:
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY CASE "comments"."id"
WHEN 1 THEN 1
WHEN 3 THEN 2
WHEN 2 THEN 3
WHEN 4 THEN 4
END
However this seems rather verbose and might have performance issues with large datasets.
Can anyone comment on these issues?
To do this, I think you should probably have an additional "ORDER" table which defines the mapping of IDs to order (effectively doing what your response to your own question said), which you can then use as an additional column on your select which you can then sort on.
In that way, you explicitly describe the ordering you desire in the database, where it should be.
sans SEQUENCE, works only on 8.4:
select * from comments c
join
(
select id, row_number() over() as id_sorter
from (select unnest(ARRAY[1,3,2,4]) as id) as y
) x on x.id = c.id
order by x.id_sorter
SELECT * FROM "comments" JOIN (
SELECT 1 as "id",1 as "order" UNION ALL
SELECT 3,2 UNION ALL SELECT 2,3 UNION ALL SELECT 4,4
) j ON "comments"."id" = j."id" ORDER BY j.ORDER
or if you prefer evil over good:
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY POSITION(','+"comments"."id"+',' IN ',1,3,2,4,')
And here's another solution that works and uses a constant table (http://www.postgresql.org/docs/8.3/interactive/sql-values.html):
SELECT * FROM comments AS c,
(VALUES (1,1),(3,2),(2,3),(4,4) ) AS t (ord_id,ord)
WHERE (c.id IN (1,3,2,4)) AND (c.id = t.ord_id)
ORDER BY ord
But again I'm not sure that this is performant.
I've got a bunch of answers now. Can I get some voting and comments so I know which is the winner!
Thanks All :-)
create sequence serial start 1;
select * from comments c
join (select unnest(ARRAY[1,3,2,4]) as id, nextval('serial') as id_sorter) x
on x.id = c.id
order by x.id_sorter;
drop sequence serial;
[EDIT]
unnest is not yet built-in in 8.3, but you can create one yourself(the beauty of any*):
create function unnest(anyarray) returns setof anyelement
language sql as
$$
select $1[i] from generate_series(array_lower($1,1),array_upper($1,1)) i;
$$;
that function can work in any type:
select unnest(array['John','Paul','George','Ringo']) as beatle
select unnest(array[1,3,2,4]) as id
Slight improvement over the version that uses a sequence I think:
CREATE OR REPLACE FUNCTION in_sort(anyarray, out id anyelement, out ordinal int)
LANGUAGE SQL AS
$$
SELECT $1[i], i FROM generate_series(array_lower($1,1),array_upper($1,1)) i;
$$;
SELECT
*
FROM
comments c
INNER JOIN (SELECT * FROM in_sort(ARRAY[1,3,2,4])) AS in_sort
USING (id)
ORDER BY in_sort.ordinal;
select * from comments where comments.id in
(select unnest(ids) from bbs where id=19795)
order by array_position((select ids from bbs where id=19795),comments.id)
here, [bbs] is the main table that has a field called ids,
and, ids is the array that store the comments.id .
passed in postgresql 9.6
Lets get a visual impression about what was already said. For example you have a table with some tasks:
SELECT a.id,a.status,a.description FROM minicloud_tasks as a ORDER BY random();
id | status | description
----+------------+------------------
4 | processing | work on postgres
6 | deleted | need some rest
3 | pending | garden party
5 | completed | work on html
And you want to order the list of tasks by its status.
The status is a list of string values:
(processing, pending, completed, deleted)
The trick is to give each status value an interger and order the list numerical:
SELECT a.id,a.status,a.description FROM minicloud_tasks AS a
JOIN (
VALUES ('processing', 1), ('pending', 2), ('completed', 3), ('deleted', 4)
) AS b (status, id) ON (a.status = b.status)
ORDER BY b.id ASC;
Which leads to:
id | status | description
----+------------+------------------
4 | processing | work on postgres
3 | pending | garden party
5 | completed | work on html
6 | deleted | need some rest
Credit #user80168
I agree with all other posters that say "don't do that" or "SQL isn't good at that". If you want to sort by some facet of comments then add another integer column to one of your tables to hold your sort criteria and sort by that value. eg "ORDER BY comments.sort DESC " If you want to sort these in a different order every time then... SQL won't be for you in this case.