SQL/PostgreSQL: How to select limited amount of rows of different types based on limits stored in a different table? - sql

I have a table (table 1) where the first column is the key and the second column contains elements of different types. In table 1, there's three types (type A, B, C) but the actual database have many more types.
Table.1. A minimal example.
_________________
| | |
|_KEY| attribute |
|____|___________|
|k1 | A |
|k2 | A |
|k3 | B |
|k4 | C |
|k5 | C |
|____|___________|
From table 1; I am interested in retrieving only a limited amount of elements from each type. The limited amount of elements of a given type is provided by table 2, in which the elements type is the key of the table (_element).
To clarify; The limited amount of elements of type A to obtain from table 1. in this minimal example is 1. Likewise, for type B it is 2 and for type C it is 1.
Table 2. Limits of item to obtain for each type in table 1.
____________________
| _Element | Limit |
|----------|-------|
| A | 1 |
| B | 2 |
| C | 1 |
|__________|_______|
Finally, the elements should be retrieved from table 1 from top to bottom.
Thanks for any help and/or pointers / gus.
P.S.
For the above minimal example, the expected output would be
___________________
| Key| Attribute |
|____|____________|
| k1 | A |
| k3 | B |
| K4 | C |
|____|____________|
Since there only exists 1 C attribute for this particular minimal example. Note that if there would have existed, say 5 elements of type C then the follow table would have been obtained instead (since the limited amount of C elements is 2)
___________________
| Key| Attribute |
|____|____________|
| k1 | A |
| k3 | B |
| K4 | C |
|_k5 | C |
|____|____________|

You can always do it with a union.
select top (SELECT Limit FROM Table2 WHERE _Element='A') * from Table1
WHERE attribute = A
UNION ALL
select top (SELECT Limit FROM Table2 WHERE _Element='B') * from Table1
WHERE attribute = B
UNION ALL
select top (SELECT Limit FROM Table2 WHERE _Element='C') * from Table1
WHERE attribute = C
Or using row_number:
with cte as (SELECT _Key,
attribute,
ROW_NUMBER() OVER (Partition by attribute Order by _Key ASC) as rowno
From Table1)
SELECT * FROM cte
LEFT JOIN Table2 on Table2.Element = Table1.attribute
WHERE rowno >= Limit

I truly like the power of PostgreSQL arrays. So
select
table2._element,
unnest((array_agg(table1._key order by table1._key desc)[1:table2.limit])) as _key
from
table1 join table2 on (table1.attribute = table2._element)
group by
table2._element, table2.limit
where in the second field of the query:
array_agg(table1._key order by table1._key desc) - collects values into array in the specified order (note that order by table1._key desc is just for example and you might to skip it or to specify another one),
(...)[1:table2.limit] - returns array elements from 1 to table2.limit,
unnest(...) - unwraps previous result to rows.

Related

Comparing aggregated columns to non aggregated columns to remove matches

I have two separate tables from two different databases that are performing a matching check.
If the values match I want them out of the result set. The first table (A) has multiple entries that contain the same symbol matches for the matching columns in the second table (B).
The entries in table B, if added up will ideally equal the value of one of the matching rows of A.
The tables look like below when queried separately.
Underneath the tables is what my query currently looks like. I thought if I group the columns by the symbols I could use the SUM of B to add up to the value of A which would get rid of the entries. However, I think because I am summing from B and not from A, then the A doesn't count as an aggregated column so must be included in the group by and doesn't allow for the summing to work in the way I'm wanting it to calculate.
How would I be able to run this query so the values in B are all summed up. Then, if matching to the symbol/value from any of the entries in A, don't get included in the result set?
Table A
| Symbol | Value |
|--------|-------|
| A | 1000 |
| A | 1000 |
| B | 1440 |
| B | 1440 |
| C | 1235 |
Table B
| Symbol | Value |
|--------|-------|
| A | 750 |
| A | 250 |
| B | 24 |
| B | 1416|
| C | 1874|
SELECT DBA.A, DBB.B
FROM DatabaseA DBA
INNER JOIN DatabaseB DBB on DBA.Symbol = DBB.Symbol
and DBA.Value != DBB.Value
group by DBA.Symbol, DBB.Symbol, DBB.Value
having SUM(DBB.Value) != DBA.Value
order by Symbol, Value
Edited to add ideal results
Table C
| SymbolB| ValueB| SymbolA | ValueA |
|--------|-------|---------|--------|
| C | 1874 | C | 1235 |
Wherever B adds up to A remove both. If they don't add, leave number inside result set
I will use CTE and use this common table expression (CTE) to search in Table A. Then join table A and table B on symbol.
WITH tDBB as (
SELECT DBB.Symbol, SUM(DBB.Value) as total
FROM tableB as DBB
GROUP BY DBB.Symbol
)
SELECT distinct DBB.Symbol as SymbolB, DBB.Value as ValueB, DBA.Symbol as SymbolA, DBA.Value as ValueA
FROM tableA as DBA
INNER JOIN tableB as DBB on DBA.Symbol = DBB.Symbol
WHERE DBA.Symbol in (Select Symbol from tDBB)
AND NOT DBA.Value in (Select total from tDBB)
Result:
|symbolB |valueB |SymbolA |ValueA |
|--------|-------|--------|-------|
| C | 1874 | C | 1235 |
with t3 as (
select symbol
,sum(value) as value
from t2
group by symbol
)
select *
from t3 join t on t.symbol = t3.symbol and t.value != t3.value
symbol
value
Symbol
Value
C
1874
C
1235
Fiddle

How to generate a sequence of ID's based on mapping tables and values from the forms in MS-Access (Sql)?

I want to generate ID's based on the form values in MS-Access. And then for each ID generated, create a group of ID's by adding another 4 digits in the end based on a Mapping Table, representing different octets for different time points (12 ID's based on the Initial ID and the mapping Table).
For example, if the ID generated based on form values is 123456, I want to add another four digits and create a group of ID's, say from a mapping table. Like,
123456**1111**
123456**1112**
123456**1113**
and so on.
So far each primary ID, I am slapping on four digits at the end and generating a group of 12 ID's.
I am a beginner in Access and I have tried some code:
UPDATE Table1 SET GenID = UPDATE Table1 SET Table1.GenID = t1 (SELECT Map.V FROM MAP as t1)
However, I get a error that Access does not recognize Map as a valid Field or expression. I am able to break down the problem into this. But could not find a way further and design a query.
Sample Data: (The short_ID and Long_ID tables, uses the mapping tables below each of them as shown.)
Short ID Table:
----------------------------------------------------
ID | Subject_ID | Organ_Type | Category | Short_ID
-----------------------------------------------------
1 | 100 | Kidney | A | 100200300
-----------------------------------------------------
2 | 400 | Heart | B | 400500600
Mapping Tables for Short ID:
Map1 for Table1:
---------------------
Map_from | Map_to |
---------------------
Kidney | 200 |
Heart | 500 |
---------------------
Map2 for Table1:
-----------------------------
Map_cat_from | Map_cat_to |
-----------------------------
A | 300 |
B | 600 |
-----------------------------
Long ID Table:(shown here are just examples for 2 time points rather than 12)
---------------------------------------------------
Subject_ID | Short_ID | Long_ID Timepoint |
---------------------------------------------------
100 | 100200300 | 1002003000001 |
---------------------------------------------------
100 | 100200300 | 1002003000002 |
---------------------------------------------------
400 | 400500600 | 4005006000001 |
---------------------------------------------------
400 | 400500600 | 4005006000002
Timepoint Map for Long ID Table:
------------------------------
Timepoint | Value_to_append |
------------------------------
1 | 0001 |
------------------------------
2 | 0002 |
I need to generate these short and long ID's from the mapping tables directly when input is given in the form. (Category, Organ_Type, Subject_ID)
tldr
generate id from mapping table and form values (id creation)
add four digits at the end and create a group of 12 id's (long id creation) based on a mapping table (which has the 12 four digits that is to be appended in the end)
First, create a query, QShortID:
SELECT
Table1.ID,
Table1.Subject_ID,
Table1.Organ_Type,
Table1.Category,
[Subject_ID] & [Map_to] & [Map_cat_to] AS Short_ID
FROM
(Table1
INNER JOIN
Map1
ON Table1.Organ_Type = Map1.Map_from)
INNER JOIN
Map2
ON Table1.Category = Map2.Map_cat_from;
Output:
Next, create a query, Dozen, that will return 12 rows:
SELECT DISTINCT
Abs([id] Mod 12) AS N
FROM
MSysObjects;
Finally, create a Cartesian (multiplying) query, QLongID:
SELECT Table1.ID, Table1.Subject_ID, Table1.Organ_Type, Table1.Category, [Subject_ID] & [Map_to] & [Map_cat_to] AS Short_ID
FROM (Table1 INNER JOIN Map1 ON Table1.Organ_Type = Map1.Map_from) INNER JOIN Map2 ON Table1.Category = Map2.Map_cat_from;
SELECT
QShortID.Subject_ID,
QShortID.Short_ID,
[Short_ID] & Format([N] + 1, "0000") AS Long_ID
FROM
QShortID,
Dozen
ORDER BY
[Short_ID] & Format([N] + 1, "0000");
Output:
Edit:
To use the timepoint mapping, use:
SELECT
QShortID.Subject_ID,
QShortID.Short_ID,
[Short_ID] & [Value_to_append] AS Long_ID
FROM
QShortID,
TimepointMap
ORDER BY
[Short_ID] & [Value_to_append];
Output:

Best Way to Join One Column on Columns From Two Other Tables

I have a schema like the following in Oracle
Section:
+--------+----------+
| sec_ID | group_ID |
+--------+----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
+--------+----------+
Section_to_Item:
+--------+---------+
| sec_ID | item_ID |
+--------+---------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
+--------+---------+
Item:
+---------+------+
| item_ID | data |
+---------+------+
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
+---------+------+
Item_Version:
+---------+----------+--------+
| item_ID | start_ID | end_ID |
+---------+----------+--------+
| 1 | 1 | |
| 2 | 1 | 3 |
| 3 | 2 | |
| 4 | 1 | 2 |
+---------+----------+--------+
Section_to_Item has FK into Section and Item on the *_ID columns.
Item_version is indexed on item_ID but has no FK to Item.item_ID (ran out of space in the snapshot group).
I have code that receives a list of version IDs and I want to get all items in sections in a given group that are valid for at least one of the versions passed in. If an item has no end_ID, it's valid for anything starting with start_ID. If it has an end_id, it's valid for anything up until (not including) end_ID.
What I currently have is:
SELECT Items.data
FROM Section, Section_to_Items, Item, Item_Version
WHERE Section.group_ID = 1
AND Section_to_Item.sec_ID = Section.sec_ID
AND Item.item_ID = Section_to_Item.item_ID
AND Item.item_ID = Item_Version.item_ID
AND exists (
SELECT *
FROM (
SELECT 2 AS version FROM DUAL
UNION ALL SELECT 3 AS version FROM DUAL
) passed_versions
WHERE Item_Version.start_ID <= passed_versions.version
AND (Item_Version.end_ID IS NULL or Item_Version.end_ID > passed_version.version)
)
Note that the UNION ALL statement is dynamically generated from the list of passed in versions.
This query currently does a cartesian join and is very slow.
For some reason, if I change the query to join
AND Item_Version.item_ID = Section_to_Item.item_ID
which is not a FK, the query does not do the cartesian join and is much faster.
A) Can anyone explain why this is?
B) Is this the right way to be joining this sequence of tables (I feel weird about joining Item.item_ID to two different tables)
C) Is this the right way to get versions between start_ID and end_ID?
Edit
Same query with inner join syntax:
SELECT Items.data
FROM Item
INNER JOIN Section_to_Items ON Section_to_Items.item_ID = Item.item_ID
INNER JOIN Section ON Section.sec_ID = Section_to_Items.sec_ID
INNER JOIN Item_Version ON Item_Version.item_ID = Item_.item_ID
WHERE Section.group_ID = 1
AND exists (
SELECT *
FROM (
SELECT 2 AS version FROM DUAL
UNION ALL SELECT 3 AS version FROM DUAL
) passed_versions
WHERE Item_Version.start_ID <= passed_versions.version
AND (Item_Version.end_ID IS NULL or Item_Version.end_ID > passed_version.version)
)
Note that in this case the performance difference comes from joining on Item_Version first and then joining Section_to_Item on Item_Version.item_ID.
In terms of table size, Section_to_Item, Item, and Item_Version should be similar (1000s) while Section should be small.
Edit
I just found out that apparently, the schema has no FKs. The FKs specified in the schema configuration files are ignored. They're just there for documentation. So there's no difference between joining on a FK column or not. That being said, by changing the joins into a cascade of SELECT INs, I'm able to avoid joining the entire Item table twice. I don't love the resulting query, and I don't really understand the difference, but the stats indicate it's much less work (changes the A-Rows returned from the inner most scan on Section from 656,000 to 488 (it used to be 656k starts returning 1 row, now it's 488 starts returning 1 row)).
Edit
It turned out to be stale statistics - the two queries were equivalent the whole time but with the incomplete statistics, the DB happened to notice the correct plan only in the second instance. After updating statistics, both queries generated the same plan.
I'm not sure if this is the best idea but this seems to avoid the cartesian join:
select data
from Item
where item_ID in (
select item_ID
from Item_Version
where item_ID in (
select item_ID
from Section_to_Item
where sec_ID in (
select sec_ID
from Section
where group_ID = 1
)
)
and exists (
select 1
from (
select 2 as version
from dual
union all
select 3 as version
from dual
) versions
where versions.version >= start_ID
and (end_ID is null or versions.version <)
)
)

Counting the total number of rows with SELECT DISTINCT ON without using a subquery

I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.
Assume I have a table my_table like the following:
CREATE TABLE my_table(
id int,
my_field text,
id_reference bigint
);
I then have a couple of values:
id | my_field | id_reference
----+----------+--------------
1 | a | 1
1 | b | 2
2 | a | 3
2 | c | 4
3 | x | 5
Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.
My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.
For example, in the above case I would like to retrieve the following output:
| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3 | 1 | b | 2 |
+-------+----+----------+--------------+
| 3 | 2 | c | 4 |
+-------+----+----------+--------------+
| 3 | 3 | x | 5 |
+-------+----+----------+--------------+
My attemp is the following:
select distinct on (id)
count(*) over () as total,
*
from my_table
order by id, id_reference desc
This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:
total | id | my_field | id_reference
-------+----+----------+--------------
5 | 1 | b | 2
5 | 2 | c | 4
5 | 3 | x | 5
(3 rows)
As you can see it has 5 instead of the expected 3.
I can fix this by using a subquery and count as an aggregate function:
with my_values as (
select distinct on (id)
*
from my_table
order by id, id_reference desc
)
select count(*) over (), * from my_values
Which produces my expected output.
My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?
You are looking at my_table 3 ways:
to find the latest id_reference for each id
to find my_field for the latest id_reference for each id
to count the distinct number of ids in the table
I therefore prefer this solution:
select
c.id_count as total,
a.id,
a.my_field,
b.max_id_reference
from
my_table a
join
(
select
id,
max(id_reference) as max_id_reference
from
my_table
group by
id
) b
on
a.id = b.id and
a.id_reference = b.max_id_reference
join
(
select
count(distinct id) as id_count
from
my_table
) c
on true;
This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.
The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.
There is nothing necessarily wrong with subqueries.

Deleting similar columns in SQL

In PostgreSQL 8.3, let's say I have a table called widgets with the following:
id | type | count
--------------------
1 | A | 21
2 | A | 29
3 | C | 4
4 | B | 1
5 | C | 4
6 | C | 3
7 | B | 14
I want to remove duplicates based upon the type column, leaving only those with the highest count column value in the table. The final data would look like this:
id | type | count
--------------------
2 | A | 29
3 | C | 4 /* `id` for this record might be '5' depending on your query */
7 | B | 14
I feel like I'm close, but I can't seem to wrap my head around a query that works to get rid of the duplicate columns.
count is a sql reserve word so it'll have to be escaped somehow. I can't remember the syntax for doing that in Postgres off the top of my head so I just surrounded it with square braces (change it if that isn't correct). In any case, the following should theoretically work (but I didn't actually test it):
delete from widgets where id not in (
select max(w2.id) from widgets as w2 inner join
(select max(w1.[count]) as [count], type from widgets as w1 group by w1.type) as sq
on sq.[count]=w2.[count] and sq.type=w2.type group by w2.[count]
);
There is a slightly simpler answer than Asaph's, with EXISTS SQL operator :
DELETE FROM widgets AS a
WHERE EXISTS
(SELECT * FROM widgets AS b
WHERE (a.type = b.type AND b.count > a.count)
OR (b.id > a.id AND a.type = b.type AND b.count = a.count))
EXISTS operator returns TRUE if the following SQL statement returns at least one record.
According to your requirements, seems to me that this should work:
DELETE
FROM widgets
WHERE type NOT IN
(
SELECT type, MAX(count)
FROM widgets
GROUP BY type
)