PL/SQL pseudo Sequencing - sql

I have the following scenario
ID SEQ
-- ---
123 2
123 4
What I want to be able to do is produce a list of these values and fill in the missing numbers to a maximum number say 6 for example (which I have from another source) where those number do not exist with the ID on the table.
ID NEW_SEQ
-- ---
123 1
123 2
123 3
123 4
123 5
123 6
Thanks
C

This generates a sequence of numbers from 1 through 6, cross joins with all the ids of the table to associate each of the sequence numbers with each id, then removes the already existing combinations.
SELECT t.id, s.seq
FROM (SELECT DISTINCT id FROM myTable) t
,(SELECT rownum AS seq
FROM dual
CONNECT BY LEVEL <= 6) s
MINUS
SELECT id, seq
FROM myTable
ORDER BY 1, 2

If you have a list of the numbers you want to use in OTHER_TABLE then I suggest you use an outer join, as in:
SELECT o.ID, o.NEW_SEQ
FROM OTHER_TABLE o
LEFT OUTER JOIN (SELECT ID, SEQ FROM MY_TABLE) t
ON (o.ID = t.ID AND o.NEW_SEQ = t.SEQ)
WHERE t.SEQ IS NULL
ORDER BY o.ID, o.NEW_SEQ
The outer join will include all rows from the first table (OTHER_TABLE, in this case) joined with the rows which exist from the second table (here, MY_TABLE). If there is a row in OTHER_TABLE which does not have a matching row in MY_TABLE, the fields from MY_TABLE will be NULL - thus, by checking for t.SEQ being NULL you're able to find the rows which exist in OTHER_TABLE but which are not in MY_TABLE.
SQLFiddle here.
Share and enjoy.

Related

SQL Joining two tables and removing the duplicates from the two tables but without loosing any duplicates from the tables itslef

I want to join two tables and remove duplicates from both the tables but keeping any duplicate value found in the first table.
T1
Name
-----
A
A
B
C
T2
Name
----
A
D
E
Expected result
A - > FROM T1
A - > FROM T1
B
C
D
E
I tried union but removes all duplicates of 'A' from both tables.
How can I achieve this?
Filter T2 before UNION ALL
select col
from T1
union all
select col
from T2
where not exists (select 1 from T1 where T1.col = T2.col)
Assuming you want the number of duplicates from the table with the most repetitions for each value, you can do it with the ROW_NUMBER() windowing function, to eliminate duplicates by their sequence with the set of repetitions in each table.
SELECT Name FROM (
SELECT Name, ROW_NUMBER() OVER ( PARTITION BY Name ORDER BY Name ) AS Row
FROM T1
UNION
SELECT Name, ROW_NUMBER() OVER ( PARTITION BY Name ORDER BY Name ) AS Row
FROM T2
) x
ORDER BY Name
To see how this works out, we add two B rows to T2 then do this:
SELECT Name, ROW_NUMBER() OVER ( PARTITION BY Name ORDER BY Name ) AS Row
FROM T1
Name Row
A 1
A 2
B 1
C 1
SELECT Name, ROW_NUMBER() OVER ( PARTITION BY Name ORDER BY Name ) AS Row
FROM T2
Name Row
A 1
B 1
B 2
D 1
E 1
Now UNION them without ALL to combine and eliminate duplicates:
SELECT Name, ROW_NUMBER() OVER ( PARTITION BY Name ORDER BY Name ) AS Row
FROM T1
UNION
SELECT Name, ROW_NUMBER() OVER ( PARTITION BY Name ORDER BY Name ) AS Row
FROM T2
Name Row
A 1
A 2
B 1
B 2
C 1
D 1
E 1
The final query up top is then just eliminating the Row column and sorting the result, to ensure ascending order.
See SQL Fiddle for demo.
select * from T1
union all
select * from T2 where name not in (select distinct name from T1)
Sql Fiddle Demo
you should use "union all" instead of "union".
"union" remove other duplicated records while "union all" gives all of them.
for you result,because of we filtered intersects from table 2 in "where",we don't need "UNION ALL"
select col1 from t1
union
select col1 from t2 where t2.col1 not in(select t1.col1 from t1)
I D'not know the following code is good practice or not But it's working
select name from T1
UNION
select name from T2 Where name not in (select name from T1)
The Above Query Filter the value based on T1 value and then join two tables values and show the result.
I hope it's helps you thanks.
Note : It's not better way to get result it's affect your performance.
I sure i update the better solution after my research
You want all names from T1 and all names from T2 except the names that are in T1.
So you can use UNION ALL for the 2 cases and the operator EXCEPT to filter the rows of T2:
SELECT Name FROM T1
UNION ALL
(
SELECT Name FROM T2
EXCEPT
SELECT Name FROM T1
)
See the demo.
Results:
> | Name |
> | :--- |
> | A |
> | A |
> | B |
> | C |
> | D |
> | E |

How to write a query to delete everything except maximum value grouped by an ID?

I am trying to write a query to delete duplicate records based on a ID and a value. There are multiple rows with the same ID. Condition to get the result are (and the queries I have written as per my understanding),
Look for maximum value available for the ID column in Value column (SELECT * FROM TABLE WHERE VALUE IN (SELECT MAX(VALUE) FROM TABLE GROUP BY ID)
Example:
Table data:
ID - Value
a - 1
a - 2
a - 3
b - 2
c - 3
Output:
ID - Value
a - 3
b - 2
c - 3
Ignore the results from point 1 in the table (SELECT * FROM TABLE WHERE NOT EXISTS ((SELECT * FROM TABLE WHERE VALUE IN (SELECT MAX(VALUE) FROM TABLE GROUP BY ID))
Edit: I wrote a query that finally outputs the required result for point 2
SELECT t1.* FROM TABLE t1
LEFT JOIN
(
SELECT 1 AS aux, * FROM (SELECT * FROM TABLE
WHERE VALUE IN
(SELECT MAX(VALUE) FROM TABLE group by ID))
) t2
ON
t2.ID= t1.ID
and
t2.VALUE= t1.VALUE
WHERE t2.aux IS NULL
Example:
Table data:
ID - Value
a - 1
a - 2
a - 3
b - 2
c - 3
Output:
ID - Value
a - 1
a - 2
Use the query of point 2 to delete rows from table (DELETE FROM TABLE WHERE (ID,VALUE) IN (SELECT * FROM TABLE WHERE NOT EXISTS ((SELECT * FROM TABLE WHERE VALUE IN (SELECT MAX(VALUE) FROM TABLE GROUP BY ID)))
Example:
Table data:
ID - Value
a - 1
a - 2
a - 3
b - 2
c - 3
Table data:
ID - Value
a - 3
b - 2
c - 3
Point 2 does not work, it is giving no results. When the checked the total row of output of the query from point 2 and total row of the table, there is a difference.
Since point 2 does not work, point 3 fails as well. What am I doing wrong?
After our discussion, I understand that you aimed to select many rows of data which respects the filter id and max(value). Therefore, I can suggest you the following script:
SELECT
DISTINCT a.*
FROM
`test-proj-261014.sample.id_value` a
RIGHT JOIN (
SELECT
id,
MAX(value) AS max_val
FROM
`test-proj-261014.sample.id_value`
GROUP BY
id
ORDER BY
id) b
ON
a.id = b.id
AND a.value = b.max_val
WHERE
a.value IS NOT NULL
ORDER BY
id;
Not that I use SELECT DISTINCT, which will not select duplicated data. In addition, due to the possibility of the existence of null values, I added the consition***WHERE a.value IS NOT NULL***, which will not select the rows that do not respect the condition.
The above query should solve the problem, however if you find any discrepancy with the expected amount of rows, I encourage you explore your data set and detect the reason why there are extra or less rows. You can use different types of joins to do so, one example would be the following query:
SELECT
a.*
FROM
`test-proj-261014.sample.id_value` a
LEFT JOIN (
SELECT
id,
MAX(value) AS max_val
FROM
`test-proj-261014.sample.id_value`
GROUP BY
id
ORDER BY
id) b
ON
a.id = b.id
AND a.value = b.max_val
WHERE
b.max_val IS NULL
ORDER BY
id;
This query retrieves all the values which are not present in the final output generated by the first query. This would help you understand better the data you are dealing with.
I hope it helps.

How to select values from a column that have only specific values from another column and not other values?

I have a pgsql schema having a table that has two columns among others: id and status. status values are of varchar type ranging from '1' to '6'. I want to select values of id that have only specific status, precisely, one id having only one status ('1'), then another having two values ('1' ands '2'), then another having only three values ('1', '2' and '3') and so on.
This is for a pgsql database. I have tried using inner query joining with the same table.
select *
from srt s
join ( select id
from srt
group by id
having count(distinct status) = 2
) t on t.id = s.id
where srt.status in ('1', '2')
limit 10
I used this to get the IDs having only status values 1 and 2 (and not having any rows with status values 3, 4, 5, 6) but didn't get the expected result
The expected result would be something like this
id status
123 1
234 1
234 2
345 1
345 2
345 3
456 1
456 2
456 3
456 4
567 1
567 2
567 3
567 4
567 5
678 1
678 2
678 3
678 4
678 5
678 6
Move your where condition inside sub-query -
select *
from srt s
join ( select id
from srt
where status in ('1', '2')
group by id
having count(distinct status) = 2
) t on t.id = s.id
limit 10
To identify the ids with consecutive statuses, you can do:
select id, max(status) as max_status
from srt s
group by id
having min(status) = 1 and
max(status::int) = count(*);
Then, you can narrow this down to one example using distinct on and use join to bring in your results:
select s.*
from srt s join
(select distinct on (max(status)) id, max(status) as max_status
from srt s
group by id
having min(status) = 1 and
max(status::int) = count(*)
order by max_status asc
) ss
on ss.id = s.id
order by ss.max_status, s.status;
This is a tricky one. My solution is to first specify a list of the "target statuses" you want to match:
with target_statuses(s) as ( values (1),(2),(3) )
Then JOIN your srt table to it and count the occurrences grouped by id.
with target_statuses(s) as ( values (1),(2),(3) )
select id, count(*), row_number() OVER (partition by count(*) order by id) rownum
from srt
join target_statuses on status=s
group by id
)
This query also captures a row number, which we'll later use to limit it to the first id that has one match, the first id that has two matches, etc. Note the order by clause... I assume you want the alphabetically lowest id first in each case, but you may change that.
Since you can't put a window function in a HAVING clause, I wrap up the whole result at ids_and_counts_of_statuses and perform a follow-up query that rejoins it with the srt table to output it:
with ids_and_counts_of_statuses as(
with target_statuses(s) as ( values (1),(2),(3) )
select id, count(*), row_number() OVER (partition by count(*) order by id) rownum
from srt
join target_statuses on status=s
group by id
)
select srt.id, srt.status
from ids_and_counts_of_statuses
join srt on ids_and_counts_of_statuses.id=srt.id
where rownum=1;
Note that I have changed your varchar values to integers just so I didn't have to type quite so much punctuation. It works, here's an example: https://www.db-fiddle.com/f/wwob31uiNgr9aAkZoe1Jgs/0

SQLite - Return Rows Even If They Are Duplicates

I have a simple SQLite table which has just one ID column.
I have some variable IDs that may be duplicates of each other like: 1,2,3,4,3,1 (These IDs are just examples, there could be hundreds of them).
And I have a simple query as follows:
SELECT ID FROM TABLE WHERE ID in (1,2,3,4,3,1)
In the usual case the answer contains only 4 rows with ids 1,2,3,4. Is there any way to force SQLite to return rows in the order of the request (1,2,3,4,3,1) even if they are duplicates?
I have n IDs in my query and I want n rows in return even if they are duplicates.
Edit: The Table Definition is:
CREATE TABLE TEST(ID TEXT PRIMARY KEY)
You can use left join:
select t.*
from (select 1 as id, 1 as ord union all
select 2 as id, 2 as ord union all
select 3 as id, 3 as ord union all
select 4 as id, 4 as ord union all
select 3 as id, 5 as ord union all
select 1 as id, 6 as ord
) ids left join
t
on t.id = ids.id
order by ids.ord;

What is the difference between Postgres DISTINCT vs DISTINCT ON?

I have a Postgres table created with the following statement. This table is filled by as dump of data from another service.
CREATE TABLE data_table (
date date DEFAULT NULL,
dimension1 varchar(64) DEFAULT NULL,
dimension2 varchar(128) DEFAULT NULL
) TABLESPACE pg_default;
One of the steps in a ETL I'm building is extracting the unique values of dimension1 and inserting them in another intermediary table.
However, during some tests I found out that the 2 commands below do not return the same results. I would expect for both to return the same sum.
The first command returns more results compared with the second (1466 rows vs. 1504.
-- command 1
SELECT DISTINCT count(dimension1)
FROM data_table;
-- command 2
SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;
Any obvious explanations for this? Alternatively to an explanation, is there any suggestion of any check on the data I should do?
EDIT: The following queries both return 1504 (same as the "simple" DISTINCT)
SELECT count(*)
FROM data_table WHERE dimension1 IS NOT NULL;
SELECT count(dimension1)
FROM data_table;
Thank you!
DISTINCT and DISTINCT ON have completely different semantics.
First the theory
DISTINCT applies to an entire tuple. Once the result of the query is computed, DISTINCT removes any duplicate tuples from the result.
For example, assume a table R with the following contents:
#table r;
a | b
---+---
1 | a
2 | b
3 | c
3 | d
2 | e
1 | a
(6 rows)
SELECT distinct * from R will result:
# select distinct * from r;
a | b
---+---
1 | a
3 | d
2 | e
2 | b
3 | c
(5 rows)
Note that distinct applies to the entire list of projected attributes: thus
select distinct * from R
is semantically equivalent to
select distinct a,b from R
You cannot issue
select a, distinct b From R
DISTINCT must follow SELECT. It applies to the entire tuple, not to an attribute of the result.
DISTINCT ON is a postgresql addition to the language. It is similar, but not identical, to group by.
Its syntax is:
SELECT DISTINCT ON (attributeList) <rest as any query>
For example:
SELECT DISTINCT ON (a) * from R
It semantics can be described as follows. Compute the as usual--without the DISTINCT ON (a)---but before the projection of the result, sort the current result and group it according to the attribute list in DISTINCT ON (similar to group by). Now, do the projection using the first tuple in each group and ignore the other tuples.
Example:
select * from r order by a;
a | b
---+---
1 | a
2 | e
2 | b
3 | c
3 | d
(5 rows)
Then for every different value of a (in this case, 1, 2 and 3), take the first tuple. Which is the same as:
SELECT DISTINCT on (a) * from r;
a | b
---+---
1 | a
2 | b
3 | c
(3 rows)
Some DBMS (most notably sqlite) will allow you to run this query:
SELECT a,b from R group by a;
And this give you a similar result.
Postgresql will allow this query, if and only if there is a functional dependency from a to b. In other words, this query will be valid if for any instance of the relation R, there is only one unique tuple for every value or a (thus selecting the first tuple is deterministic: there is only one tuple).
For instance, if the primary key of R is a, then a->b and:
SELECT a,b FROM R group by a
is identical to:
SELECT DISTINCT on (a) a, b from r;
Now, back to your problem:
First query:
SELECT DISTINCT count(dimension1)
FROM data_table;
computes the count of dimension1 (number of tuples in data_table that where dimension1 is not null). This query
returns one tuple, which is always unique (hence DISTINCT
is redundant).
Query 2:
SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;
This is query in a query. Let me rewrite it for clarity:
WITH tmp_table AS (
SELECT DISTINCT ON (dimension1)
dimension1 FROM data_table
GROUP by dimension1)
SELECT count(*) from tmp_table
Let us compute first tmp_table. As I mentioned above,
let us first ignore the DISTINCT ON and do the rest of the
query. This is a group by by dimension1. Hence this part of the query
will result in one tuple per different value of dimension1.
Now, the DISTINCT ON. It uses dimension1 again. But dimension1 is unique already (due to the group by). Hence
this makes the DISTINCT ON superflouos (it does nothing).
The final count is simply a count of all the tuples in the group by.
As you can see, there is an equivalence in the following query (it applies to any relation with an attribute a):
SELECT (DISTINCT ON a) a
FROM R
and
SELECT a FROM R group by a
and
SELECT DISTINCT a FROM R
Warning
Using DISTINCT ON results in a query might be non-deterministic for a given instance of the database.
In other words, the query might return different results for the same tables.
One interesting aspect
Distinct ON emulates a bad behaviour of sqlite in a much cleaner way. Assume that R has two attributes a and b:
SELECT a, b FROM R group by a
is an illegal statement in SQL. Yet, it runs on sqlite. It simply takes a random value of b from any of the tuples in the group of same values of a.
In Postgresql this statement is illegal. Instead, you must use DISTINCT ON and write:
SELECT DISTINCT ON (a) a,b from R
Corollary
DISTINCT ON is useful in a group by when you want to access a value that is functionally dependent on the group by attributes. In other words, if you know that for every group of attributes they always have the same value of the third attribute, then use DISTINCT ON that group of attributes. Otherwise you would have to make a JOIN to retrieve that third attribute.
The first query gives the number of not null values of dimension1, while the second one returns the number of distinct values of the column. These numbers obviously are not equal if the column contains duplicates or nulls.
The word DISTINCT in
SELECT DISTINCT count(dimension1)
FROM data_table;
makes no sense, as the query returns a single row. Maybe you wanted
SELECT count(DISTINCT dimension1)
FROM data_table;
which returns the number of distinct not null values of dimension1. Note, that it is not the same as
SELECT count(*)
FROM (
SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
-- GROUP BY dimension1 -- redundant
) AS tmp_table;
The last query yields the number of all (null or not) distinct values of the column.
To learn and understand what happens by visual example.
Here's a bit of SQL to execute on a PostgreSQL:
DROP TABLE IF EXISTS test_table;
CREATE TABLE test_table (
id int NOT NULL primary key,
col1 varchar(64) DEFAULT NULL
);
INSERT INTO test_table (id, col1) VALUES
(1,'foo'), (2,'foo'), (3,'bar'), (4,null);
select count(*) as total1 from test_table;
-- returns: 4
-- Because the table has 4 records.
select distinct count(*) as total2 from test_table;
-- returns: 4
-- The count(*) is just one value. Making 1 total unique can only result in 1 total.
-- So the distinct is useless here.
select col1, count(*) as total3 from test_table group by col1 order by col1;
-- returns 3 rows: ('bar',1),('foo',2),(NULL,1)
-- Since there are 3 unique col1 values. NULL's are included.
select distinct col1, count(*) as total4 from test_table group by col1 order by col1;
-- returns 3 rows: ('bar',1),('foo',2),(NULL,1)
-- The result is already grouped, and therefor already unique.
-- So again, the distinct does nothing extra here.
select count(distinct col1) as total5 from test_table;
-- returns 2
-- NULL's aren't counted in a count by value. So only 'foo' & 'bar' are counted
select distinct on (col1) id, col1 from test_table order by col1 asc, id desc;
-- returns 3 rows: (2,'a'),(3,'b'),(4,NULL)
-- So it gets the records with the maximum id per unique col1
-- Note that the "order by" matters here. Changing that DESC to ASC would get the minumum id.
select count(*) as total6 from (select distinct on (col1) id, col1 from test_table order by col1 asc, id desc) as q;
-- returns 3.
-- After seeing the previous query, what else would one expect?
select distinct col1 from test_table order by col1;
-- returns 3 unique values : ('bar'),('foo'),(null)
select distinct id, col1 from test_table order by col1;
-- returns all records.
-- Because id is the primary key and therefore makes each returned row unique
Here's a more direct summary that might useful for Googlers, answering the title but not the intricacies of the full post:
SELECT DISTINCT
availability: ISO
behaviour:
SELECT DISTINCT col1, col2, col3 FROM mytable
returns col1, col2 and col3 and omits any rows in which all of the tuple (col1, col2, col3) are the same. E.g. you could get a result like:
1 2 3
1 2 4
because those two rows are not identical due to the 4. But you could never get:
1 2 3
1 2 4
1 2 3
because 1 2 3 appears twice, and both rows are exactly the same. That is what DISTINCT prevents.
vs GROUP BY: SELECT DISTINCT is basically a subset of GROUP BY where you can't use aggregate functions: Is there any difference between GROUP BY and DISTINCT
SELECT DISTINCT ON
availability: PostgreSQL extension, WONTFIXED by SQLite
behavior: unlike DISTINCT, DISTINCT ON allows you to separate
what you want to be unique
from what you want to return
E.g.:
SELECT DISTINCT ON(col1) col2, col3 FROM mytable
returns col2 and col3, and does not return any two rows with the same col1. E.g.:
1 2 3
1 4 5
could not happen, because we have 1 twice on col1.
And e.g.:
SELECT DISTINCT ON(col1, col2) col2, col3 FROM mytable
would prevent any duplicated (col1, col2) tuples, e.g. you could get:
1 2 3
1 4 3
as it has different (1, 2) and (1, 4) tuples, but not:
1 2 3
1 2 4
where (1, 2) happens twice, only one of those two could appear.
We can uniquely determine which one of the possible rows will be selected with ORDER BY which guarantees that the first match is taken, e.g.:
SELECT DISTINCT ON(col1, col2) col2, col3 FROM mytable
ORDER BY col1 DESC, col2 DESC, col3 DESC
would ensure that among:
1 2 3
1 2 4
only 1 2 4 would be picked as it happens first on our DESC sorting.
vs GROUP BY: DISTINCT ON is not a subset of GROUP BY because it allows you to access extra rows not present in the GROUP BY, which is generally not allowed in GROUP BY, unless:
you group by primary key in Postgres (unique not null is a TODO for them)
or if that is allows as an ISO extension as in SQLite/MySQL
This makes DISTINCT ON extremely useful to fulfill the common use case of "find the full row that reaches the maximum/minimum of some column": Is there any difference between GROUP BY and DISTINCT
E.g. to find the city of each country that has the most sales:
SELECT DISTINCT ON ("country") "country", "city", "amount"
FROM "Sales"
ORDER BY "country" ASC, "amount" DESC, "city" ASC
or equivalently with * if we want all columns:
SELECT DISTINCT ON ("country") *
FROM "Sales"
ORDER BY "country" ASC, "amount" DESC, "city" ASC
Here each country appears only once, within each country we then sort by amount DESC and take the first, and therefore highest, amount.
RANK and ROW_NUMBER window functions
These can be used basically as supersets of DISTINCT ON, and implemented tested as of both SQLite 3.34 and PostgreSQL 14.3. I highly recommend also looking into them, see e.g.: How to SELECT DISTINCT of one column and get the others?
This is how the above "city with the highest amount of each country" query would look like with ROW_NUMBER:
SELECT *
FROM (
SELECT
ROW_NUMBER() OVER (
PARTITION BY "country"
ORDER BY "amount" DESC, "city" ASC
) AS "rnk",
*
FROM "Sales"
) sub
WHERE
"sub"."rnk" = 1
ORDER BY
"sub"."country" ASC
Try
SELECT count(dimension1a)
FROM (SELECT DISTINCT ON (dimension1) dimension1a
FROM data_table
ORDER BY dimension1) AS tmp_table;
DISTINCT ON appears to be synonymous with GROUP BY.