Is order preserved after UNION in PostgreSQL? - sql

Here is the code:
CREATE TABLE audit_trail (
old_email TEXT NOT NULL,
new_email TEXT NOT NULL
);
INSERT INTO audit_trail(old_email, new_email)
VALUES ('harold_gim#yahoo.com', 'hgimenez#hotmail.com'),
('hgimenez#hotmail.com', 'harold.gimenez#gmail.com'),
('harold.gimenez#gmail.com', 'harold#heroku.com'),
('foo#bar.com', 'bar#baz.com'),
('bar#baz.com', 'barbaz#gmail.com');
WITH RECURSIVE all_emails AS (
SELECT old_email, new_email
FROM audit_trail
WHERE old_email = 'harold_gim#yahoo.com'
UNION
SELECT at.old_email, at.new_email
FROM audit_trail at
JOIN all_emails a
ON (at.old_email = a.new_email)
)
SELECT * FROM all_emails;
old_email | new_email
--------------------------+--------------------------
harold_gim#yahoo.com | hgimenez#hotmail.com
hgimenez#hotmail.com | harold.gimenez#gmail.com
harold.gimenez#gmail.com | harold#heroku.com
(3 rows)
select old_email, new_email into iter1
from audit_trail where old_email = 'harold_gim#yahoo.com';
select * from iter1;
-- old_email | new_email
-- ----------------------+----------------------
-- harold_gim#yahoo.com | hgimenez#hotmail.com
-- (1 row)
select a.old_email, a.new_email into iter2
from audit_trail a join iter1 b on (a.old_email = b.new_email);
select * from iter2;
-- old_email | new_email
-- ----------------------+--------------------------
-- hgimenez#hotmail.com | harold.gimenez#gmail.com
-- (1 row)
select * from iter1 union select * from iter2;
-- old_email | new_email
-- ----------------------+--------------------------
-- hgimenez#hotmail.com | harold.gimenez#gmail.com
-- harold_gim#yahoo.com | hgimenez#hotmail.com
-- (2 rows)
As you can see the recursive code gives the result in right order, but the non-recursive code does not.
They both use union, why the difference?

Basically, your query is incorrect to begin with. Use UNION ALL, not UNION or you would incorrectly remove duplicate entries. (There is nothing to say the trail cannot switch back and forth between the same emails.)
The Postgres implementation for UNION ALL returns values in the sequence as appended - as long as you do not add ORDER BY at the end or do anything else with the result.
Be aware though, that each SELECT returns rows in arbitrary order unless ORDER BY is appended. There is no natural order in tables.
The same is not true for UNION, which has to process all rows to remove possible duplicates. There are various ways to determine duplicates, the resulting order of rows depends on the chosen algorithm and is implementation-dependent and completely unreliable - unless, again, ORDER BY is appended.
So use instead:
SELECT * FROM iter1
UNION ALL -- union all!
SELECT * FROM iter2;
To get a reliable sort order, and "simulate the record of growth", you can track levels like this:
WITH RECURSIVE all_emails AS (
SELECT *, 1 AS lvl
FROM audit_trail
WHERE old_email = 'harold_gim#yahoo.com'
UNION ALL -- union all!
SELECT t.*, a.lvl + 1
FROM all_emails a
JOIN audit_trail t ON t.old_email = a.new_email
)
TABLE all_emails
ORDER BY lvl;
db<>fiddle here
Old sqlfiddle
Aside: if old_email is not defined UNIQUE in some way, you can get multiple trails. You would need a unique column (or combination of columns) to keep it unambiguous. If all else fails you can (ab-)use the internal tuple ID ctid for the purpose of telling trails apart. But you should rather use your own columns. (Added example in the fiddle.)
In-order sequence generation
Consider:
How to return records in correct order in PostgreSQL

Ordering is never preserved after any operation in any reasonable database. If you want the result set in a particular order, use ORDER BY. Period.
This is especially true after a UNION. UNION removes duplicates and that operation is going to change the ordering of the rows, in all likelihood.

Order is preserved if one can pass after all unions statement as below:
select "ClassName","SectionName","Students","OrderNo" from table
UNION
select '----TOTAL----' as "ClassName",'----' as "SectionName",sum("Total Students"),9999 as "OrderNo" from table
ORDER BY "OrderNo"

Related

SQL UNION ALL but with lots of columns on BigQuery?

Above image is a screenshot of my table just as a quick initial reference.
The focal point are the multiple mech columns (mech1, mech2, mech3, and mech4).
Board games in this tables have multiple attributes called mechanisms so I've separated them into 4 different columns.
So I've learned how to combine columns vertically via UNION ALL so that I can query the count of all unique game mechanisms in my table.
However, it got me wondering if there's a shorter and more efficient way to achieve what I've done:
WITH mechanism_info AS
(
WITH
mechanism_col_combined AS
(
SELECT mech1 AS all_mech_columns_combined
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
## There's no IS NOT NULL condition defined for column 'mech1' since there's at least one mechanism noted for a game.
SELECT mech2
FROM `ckda-portfolio-2022.bg_collection.base`
WHERE mech2 IS NOT NULL
UNION ALL
SELECT mech3
FROM `ckda-portfolio-2022.bg_collection.base`
WHERE mech3 IS NOT NULL
UNION ALL
SELECT mech4
FROM `ckda-portfolio-2022.bg_collection.base`
WHERE mech4 IS NOT NULL
)
## Temporary table with all mechanism column in the collection combined.
SELECT DISTINCT(all_mech_columns_combined) AS unique_mechanisms, COUNT(*) AS count
FROM mechanism_col_combined
GROUP BY all_mech_columns_combined
ORDER BY all_mech_columns_combined
)
SELECT *
FROM mechanism_info
By querying this temp. table, SQL returns the information that I've anticipated as below:
unique_mechanisms | count
Acting | 1
Action Points | 3
Action Queue | 1
Action Retrieval | 1
Area Movement | 1
Auction/Bidding | 5
Bag Building | 1
Betting & Bluffing| 2
Bingo | 1
Bluffing | 7
Now, I want to shorten my code and I know there has to be a way to shorten the repetitive process of combining columns with UNION ALL.
And if there's any other tips or methods on how to shorten my query, please let me know!
Thank you.
You can convert the multiple columns [mech1, mech2, ...] into a column of array mech_arr and then using UNNEST to convert the column to have scalar value in each row.
For example:
WITH table1 AS (
SELECT 'AA' AS mech1, 'BB' AS mech2, 'CC' AS mech3,
UNION ALL SELECT 'AA' AS mech1, 'CC' AS mech2, 'EE' AS mech3
),
table2 AS (SELECT [mech1, mech2, mech3] AS mech_arr FROM table1)
SELECT mech, COUNT(*) AS mech_counts
FROM table2, UNNEST(mech_arr) AS mech
GROUP BY mech
Output
mech mech_counts
AA 2
BB 1
CC 2
EE 1
You could send join into the table, but the performance would not improve and the query would be just as long.
You can simplify as follows:
SELECT
mech_column,
count(*) "number"
FROM (
SELECT mech1 AS mech_column
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech2
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech3
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech4
FROM `ckda-portfolio-2022.bg_collection.base`
) m
WHERE mech_column IS NOT NULL
GROUP BY mech_column
ORDER BY mech_column;
Didn't find a smoother way to query but I did find a way to remove the process of adding WHERE column IS NOT NULL for each and every columns that was used to vertically aggregate them into a single column:
mechanism_info AS
(
WITH
mechanism_col_combined AS
(
SELECT mech1 AS mech_columns
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech2
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech3
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech4
FROM `ckda-portfolio-2022.bg_collection.base`
## Removed all WHERE clause from the above columns
and added it below instead.
)
## Temporary table with all mechanism columns in the collection combined.
SELECT DISTINCT(mech_columns) AS mechanisms, COUNT(*) AS count
FROM mechanism_col_combined
WHERE mech_columns IS NOT NULL ## <--- Added here!
GROUP BY mech_columns
ORDER BY mech_columns
)
SELECT *
FROM mechanism_info
Since mechanism_info is a nested temp. table, I can just add WHERE mech_columns IS NOT NULL clause and condition to the initial temp. table's setting.
I'm still looking to reduce this query down to something more efficient. It's unfortunate that UNION ALL can't select multiple columns with a single call :(

How to limit number of groups returned in a query, but not the number of rows in Oracle

How to limit the number of groups in a query, but not the number of rows in Oracle?
If I had to do that manually, I would have to use a DISTINCT.
Would be something like this:
FOR d IN (
SELECT DISTINCT COLUMN_1 FROM myTable
WHERE myDate BETWEEN x AND y
OFFSET o ROWS
FETCH NEXT l ROWS ONLY
) LOOP
And then, do the selects from each of the ids returned in the query, which, in my opinion, is a terrible solution.
SAMPLE DATA:
If I limit the number of groups to 2 by using COLUMN_2, the expected result should be something like:
I believe you may be looking for something like this:
select *
from mytable
where id in (
select distinct id
from my_table
where my_date between x and y
fetch first :n rows only
)
;
:n is a bind variable, encoding the number of groups you want to select.
This should be more efficient than solutions using analytic functions - even if it must read the base table twice. In tests posted on OTN, I showed that the difference is not small.
EDIT If I remember correctly, FETCH is not implemented in the most efficient way (perhaps for good reasons, having to do with features we don't need in this query - such as how to deal with ties). FETCH itself resembles a DENSE_RANK() implementation rather than the faster row limiting clause (using ROWNUM). I would likely need to modify the query to do away with FETCH, if speed was really important. END EDIT
Further edit to do with performance comparisons
Frequent poster MT0 requested a pointer for the claim that aggregate solutions can (and often are) more efficient than analytic function approaches, even when the former may require multiple passes through the data where the analytic function approach requires only one.
Alas, OTN (what now calls itself the "Oracle Groundbreakers Developer Community", the discussion board hosted by Oracle itself) went through a massive - and massively botched - platform change at the end of September 2020; that messed up both the search facilities and the formatting of old posts, to the point of rendering them almost unusable.
Instead, I will show here a simple mock-up of the OP's problem in this thread; code that anyone can run so they can repeat the tests on their own machine.
I created a table with two columns, ID and STR - the ID plays the same role as in the OP's question, and STR is just extra payload to mimic real-life data. ID is number and STR is varchar2(100). I populated the table with 9 million rows - 1 million ID's, nine rows for each ID. The task is to select just three "groups" (three distinct ID's, then select all the rows from the base table for those three distinct ID's).
With no index on the ID column, the aggregate solution runs in 0.81 seconds on my machine; with an index on ID, it runs in 0.47 seconds. The analytic functions solution runs in 0.91 seconds, with or without an index (obviously - there is no way an index can benefit the analytic function solution). All these results are for column ID not declared NOT NULL.
Here is the code to create the table, the index on ID, and the two queries I tested. Note: As I explained in my first edit (above), fetch is slow; I replaced it with a standard row-limiting technique using ROWNUM in an over-query.
drop table t purge;
create table t (id number, str varchar2(100));
insert into t
with row_gen as (select level from dual connect by level <= 3000)
select mod(344227 * rownum, 1000000), rpad('x', 100, 'x')
from row_gen cross join row_gen
;
commit;
create index t_idx on t(id);
select *
from t
where id in (
select id from (select distinct id from t)
where rownum <= 3
);
select *
from ( select t.*, dense_rank() over (order by id) dr from t )
where dr <= 3;
You can use DENSE_RANK:
SELECT *
FROM (
SELECT t.*,
DENSE_RANK() OVER ( ORDER BY column2 ) AS rnk
FROM table_name t
)
WHERE rnk <= 2;
Which, for the sample data:
CREATE TABLE table_name ( column1, column2, column3, column4 ) AS
SELECT 1, 1, 1.0, 1.0 FROM DUAL UNION ALL
SELECT 2, 2, 2.0, 2.0 FROM DUAL UNION ALL
SELECT 2, 2, 2.2, 2.1 FROM DUAL UNION ALL
SELECT 2, 2, 2.2, 2.2 FROM DUAL UNION ALL
SELECT 2, 2, 2.0, 2.3 FROM DUAL UNION ALL
SELECT 3, 3, 3.0, 3.1 FROM DUAL UNION ALL
SELECT 3, 3, 3.1, 3.1 FROM DUAL UNION ALL
SELECT 3, 3, 3.1, 3.1 FROM DUAL UNION ALL
SELECT 4, 4, 4.2, 4.0 FROM DUAL;
Outputs:
COLUMN1 | COLUMN2 | COLUMN3 | COLUMN4 | RNK
------: | ------: | ------: | ------: | --:
1 | 1 | 1 | 1 | 1
2 | 2 | 2 | 2 | 2
2 | 2 | 2.2 | 2.1 | 2
2 | 2 | 2.2 | 2.2 | 2
2 | 2 | 2 | 2.3 | 2
(and, if you want DISTINCT rows then add DISTINCT to the outer query)
db<>fiddle here
If I understand correctly, you want ROW_NUMBER():
SELECT t.*
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) as seqnum
FROM myTable t
WHERE t.myDate BETWEEN x AND y
) t
WHERE seqnum = 1;
This returns an arbitrary row for each id meeting the conditions.

How to get substring for filter and group by clause in AWS Redshift database

How to get substring from column which contains records for filter and group by clause in AWS Redshift database.
I have table with records like:
Table_Id | Categories | Value
<ID> | ABC1; ABC1-1; XYZ | 10
<ID> | ABC1; ABC1-2; XYZ | 15
<ID> | XYZ | 5
.....
Now I want to filter records based on individual category like 'ABC1' or 'ABC1 and XYZ'
Expected output from query would like:
Table_Id | Categories | Value
<ID> | ABC1 | 25
<ID> | ABC1-1 | 10
<ID> | ABC1-2 | 15
<ID> | XYZ | 30
.....
So need to group results based on individual categories.
If you have at most 3 values in any "categories" cell you can unnest the cells, get the list of unique values and use that list in a join condition like this:
WITH
values as (
select distinct category
from (
select distinct split_part(categories,';',1) as category from your_table
union select distinct split_part(categories,';',2) from your_table
union select distinct split_part(categories,';',3) from your_table
)
where nullif(category,'') is not null
)
SELECT
t2.category
,sum(t1.value)
FROM your_table t1
JOIN values t2
ON split_part(categories,';',1)=t2.category
OR split_part(categories,';',2)=t2.category
OR split_part(categories,';',3)=t2.category
if you have more than 3 options just add another split_part level both in WITH part and the join condition
#JonScott, #AlexYes and other pals who struggle with similar kinda situations.
I found more better approach other than suggested by #AlexYes.
What I did, I flatter category column which result individual records.
Which I can further process.
Query:
select row_number() over(order by 1) as r1,
to_char(timestamptz 'epoch' + date_time * interval '1 second', 'yyyy-mm-dd') AS DAY,
split_part(categories, ';', numbers.n) as catg,
value
from <TABLE>
join numbers
on numbers.n <= regexp_count(category_string, ';') + 1 <OTHER_CONDITIONS>
Explanation:
Two functions are useful here: first, the split_part function, which takes a string, splits it on ';' delimiter, and returns the first, second, ... , nth value specified from the split string; second, regexp_count, which tells us how many times a particular pattern is found in our string.
To do this fully dynamically, you need to transpose or pivot values in "categories" column into separate rows.
Unfortunately, a "fully dynamic" solution (without knowing the different values beforehand) is NOT possible using redshift.
Your options are as follows:
Use the method suggested by AlexYes in another answer. This is
semi-dynamic and is probably your best option.
Outside of Redshift, run some ETL code to perform
the column -> multiple rows ETL.
Create a hardcoded type solution, and perform the pivot something like this:
select table_id,'ABC1' as category, case when concat(Categories,';') ilike '%ABC1;%' then value else 0 end as value from your_table
union all
select table_id,'ABC1-1' as category, case when concat(Categories,';')ilike '%ABC1-1;%' then value else 0 end as value from your_table
union all
etc

Recursive SQL Query or Just Multiple Unions? Or Something Else?

I'm trying to create a query grabbing data from 5 different tables. To return records for every date and every account, I have to create a 'master' table with date and account id.
Since I really don't have a reference table for the account_id, I was thinking of writing the query as such.
select tab1.calendar_date, tab1.cal_d, (0) as account_id from calendar.table
union all
select tab1.calendar_date, tab1.cal_d, (1) as account_id from calendar.table
union all
select tab1.calendar_date, tab1.cal_d, (2) as account_id from calendar.table
and so on to account id 5.
The resulting table is then mapped to 5 other tables to pull the other information. Is there another way for me to restructure this query so it's not doing 4/5 joins? A co-worker suggested a recursive table, but I'm not familiar with it. I'm almost referencing as the master 'fact' table.
Additional context. I need the resulting table to look like the following:
calendar_date_id calendar_date account_id
2766 2014-01-01 1
2766 2014-01-01 2
2766 2014-01-01 3
... 2014-01-01 6
After this table/result is generated, I will join it with other tables with other metrics/dimensions.
i suggest doing the zero to five thing using a recursive cte like so:
with zerotofive as (
select 0 as a
union all
select a+1 as a from zerotofive
where a<5
)
select tab1.calendar_date, tab1.cal_d, zerotofive.a as account_id
from calendar.table
cross join
zerotofive
and then join that with any other tables you may have
USE Recursive Common Table Expression like
;WITH CTEmaster AS(
select tab1.calendar_date, tab1.cal_d, 0 as account_id from calendar.table
union all
select tab1.calendar_date, tab1.cal_d, account_id+1 as account_id from CTEmaster
where account_id<5
)
select * from CTEmaster join (to your desired tables )

Select data from a table where only the first two columns are distinct

Background
I have a table which has six columns. The first three columns create the pk. I'm tasked with removing one of the pk columns.
I selected (using distinct) the data into a temp table (excluding the third column), and tried inserting all of that data back into the original table with the third column being '11' for every row as this is what I was instructed to do. (this column is going to be removed by a DBA after I do this)
However, when I went to insert this data back into the original table I get a pk constraint error. (shocking, I know)
The other three columns are just date columns, so the distinct select didn't create a unique pk for each record. What I'm trying to achieve is just calling a distinct on the first two columns, and then just arbitrarily selecting the three other columns as it doesn't matter which dates I choose (at least not on dev).
What I've tried
I found the following post which seems to achieve what I want:
How do I (or can I) SELECT DISTINCT on multiple columns?
I tried the answers from both Joel,and Erwin.
Attempt 1:
However, with Joels answer the set returned is too large - the inner join isn't doing what I thought it would do. Selecting distinct col1 and col2 there are 400 columns returned, however when I use his solution 600 rows are returned. I checked the data and in fact there were duplicate pk's. Here is my attempt at duplicating Joels answer:
select a.emp_no,
a.eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no, modify_dte,
modify_by_emp_no
from tempdb.guest.temp_part_time_evaluator b
inner join
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
) a
ON b.emp_no = a.emp_no AND b.eec_planning_unit_cde = a.eec_planning_unit_cde
Now, if I execute just the inner select statement 400 rows are returned. If I select the whole query 600 rows are returned? Isn't inner join supposed to only show the intersection of the two sets?
Attempt 2:
I also tried the answer from Erwin. This one has a syntax error and I'm having trouble googling the spec on the where clause (specifically, the trick he is using with (emp_no, eec_planning_unit_cde))
Here is the attempt:
select emp_no,
eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no,
modify_dte,
modify_by_emp_no
where (emp_no, eec_planning_unit_cde) IN
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
)
Now, I realize that the post I referenced is for postgresql. Doesn't T-SQL have something similar? Trying to google parenthesis isn't working too well.
Overview of Questions:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
A select distinct will be based on all columns so it does not guarantee the first two to be distinct
select pk1, pk2, '11', max(c1), max(c2), max(c3)
from table
group by pk1, pk2
You could TRY this:
SELECT a.emp_no,
a.eec_planning_unit_cde,
b.'11' as area,
b.create_dte,
b.create_by_emp_no,
b.modify_dte,
b.modify_by_emp_no
FROM
(
SELECT emp_no, eec_planning_unit_cde
FROM tempdb.guest.temp_part_time_evaluator
GROUP BY emp_no, eec_planning_unit_cde
) a
JOIN tempdb.guest.temp_part_time_evaluator b
ON a.emp_no = b.emp_no AND a.eec_planning_unit_cde = b.eec_planning_unit_cde
That would give you a distinct on those fields but if there is differences in the data between columns you might have to try a more brute force approch.
SELECT a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY emp_no, eec_planning_unit_cde) rownumber,
a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM tempdb.guest.temp_part_time_evaluator
) a
WHERE rownumber = 1
I'll reply one by one:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Inner join don't do an intersection. Le'ts supose this tables:
T1 T2
n s n s
1 A 2 X
2 B 2 Y
2 C
3 D
If you join both tables by numeric column you don't get the intersection (2 rows). You get:
select *
from t1 inner join t2
on t1.n = t2.n;
| N | S |
---------
| 2 | B |
| 2 | B |
| 2 | C |
| 2 | C |
And, your second query approach:
select *
from t1
where t1.n in (select n from t2);
| N | S |
---------
| 2 | B |
| 2 | C |
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
Yes, this subquery:
select *
from t1
where not exists (
select 1
from t2
where t2.n = t1.n
);
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
yes, using #JTC second query.