Recursive SQL Query or Just Multiple Unions? Or Something Else? - sql

I'm trying to create a query grabbing data from 5 different tables. To return records for every date and every account, I have to create a 'master' table with date and account id.
Since I really don't have a reference table for the account_id, I was thinking of writing the query as such.
select tab1.calendar_date, tab1.cal_d, (0) as account_id from calendar.table
union all
select tab1.calendar_date, tab1.cal_d, (1) as account_id from calendar.table
union all
select tab1.calendar_date, tab1.cal_d, (2) as account_id from calendar.table
and so on to account id 5.
The resulting table is then mapped to 5 other tables to pull the other information. Is there another way for me to restructure this query so it's not doing 4/5 joins? A co-worker suggested a recursive table, but I'm not familiar with it. I'm almost referencing as the master 'fact' table.
Additional context. I need the resulting table to look like the following:
calendar_date_id calendar_date account_id
2766 2014-01-01 1
2766 2014-01-01 2
2766 2014-01-01 3
... 2014-01-01 6
After this table/result is generated, I will join it with other tables with other metrics/dimensions.

i suggest doing the zero to five thing using a recursive cte like so:
with zerotofive as (
select 0 as a
union all
select a+1 as a from zerotofive
where a<5
)
select tab1.calendar_date, tab1.cal_d, zerotofive.a as account_id
from calendar.table
cross join
zerotofive
and then join that with any other tables you may have

USE Recursive Common Table Expression like
;WITH CTEmaster AS(
select tab1.calendar_date, tab1.cal_d, 0 as account_id from calendar.table
union all
select tab1.calendar_date, tab1.cal_d, account_id+1 as account_id from CTEmaster
where account_id<5
)
select * from CTEmaster join (to your desired tables )

Related

SQL UNION ALL but with lots of columns on BigQuery?

Above image is a screenshot of my table just as a quick initial reference.
The focal point are the multiple mech columns (mech1, mech2, mech3, and mech4).
Board games in this tables have multiple attributes called mechanisms so I've separated them into 4 different columns.
So I've learned how to combine columns vertically via UNION ALL so that I can query the count of all unique game mechanisms in my table.
However, it got me wondering if there's a shorter and more efficient way to achieve what I've done:
WITH mechanism_info AS
(
WITH
mechanism_col_combined AS
(
SELECT mech1 AS all_mech_columns_combined
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
## There's no IS NOT NULL condition defined for column 'mech1' since there's at least one mechanism noted for a game.
SELECT mech2
FROM `ckda-portfolio-2022.bg_collection.base`
WHERE mech2 IS NOT NULL
UNION ALL
SELECT mech3
FROM `ckda-portfolio-2022.bg_collection.base`
WHERE mech3 IS NOT NULL
UNION ALL
SELECT mech4
FROM `ckda-portfolio-2022.bg_collection.base`
WHERE mech4 IS NOT NULL
)
## Temporary table with all mechanism column in the collection combined.
SELECT DISTINCT(all_mech_columns_combined) AS unique_mechanisms, COUNT(*) AS count
FROM mechanism_col_combined
GROUP BY all_mech_columns_combined
ORDER BY all_mech_columns_combined
)
SELECT *
FROM mechanism_info
By querying this temp. table, SQL returns the information that I've anticipated as below:
unique_mechanisms | count
Acting | 1
Action Points | 3
Action Queue | 1
Action Retrieval | 1
Area Movement | 1
Auction/Bidding | 5
Bag Building | 1
Betting & Bluffing| 2
Bingo | 1
Bluffing | 7
Now, I want to shorten my code and I know there has to be a way to shorten the repetitive process of combining columns with UNION ALL.
And if there's any other tips or methods on how to shorten my query, please let me know!
Thank you.
You can convert the multiple columns [mech1, mech2, ...] into a column of array mech_arr and then using UNNEST to convert the column to have scalar value in each row.
For example:
WITH table1 AS (
SELECT 'AA' AS mech1, 'BB' AS mech2, 'CC' AS mech3,
UNION ALL SELECT 'AA' AS mech1, 'CC' AS mech2, 'EE' AS mech3
),
table2 AS (SELECT [mech1, mech2, mech3] AS mech_arr FROM table1)
SELECT mech, COUNT(*) AS mech_counts
FROM table2, UNNEST(mech_arr) AS mech
GROUP BY mech
Output
mech mech_counts
AA 2
BB 1
CC 2
EE 1
You could send join into the table, but the performance would not improve and the query would be just as long.
You can simplify as follows:
SELECT
mech_column,
count(*) "number"
FROM (
SELECT mech1 AS mech_column
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech2
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech3
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech4
FROM `ckda-portfolio-2022.bg_collection.base`
) m
WHERE mech_column IS NOT NULL
GROUP BY mech_column
ORDER BY mech_column;
Didn't find a smoother way to query but I did find a way to remove the process of adding WHERE column IS NOT NULL for each and every columns that was used to vertically aggregate them into a single column:
mechanism_info AS
(
WITH
mechanism_col_combined AS
(
SELECT mech1 AS mech_columns
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech2
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech3
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech4
FROM `ckda-portfolio-2022.bg_collection.base`
## Removed all WHERE clause from the above columns
and added it below instead.
)
## Temporary table with all mechanism columns in the collection combined.
SELECT DISTINCT(mech_columns) AS mechanisms, COUNT(*) AS count
FROM mechanism_col_combined
WHERE mech_columns IS NOT NULL ## <--- Added here!
GROUP BY mech_columns
ORDER BY mech_columns
)
SELECT *
FROM mechanism_info
Since mechanism_info is a nested temp. table, I can just add WHERE mech_columns IS NOT NULL clause and condition to the initial temp. table's setting.
I'm still looking to reduce this query down to something more efficient. It's unfortunate that UNION ALL can't select multiple columns with a single call :(

SQL query with grouping and MAX

I have a table that looks like the following but also has more columns that are not needed for this instance.
ID DATE Random
-- -------- ---------
1 4/12/2015 2
2 4/15/2015 2
3 3/12/2015 2
4 9/16/2015 3
5 1/12/2015 3
6 2/12/2015 3
ID is the primary key
Random is a foreign key but i am not actually using table it points to.
I am trying to design a query that groups the results by Random and Date and select the MAX Date within the grouping then gives me the associated ID.
IF i do the following query
select top 100 ID, Random, MAX(Date) from DateBase group by Random, Date, ID
I get duplicate Randoms since ID is the primary key and will always be unique.
The results i need would look something like this
ID DATE Random
-- -------- ---------
2 4/15/2015 2
4 9/16/2015 3
Also another question is there could be times where there are many of the same date. What will MAX do in that case?
You can use NOT EXISTS() :
SELECT * FROM YourTable t
WHERE NOT EXISTS(SELECT 1 FROM YourTable s
WHERE s.random = t.random
AND s.date > t.date)
This will select only those who doesn't have a bigger date for corresponding random value.
Can also be done using IN() :
SELECT * FROM YourTable t
WHERE (t.random,t.date) in (SELECT s.random,max(s.date)
FROM YourTable s
GROUP BY s.random)
Or with a join:
SELECT t.* FROM YourTable t
INNER JOIN (SELECT s.random,max(s.date) as max_date
FROM YourTable s
GROUP BY s.random) tt
ON(t.date = tt.max_date and s.random = t.random)
In SQL Server you could do something like the following,
select a.* from DateBase a inner join
(select Random,
MAX(dt) as dt from DateBase group by Random) as x
on a.dt =x.dt and a.random = x.random
This method will work in all versions of SQL as there are no vendor specifics (you'll need to format the dates using your vendor specific syntax)
You can do this in two stages:
The first step is to work out the max date for each random:
SELECT MAX(DateField) AS MaxDateField, Random
FROM Example
GROUP BY Random
Now you can join back onto your table to get the max ID for each combination:
SELECT MAX(e.ID) AS ID
,e.DateField AS DateField
,e.Random
FROM Example AS e
INNER JOIN (
SELECT MAX(DateField) AS MaxDateField, Random
FROM Example
GROUP BY Random
) data
ON data.MaxDateField = e.DateField
AND data.Random = e.Random
GROUP BY DateField, Random
SQL Fiddle example here: SQL Fiddle
To answer your second question:
If there are multiples of the same date, the MAX(e.ID) will simply choose the highest number. If you want the lowest, you can use MIN(e.ID) instead.

Is order preserved after UNION in PostgreSQL?

Here is the code:
CREATE TABLE audit_trail (
old_email TEXT NOT NULL,
new_email TEXT NOT NULL
);
INSERT INTO audit_trail(old_email, new_email)
VALUES ('harold_gim#yahoo.com', 'hgimenez#hotmail.com'),
('hgimenez#hotmail.com', 'harold.gimenez#gmail.com'),
('harold.gimenez#gmail.com', 'harold#heroku.com'),
('foo#bar.com', 'bar#baz.com'),
('bar#baz.com', 'barbaz#gmail.com');
WITH RECURSIVE all_emails AS (
SELECT old_email, new_email
FROM audit_trail
WHERE old_email = 'harold_gim#yahoo.com'
UNION
SELECT at.old_email, at.new_email
FROM audit_trail at
JOIN all_emails a
ON (at.old_email = a.new_email)
)
SELECT * FROM all_emails;
old_email | new_email
--------------------------+--------------------------
harold_gim#yahoo.com | hgimenez#hotmail.com
hgimenez#hotmail.com | harold.gimenez#gmail.com
harold.gimenez#gmail.com | harold#heroku.com
(3 rows)
select old_email, new_email into iter1
from audit_trail where old_email = 'harold_gim#yahoo.com';
select * from iter1;
-- old_email | new_email
-- ----------------------+----------------------
-- harold_gim#yahoo.com | hgimenez#hotmail.com
-- (1 row)
select a.old_email, a.new_email into iter2
from audit_trail a join iter1 b on (a.old_email = b.new_email);
select * from iter2;
-- old_email | new_email
-- ----------------------+--------------------------
-- hgimenez#hotmail.com | harold.gimenez#gmail.com
-- (1 row)
select * from iter1 union select * from iter2;
-- old_email | new_email
-- ----------------------+--------------------------
-- hgimenez#hotmail.com | harold.gimenez#gmail.com
-- harold_gim#yahoo.com | hgimenez#hotmail.com
-- (2 rows)
As you can see the recursive code gives the result in right order, but the non-recursive code does not.
They both use union, why the difference?
Basically, your query is incorrect to begin with. Use UNION ALL, not UNION or you would incorrectly remove duplicate entries. (There is nothing to say the trail cannot switch back and forth between the same emails.)
The Postgres implementation for UNION ALL returns values in the sequence as appended - as long as you do not add ORDER BY at the end or do anything else with the result.
Be aware though, that each SELECT returns rows in arbitrary order unless ORDER BY is appended. There is no natural order in tables.
The same is not true for UNION, which has to process all rows to remove possible duplicates. There are various ways to determine duplicates, the resulting order of rows depends on the chosen algorithm and is implementation-dependent and completely unreliable - unless, again, ORDER BY is appended.
So use instead:
SELECT * FROM iter1
UNION ALL -- union all!
SELECT * FROM iter2;
To get a reliable sort order, and "simulate the record of growth", you can track levels like this:
WITH RECURSIVE all_emails AS (
SELECT *, 1 AS lvl
FROM audit_trail
WHERE old_email = 'harold_gim#yahoo.com'
UNION ALL -- union all!
SELECT t.*, a.lvl + 1
FROM all_emails a
JOIN audit_trail t ON t.old_email = a.new_email
)
TABLE all_emails
ORDER BY lvl;
db<>fiddle here
Old sqlfiddle
Aside: if old_email is not defined UNIQUE in some way, you can get multiple trails. You would need a unique column (or combination of columns) to keep it unambiguous. If all else fails you can (ab-)use the internal tuple ID ctid for the purpose of telling trails apart. But you should rather use your own columns. (Added example in the fiddle.)
In-order sequence generation
Consider:
How to return records in correct order in PostgreSQL
Ordering is never preserved after any operation in any reasonable database. If you want the result set in a particular order, use ORDER BY. Period.
This is especially true after a UNION. UNION removes duplicates and that operation is going to change the ordering of the rows, in all likelihood.
Order is preserved if one can pass after all unions statement as below:
select "ClassName","SectionName","Students","OrderNo" from table
UNION
select '----TOTAL----' as "ClassName",'----' as "SectionName",sum("Total Students"),9999 as "OrderNo" from table
ORDER BY "OrderNo"

SQL Remove Duplicates, save lowest of certain column

I've been looking for an answer to this but couldn't find anything the same as this particular situation.
So I have a one table that I want to remove duplicates from.
__________________
| JobNumber-String |
| JobOp - Number |
------------------
So there are multiples of these two values, together they make the key for the row. I want keep all distinct job numbers with the lowest job op. How can I do this? I've tried a bunch of things, mainly trying the min function, but that only seems to work on the entire table not just the JobNumber sets. Thanks!
Original Table Values:
JobNumber Jobop
123 100
123 101
456 200
456 201
780 300
Code Ran:
DELETE FROM table
WHERE CONCAT(JobNumber,JobOp) NOT IN
(
SELECT CONCAT(JobNumber,MIN(JobOp))
FROM table
GROUP BY JobNumber
)
Ending Table Values:
JobNumber Jobop
123 100
456 200
780 300
With SQL Server 2008 or higher you can enhance the MIN function with an OVER clause specifying a PARTITION BY section.
Please have a look at https://msdn.microsoft.com/en-us/library/ms189461.aspx
You can simply select the values you want to keep:
select jobOp, min(number) from table group by jobOp
Then you can delete the records you don't want:
DELETE t FROM table t
left JOIN (select jobOp, min(number) as minnumber from table group by jobOp ) e
ON t.jobob = e.jobob and t.number = e.minnumber
Where e.jobob is null
I like to do this with window functions:
with todelete as (
select t.*, min(jobop) over (partition by numbers) as minjop
from table t
)
delete from todelete
where jobop > minjop;
It sounds like you are not using the correct GROUP BY clause when using the MIN function. This sql should give you the minimum JobOp value for each JobNumber:
SELECT JobNumber, MIN(JobOp) FROM test.so_test GROUP BY JobNumber;
Using this in a subquery, along with CONCAT (this is from MySQL, SQL Server might use different function) because both fields form your key, gives you this sql:
SELECT * FROM so_test WHERE CONCAT(JobNumber,JobOp)
NOT IN (SELECT CONCAT(JobNumber,MIN(JobOp)) FROM test.so_test GROUP BY JobNumber);

Select data from a table where only the first two columns are distinct

Background
I have a table which has six columns. The first three columns create the pk. I'm tasked with removing one of the pk columns.
I selected (using distinct) the data into a temp table (excluding the third column), and tried inserting all of that data back into the original table with the third column being '11' for every row as this is what I was instructed to do. (this column is going to be removed by a DBA after I do this)
However, when I went to insert this data back into the original table I get a pk constraint error. (shocking, I know)
The other three columns are just date columns, so the distinct select didn't create a unique pk for each record. What I'm trying to achieve is just calling a distinct on the first two columns, and then just arbitrarily selecting the three other columns as it doesn't matter which dates I choose (at least not on dev).
What I've tried
I found the following post which seems to achieve what I want:
How do I (or can I) SELECT DISTINCT on multiple columns?
I tried the answers from both Joel,and Erwin.
Attempt 1:
However, with Joels answer the set returned is too large - the inner join isn't doing what I thought it would do. Selecting distinct col1 and col2 there are 400 columns returned, however when I use his solution 600 rows are returned. I checked the data and in fact there were duplicate pk's. Here is my attempt at duplicating Joels answer:
select a.emp_no,
a.eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no, modify_dte,
modify_by_emp_no
from tempdb.guest.temp_part_time_evaluator b
inner join
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
) a
ON b.emp_no = a.emp_no AND b.eec_planning_unit_cde = a.eec_planning_unit_cde
Now, if I execute just the inner select statement 400 rows are returned. If I select the whole query 600 rows are returned? Isn't inner join supposed to only show the intersection of the two sets?
Attempt 2:
I also tried the answer from Erwin. This one has a syntax error and I'm having trouble googling the spec on the where clause (specifically, the trick he is using with (emp_no, eec_planning_unit_cde))
Here is the attempt:
select emp_no,
eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no,
modify_dte,
modify_by_emp_no
where (emp_no, eec_planning_unit_cde) IN
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
)
Now, I realize that the post I referenced is for postgresql. Doesn't T-SQL have something similar? Trying to google parenthesis isn't working too well.
Overview of Questions:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
A select distinct will be based on all columns so it does not guarantee the first two to be distinct
select pk1, pk2, '11', max(c1), max(c2), max(c3)
from table
group by pk1, pk2
You could TRY this:
SELECT a.emp_no,
a.eec_planning_unit_cde,
b.'11' as area,
b.create_dte,
b.create_by_emp_no,
b.modify_dte,
b.modify_by_emp_no
FROM
(
SELECT emp_no, eec_planning_unit_cde
FROM tempdb.guest.temp_part_time_evaluator
GROUP BY emp_no, eec_planning_unit_cde
) a
JOIN tempdb.guest.temp_part_time_evaluator b
ON a.emp_no = b.emp_no AND a.eec_planning_unit_cde = b.eec_planning_unit_cde
That would give you a distinct on those fields but if there is differences in the data between columns you might have to try a more brute force approch.
SELECT a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY emp_no, eec_planning_unit_cde) rownumber,
a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM tempdb.guest.temp_part_time_evaluator
) a
WHERE rownumber = 1
I'll reply one by one:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Inner join don't do an intersection. Le'ts supose this tables:
T1 T2
n s n s
1 A 2 X
2 B 2 Y
2 C
3 D
If you join both tables by numeric column you don't get the intersection (2 rows). You get:
select *
from t1 inner join t2
on t1.n = t2.n;
| N | S |
---------
| 2 | B |
| 2 | B |
| 2 | C |
| 2 | C |
And, your second query approach:
select *
from t1
where t1.n in (select n from t2);
| N | S |
---------
| 2 | B |
| 2 | C |
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
Yes, this subquery:
select *
from t1
where not exists (
select 1
from t2
where t2.n = t1.n
);
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
yes, using #JTC second query.