Split record into 2 records with distinct values based on a unique id - sql

I have a table with some IDs that correspond to duplicate data that i would like to get rid of. They are linked by a groupid number. Currently my data looks like this:
|GroupID|NID1 |NID2 |
|S1 |644763|643257|
|T2 |4759 |84689 |
|W3 |96676 |585876|
In order for the software to run, I need the data in the following format:
|GroupID|NID |
|S1 |644763|
|S1 |643257|
|T2 |4759 |
|T2 |84689 |
|W3 |96676 |
|W3 |585876|
Thank you for your time.

You want union all :
select groupid, nid1 as nid
from table t
union all -- use "union" instead if you don't want duplicate rows
select groupid, nid2
from table t;

In Oracle 12C+, you can use lateral joins:
select t.groupid, v.nid
from t cross apply
(select t.nid1 as nid from dual union all
select t.nid2 as nid from dual
) v;
This is more efficient than union all because it only scans the table once.
You can also express this as:
select t.groupid,
(case when n.n = 1 then t.nid1 when n.n = 2 then t.nid2 end) as nid
from t cross join
(select 1 as n from dual union all select 2 from dual) n;
A little more complicated, but still only one scan of the table.

Related

SQL UNION ALL but with lots of columns on BigQuery?

Above image is a screenshot of my table just as a quick initial reference.
The focal point are the multiple mech columns (mech1, mech2, mech3, and mech4).
Board games in this tables have multiple attributes called mechanisms so I've separated them into 4 different columns.
So I've learned how to combine columns vertically via UNION ALL so that I can query the count of all unique game mechanisms in my table.
However, it got me wondering if there's a shorter and more efficient way to achieve what I've done:
WITH mechanism_info AS
(
WITH
mechanism_col_combined AS
(
SELECT mech1 AS all_mech_columns_combined
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
## There's no IS NOT NULL condition defined for column 'mech1' since there's at least one mechanism noted for a game.
SELECT mech2
FROM `ckda-portfolio-2022.bg_collection.base`
WHERE mech2 IS NOT NULL
UNION ALL
SELECT mech3
FROM `ckda-portfolio-2022.bg_collection.base`
WHERE mech3 IS NOT NULL
UNION ALL
SELECT mech4
FROM `ckda-portfolio-2022.bg_collection.base`
WHERE mech4 IS NOT NULL
)
## Temporary table with all mechanism column in the collection combined.
SELECT DISTINCT(all_mech_columns_combined) AS unique_mechanisms, COUNT(*) AS count
FROM mechanism_col_combined
GROUP BY all_mech_columns_combined
ORDER BY all_mech_columns_combined
)
SELECT *
FROM mechanism_info
By querying this temp. table, SQL returns the information that I've anticipated as below:
unique_mechanisms | count
Acting | 1
Action Points | 3
Action Queue | 1
Action Retrieval | 1
Area Movement | 1
Auction/Bidding | 5
Bag Building | 1
Betting & Bluffing| 2
Bingo | 1
Bluffing | 7
Now, I want to shorten my code and I know there has to be a way to shorten the repetitive process of combining columns with UNION ALL.
And if there's any other tips or methods on how to shorten my query, please let me know!
Thank you.
You can convert the multiple columns [mech1, mech2, ...] into a column of array mech_arr and then using UNNEST to convert the column to have scalar value in each row.
For example:
WITH table1 AS (
SELECT 'AA' AS mech1, 'BB' AS mech2, 'CC' AS mech3,
UNION ALL SELECT 'AA' AS mech1, 'CC' AS mech2, 'EE' AS mech3
),
table2 AS (SELECT [mech1, mech2, mech3] AS mech_arr FROM table1)
SELECT mech, COUNT(*) AS mech_counts
FROM table2, UNNEST(mech_arr) AS mech
GROUP BY mech
Output
mech mech_counts
AA 2
BB 1
CC 2
EE 1
You could send join into the table, but the performance would not improve and the query would be just as long.
You can simplify as follows:
SELECT
mech_column,
count(*) "number"
FROM (
SELECT mech1 AS mech_column
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech2
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech3
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech4
FROM `ckda-portfolio-2022.bg_collection.base`
) m
WHERE mech_column IS NOT NULL
GROUP BY mech_column
ORDER BY mech_column;
Didn't find a smoother way to query but I did find a way to remove the process of adding WHERE column IS NOT NULL for each and every columns that was used to vertically aggregate them into a single column:
mechanism_info AS
(
WITH
mechanism_col_combined AS
(
SELECT mech1 AS mech_columns
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech2
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech3
FROM `ckda-portfolio-2022.bg_collection.base`
UNION ALL
SELECT mech4
FROM `ckda-portfolio-2022.bg_collection.base`
## Removed all WHERE clause from the above columns
and added it below instead.
)
## Temporary table with all mechanism columns in the collection combined.
SELECT DISTINCT(mech_columns) AS mechanisms, COUNT(*) AS count
FROM mechanism_col_combined
WHERE mech_columns IS NOT NULL ## <--- Added here!
GROUP BY mech_columns
ORDER BY mech_columns
)
SELECT *
FROM mechanism_info
Since mechanism_info is a nested temp. table, I can just add WHERE mech_columns IS NOT NULL clause and condition to the initial temp. table's setting.
I'm still looking to reduce this query down to something more efficient. It's unfortunate that UNION ALL can't select multiple columns with a single call :(

How to limit number of groups returned in a query, but not the number of rows in Oracle

How to limit the number of groups in a query, but not the number of rows in Oracle?
If I had to do that manually, I would have to use a DISTINCT.
Would be something like this:
FOR d IN (
SELECT DISTINCT COLUMN_1 FROM myTable
WHERE myDate BETWEEN x AND y
OFFSET o ROWS
FETCH NEXT l ROWS ONLY
) LOOP
And then, do the selects from each of the ids returned in the query, which, in my opinion, is a terrible solution.
SAMPLE DATA:
If I limit the number of groups to 2 by using COLUMN_2, the expected result should be something like:
I believe you may be looking for something like this:
select *
from mytable
where id in (
select distinct id
from my_table
where my_date between x and y
fetch first :n rows only
)
;
:n is a bind variable, encoding the number of groups you want to select.
This should be more efficient than solutions using analytic functions - even if it must read the base table twice. In tests posted on OTN, I showed that the difference is not small.
EDIT If I remember correctly, FETCH is not implemented in the most efficient way (perhaps for good reasons, having to do with features we don't need in this query - such as how to deal with ties). FETCH itself resembles a DENSE_RANK() implementation rather than the faster row limiting clause (using ROWNUM). I would likely need to modify the query to do away with FETCH, if speed was really important. END EDIT
Further edit to do with performance comparisons
Frequent poster MT0 requested a pointer for the claim that aggregate solutions can (and often are) more efficient than analytic function approaches, even when the former may require multiple passes through the data where the analytic function approach requires only one.
Alas, OTN (what now calls itself the "Oracle Groundbreakers Developer Community", the discussion board hosted by Oracle itself) went through a massive - and massively botched - platform change at the end of September 2020; that messed up both the search facilities and the formatting of old posts, to the point of rendering them almost unusable.
Instead, I will show here a simple mock-up of the OP's problem in this thread; code that anyone can run so they can repeat the tests on their own machine.
I created a table with two columns, ID and STR - the ID plays the same role as in the OP's question, and STR is just extra payload to mimic real-life data. ID is number and STR is varchar2(100). I populated the table with 9 million rows - 1 million ID's, nine rows for each ID. The task is to select just three "groups" (three distinct ID's, then select all the rows from the base table for those three distinct ID's).
With no index on the ID column, the aggregate solution runs in 0.81 seconds on my machine; with an index on ID, it runs in 0.47 seconds. The analytic functions solution runs in 0.91 seconds, with or without an index (obviously - there is no way an index can benefit the analytic function solution). All these results are for column ID not declared NOT NULL.
Here is the code to create the table, the index on ID, and the two queries I tested. Note: As I explained in my first edit (above), fetch is slow; I replaced it with a standard row-limiting technique using ROWNUM in an over-query.
drop table t purge;
create table t (id number, str varchar2(100));
insert into t
with row_gen as (select level from dual connect by level <= 3000)
select mod(344227 * rownum, 1000000), rpad('x', 100, 'x')
from row_gen cross join row_gen
;
commit;
create index t_idx on t(id);
select *
from t
where id in (
select id from (select distinct id from t)
where rownum <= 3
);
select *
from ( select t.*, dense_rank() over (order by id) dr from t )
where dr <= 3;
You can use DENSE_RANK:
SELECT *
FROM (
SELECT t.*,
DENSE_RANK() OVER ( ORDER BY column2 ) AS rnk
FROM table_name t
)
WHERE rnk <= 2;
Which, for the sample data:
CREATE TABLE table_name ( column1, column2, column3, column4 ) AS
SELECT 1, 1, 1.0, 1.0 FROM DUAL UNION ALL
SELECT 2, 2, 2.0, 2.0 FROM DUAL UNION ALL
SELECT 2, 2, 2.2, 2.1 FROM DUAL UNION ALL
SELECT 2, 2, 2.2, 2.2 FROM DUAL UNION ALL
SELECT 2, 2, 2.0, 2.3 FROM DUAL UNION ALL
SELECT 3, 3, 3.0, 3.1 FROM DUAL UNION ALL
SELECT 3, 3, 3.1, 3.1 FROM DUAL UNION ALL
SELECT 3, 3, 3.1, 3.1 FROM DUAL UNION ALL
SELECT 4, 4, 4.2, 4.0 FROM DUAL;
Outputs:
COLUMN1 | COLUMN2 | COLUMN3 | COLUMN4 | RNK
------: | ------: | ------: | ------: | --:
1 | 1 | 1 | 1 | 1
2 | 2 | 2 | 2 | 2
2 | 2 | 2.2 | 2.1 | 2
2 | 2 | 2.2 | 2.2 | 2
2 | 2 | 2 | 2.3 | 2
(and, if you want DISTINCT rows then add DISTINCT to the outer query)
db<>fiddle here
If I understand correctly, you want ROW_NUMBER():
SELECT t.*
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) as seqnum
FROM myTable t
WHERE t.myDate BETWEEN x AND y
) t
WHERE seqnum = 1;
This returns an arbitrary row for each id meeting the conditions.

`INTERSECT` does not return anything from two tables, separately values are returned fine

I'm not sure what I am doing wrong here since I didn't touch SQL queries for several years plus MSSQL query language is a bit strange to me but after 30 minutes of googling I still cannot find the answer.
Problem
I have two queries that work perfectly fine:
SELECT COUNT(*) AS 'NumberOfAccounts' FROM Accounts
SELECT COUNT(*) AS 'NumberOfUsers' FROM Users
I need to get this information in one go in my API response since I don't want to execute two statements. How can I combine them into one query so it will return table as follows:
+------------------+---------------+
| NumberOfAccounts | NumberOfUsers |
+------------------+---------------+
| 10 | 16 |
+------------------+---------------+
What I have tried
UNION SELECT COUNT(*) AS 'NumberOfAccounts' FROM Accounts UNION SELECT COUNT(*) AS 'NumberOfUsers' FROM Users
This is giving me the result of both tables, however it all pushes it into NumberOfAccounts and the result is invalid for me to parse.
+------------------+
| NumberOfAccounts |
+------------------+
| 10 |
| 16 |
+------------------+
INTRSECT SELECT COUNT(*) AS 'NumberOfAccounts' FROM Accounts INTERSECT SELECT COUNT(*) AS 'NumberOfUsers' FROM Users
This just gives me empty result with only NumberOfAccounts column in it.
You can just put these as subqueries in a select:
SELECT (SELECT COUNT(*) FROM Accounts) as NumberOfAccounts,
(SELECT COUNT(*) FROM Users) as NumberOfUsers
In SQL Server, no FROM clause is needed.
UNION is the wrong usage here. Union will "merge" rows of identical tables (or identical selects) and not columns.
One solution might be:
SELECT AccountCount, UserCount FROM
(SELECT COUNT(*) AS AccountCount, 1 AS Id FROM Accounts) AS a
JOIN
(SELECT COUNT(*) AS UserCount, 1 as Id FROM Users) AS u ON (a.Id = u.Id)
Be aware of the artificial surrogate key 1 you need to insert to join both sub-selects together.
For completeness sake; with UNION ALL you'd do:
SELECT 'NumberOfAccounts' AS what, COUNT(*) AS howmany FROM accounts
UNION ALL
SELECT 'NumberOfUsers' AS what, COUNT(*) AS howmany FROM users;
which results in
+------------------+---------+
| what | howmany |
+------------------+---------+
| NumberOfAccounts | 10 |
| NumberOfUsers | 16 |
+------------------+---------+
And another variation:
WITH cte AS
(
SELECT COUNT(*) AS cntAccounts, 0 AS cntUsers FROM accounts
UNION ALL
SELECT 0 AS cntAccounts, COUNT(*) AS cntUsers FROM users
)
SELECT
SUM(cntAccounts) AS NumberOfAccounts
,SUM(cntUsers ) AS NumberOfUsers
FROM cte
If you want (need) better performance you can get the row counts from the following query which uses sys.dm_db_partition_stats to get the row counts:
SELECT (
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('Accounts')
AND (index_id=0 or index_id=1)) NumberOfAccounts,
(
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('Users')
AND (index_id=0 or index_id=1)) NumberOfUsers

Select data from a table where only the first two columns are distinct

Background
I have a table which has six columns. The first three columns create the pk. I'm tasked with removing one of the pk columns.
I selected (using distinct) the data into a temp table (excluding the third column), and tried inserting all of that data back into the original table with the third column being '11' for every row as this is what I was instructed to do. (this column is going to be removed by a DBA after I do this)
However, when I went to insert this data back into the original table I get a pk constraint error. (shocking, I know)
The other three columns are just date columns, so the distinct select didn't create a unique pk for each record. What I'm trying to achieve is just calling a distinct on the first two columns, and then just arbitrarily selecting the three other columns as it doesn't matter which dates I choose (at least not on dev).
What I've tried
I found the following post which seems to achieve what I want:
How do I (or can I) SELECT DISTINCT on multiple columns?
I tried the answers from both Joel,and Erwin.
Attempt 1:
However, with Joels answer the set returned is too large - the inner join isn't doing what I thought it would do. Selecting distinct col1 and col2 there are 400 columns returned, however when I use his solution 600 rows are returned. I checked the data and in fact there were duplicate pk's. Here is my attempt at duplicating Joels answer:
select a.emp_no,
a.eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no, modify_dte,
modify_by_emp_no
from tempdb.guest.temp_part_time_evaluator b
inner join
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
) a
ON b.emp_no = a.emp_no AND b.eec_planning_unit_cde = a.eec_planning_unit_cde
Now, if I execute just the inner select statement 400 rows are returned. If I select the whole query 600 rows are returned? Isn't inner join supposed to only show the intersection of the two sets?
Attempt 2:
I also tried the answer from Erwin. This one has a syntax error and I'm having trouble googling the spec on the where clause (specifically, the trick he is using with (emp_no, eec_planning_unit_cde))
Here is the attempt:
select emp_no,
eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no,
modify_dte,
modify_by_emp_no
where (emp_no, eec_planning_unit_cde) IN
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
)
Now, I realize that the post I referenced is for postgresql. Doesn't T-SQL have something similar? Trying to google parenthesis isn't working too well.
Overview of Questions:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
A select distinct will be based on all columns so it does not guarantee the first two to be distinct
select pk1, pk2, '11', max(c1), max(c2), max(c3)
from table
group by pk1, pk2
You could TRY this:
SELECT a.emp_no,
a.eec_planning_unit_cde,
b.'11' as area,
b.create_dte,
b.create_by_emp_no,
b.modify_dte,
b.modify_by_emp_no
FROM
(
SELECT emp_no, eec_planning_unit_cde
FROM tempdb.guest.temp_part_time_evaluator
GROUP BY emp_no, eec_planning_unit_cde
) a
JOIN tempdb.guest.temp_part_time_evaluator b
ON a.emp_no = b.emp_no AND a.eec_planning_unit_cde = b.eec_planning_unit_cde
That would give you a distinct on those fields but if there is differences in the data between columns you might have to try a more brute force approch.
SELECT a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY emp_no, eec_planning_unit_cde) rownumber,
a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM tempdb.guest.temp_part_time_evaluator
) a
WHERE rownumber = 1
I'll reply one by one:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Inner join don't do an intersection. Le'ts supose this tables:
T1 T2
n s n s
1 A 2 X
2 B 2 Y
2 C
3 D
If you join both tables by numeric column you don't get the intersection (2 rows). You get:
select *
from t1 inner join t2
on t1.n = t2.n;
| N | S |
---------
| 2 | B |
| 2 | B |
| 2 | C |
| 2 | C |
And, your second query approach:
select *
from t1
where t1.n in (select n from t2);
| N | S |
---------
| 2 | B |
| 2 | C |
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
Yes, this subquery:
select *
from t1
where not exists (
select 1
from t2
where t2.n = t1.n
);
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
yes, using #JTC second query.

How can I select unique rows in a database over two columns?

I have found similar solutions online but none that I've been able to apply to my specific problem.
I'm trying to "unique-ify" data from one table to another. In my original table, data looks like the following:
USERIDP1 USERIDP2 QUALIFIER DATA
1 2 TRUE AB
1 2 CD
1 3 EF
1 3 GH
The user IDs are composed of two parts, USERIDP1 and USERIDP2 concatenated. I want to transfer all the rows that correspond to a user who has QUALIFIER=TRUE in ANY row they own, but ignore users who do not have a TRUE QUALIFIER in any of their rows.
To clarify, all of User 12's rows would be transferred, but not User 13's. The output would then look like:
USERIDP1 USERIDP2 QUALIFIER DATA
1 2 TRUE AB
1 2 CD
So basically, I need to find rows with distinct user ID components (involving two unique fields) that also possess a row with QUALIFIER=TRUE and copy all and only all of those users' rows.
Although this nested query will be very slow for large tables, this could do it.
SELECT DISTINCT X.USERIDP1, X.USERIDP2, X.QUALIFIER, X.DATA
FROM YOUR_TABLE_NAME AS X
WHERE EXISTS (SELECT 1 FROM YOUR_TABLE_NAME AS Y WHERE Y.USERIDP1 = X.USERIDP1
AND Y.USERIDP2 = X.USERIDP2 AND Y.QUALIFIER = TRUE)
It could be written as an inner join with itself too:
SELECT DISTINCT X.USERIDP1, X.USERIDP2, X.QUALIFIER, X.DATA
FROM YOUR_TABLE_NAME AS X
INNER JOIN YOUR_TABLE_NAME AS Y ON Y.USERIDP1 = X.USERIDP1
AND Y.USERIDP2 = X.USERIDP2 AND Y.QUALIFIER = TRUE
For a large table, create a new auxiliary table containing only USERIDP1 and USERIDP2 columns for rows that have QUALIFIER = TRUE and then join this table with your original table using inner join similar to the second option above. Remember to create appropriate indexes.
This should do the trick - if the id fields are stored as integers then you will need to convert / cast into Varchars
SELECT 1 as id1,2 as id2,'TRUE' as qualifier,'AB' as data into #sampled
UNION ALL SELECT 1,2,NULL,'CD'
UNION ALL SELECT 1,3,NULL,'EF'
UNION ALL SELECT 1,3,NULL,'GH'
;WITH data as
(
SELECT
id1
,id2
,qualifier
,data
,SUM(CASE WHEN qualifier = 'TRUE' THEN 1 ELSE 0 END)
OVER (PARTITION BY id1 + '' + id2) as num_qualifier
from #sampled
)
SELECT
id1
,id2
,qualifier
,data
from data
where num_qualifier > 0
Select *
from yourTable
INNER JOIN (Select UserIDP1, UserIDP2 FROM yourTable WHERE Qualifier=TRUE) B
ON yourTable.UserIDP1 = B.UserIDP1 and YourTable.UserIDP2 = B.UserIDP2
How about a subquery as a where clause?
SELECT *
FROM theTable t1
WHERE CAST(t1.useridp1 AS VARCHAR) + CAST(t1.useridp2 AS VARCHAR) IN
(SELECT CAST(t2.useridp1 AS VARCHAR) + CAST(t.useridp2 AS VARCHAR)
FROM theTable t2
WHERE t2.qualified
);
This is a solution in mysql, but I believe it should transfer to sql server pretty easily. Use a subquery to pick out groups of (id1, id2) combinations with at least one True 'qualifier' row; then join that to the original table on (id1, id2).
mysql> SELECT u1.*
FROM users u1
JOIN (SELECT id1,id2
FROM users
WHERE qualifier
GROUP BY id1, id2) u2
USING(id1, id2);
+------+------+-----------+------+
| id1 | id2 | qualifier | data |
+------+------+-----------+------+
| 1 | 2 | 1 | aa |
| 1 | 2 | 0 | bb |
+------+------+-----------+------+
2 rows in set (0.00 sec)