How to conduct large-scale lookups in SQL? - sql

I have a SQL table with several million rows, each row contains columns ID1 and ID2.
I'm trying to lookup the entries in this table for 100,000 unique combinations of ID1 and ID2 and export the results of this query to a CSV file.
In the past for smaller numbers of rows I've been able to use a query along the lines of
SELECT *
FROM Database.Table
WHERE ID1 = 5 AND ID2 = 3
OR ID1 = 9 AND ID2 = 33
OR ID1 = 59 AND ID2 = 332...
However this seems to break once I get beyond a few thousand combinations of ID1 and ID2.
What is the best approach for handling large lookups like this in SQL?

Load the CSV file into a table with two columns: id1 and id2. Make these the primary key.
Then use join or exists:
select bt.*
from bigtable bt join
csvtable csv
on bt.id1 = csv.id1 and bt.id2 = csv.id2;

Have you tried IN statement? https://www.mysqltutorial.org/sql-in.aspx/
SELECT * FROM table WHERE (ID1, ID2) in ( (5, 3), (10, 6), ..)

Related

An association x-reference table

I've got a SQL table like below where one value is linked to a second value and vice versa.
ROW
ID1
ID2
1
1
2
2
2
1
3
3
4
4
4
3
....
This might be some bad design but this is what I'm stuck with. I need to produce a SQL query in SQL Server to return only the following (doesn't matter which order):
ROW
ID1
ID2
1
1
2
3
3
4
....
OR
ROW
ID1
ID2
2
2
1
4
4
3
....
I've got a list of ID's (1, 2, 3, 4) which I used to query the table against ID1 field or ID2 field, but it always returns all the rows because those IDs exist in both columns.
I've tried looking at eliminating one row by looking if the one field it exists in the other column, but then I get no results. Obviously.
The one solution that could work is by looking at the rownum field and only get the even or odd rows. But this feels hacky. Also there might be other values in that list that is not part of my IN list, so that could possibly miss some rows?
Anything eloquent to consider from a TSQL perspective
Here's one (quite cumbersome but pretty effective) way to do it.
First, Create and populate sample table (Please save us this step in future questions):
CREATE TABLE Table1 (
[ROW] int,
[ID1] int,
[ID2] int
);
INSERT INTO Table1 ([ROW], [ID1], [ID2]) VALUES
(1, 1, 2),
(2, 2, 1),
(3, 3, 4),
(4, 4, 3),
(5, 1, 4);
Note: The last raw is not a part of the sample data you've provided, but I assumed you would also like to include in the results records where only one row had the connection beteween Id1 and Id2.
Then, use a couple of common table expression to get the minimum row number of any pair of Id1 and Id2, regardless of the order of ids, and then query the original table joined to the second cte:
WITH CTE1 AS
(
SELECT Row,
IIF(Id1 < Id2, Id1, Id2) As Small,
IIF(Id1 < Id2, Id2, Id1) As Big
FROM Table1
), CTE2 AS
(
SELECT Min(Row) As MinRow
FROM CTE1
GROUP BY Small, Big
)
SELECT Row, Id1, Id2
FROM Table1
JOIN CTE2
ON Row = MinRow;
Results:
Row Id1 Id2
1 1 2
3 3 4
5 1 4
You can see a live demo on DB<>Fiddle

Combine three columns from different tables into one row

I am new to sql and are trying to combine a column value from three different tables and combine to one row in DB2 Warehouse on Cloud. Each table consists of only one row and unique column name. So what I want to is just join these three to one row their original column names.
Each table is built from a statement that looks like this:
SELECT SUM(FUEL_TEMP.FUEL_MLAD_VALUE) AS FUEL
FROM
(SELECT ML_ANOMALY_DETECTION.MLAD_METRIC AS MLAD_METRIC, ML_ANOMALY_DETECTION.MLAD_VALUE AS FUEL_MLAD_VALUE, ML_ANOMALY_DETECTION.TAG_NAME AS TAG_NAME, ML_ANOMALY_DETECTION.DATETIME AS DATETIME, DATA_CONFIG.SYSTEM_NAME AS SYSTEM_NAME
FROM ML_ANOMALY_DETECTION
INNER JOIN DATA_CONFIG ON
(ML_ANOMALY_DETECTION.TAG_NAME =DATA_CONFIG.TAG_NAME AND
DATA_CONFIG.SYSTEM_NAME = 'FUEL')
WHERE ML_ANOMALY_DETECTION.MLAD_METRIC = 'IFOREST_SCORE'
AND ML_ANOMALY_DETECTION.DATETIME >= (CURRENT DATE - 9 DAYS)
ORDER BY DATETIME DESC)
AS FUEL_TEMP
I have tried JOIN, INNER JOIN, UNION/UNION ALL, but can't get it to work as it should. How can I do this?
Use a cross-join like this:
create table table1 (field1 char(10));
create table table2 (field2 char(10));
create table table3 (field3 char(10));
insert into table1 values('value1');
insert into table2 values('value2');
insert into table3 values('value3');
select *
from table1
cross join table2
cross join table3;
Result:
field1 field2 field3
---------- ---------- ----------
value1 value2 value3
A cross join joins all the rows on the left with all the rows on the right. You will end up with a product of rows (table1 rows x table2 rows x table3 rows). Since each table only has one row, you will get (1 x 1 x 1) = 1 row.
Using UNION should solve your problem. Something like this:
SELECT
WarehouseDB1.WarehouseID AS TheID,
'A' AS TheSystem,
WarehouseDB1.TheValue AS TheValue
FROM WarehouseDB1
UNION
SELECT
WarehouseDB2.WarehouseID AS TheID,
'B' AS TheSystem,
WarehouseDB2.TheValue AS TheValue
FROM WarehouseDB2
UNION
WarehouseDB3.WarehouseID AS TheID,
'C' AS TheSystem,
WarehouseDB3.TheValue AS TheValue
FROM WarehouseDB3
Ill adapt the code with your table names and rows if you tell me what they are. This kind of query would return something like the following:
TheID TheSystem TheValue
1 A 10
2 A 20
3 B 30
4 C 40
5 C 50
As long as your column names match in each query, you should get the desired results.

Need to get rows where combination of two columns both exist and don't exit

Trying to figure out if it's possible to write a single, set based query to return what I want with data in one single table. The below is just an example, and I need something that could easily work if most (but not all) of combinations 1 to 9 (or 1 to 20 etc) exist.
Table AllCovered has two columns. ID1 and ID2. There are 16 rows in this table, each containing a combination of the numbers 1 to 4 (so 1,1 1,2 1,3 1,4 2,1 .... 4,3 4,4)
Table SomeGaps has the same structure but only has 12 rows, again each row is a combination of 1 to 4, but with some of the combinations missing.
SELECT ID1, ID2, COUNT(ID1) as THIS
FROM AllCovered
GROUP BY ID1, ID2
- this query returns 16 rows, each combination with 1 in the 3rd column (THIS)
SELECT ID1, ID2, COUNT(ID1) as THIS
FROM SomeGaps
GROUP BY ID1, ID2
- this returns the 12 rows. How can I create query that will return 16 rows, of each combination but with 0 in THIS for the combinations that are missing in somegaps?
ID1 ID2 THIS
1 1 1
1 2 0 (1,2 combination does NOT exist in SomeGaps)
1 3 1
1 4 1
2 1 1
2 2 0 (2,2 combination does NOT exist in SomeGaps)
Obviously I've tried using a crossjoin to get all combinations of ID1 and ID2 but the COUNT is, as expected, vastly inflated.
Hope this makes sense. Apologies if it's an easy solution, I can't seem to crack it!
You can do this by cross-joining all the distinct values for the two columns. Then use left outer join and aggregation to get the counts for all combinations:
select ac.id1, ac.id2, count(ac.id1) as cnt
from (select distinct id1 from AllCovered) ac1 cross join
(select distinct id2 from AllCovered) ac2 left join
AllCovered ac
on ac.id1 = ac1.id1 and ac.id2 = ac2.id2
group by ac.id1, ac.id2;
I'm probably missing something obvious, but I'll take a bite anyway:
create table #AllCovered (id1 int, id2 int);
insert #AllCovered values
(1,1),(1,2),(1,3),(1,4),(2,1),(2,2),(2,3),(2,4),(3,1),(3,2),(3,3),(3,4),(4,1),(4,2),(4,3),(4,4);
create table #gaps (id1 int, id2 int);
insert #gaps values(1,1),(1,2),(1,3),(1,4),(2,1),(2,4),(3,1),(3,2),(3,3),(4,1),(4,2),(4,4);
select #AllCovered.id1, #AllCovered.id2,
count(#gaps.id1) as this
from #AllCovered
left outer join #gaps
on #AllCovered.id1 = #gaps.id1 and #AllCovered.id2 = #gaps.id2
group by #AllCovered.id1, #AllCovered.id2;
drop table #AllCovered, #gaps
From your narrative, there are no duplicate combinations of (id1, id2) in neither table, and AllCovered contains all possible combinations -- otherwise will use distinct subqueries and fabricate AllCovered.

SQL foreach using table rows

I have two Access tables. One table (table1) has a unique field, MyID and another table (table2) has several rows with MyID and another column, Data. I'd like to write an SQL statement that is similar to a foreach where all the values for MyID are selected and averaged from table2's Data and then updated in the MyID row under another field, Avg.
**Table1**
MyID
ID1
ID2
ID3
**Table2**
MyID Data Mon
ID2 10 Jan
ID2 20 Feb
ID1 10 Jan
ID3 30 Mar
ID1 30 Mar
Expecting results like:
**Table1**
MyID Avg
ID1 20
ID2 15
ID3 30
Maybe there's a better way to do this in SQL, but don't currently know.
UPDATE table1
INNER JOIN
(
SELECT Data1, AVG(columnName) avgCol
FROM table2
GROUP BY Data1
) b ON table1.MyID = b.Data
SET table1.avgColumnName = b.avgCol
MS Access update with join
This does the trick in MS Access as a query but not updating into a table.
SELECT Table2.[MyID], Avg(Table2.[Data]) AS [AVG]
FROM Table2
GROUP BY Table2.[MyID]

mysql query performance

Can somebody give a hint on this one? :
I have a table, let's say tblA, where I have id1 and id2 as columns and index(id1,id2).
I want to select the id1´s where id2´s belong to several sets. So I would want to say
select id1 from tblA
where id2 in (val1,val2,val3 ...)
union
select id1 from tblA
where id2 in (val4,val2,val3 ...)
union
(...)*
Let's say we have in table A the following:
(1,1)
(1,2)
(1,3)
(1,4)
(1,5)
(2,1)
(2,2)
(2,3)
Now I want all the id1s that have id2 in (3,4).
So what I want to get is id1 = 1.
2 shouldn't appear because although we have a relation (2,3) we don't have (2,4).
Any ideas how to perform this query? I guess the way above has a problem with performance if the (...) grows to much!? Thanks.
greets
You should create a temporary table like this:
CREATE TABLE temp (id INT NOT NULL PRIMARY KEY) ENGINE MEMORY;
, fill it with values you are searching for (2 and 3 in your example):
INSERT
INTO temp
VALUES (3), (4)
and issue this query:
SELECT ad.id1
FROM (
SELECT DISTINCT id1
FROM a
) ad
WHERE NOT EXISTS
(
SELECT NULL
FROM temp
WHERE NOT EXISTS
(
SELECT NULL
FROM a
WHERE a.id1 = ad.id1
AND a.id2 = temp.id
)
)
You should create a composite index on (id1, id2) for this to work.
For each id1, this will probe each id2 against temp at most once, and will return false as soon as the first id2 absent in temp is found for each id1.
Here's the plan for the query:
1, 'PRIMARY', '<derived2>', 'ALL', '', '', '', '', 2, 'Using where'
3, 'DEPENDENT SUBQUERY', 'temp', 'ALL', '', '', '', '', 2, 'Using where'
4, 'DEPENDENT SUBQUERY', 'a', 'eq_ref', 'PRIMARY', 'PRIMARY', '8', 'ad.id1,test.temp.id', 1, 'Using index'
2, 'DERIVED', 'a', 'range', '', 'PRIMARY', '4', '', 3, 'Using index for group-by'
, no temporary, no filesort.
The union is gonna kill your performance. Use something like this:
select id1 from tblA where id2 in (val1,val2,val3 ...) or id2 in (val4,val2,val3)
Can you combine all the sets into one large set?
If the order is not important, this would seem to be the fastest way.
First, remember that
select id1 from tblA where id2 in (val1, val2, val3) union
select id1 from tblA where id2 in (val4, val5, val6)
should give the same result as
select id1 from tblA where id2 in (val1, val2, val3, val4, val5, val6)
so you can perhaps improve efficiency by formulating a single query rather than using a union.
Secondly (and independent of the above) you should add an index on id2 to tblA. Without it the id2 values are randomly distributed through both the existing index and the table data, so the optimizer will have no option but to perform a linear scan - of the index, if you are lucky.
But all these queries give back both ids from column id1! I think Robert meant that as a result he just wants "1" from column id1:
id1 id2
1 | 1
1 | 2
1 | 3
1 | 4 --> id1s that have id2 with 3 and 4
1 | 5
2 | 1
2 | 2
2 | 3
Because id1=2 does not have 3 AND 4 it should not be a result.
Please correct me if I misunderstood...
I was trying to do a statement but I could not get just the id1=1 back, but I am as well very interested in an efficient solution to this!
You need to create a separate index on column 'id2' because combined index on (id1,id2) will not be used when looking up for id2 only.
This query does what you mentioned
SELECT id1 FROM tblA WHERE id2 IN (?,?,?,?)
GROUP BY id1 HAVING COUNT(id2)=4
NOTE: You need to adjust the COUNT(id2) condition in HAVING clause to the number of values mentioned in the IN clause. Here i used four '?' to represent four values that's why i have written COUNT(id2)=4.
For the scenario which you mentioned in the comment, query will look like following
SELECT id1 FROM tblA WHERE id2 IN (3,4)
GROUP BY id1 HAVING COUNT(id2)=2