Best way to compare two tables in SQL by matching string? - sql

I have a program where the goal is to take data from an API, and capture the differences in data from minute to minute. It involves three tables: Table 1 (for new data), Table 2 (for previous minutes data), Results table (for the results).
The sequence of the program is like this:
Update table 1 -> Calculate the differences from table 2 and update a "Results" table with the differences -> Copy table 1 to table 2.
Then it repeats! It's simple and it works.
Here is my SQL query:
Insert into Results (symbol, bid, ask, description, Vol_Dif, Price_Dif, Time) Select * FROM(
Select symbol, bid, ask, description, Vol_Dif, Price_Dif, '$now' as Time FROM (
Select t1.symbol, t1.bid, t1.ask, t1.description, (t1.volume - t2.volume) AS Vol_Dif, (t1.totalPrice - t2.totalPrice) AS Price_Dif
FROM `Table_1` t1
Inner Join (
Select id, volume, ask, totalPrice FROM Table_2) t2
ON t2.id = t1.id) as test
The tables are identical in structure, obviously. The primary key is the 'id' field that auto-increments. And as you can see, I am comparing both tables on the basis of these 'id' fields being equal.
The PROBLEM is that the API seems to be inconsistent. One API call will have 50,000 entries. The next one will have 51,000 entries. And the entries are not just added to the end or added to the beginning, they are mixed into the middle.
So, comparing on equal ID's means I am comparing entries for DIFFERENT data, IF the API calls return a different number entries.
The data that I am trying to get the differences of is the 'bid', 'ask', 'Vol_Dif', 'Price_Dif' from minute to minute. There are many instances of the same 'symbol's, so I couldn't compare with this. The ONLY other way to compare entries from table to table, beside the matching ID's, would be matching the "description" fields.
I have tried this. The script is almost the same as above except the end of the query is
ON t2.description = t1.description
The problem is that looking for matching description fields takes 3 minutes for 50,000 entries, whereas looking for matching ID's takes 1 second.
Is there a better, faster way to do what I'm trying to do? Thanks in advance. Any help is appreciated.

Related

How to get the differences between two - kind of - duplicated tables (sql)

Prolog:
I have two tables in two different databases, one is an updated version of the other. For example we could imagine that one year ago I duplicated table 1 in the new db (say, table 2), and from then I started working on table 2 never updating table 1.
I would like to compare the two tables, to get the differences that have grown in this period of time (the tables has preserved the structure, so that comparison has meaning)
My way of proceeding was to create a third table, in which I would like to copy both table 1 and table 2, and then count the number of repetitions of every entry.
In my opinion, this, added to a new attribute that specifies for every entry the table where he cames from would do the job.
Problem:
Copying the two tables into the third table I get the (obvious) error to have two duplicate key values in a unique or primary key costraint.
How could I bypass the error or how could do the same job better? Any idea is appreciated
Something like this should do what you want if A and B have the same structure, otherwise just select and rename the columns you want to confront....
SELECT
*
FROM
B
WHERE NOT EXISTS (SELECT * FROM A)
if NOT EXISTS doesn't work in your DBMS you could also use a left outer join comparing the rows columns values.
SELECT
A.*
from
A left outer join B
on A.col = B.col and ....

MS Access - trying to find duplicates across 4 tables based on Column1 and Column2

MS Access - trying to find duplicates across 4 tables based on the info in Column1 and Column2. I would also like the resulting query to show me Column3, Column4 and Column5 for easy review. I've tried following a Youtube vid on a union query and was successful.. But that's as far as I can go. I tried to follow along some of the answers but I cant make it work. Just note that I have 0 programming language knowledge. Tyvm in advance!
Column1 = Unique reference
Column 2 = Loss date
Duplicates happen when a row has same unique ref and same DOL. This can be within the table or across tables. Like one entry is in Table2019 and another one is in Table2022. Or two entries in Table2019 with four more spread in other tables.
SELECT [t2019].ID, [t2019].[ClaimNo], [t2019].DOL, [t2019].[Amount], [t2019].[Cause], [t2019].[Ref], [t2019].[Regn], [t2019].Remarks
FROM [t2019]
UNION
SELECT [t2020].ID, [t2020].[ClaimNo], [t2020].DOL, [t2020].[Amount], [t2020].[Cause], [t2020].[Ref], [t2020].[Regn], [t2020].Remarks
FROM [t2020]
UNION
SELECT [t2021].ID, [t2021].[ClaimNo], [t2021].DOL, [t2021].[Amount], [t2021].[Cause], [t2021].[Ref], [t2021].[Regn], [t2021].Remarks
FROM [t2021]
UNION
SELECT [t2022].ID, [t2022].[ClaimNo], [t2022].DOL, [t2022].[Amount], [t2022].[Cause], [t2022].[Ref], [t2022].[Regn], [t2022].Remarks
FROM [t2022];
Access has a wizard to help write the relatively difficult SQL for finding duplicate records. So first gather up all the records that need to be searched for duplicates then use the wizard.
To gather the records open the query designer, go to the SQL Pane, SELECT union and adapt the following SQL:
Unfortunately, there is no graphical interface to help.
Get Typing and don't forget that semi-colon. UNION is used to combine SELECT statements. So were combining everything from all the tables. the ALL is important because by itself UNION ignores rows where every column is an exact match to a previous row. We are looking for duplicates so we add ALL to include those skipped rows.
When you have all the rows go to query wizard under the create tab and run the find duplicates wizard:
Here is the resulting SQL for my example data:
SELECT Query1.[ID], Query1.[DOL], Query1.[ClaimNo], Query1.[Amount], Query1.[Cause], Query1.[Ref], Query1.[Regn], Query1.[Remarks]
FROM Query1
WHERE (((Query1.[ID]) In (SELECT [ID] FROM [Query1] As Tmp GROUP BY [ID],[DOL] HAVING Count(*)>1 And [DOL] = [Query1].[DOL])))
ORDER BY Query1.[ID], Query1.[DOL]
Note:
In Access ID is a primary key and AutoNumber by default. It looks suspicious here. If the default settings are intact and you are entering data in Access then every table starts with ID 1 and you have duplicate ID's in every table. Instead, I would normally combine all these year tables using a year column. This also avoids the union query. I would only use year tables if I had millions of records and couldn't afford the space for a year column.

Selecting a large number of rows by index using SQL

I am trying to select a number of rows by the value of a column called ID. I know you can do this pretty easily by:
SELECT col1, col2, col3 FROM mytable WHERE id IN (1,2,3,4,5...)
However, what if there are a few million IDs I want to select and the IDs don't always have pattern (which means I can't use something like BETWEEN x AND y)? Does this select statement still work or is there better ways of doing so?
The actual application is this. Filters are specified by users, which is compared to some attributes of the records. From those filters, we create a subset of the data which is of interest to a particular user. There are about 30 million records each with roughly ~3000 attributes (which is stored in roughly 30 tables, but every table has ID as a primary key), so every time someone makes a query about their desired subset of records, we'd have to join many tables, apply those filters, and figure out what his subset looks like. In order to avoid joining many tables all the time, I thought maybe it's a better idea to join the tables once, figure out the id of the selected subset, and this way each time a new query is made, all we have to do is select the relevant columns of the rows that match the filtered ids.
This depends on the database and the interface you are using. For a few hundred or thousand values, no problem. But your question specifies millions. And that could start to get into limits on the length of the query -- either specified by the database, the tool you are using, or intermediate libraries.
If you have so many ids, I would strongly recommend that you load them into a table in the database with the id as the primary key. Then use join or exists to identify the rows in your table that match.
Often, such a list would be generated in the database anyway. In that case, you can use a subquery or CTE and just include that code in your final query.

How to find rows which differ by a given amount in SQL?

So I have a data table which looks like
Where each row has a timestamp column in Unix time. I need to find all the places where two entries with the same resource_id are x(day month, year etc) amount of time apart, so I need a query that will go through and look at the differences between one row and the next and spit back the ones which differ by more than a specified amount.
Anybody have any ideas on how to do this? Thanks in advance
You may use a cross join to compare every row in the table with every other row in the table then compare the field. For example the following will return where the two rows are 2 months apart.
SELECT t.resource_id, s.resource_id
FROM table t CROSS JOIN table s
WHERE TIMESTAMPDIFF(MONTH,t.timestamp,s.timestamp) = 2
Note that this could be extremely slow if the table is large. Or according to the MySQL docs just saying JOIN without specifying the condition will result in a cartesian product which is equivalent to a cross join.

Is there any reason this simple SQL query should be so slow?

This query takes about a minute to give results:
SELECT MAX(d.docket_id), MAX(cus.docket_id) FROM docket d, Cashup_Sessions cus
Yet this one:
SELECT MAX(d.docket_id) FROM docket d UNION MAX(cus.docket_id) FROM Cashup_Sessions cus
gives its results instantly. I can't see what the first one is doing that would take so much longer - I mean they both simply check the same two lists of numbers for the greatest one and return them. What else could it be doing that I can't see?
I'm using jet SQL on an MS Access database via Java.
the first one is doing a cross join between 2 tables while the second one is not.
that's all there is to it.
The first one uses Cartesian product to form a source data, which means that every row from the first table is paired with each row from the second one. After that, it searches the source to find out the max values from the columns.
The second doesn't join tables. It just find max from the fist table and the max one from the second table and than returns two rows.
The first query makes a cross join between the tables before getting the maximums, that means that each record in one table is joined with every record in the other table.
If you have two tables with 1000 items each, you get a result with 1000000 items to go through to find the maximums.