Find duplicates in large dataset Excel

Find duplicates in large dataset Excel - sql

I have this task that seems to be recurring and I would need a better solution for.
I pull data from two different databases in two different systems (don't ask why, it's just the way it is). When I do this, preferably I would like the two datasets to be the same size. I have a primary key on both, let's calll this "ID". What I want to do is check this ID from table1 and table2 and get the unique values (so I can go on and see why I have more in one table). My dataset gets very large (roughly a bit over 100 000 rows) which makes my VLOOKUP function in excel work extremely slow. Is there any way of solving this in excel with speed? Solutions using VBA macro, pivottables or excels built-in SQL would do fine. Using excel 2016.
Sample table:
ID_TableA ID_TableB
123456789208435989 123456789208435989
123456789239344137 123456789368934745
123456789368934745 123456789381895013
123456789381895013 123456789447760867
123456789447760867 123456789466692531
123456789466692531 123456789470807304
123456789470807304 123456789504343451
123456789504343451 123456789571573964
123456789563853210 123456789666106771
123456789571573964 123456789683792216
123456789666106771 123456789719645070
123456789683792216 123456789747751420
123456789719645070 123456789770236822
123456789747751420 123456789839975896
123456789770236822 123456789920037815
123456789825288494 123456789930612286
123456789839975896 123456789936072949
123456789920037815 123456789948401617
123456789930612286 123456789982601470
123456789936072949
123456789948401617
123456789982601470
The result from the solution should output:
123456789825288494
123456789563853210
123456789239344137
The data in the tables are 18 char long numberseries where the first 9 numbers are not changing.
Edit: Both of the two tables could contain unique values. The result should return values that are unique from both tables.

Assuming you have both these columns in separate tables on a single database, then this problem is easy to handle using SQL. Here is one way:
SELECT a.ID_TableA
FROM TableA a
LEFT JOIN TableB b
ON a.ID_TableA = b.ID_TableB
WHERE b.ID_TableB IS NULL
UNION
SELECT b.ID_TableB
FROM TableA a
RIGHT JOIN TableB b
ON a.ID_TableA = b.ID_TableB
WHERE a.ID_TableB IS NULL;
Another way, using EXISTS:
SELECT ID_TableA
FROM TableA a
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE a.ID_TableA = b.ID_TableB)
UNION
SELECT ID_TableB
FROM TableA b
WHERE NOT EXISTS (SELECT 1 FROM TableA a WHERE a.ID_TableA = b.ID_TableB);

While I would do that with an Access query, as others suggested, here's my 2 cents for your question.
VLOOKUP IS slow and not the right function for this.
Countif is a bit better, but ISNUMBER(MATCH()) seems to be the fastest combination by far.
Have a look at https://stackoverflow.com/a/29983885/78522

You can use powerquery (Get & Transform Data):
let
SourceA = Excel.CurrentWorkbook(){[Name="tblA"]}[Content],
SourceB = Excel.CurrentWorkbook(){[Name="tblB"]}[Content],
UniqueA = Table.Join(SourceA,{"ID_TableA"},SourceB,{"ID_TableB"},JoinKind.LeftAnti),
UniqueB = Table.Join(SourceA,{"ID_TableA"},SourceB,{"ID_TableB"},JoinKind.RightAnti),
OutputList = List.Combine({UniqueA[ID_TableA], UniqueB[ID_TableB]})
in
OutputList
(Edited having seen your requirement to return unique values from EITHER table)
Doing some testing, using some mocked up data in a similar format, this seems pretty fast:
Input from tblA Rows: 250,000
Input from tblB Rows: 250,000
Start: 25/10/2018 14:17:13
End: 25/10/2018 14:17:15
Returned 41,042 unique values in about 2 seconds

Related

How to run a sql query multiple times and combine the results into single output?

I have a list of 2500 obj numbers stored in Excel for which I need to run the below SQL:
SELECT
a.objno,
a.table_comment,
b.queue_comment
FROM
aq$_queue_tables a
JOIN
AQ$_QUEUES b ON a.objno = b.table_objno
WHERE
a.objno = 19551;
Is there any way I can write a loop on above SQL with objno feeding from a list or from a different table? I also want to store/produce all the results from each loop run as a single output.
I considered the option to upload the numbers into a new table and add a where condition:
a.objno=(SELECT newtab.objectno FROM newtab);
However, the logic I'll be writing in the query would exclude certain objectno results. Let's say that the associated objectno has certain queue_comment as of certain date associated with that objectno. I do not want to pull that record. This condition would match with some objectno and wouldn't match with others. Having that condition and running the query against all the objectno is returning 0 results. I couldn't share the original logic as it would reveal certain business rules and it'll be a violation of some policy.
So, I need to run the query on each objectno separately and combine the results.
I'm totally new to SQL and got this task assigned. I'm aware of the regular loop, for in SQL, but I don't think I can apply them in this situation.
Any guidance or reference links to helpful topics is much appreciated as well.
Thanks in advance for the help.

One option is to upload the object numbers from Excel sheet to a table in the database and run the query as following. Assuming newtab is the table where the objectno are uploaded.
SELECT
a.objno,
a.table_comment,
b.queue_comment
FROM
aq$_queue_tables a JOIN AQ$_QUEUES b on a.objno = b.table_objno
WHERE
a.objno IN (SELECT newtab.objectno FROM newtab);
I have used a subquery here, join to the aq$ can work as well.

Reading the comments and all I think you need to enhance your Excel with 2 additional columns and load to a new table.
IN can be used in the following way too:
SELECT
a.objno,
a.table_comment,
b.queue_comment
FROM
aq$_queue_tables a
JOIN
AQ$_QUEUES b ON a.objno = b.table_objno
WHERE
(a.objno,a.table_comment,b.queue_comment) IN (19551,'something','something');
so with the new table will be:
WHERE
(a.objno,a.table_comment,b.queue_comment) IN
(select n.objno, n.table_comment, n.queue_comment from new_table n)

How do I join two dataframes, based on conditions, with no common variable?

I am trying to recreate the following SAS code in R
PROC SQL;
create table counts_2018 as
select a.*, b.cell_no
from work.universe201808 a, work.selpar17 b
where a.newregionxx = b.lower_region2
and a.froempment >= b.lower_size
and a.froempment <= b.upper_size
and a.frosic07_2 >= b.lower_class2
and a.frosic07_2 <= b.upper_class2;
QUIT;
What this does, in effect, is assign the cell_no found in selpar17 to the data in universe201808, based on the fulfillment of all 6 conditions outlined in the code. Data which does not fulfill these conditions, and thus won't have a cell_no assigned to it, is not included in the final table.
The documentation/answers I have found so far all start with a step where the two dataframes are merged by a common variable, then an sqldf select is carried out. I do not have a common column, and thus I cannot merge my dataframes.

Currently, you are running an implicit join between the two tables which is not advised in SQL. Per ANSI-1992 (a 25+ year specification) that made the explicit JOIN the standard way of joining relations, consider revising your SQL query accordingly.
Contrary to your statement, you in fact do have a common column between the tables as shown in your equality condition: a.newregionxx = b.lower_region2 which can serve as the JOIN condition. Even use the BETWEEN operator for concision:
new_df <- sqldf('select u.*, s.cell_no
from universe201808 u
inner join selpar17 s
on u.newregionxx = s.lower_region2
where u.froempment between s.lower_size and s.upper_size
and u.frosic07_2 between s.lower_class2 and s.upper_class2')
In fact, you can remove the where altogether and place all in the on clause:
...
on u.newregionxx = s.lower_region2
and u.froempment between s.lower_size and s.upper_size
and u.frosic07_2 between s.lower_class2 and s.upper_class2

select query showing decimal places on some fields but not others

I have two tables, A & B.
Table A has a column called Nominal which is a float.
Table B has a column called Units which is also a float.
I have a simple select query that highlights any differences between Nominals in table A & Units in table B.
select coalesce(A.Id, B.Id) Id, A.Nominal, B.Units, isnull(A.Nominal, 0) - isnull(B.Units, 0) Diff
from tblA A full outer join tblB B
on tblA.Id = tblB.Id
where isnull(A.Nominal, 0) - isnull(B.Units, 0) <> 0
this query works. However this morning I have a slight problem.
The query is showing on line as having a difference,
Id Nominal Units Diff
FJLK 100000 100000 1.4515E-11
So obviously one or both of the figures are not 100,000 exactly. However when I run a select query on both tables (individually) on this id both of them return 100,000 I can't see which one has decimal places, why is this? Is this some sort of default display in SQL Server?

In the excel you will find this kind of behavior.
It's a standard way to represent a low numbers. The number 1.4515E-11 you got is same 1.4515 * 10^(-11)

SQL - LEFT JOIN and WHERE statement to show just first row

I read many threads but didn't get the right solution to my problem. It's comparable to this Thread
I have a query, which gathers data and writes it per shell script into a csv file:
SELECT
'"Dose History ID"' = d.dhs_id,
'"TxFieldPoint ID"' = tp.tfp_id,
'"TxFieldPointHistory ID"' = tph.tph_id,
...
FROM txfield t
LEFT JOIN txfielpoint tp ON t.fld_id = tp.fld_id
LEFT JOIN txfieldpoint_hst tph ON fh.fhs_id = tph.fhs_id
...
WHERE d.dhs_id NOT IN ('1000', '10000')
AND ...
ORDER BY d.datetime,...;
This is based on an very big database with lots of tables and machine values. I picked my columns of interest and linked them by their built-in table IDs. Now I have to reduce my result where I get many rows with same values and just the IDs are changed. I just need one(first) row of "tph.tph_id" with the mechanics like
WHERE "Rownumber" is 1
or something like this. So far i couldn't implement a proper subquery or use the ROW_NUMBER() SQL function. Your help would be very appreciated. The Result looks like this and, based on the last ID, I just need one row for every og this numbers (all IDs are not strictly consecutive).
A01";261511;2843119;714255;3634457;
A01";261511;2843113;714256;3634457;
A01";261511;2843113;714257;3634457;
A02";261512;2843120;714258;3634464;
A02";261512;2843114;714259;3634464;
....

I think "GROUP BY" may suit your needs.
You can group rows with the same values for a set of columns into a single row

SAS: how to properly use intck() in proc sql

I have the following codes in SAS:
proc sql;
create table play2
as select a.anndats,a.amaskcd,count(b.amaskcd) as experience
from test1 as a, test1 as b
where a.amaskcd = b.amaskcd and intck('day', b.anndats, a.anndats)>0
group by a.amaskcd, a.ANNDATS;
quit;
The data test1 has 32 distinct obs, while this play2 only returns 22 obs. All I want to do is for each obs, count the number of appearance for the same amaskcd in history. What is the best way to solve this? Thanks.

The reason this would return 22 observations - which might not actually be 22 distinct from the 32 - is that this is a comma join, which in this case ends up being basically an inner join. For any given row a if there are no rows b which have a later anndats with the same amaskcd, then that a will not be returned.
What you want to do here is a left join, which returns all rows from a once.
create table play2
as select ...
from test1 a
left join test1 b
on a.amaskcd=b.amaskcd
where intck(...)>0
group by ...
;
I would actually write this differently, as I'm not sure the above will do exactly what you want.
create table play2
as select a.anndats, a.amaskcd,
(select count(1) from test1 b
where b.amaskcd=a.amaskcd
and b.anndats>a.anndats /* intck('day') is pointless, dates are stored as integer days */
) as experience
from test1 a
;
If your test1 isn't already grouped by amaskcd and anndats, you may need to rework this some. This kind of subquery is easier to write and more accurately reflects what you're trying to do, I suspect.

If both the anndats variables in each dataset are date type (not date time) then you can simple do an equals. Date variables in SAS are simply integers where 1 represents one day. You would not need to use the intck function to tell the days differnce, just use subtraction.
The second thing I noticed is your code looks for > 0 days returned. The intck function can return a negative value if the second value is less than the first.
I am still not sure I understand what your looking to produce in the query. It's joining two datasets using the amaskcd field as the key. Your then filtering based on anndats, only selecting records where b anndats value is less than a anndats or b.anndats < a.anndats.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Find duplicates in large dataset Excel - sql

Related

How to run a sql query multiple times and combine the results into single output?

How do I join two dataframes, based on conditions, with no common variable?

select query showing decimal places on some fields but not others

SQL - LEFT JOIN and WHERE statement to show just first row

SAS: how to properly use intck() in proc sql

Categories

Resources