How do I compare whether same/intersection data is there in one and another row in same table when I have huge data in the table in sql server? - sql

I have a table with the sample data below. Now, I just want to compare one record with all other records in the same table and we have to give ID if that record colloids with any other records in the remaining records. And column is with comma separated data, So if we have 'A,C' as Name in one record and 'A' in another record(Check the input from text) then it colloid each other because 'A' is common in both.
In the same way one of the record is not having anything in the Name it is NULL. When it is Null it should colloid with remaining other records. Like this Name column I have around 10 columns to verify data.
Input
ID
Name
1
A,C
2
B
3
A
4
NULL
OUTPUT
ID
ColloidID
1
3
1
4
2
4
3
1
3
4
4
1
4
2
4
3
Problem : I have implemented solution like below, and it working fine as expected. But the thing here is it is fine when less data in the table(<100k) but it's taking more time and space when dealing with millions of data(Ex : >20M Data)
SELECT DISTINCT A.ID,B.ID AS ColloidID
FROM #Temp1 A
CROSS APPLY #Temp1 B
WHERE A.ID<>B.ID
AND master.dbo.fIntersection(COALESCE(A.Name,B.Name,''),COALESCE(B.Name,A.Name,'')) = 1

Ideally you should not store multiple pieces of info in a single column.
Be that as it may, you can use a nested EXISTS with STRING_SPLIT to compare the two columns.
SELECT t1.ID, t2.ID
FROM #Temp1 t1
JOIN #Temp1 t2 ON t2.ID <> t1.ID
AND (t1.Name IS NULL OR t2.Name IS NULL
OR EXISTS (SELECT 1
FROM STRING_SPLIT(t1.Name, ',') s1
JOIN STRING_SPLIT(t2.Name, ',') s2 ON s2.value = s1.value
)
)
ORDER BY
t1.ID,
t2.ID;
db<>fiddle

20M isn't a lot of data, provided a good database design is used, with proper indexes. This is definitely not a good design. It violates the most basic design rule - one value per field. As a result, it's impossible to index Name, forcing 4*10^14 comparisons.
The only way to get acceptable performance is to fix the design. To do that Name has to be split into separate rows. The data needs to be stored in a table whose Name column is covered by an index or primary key:
create table #Id_Names (
ID bigint not null,
Name varchar(30) null,
INDEX IX_Id_Names (Name,ID)
);
GO
INSERT INTO #Id_Names (Id,Name)
select ID,value
from #Temp1 t
CROSS APPLY STRING_SPLIT(Name,',');
After that, the query is simplified to :
SELECT
t1.ID,t2.ID as ColloidID
FROM #Id_Names t1
INNER JOIN #Id_Names t2
ON t1.ID<>t2.ID
AND (t1.Name=t2.Name
OR t1.Name IS NULL
OR t2.Name IS NULL)
This can run a lot faster. The only real problem is the logic of treating NULL as a wildcard. This will return the entire table. And since the table joins itself, each null will result in (20M-1)^2 extra rows. The same relations will be repeated twice, eg (1,4) and (4,1)
If #Temp1 was a proper table, an alternative would be to create an indexed view. Creating an index over a VIEW essentially generates, stores and updates its results automatically.
Another option is to create a Clustered Columnstore index. This provides both compression and acceleration. The data is stored per column in buckets of roughly 1M rows. In each bucket, each column value is only stored once.
create table #Id_Names (
ID bigint not null,
Name varchar(30) null,
INDEX CCI_Id_Names CLUSTERED COLUMNSTORE
);

Related

Optimized way to check if record is present in table 1. If not then check table 2, else return default value

Asked in an interview:
I have 2 tables, one table has records like ID, Name, address. id(pk) is from 1 to 10000000.
Another table has records from 10000001 to 20000000.
I have to check if a particular ID is present in table 1 or table 2 and return corresponding result.
Because table size is big, have to think an optimized way to do this.
declare #ID BIGINT
SET #ID=10000000
IF EXIST(SELECT ID FROM TABLE1 WHERE ID=#ID)
SELECT ID,NAME,ADDRESS FROM TABLE1 WHERE ID=#ID
ELSE IF EXIST(SELECT ID FROM TABLE2 WHERE ID=#ID)
SELECT ID,NAME,ADDRESS FROM TABLE2 WHERE ID=#ID
ELSE
SELECT #ID
Few ideas on top of my mind.
In the hive, you can use map-side join which is much faster than usual join when 1 table is large and another is small. (here 2nd table being the id you are searching for)
You can optimize in the way you store the data. Keeping the data sorted by id column, if such queries are frequent. A columnar format such as orc keeps track of the range of id in each file, resulting in such queries being faster.

SQL Server inconsistent results over 2 columns using = and <>

I am trying to replace a manual process with an SQL-SERVER (2012) based automated one. Prior to doing this, I need to analyse the data in question over time to produce some data quality measures/statistics.
Part of this entails comparing the values in two columns. I need to count where they match and where they do not so I can prove my varied stats tally. This should be simple but seems not to be.
Basically, I have a table containing two columns both of which are defined identically as type INT with null values permitted.
SELECT * FROM TABLE
WHERE COLUMN1 is NULL
returns zero rows
SELECT * FROM TABLE
WHERE COLUMN2 is NULL
also returns zero rows.
SELECT COUNT(*) FROM TABLE
returns 3780
and
SELECT * FROM TABLE
returns 3780 rows.
So I have established that there are 3780 rows in my table and that there are no NULL values in the columns I am interested in.
SELECT * FROM TABLE
WHERE COLUMN1=COLUMN2
returns zero rows as expected.
Conversely therefore in a table of 3780 rows, with no NULL values in the columns being compared, I expect the following SQL
SELECT * FROM TABLE
WHERE COLUMN1<>COLUMN2
or in desperation
SELECT * FROM TABLE
WHERE NOT (COLUMN1=COLUMN2)
to return 3780 rows but it doesn't. It returns 3709!
I have tried SELECT * instead of SELECT COUNT(*) in case NULL values in some other columns were impacting but this made no difference, I still got 3709 rows.
Also, there are some negative values in 73 rows for COLUMN1 - is this what causes the issue (but 73+3709=3782 not 3780 my number of rows)?
What is a better way of proving the values in these numeric columns never match?
Update 09/09/2016: At Lamaks suggestion below I isolated the 71 missing rows and found that in each one, COLUMN1 = NULL and COLUMN2 = -99. So the issue is NULL values but why doesn't
SELECT * FROM TABLE WHERE COLUMN1 is NULL
pick them up? Here is the information in Information Schema Views and System Views:
ORDINAL_POSITION COLUMN_NAME DATA_TYPE CHARACTER_MAXIMUM_LENGTH IS_NULLABLE
1 ID int NULL NO
.. .. .. .. ..
7 COLUMN1 int NULL YES
8 COLUMN2 int NULL YES
CONSTRAINT_NAME
PK__TABLE___...
name type_desc is_unique is_primary_key
PK__TABLE___... CLUSTERED 1 1
Suspect the CHARACTER_MAXIMUM_LENGTH of NULL must be the issue?
You can find the count based on the below query using left join.
--To find COLUMN1=COLUMN2 Count
--------------------------------
SELECT COUNT(T1.ID)
FROM TABLE T1
LEFT JOIN TABLE T2 ON T1.COLUMN1=T2.COLUMN2
WHERE t2.id is not null
--To find COLUMN1<>COLUMN2 Count
--------------------------------
SELECT COUNT(T1.ID)
FROM TABLE T1
LEFT JOIN TABLE T2 ON T1.COLUMN1=T2.COLUMN2
WHERE t2.id is null
Through the exhaustive comment chain above with all help gratefully received, I suspect this to be a problem with the table creation script data types for the columns in question. I have no explanation from an SQL code point of view, as to why the "is NULL" intermittently picked up NULL values.
I was able to identify the 71 rows that were not being picked up as expected by using an "except".
i.e. I flipped the SQL that was missing 71 rows, namely:
SELECT * FROM TABLE WHERE COLUMN1 <> COLUMN 2
through an except:
SELECT * FROM TABLE
EXCEPT
SELECT * FROM TABLE WHERE COLUMN1 <> COLUMN 2
Through that I could see that COLUMN1 was always NULL in the missing 71 rows - even though the "is NULL" was not picking them up for me when I ran
SELECT * FROM TABLE WHERE COLUMN1 IS NULL
which returned zero rows.
Regarding the comparison of values stored in the columns, as my data volumes are low (3780 recs), I am just forcing the issue by using ISNULL and setting to 9999 (a numeric value I know my data will never contain) to make it work.
SELECT * FROM TABLE
WHERE ISNULL(COLUMN1, 9999) <> COLUMN2
I then get the 3780 rows as expected. It's not ideal but it'll have to do and is more or less appropriate as there are null values in there so they have to be handled.
Also, using Bertrands tip above I could view the table creation script and the columns were definitely set up as INT.

Inner-Join on two column where one column has a single tailing character

Hi I'm new to SQL and I have 2 tables that I am trying to do an inner-join with.
------------------------
First table:
------------------------
ID-Number CustomerName
------------------------
Second table
------------------------
ID-Number CustomerDevice
(ID with a single tailing character)
Questions
What would be the best preforming way to execute the inner-join on both table's ID-number?
Is there a method to remove the trailing character within the inner-join command?
You don't have much choice. Here is how you can express the logic:
select . . .
from t1 join
t2
on t1.id like t2.id + '_';
Unfortunately, this may not make use of indexes. (Also note that + for string concatenation is SQL Server-specific).
You might be able to rewrite the query as:
on t1.id = left(t2.id, len(t2.id) - 1)
This should be able to use an index on t1(id).
The best approach is to fix the data, so your ids are the same type, same length, and have a properly declared foreign key relationship. Another alternative available in SQL Server is an index on a computed column:
alter table t2 add realId as (left(id, len(id) - 1));
create index idx_t2_realId on t2(realId);
Then write the join logic using realId.
Would this work?
SELECT
ID-Number,
CustomerName,
CustomerDevice
FROM t1
INNER JOIN t2 on t1.ID-Number=LEFT(t2.ID-Number,LEN(t2.ID-Number)-1)
EDIT: Forgot the 1
Given that the table Customer has this column
ID_number int not null;
And the the table Device has this column
ID_number varchar(15);
And we know that Device.ID_number, if it is not NULL, is always equal to some Customer.ID_number with a letter appended, then (SQL Server):
SELECT *
FROM Customer c
JOIN Device d
ON c.ID_number = CAST(SUBSTRING(i.ID_number, 1, LEN(i.ID_number) - 1) AS int)
More robust solutions that allow for more possibilities in the data require more defensive coding. You may want to define a scalar function to process Customer.ID_number.

Oracle Compare data between two different table

I have two table one is having all field VARCHAR2 but other having different type for different data.
For Example :
Table One
==========================
Col 1 VARCHAR2 UNIQUE KEY
Col 2 VARCHAR2
Col 3 VARCHAR2
===========================
Table Two
==========================
Col One VARCHAR2 UNIQUE KEY
Col Two TIMESTAMP
Col Three NUMBER
==========================
we are having one mapping table. it denotes which column of Table One has to compare with which column of Table Two.
For Example
Mapping Table
==============================
Table One Table Two
==============================
Col 1 Col One
Col 2 Col Three
Col 3 Col Two
==============================
Now with the help of UNIQUE KEY of TABLE ONE we have to find same row in TABLE TWO and compare rows column by column and get changes in data.
Currently we are using java program for comparing data row by row and column by column and getting changes between data in rows with same UNIQUE KEY. it is working fine but taking too much time as we are having 100000 records in DB.
Now my question is : is there any way i can compare data at SQL level and get changes in data?
You can do it 'manually' with a query like this: It's a lot of work, but there are only three different types of checks you need to do, so it's not very complex:
select
*
from
Table1 t1
full outer join Table2 t2 on t2.ID = t1.ID
where
-- Check ID, either record does not exist in either table.
t1.ID is null or
t2.ID = null or
-- Not nullable field can be easily compared.
t1.NotNullableField1 <> t2.NotNUllableField1 or
-- Nullable field is slightly more work.
t1.NullableField1 <> t2.NullableField1 or
(t1.NullableField1 is null and t2.NullableField1 is not null) or
(t1.NullableField1 is not null and t2.NullableField1 is null)
Another solution is to use MINUS, which is a bit like UNION, only it returns a dataset minus the records in a second dataset:
select * from Table1 t1
MINUS
select * from Table2 t2
This works only one way (which might be fine for your purpose), but you can also combine it with UNION to make it bidirectional.
select
*
from
( select * from Table1
MINUS
select * from Table2)
UNION ALL
( select * from Table2
MINUS
select * from Table1)
The output of both solutions is a bit different.
In the FULL OUTER JOIN query, the IDs will be joined and the values of the matching rows will be displayed next to each other as a single row.
In the MINUS query, the result will be presented as a single dataset. If a record does not exist in either one table, it will be displayed. If a record (ID) exists in both tables, but other fields are different, you will get both rows. So it's a bit harder to compare them.
See: http://www.techonthenet.com/oracle/minus.php

MS SQL - Joining on two tables with a substringed key in one column

I have a 2 tables I need to join, however on one of the tables I need to extract a key from a varchar field in each row.
Table 1 Description (numeric 18,varchar 4000)
descriptionid description
1 Blah Blah: Queue 1Blah Blah
2 foobar:Queue 2
3 rem:Queue 2 -This is a note
4 Anotherrow: Queue 3
5 Something else
Table 2 Queue - (numeric 18, varchar 100)
queueid queue
123 Queue 1
124 Queue 2
127 Queue 3
129 Queue 4
So I need to produce the output like so
View 3 Queue-Description (numeric 18, numeric 18)
descriptionid queueid
1 123
2 124
3 124
4 127
5 null
So in table 1 row 1, I need to strip out the value Queue1 from the description, verify it is in the queue table, and lookup the queueid.
I am unable to change the structure of tables 1 and 2.
What ways can this be achieved in MSSQL?
What is the most efficient way to do this in SQL - using MSSQL 2005 here.
most efficient way
Well... don't know about that but it is a way.
select T1.descriptionid,
T2.queueid
from Table1 as T1
left outer join Table2 as T2
on T1.description like '%'+T2.queue+'%'
Another way
select T1.descriptionid,
T2.queueid
from Table1 as T1
left outer join Table2 as T2
on charindex(T2.queue, T1.description, 1) > 0
If there are more than one match (see comment by Ed Harper) you can use this to pick the one with the longest match.
select T1.descriptionid,
T2.queueid
from Table1 as T1
outer apply (
select top 1 T3.queueid
from Table2 as T3
where charindex(T3.queue, T1.description, 1) > 0
order by len(T3.queue) desc
) as T2(queueid)
The most efficient way to do this is to add an extra column to your table and insert the extracted the ID from the string. You can do this when rows are added and you can process the existing ones fairly easily. But trying to left join like this will be very slow.
In Sql Server 2005 you can extract your queue string using regex. The Data Extraction section on this page contains an example.
In a stored procedure you can then build an indexed temp table that contains a new column - this allows you to do this without changing the table metadata).
If you can change the table metadata you can:
Trigger the content into another column (on insert).
Or if the information is not needed immediately a daily sql job could extract the information.