Need solution to avoid repeated scanning in huge table - sql

I have a event table which has 40 columns and fill up to 2 billion records. In that event table i would like to query for a combination event i.e Event A with Event B. Sometimes I may want to find more combination like Event A with B and C. It may goes to 5 or 6 combination.
I don't want to scan that table for every event in combination i.e Scanning for event A and scanning for event B. And I need a generic approach for more combination scanning as well.
Note: That 2 billion records is partitioned based on event date and data is been equally split.
Eg:
Need to find id's which has event A,B,C and need to find id's which has only A,B.
This number of combination is dynamic. I don't want to scan that table for each event and finally intersect the result.

There may be some mileage in using a sql server equivalent of the mysql group_concat function.
For example
drop table t
create table t (id int, dt date, event varchar(1))
insert into t values
(1,'2017-01-01','a'),(1,'2017-01-01','b'),(1,'2017-01-01','c'),(1,'2017-01-02','c'),(1,'2017-01-03','d'),
(2,'2017-02-01','a'),(2,'2017-02-01','b')
select id,
stuff(
(
select cast(',' as varchar(max)) + t1.event
from t as t1
WHERE t1.id = t.id
order by t1.id
for xml path('')
), 1, 1, '') AS groupconcat
from t
group by t.id
Results in
id groupconcat
----------- -----------
1 a,b,c,c,d
2 a,b
If you then add a patindex
select * from
(
select id,
stuff(
(
select cast(',' as varchar(max)) + t1.event
from t as t1
WHERE t1.id = t.id
order by t1.id
for xml path('')
), 1, 1, '') AS groupconcat
from t
group by t.id
) s
where patindex('a,b,c%',groupconcat) > 0
you get this
id groupconcat
----------- ------------
1 a,b,c,c,d

SELECT * from table as A
JOIN table AS B
ON A.Id = B.Id AND A.Date = B.Date
WHERE Date = '1-Jan'
AND A.Event = 'A'
AND B.Event = 'B'
This will give you rows, where Date is '1-Jan' and Id is same for both events.
You can join table again and again if you want to filter by more events.

The having clause allows you to filter using the result of an aggregate function. I've used a regular count but you may need a distinct count, depending on your table design.
Example:
-- Returns ids with 3 or more events.
SELECT
x.Id,
COUNT(*) AS EventCount
FROM
(
VALUES
(1, '2017-01-01', 'A'),
(1, '2017-01-01', 'B'),
(1, '2017-01-03', 'C'),
(1, '2017-01-04', 'C'),
(1, '2017-01-05', 'E'),
(2, '2017-01-01', 'A'),
(2, '2017-01-01', 'B'),
(3, '2017-01-01', 'A')
) AS x(Id, [Date], [Event])
GROUP BY
x.Id
HAVING
COUNT(*) > 2
;
Returns
Id EventCount
1 5

Related

Select UNIQUE, NOT DISTINCT values

I am trying to select values from a table that are not duplicates - for example, with the following input set, I would like to select only the values in Column 1 that don't have a duplicated value in Column 2
Column 1 Column 2
A X
B X
C Y
D Y
E Z
Resulting in
Column 1 Column 2
E Z
This is made harder by my having a character limit for my SQL statement, and my having to join a couple of tables in the same query.
My existing statement is here, and this is where I am stuck.
SELECT d.o_docguid, d.o_itemdesc
FROM dms_doc d
INNER JOIN
(SELECT s.o_itemno as si, s.o_projectno as sp, t.o_itemno as ti, t.o_projectno as tp
FROM env_bs1192_1 s, env_bs1192_2 t
WHERE s.TB_FILE_ID = t.TB_FILE_ID) as r
ON (si = d.o_itemno AND sp = d.o_projectno)
OR (ti = d.o_itemno AND tp = d.o_projectno)
Results look like
o_docguid o_itemdesc
aguid adescription
bguid adescription
cguid bdescription
I want to filter this list out such that all that remains are the unique descriptions and their associated guid (i.e. only the rows that have specifically a single unique entry in the description, or put another way, if there is a duplicate, throw both away - in this instance, cguid and bdescription should be the only results).
The last challenge, which I still haven't solved, is that this SQL statement needs to fit into a character limit of 242 characters.
Taking the first part as a question, the answer might be:
declare #Table table (Column1 char(1), Column2 char(1));
insert into #Table values
('A', 'X'),
('B', 'X'),
('C', 'Y'),
('D', 'Y'),
('E', 'Z');
select
Column1 = max(Column1),
Column2
from
#Table
group by
Column2
having
count(*) = 1;
How to do it with generic data.
DROP TABLE IF EXISTS #MyTable
CREATE TABLE #MyTable(Column1 VARCHAR(50),Column2 VARCHAR(50))
INSERT INTO #MyTable(Column1,Column2)
VALUES
('A','X'),
('B','X'),
('C','Y'),
('D','Y'),
('E','Z')
;WITH UniqueCol2 AS
(
SELECT Column2
FROM #MyTable
GROUP BY Column2
HAVING COUNT(*) = 1
)
SELECT
mt.*
FROM UniqueCol2
JOIN #MyTable mt ON mt.Column2 = UniqueCol2.Column2

SQL group three columns into one

I have a table with three columns:
[ID] [name] [link]
1 sample_name_1 sample_link_1
2 sample_name_2 sample_link_2
3 sample_name_3 sample_link_3
I need to somehow group them into one column, so the ideal result is this:
[one_column]
1
sample_name_1
sample_name_1
2
sample_name_2
sample_link_2
3
sample_name_3
sample_link_3
Does anyone have any suggestions on where to look and how to get it done in SQL Server?
You may try to use VALUES table value constructor with CROSS APPLY:
Table:
CREATE TABLE MyTable (
ID int,
name varchar(50),
link varchar(50)
)
INSERT INTO MyTable (ID, name, link)
VALUES
(1, 'sample_name_1', 'sample_link_1'),
(2, 'sample_name_2', 'sample_link_2'),
(3, 'sample_name_3', 'sample_link_3')
Statement:
SELECT v.one_column
FROM MyTable t
CROSS APPLY (VALUES
(1, CONVERT(varchar(50), ID)),
(2, CONVERT(varchar(50), name)),
(3, CONVERT(varchar(50), link))
) v (rn, one_column)
ORDER BY t.ID, v.rn
Result:
one_column
1
sample_name_1
sample_link_1
2
sample_name_2
sample_link_2
3
sample_name_3
sample_link_3
While this is something you should do in your presentation layer (i.e. your app or Website) you can do this in SQL:
select one column
from
(
select cast(id as varchar(10)) as one column, id as sortkey1, 1 as sortkey2 from mytable
union all
select name as one column, id as sortkey1, 2 as sortkey2 from mytable
union all
select link as one column, id as sortkey1, 3 as sortkey2 from mytable
) unioned
order by sortkey1, sortkey2;

Alternative to NOT IN in SSMS

I have my table in this structure. I am trying to find all the unique ID's whose word's do not appear in the list. How can I achieve this in MS SQL Server.
id word
1 hello
2 friends
2 world
3 cat
3 dog
2 country
1 phone
4 eyes
I have a list of words
**List**
phone
eyes
hair
body
Expected Output
Except the words from the list, I need all the unique ID's. In this case it is,
2
3
I & 4 is not in the output as their words appears in the List
I tried the below code
Select count(distinct ID)
from Table1
where word not in ('phone','eyes','hair','body')
I tried Not Exists also which did not work
You can also use GROUP BY
SELECT id
FROM Table1
GROUP BY id
HAVING MAX(CASE WHEN word IN('phone', 'eyes', 'hair', 'body') THEN 1 ELSE 0 END) = 0
One way to do it is to use not exists, where the inner query is linked to the outer query by id and is filtered by the search words.
First, create and populate sample table (Please save us this step in your future questions):
DECLARE #T AS TABLE (
id int,
word varchar(20)
)
INSERT INTO #T VALUES
(1, 'hello'),
(2, 'friends'),
(2, 'world'),
(3, 'cat'),
(3, 'dog'),
(2, 'country'),
(1, 'phone'),
(4, 'eyes')
The query:
SELECT DISTINCT id
FROM #T t0
WHERE NOT EXISTS
(
SELECT 1
FROM #T t1
WHERE word IN('phone', 'eyes', 'hair', 'body')
AND t0.Id = t1.Id
)
Result:
id
2
3
SELECT t.id FROM dbo.table AS t
WHERE NOT EXISTS (SELECT 1 FROM dbo.table AS t2
INNER JOIN
(VALUES('phone'),('eyes'),('hair'),('body')) AS lw(word)
ON t2.word = lw.word
AND t2.id = t.id)
GROUP BY t.id;
You can try this as well: this is a dynamic table structure:
DECLARE #T AS TABLE (id int, word varchar(20))
INSERT INTO #T VALUES
(1, 'hello'),
(2, 'friends'),
(2, 'world'),
(3, 'cat'),
(3, 'dog'),
(2, 'country'),
(1, 'phone'),
(4, 'eyes')
DECLARE #tblNotUsed AS TABLE ( id int, word varchar(20))
DECLARE #tblNotUsedIds AS TABLE (id int)
INSERT INTO #tblNotUsed VALUES
(1, 'phone'),
(2, 'eyes'),
(3, 'hair'),
(4, 'body')
INSERT INTO #tblNotUsedIds (id)
SELECT [#T].id FROM #T INNER JOIN #tblNotUsed ON [#tblNotUsed].word = [#T].word
SELECT DISTINCT id FROM #T
WHERE id NOT IN (SELECT id FROM #tblNotUsedIds)
The nice thing about SQL is there are sometimes many ways to do things. Here is one way is to place your list of known values into a #temp table and then run something like this.
Select * from dbo.maintable
EXCEPT
Select * from #tempExcludeValues
The results will give you all records that aren't in your predefined list. A second way is to do the join like Larnu has mentioned in the comment above. NOT IN is typically not the fastest way to do things on larger datasets. JOINs are by far the most efficient method of filtering data. Many times better than using a IN or NOT IN clause.

sql join using recursive cte

Edit: Added another case scenario in the notes and updated the sample attachment.
I am trying to write a sql to get an output attached with this question along with sample data.
There are two table, one with distinct ID's (pk) with their current flag.
another with Active ID (fk to the pk from the first table) and Inactive ID (fk to the pk from the first table)
Final output should return two columns, first column consist of all distinct ID's from the first table and second column should contain Active ID from the 2nd table.
Below is the sql:
IF OBJECT_ID('tempdb..#main') IS NOT NULL DROP TABLE #main;
IF OBJECT_ID('tempdb..#merges') IS NOT NULL DROP TABLE #merges
IF OBJECT_ID('tempdb..#final') IS NOT NULL DROP TABLE #final
SELECT DISTINCT id,
current
INTO #main
FROM tb_ID t1
--get list of all active_id and inactive_id
SELECT DISTINCT active_id,
inactive_id,
Update_dt
INTO #merges
FROM tb_merges
-- Combine where the id from the main table matched to the inactive_id (should return all the rows from #main)
SELECT id,
active_id AS merged_to_id
INTO #final
FROM (SELECT t1.*,
t2.active_id,
Update_dt ,
Row_number()
OVER (
partition BY id, active_id
ORDER BY Update_dt DESC) AS rn
FROM #main t1
LEFT JOIN #merges t2
ON t1.id = t2.inactive_id) t3
WHERE rn = 1
SELECT *
FROM #final
This sql partially works. It doesn't work, where the id was once active then gets inactive.
Please note:
the active ID should return the last most active ID
the ID which doesn't have any active ID should either be null or the ID itself
ID where the current = 0, in those cases active ID should be the ID current in tb_ID
ID's may get interchanged. For example there are two ID's 6 and 7, when 6 is active 7 is inactive and vice versa. the only way to know the most current active state is by the update date
Attached sample might be easy to understand
Looks like I might have to use recursive cte for achieiving the results. Can someone please help?
thank you for your time!
I think you're correct that a recursive CTE looks like a good solution for this. I'm not entirely certain that I've understood exactly what you're asking for, particularly with regard to the update_dt column, just because the data is a little abstract as-is, but I've taken a stab at it, and it does seem to work with your sample data. The comments explain what's going on.
declare #tb_id table (id bigint, [current] bit);
declare #tb_merges table (active_id bigint, inactive_id bigint, update_dt datetime2);
insert #tb_id values
-- Sample data from the question.
(1, 1),
(2, 1),
(3, 1),
(4, 1),
(5, 0),
-- A few additional data to illustrate a deeper search.
(6, 1),
(7, 1),
(8, 1),
(9, 1),
(10, 1);
insert #tb_merges values
-- Sample data from the question.
(3, 1, '2017-01-11T13:09:00'),
(1, 2, '2017-01-11T13:07:00'),
(5, 4, '2013-12-31T14:37:00'),
(4, 5, '2013-01-18T15:43:00'),
-- A few additional data to illustrate a deeper search.
(6, 7, getdate()),
(7, 8, getdate()),
(8, 9, getdate()),
(9, 10, getdate());
if object_id('tempdb..#ValidMerge') is not null
drop table #ValidMerge;
-- Get the subset of merge records whose active_id identifies a "current" id and
-- rank by date so we can consider only the latest merge record for each active_id.
with ValidMergeCTE as
(
select
M.active_id,
M.inactive_id,
[Priority] = row_number() over (partition by M.active_id order by M.update_dt desc)
from
#tb_merges M
inner join #tb_id I on M.active_id = I.id
where
I.[current] = 1
)
select
active_id,
inactive_id
into
#ValidMerge
from
ValidMergeCTE
where
[Priority] = 1;
-- Here's the recursive CTE, which draws on the subset of merges identified above.
with SearchCTE as
(
-- Base case: any record whose active_id is not used as an inactive_id is an endpoint.
select
M.active_id,
M.inactive_id,
Depth = 0
from
#ValidMerge M
where
not exists (select 1 from #ValidMerge M2 where M.active_id = M2.inactive_id)
-- Recursive case: look for records whose active_id matches the inactive_id of a previously
-- identified record.
union all
select
S.active_id,
M.inactive_id,
Depth = S.Depth + 1
from
#ValidMerge M
inner join SearchCTE S on M.active_id = S.inactive_id
)
select
I.id,
S.active_id
from
#tb_id I
left join SearchCTE S on I.id = S.inactive_id;
Results:
id active_id
------------------
1 3
2 3
3 NULL
4 NULL
5 4
6 NULL
7 6
8 6
9 6
10 6

MSSQL ORDER BY Passed List

I am using Lucene to perform queries on a subset of SQL data which returns me a scored list of RecordIDs, e.g. 11,4,5,25,30 .
I want to use this list to retrieve a set of results from the full SQL Table by RecordIDs.
So SELECT * FROM MyFullRecord
where RecordID in (11,5,3,25,30)
I would like the retrieved list to maintain the scored order.
I can do it by using an Order by like so;
ORDER BY (CASE WHEN RecordID = 11 THEN 0
WHEN RecordID = 5 THEN 1
WHEN RecordID = 3 THEN 2
WHEN RecordID = 25 THEN 3
WHEN RecordID = 30 THEN 4
END)
I am concerned with the loading of the server loading especially if I am passing long lists of RecordIDs. Does anyone have experience of this or how can I determine an optimum list length.
Are there any other ways to achieve this functionality in MSSQL?
Roger
You can record your list into a table or table variable with sorting priorities.
And then join your table with this sorting one.
DECLARE TABLE #tSortOrder (RecordID INT, SortOrder INT)
INSERT INTO #tSortOrder (RecordID, SortOrder)
SELECT 11, 1 UNION ALL
SELECT 5, 2 UNION ALL
SELECT 3, 3 UNION ALL
SELECT 25, 4 UNION ALL
SELECT 30, 5
SELECT *
FROM yourTable T
LEFT JOIN #tSortOrder S ON T.RecordID = S.RecordID
ORDER BY S.SortOrder
Instead of creating a searched order by statement, you could create an in memory table to join. It's easier on the eyes and definitely scales better.
SQL Statement
SELECT mfr.*
FROM MyFullRecord mfr
INNER JOIN (
SELECT *
FROM (VALUES (1, 11),
(2, 5),
(3, 3),
(4, 25),
(5, 30)
) q(ID, RecordID)
) q ON q.RecordID = mfr.RecordID
ORDER BY
q.ID
Look here for a fiddle
Something like:
SELECT * FROM MyFullRecord where RecordID in (11,5,3,25,30)
ORDER BY
CHARINDEX(','+CAST(RecordID AS varchar)+',',
','+'11,5,3,25,30'+',')
SQLFiddle demo