SQL Combination Match Count - sql

My aim is to calculate all possible combinations of the unique entries in column 'role' and then count how many times each combination has the same refid. I've built something and it's working, however it's really slow so I'm wondering if anyone has a better solution?
Data Model
refid
role
1000
xxx
1000
yyy
1001
xxx
1001
yyy
Output Table
a_role
b_role
match_count
xxx
yyy
2
Existing Solution
I've written a stored procedure that performs the following steps:
Generate a list of all combinations of unique roles in column role (97032 combinations)
While loop through all entries of step 1, update the entry with calculated count
Appreciate the help.

You could use a self join and aggregate as the following:
SELECT T.role a_role, D.role b_role,
COUNT(*) match_count
FROM table_name T
JOIN table_name D ON T.refid = D.refid AND T.role < D.role
GROUP BY T.role, D.role
ORDER BY T.role, D.role
See a demo.

I set up a little test script, try this out with your data and see if it's any faster. Note this assumes you only have two roles per refid. It'll need a rework if you have more.
declare #input table (
refid int,
descriptor nvarchar(1)
)
insert into #input (refid, descriptor) values
(1000, 'x'),
(1000, 'y'),
(1001, 'x'),
(1001, 'y'),
(1002, 'a'),
(1002, 'b')
select role1, role2, count(*) from (
select min(descriptor) as role1, max(descriptor) as role2
from #input
group by refid
) a
group by role1, role2

Related

Query database for distinct values and aggregate data based on condition

I am trying to extract distinct items from a Postgres database pairing a column from a table with a column from another table based on a condition. Simplified version looks like this:
CREATE TABLE users
(
id SERIAL PRIMARY KEY,
name VARCHAR(255)
);
CREATE TABLE photos
(
id INT PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
flag VARCHAR(255)
);
INSERT INTO users VALUES (1, 'Bob');
INSERT INTO users VALUES (2, 'Alice');
INSERT INTO users VALUES (3, 'John');
INSERT INTO photos VALUES (1001, 1, 'a');
INSERT INTO photos VALUES (1002, 1, 'b');
INSERT INTO photos VALUES (1003, 1, 'c');
INSERT INTO photos VALUES (1004, 2, 'a');
INSERT INTO photos VALUES (1004, 2, 'x');
What I need is to extract each user name, only once, and a flag value for each of them. The flag value should prioritize a specific one, let's say b. So, the result should look like:
Bob b
Alice a
Where Bob owns a photo having the b flag, while Alice does not and John has no photos. For Alice the output for the flag value is not important (a or x would be just as good) as long as she owns no photo flagged b.
The closest thing I found were some self-join queries where the flag value would have been aggregated using min() or max(), but I am looking for a particular value, which is not first, nor last. Moreover, I found out that you can define your own aggregate functions, but I wonder if there is an easier way of conditioning the query in order to obtain the required data.
Thank you!
Here is a method with aggregation:
select u.name,
coalesce(max(flag) filter (where flag = 'b'),
min(flag)
) as flag
from users u left join
photos p
on u.id = p.user_id
group by u.id, u.name;
That said, a more typical method would be a prioritization query. Perhaps:
select distinct on (u.id) u.name, p.flag
from users u left join
photos p
on u.id = p.user_id
order by u.id, (p.flag = 'b') desc;

Alternative to NOT IN in SSMS

I have my table in this structure. I am trying to find all the unique ID's whose word's do not appear in the list. How can I achieve this in MS SQL Server.
id word
1 hello
2 friends
2 world
3 cat
3 dog
2 country
1 phone
4 eyes
I have a list of words
**List**
phone
eyes
hair
body
Expected Output
Except the words from the list, I need all the unique ID's. In this case it is,
2
3
I & 4 is not in the output as their words appears in the List
I tried the below code
Select count(distinct ID)
from Table1
where word not in ('phone','eyes','hair','body')
I tried Not Exists also which did not work
You can also use GROUP BY
SELECT id
FROM Table1
GROUP BY id
HAVING MAX(CASE WHEN word IN('phone', 'eyes', 'hair', 'body') THEN 1 ELSE 0 END) = 0
One way to do it is to use not exists, where the inner query is linked to the outer query by id and is filtered by the search words.
First, create and populate sample table (Please save us this step in your future questions):
DECLARE #T AS TABLE (
id int,
word varchar(20)
)
INSERT INTO #T VALUES
(1, 'hello'),
(2, 'friends'),
(2, 'world'),
(3, 'cat'),
(3, 'dog'),
(2, 'country'),
(1, 'phone'),
(4, 'eyes')
The query:
SELECT DISTINCT id
FROM #T t0
WHERE NOT EXISTS
(
SELECT 1
FROM #T t1
WHERE word IN('phone', 'eyes', 'hair', 'body')
AND t0.Id = t1.Id
)
Result:
id
2
3
SELECT t.id FROM dbo.table AS t
WHERE NOT EXISTS (SELECT 1 FROM dbo.table AS t2
INNER JOIN
(VALUES('phone'),('eyes'),('hair'),('body')) AS lw(word)
ON t2.word = lw.word
AND t2.id = t.id)
GROUP BY t.id;
You can try this as well: this is a dynamic table structure:
DECLARE #T AS TABLE (id int, word varchar(20))
INSERT INTO #T VALUES
(1, 'hello'),
(2, 'friends'),
(2, 'world'),
(3, 'cat'),
(3, 'dog'),
(2, 'country'),
(1, 'phone'),
(4, 'eyes')
DECLARE #tblNotUsed AS TABLE ( id int, word varchar(20))
DECLARE #tblNotUsedIds AS TABLE (id int)
INSERT INTO #tblNotUsed VALUES
(1, 'phone'),
(2, 'eyes'),
(3, 'hair'),
(4, 'body')
INSERT INTO #tblNotUsedIds (id)
SELECT [#T].id FROM #T INNER JOIN #tblNotUsed ON [#tblNotUsed].word = [#T].word
SELECT DISTINCT id FROM #T
WHERE id NOT IN (SELECT id FROM #tblNotUsedIds)
The nice thing about SQL is there are sometimes many ways to do things. Here is one way is to place your list of known values into a #temp table and then run something like this.
Select * from dbo.maintable
EXCEPT
Select * from #tempExcludeValues
The results will give you all records that aren't in your predefined list. A second way is to do the join like Larnu has mentioned in the comment above. NOT IN is typically not the fastest way to do things on larger datasets. JOINs are by far the most efficient method of filtering data. Many times better than using a IN or NOT IN clause.

Need solution to avoid repeated scanning in huge table

I have a event table which has 40 columns and fill up to 2 billion records. In that event table i would like to query for a combination event i.e Event A with Event B. Sometimes I may want to find more combination like Event A with B and C. It may goes to 5 or 6 combination.
I don't want to scan that table for every event in combination i.e Scanning for event A and scanning for event B. And I need a generic approach for more combination scanning as well.
Note: That 2 billion records is partitioned based on event date and data is been equally split.
Eg:
Need to find id's which has event A,B,C and need to find id's which has only A,B.
This number of combination is dynamic. I don't want to scan that table for each event and finally intersect the result.
There may be some mileage in using a sql server equivalent of the mysql group_concat function.
For example
drop table t
create table t (id int, dt date, event varchar(1))
insert into t values
(1,'2017-01-01','a'),(1,'2017-01-01','b'),(1,'2017-01-01','c'),(1,'2017-01-02','c'),(1,'2017-01-03','d'),
(2,'2017-02-01','a'),(2,'2017-02-01','b')
select id,
stuff(
(
select cast(',' as varchar(max)) + t1.event
from t as t1
WHERE t1.id = t.id
order by t1.id
for xml path('')
), 1, 1, '') AS groupconcat
from t
group by t.id
Results in
id groupconcat
----------- -----------
1 a,b,c,c,d
2 a,b
If you then add a patindex
select * from
(
select id,
stuff(
(
select cast(',' as varchar(max)) + t1.event
from t as t1
WHERE t1.id = t.id
order by t1.id
for xml path('')
), 1, 1, '') AS groupconcat
from t
group by t.id
) s
where patindex('a,b,c%',groupconcat) > 0
you get this
id groupconcat
----------- ------------
1 a,b,c,c,d
SELECT * from table as A
JOIN table AS B
ON A.Id = B.Id AND A.Date = B.Date
WHERE Date = '1-Jan'
AND A.Event = 'A'
AND B.Event = 'B'
This will give you rows, where Date is '1-Jan' and Id is same for both events.
You can join table again and again if you want to filter by more events.
The having clause allows you to filter using the result of an aggregate function. I've used a regular count but you may need a distinct count, depending on your table design.
Example:
-- Returns ids with 3 or more events.
SELECT
x.Id,
COUNT(*) AS EventCount
FROM
(
VALUES
(1, '2017-01-01', 'A'),
(1, '2017-01-01', 'B'),
(1, '2017-01-03', 'C'),
(1, '2017-01-04', 'C'),
(1, '2017-01-05', 'E'),
(2, '2017-01-01', 'A'),
(2, '2017-01-01', 'B'),
(3, '2017-01-01', 'A')
) AS x(Id, [Date], [Event])
GROUP BY
x.Id
HAVING
COUNT(*) > 2
;
Returns
Id EventCount
1 5

Insert values in table only one column changes value

I've got a table with 2 columns,
GROUP PROJECTS
10001 1
10001 2
First column (GROUP) stays the same value 10001.
Second column (PROJECTS) changes values 3,5,9,100, etc. (I have 400 project ID's)
What would be to correct (loop?) statement to insert all 400 PROJECTS.
I used insert, values for smaller lists.
INSERT INTO table (GROUP_ID, PROJECTS) VALUES (10001, 1, 10001, 2, 10001, etc, 10001, etc);
I have the list in Excel (if needed I can create a Temp table with all 400 project ID's)
Thanks.
I typically write such inserts as:
INSERT INTO table(GROUP_ID, PROJECTS)
select 10001, 1 from dual union all
select 10001, 2 from dual union all
. . . ;
You should be able to generate the select statement pretty easily in Excel.
If the project IDs exist in their own table (or you can create one from your Excel data), then yu can get the list of values from there and cross-join those with all the group IDs:
insert into group_projects (group_id, project_id)
select g.group_id, p.project_id
from groups g
cross join projects p
where not exists (
select 1 from group_projects gp
where gp.group_id = g.group_id and gp.project_id = p.project_id
);
The where not exists() excludes all the existing pairs so you don't insert duplicates.
SQL Fiddle
If the groups don't have their own table then you can use the existing values from a subquery:
insert into group_projects (group_id, project_id)
select g.group_id, p.project_id
from (select distinct group_id from group_projects) g
cross join projects p
where not exists (
select 1 from group_projects gp
where gp.group_id = g.group_id and gp.project_id = p.project_id
);
You could use Gordon's approach to generate the project ID list as a subquery as well, if you didn't want to create a table for those.
I'd go with what I view as a simpler and much more readable solution... create the temp table with the data from Excel, then run this-
DECLARE
CURSOR c1
IS
SELECT project_id
FROM temp_table
WHERE project_id IS NOT NULL;
BEGIN
FOR rec in c1
LOOP
INSERT INTO table
VALUES (10001, rec.project_id);
COMMIT;
END LOOP;
END;
Seems cleaner than one giant insert statement or something complex with joins and sub-queries. If you wanted to make sure the value doesn't already exist in "table", add that criteria to the cursor select statement, or if you have constraints on the table add an exception handler in the loop.

How to replace NULL in a result set with the last NOT NULL value in the same column?

A colleague of mine has a problem with a sql query:-
Take the following as an example, two temp tables:-
select 'John' as name,10 as value into #names
UNION ALL SELECT 'Abid',20
UNION ALL SELECT 'Alyn',30
UNION ALL SELECT 'Dave',15;
select 'John' as name,'SQL Expert' as job into #jobs
UNION ALL SELECT 'Alyn','Driver'
UNION ALL SELECT 'Abid','Case Statement';
We run the following query on the tables to give us a joined resultset:-
select #names.name, #names.value, #jobs.job
FROM #names left outer join #jobs
on #names.name = #jobs.name
name value job
John 10 SQL Expert
Abid 20 Case Statement
Alyn 30 Driver
Dave 15 NULL
As 'Dave' does not exist in the #jobs table, he is given a NULL value as expected.
My colleague wants to modify the query so each NULL value is given the same value as the previous entry.
So the above would be:-
name value job
John 10 SQL Expert
Abid 20 Case Statement
Alyn 30 Driver
Dave 15 Driver
Note that Dave is now a 'Driver'
There may be more than one NULL value in sequence,
name value job
John 10 SQL Expert
Abid 20 Case Statement
Alyn 30 Driver
Dave 15 NULL
Joe 15 NULL
Pete 15 NULL
In this case Dave, Joe and Pete should all be 'Driver', as 'Driver' is the last non null entry.
There are probably better ways to do this. Here is one of the ways I could achieve the result using Common Table Expressions (CTE) and using that output to perform a OUTER APPLY to find the previous persion's job. The query here uses id to sort the records and then determines what the previous person's job was. You need at least one criteria to sort the records because data in tables are considered to be unordered sets.
Also, the assumption is that the first person in the sequence should have a job. If the first person doesn't have a job, then there is no value to pick from.
Click here to view the demo in SQL Fiddle.
Click here to view another demo in SQL Fiddle with second data set.
Script:
CREATE TABLE names
(
id INT NOT NULL IDENTITY
, name VARCHAR(20) NOT NULL
, value INT NOT NULL
);
CREATE TABLE jobs
(
id INT NOT NULL
, job VARCHAR(20) NOT NULL
);
INSERT INTO names (name, value) VALUES
('John', 10),
('Abid', 20),
('Alyn', 30),
('Dave', 40),
('Jill', 50),
('Jane', 60),
('Steve', 70);
INSERT INTO jobs (id, job) VALUES
(1, 'SQL Expert'),
(2, 'Driver' ),
(5, 'Engineer'),
(6, 'Barrista');
;WITH empjobs AS
(
SELECT
TOP 100 PERCENT n.id
, n.name
, n.value
, job
FROM names n
LEFT OUTER JOIN jobs j
on j.id = n.id
ORDER BY n.id
)
SELECT e1.id
, e1.name
, e1.value
, COALESCE(e1.job , e2.job) job FROM empjobs e1
OUTER APPLY (
SELECT
TOP 1 job
FROM empjobs e2
WHERE e2.id < e1.id
AND e2.job IS NOT NULL
ORDER BY e2.id DESC
) e2;
Output:
ID NAME VALUE JOB
--- ------ ----- -------------
1 John 10 SQL Expert
2 Abid 20 Driver
3 Alyn 30 Driver
4 Dave 40 Driver
5 Jill 50 Engineer
6 Jane 60 Barrista
7 Steve 70 Barrista
What do you mean by "last" non-null entry? You need a well-defined ordering for "last" to have a consistent meaning. Here's a query with data definitions that uses the "value" column to define last, and that might be close to what you want.
CREATE TABLE #names
(
id INT NOT NULL IDENTITY
, name VARCHAR(20) NOT NULL
, value INT NOT NULL PRIMARY KEY
);
CREATE TABLE #jobs
(
name VARCHAR(20) NOT NULL
, job VARCHAR(20) NOT NULL
);
INSERT INTO #names (name, value) VALUES
('John', 10),
('Abid', 20),
('Alyn', 30),
('Dave', 40),
('Jill', 50),
('Jane', 60),
('Steve', 70);
INSERT INTO #jobs (name, job) VALUES
('John', 'SQL Expert'),
('Abid', 'Driver' ),
('Alyn', 'Engineer'),
('Dave', 'Barrista');
with Partial as (
select
#names.name,
#names.value,
#jobs.job as job
FROM #names left outer join #jobs
on #names.name = #jobs.name
)
select
name,
value,
(
select top 1 job
from Partial as P
where job is not null
and P.value <= Partial.value
order by value desc
)
from Partial;
It might be more efficient to insert the data, then update.