Count Distinct not working as expected, output is equal to count - sql

I have a table where I'm trying to count the distinct number of members per group. I know there's duplicates based on the count(distinct *) function. But when I try to group them into the group and count distinct, it's not spitting out the number I'd expect.
select count(distinct memberid), count(*)
from dbo.condition c
output:
count
count
303,781
348,722
select groupid, count(*), count(distinct memberid)
from dbo.condition c
group by groupid
output:
groupid
count
count
2
19,984
19,984
3
25,689
25,689
5
14,400
14,400
24
56,058
56,058
25
200,106
200,106
29
27,847
27,847
30
1,370
1,370
31
3,268
3,268
The numbers in the second query equate when they shouldn't be. Does anyone know what I'm doing wrong? I need the 3rd column to be equal to 303,781 not 348,722.
Thanks!

There's nothing wrong with your second query. Since you're aggregating on the "groupid" field, the output you get tells you that there are no duplicates within the same groupid of the "memberid" values (basically counting values equates to counting distinctively).
On the other hand, in the first query the aggregation happens without any partitioning, whose output hints there are duplicate values across different "groupid" values.
Took the liberty of adding of an example that corroborates your answer:
create table aa (groupid int not null, memberid int not null );
insert into aa (groupid, memberid)
values
(1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3), (4, 1), (4, 2), (4, 3), (4, 5), (5, 3)
select groupid, count(*), count(distinct memberid)
from aa group by groupid;
select count(*), count(distinct memberid)
from aa

Related

running sums, find blocks of rows that sum to given list of values

here is the test data:
declare #trial table (id int, val int)
insert into #trial (id, val)
values (1, 1), (2, 3),(3, 2), (4, 4), (5, 5),(6, 6), (7, 7), (8, 2),(9, 3), (10, 4), (11, 6),(12, 10), (13, 5), (14, 3),(15, 2) ;
select * from #trial order by id asc
description of data:
i have a list of n values that represent sums. assume they are (10, 53) for this example. the values in the #trial can be both negative & positive. note that the values in #trial will always sum to the given sums.
description of pattern:
10 in this example is the 1st sum i want to match & 53 is the 2nd sum i want to match. the dataset has been set up in such a way that a block of consecutive rows will always sum to these sums with this feature: in this example, the 1st 4 rows sum to 10, & then the next 11 rows sum to 53. the dataset will always have this feature. in other words, the 1st given sum can be found from summing 1 to ith row, then 2nd sum from i + 1 row to jth row, & so on....
finally i want an id to identify the groups of rows that sum to the given sums. so in this example, 1 to 4th row will take id 1, 5th to 15th row will take id 2.
This answers the original question.
From what you describe you can do something like this:
select v.grp, t.*
from (select t.*, sum(val) over (order by id) as running_val
from #trial t
) t left join
(select grp lag(upper, 1, -1) over (order by upper) as lower, uper
from (values (1, 10), (2, 53)) v(grp, upper)
) v
on t.running_val > lower and
t.running_val <= v.upper

Running total over duplicate column values and no other columns

I want to do running total but there is no unique column or id column to be used in over clause.
CREATE TABLE piv2([name] varchar(5), [no] int);
INSERT INTO piv2
([name], [no])
VALUES
('a', 1),
('a', 2),
('a', 3),
('a', 4),
('b', 1),
('b', 2),
('b', 3);
there are only 2 columns, name which has duplicate values and the no on which I want to do running total in SQL Server 2017 .
expected result:
a 1
a 3
a 6
a 10
b 11
b 13
b 16
Any help?
The following query would generate the output you expect, at least for the exact sample data you did show us:
SELECT
name,
SUM(no) OVER (ORDER BY name, no) AS no_sum
FROM piv2;
If the order you intend to use for the rolling sum is something other than the order given by the name and no columns, then you should reveal that logic along with sample data.

Select TOP columns from table1, join table2 with their names

I have a TABLE1 with these two columns, storing departure and arrival identifiers from flights:
dep_id arr_id
1 2
6 2
6 2
6 2
6 2
3 2
3 2
3 2
3 4
3 4
3 6
3 6
and a TABLE2 with the respective IDs containing their ICAO codes:
id icao
1 LPPT
2 LPFR
3 LPMA
4 LPPR
5 LLGB
6 LEPA
7 LEMD
How can i select the top count of TABLE1 (most used departure id and most used arrival id) and group it with the respective ICAO code from TABLE2, so i can get from the provided example data:
most_arrivals most_departures
LPFR LPMA
It's simple to get ONE of them, but mixing two or more columns doesn't seem to work for me no matter what i try.
You can do it like this.
Create and populate tables.
CREATE TABLE dbo.Icao
(
id int NOT NULL PRIMARY KEY,
icao nchar(4) NOT NULL
);
CREATE TABLE dbo.Flight
(
dep_id int NOT NULL
FOREIGN KEY REFERENCES dbo.Icao(id),
arr_id int NOT NULL
FOREIGN KEY REFERENCES dbo.Icao(id)
);
INSERT INTO dbo.Icao (id, icao)
VALUES
(1, N'LPPT'),
(2, N'LPFR'),
(3, N'LPMA'),
(4, N'LPPR'),
(5, N'LLGB'),
(6, N'LEPA'),
(7, N'LEMD');
INSERT INTO dbo.Flight (dep_id, arr_id)
VALUES
(1, 2),
(6, 2),
(6, 2),
(6, 2),
(6, 2),
(3, 2),
(3, 2),
(3, 2),
(3, 4),
(3, 4),
(3, 6),
(3, 6);
Then do a SELECT using two subqueries.
SELECT
(SELECT TOP 1 I.icao
FROM dbo.Flight AS F
INNER JOIN dbo.Icao AS I
ON I.id = F.arr_id
GROUP BY I.icao
ORDER BY COUNT(*) DESC) AS 'most_arrivals',
(SELECT TOP 1 I.icao
FROM dbo.Flight AS F
INNER JOIN dbo.Icao AS I
ON I.id = F.dep_id
GROUP BY I.icao
ORDER BY COUNT(*) DESC) AS 'most_departures';
Click this button on the toolbar to include the actual execution plan, when you execute the query.
And this is the graphical execution plan for the query. Each icon represents an operation that will be performed by the SQL Server engine. The arrows represent data flows. The direction of flow is from right to left, so the result is the leftmost icon.
try this one:
select
(select name
from table2 where id = (
select top 1 arr_id
from table1
group by arr_id
order by count(*) desc)
) as most_arrivals,
(select name
from table2 where id = (
select top 1 dep_id
from table1
group by dep_id
order by count(*) desc)
) as most_departures

SQL - Counting sets of Field-B values for each Field-A value

First of all sorry that I could not think of a more descriptive title.
What I want to do is the following using only SQL:
I have some lists of strings, list1, list2 and list3.
I have a dataset that contains two interesting columns, A and B. Column A contains a TransactionID and column B contains an ItemID.
Naturally, there can be multiple rows that share the same TransactionIDs.
I need to catch those transactions that have at least one ItemID in each and every list (list1 AND list2 AND list3).
I also need to count how many times does that happen for each transaction.
[EDIT] That is, count how many full sets of ItemIDs there are for each TransactionID", "Full Set" being any element of the list1 with any element of the list2 with any element of the list3
I hope that makes enough sense, perhaps I will be able to explain it better with a clear head.
Thanks in advance
In MySQL if you have the following lists:
list1 = ('1', '3')
list2 = ('2', '3')
list3 = ('3', '5')
then you can do this:
SELECT
TransactionID,
SUM(ItemID IN ('1', '3')) AS list1_count,
SUM(ItemID IN ('2', '3')) AS list2_count,
SUM(ItemID IN ('3', '5')) AS list3_count
FROM table1
GROUP BY TransactionID
HAVING list1_count > 0 AND list2_count > 0 AND list3_count > 0
Result:
TransactionId list1_count list2_count list3_count
1 3 2 1
3 2 2 1
Test data:
CREATE TABLE table1 (ID INT NOT NULL, TransactionID INT NOT NULL, ItemID INT NOT NULL);
INSERT INTO table1 (ID, TransactionID, ItemID) VALUES
(1, 1, 1),
(2, 1, 2),
(3, 1, 3),
(4, 1, 4),
(5, 1, 1),
(6, 2, 1),
(7, 2, 2),
(8, 2, 1),
(9, 2, 4),
(10, 3, 3),
(11, 3, 2),
(12, 3, 1);
Depending on your dialect, and assuming your lists are other tables...
SELECT
TransactionID, Count1, Count2, Count3
FROM
MyDataSet M
JOIN
(SELECT COUNT(*), ItemID AS Count1 FROM List1 GROUP BY ItemID) T1 ON T1.ItemID = M.ItemID
JOIN
(SELECT COUNT(*), ItemID AS Count2 FROM List2 GROUP BY ItemID) T2 ON T2.ItemID = M.ItemID
JOIN
(SELECT COUNT(*), ItemID AS Count3 FROM List3 GROUP BY ItemID) T3 ON T3.ItemID = M.ItemID
If list1, list2, and list3 are actually known enumerations, you could go with:
SELECT TransactionID, COUNT(*)
FROM MyTable
WHERE ItemID IN (list1) AND ItemID IN (list2) AND ItemID IN (list3)
GROUP BY TransactionID
If you have a lot of lists, you may want to generate the SQL in a program. However, it should still perform pretty well, even for a lot of lists. Put the lists you expect to have the fewest matches in first, so that you stop evaluating the predicate as soon as possible.
If your lists are in another table, perhaps a bunch of tuples of the form (list_id, item_id), that's a trickier problem. I'd like to know more before trying to come up with a query for that.

what mysql query should i use to select a category that matches ALL my criteria?

i have the following data in my table called cat_product
cat_id product_id
1 2
1 3
2 2
2 4
If given a set of values for product_id (2,3) i want to know the unique cat_id. In this case, that will be cat_id 1.
how should i construct mysql query?
i tried to use
select distinct cat_id from cat_product where product_id IN (2,3)
but it returns both 1 and 2.
if i use
select distinct cat_id from cat_product where product_id NOT IN (2,3)
i get 2.
is there a better way than
select distinct cat_id from cat_product where product_id IN (2,3)
and cat_id NOT IN
(select distinct cat_id from cat_product where product_id NOT IN (2,3) )
i need to return the category_id that has the EXACT set of product id i am looking for.
basically i have about 10 product ids as input.
SELECT cat_id
FROM (
SELECT DISTINCT cat_id
FROM cat_product
) cpo
WHERE EXISTS
(
SELECT NULL
FROM cat_product cpi
WHERE cpi.cat_id = cpo.cat_id
AND product_id IN (2, 3)
LIMIT 1, 1
)
You need to have a UNIQUE index on (cat_id, product_id) (in this order) for this to work fast.
This solution will use INDEX FOR GROUP BY to get a list of distinct categories, and EXISTS predicate will be a little bit faster than COUNT(*) (since the aggregation requires some overhead).
If you have more than two products to search for, adjust the first argument to LIMIT accordingly.
It should be LIMIT n - 1, 1, where n is the number of items in the IN list.
Update:
To return the categories holding all products from the list and nothing else, use this:
SELECT cat_id
FROM (
SELECT DISTINCT cat_id
FROM cat_product
) cpo
WHERE EXISTS
(
SELECT NULL
FROM cat_product cpi
WHERE cpi.cat_id = cpo.cat_id
AND product_id IN (2, 3)
LIMIT 1, 1
)
AND NOT EXISTS
(
SELECT NULL
FROM cat_product cpi
WHERE cpi.cat_id = cpo.cat_id
AND product_id NOT IN (2, 3)
)
SELECT * FROM
(SELECT cat_id FROM cat_product WHERE product_id=2) a
INNER JOIN
(SELECT cat_id FROM cat_product WHERE product_id=3) b
ON a.cat_id = b.cat_id;
Your own solution would not work (at least not if I understand the question correctly). It will return the id of a category that contains one or more of the listed products and no other products. Try adding the following row to your table, and see if you get the expected result:
insert into cat_product (cat_id, product_id) values (1,5)
If you really need to find the id of a category that has all of the listed products (no matter what other products might be in the category), try this query:
select cat_id
from cat_product
where product_id in (2,3)
group by cat_id
having count(*) = 2
The number 2 on the last line of the query corresponds to the size of the set of products you are searching for. If you are executing the query using some parameterized API, make sure to create an additional parameter bound to productsArray.length or similar.
Find the cat_id's that contains all the given products and nothing else.
Let's create some test data...
create table cat_product (
cat_id int not null,
product_id int not null,
primary key (cat_id, product_id));
delete from cat_product;
insert into cat_product
values
(1, 2),
(1, 3),
(2, 2),
(2, 4),
(3, 1),
(3, 2),
(3, 3),
(3, 4),
(3, 5),
(4, 1),
(4, 2),
(4, 3),
(4, 4),
(4, 6),
(5, 1),
(5, 2),
(5, 3),
(5, 4),
(5, 5),
(5, 6);
Plug in your list of product_id's for cp.product_id IN (1, 2, 3, 4, 5)
AND plug in the number of product_id's on the last line of the query at cats.match_count = 5.
select
cats.cat_id
from
/* Count how many products for each cat match the list of products */
(select
cp.cat_id,
count(*) as match_count
from
cat_product cp
where
cp.product_id IN (1, 2, 3, 4, 5)
group by
cp.cat_id) as cats,
/* Count the number of products in each cat */
(select cat_id, count(*) as cat_count
from cat_product
group by cat_id) as cat_count
where
cats.cat_id = cat_count.cat_id
/* We matched all the products in the cat */
and cats.match_count = cat_count.cat_count
/* We matched all the products we wanted. Without this clause
the query would also match cats that only contain products in
our list (but not all of them) */
and cats.match_count = 5;
Relational Division is what you are looking for