How to find count differences for IDs in two large tables

How to find count differences for IDs in two large tables - sql

I have two large tables. Both containing around 17M rows each. They should have same exact number of rows but I am finding that the counts are different by 343. I want to find out where the counts are different. Tables look like this:
Table A
ID | color
---| ---------
1 | red
1 | green
1 | blue
2 | white
3 | black
3 | red
Tabls B
ID | sale_dates
---| ----------
1 | 2020-10-01
1 | 2020-01-10
2 | 2018-01-09
3 | 2017-08-08
Based on above I would like an output like below:
ID | Table A | Table B | Difference
---| --------| --------| ----------
1 | 5 | 2 | 3
2 | 1 | 1 | 0
3 | 2 | 1 | 1
Or even only find out the ones where the difference is not 0

If the two tables will always have the same set of ID values, you can just JOIN two derived tables of COUNT(*) values to get your desired output:
SELECT A.ID,
"Table A",
"Table B",
"Table A" - "Table B" AS Difference
FROM (
SELECT ID, COUNT(*) AS "Table A"
FROM A
GROUP BY ID
) A
JOIN (
SELECT ID, COUNT(*) AS "Table B"
FROM B
GROUP BY ID
) B ON A.ID = B.ID
ORDER BY A.ID
Output:
id Table A Table B difference
1 3 2 1
2 1 1 0
3 2 1 1
Demo on dbfiddle
If you only want the ID values which have a non-zero difference, add
WHERE "Table A" - "Table B" > 0
before the ORDER BY clause.
Demo on dbfiddle

This is a tweak on Nick's answer. I think a full join is very important in this type of situation, because it is possible that some ids are missing from one table or the other:
SELECT ID, a.cnt, b.cnt,
(COALESCE(a.cnt, 0) - COALESCE(b.cnt, 0)) as difference
FROM (SELECT UPPER(ID) as id, COUNT(*) AS cnt
FROM A
GROUP BY UPPER(ID)
) A FULL JOIN
(SELECT UPPER(ID) as id, COUNT(*) AS cnt
FROM B
GROUP BY UPPER(ID)
) B
USING (ID)
ORDER BY difference DESC;
Add:
WHERE COALESCE(a.cnt, 0) <> COALESCE(b.cnt)
if you only want ids where the counts are not the same.

Related

Sum of two counts from one table with additional data from another table

I have two tables as follows:
TABLE A
| id | col_a | col_b | user_id |
--------------------------------
| 1 | false | true | 1 |
| 2 | false | true | 2 |
| 3 | true | true | 2 |
| 4 | true | true | 3 |
| 5 | true | false | 1 |
TABLE B
| id | name |
--------------
| 1 | Bob |
| 2 | Jim |
| 3 | Helen |
| 4 | Michael|
| 5 | Jen |
I want to get the sum of two counts, which are the number of true values in col_a and number of true values in col_b. I want to group that data by user_id. I also want to join Table B and get the name of each user. The result would look like this:
|user_id|total (col_a + col_b)|name
------------------------------------
| 1 | 2 | Bob
| 2 | 3 | Jim
| 3 | 2 | Helen
So far I got the total sum with the following query:
SELECT
(SELECT COUNT(*) FROM "TABLE_A" WHERE "col_a" is true)+
(SELECT COUNT(*) FROM "TABLE_A" WHERE "col_b" is true)
as total
However, I'm not sure how to proceed with grouping these counts by user_id.

Something like this is typically fastest:
SELECT *
FROM "TABLE_B" b
JOIN (
SELECT user_id AS id
, count(*) FILTER (WHERE col_a)
+ count(*) FILTER (WHERE col_b) AS total
FROM "TABLE_A"
GROUP BY 1
) a USING (id);
While fetching all rows, aggregate first, join later. That's cheaper. See:
Query with LEFT JOIN not returning rows for count of 0
The aggregate FILTER clause is typically fastest. See:
For absolute performance, is SUM faster or COUNT?
Aggregate columns with additional (distinct) filters
Often, you want to keep total counts of 0 in the result. You did say:
get the name of each user.
SELECT b.id AS user_id, b.name, COALESCE(a.total, 0) AS total
FROM "TABLE_B" b
LEFT JOIN (
SELECT user_id AS id
, count(col_a OR NULL)
+ count(col_b OR NULL) AS total
FROM "TABLE_A"
GROUP BY 1
) a USING (id);
...
count(col_a OR NULL) is an equivalent alternative, shortest, and still fast. (Use the FILTER clause from above for best performance.)
The LEFT JOIN keeps all rows from "TABLE_B" in the result.
COALESCE() return 0 instead of NULL for the total count.
If col_a and col_b have only few true values, this is typically (much) faster - basically what you had already:
SELECT b.*, COALESCE(aa.ct, 0) + COALESCE(ab.ct, 0) AS total
FROM "TABLE_B" b
LEFT JOIN (
SELECT user_id AS id, count(*) AS ct
FROM "TABLE_A"
WHERE col_a
GROUP BY 1
) aa USING (id)
LEFT JOIN (
SELECT user_id AS id, count(*) AS ct
FROM "TABLE_A"
WHERE col_b
GROUP BY 1
) ab USING (id);
Especially with (small in this case!) partial indexes like:
CREATE INDEX a_true_idx on "TABLE_A" (user_id) WHERE col_a;
CREATE INDEX b_true_idx on "TABLE_A" (user_id) WHERE col_b;
Aside: use legal, lower-case unquoted names in Postgres to make your like simpler.
Are PostgreSQL column names case-sensitive?

select user_id,name
, count(case when col_a = true then 1 end)
+ count(case when col_b = true then 1 end) total
from tableA a
join TableB b on a.user_id= b.id
group by user_id,name

You are double counting JIM, if that is not supposed since it only shows up in two rows and not three, maybe you can do the following:
with cte_A as (
select col_a as col, user_id
from A
where col_a=true
union -- ALL -- (if you want to double count Jim)
select col_b as col, user_id
from A
where col_b=true
)
select B.user_id, sum(*) as total, B.name
from cte_A
join B
on cte_A.user_id = B.user_id
group by B.user_id
If you want to actually double count then use the UNION ALL instead of UNION

How to compare counts of two different columns for multiple rows and find only mismatches

I have two tables:
Table A
---------
id | num_of_records
---|---------------
10 | 2
20 | 9
30 | 1
40 | 3
Table B
---------
id | details
---| --------
10 | somedetail 2
10 | somedetail 3
20 | foobar
30 | random
40 | random 2
In the above I want to get recods only where the num_of_recods from TableA do not match the count(*) from TableB so the result based on above would be:
id | difference
---| --------
20 | 8
40 | 2
My real TableB has around 20M records and TableA has around 4k records.
I've written something like this which doesn't work select id from tableA join on tableB where tableA.num_of_records <> (select count(*) from tableB where id = id);

You can use a full join:
select id, coalesce(a.num_of_records, 0) - coalesce(b.cnt, 0)
from a full join
(select id, count(*) as cnt
from b
group by id
) b
using (id)
where a.num_of_records is distinct from b.cnt;
This uses full join to handle the case when either table is missing an id.

pulling data from max field

I have a table structure with columns similar to the following:
ID | line | value
1 | 1 | 10
1 | 2 | 5
2 | 1 | 6
3 | 1 | 7
3 | 2 | 4
ideally, i'd like to pull the following:
ID | value
1 | 5
2 | 6
3 | 4
one solution would be to do something like the following:
select a.ID, a.value
from
myTable a
inner join (select id, max(line) as line from myTable group by id) b
on a.id = b.id and a.line = b.line
Given the size of the table and that this is just a part of a larger pull, I'd like to see if there's a more elegant / simpler way of pulling this directly.

This is a task for OLAP-functions:
select *
from myTable a
qualify
rank() -- assign a rank for each id
over (partition by id
order by line desc) = 1
Might return multiple rows per id if they share the same max line. If you want to return only one of them, add another column to the order by to make it unique or switch to row_number to get an indeterminate row.

Comparing different columns in SQL for each row

after some transformation I have a result from a cross join (from table a and b) where I want to do some analysis on. The table for this looks like this:
+-----+------+------+------+------+-----+------+------+------+------+
| id | 10_1 | 10_2 | 11_1 | 11_2 | id | 10_1 | 10_2 | 11_1 | 11_2 |
+-----+------+------+------+------+-----+------+------+------+------+
| 111 | 1 | 0 | 1 | 0 | 222 | 1 | 0 | 1 | 0 |
| 111 | 1 | 0 | 1 | 0 | 333 | 0 | 0 | 0 | 0 |
| 111 | 1 | 0 | 1 | 0 | 444 | 1 | 0 | 1 | 1 |
| 112 | 0 | 1 | 1 | 0 | 222 | 1 | 0 | 1 | 0 |
+-----+------+------+------+------+-----+------+------+------+------+
The ids in the first column are different from the ids in the sixth column.
In a row are always two different IDs that are matched with each other. The other columns always have either 0 or 1 as a value.
I am now trying to find out how many values(meaning both have "1" in 10_1, 10_2 etc) two IDs have on average in common, but I don't really know how to do so.
I was trying something like this as a start:
SELECT SUM(CASE WHEN a.10_1 = 1 AND b.10_1 = 1 then 1 end)
But this would obviously only count how often two ids have 10_1 in common. I could make something like this for example for different columns:
SELECT SUM(CASE WHEN (a.10_1 = 1 AND b.10_1 = 1)
OR (a.10_2 = 1 AND b.10_1 = 1) OR [...] then 1 end)
To count in general how often two IDs have one thing in common, but this would of course also count if they have two or more things in common. Plus, I would also like to know how often two IDS have two things, three things etc in common.
One "problem" in my case is also that I have like ~30 columns I want to look at, so I can hardly write down for each case every possible combination.
Does anyone know how I can approach my problem in a better way?
Thanks in advance.
Edit:
A possible result could look like this:
+-----------+---------+
| in_common | count |
+-----------+---------+
| 0 | 100 |
| 1 | 500 |
| 2 | 1500 |
| 3 | 5000 |
| 4 | 3000 |
+-----------+---------+

With the codes as column names, you're going to have to write some code that explicitly references each column name. To keep that to a minimum, you could write those references in a single union statement that normalizes the data, such as:
select id, '10_1' where "10_1" = 1
union
select id, '10_2' where "10_2" = 1
union
select id, '11_1' where "11_1" = 1
union
select id, '11_2' where "11_2" = 1;
This needs to be modified to include whatever additional columns you need to link up different IDs. For the purpose of this illustration, I assume the following data model
create table p (
id integer not null primary key,
sex character(1) not null,
age integer not null
);
create table t1 (
id integer not null,
code character varying(4) not null,
constraint pk_t1 primary key (id, code)
);
Though your data evidently does not currently resemble this structure, normalizing your data into a form like this would allow you to apply the following solution to summarize your data in the desired form.
select
in_common,
count(*) as count
from (
select
count(*) as in_common
from (
select
a.id as a_id, a.code,
b.id as b_id, b.code
from
(select p.*, t1.code
from p left join t1 on p.id=t1.id
) as a
inner join (select p.*, t1.code
from p left join t1 on p.id=t1.id
) as b on b.sex <> a.sex and b.age between a.age-10 and a.age+10
where
a.id < b.id
and a.code = b.code
) as c
group by
a_id, b_id
) as summ
group by
in_common;

The proposed solution requires first to take one step back from the cross-join table, as the identical column names are super annoying. Instead, we take the ids from the two tables and put them in a temporary table. The following query gets the result wanted in the question. It assumes table_a and table_b from the question are the same and called tbl, but this assumption is not needed and tbl can be replaced by table_a and table_b in the two sub-SELECT queries. It looks complicated and uses the JSON trick to flatten the columns, but it works here:
WITH idtable AS (
SELECT a.id as id_1, b.id as id_2 FROM
-- put cross join of table a and table b here
)
SELECT in_common,
count(*)
FROM
(SELECT idtable.*,
sum(CASE
WHEN meltedR.value::text=meltedL.value::text THEN 1
ELSE 0
END) AS in_common
FROM idtable
JOIN
(SELECT tbl.id,
b.*
FROM tbl, -- change here to table_a
json_each(row_to_json(tbl)) b -- and here too
WHERE KEY<>'id' ) meltedL ON (idtable.id_1 = meltedL.id)
JOIN
(SELECT tbl.id,
b.*
FROM tbl, -- change here to table_b
json_each(row_to_json(tbl)) b -- and here too
WHERE KEY<>'id' ) meltedR ON (idtable.id_2 = meltedR.id
AND meltedL.key = meltedR.key)
GROUP BY idtable.id_1,
idtable.id_2) tt
GROUP BY in_common ORDER BY in_common;
The output here looks like this:
in_common | count
-----------+-------
2 | 2
3 | 1
4 | 1
(3 rows)

Creating sql view where the select is dependent on the value of two columns

I want to create a view in my database, based on these three tables:
I would like to select the rows in table3 that has the highest value in Weight, for rows that has the same value in Count.
Then I want them grouped by Category_ID and ordered by Date, so that if two rows in table3 are identical, I want the newest.
Let me give you an example:
Table1
ID | Date | UserId
1 | 2015-01-01 | 1
2 | 2015-01-02 | 1
Table2
ID | table1_ID | Category_ID
1 | 1 | 1
2 | 2 | 1
Table3
ID | table2_ID | Count | Weight
1 | 1 | 5 | 10
2 | 1 | 5 | 20 <-- count is 5 and weight is highest
3 | 1 | 3 | 40
4 | 2 | 5 | 10
5 | 2 | 3 | 40 <-- newest of the two equal rows
Then the result should be row 2 and 5 from table 3.
PS I'm doing this in mssql.
PPS I'm sory if the title is not appropriate, but I did not know how to formulate a good one.

SELECT
*
FROM
(
SELECT
t3.*
,RANK() OVER (PARTITION BY [Count] ORDER BY [Weight] DESC, Date DESC) highest
FROM TABLE3 t3
INNER JOIN TABLE2 t2 ON t2.Id = t3.Table2_Id
INNER JOIN TABLE1 t1 ON t1.Id = t2.Table1_Id
) t
WHERE t.Highest = 1
This will group by the Count (which must be the same). Then it will determine which has the highest weight. If two of more of them have the same 'heighest' weight, it takes the one with the most recent date first.

You can use RANK() analytic function here, and give those rows a rank and than choose the first rank for each ID
Something like
select *
from
(select
ID, table2_ID, Count, Weight,
RANK() OVER (PARTITION BY ID ORDER BY Count, Weight DESC) as Highest
from table3)
where Highest = 1;
This is the syntax for Oracle, if you not using it look in the internet for the your syntax which should be almost the same

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to find count differences for IDs in two large tables - sql

Related

Sum of two counts from one table with additional data from another table

How to compare counts of two different columns for multiple rows and find only mismatches

pulling data from max field

Comparing different columns in SQL for each row

Creating sql view where the select is dependent on the value of two columns

Categories

Resources