SQL count(distinct) from both the table - sql

I have 2 tables. Let's say Table A and Table B. Table A has a column called "name". Table B also has a column "name". I want to find out the count(distinct name). Name should take values from both the columns.
For ex-
Table A
name
A
B
C
Table B
name
A
B
D
Output should be 4.

The best concept is, first combine the data in the way you want using a subquery, and then dedupe or do the 2nd step.
For example,
WITH COMBINED AS (
SELECT
name
FROM
TableA
UNION ALL
SELECT
name
FROM
TableB
)
SELECT
DISTINCT name
FROM
COMBINED
In your situation, the 2nd step can be accomplished by changing UNION ALL to a UNION. This will dedupe the values automatically. You won't even need a subquery or a 2nd step. But I wanted to teach you the concept because it comes up often.
SELECT name FROM TableA
UNION
SELECT name FROM TableB

Then UNION in the CTE will reove all Duplicates
so a COUNT(*) will suffoce
WITH CTE AS (
SELECT name FROM TableA
UNION
SELECT name FROM TableB
)
SELECT COUNT(*) FROM CTE

I hope this query should do it:
SELECT SUM(names) AS total_names
FROM (
SELECT COUNT(DISTINCT(name)) as names FROM TableA
UNION
SELECT COUNT(DISTINCT(name)) as names FROM TableB
) t;
Note: Tested with sql server

Yet another option:
select hll_count.merge(hll_sketch) names
from (
select hll_count.init(name) hll_sketch from tableA
union all
select hll_count.init(name) from tableB
)
HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical error. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.
See more about benefits of using HyperLogLog++ functions

Related

Is it possible to UNION distinct rows but disregard one column to determine uniqueness?

select d.id, d.registration_number
from DOCUMENTS d
union
select dd.id, dd.registration_number
from DIFFERENT_DOCUMENTS dd
Would it be possible to union those results based solely on the uniqueness of the registration_number, disregarding the id of the documents?
Or, is it possible to achieve the same result in a different way?
Just to add: actually I'm unioning 5 queries, each ~20 lines long, with 4 columns that should be disregarded in determining uniqueness.
you basically need to wrap the unioned data with something else to get only the ones you want.
SELECT min(id), registration_number
FROM (SELECT id, registration_number
FROM documents
UNION ALL
SELECT id, registration_number
FROM different_documents)
GROUP BY registration_number
Union will check the combination of all the columns for uniqueness. You could, however, use union all (that does not remove duplicates) and then apply the logic yourself using the row_number window function:
SELECT id, registration_number
FROM (SELECT id, registration_number,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id) AS rn
FROM (SELECT id, registration_number
FROM documents
UNION ALL
SELECT id, registration_number
FROM different_documents) u
) r
WHERE rn = 1
Since the other answers are already correct, may I ask why do you need to retrieve other columns in that query since the primary purpose appear to gather unique registration numbers?
Wouldn't it be simpler to first gather unique registration number and then retrieve the other info?
Or in your actual query, first gather the info without the columns that should be disregarded and then gather the info in these column if need be?
Like,for example, making a view with
SELECT d.registration_number
FROM DOCUMENT d
UNION
SELECT dd.registration_number
FROM DIFFERENT_DOCUMENT dd
and then gather information using that view and JOINS?
Assuming registration_number is unique in each table, you can use not exists:
select d.id, d.registration_number
from DOCUMENTS d
union all
select dd.id, dd.registration_number
from DIFFERENT_DOCUMENTS dd
where not exists (select 1
from DOCUMENTS d
where dd.registration_number = d.registration_number
);

Generic comparison method for two tables in BigQuery if tables contain STRUCT type

I'm looking for a generic method to compare two tables in BigQuery, even if they have columns that are STRUCT type.
It should work for any pair of tables, and ideally wouldn't involve writing a query that depends on that actual columns of the tables. All I really need to know is if the tables are equal or not, but it would be a bonus if it could show me the difference between the rows that aren't the same.
So something like (in pseudo code)
sizeOf( TABLE A EXCEPT TABLE B ) == 0
or
Hash(TABLE A) == HASH(TABLE B)
Would be fine.
I tried using this:
( SELECT * FROM table1
EXCEPT DISTINCT
SELECT * FROM table2)
UNION ALL
( SELECT * FROM table2
EXCEPT DISTINCT
SELECT * FROM table1)
But I got this error.
Column 1 in EXCEPT ALL has type that does not support set operation comparisons: STRUCT at [3:5]
Does anyone know of a way to get around this?
EDIT
Should have mentioned before, but I need this to work regardless of the ordering of the rows of the table.
I think yo are looking for something like below to start with
#standardSQL
SELECT TO_JSON_STRING(a) FROM `project.dataset.tableA` a
EXCEPT DISTINCT
SELECT TO_JSON_STRING(b) FROM `project.dataset.tableB` b
Or, more complete example - to show differences - note: this can be quite exhausting output for really different tables
#standardSQL
SELECT 'a' table, * FROM (
SELECT TO_JSON_STRING(a) record FROM `project.dataset.tableA` a
EXCEPT DISTINCT
SELECT TO_JSON_STRING(b) FROM `project.dataset.tableB` b
)
UNION ALL
SELECT 'b', * FROM (
SELECT TO_JSON_STRING(b) FROM `project.dataset.tableB` b
EXCEPT DISTINCT
SELECT TO_JSON_STRING(a) FROM `project.dataset.tableA` a
)

SQL combine two query results

I can't use a Union because it's not the result I want, and I can't use join because I haven't any common column. I have tried many different SQL query structures and nothing works as I want.
I need help to achieve what I believe is a really simple SQL query. What I am doing now is
select a, b
from (select top 4 a from element_type order by c) as Y,
(SELECT * FROM (VALUES (NULL), (1), (2), (3)) AS X(b)) as Z
The first is a part of a table and the second is a hand created select that gives results like this:
select a; --Give--> a,b,c,d (1 column)
select b; --Give--> 1,2,3,4 (1 column)
I need a query based on the two first that give me (2 column) :
a,1
b,2
c,3
d,4
How can i do this? UNION, JOIN or anything else? Or maybe I can't.
All I can get for now is this:
a,1
a,2
a,3
a,4
b,1
b,2
...
If you want to join two tables together purely on the order the rows appear, then I hope your database support analytic (window) functions:
SELECT * FROM
(SELECT t.*, ROW_NUMBER() OVER(ORDER BY x) as rown FROM table1 t) t1
INNER JOIN
(SELECT t.*, ROW_NUMBER() OVER(ORDER BY x) as rown FROM table2 t) t2
ON t1.rown = t2.rown
Essentially we invent something to join them on by numbering the rows. If one of your tables already contains incrementing integers from 1, you dont need to ROW_NUMBER() OVER() on that table, because it already has suitable data to join to; you just invent a fake column of incrementing nubmers in the other table and then join together
Actually, even if it doesn't support analytics, there are ugly ways of doing row numbering, such as joining the table back to itself using id < id and COUNT(*) .. GROUP BY id to number the rows. I hate doing it, but if your DB doesnt support ROW_NUMBER i'll post an example.. :/
Bear in mind, of course, that RDBMS have R in the name for a reason - related data is.. well.. related. They don't do so well when data is unrelated, so if your hope is to join the "chalks" table to the "cheese" table even though the two are completely unrelated, you're finding out now why it's hard work! :)
Try using row_number. I've created something that might help you. See below:
declare #tableChar table(letter varchar)
insert into #tableChar(letter)
select 'a';
insert into #tableChar(letter)
select 'b';
insert into #tableChar(letter)
select 'c';
insert into #tableChar(letter)
select 'd';
select letter,ROW_NUMBER() over(order by letter ) from #tableChar
You can user row_number() to achieve this,
select a,row_number() over(order by a) as b from element_type;
As you are not taking second part from other table, so you do not need to use join. But if you are doing this on different tables the you can use row_number() to create key for both the tables and bases on those keys, you can join.
Hope it will help.

Get row count including column values in sql server

I need to get the row count of a query, and also get the query's columns in one single query. The count should be a part of the result's columns (It should be the same for all rows, since it's the total).
for example, if I do this:
select count(1) from table
I can have the total number of rows.
If I do this:
select a,b,c from table
I'll get the column's values for the query.
What I need is to get the count and the columns values in one query, with a very effective way.
For example:
select Count(1), a,b,c from table
with no group by, since I want the total.
The only way I've found is to do a temp table (using variables), insert the query's result, then count, then returning the join of both. But if the result gets thousands of records, that wouldn't be very efficient.
Any ideas?
#Jim H is almost right, but chooses the wrong ranking function:
create table #T (ID int)
insert into #T (ID)
select 1 union all
select 2 union all
select 3
select ID,COUNT(*) OVER (PARTITION BY 1) as RowCnt from #T
drop table #T
Results:
ID RowCnt
1 3
2 3
3 3
Partitioning by a constant makes it count over the whole resultset.
Using CROSS JOIN:
SELECT a.*, b.numRows
FROM YOUR_TABLE a
CROSS JOIN (SELECT COUNT(*) AS numRows
FROM YOUR_TABLE) b
Look at the Ranking functions of SQL Server.
SELECT ROW_NUMBER() OVER (ORDER BY a) AS 'RowNumber', a, b, c
FROM table;
You could do it like this:
SELECT x.total, a, b, c
FROM
table
JOIN (SELECT total = COUNT(*) FROM table) AS x ON 1=1
which will return the total number of records in the first column, followed by fields a,b & c

combine SELECTS in ONE VIEW DISPLAY

I need to know of a way to combine multiple SELECT statements in one VIEW? I tried the UNION ALL, but it fails since I am using unique columns to aggregate the GRAND TOTAL.
I am a student this is part of a group project.
I have one table with 4 columns: account, description, short_description, and balance. The COA (chart of accounts) is an excel spreadsheet that is imported.
CREATE VIEW [account_balance_sums]
AS
SELECT SUM(balance) AS total,
SUBSTRING (Account,0,2) AS account_group
FROM COA
GROUP BY account_group
GO
SELECT * FROM [account_balance_sums]
SELECT SUM(total) AS Grand_total
FROM [account_balance_sums]
Assuming that you are trying to create a view that gives account group and total balance with a single extra row for the total across all accounts then this view should help:
CREATE VIEW [account_balance_sums] AS
SELECT SUM(balance) AS total, SUBSTRING (Account,0,2) AS account_group
FROM COA
GROUP BY account_group
UNION ALL
SELECT SUM(balance), 'Grand Total'
FROM account_group
By the way, the sub-string of the first characters of the account name suggests that you have more than one piece of data in a single column. This indicates a data that is not properly normalised, which you should probably address if you want top marks. See wikipedia on normal form
In a UNION'd statement, there must be:
The same number of columns in each SELECT statement
The data types must match at each position in the SELECT statement
Use:
SELECT *
FROM [account_balance_sums]
UNION ALL
SELECT SUM(total),
NULL AS account_group
FROM [account_balance_sums]
UNION ALL should work. basic structure like this
select a,b,c,d
from t1
union all
select a,b,c,e
from t2
so long as d and e are the same data type.
to do the sum, then you wrap this with the aggregation layer - using this structure as an inline view (among other methods)
something like:
select sum( d )
from (
select a,b,c,d
from t1
union all
select a,b,c,e
from t2
)