Counts for distinct values in different tables where columns are common to separate tables - sql

I have no idea if that title conveys what I want it to.
I have two tables containing phone records (one for each account) and I'd like to get call counts for the numbers that are common to each account. In other words:
Table 1
Number ...
8675309
8675309
8675310
8675310
8675312
Table 2
Number ...
8675309
8675309
8675309
8675310
8675311
Querying with something like:
SELECT DISTINCT table1.number, COUNT(table1.number), COUNT(table2.number) FROM table1, table2 WHERE table1.number = table2.number GROUP BY table1.number
would hopefully produce:
8675309|2|3
8675310|2|1
Instead, it currently produces something like:
8675309|6|6
8675310|2|2
It appears to be multiplying the count from each table. Presumably, this is because I'm not joining the tables the way I should for this goal. Or because by the time I ask for COUNT(table1.number) the tables have already been joined in some multiplicative way. Should I not be doing a JOIN and instead something that would read like: "where table2.number CONTAINS(table1.number)"?
Any tips?

One way is with subqueries:
SELECT t1.number, t1.table1Count, t2.table2Count
from (select number, count(*) table1Count
from table1
group by number) t1
inner join (select number, count(*) table2Count
from table2
group by number) t2
on t2.number = t1.number
This assumes that you only want to list numbers that appear in both tables. If you want to list all numbers that appear in one table and optionally the other, you'd use a left or right outer join; if you wanted all numbers that appeared in either or both tables, you'd use a full outer join.
Another and potentially more efficient way requires the presence of a single column that uniquely identifies each row in each table:
SELECT
t1.number
,count(distinct t1.PrimaryKeyValue) table1Count
,count(distinct t2.PrimaryKeyValue) table2Count
from table1 t1
inner join table2 t2
on t2.number = t1.number
group by t1.number
This makes the same assumptions as before, and can also be adjusted modified via outer joins.

One way is to use a couple of derived tables to compute your counts separately and then join them to produce your final summary:
select t1.number, t1.count1, t2.count2
from (select number, count(number) as count1 from table1 group by number) as t1
join (select number, count(number) as count2 from table2 group by number) as t2
on t1.number = t2.number
There are probably other ways but that should work and it is the first thing that came to mind.
You're getting your "multiplicative" effect pretty much for the reasons you suspect. If you have this:
table1(id,x) table2(id,x)
------------ ------------
1, a 4, a
2, a 5, a
3, b 6, b
Then joining them on x will give you this:
1,a, 4,a
1,a, 5,a
2,a, 4,a
2,a, 5,a
...
Usually you could use a GROUP BY to sort out the duplicates but you can't do that because it would mess up your per-table counts.

Try this:
select tab1.number,tab1.num1,tab2.num2
from
(SELECT number, COUNT(number) as num1 from table1 group by number) as tab1
left join
(SELECT number, COUNT(number) as num2 from table2 group by number) as tab2
on tab1.number = tab2.number

Related

sql - ignore duplicates while joining

I have two tables.
Table1 is 1591 rows. Table2 is 270 rows.
I want to fetch specific column data from Table2 based on some condition between them and also exclude duplicates which are in Table2. Which I mean to join the tables but get only one value from Table2 even if the condition has occurred more than time. The result should be exactly 1591 rows.
I tried to make Left,Right, Inner joins but the data comes more than or less 1591.
Example
Table1
type,address,name
40,blabla,Adam
20,blablabla,Joe
Table2
type,currency
40,usd
40,gbp
40,omr
Joining on 'type'
Result
type,address,name,currency
40,blabla,name,usd
20,blblbla,Joe,null
try this it has to work
select *
from
Table1 h
inner join
(select type,currency,ROW_NUMBER()over (partition by type order by
currency) as rn
from
Table2
) sr on
sr.type=h.type
and rn=1
Try this. It's standard SQL, therefore, it should work on your rdbms system.
select * from Table1 AS t
LEFT OUTER JOIN Table2 AS y ON t.[type] = y.[type] and y.currency IN (SELECT MAX(currency) FROM Table2 GROUP BY [type])
If you want to control which currency is joined, consider altering Table2 by adding a new column active/non active and modifying accordingly the JOIN clause.
You can use outer apply if it's supported.
select a.type, a.address, a.name, b.currency
from Table1 a
outer apply (
select top 1 currency
from Table2
where Table2.type = a.type
) b
I typical way to do this uses a correlated subquery. This guarantees that all rows in the first table are kept. And it generates an error if more than one row is returned from the second.
So:
select t1.*,
(select t2.currency
from table2 t2
where t2.type = t1.type
fetch first 1 row only
) as currency
from table1 t1;
You don't specify what database you are using, so this uses standard syntax for returning one row. Some databases use limit or top instead.

SQL Select statement (from 2 different tables)

Heyy I'm new to sql and I'd just like to know if there's a way to retrieve select statements with conditions from other tables.
I want to select all name values that have a number that identifies that they have committed a crime. I only want to select a name once.
"SELECT distinct * FROM Table1 WHERE number LIKE table2.number "
Are you looking for IN?
SELECT t1.*
FROM Table1 t1
WHERE t1.number IN (SELECT t2.number FROM table2 t2 t2.number);
Under most circumstances, the rows in a table should be unique. So, you don't need SELECT DISTINCT. The DISTINCT can add a considerable amount of overhead to such a query.
You can able to use INNER JOIN like below,
select tbl1.Name from tableOne tbl1
inner join tableTwo tbl2 ON tbl1.commonKey = tbl2.commonKey
where tbl1.columnName = 'any value'

Left Join with Distinct Clause

Below is my insert query.
INSERT INTO /*+ APPEND*/ TEMP_CUSTPARAM(CUSTNO, RATING)
SELECT DISTINCT Q.CUSTNO, NVL(((NVL(P.RATING,0) * '10.0')/100),0) AS RATING
FROM TB_ACCOUNTS Q LEFT JOIN TB_CUSTPARAM P
ON P.TEXT_PARAM IN (SELECT DISTINCT PRDCD FROM TB_ACCOUNTS)
AND P.TABLENAME='TB_ACCOUNTS' AND P.COLUMNNAME='PRDCD';
In the previous version of the query, P.TEXT_PARAM=Q.PRDCD but during insert to TEMP_CUSTPARAM due to violation of unique constraint on CUSTNO.
The insert query is taking ages to complete. Would like to know how to use distinct with LEFT JOIN statement.
Thanks.
SELECT T1.Col1, T2.Col2 FROM Table1 T1
Left JOIN
(SELECT Distinct Col1, Col2 FROM Table2
) T2 ON T2.Id = T1.Id
You are missing criteria to join TB_ACCOUNTS records with their related TB_ACCOUNTS/PRDCD TB_CUSTPARAM records and thus cross join them instead. I guess you want:
INSERT INTO /*+ APPEND*/ TEMP_CUSTPARAM(CUSTNO, RATING)
SELECT DISTINCT
Q.CUSTNO,
NVL(P.RATING, 0) * 0.1 AS RATING
FROM TB_ACCOUNTS Q
LEFT JOIN TB_CUSTPARAM P ON P.TEXT_PARAM = Q.PRDCD
AND P.TABLENAME = 'TB_ACCOUNTS'
AND P.COLUMNNAME = 'PRDCD';
If the query is taking ages to complete, check first the execution plan. You may find some hints here - If you see a cartesian join on two non-trivial tables, probably the query should be revisited.
Than ask yourself what is the expectation of the query.
Do you expect one record per CUSTNO? Or can a customer have more than one rating?
One reting per customer could have sense from the point of business. To get unique customer list with rating
1) first get a UNIQUE CUSTNO - note that this is in generel not done with a DISTINCT clause, but if tehre are more rows per customer with a filter predicate, e.g. selecting the most recent row.
2) than join to the rating table

Comparing two tables for equality in HIVE

I have two tables, table1 and table2. Each with the same columns:
key, c1, c2, c3
I want to check to see if these tables are equal to eachother (they have the same rows). So far I have these two queries (<> = not equal in HIVE):
select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key
where t2.key is null or t1.c1<>t2.c1 or t1.c2<>t2.c2 or t1.c3<>t2.c3
And
select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t2.key is null
So my idea is that, if a zero count is returned, the tables are the same. However, I'm getting a zero count for the first query, and a non-zero count for the second query. How exactly do they differ? If there is a better way to check this certainly let me know.
The first one excludes rows where t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, or t2.c3 is null. That means that you effectively doing an inner join.
The second one will find rows that exist in t1 but not in t2.
To also find rows that exist in t2 but not in t1 you can do a full outer join. The following SQL assumes that all columns are NOT NULL:
select count(*) from table1 t1
full outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t1.key is null /* this condition matches rows that only exist in t2 */
or t2.key is null /* this condition matches rows that only exist in t1 */
If you want to check for duplicates and the tables have exactly the same structure and the tables do not have duplicates within them, then you can do:
select t.key, t.c1, t.c2, t.c3, count(*) as cnt
from ((select t1.*, 1 as which from table1 t1) union all
(select t2.*, 2 as which from table2 t2)
) t
group by t.key, t.c1, t.c2, t.c3
having cnt <> 2;
There are various ways that you can relax the conditions in the first paragraph, if necessary.
Note that this version also works when the columns have NULL values. These might be causing the problem with your data.
Well, the best way is calculate the hash sum of each table, and compare the sum of hash.
So no matter how many column are they, no matter what data type are they, as long as the two table has the same schema, you can use following query to do the comparison:
select sum(hash(*)) from t1;
select sum(hash(*)) from t2;
And you just need to compare the return values.
I used EXCEPT statement and it worked.
select * from Original_table
EXCEPT
select * from Revised_table
Will show us all the rows of the Original table that are not in the Revised table.
If your table is partitioned you will have to provide a partition predicate.
Fyi, partition values don't need to be provided if you use Presto and querying via SQL lab.
I would recommend you not using any JOINs to try to compare tables:
it is quite an expensive operations when tables are big (which is often the case in Hive)
it can give problems when some rows/"join keys" are repeated
(and it can also be unpractical when data are in different clusters/datacenters/clouds).
Instead, I think using a checksum approach and comparing the checksums of both tables is best.
I have developed a Python script that allows you to do easily such comparison, and see the differences in a webbrowser:
https://github.com/bolcom/hive_compared_bq
I hope that can help you!
First get count for both the tables C1 and C2. C1 and C2 should be equal. C1 and C2 can be obtained from the following query
select count(*) from table1
if C1 and C2 are not equal, then the tables are not identical.
2: Find distinct count for both the tables DC1 and DC2. DC1 and DC2 should be equal. Number of distinct records can be found using the following query:
select count(*) from (select distinct * from table1)
if DC1 and DC2 are not equal, the tables are not identical.
3: Now get the number of records obtained by performing a union on the 2 tables. Let it be U. Use the following query to get the number of records in a union of 2 tables:
SELECT count (*)
FROM
(SELECT *
FROM table1
UNION
SELECT *
FROM table2)
You can say that the data in the 2 tables is identical if distinct count for the 2 tables is equal to the number of records obtained by performing union of the 2 tables. ie DC1 = U and DC2 = U
another variant
select c1-c2 "different row counts"
, c1-c3 "mismatched rows"
from
( select count(*) c1 from table1)
,( select count(*) c2 from table2 )
,(select count(*) c3 from table1 t1, table2 t2
where t1.key= t2.key
and T1.c1=T2.c1 )
Try with WITH Clause:
With cnt as(
select count(*) cn1 from table1
)
select 'X' from dual,cnt where cnt.cn1 = (select count(*) from table2);
One easy solution is to do inner join. Let's suppose we have two hive tables namely table1 and table2. Both the table has same column namely col1, col2 and col3. The number of rows should also be same. Then the command would be as follows
**
select count(*) from table1
inner join table2
on table1.col1 = table2.col1
and table1.col2 = table2.col2
and table1.col3 = table2.col3 ;
**
If the output value is same as number of rows in table1 and table2 , then all the columns has same value, If however the output count is lesser than there are some data which are different.
Use a MINUS operator:
SELECT count(*) FROM
(SELECT t1.c1, t1.c2, t1.c3 from table1 t1
MINUS
SELECT t2.c1, t2.c2, t2.c3 from table2 t2)

SQL SELECT across two tables

I am a little confused as to how to approach this SQL query.
I have two tables (equal number of records), and I would like to return a column with which is the division between the two.
In other words, here is my not-working-correctly query:
SELECT( (SELECT v FROM Table1) / (SELECT DotProduct FROM Table2) );
How would I do this? All I want it a column where each row equals the same row in Table1 divided by the same row in Table2. The resulting table should have the same number of rows, but I am getting something with a lot more rows than the original two tables.
I am at a complete loss. Any advice?
It sounds like you have some kind of key between the two tables. You need an Inner Join:
select t1.v / t2.DotProduct
from Table1 as t1
inner join Table2 as t2
on t1.ForeignKey = t2.PrimaryKey
Should work. Just make sure you watch out for division by zero errors.
You didn't specify the full table structure so I will assume a common ID column to link rows in the tables.
SELECT table1.v/table2.DotProduct
FROM Table1 INNER JOIN Table2
ON (Table1.ID=Table2.ID)
You need to do a JOIN on the tables and divide the columns you want.
SELECT (Table1.v / Table2.DotProduct) FROM Table1 JOIN Table2 ON something
You need to substitue something to tell SQL how to match up the rows:
Something like: Table1.id = Table2.id
In case your fileds are both integers you need to do this to avoid integer math:
select t1.v / (t2.DotProduct*1.00)
from Table1 as t1
inner join Table2 as t2
on t1.ForeignKey = t2.PrimaryKey
If you have multiple values in table2 relating to values in table1 you need to specify which to use -here I chose the largest one.
select t1.v / (max(t2.DotProduct)*1.00)
from Table1 as t1
inner join Table2 as t2
on t1.ForeignKey = t2.PrimaryKey
Group By t1.v