Distinct vs Group by Performance - sql

I am having 2 queries, in one of them I am using distinct and in other I am using group by. which is faster distinct or group by also if I have more than 10 columns then which one is faster?
select distinct column1, column2
from table
select column1, column2
from table
group by column1, column2

Related

Removing duplicates of column2 then group them based on column1 , then sum the values of column3 in sql

The table looks like
column1 column2 column3
400196 2021-07-06 33
400196 2021-07-06 33
400196 2021-08-16 33
I want to get the sum of column3 values based on grouping of column 1 but the duplicate values of date should not be added
The desired output is:
column1 column3
400196 66
The query I wrote is
select sum(column3)
from table_name
group by column1
But this gives me result 99
You can remove duplicate values in a subquery:
select t.column1, sum(t.column3)
from (select distinct t.column1, t.column2, t.column3
from t
) t
group by t.column1;
Note: This sort of problem can arise when you are joining tables together. Removing duplicates may not always be the right solution. Often it is better to do the calculation before joining, so you don't have duplicate values to deal with.
You could use a two step process here, first remove duplicates, then aggregate and sum:
SELECT column1, SUM(column3) AS column3
FROM (SELECT DISTINCT column1, column2, column3 FROM yourTable) t
GROUP BY column1;
Demo

SQL Server - improve performance of searching a values in table

I'm facing with problem in one query. The easiest will be to explain step by step:
At first I'm searching a specific values in colum1 in table1 by using query like this:
Query #1:
select column1
from table1
where column1 in('xxx','yyy','zzz')
group by column1
having count(*) >3
So now I have a list on values from column1, which occurs more than 3 times.
Then I need to use that list in where condition in another query:
select column1, column2, column3
from table1
where column1 in (query 1)
Unfortunately when I'm using query 1 as subquery, execution is really slow and I need to find a different way to this. Any suggest how can I increase a performance ?
Best regards and thank you in advance
If they are the same table, then use window functions:
select t.*
from (select t.*, count(*) over (partition by column1) as cnt
from table1 t
where column1 in ('xxx', 'yyy', 'zzz')
) t
where cnt > 3;
Both this an your original query will benefit from h having an index on table1(column1).
1)First of all take a look if the query is correctly indexed.
Maybe you have to add an index on column1.
2) try with it:
select column1, column2, column3
from table1 as T1 inner join (
select column1, column2, column3
from table1
where column1 in (query 1)) as T2
on t1.column1 = t2.column1

how to do nested SQL select count

i'm querying a system that won't allow using DISTINCT, so my alternative is to do a GROUP BY to get near to a result
my desired query was meant to look like this,
SELECT
SUM(column1) AS column1,
SUM(column2) AS column2,
COUNT(DISTINCT(column3)) AS column3
FROM table
for the alternative, i would think i'd need some type of nested query along the lines of this,
SELECT
SUM(column1) AS column1,
SUM(column2) AS column2,
COUNT(SELECT column FROM table GROUP BY column) AS column3
FROM table
but it didn't work. Am i close?
You are using the wrong syntax for COUNT(DISTINCT). The DISTINCT part is a keyword, not a function. Based on the docs, this ought to work:
SELECT
SUM(column1) AS column1,
SUM(column2) AS column2,
COUNT(DISTINCT column3) AS column3
FROM table
Do, however, read the docs. BigQuery's implementation of COUNT(DISTINCT) is a bit unusual, apparently so as to scale better for big data. If you are trying to count a large number of distinct values then you may need to specify a second parameter (and you have an inherent scaling problem).
Update:
If you have a large number of distinct column3 values to count, and you want an exact count, then perhaps you can perform a join instead of putting a subquery in the select list (which BigQuery seems not to permit):
SELECT *
FROM (
SELECT
SUM(column1) AS column1,
SUM(column2) AS column2
FROM table
)
CROSS JOIN (
SELECT count(*) AS column3
FROM (
SELECT column3
FROM table
GROUP BY column3
)
)
Update 2:
Not that joining two one-row tables would be at all expensive, but #FelipeHoffa got me thinking more about this, and I realized I had missed a simpler solution:
SELECT
SUM(column1) AS column1,
SUM(column2) AS column2,
COUNT(*) AS column3
FROM (
SELECT
SUM(column1) AS column1,
SUM(column2) AS column2
FROM table
GROUP BY column3
)
This one computes a subtotal of column1 and column2 values, grouping by column3, then counts and totals all the subtotal rows. It feels right.
FWIW, the way you are trying to use DISTINCT isn't how its normally used, as its meant to show unique rows, not unique values for one column in a dataset. GROUP BY is more in line with what I believe you are ultimately trying to accomplish.
Depending upon what you need you could do one of a couple things. Using your second query, you would need to modify your subquery to get a count, not the actual values, like:
SELECT
SUM(column1) AS column1,
SUM(column2) AS column2,
(SELECT sum(1) FROM table GROUP BY column) AS column3
FROM table
Alternatively, you could do a query off your initial query, something like this:
SELECT sum(column1), sum(column2), sum(column4) from (
SELECT
SUM(column1) AS column1,
SUM(column2) AS column2,
1 AS column4
FROM table GROUP BY column3)
GROUP BY column4
Edit: The above is generic SQL, not too familiar with Google Big Query
You can probably use a CTE
WITH result as (select column from table group by column)
SELECT
SUM(column1) AS column1,
SUM(column2) AS column2,
Select Count(*) From result AS column3
FROM table
Instead of doing a COUNT(DISTINCT), you can get the same results by running a GROUP BY first, and then counting results.
For example, the number of different words that Shakespeare used by year:
SELECT corpus_date, COUNT(word) different_words
FROM (
SELECT word, corpus_date
FROM [publicdata:samples.shakespeare]
GROUP BY word, corpus_date
)
GROUP BY corpus_date
ORDER BY corpus_date
As a bonus, let's add a column that identifies which books were written during each year:
SELECT corpus_date, COUNT(word) different_words, GROUP_CONCAT(UNIQUE(corpus)) books
FROM (
SELECT word, corpus_date, UNIQUE(corpus) corpus
FROM [publicdata:samples.shakespeare]
GROUP BY word, corpus_date
)
GROUP BY corpus_date
ORDER BY corpus_date

Finding rows that have many similar values and one different one

I'm trying to isolate a problem with a violation of a unique key index. I'm pretty certain that the cause is resulting from columns that have the same value in 3 columns not having the same value in the 4th (when they should). As an example...
Key Column1 Column2 Column3 Column4
1 A B C D
2 A B C D
3 A B C D
4 A B C Z
I basically want to select column 4, or some way to let me identify column 4. I know it's a matter of using aggregrate functions but I'm not very familiar with them. Can anyone assist on a way to select Key, Column4 for rows that have a different column 4 value and the same column 1-3 values?
This is what you want:
select column1, column2, column3
from t
group by column1, column2, column3
having min(column4) <> max(column4)
Once you get the right values for the first three columns, you can join back in to get the specific rows.
Or, you can use window functions like this:
select t.*
from (select t.*, min(column4) over (partition by column1, column2 column3) as min4,
max(column4) over (partition by column1, column2 column3) as max4
from t
) t
where min4 <> max4;
If NULL is a valid "other" value that you want to count, you will need additional logic for that.
If you want to get all columns, then (it could be simpler if windowed count supported distinct but it's not):
with cte1 as (
select distinct * from Table1
), cte2 as (
select
*,
count(column4) over(partition by column1, column2, column3) as cnt
from cte1
)
select * from cte2 where cnt > 1;
if you want just to select key:
select
column1, column2, column3
from Table1
group by column1, column2, column3
having count(distinct column4) > 1
sql fiddle demo

TeraData aggregate function

When I try to select couple of columns with count, I get the following error:
Selected non-aggregate values must be part of the associated group
My query is something like this.
SELECT COUNT(1), COLUMN1, COLUMN2
FROM TABLE-NAME
If you're after a count for each combination of COLUMN1 and COLUMN2:
SELECT COUNT(1), COLUMN1, COLUMN2 FROM TABLE_NAME GROUP BY COLUMN1, COLUMN2
If you're after a count of all records in the table:
SELECT COUNT(1) OVER (), COLUMN1, COLUMN2 FROM TABLE_NAME