Count distinct multiple columns in redshift - sql

I am trying to count rows which have a distinct combination of 2 columns in Amazon redshift. The query I am using is -
select count(distinct col1, col2)
from schemaname.tablename
where some filters
It is throwing me this error -
Amazon Invalid operation: function count(character varying, bigint) does not exist`
I tried casting bigint to char but it didn't work.

you can use sub-query and count
select count(*) from (
select distinct col1, col2
from schemaname.tablename
where some filter
) as t

A little late to the party but anyway: you can also try to concatenate columns using || operator. It might be inefficient so I wouldn't use it in prod code, but for ad-hoc analysis should be fine.
select count(distinct col1 || '_' || col2)
from schemaname.tablename
where some filters
Note separator choice might matter, i.e.
both 'foo' || '_' || 'bar_baz' and 'foo_bar' || '_' || 'baz' yield 'foo_bar_baz' and are thus equal. In some cases this might be concern, in some it's so insignificant you can skip separator completely.

You can use
select col1,col2,count(*) from schemaname.tablename
where -- your filter
group by col1,col2

If you are just trying to do count(distinct) then Zaynul's answer is correct. If you want other aggregations as well, here is another method:
select . . .,
sum(case when seqnum = 1 then 1 else 0 end) as col1_col2_unique_count
from (select t.*,
row_number() over (partition by col1, col2 order by col1) as seqnum
from schemaname.tablename t
where some filters
) c

Related

How to express "either the single resulting record or NULL", without an inner-query LIMIT?

Consider the following query:
SELECT (SELECT MIN(col1) FROM table1) = 7;
Assuming col1 is non-NULLable, this will yield either true or false - or possibly NULL when table1 is empty;
But now suppose I have:
SELECT (
SELECT
FIRST_VALUE (col2) OVER (
ORDER BY col1
) AS col2_for_first_col1
FROM table1
) = 7;
(and assume col2 is also non-NULLable for simplicity.)
If there is a unique col2 value for the lowest col1 value, or the table is empty, then this works just like before. But if there are multiple col2 values for the lowest col1, I'm going to get a query runtime error.
My question: What is a short, elegant way to get NULL from this last query also in the case of multiple inner-query results? I could of course duplicate it and check the count, but I would rather avoid that.
Important caveat: I'm using MonetDB, and it doesn't seem to support ORDER BY ... LIMIT 1 on inner queries.
Without the MonetDB limitation, you would seem to want:
SELECT (SELECT col2
FROM table1
ORDER BY col1
LIMIT 1
) = 7;
with the limitation, you can use window functions differently:
SELECT (SELECT col2
FROM (SELECT col2, ROW_NUMBER() OVER (ORDER BY col1) as seqnum
FROM table1
) t
WHERE seqnum = 1
) = 7;

In SQL Server, how to concat --> group by using the concat column

SELECT
'dbo.our_table' as table_name,
CONCAT(col1, '-', col2, '-', col3) as table_id,
COUNT(*) as ct
FROM dbo.our_table
group by table_name, table_id
-- group by 1, 2 -- this doesn't work either...
order by ct desc
This does not work in SQL server because it does not recognize table_name or table_id. I understand that this can be done by nesting a 2nd SELECT clause into the FROM, so that table_name and table_id are explicitly available, however I am trying to understand if it is possible to achieve this output without having to create a nested SELECT statement, but rather by keeping the structure of my current query but only making a tweak.
Thanks
As mentioned in the comments, you need to put your 3 columns (col1, col2 & col3) into the GROUP BY. Unlike these 3 columns, dbo.our_table is not needed in the GROUP BY as it is a string.
SQL Server executes the components of SELECT queries in a particular order starting with FROM, then WHERE, GROUP BY, etc. In this case, SQL Server doesn't recognize the aliases table_name & table_id in the GROUP BY because they are not set until the SELECT, which is executed after the GROUP BY.
Googling "SQL Server SELECT query execution order" should give you a number of resources which will explain the order of execution in more detail.
You need to specify the full calculation for the GROUP BY as well as for the SELECT. This is because GROUP BY is logically considered before SELECT, so cannot access those calculations.
You could do it like this (table_name is not necessary because it's purely computed):
SELECT
'dbo.our_table' as table_name,
CONCAT(col1, '-', col2, '-', col3) as table_id,
COUNT(*) as ct
FROM dbo.our_table
group by CONCAT(col1, '-', col2, '-', col3)
order by ct desc;
But much better is to place calculations in a CROSS APPLY, this means it is accessible later by name as you wished:
SELECT
'dbo.our_table' as table_name,
v.table_id,
COUNT(*) as ct
FROM dbo.our_table
CROSS APPLY (VALUES (CONCAT(col1, '-', col2, '-', col3) ) ) as v(table_id)
group by v.table_id
order by ct desc;

BigQuery: Use COUNT as LIMIT

I want to select everything from mytable1 and combine that with just as many rows from mytable2. In my case mytable1 always has fewer rows than mytable2 and I want the final table to be a 50-50 mix of data from each table. While I feel like the following code expresses what I want logically, it doesn't work syntax wise:
Syntax error: Expected "#" or integer literal or keyword CAST but got
"(" at [3:1]
(SELECT * FROM `mytable1`)
UNION ALL (
SELECT * FROM `mytable2`
LIMIT (SELECT COUNT(*) FROM`mytable1`)
)
Using standard SQL in bigquery
The docs state that LIMIT clause accept only literal or parameter values. I think you can ROW_NUMBER() the rows from second table and limit based on that:
SELECT col1, col2, col3
FROM mytable1
UNION ALL
SELECT col1, col2, col3
FROM (
SELECT col1, col2, col3, ROW_NUMBER() OVER () AS rn
FROM mytable2
) AS x
WHERE x.rn <= (SELECT COUNT(*) FROM mytable1)
Each SELECT statement within UNION must have the same number of
columns
The columns must also have similar data types
The columns in each SELECT statement must also be in the same order
As your mytable1 always less column than mytable2 so you have to put same number of column by selection
select col1,col2,col3,'' as col4 from mytable1 --in case less column you can use alias
union all
select col1,col2,col3,col4 from mytable2

How to find duplicates in a SQL Server table which has trailing spaces values in a column

select COL1, count(COL1)
from Table1
group by COL1
having count (COL1) > 1;
I have tried the above query and got some result based on data which do not have trailing spaces however the above query does not apply to data which has trailing spaces so I tried the below query and got no results. Please advice
select COL1, count(COL1)
from Table1
where COL1 in(select Ltrim(Rtrim(COL1))from Table1)
group by COL1
having count (COL1) > 1;
If you want to tally the text contents of COL1 ignoring leading and trailing whitespace, then just do that. Use ltrim(rtrim(COL1)) when aggregating:
select
ltrim(rtrim(COL1)) AS COL1_trimmed
count(*) cnt
from Table1
group by ltrim(rtrim(COL1))
having count(*) > 1;
In general, SQL Server ignores trailing spaces with varchar(). However, it does not when using char(). I am guessing the trailing "spaces" are not really spaces.
Here is an example.
with t as (
select cast('a' as varchar(255)) as x union all
select cast('a ' as varchar(255))
)
select t.x, count(*), min(t.x + '|') , max(t.x + '|')
from t
group by t.x;
This returns:
a 2 "a |" "a|"
(I added the double quotes to clarify the results.) Note that one row is returned, not two. But the spaces really are at the end of the values.
This leads me to suspect that the trailing characters are not spaces.
One way to investigate what they are is by using the ASCII() function.
Another way is to first remove the trailing and leading spaces from that column in your table.
If COL1 is a VARCHAR type:
update Table1
set COL1 = rtrim(ltrim(COL1))
where COL1 != rtrim(ltrim(COL1));
If COL1 is a CHAR type then you only need to left trim:
update Table1
set COL1 = ltrim(COL1)
where COL1 != ltrim(COL1);
After that cleanup, you can just use a grouping query without trimming the column
select COL1, count(*) as Total
from Table1
group by COL1
having count(*) > 1;

How to write a listagg on Redshift?

For sample data as below,
Col1 Col2
1 A
1 B
1 C
2 A
2 B
the output I am looking for is
COL1 COL2
1 A B C
2 A B
This can be done using LISTAGG on Oracle or recursive queries on other DBs but Redshift doesnt support both.
How do I achieve this on Redshift
They just added LISTAGG() to Redshift (2015-07-31). http://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html
Here is a solution from another similar question -
SELECT col1,
LISTAGG(col2,', ')
WITHIN GROUP (ORDER BY col2)
OVER (PARTITION BY col1) AS EMPLOYEE
FROM YOUR_TABLE
ORDER BY col1
This question.
Redshift has introduced LISTAGG window function that makes it possible to do so now. Here is a quick solution to your problem - may or may not be useful but putting it here so that people will know!
Here is the documentation about the function. || This is the announcement.
Try getting row_number of each group row in a sub query and then on top of the subquery do
max(case when row_num_value =1 then col_value end)||','
max(case when row_num_value =2 then col_value end)||','
max(case when row_num_value =3 then col_value end)||.....
This is ofcourse a limited version with upper limit upto whatever you chose.
Try this:
SELECT COL1,STRING_AGG(COL2,' ') AS COL2 FROM TABLE_NAME GROUP BY COL1
SELECT Col1, ARRAY_TO_STRING(ARRAY_AGG(Col2 ORDER BY Col2 ASC), ' ')
FROM MyTable
GROUP BY Col1;
I don't know what version of PostgreSQL you are using. Prior to version 8.4, you would have had to define the function array_agg before using it:
CREATE AGGREGATE array_agg (anyelement)
(
sfunc = array_append,
stype = anyarray,
initcond = '{}'
);
select
distinct COL_1,
listagg(distinct COL_2,
',') within group (
order by COL_2 desc) as my_list
from
table
group by 1
However I have a following question as to how we can retrieve the 2nd element of this list without using substring ( e.g if its an array we can just do array[1])