Trying to convert Teradata bteq SQL scripts to redshift SQL. My current redshift Postgres version is 8.0.2, redshift version is 1.0.1499. The current version of redshift does not support rollup(), grouping() functions. How to overcome and resolve this scenario. What are the equivalent redshift functions for them? Could anyone explain with some examples how to do?
Sample Teradata SQL-
select
PRODUCT_ID,CUST_ID,
GROUPING (PRODUCT_ID),
GROUPING (CUST_ID),
row_number over (order by PRODUCT_ID,CUST_ID) AS "ROW_OUTPUT_NUM"
from products
group by rollup(PRODUCT_ID,CUST_ID);
Need to convert above sql query to Redshift
Implement the ROLLUP by hand
Once Redshift does not currently recognize the ROLLUP clause, you must implement this grouping technique in a hard way.
ROLLUP with 1 argument
With ROLLUP Ex. PostgreSQL
SELECT column1, aggregate_function(*)
FROM some_table
GROUP BY ROLLUP(column1)
The equivalent implementation
-- First, the same GROUP BY without the ROLLUP
-- For efficiency, we will reuse this table
DROP TABLE IF EXISTS tmp_totals;
CREATE TEMP TABLE tmp_totals AS
SELECT column1, aggregate_function(*) AS total1
FROM some_table
GROUP BY column1;
-- Show the table 'tmp_totals'
SELECT * FROM tmp_totals
UNION ALL
-- The aggregation of 'tmp_totals'
SELECT null, aggregate_function(total1) FROM tmp_totals
ORDER BY 1
Example output
Country | Sales
-------- | -----
Poland | 2
Portugal | 4
Ukraine | 3
null | 9
ROLLUP with 2 argument
With ROLLUP Ex. PostgreSQL
SELECT column1, column2, aggregate_function(*)
FROM some_table
GROUP BY ROLLUP(column1, column2);
The equivalent implementation
-- First, the same GROUP BY without the ROLLUP
-- For efficiency, we will reuse this table
DROP TABLE IF EXISTS tmp_totals;
CREATE TEMP TABLE tmp_totals AS
SELECT column1, column2, aggregate_function(*) AS total1
FROM some_table
GROUP BY column1, column2;
-- Show the table 'tmp_totals'
SELECT * FROM tmp_totals
UNION ALL
-- The sub-totals of the first category
SELECT column1, null, sum(total1) FROM tmp_totals GROUP BY column1
UNION ALL
-- The full aggregation of 'tmp_totals'
SELECT null, null, sum(total1) FROM tmp_totals
ORDER BY 1, 2;
Example output
Country | Segment | Sales
-------- | -------- | -----
Poland | Premium | 0
Poland | Base | 2
Poland | null | 2 <- sub total
Portugal | Premium | 1
Portugal | Base | 3
Portugal | null | 4 <- sub total
Ukraine | Premium | 1
Ukraine | Base | 2
Ukraine | null | 3 <- sub total
null | null | 9 <- grand total
If you use the UNION technique that others have pointed to, you'll be scanning the underlying table multiple times.
If the fine-level GROUPing actually results in a significant reduction in the data size, a better solution may be:
create temp table summ1
as
select PRODUCT_ID,CUST_ID, ...
from products
group by PRODUCT_ID,CUST_ID;
create temp table summ2
as
select PRODUCT_ID,cast(NULL as INT) AS CUST_ID, ...
from products
group by PRODUCT_ID;
select * from summ1
union all
select * from summ2
union all
select cast(NULL as INT) AS PRODUCT_ID, cast(NULL as INT) AS CUST_ID, ...
from summ2
Related
I have a records from which a set of Procedure codes should only occur once per year per member. I'm trying to identify occurrences where this rule is broken.
I've tried the below SQL, is that correct?
Table
+---------------+--------+-------------+
| ProcedureCode | Member | ServiceDate |
+---------------+--------+-------------+
| G0443 | 1234 | 01-03-2017 |
+---------------+--------+-------------+
| G0443 | 1234 | 05-03-2018 |
+---------------+--------+-------------+
| G0443 | 1234 | 07-03-2018 |
+---------------+--------+-------------+
| G0444 | 3453 | 01-03-2017 |
+---------------+--------+-------------+
| G0443 | 5676 | 07-03-2018 |
+---------------+--------+-------------+
Expected results where rule is broken
+---------------+--------+
| ProcedureCode | Member |
+---------------+--------+
| G0443 | 1234 |
+---------------+--------+
SQL
Select ProcedureCD, Mbr_Id
From CLAIMS
Where ProcedureCD IN ('G0443', 'G0444')
GROUP BY ProcedureCD,Mbr_Id, YEAR(ServiceFromDate)
having count(YEAR(ServiceFromDate))>1
The query you've written will work (if you correct the column names- your query uses different column names to the sample data you posted). It can be simplified visually by using COUNT(*) in the HAVING clause. COUNT works on any non null value and accumulates a 1 for non nulls, or 0 for nulls, but there isn't any significance to using YEAR inside the count in this case because all the dates are non null and count isn't interested in the value - count(*), count(1), count(0), count(member)would all work equally here
The only time count(column) works differently to count(*) is when column contains null values. There is also an option of COUNT where you put DISTINCT inside the brackets, and this causes the counting to ignore repeated values.
COUNT DISTINCT on a table column that contains 6 rows of values 1, 1, 2, null, 3, 3 would return 3 (3 unique values). COUNTing the same column would return 5 (5 non null values), COUNT(*) would return 6
You should understand that by putting the YEAR(...) in the group by but not the select, you might produce duplicate-looking rows in the output. For example if you had these rows also:
Member, Code, Date
1234, G0443, 1-1-19
1234, G0443, 2-1-19
And you're grouping on year (but not showing it) then you'll see:
1234, G0443 --it's for year 2018
1234, G0443 --it's for year 2019
Personally I think it'd be handy to show the year in the select list, so you can better pinpoint where the problem is, but if you want to squish these duplicate rows, do a SELECT DISTINCT Alternatively, leverage the difference between count and count distinct: remove the year from the GROUP BY and instead say HAVING COUNT(*) > COUNT(DISTINCT YEAR(ServiceDate))
As discussed above a count(*) will be greater than a count distinct year if there are duplicated years
Select ProcedureCode, Member,YEAR(ServiceDate) [Year],Count(*) Occurences
From CLAIMS
Where ProcedureCode IN ('G0443', 'G0444')
GROUP BY ProcedureCode, Member,YEAR(ServiceDate)
HAVING Count(*) > 1
Hope This code will help you
create table #temp (ProcedureCode varchar(20),Member varchar(20),ServiceDate Date)
insert into #temp (ProcedureCode,Member,ServiceDate) values ('G0443','1234','01-03-2017')
insert into #temp (ProcedureCode,Member,ServiceDate) values ('G0443','1234','05-03-2018 ')
insert into #temp (ProcedureCode,Member,ServiceDate) values ('G0443','1234','07-03-2018')
insert into #temp (ProcedureCode,Member,ServiceDate) values ('G0444','3453','01-03-2017')
insert into #temp (ProcedureCode,Member,ServiceDate) values ('G0443','5676','07-03-2018')
select ProcedureCode,Member from #temp
where YEAR(ServiceDate) in (Select year(ServiceDate) ServiceDate from #temp group by
ServiceDate having count(ServiceDate)>1)
and Member in (Select Member from #temp group by Member having count(Member)>1)
Group by ProcedureCode,Member
drop table #temp
I need to add a row of sums as the last row of the table. For example:
book_name | some_row1 | some_row2 | sum
---------------+---------------+---------------+----------
book1 | some_data11 | some_data12 | 100
book2 | some_data21 | some_data22 | 300
book3 | some_data31 | some_data32 | 500
total_books=3 | NULL | NULL | 900
How can I do this? (T-SQL)
You can use union all :
select book_name, some_row1, some_row2, sum
from table t
union all
select cast(count(*) as varchar(255)), null, null, sum(sum)
from table t;
However, count(*) will give you no of rows available in table, if the book_name has null value also, then you need count(book_name) instead of count(*).
Try with ROLLUP
SELECT CASE
WHEN (GROUPING([book_name]) = 1) THEN 'total_books'
ELSE [book_name] END AS [book_name],some_row1, some_row2
,SUM(]sum]) as Total_Sales
From Before
GROUP BY
[book_name] WITH ROLLUP
I find that grouping sets is much more flexible than rollup. I would write this as:
select coalesce(book_name,
replace('total_books=#x', '#x', count(*))
) as book_name,
col2, col3, sum(whatever)
from t
group by grouping sets ( (book_name), () );
Strictly speaking, the GROUPING function with a CASE is better than COALESCE(). However, NULL values on the grouping keys is quite rare.
I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.
Assume I have a table my_table like the following:
CREATE TABLE my_table(
id int,
my_field text,
id_reference bigint
);
I then have a couple of values:
id | my_field | id_reference
----+----------+--------------
1 | a | 1
1 | b | 2
2 | a | 3
2 | c | 4
3 | x | 5
Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.
My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.
For example, in the above case I would like to retrieve the following output:
| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3 | 1 | b | 2 |
+-------+----+----------+--------------+
| 3 | 2 | c | 4 |
+-------+----+----------+--------------+
| 3 | 3 | x | 5 |
+-------+----+----------+--------------+
My attemp is the following:
select distinct on (id)
count(*) over () as total,
*
from my_table
order by id, id_reference desc
This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:
total | id | my_field | id_reference
-------+----+----------+--------------
5 | 1 | b | 2
5 | 2 | c | 4
5 | 3 | x | 5
(3 rows)
As you can see it has 5 instead of the expected 3.
I can fix this by using a subquery and count as an aggregate function:
with my_values as (
select distinct on (id)
*
from my_table
order by id, id_reference desc
)
select count(*) over (), * from my_values
Which produces my expected output.
My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?
You are looking at my_table 3 ways:
to find the latest id_reference for each id
to find my_field for the latest id_reference for each id
to count the distinct number of ids in the table
I therefore prefer this solution:
select
c.id_count as total,
a.id,
a.my_field,
b.max_id_reference
from
my_table a
join
(
select
id,
max(id_reference) as max_id_reference
from
my_table
group by
id
) b
on
a.id = b.id and
a.id_reference = b.max_id_reference
join
(
select
count(distinct id) as id_count
from
my_table
) c
on true;
This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.
The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.
There is nothing necessarily wrong with subqueries.
The title was hard to word but the question is pretty simple. I searched all over here and could not find something for my specific issue so here it is. I'm usuing Microsoft SQL Server Management Studio 2010.
Table Currently looks like this
| Value | Product Name|
| 300 | Bike |
| 400 | Bike |
| 300 | Car |
| 300 | Car |
I need the table to show me the sum of Values where Product Name matches - like this
| TOTAL | ProductName |
| 700 | Bike |
| 600 | Car |
I've tried a simple
SELECT
SUM(Value) AS 'Total'
,ProductName
FROM TableX
But the above doesn't work. I end up getting the sum of all values in the column. How can I sum based on the product name matching?
Thanks!
SELECT SUM(Value) AS 'Total', [Product Name]
FROM TableX
GROUP BY [Product Name]
SQL Fiddle Example
Anytime you use an aggregate function, (SUM, MIN, MAX ... ) with a column in the SELECT statement, you must use GROUP BY. This is a group function that indicates which column to group the aggregate by. Further, any columns that are not in the aggregate cannot be in your SELECT statement.
For example, the following syntax is invalid because you are specifying columns (col2) which are not in your GROUP BY (even though MySQL allows for this):
SELECT col1, col2, SUM(col3)
FROM table
GROUP BY col1
The solution to your question would be:
SELECT ProductName, SUM(Value) AS 'Total'
FROM TableX
GROUP BY ProductName
I have the following query:
select column_name, count(column_name)
from table
group by column_name
having count(column_name) > 1;
What would be the difference if I replaced all calls to count(column_name) to count(*)?
This question was inspired by How do I find duplicate values in a table in Oracle?.
To clarify the accepted answer (and maybe my question), replacing count(column_name) with count(*) would return an extra row in the result that contains a null and the count of null values in the column.
count(*) counts NULLs and count(column) does not
[edit] added this code so that people can run it
create table #bla(id int,id2 int)
insert #bla values(null,null)
insert #bla values(1,null)
insert #bla values(null,1)
insert #bla values(1,null)
insert #bla values(null,1)
insert #bla values(1,null)
insert #bla values(null,null)
select count(*),count(id),count(id2)
from #bla
results
7 3 2
Another minor difference, between using * and a specific column, is that in the column case you can add the keyword DISTINCT, and restrict the count to distinct values:
select column_a, count(distinct column_b)
from table
group by column_a
having count(distinct column_b) > 1;
A further and perhaps subtle difference is that in some database implementations the count(*) is computed by looking at the indexes on the table in question rather than the actual data rows. Since no specific column is specified, there is no need to bother with the actual rows and their values (as there would be if you counted a specific column). Allowing the database to use the index data can be significantly faster than making it count "real" rows.
The explanation in the docs, helps to explain this:
COUNT(*) returns the number of items in a group, including NULL values and duplicates.
COUNT(expression) evaluates expression for each row in a group and returns the number of nonnull values.
So count(*) includes nulls, the other method doesn't.
We can use the Stack Exchange Data Explorer to illustrate the difference with a simple query. The Users table in Stack Overflow's database has columns that are often left blank, like the user's Website URL.
-- count(column_name) vs. count(*)
-- Illustrates the difference between counting a column
-- that can hold null values, a 'not null' column, and count(*)
select count(WebsiteUrl), count(Id), count(*) from Users
If you run the query above in the Data Explorer, you'll see that the count is the same for count(Id) and count(*)because the Id column doesn't allow null values. The WebsiteUrl count is much lower, though, because that column allows null.
The COUNT(*) sentence indicates SQL Server to return all the rows from a table, including NULLs.
COUNT(column_name) just retrieves the rows having a non-null value on the rows.
Please see following code for test executions SQL Server 2008:
-- Variable table
DECLARE #Table TABLE
(
CustomerId int NULL
, Name nvarchar(50) NULL
)
-- Insert some records for tests
INSERT INTO #Table VALUES( NULL, 'Pedro')
INSERT INTO #Table VALUES( 1, 'Juan')
INSERT INTO #Table VALUES( 2, 'Pablo')
INSERT INTO #Table VALUES( 3, 'Marcelo')
INSERT INTO #Table VALUES( NULL, 'Leonardo')
INSERT INTO #Table VALUES( 4, 'Ignacio')
-- Get all the collumns by indicating *
SELECT COUNT(*) AS 'AllRowsCount'
FROM #Table
-- Get only content columns ( exluce NULLs )
SELECT COUNT(CustomerId) AS 'OnlyNotNullCounts'
FROM #Table
COUNT(*) – Returns the total number of records in a table (Including NULL valued records).
COUNT(Column Name) – Returns the total number of Non-NULL records. It means that, it ignores counting NULL valued records in that particular column.
Basically the COUNT(*) function return all the rows from a table whereas COUNT(COLUMN_NAME) does not; that is it excludes null values which everyone here have also answered here.
But the most interesting part is to make queries and database optimized it is better to use COUNT(*) unless doing multiple counts or a complex query rather than COUNT(COLUMN_NAME). Otherwise, it will really lower your DB performance while dealing with a huge number of data.
Further elaborating upon the answer given by #SQLMeance and #Brannon making use of GROUP BY clause which has been mentioned by OP but not present in answer by #SQLMenace
CREATE TABLE table1 (
id INT
);
INSERT INTO table1 VALUES
(1),
(2),
(NULL),
(2),
(NULL),
(3),
(1),
(4),
(NULL),
(2);
SELECT * FROM table1;
+------+
| id |
+------+
| 1 |
| 2 |
| NULL |
| 2 |
| NULL |
| 3 |
| 1 |
| 4 |
| NULL |
| 2 |
+------+
10 rows in set (0.00 sec)
SELECT id, COUNT(*) FROM table1 GROUP BY id;
+------+----------+
| id | COUNT(*) |
+------+----------+
| 1 | 2 |
| 2 | 3 |
| NULL | 3 |
| 3 | 1 |
| 4 | 1 |
+------+----------+
5 rows in set (0.00 sec)
Here, COUNT(*) counts the number of occurrences of each type of id including NULL
SELECT id, COUNT(id) FROM table1 GROUP BY id;
+------+-----------+
| id | COUNT(id) |
+------+-----------+
| 1 | 2 |
| 2 | 3 |
| NULL | 0 |
| 3 | 1 |
| 4 | 1 |
+------+-----------+
5 rows in set (0.00 sec)
Here, COUNT(id) counts the number of occurrences of each type of id but does not count the number of occurrences of NULL
SELECT id, COUNT(DISTINCT id) FROM table1 GROUP BY id;
+------+--------------------+
| id | COUNT(DISTINCT id) |
+------+--------------------+
| NULL | 0 |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
+------+--------------------+
5 rows in set (0.00 sec)
Here, COUNT(DISTINCT id) counts the number of occurrences of each type of id only once (does not count duplicates) and also does not count the number of occurrences of NULL
It is best to use
Count(1) in place of column name or *
to count the number of rows in a table, it is faster than any format because it never go to check the column name into table exists or not
There is no difference if one column is fix in your table, if you want to use more than one column than you have to specify that how much columns you required to count......
Thanks,
As mentioned in the previous answers, Count(*) counts even the NULL columns, whereas count(Columnname) counts only if the column has values.
It's always best practice to avoid * (Select *, count *, …)