Teradata: Results with duplicate values converted into comma delimited strings - sql

I have a typical table where each row represents a customer - product holding. If a customer has multiple products, there will be multiple rows with the same customer Id. I'm trying to roll this up so that each customer is represented by a single row, with all product codes concatenated together in a single comma delimited string. The diagram below illustrates this
After googling this, I managed to get it to work using the XMLAGG function - but this only worked on a small sample of data, when scaled up Teradata complained about running out of 'spool space' - so I figure it's not very efficient.
Does anyone know how to efficiently achieve this?

Newer versions of Teradata support NPath, which can be used for this. You have to get used to the syntax, it's a Table Operator :-)
E.g. this returns the column list for each table in your system:
SELECT *
FROM
NPath(ON(SELECT databasename, tablename, columnname, columnid
FROM dbc.columnsV
) AS dt -- input data
PARTITION BY databasename, tablename -- group by columns
ORDER BY columnid -- order within list
USING
MODE (NonOverlapping) -- required syntax
Symbols (True AS F) -- every row
Pattern ('F*') -- is returned
RESULT(First (databasename OF F) AS DatabaseName, -- group by column
First (tablename OF F) AS TableName, -- group by column
Count (* OF F) AS Cnt,
Accumulate(Translate(columnname USING unicode_to_latin) OF ANY (F)) AS ListAgg
)
);
Should be waaaaaay better than XMLAgg.

Related

dynamically cast() values to string and unpivot in BigQuery

I have tables (of different schema) that consist of numerous rows (millions) with a unique id and at least 100-200 columns of various data types (INT64, String, Datetime, Float...etc). I need to unpivot the columns to rows dynamically and display pertaining values (including null values) in the next column. I need this only for data related to a selected id.
Here is an example of what I need.
An idea of how tables look and final result:
I wrote this code but I am getting the following error:
"Query error: The datatype of column does not match with other datatypes in the IN clause. Expected STRING, Found INT64 at [4:74]"
code I wrote:
declare myup string;
set myup=(
select concat('(',string_agg(column_name,','),')'),
from (select distinct column_name from `abc-def-
bigqueryghi.dataset_info.INFORMATION_SCHEMA.COLUMNS`
where table_name='table_1'
and column_name not in ("id")
)
);
execute immediate format("""
select*from `abc-def-bigquery-ghi.dataset_info.table_1`
unpivot
(values for column_name in %s)""",myup);
It is not possible to explicitly cast each column by name into string since some tables have up to 200 columns.
Null values also need to be displayed in final result since this needs to then be visualized on Google Data Studio.
Any ideas on how to solve this is highly appreciated.

Query on result of Hive's Describe

In Hue/Hive,
Describe mytablename;
gives the list of columns, their types and comments. Is there any way to query in Hive, treating result from describe as a table ?
For example I want to count the number of numeric/character/specific type columns, filter column names, total number of columns (currently requires scrolling down per 100 each, which is a hassle with 1000+ columns), etc
Queries such as
select count(*) from (Describe mytablename);
select count(*) from (select * from describe mytablename);
are of course invalid
Any ideas ?
You can create a sql file --> hive.sql containing "describe dbname.tablename"
hive -f hive.sql > /path/file.txt
create table dbname.desc
(
name String,
type String,
desc String
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
then, load data from path '/path/file.txt' into table dbname.desc.

Is it possible to avoid specifying a column list in a SQL Server CTE?

Is it possible to avoid specifying a column list in a SQL Server CTE?
I'd like to create a CTE from a table that has many columns so that the structure is identical. There probably is a way to accomplish this without relisting every column name.
I've tried (unsuccessfully):
with pay_cte as
(select * from payments)
select * from pay_cte
I'm encouraged in my quest by this statement in the msdn documentation:
The list of column names is optional only if distinct names for all resulting columns are supplied in the query definition.
https://msdn.microsoft.com/en-us/library/ms175972.aspx
Yes, assuming you mean that you don't have to name every column in the with cte(Col1, Col2) as section.
You can easily try this yourself with a very simple test query along the lines of:
with cte as
(
select *
from sys.tables
)
select *
from cte

Count the number of attributes that are NULL for a row

I want to add a new column to a table to record the number of attributes whose value are null for each tuple (row). How can I use SQL to get the number?
for example, if a tuple is like this:
Name | Age | Sex
-----+-----+-----
Blice| 100 | null
I want to update the tuple as this:
Name | Age | Sex | nNULL
-----+-----+-----+--------
Blice| 100 | null| 1
Also, because I'm writing a PL/pgSQL function and the table name is obtained from argument, I don't know the schema of a table beforehand. That means I need to update the table with the input table name. Anyone know how to do this?
Possible without spelling out columns. Unpivot columns to rows and count.
The aggregate function count(<expression>) only counts non-null values, while count(*) counts all rows. The shortest and fastest way to count NULL values for more than a few columns is count(*) - count(col) ...
Works for any table with any number of columns of any data types.
In Postgres 9.3+ with built-in JSON functions:
SELECT *, (SELECT count(*) - count(v)
FROM json_each_text(row_to_json(t)) x(k,v)) AS ct_nulls
FROM tbl t;
What is x(k,v)?
json_each_text() returns a set of rows with two columns. Default column names are key and value as can be seen in the manual where I linked. I provided table and column aliases so we don't have to rely on default names. The second column is named v.
Or, in any Postgres version since at least 8.3 with the additional module hstore installed, even shorter and a bit faster:
SELECT *, (SELECT count(*) - count(v) FROM svals(hstore(t)) v) AS ct_nulls
FROM tbl t;
This simpler version only returns a set of single values. I only provide a simple alias v, which is automatically taken to be table and column alias.
Best way to install hstore on multiple schemas in a Postgres database?
Since the additional column is functionally dependent I would consider not to persist it in the table at all. Rather compute it on the fly like demonstrated above or create a tiny function with a polymorphic input type for the purpose:
CREATE OR REPLACE FUNCTION f_ct_nulls(_row anyelement)
RETURNS int LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
'SELECT (count(*) - count(v))::int FROM svals(hstore(_row)) v';
(PARALLEL SAFE only for Postgres 9.6 or later.)
Then:
SELECT *, f_ct_nulls(t) AS ct_nulls
FROM tbl t;
You could wrap this into a VIEW ...
db<>fiddle here - demonstrating all
Old sqlfiddle
This should also answer your second question:
... the table name is obtained from argument, I don't know the schema of a table beforehand. That means I need to update the table with the input table name.
In Postgres, you can express this as:
select t.*,
((name is null)::int +
(age is null)::int +
(sex is null)::int
) as numnulls
from table t;
In order to implement this on an unknown table, you will need to use dynamic SQL and obtaining a list of columns (say from information_schema.columns)).
Function to add column automatically
This is an audited version of what #winged panther posted, per request.
The function adds a column with given name to any existing table that the calling role has the necessary privileges for:
CREATE OR REPLACE FUNCTION f_add_null_count(_tbl regclass, _newcol text)
RETURNS void AS
$func$
BEGIN
-- add new col
EXECUTE format('ALTER TABLE %s ADD COLUMN %I smallint', _tbl, _newcol);
-- update new col with dynamic count of nulls
EXECUTE (
SELECT format('UPDATE %s SET %I = (', _tbl, _newcol) -- regclass used as text
|| string_agg(quote_ident(attname), ' IS NULL)::int + (')
|| ' IS NULL)::int'
FROM pg_catalog.pg_attribute
WHERE attnum > 0
AND NOT attisdropped
AND attrelid = _tbl -- regclass used as OID
AND attname <> _newcol -- no escaping here, it's the *text*!
);
END
$func$ LANGUAGE plpgsql;
SQL Fiddle demo.
How to treat identifiers properly
Sanitize identifiers with cast to regclass, format() with %I or quote_ident().
I am using all three techniques in the example, each happens to be the best choice where they are used. More here:
Table name as a PostgreSQL function parameter
I formatted the relevant code fragments in bold.
Other points
I am basing my query on pg_catalog.pg_attribute, but that's a optional decision with pros and cons. Makes my query simpler and faster because I can use the OID of the table. Related:
How to check if a table exists in a given schema
Select columns with particular column names in PostgreSQL
You have to exclude the newly added column from the count, or the count will be off by one.
Using data type smallint for the count, since there cannot more than 1600 columns in a table.
I don't use a variable but execute the result of the SELECT statement directly. Assignments are comparatively expensive in plpgsql. Not a big deal, though. Also a matter of taste and style.
I make it a habbit to prepend parameters and variable with an underscore (_tbl) to rule out ambiguity between variables and column names.
I just created a function to perform OP's requirement by using Gordon Linoff's answer with following table and data:
Table det:
CREATE TABLE det (
name text,
age integer,
sex text
);
Data:
insert into det (name,age,sex) values
('Blice',100,NULL),
('Glizz',NULL,NULL),
(NULL,NULL,NULL);
Function:
create or replace function fn_alter_nulls(tbl text,new_col text) returns void as
$$
declare vals text;
begin
-- dynamically getting list of columns *
select string_agg(format('(%s is null)::int',column_name),'+') into vals
from information_schema.columns
where table_schema='public' and table_name=''||tbl||'' and table_catalog='yourDB_Name';
-- adds new column
execute format('alter table %s add column "%s" int',tbl,new_col);
--updates new column
execute format('update det set %s =(%s)',new_col,vals);
end;
$$
language plpgsql
Function call:
select fn_alter_nulls('det','nnulls')
Since the null count is derived data and simple/cheap to determine at query time, why not create a view:
create view MyTableWithNullCount as
select
*,
case when nullableColumn1 is null then 1 else 0 end +
case when nullableColumn2 is null then 1 else 0 end +
...
case when nullableColumnn is null then 1 else 0 end as nNull
from myTable
And just use the view instead.
This has the upside of not having to write triggers/code to maintain a physical null count column, which will be a bigger headache than this approach.

How to get unique values from each column based on a condition?

I have been trying to find an optimal solution to select unique values from each column. My problem is I don't know column names in advance since different table has different number of columns. So first, I have to find column names and I could use below query to do it:
select column_name from information_schema.columns
where table_name='m0301010000_ds' and column_name like 'c%'
Sample output for column names:
c1, c2a, c2b, c2c, c2d, c2e, c2f, c2g, c2h, c2i, c2j, c2k, ...
Then I would use returned column names to get unique/distinct value in each column and not just distinct row.
I know a simplest and lousy way is to write select distict column_name from table where column_name = 'something' for every single column (around 20-50 times) and its very time consuming too. Since I can't use more than one distinct per column_name, I am stuck with this old school solution.
I am sure there would be a faster and elegant way to achieve this, and I just couldn't figure how. I will really appreciate any help on this.
You can't just return rows, since distinct values don't go together any more.
You could return arrays, which can be had simpler than you may have expected:
SELECT array_agg(DISTINCT c1) AS c1_arr
,array_agg(DISTINCT c2a) AS c2a_arr
,array_agg(DISTINCT c2b) AS c2ba_arr
, ...
FROM m0301010000_ds;
This returns distinct values per column. One array (possibly big) for each column. All connections between values in columns (what used to be in the same row) are lost in the output.
Build SQL automatically
CREATE OR REPLACE FUNCTION f_build_sql_for_dist_vals(_tbl regclass)
RETURNS text AS
$func$
SELECT 'SELECT ' || string_agg(format('array_agg(DISTINCT %1$I) AS %1$I_arr'
, attname)
, E'\n ,' ORDER BY attnum)
|| E'\nFROM ' || _tbl
FROM pg_attribute
WHERE attrelid = _tbl -- valid, visible table name
AND attnum >= 1 -- exclude tableoid & friends
AND NOT attisdropped -- exclude dropped columns
$func$ LANGUAGE sql;
Call:
SELECT f_build_sql_for_dist_vals('public.m0301010000_ds');
Returns an SQL string as displayed above.
I use the system catalog pg_attribute instead of the information schema. And the object identifier type regclass for the table name. More explanation in this related answer:
PLpgSQL function to find columns with only NULL values in a given table
If you need this in "real time", you won't be able to archive it using a SQL that needs to do a full table scan to archive it.
I would advise you to create a separated table containing the distinct values for each column (initialized with SQL from #Erwin Brandstetter ;) and maintain it using a trigger on the original table.
Your new table will have one column per field. # of row will be equals to the max number of distinct values for one field.
For on insert: for each field to maintain check if that value is already there or not. If not, add it.
For on update: for each field to maintain that has old value != from new value, check if the new value is already there or not. If not, add it. Regarding the old value, check if any other row has that value, and if not, remove it from the list (set field to null).
For delete : for each field to maintain, check if any other row has that value, and if not, remove it from the list (set value to null).
This way the load mainly moved to the trigger, and the SQL on the value list table will super fast.
P.S.: Make sure to pass all you SQL from trigger to explain plan to make sure they use best index and execution plan as possible. For update/deletion, just check if old value exists (limit 1).