Return column name and distinct values - sql

Say I have a simple table in postgres as the following:
+--------+--------+----------+
| Car | Pet | Name |
+--------+--------+----------+
| BMW | Dog | Sam |
| Honda | Cat | Mary |
| Toyota | Dog | Sam |
| ... | ... | ... |
I would like to run a sql query that could return the column name in the first column and unique values in the second column. For example:
+--------+--------+
| Col | Vals |
+--------+--------+
| Car | BMW |
| Car | Toyota |
| Car | Honda |
| Pet | Dog |
| Pet | Cat |
| Name | Sam |
| Name | Mary |
| ... | ... |
I found a bit of code that can be used to return all of the unique values from multiple fields into one column:
-- Query 4b. (104 ms, 128 ms)
select distinct unnest( array_agg(a)||
array_agg(b)||
array_agg(c)||
array_agg(d) )
from t ;
But I don't understand the code well enough to know how to append the column name into another column.
I also found a query that can return the column names in a table. Maybe a sub-query of this in combination with the "Query 4b" shown above?

SQL Fiddle
SELECT distinct
unnest(array['car', 'pet', 'name']) AS col,
unnest(array[car, pet, name]) AS vals
FROM t
order by col

It's bad style to put set-returning functions in the SELECT list and not allowed in the SQL standard. Postgres supports it for historical reasons, but since LATERAL was introduced Postgres 9.3, it's largely obsolete. We can use it here as well:
SELECT x.col, x.val
FROM tbl, LATERAL (VALUES ('car', car)
, ('pet', pet)
, ('name', name)) x(col, val)
GROUP BY 1, 2
ORDER BY 1, 2;
You'll find this LATERAL (VALUES ...) technique discussed under the very same question on dba.SE you already found yourself. Just don't stop reading at the first answer.
Up until Postgres 9.4 there was still an exception: "parallel unnest" required to combine multiple set-returning functions in the SELECT list. Postgres 9.4 brought a new variant of unnest() to remove that necessity, too. More:
Unnest multiple arrays in parallel
The new function is also does not derail into a Cartesian product if the number of returned rows should not be exactly the same for all set-returning functions in the SELECT list, which is (was) a very odd behavior. The new syntax variant should be preferred over the now outdated one:
SELECT DISTINCT x.*
FROM tbl t, unnest('{car, pet, name}'::text[]
, ARRAY[t.car, t.pet, t.name]) AS x(col, val)
ORDER BY 1, 2;
Also shorter and faster than two unnest() calls in parallel.
Returns:
col | val
------+--------
car | BMW
car | Honda
car | Toyota
name | Mary
name | Sam
pet | Cat
pet | Dog
DISTINCT or GROUP BY, either is good for the task.

With JSON functions
row_to_json() and json_each_text() you can do it not specifying number and names of columns:
select distinct key as col, value as vals
from (
select row_to_json(t) r
from a_table t
) t,
json_each_text(r)
order by 1, 2;
SqlFiddle.

Related

How to get the mode of a text[] in SQL?

In a table, I have a column of type text[]. I want to extract the most frequent string in each row. How can I do that?
Trivial example:
id | fruit
----------------------------------
10 | ['apple','pear','apple']
20 | ['pear','pear','banana']
30 | ['pineapple','apple','apple']
After running the query I would like to have:
id | fruit | mode
-----------------------------------------
10 | ['apple','pear','apple'] | apple
20 | ['pear','pear','banana'] | pear
30 | ['pineapple','apple','apple']| apple
You can use a scalar sub-query after unnesting the elements:
select *,
(select mode() within group (order by u.word)
from unnest(u.fruit) as u(word)) as mode
from the_table t
This assumes that fruit is a text[] column. If it's a json or jsonb in reality, you need to use json_array_elements_text() instead of unnest.
If you need that a lot, you can create a function for that.

SQL Server stored procedure inserting duplicate rows

I have a table with column GetDup and I'd like to the duplicate records based on the value of this column. For example, if value on is 1 in GetDup, then duplicate the record once. If value in the column is 2, then duplicate the record twice and so on and the statement has to be in looping statement.
What will be a good way to write a stored procedures for this? Please help.
Input:
+--------+--------------+---------------+
| Getdup | CustomerName | CustomerAdd |
+--------+--------------+---------------+
| 1 | John | 123 SomeWhere |
| 2 | Bob | 987 SomeWhere |
+--------+--------------+---------------+
What I want:
+--------+--------------+---------------+
| Getdup | CustomerName | CustomerAdd |
+--------+--------------+---------------+
| 1 | John | 123 SomeWhere |
| 1 | John | 123 SomeWhere |
| 2 | Bob | 987 SomeWhere |
| 2 | Bob | 987 SomeWhere |
| 2 | Bob | 987 SomeWhere |
+--------+--------------+---------------+
picture of data
Answer #2 After Clarification
Number Table to the Rescue!
The number table in my example (or tally table, if you want to call it that), is both temporary and very small. To make it bigger, just add more values to z and add more CROSS JOINs. In my opinion, a number table and a calendar table are both things that should be in every database you have. They are extremely useful.
SQL Fiddle
MS SQL Server 2017 Schema Setup:
CREATE TABLE mytable ( Getdup int, CustomerName varchar(10), CustomerAdd varchar(20) ) ;
INSERT INTO mytable (Getdup, CustomerName, CustomerAdd)
VALUES (1,'John','123 SomeWhere'), (2,'Bob','987 SomeWhere')
;
Query 1:
;WITH z AS (
SELECT *
FROM ( VALUES(0),(0),(0),(0) ) v(x)
)
, numTable AS (
SELECT num
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY z1.x)-1 num
FROM z z1
CROSS JOIN z z2
) s1
)
SELECT t1.Getdup, t1.CustomerName, t1.CustomerAdd
FROM mytable t1
INNER JOIN numTable ON t1.getdup >= numTable.num
ORDER BY CustomerName, CustomerAdd
Results:
| Getdup | CustomerName | CustomerAdd |
|--------|--------------|---------------|
| 2 | Bob | 987 SomeWhere |
| 2 | Bob | 987 SomeWhere |
| 2 | Bob | 987 SomeWhere |
| 1 | John | 123 SomeWhere |
| 1 | John | 123 SomeWhere |
--------------------------------------------------------------------------
ORIGINAL ANSWER
EDIT: After further clarification of the problem, this won't duplicate rows, this will only duplicate the data in a column.
Something like one of these might work.
T-SQL
SELECT replicate(mycolumn,getdup) AS x
FROM mytable
MySQL
SELECT repeat(mycolumn,getdup) AS x
FROM mytable
Oracle SQL
SELECT rpad(mycolumn,getdup*length(mycolumn),mycolumn) AS x
FROM mytable
PostgreSQL
SELECT repeat(mycolumn,getdup+1) AS x
FROM mytable
If you can provide more details for exactly what you want and what you're working with, we might be able to help you better.
NOTE 2: Depending on what you need, you may need to do some math magic. You say above if GetDup is 1 then you want one duplicate. If that means that your output should be GetDup``GetDup, then you'll want to add one in the repeat(),replicate() or rpad() functions. ie replicate(mycolumn,getdup+1). Oracle SQL will be a little different, since it uses rpad().
In standard SQL you can use a recursive CTE:
with recursive cte as (
select t.dup, . . .
from t
union all
select cte.dup - 1, . . .
from cte
where cte.dup > 1
)
select *
from cte;
Of course, not all databases support recursive CTEs (and the recursive keyword is not used in some of them).
So, you want recursive solution :
with t as (
select Getdup, CustomerName, CustomerAdd, 0 as id
from table
union all
select Getdup, CustomerName, CustomerAdd, id + 1
from t
where id < getdup
)
insert into table (col1, col2, col3)
select Getdup, CustomerName, CustomerAdd
from t
order by getdup
option (maxrecursion 0);

Use something like TOP with GROUP BY

With table table1 like below
+--------+-------+-------+------------+-------+
| flight | orig | dest | passenger | bags |
+--------+-------+-------+------------+-------+
| 1111 | sfo | chi | david | 3 |
| 1112 | sfo | dal | david | 7 |
| 1112 | sfo | dal | kim | 10|
| 1113 | lax | san | ameera | 5 |
| 1114 | lax | lfr | tim | 6 |
| 1114 | lax | lfr | jake | 8 |
+--------+-------+-------+------------+-------+
I'm aggregating the table by orig like below
select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from table1
group by orig
I need to add the passenger with the longest name ( length(passenger) ) for each orig group - how do I go about it?
Output expected
+------+-------------+-----------+---------------+-------------------+
| orig | flight_cnt | pass_cnt | bags_cnt_med | pass_max_len_name |
+------+-------------+-----------+---------------+-------------------+
| sfo | 3 | 2 | 7 | david |
| lax | 3 | 3 | 6 | ameera |
+------+-------------+-----------+---------------+-------------------+
You can conveniently retrieve the passenger with the longest name per group with DISTINCT ON.
Select first row in each GROUP BY group?
But I see no way to combine that (or any other simple way) with your original query in a single SELECT. I suggest to join two separate subqueries:
SELECT *
FROM ( -- your original query
SELECT orig
, count(*) AS flight_cnt
, count(distinct passenger) AS pass_cnt
, percentile_cont(0.5) WITHIN GROUP (ORDER BY bags) AS bag_cnt_med
FROM table1
GROUP BY orig
) org_query
JOIN ( -- my addition
SELECT DISTINCT ON (orig) orig, passenger AS pass_max_len_name
FROM table1
ORDER BY orig, length(passenger) DESC NULLS LAST
) pas USING (orig);
USING in the join clause conveniently only outputs one instance of orig, so you can simply use SELECT * in the outer SELECT.
If passenger can be NULL, it is important to add NULLS LAST:
PostgreSQL sort by datetime asc, null first?
From multiple passenger names with the same maximum length in the same group, you get an arbitrary pick - unless you add more expressions to ORDER BY as tiebreaker. Detailed explanation in the answer linked above.
Performance?
Typically, a single scan is superior, especially with sequential scans.
The above query uses two scans (maybe index / index-only scans). But the second scan is comparatively cheap unless the table is too huge to fit in cache (mostly). Lukas suggested an alternative query with only a single SELECT adding:
, (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1] -- I'd add NULLS LAST
The idea is smart, but last time I tested, array_agg with ORDER BY did not perform so well. (The overhead of per-group ORDER BY is substantial, and array handling is expensive, too.)
The same approach can be cheaper with a custom aggregate function first() like instructed in the Postgres Wiki here. Or, faster, yet, with a version written in C, available on PGXN. Eliminates the extra cost for array handling, but we still need per-group ORDER BY. May be faster for only few groups. You would then add:
, first(passenger ORDER BY length(passenger) DESC NULLS LAST)
Gordon and Lukas also mention the window function first_value(). Window functions are applied after aggregate functions. To use it in the same SELECT, we would need to aggregate passenger somehow first - catch 22. Gordon solves this with a subquery - another candidate for good performance with standard Postgres.
first() does the same without subquery and should be simpler and a bit faster. But it still won't be faster than a separate DISTINCT ON for most cases with few rows per group. For lots of rows per group, a recursive CTE technique is typically faster. There are yet faster techniques if you have a separate table holding all relevant, unique orig values. Details:
Optimize GROUP BY query to retrieve latest record per user
The best solution depends on various factors. The proof of the pudding is in the eating. To optimize performance you have to test with your setup. The above query should be among the fastest.
One method uses the window function first_value(). Unfortunately, this is not available as an aggregation function:
select orig,
count(*) flight_cnt,
count(distinct passenger) as pass_cnt,
percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med,
max(longest_name) as longest_name
from (select t1.*,
first_value(name) over (partition by orig order by length(name) desc) as longest_name
from table1
) t1
group by orig;
You are looking for something like Oracle's KEEP FIRST/LAST where you get a value (the passenger name) according to an aggregate (the name length). PostgreSQL doesn't have such function as far as I know.
One way to go about this is a trick: Combine length and name, get the maximum, then extract the name: '0005david' > '0003kim' etc.
select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med,
, substr(max(to_char(char_length(passenger), '0000') || passenger), 5) as name
from table1
group by orig
order by orig;
For small group sizes, you could use array_agg()
SELECT
orig
, COUNT (*) AS flight_cnt
, COUNT (DISTINCT passenger) AS pass_cnt
, PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY bags ASC) AS bag_cnt_med
, (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1] AS pass_max_len_name
FROM table1
GROUP BY orig
Having said so, while this is shorter syntax, a first_value() window function based approach might be faster for larger data sets as array accumulation might become expensive.
bot it does not solve problem if you have several names wqith same length:
t=# with p as (select distinct orig,passenger,length(trim(passenger)),max(length(trim(passenger))) over (partition by orig) from s127)
, o as ( select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from s127
group by orig)
select distinct o.*,p.passenger from o join p on p.orig = o.orig where max=length;
orig | flight_cnt | pass_cnt | bag_cnt_med | passenger
---------+------------+----------+-------------+--------------
lax | 3 | 3 | 6 | ameera
sfo | 3 | 2 | 7 | david
(2 rows)
populate:
t=# create table s127(flight int,orig text,dest text, passenger text, bags int);
CREATE TABLE
Time: 52.678 ms
t=# copy s127 from stdin delimiter '|';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1111 | sfo | chi | david | 3
>> 1112 | sfo | dal | david | 7
1112 | sfo | dal | kim | 10
1113 | lax | san | ameera | 5
1114 | lax | lfr | tim | 6
1114 | lax | lfr | jake | 8 >> >> >> >>
>> \.
COPY 6

Splitting a string column in BigQuery

Let's say I have a table in BigQuery containing 2 columns. The first column represents a name, and the second is a delimited list of values, of arbitrary length. Example:
Name | Scores
-----+-------
Bob |10;20;20
Sue |14;12;19;90
Joe |30;15
I want to transform into columns where the first is the name, and the second is a single score value, like so:
Name,Score
Bob,10
Bob,20
Bob,20
Sue,14
Sue,12
Sue,19
Sue,90
Joe,30
Joe,15
Can this be done in BigQuery alone?
Good news everyone! BigQuery can now SPLIT()!
Look at "find all two word phrases that appear in more than one row in a dataset".
There is no current way to split() a value in BigQuery to generate multiple rows from a string, but you could use a regular expression to look for the commas and find the first value. Then run a similar query to find the 2nd value, and so on. They can all be merged into only one query, using the pattern presented in the above example (UNION through commas).
Trying to rewrite Elad Ben Akoune's answer in Standart SQL, the query becomes like this;
WITH name_score AS (
SELECT Name, split(Scores,';') AS Score
FROM (
(SELECT * FROM (SELECT 'Bob' AS Name ,'10;20;20' AS Scores))
UNION ALL
(SELECT * FROM (SELECT 'Sue' AS Name ,'14;12;19;90' AS Scores))
UNION ALL
(SELECT * FROM (SELECT 'Joe' AS Name ,'30;15' AS Scores))
))
SELECT name, score
FROM name_score
CROSS JOIN UNNEST(name_score.score) AS score;
And this outputs;
+------+-------+
| name | score |
+------+-------+
| Bob | 10 |
| Bob | 20 |
| Bob | 20 |
| Sue | 14 |
| Sue | 12 |
| Sue | 19 |
| Sue | 90 |
| Joe | 30 |
| Joe | 15 |
+------+-------+
If someone is still looking for an answer
select Name,split(Scores,';') as Score
from (
# replace the inner custome select with your source table
select *
from
(select 'Bob' as Name ,'10;20;20' as Scores),
(select 'Sue' as Name ,'14;12;19;90' as Scores),
(select 'Joe' as Name ,'30;15' as Scores)
);

Sybase SQL Select Distinct Based on Multiple Columns with an ID

I'm trying to query a sybase server to get examples of different types of data we hold for testing purposes.
I have a table that looks like the below (abstracted)
Animals table:
id | type | breed | name
------------------------------------
1 | dog | german shepard | Bernie
2 | dog | german shepard | James
3 | dog | husky | Laura
4 | cat | british blue | Mr Fluffles
5 | cat | other | Laserchild
6 | cat | british blue | Sleepy head
7 | fish | goldfish | Goldie
As I mentioned I want an example of each type so for the above table would like a results set like (in reality I just want the ID's):
id | type | breed
---------------------------
1 | dog | german shepard
3 | dog | husky
4 | cat | british blue
5 | cat | other
7 | fish | goldfish
I've tried multiple combinations of queries such as the below but they are either invalid SQL (for sybase) or return invalid results
SELECT id, DISTINCT ON type, breed FROM animals
SELECT id, DISTINCT(type, breed) FROM animals
SELECT id FROM animals GROUP BY type, breed
I've found other questions such as SELECT DISTINCT on one column but this only deal with one column
Do you have any idea how to implement this query?
Maybe you have to use aggregate function max or min for column ID. It will return only one ID for grouped columns.
select max(Id), type, breed
from animals
group by type, breed
EDIT:
Other different ways to do it:
With having and aggregate function
select id, type, breed
from animals
group by type, breed
having id = max(Id)
With having and aggregate subquery
select id, type, breed
from animals a1
group by type, breed
having id = (
select max(id)
from animals a2
where a2.type = a1.type
and a2.breed = a1.breed
)
Try this and let me know if it works:
select distinct breed, max(id) as id , max(type) as type
from animals
You may have to play around with max()
The arbitrary choice here is max(), but you could arbitrarily use min() instead.
max() returns the largest value for that columns, min() the smallest