In SQL (PSQL), how to group by partitions of rows (how to do nested group by)? - sql

Wording of the question needs improvement, I'm not sure how to accurately describe it.
Given a table foo, count how many languages each person can speak, grouped by format. Example:
name | format | language
------+----------+------------
abe | compiled | golang
abe | compiled | c
abe | scripted | javascript
jon | scripted | ruby
jon | scripted | javascript
wut | spoken | english
(6 rows)
Result:
name | format | count
------+----------+------------
abe | compiled | 2
abe | scripted | 1
jon | scripted | 2
wut | spoken | 1
Example data can be created using:
create table foo
(
name varchar(40) not null,
format varchar(40) not null,
language varchar(40) not null
);
insert into foo
values
( 'abe', 'compiled', 'golang' ),
( 'abe', 'compiled', 'c' ),
( 'abe', 'scripted', 'javascript' ),
( 'jon', 'scripted', 'ruby' ),
( 'jon', 'scripted', 'javascript' ),
( 'wut', 'spoken', 'english' )
;
I've tried using windowing functions count(*) over (partition by format) but it doesn't squash rows, and it would require a nested window by name, and then by format, whereas count(*) ... group by name used on its own would squash the result into one row per name.

Use group by clause :
select name, format, count(*)
from foo
group by name, format;
However, if you want to go with window function then you can also do that :
select distinct name, format,
count(*) over (partition by name, format)
from foo f;

Related

Can I count the occurences for postgres array field?

I have a table postgres that uses the array type of data, it allows some magic making it possible to avoid having more tables, but the non-standard nature of this makes it more difficult to operate with for a beginner.
I would like to get some summary data out of it.
Sample content:
CREATE TABLE public.cts (
id serial NOT NULL,
day timestamp NULL,
ct varchar[] NULL,
CONSTRAINT ctrlcts_pkey PRIMARY KEY (id)
);
INSERT INTO public.cts
(id, day, ct)
VALUES(29, '2015-01-24 00:00:00.000', '{ct286,ct281}');
INSERT INTO public.cts
(id, day, ct)
VALUES(30, '2015-01-25 00:00:00.000', '{ct286,ct281}');
INSERT INTO public.cts
(id, day, ct)
VALUES(31, '2015-01-26 00:00:00.000', '{ct286,ct277,ct281}');
I would like to get the totals per array member occurence totalized, with an output like this for example:
name | value
ct286 | 3
ct281 | 3
ct277 | 1
Use Postgres function array unnest():
SELECT name, COUNT(*) cnt
FROM cts, unnest(ct) as u(name)
GROUP BY name
Demo on DB Fiddle:
| name | cnt |
| ----- | --- |
| ct277 | 1 |
| ct281 | 3 |
| ct286 | 3 |

Reuse a query to use it for operation on LIMIT and OFFSET on postgresql

Using PostgreSQL 9.4, I have a table like this:
CREATE TABLE products
AS
SELECT id::uuid, title, kind, created_at
FROM ( VALUES
( '61c5292d-41f3-4e86-861a-dfb5d8225c8e', 'foo', 'standard' , '2017/04/01' ),
( 'def1d3f9-3e55-4d1b-9b42-610d5a46631a', 'bar', 'standard' , '2017/04/02' ),
( 'cc1982ab-c3ee-4196-be01-c53e81b53854', 'qwe', 'standard' , '2017/04/03' ),
( '919c03b5-5508-4a01-a97b-da9de0501f46', 'wqe', 'standard' , '2017/04/04' ),
( 'b3d081a3-dd7c-457f-987e-5128fb93ce13', 'tyu', 'other' , '2017/04/05' ),
( 'c6e9e647-e1b4-4f04-b48a-a4229a09eb64', 'ert', 'irregular', '2017/04/06' )
) AS t(id,title,kind,created_at);
Need to split the data into n same size parts. if this table had a regular id will be easier, but since it has uuid then I can't use modulo operations (as far as I know).
So far I did this:
SELECT * FROM products
WHERE kind = 'standard'
ORDER BY created_at
LIMIT(
SELECT count(*)
FROM products
WHERE kind = 'standard'
)/2
OFFSET(
(
SELECT count(*)
FROM products
WHERE kind = 'standard'
)/2
)*1;
Works fine but doing the same query 3 times I don't think is a good idea, the count is not "expensive" but every time someone wants to modify/update the query will need to do it in the 3 sections.
Note that currently n is set as 2 and the offset is set to 1 but both can take other values. Also limit rounds down so there may be a missing value, I can fix it using other means but having it on the query will be nice.
You can see the example here
Just to dispel a myth you can never use an serial and modulus to get parts because a serial isn't guaranteed to be gapless. You can use row_number() though.
SELECT row_number() OVER () % 3 AS parts, * FROM products;
parts | id | title | kind | created_at
-------+--------------------------------------+-------+-----------+------------
1 | 61c5292d-41f3-4e86-861a-dfb5d8225c8e | foo | standard | 2017/04/01
2 | def1d3f9-3e55-4d1b-9b42-610d5a46631a | bar | standard | 2017/04/02
0 | cc1982ab-c3ee-4196-be01-c53e81b53854 | qwe | standard | 2017/04/03
1 | 919c03b5-5508-4a01-a97b-da9de0501f46 | wqe | standard | 2017/04/04
2 | b3d081a3-dd7c-457f-987e-5128fb93ce13 | tyu | other | 2017/04/05
0 | c6e9e647-e1b4-4f04-b48a-a4229a09eb64 | ert | irregular | 2017/04/06
(6 rows)
This won't get equal parts unless parts goes into count equally.

Select Top User over a list of Pages

I have a table containing records of Users' internet history. The table's structure contains the User_ID, the Page Accessed, and the Date Accessed of the page. For Example:
+==========================================+
|User_ID | Page_Accessed | Date_Accessed |
+==========================================+
|Johh.Doe | Google | 1/1/2015 |
|Johh.Doe | Google | 1/1/2015 |
|Suzy.Lue | Google | 7/11/2015 |
|Suzy.Lue | Wikipedia | 4/23/2015 |
|Babe Ruth| StackOverflow | 9/1/2015 |
+==========================================+
I am currently trying to use a SQL query that uses:
RANK() OVER (PARTITION BY [Page Accessed] ORDER BY Count(DateAcc))
Then I use a PIVOT() by the Various Sites. However after selecting the records WHERE (Num = 1) from the PIVOT() and a GROUP BY [Rank], I'm ending up with resulting query similar to:
+=================================================+
|Rank | Google | Wikipedia | StackOverflow |
+=================================================+
| 1 | John Doe| NULL | NULL |
| 1 | NULL | Suzy Lue | NULL |
| 1 | NULL | NULL | Babe Ruth |
+=================================================+
Instead I need to reformat my output as:
+=================================================+
|Rank | Google | Wikipedia | StackOverflow |
+=================================================+
| 1 | John Doe| Suzy Lue | Babe Ruth |
+=================================================+
My Current Query:
SELECT Rank, Google, Wikipedia, StackOverflow
FROM(
SELECT TOP (100) PERCENT User_ID, Page_Accessed, COUNT(Date_Accessed) AS Views,
RANK() OVER (PARTITION BY Page_Accessed ORDER BY Count(Date_Accessed) DESC) AS Rank
FROM Record_Table
GROUP BY dbo.location_key.subSite, dbo.user_info_list_parse.Name
ORDER BY Views DESC) AS tb
PIVOT (
max(tb.User_ID) FOR
Page_Accessed IN ( Google, Wikipedia, StackOverflow)
) pvt
WHERE (Num = 1)
Are there any creative solutions to obtain this result?
I think you've already found solution but for your information and for others reading this - let me erase noise in this query. There is no need to ORDER BY, no need to apply TOP (100) PERCENT, Views column is redundant. I would simplify this query as follows:
CREATE TABLE InternetHistory
(
[User_ID] varchar(20),
[Page_Accessed] varchar(20),
[Date_Accessed] datetime
)
INSERT InternetHistory VALUES
('Johh.Doe', 'Google', '2015-01-01'),
('Johh.Doe', 'Google', '2015-01-01'),
('Suzy.Lue', 'Google', '2015-07-11'),
('Suzy.Lue', 'Wikipedia', '2015-04-23'),
('Babe Ruth', 'StackOverflow', '2015-01-09')
SELECT * FROM
(
SELECT [User_ID], [Page_Accessed], RANK() OVER (PARTITION BY [Page_Accessed] ORDER BY COUNT(*) DESC) Ranking
FROM InternetHistory
GROUP BY [User_ID], [Page_Accessed]
) AS Src
PIVOT
(
MAX([User_Id]) FOR [Page_Accessed] IN ([Google], [Wikipedia], [StackOverflow])
) AS Pvt
WHERE Ranking = 1

Return column name and distinct values

Say I have a simple table in postgres as the following:
+--------+--------+----------+
| Car | Pet | Name |
+--------+--------+----------+
| BMW | Dog | Sam |
| Honda | Cat | Mary |
| Toyota | Dog | Sam |
| ... | ... | ... |
I would like to run a sql query that could return the column name in the first column and unique values in the second column. For example:
+--------+--------+
| Col | Vals |
+--------+--------+
| Car | BMW |
| Car | Toyota |
| Car | Honda |
| Pet | Dog |
| Pet | Cat |
| Name | Sam |
| Name | Mary |
| ... | ... |
I found a bit of code that can be used to return all of the unique values from multiple fields into one column:
-- Query 4b. (104 ms, 128 ms)
select distinct unnest( array_agg(a)||
array_agg(b)||
array_agg(c)||
array_agg(d) )
from t ;
But I don't understand the code well enough to know how to append the column name into another column.
I also found a query that can return the column names in a table. Maybe a sub-query of this in combination with the "Query 4b" shown above?
SQL Fiddle
SELECT distinct
unnest(array['car', 'pet', 'name']) AS col,
unnest(array[car, pet, name]) AS vals
FROM t
order by col
It's bad style to put set-returning functions in the SELECT list and not allowed in the SQL standard. Postgres supports it for historical reasons, but since LATERAL was introduced Postgres 9.3, it's largely obsolete. We can use it here as well:
SELECT x.col, x.val
FROM tbl, LATERAL (VALUES ('car', car)
, ('pet', pet)
, ('name', name)) x(col, val)
GROUP BY 1, 2
ORDER BY 1, 2;
You'll find this LATERAL (VALUES ...) technique discussed under the very same question on dba.SE you already found yourself. Just don't stop reading at the first answer.
Up until Postgres 9.4 there was still an exception: "parallel unnest" required to combine multiple set-returning functions in the SELECT list. Postgres 9.4 brought a new variant of unnest() to remove that necessity, too. More:
Unnest multiple arrays in parallel
The new function is also does not derail into a Cartesian product if the number of returned rows should not be exactly the same for all set-returning functions in the SELECT list, which is (was) a very odd behavior. The new syntax variant should be preferred over the now outdated one:
SELECT DISTINCT x.*
FROM tbl t, unnest('{car, pet, name}'::text[]
, ARRAY[t.car, t.pet, t.name]) AS x(col, val)
ORDER BY 1, 2;
Also shorter and faster than two unnest() calls in parallel.
Returns:
col | val
------+--------
car | BMW
car | Honda
car | Toyota
name | Mary
name | Sam
pet | Cat
pet | Dog
DISTINCT or GROUP BY, either is good for the task.
With JSON functions
row_to_json() and json_each_text() you can do it not specifying number and names of columns:
select distinct key as col, value as vals
from (
select row_to_json(t) r
from a_table t
) t,
json_each_text(r)
order by 1, 2;
SqlFiddle.

How to combine certain data from different rows based on a certain column?

I have a table that looks like this:
--------------------------------
| name | email | friend |
--------------------------------
1 | bob | bobs email | kate |
--------------------------------
2 | bob | bobs email | joe |
--------------------------------
3 | tim | tims email | eddie |
How can I create new columns (friend1, friend2, etc.) and move friends there, on the condition that name and email are the same (there might be two bobs, for instance, bob and bob with a different email).
My desired table looks like this:
-----------------------------------------------------
| name | email | friend1 | friend2 | friend3 |
-----------------------------------------------------
1 | bob | bobs email | kate | joe | |
-----------------------------------------------------
2 | tim | tims email | eddie | | |
This can't be achieved as the query you need has no static metadata (i.e. you don't know the columns) as it might change over time if a friend is added. But if you mean that you need only just three columns for friends, you can use the PIVOT command. You can use the below link as an example:
http://blogs.msdn.com/b/spike/archive/2009/03/03/pivot-tables-in-sql-server-a-simple-sample.aspx
Another solution (which is unfortunately not easily available in SQL Server) is to aggregate the friends, i.e. you will have only one column containing all friends regardless their count and separated with comma. This can be achieved using CLR function (Example: http://www.mssqltips.com/sqlservertip/2022/concat-aggregates-sql-server-clr-function/), CTE (Example: Optimal way to concatenate/aggregate strings) or FOR XML (Example: Does T-SQL have an aggregate function to concatenate strings?).
Hope this helps...
Having these sample data:
DECLARE #DataSource TABLE
(
[name] VARCHAR(12)
,[email] VARCHAR(24)
,[friend] VARCHAR(12)
)
INSERT INTO #DataSource ([name], [email], [friend])
VALUES ('bob', 'bobs email', 'kate')
,('bob', 'bobs email', 'joe')
,('tim', 'tim email', 'edie')
The following query:
SELECT DD.[name]
,DD.[email]
,Friends.[friend]
,ROW_NUMBER() OVER (PARTITION BY DD.[name], DD.[email] ORDER BY Friends.[friend]) AS [FriendNumber]
FROM
(
SELECT DISTINCT [name]
,[email]
FROM #DataSource
) DD -- Distinct Data
CROSS APPLY
(
SELECT [friend]
FROM #DataSource DS
WHERE DS.[name] = DD.[name]
AND DS.[email] = DD.[email]
) Friends
will give you:
So, you can now build want you want using pivot, but note that you need to know the maximum number of friends which a person could have:
SELECT *
FROM
(
SELECT DD.[name]
,DD.[email]
,Friends.[friend]
,'friend' + CAST(ROW_NUMBER() OVER (PARTITION BY DD.[name], DD.[email] ORDER BY Friends.[friend]) AS VARCHAR(2)) AS [FriendNumber]
FROM
(
SELECT DISTINCT [name]
,[email]
FROM #DataSource
) DD -- Distinct Data
CROSS APPLY
(
SELECT [friend]
FROM #DataSource DS
WHERE DS.[name] = DD.[name]
AND DS.[email] = DD.[email]
) Friends
) DS
PIVOT
(
MAX([friend]) FOR [FriendNumber] IN ([friend1], [friend2], [friend3])
) PVT