SQL: histograms for multiple columns - sql

Given the following table:
Column A Column B
east red
west blue
east green
I want to find out the of column values of each column and how many times each value is present in the table. Given the output above the result should look like:
A values A value counts B values B value counts
east 2 red 1
west 1 blue 1
green 1
This is achievable by running SELECT colX, count(colX) From Table GROUP BY colX for each column. This is not a scalable solution if there is a complex WHERE condition since it needs to be executed for each query.
An alternative is to execute the complex where query once and compute the aggregations in the server code. But is there a single SQL query that can compute that?

You can use window function :
select cola, count(*) over (partition by cola) as a_count,
colb, count(*) over (partition by colb) as b_count
This will count for both columns (a & b) with their values display.

You can use subqueries to aggregate, then union all and aggregate again to combine the results:
select max(a) as a, max(a_cnt) as a_cnt, max(b) as b, max(b_cnt) as b
from ((select a, count(*) as a_cnt, null as b, null as b_cnt,
row_number() over (order by count(*) desc) as seqnum
from t
group by a
) union all
(select null, null, b, count(*),
row_number() over (order by count(*) desc) as seqnum
from t
group by b
)
) ab
group by seqnum
order by seqnum;

If you are using Oracle you can use user_tab_cols to generate the SQL for all columns in your table
SELECT 'SELECT '
|| Listagg(column_name
||',count(1) over (partition by '
||column_name
||') as '
||column_name
||'_cnt', ',')
within GROUP (ORDER BY column_id)
||' FROM '
||'TEST_DATA'
FROM user_tab_cols
WHERE table_name = 'TEST_DATA'
Sample output is below
SELECT ID,count(1) over (partition by ID) as ID_cnt,VALUE,count(1) over (partition by
VALUE) as VALUE_cnt FROM TEST_DATA

Related

Selecting records that appear several times in a row

My problem is that I would like to select some records which appears in a row.
For example we have table like this:
x
x
x
y
y
x
x
y
Query should give answer like this:
x 3
y 2
x 2
y 1
SQL tables represent unordered sets. Your question only makes sense if there is a column that specifies the ordering. If so, you can use the difference-of-row-numbers to determine the groups and then aggregate:
select col1, count(*)
from (select t.*,
row_number() over (order by <ordering col>) as seqnum,
row_number() over (partition by col1 order by <ordering col>) as seqnum_2
from t
) t
group by col1, (seqnum - seqnum_2)
I made a SQL Fiddle
http://sqlfiddle.com/#!18/f8900/5
CREATE TABLE [dbo].[SomeTable](
[data] [nchar](1) NULL,
[id] [int] IDENTITY(1,1) NOT NULL
);
INSERT INTO SomeTable
([data])
VALUES
('x'),
('x'),
('x'),
('y'),
('y'),
('x'),
('x'),
('y')
;
select * from SomeTable;
WITH SomeTable_CTE (Data, total, BaseId, NextId)
AS
(
SELECT
Data,
1 as total,
Id as BaseId,
Id+1 as NextId
FROM SomeTable
where not exists(
Select * from SomeTable Previous
where Previous.Id+1 = SomeTable.Id
and Previous.Data = SomeTable.Data)
UNION ALL
select SomeTable_CTE.Data, SomeTable_CTE.total+1, SomeTable_CTE.BaseId as BaseId, SomeTable.Id+1 as NextId
from SomeTable_CTE inner join SomeTable on
SomeTable.Data = SomeTable_CTE.Data
and
SomeTable.Id = SomeTable_CTE.NextId
)
SELECT Data, max(total) as total
FROM SomeTable_CTE
group by Data, BaseId
order by BaseId
The elephant in the room is the missing column(s) to establish the order of rows.
SELECT col1, count(*)
FROM (
SELECT col1, order_column
, row_number() OVER (ORDER BY order_column)
- row_number() OVER (PARTITION BY col1 ORDER BY order_column) AS grp
FROM tbl
) t
GROUP BY col1, grp
ORDER BY min(order_column);
To exclude partitions with only a single row, add a HAVING clause:
SELECT col1, count(*)
FROM (
SELECT col1, order_column
, row_number() OVER (ORDER BY order_column)
- row_number() OVER (PARTITION BY col1 ORDER BY order_column) AS grp
FROM tbl
) t
GROUP BY col1, grp
HAVING count(*) > 1
ORDER BY min(order_column);
db<>fiddle here
Add a final ORDER BY to maintain original order (and a meaningful result). You may want to add a column like min(order_column) as well.
Related:
Find the longest streak of perfect scores per player
Select longest continuous sequence
Group by repeating attribute

alias table with union all not works

I have ora-00094 (identifier not valid) in a simple query but I can't see why. Could you help me please?
select columnA, 'More than 4000 bytes'
from tableA
union all
select p.columnB, listagg(p.columnC, ',') within group (order by p.columnC)
from (
select distinct b.job_name, a.hostname
from tableB a, emuser.def_job b
) p
group by p.columnB
order by p.columnB desc;
ORDER BY is for ResultSet of whole query. So for ORDER BY there is no columnB here. ResultSet have only column names of first query.
Try this
SELECT columnA, 'More than 4000 bytes' as columnC FROM tableA
UNION ALL
SELECT p.columnB, LISTAGG (p.columnC, ',') WITHIN GROUP (ORDER BY p.columnC)
FROM (SELECT DISTINCT b.job_name, a.hostname
FROM tableB a, emuser.def_job b) p
GROUP BY p.columnB
ORDER BY p.columnA DESC;

Remove duplicate rows as an additional column

I have a sql table for student records and I have some duplicate rows for the student dimension cause of the major, so now I have something like this:
ID Major
----------
1 CS
1 Mgt
What I want is to combine this two rows in this form:
ID Major Major2
----------
1 CS Mgt
You need a number for pivoting. Then you can pivot using either pivot or conditional aggregation:
select id,
max(case when seqnum = 1 then major end) as major_1,
max(case when seqnum = 2 then major end) as major_2
from (select t.*,
row_number() over (partition by id order by (select null)) as seqnum
from t
) t
group by id;
Note: you should validate that "2" is large enough to count the majors. You can get the maximum using:
select top 1 id, count(*)
from t
group by id
order by count(*) desc;
If you have at most two different values of major:
select a.id as id,
a.major as major,
b.major as major2
from YOUR_TABLE a
left join YOUR_TABLE b on
a.id = b.id
and (b.major is null or a.major > b.major)
This will help you
Select
ID,
(select top 1 Major from <Your_Table> where id=T.Id order by Major) Major,
(case when count(Id)>1 then (select top 1 Major from #temp where id=T.Id order by Major desc) else null end) Major2
from <Your_Table> T
Group By
ID
You can use pivot function directly
SELECT [ID],[CS] AS Major , [Mgt] AS Major2 from Your_Table_Name
PIVOT
(max(Major)for [Major] IN ([CS] , [Mgt]))as p

How to create "subsets" as a result from a column in SQL

Let's suppose that I've got as a result from one query the next set of values of one column:
Value
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
9 I
10 J
Now, I would like to see this information with another order, establishing a limit to the number of values of every single subset. Now suppose that I choose 3 as a limit,the information will be given like this (one column for all the subsets):
Values
1 A, B, C
2 D, E, F
3 G, H, I
4 J,
Obviously, the last row will contain the remaining values when their number is smaller than the limit established.
Is it possible to perform a query like this in SQL?
What about if the limit is dynamic?. It can be chosen randomly.
create table dee_t (id int identity(1,1),name varchar(10))
insert into dee_t values ('A'),('B'),('c'),('D'),('E'),('F'),('g'),('H'),('I'),('J')
;with cte as
(
select (id-1)/3 +1 rno ,* from dee_t
) select rno ,
(select name+',' from cte where rno = c.rno for xml path('') )
from cte c group by rno
You can do this by using few calculations with row_number, like this:
select
GRP,
max(case when RN = 1 then Value end),
max(case when RN = 2 then Value end),
max(case when RN = 0 then Value end)
from (
select
row_number() over (order by Value) % 3 as RN,
(row_number() over (order by Value)+2) / 3 as GRP,
Value
from Table1
) X
group by GRP
The first row_number creates numbers for the columns (1,2,0,1,2,0...) and the second one creates numbers for the rows (1,1,1,2...). Those are then used to group the values into correct place using case, but you can also use pivot instead of it if you like it more.
If you want them into same column, of course just concatenate the cases instead of selecting them on different columns, but beware of nulls.
Example in SQL Fiddle
Thanks a lot for all your reply. Finally I've got a Solution with the help of Rajen Singh
This is the code than can be used:
WITH CTE_0 AS
(
SELECT DISTINCT column_A_VALUE AS id
FROM Table
WHERE column_A_VALUE IS NOT NULL
), CTE_1 AS
(
SELECT ROW_NUMBER() OVER (ORDER BY id) RN, id
FROM CTE_0
), CTE_2 AS
(
SELECT RN%30 GROUP, ID
FROM CTE_1
)SELECT STUFF(( SELECT ','''+CAST(ID AS NVARCHAR(20))+''''
FROM CTE_2
WHERE GROUP = A.GROUP
FOR XML PATH('')),1,1,'') IDS
FROM CTE_2 A
GROUP BY GROUP

Cumulative sum over a table

What is the best way to perform a cumulative sum over a table in Postgres, in a way that can bring the best performance and flexibility in case more fields / columns are added to the table.
Table
a b d
1 59 15 181
2 16 268
3 219
4 102
Cumulative
a b d
1 59 15 181
2 31 449
3 668
4 770
You can use window functions, but you need additional logic to avoid values where there are NULLs:
SELECT id,
(case when a is not null then sum(a) OVER (ORDER BY id) end) as a,
(case when b is not null then sum(b) OVER (ORDER BY id) end) as b,
(case when d is not null then sum(d) OVER (ORDER BY id) end) as d
FROM table;
This assumes that the first column that specifies the ordering is called id.
Window functions for running sum.
SELECT sum(a) OVER (ORDER BY d) as "a",
sum(b) OVER (ORDER BY d) as "b",
sum(d) OVER (ORDER BY d) as "d"
FROM table;
If you have more than one running sum, make sure the orders are the same.
http://www.postgresql.org/docs/8.4/static/tutorial-window.html
http://www.postgresonline.com/journal/archives/119-Running-totals-and-sums-using-PostgreSQL-8.4-Windowing-functions.html
It's important to note that if you want your columns to appear as the aggregate table in your question (each field uniquely ordered), it'd be a little more involved.
Update: I've modified the query to do the required sorting, without a given common field.
SQL Fiddle: (1) Only Aggregates, or (2) Source Data Beside Running Sum
WITH
rcd AS (
select row_number() OVER() as num,a,b,d
from tbl
),
sorted_a AS (
select row_number() OVER(w1) as num, sum(a) over(w2) a
from tbl
window w1 as (order by a nulls last),
w2 as (order by a nulls first)
),
sorted_b AS (
select row_number() OVER(w1) as num, sum(b) over(w2) b
from tbl
window w1 as (order by b nulls last),
w2 as (order by b nulls first)
),
sorted_d AS (
select row_number() OVER(w1) as num, sum(d) over(w2) d
from tbl
window w1 as (order by d nulls last),
w2 as (order by d nulls first)
)
SELECT sorted_a.a, sorted_b.b, sorted_d.d
FROM rcd
JOIN sorted_a USING(num)
JOIN sorted_b USING(num)
JOIN sorted_d USING(num)
ORDER BY num;
I think what you are really looking for is this:
SELECT id
, sum(a) OVER (PARTITION BY a_grp ORDER BY id) as a
, sum(b) OVER (PARTITION BY b_grp ORDER BY id) as b
, sum(d) OVER (PARTITION BY d_grp ORDER BY id) as d
FROM (
SELECT *
, count(a IS NULL OR NULL) OVER (ORDER BY id) as a_grp
, count(b IS NULL OR NULL) OVER (ORDER BY id) as b_grp
, count(d IS NULL OR NULL) OVER (ORDER BY id) as d_grp
FROM tbl
) sub
ORDER BY id;
The expression count(col IS NULL OR NULL) OVER (ORDER BY id) forms groups of consecutive non-null rows for a, b and d in the subquery sub.
In the outer query we run cumulative sums per group. NULL values form their own group and stay NULL automatically. No additional CASE statement necessary.
SQL Fiddle (with some added values for column a to demonstrate the effect).