is there a way using sql, in bigquery more specifically, to get one line per unique value in a given column
I know that this is possible using a sequence of union queries where you have as much union as distinct values as there is in the column of interest. but i'm wondering if there is a better way to do it.
You can use row_number():
select t.* except (seqnum)
from (select t.*, row_number() over (partition by col order by col) as seqnum
from t
) t
where seqnum = 1;
This returns an arbitrary row. You can control which row by adjusting the order by.
Another fun solution in BigQuery uses structs:
select array_agg(t limit 1)[ordinal(1)].*
from t
group by col;
You can add an order by (order by X limit 1) if you want a particular row.
here is just a more formated format :
select tab.* except(seqnum)
from (
select *, row_number() over (partition by column_x order by column_x) as seqnum
from `project.dataset.table`
) as tab
where seqnum = 1
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `project.dataset.table` t
GROUP BY col
You can test, play with above using dummy data as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, 1 col UNION ALL
SELECT 2, 1 UNION ALL
SELECT 3, 1 UNION ALL
SELECT 4, 2 UNION ALL
SELECT 5, 2 UNION ALL
SELECT 6, 3
)
SELECT AS VALUE ANY_VALUE(t)
FROM `project.dataset.table` t
GROUP BY col
with result
Row id col
1 1 1
2 4 2
3 6 3
Related
I have table with values
val1
1
2
3
4
I want following as output
1.00
2.50
4.25
6.12
Each value in a table is computed as val1+0.5*val1(from Previous row)
so for .eg.
row with 2 ---> output is computed as 2+0.5*1.00= 2.50
row with 3 ---> output is computed as 3+0.5*2.50 = 4.25
When I use following sql windows function
SELECT *
,val1+SUM(0.50*val1) OVER (ORDER BY val1 ROWS between 1 PRECEDING and 1 PRECEDING) AS r
FROM #a1
I get output as
1.00
2.500
4.000
5.500
This can be done with a recursive cte.
with rownums as (select val,row_number() over(order by val) as rnum
from tbl)
/* This is the recursive cte */
,cte as (select val,rnum,cast(val as float) as new_val from rownums where rnum=1
union all
select r.val,r.rnum,r.val+0.5*c.new_val
from cte c
join rownums r on c.rnum=r.rnum-1
)
/* End Recursive cte */
select val,new_val
from cte
Sample Demo
This is called exponential averaging. You can do it with some sort of power function, say it is called power() (this might differ among databases).
The following will work -- but I'm not sure about what happens if the sequences get long. Note that this has an id column to specify the ordering:
with t as (
select 1 as id, 1 as val union all
select 2, 2 union all select 3, 3 union all select 4, 4
)
select t.*,
( sum(p_seqnum * val) over (order by id) ) / p_seqnum
from (select t.*,
row_number() over (order by id desc) as seqnum,
power(cast(0.5 as float), row_number() over (order by id desc)) as p_seqnum
from t
) t;
Here is a rextester for Postgres. Here is a SQL Fiddle for SQL Server.
This works because exponential averaging is "memory-less". If this were not true, you would need a recursive CTE, and that can be much more expensive.
I need to sumarize a sequence of values into intervals of nonchanging values - begin, end and value for each such interval. I can easily do it in plsql but would like a pure sql solution for both performance and educational reasons. I have been trying for some time to solve it with analytical functions, but can't figure how to properly define windowing clause. The problem I am having is with a repeated value.
Simplified example -
given input:
id value
1 1
2 1
3 2
4 2
5 1
I'd like to get output
from to val
1 2 1
3 4 2
5 5 1
You want to identify groups of adjacent values. One method is to use lag() to find the beginning of the sequence, then a cumulative sum to identify the groups.
Another method is the difference of row number:
select value, min(id) as from_id, max(id) as to_id
from (select t.*,
(row_number() over (order by id) -
row_number() over (partition by val order by id
) as grp
from table t
) t
group by grp, value;
Using a CTE to collect all the rows and identifying them into changing values, then finally grouping together for the changing values.
CREATE TABLE #temp (
ID INT NOT NULL IDENTITY(1,1),
[Value] INT NOT NULL
)
GO
INSERT INTO #temp ([Value])
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 2 UNION ALL
SELECT 2 UNION ALL
SELECT 1;
WITH Marked AS (
SELECT
*,
grp = ROW_NUMBER() OVER (ORDER BY ID)
- ROW_NUMBER() OVER (PARTITION BY Value ORDER BY ID)
FROM #temp
)
SELECT MIN(ID) AS [From], MAX(ID) AS [To], [VALUE]
FROM Marked
GROUP BY grp, Value
ORDER BY MIN(ID)
DROP TABLE #temp;
I have table(Id, Name, Type) in sql.
Id, Name, Type:
1, AA, 1
2, BB, 2
3, CC, 4
4, DD, 2
5, EE, 3
6, FF, 3
I want select the first non-duplicate data. Result:
Id, Name, Type:
1, AA, 1
2, BB, 2
3, CC, 4
6, FF, 3
I use DISTINCT and GROUP BY, but not working, I have select all row not select Type with DISTINCT or GROUP BY.
select DISTINCT Type
from tbltest
I like CTE's and ROW_NUMBER since it allows to change it easily to delete the duplicates.
Presuming that you want to remove duplicate Types and first means according to the ID:
WITH CTE AS(
SELECT Id, Name, Type,
RN = ROW_NUMBER() OVER ( PARTITION BY Type ORDER BY ID )
FROM dbo.Table1
)
SELECT Id, Name, Type FROM CTE WHERE RN = 1
You can do this in several ways. My preference is row_number():
select id, name, type
from (select t.*, row_number() over (partition by type order by id) as seqnum
from tbltest t
) t
where seqnum = 1;
EDIT:
Performance of the above should be reasonable. However, the following might be faster with an index on type, id:
selct id, name, type
from tbltest t
where not exists (select 1 from tbltest t2 where t2.type = t.type and t2.id < t.id);
That is, select the rows that have no lower id for the same type.
My table:
ID NUM VAL
1 1 Hello
1 2 Goodbye
2 2 Hey
2 4 What's up?
3 5 See you
If I want to return the max number for each ID, it's really nice and clean:
SELECT MAX(NUM) FROM table GROUP BY (ID)
But what if I want to grab the value associated with the max of each number for each ID?
Why can't I do:
SELECT MAX(NUM) OVER (ORDER BY NUM) FROM table GROUP BY (ID)
Why is that an error? I'd like to have this select grouped by ID, rather than partitioning separately for each window...
EDIT: The error is "not a GROUP BY expression".
You could probably use the MAX() KEEP(DENSE_RANK LAST...) function:
with sample_data as (
select 1 id, 1 num, 'Hello' val from dual union all
select 1 id, 2 num, 'Goodbye' val from dual union all
select 2 id, 2 num, 'Hey' val from dual union all
select 2 id, 4 num, 'What''s up?' val from dual union all
select 3 id, 5 num, 'See you' val from dual)
select id, max(num), max(val) keep (dense_rank last order by num)
from sample_data
group by id;
When you use windowing function, you don't need to use GROUP BY anymore, this would suffice:
select id,
max(num) over(partition by id)
from x
Actually you can get the result without using windowing function:
select *
from x
where (id,num) in
(
select id, max(num)
from x
group by id
)
Output:
ID NUM VAL
1 2 Goodbye
2 4 What's up
3 5 SEE YOU
http://www.sqlfiddle.com/#!4/a9a07/7
If you want to use windowing function, you might do this:
select id, val,
case when num = max(num) over(partition by id) then
1
else
0
end as to_select
from x
where to_select = 1
Or this:
select id, val
from x
where num = max(num) over(partition by id)
But since it's not allowed to do those, you have to do this:
with list as
(
select id, val,
case when num = max(num) over(partition by id) then
1
else
0
end as to_select
from x
)
select *
from list
where to_select = 1
http://www.sqlfiddle.com/#!4/a9a07/19
If you're looking to get the rows which contain the values from MAX(num) GROUP BY id, this tends to be a common pattern...
WITH
sequenced_data
AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY id ORDER BY num DESC) AS sequence_id,
*
FROM
yourTable
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
EDIT
I don't know if TeraData will allow this, but the logic seems to make sense...
SELECT
*
FROM
yourTable
WHERE
num = MAX(num) OVER (PARTITION BY id)
Or maybe...
SELECT
*
FROM
(
SELECT
*,
MAX(num) OVER (PARTITION BY id) AS max_num_by_id
FROM
yourTable
)
AS sub_query
WHERE
num = max_num_by_id
This is slightly different from my previous answer; if multiple records are tied with the same MAX(num), this will return all of them, the other answer will only ever return one.
EDIT
In your proposed SQL the error relates to the fact that the OVER() clause contains a field not in your GROUP BY. It's like trying to do this...
SELECT id, num FROM yourTable GROUP BY id
num is invalid, because there can be multiple values in that field for each row returned (with the rows returned being defined by GROUP BY id).
In the same way, you can't put num inside the OVER() clause.
SELECT
id,
MAX(num), <-- Valid as it is an aggregate
MAX(num) <-- still valid
OVER(PARTITION BY id), <-- Also valid, as id is in the GROUP BY
MAX(num) <-- still valid
OVER(PARTITION BY num) <-- Not valid, as num is not in the GROUP BY
FROM
yourTable
GROUP BY
id
See this question for when you can't specify something in the OVER() clause, and an answer showing when (I think) you can: over-partition-by-question
This should be a simple question, but I can't get it to work :(
How to select rows that have the maximum column value,as group by another column?
For example,
I have the following table definition:
ID
Del_Index
docgroupviewid
The issue now is that I want to group by results by docgroupviewid first, and then choose one row from each docgroupviewid group, depending on which one has the highest del_index.
I tried
SELECT docgroupviewid, max(del_index),id FROM table
group by docgroupviewid
But instead of return me with the correct id, it returns me with the earliest id from the group with the same docgroupviewid.
Any ideas?
I've struggled with this many times myself and the solution is to think about your query differently.
I want each DocGroupViewID row where the Del_Index is the highest(max) for all rows with that DocGroupViewID:
SELECT
T.DocGroupViewID,
T.Del_Index,
T.ID
FROM MyTable T
WHERE T.Del_Index = (
SELECT MAX( T1.Del_Index ) FROM MyTable T1
WHERE T1.DocGroupViewID = T.DocGroupViewID
)
It gets more complex when more than one row can have the same Del_Index, since then you need some way to choose which one to show.
EDIT: wanted to follow up with another option
You can use the RANK() or ROW_NUMBER() functions with a CTE to get more control over the results, as follows:
-- fake a source table
DECLARE #t TABLE (
ID int IDENTITY(1,1) PRIMARY KEY,
Del_Index int,
DocGroupViewID int
)
INSERT INTO #t
SELECT 1, 1 UNION ALL
SELECT 2, 1 UNION ALL
SELECT 3, 1 UNION ALL
SELECT 1, 2 UNION ALL
SELECT 2, 2 UNION ALL
SELECT 2, 2 UNION ALL
SELECT 1, 3 UNION ALL
SELECT 2, 3 UNION ALL
SELECT 3, 3 UNION ALL
SELECT 4, 3
-- show our source
SELECT * FROM #t
-- select using RANK (can have duplicates)
;WITH cteRank AS
(
SELECT
DocGroupViewID,
Del_Index,
ID,
RANK() OVER
(PARTITION BY DocGroupViewID ORDER BY Del_Index DESC)
AS RowRank,
ROW_NUMBER() OVER
(PARTITION BY DocGroupViewID ORDER BY Del_Index DESC)
AS RowNumber
FROM #t
)
SELECT *
FROM cteRank
WHERE RowRank = 1
-- select using ROW_NUMBER
;WITH cteRowNumber AS
(
SELECT
DocGroupViewID,
Del_Index,
ID,
RANK() OVER
(PARTITION BY DocGroupViewID ORDER BY Del_Index DESC)
AS RowRank,
ROW_NUMBER() OVER
(PARTITION BY DocGroupViewID ORDER BY Del_Index DESC)
AS RowNumber
FROM #t
)
SELECT *
FROM cteRowNumber
WHERE RowNumber = 1
If you have ways to sort out ties, just add it to the ORDER BY.
You will have to complicate your query a little bit:
select a.docgroupviewid, a.del_index, a.id from table a
where a.del_index = (select max(b.del_index) from table
where b.docgroupviewid = a.docgroupviewid)