I have a table with 6 columns in Teradata as follows:
ID Feature1 Feature2 Feature3 Feature4 Feature5
1 12 15 1 22 350
2 121 0.9 999 756 879
...
I need to get the column names for the greatest, 2nd greatest and 3rd greatest values per row, so, I need output that looks like this:
ID Greatest 2nd_Greatest 3rd_Greatest
1 Feature5 Feature4 Feature2
2 Feature3 Feature5 Feature4
Can someone help please.
Thank you!
You can do this with a massive case statement, which gets even more complicated if any of the values are NULL. That would be the fastest way, though.
The easiest method might be to unpivot the data and re-summarize it:
select id,
max(case when seqnum = 1 then feature end) as greatest_feature,
max(case when seqnum = 2 then feature end) as greatest_feature2,
max(case when seqnum = 3 then feature end) as greatest_feature3,
max(case when seqnum = 1 then which end) as which_1,
max(case when seqnum = 2 then which end) as which_2,
max(case when seqnum = 3 then which end) as which_3
from (select id, feature, row_number() over (partition by id order by feature desc) as serqnum
from ((select id, feature1 as feature, 'feature1' as which from table) union all
(select id, feature2 as feature, 'feature2' as which from table) union all
(select id, feature3 as feature, 'feature3' as which from table) union all
(select id, feature4 as feature, 'feature4' as which from table) union all
(select id, feature5 as feature, 'feature5' as which from table) union all
(select id, feature6 as feature, 'feature6' as which from table)
) t
) t
group by id;
Refining Gordon's query:
Instead of several passes over the source table for those UNIONs you can create a list of features and then cross join it:
SELECT t.id, f.feature,
CASE f.feature
WHEN 'feature1' THEN t.feature1
WHEN 'feature2' THEN t.feature2
WHEN 'feature3' THEN t.feature3
WHEN 'feature4' THEN t.feature4
WHEN 'feature5' THEN t.feature5
END AS val
FROM tab AS t CROSS JOIN
(
SELECT * FROM (SELECT 'feature1' AS feature) AS dt
UNION ALL
SELECT * FROM (SELECT 'feature2' AS feature) AS dt
UNION ALL
SELECT * FROM (SELECT 'feature3' AS feature) AS dt
UNION ALL
SELECT * FROM (SELECT 'feature4' AS feature) AS dt
UNION ALL
SELECT * FROM (SELECT 'feature5' AS feature) AS dt
) AS f
You can create the list on the fly like above using UNIONs or as a real table.
Starting with TD14.10 there's also a TD_UNPIVOT table operator (but still no PIVOT):
SELECT *
FROM TD_UNPIVOT
(
ON (SELECT id, feature1, feature2, feature3, feature4, feature5 FROM tab)
USING
VALUE_COLUMNS('val')
UNPIVOT_COLUMN('feature')
COLUMN_LIST('feature1', 'feature2', 'feature3', 'feature4', 'feature5')
) AS dt
Also starting with TD14.10 there's LAST_VALUE which can be used for finding nth-greatest value together with the ROW_NUMBER, thus avoiding the final aggregation:
SELECT id,
feature AS "Greatest",
LAST_VALUE(feature)
OVER (PARTITION BY id ORDER BY val DESC
ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS "2nd_Greatest",
LAST_VALUE(feature)
OVER (PARTITION BY id ORDER BY val DESC
ROWS BETWEEN 2 FOLLOWING AND 2 FOLLOWING) AS "3rd_Greatest"
FROM TD_UNPIVOT
(
ON (SELECT id, feature1, feature2, feature3, feature4, feature5 FROM tab)
USING
VALUE_COLUMNS('val')
UNPIVOT_COLUMN('feature')
COLUMN_LIST('feature1', 'feature2', 'feature3', 'feature4', 'feature5')
) AS dt
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY val DESC) = 1;
Related
is there a way using sql, in bigquery more specifically, to get one line per unique value in a given column
I know that this is possible using a sequence of union queries where you have as much union as distinct values as there is in the column of interest. but i'm wondering if there is a better way to do it.
You can use row_number():
select t.* except (seqnum)
from (select t.*, row_number() over (partition by col order by col) as seqnum
from t
) t
where seqnum = 1;
This returns an arbitrary row. You can control which row by adjusting the order by.
Another fun solution in BigQuery uses structs:
select array_agg(t limit 1)[ordinal(1)].*
from t
group by col;
You can add an order by (order by X limit 1) if you want a particular row.
here is just a more formated format :
select tab.* except(seqnum)
from (
select *, row_number() over (partition by column_x order by column_x) as seqnum
from `project.dataset.table`
) as tab
where seqnum = 1
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `project.dataset.table` t
GROUP BY col
You can test, play with above using dummy data as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, 1 col UNION ALL
SELECT 2, 1 UNION ALL
SELECT 3, 1 UNION ALL
SELECT 4, 2 UNION ALL
SELECT 5, 2 UNION ALL
SELECT 6, 3
)
SELECT AS VALUE ANY_VALUE(t)
FROM `project.dataset.table` t
GROUP BY col
with result
Row id col
1 1 1
2 4 2
3 6 3
I have two tables, table1 contains old values and table2 contains latest values, I want to show latest value in table1 but I do not have anything which tells me this is the latest value in table2.
for example
Table1
CID-----PID-----RID
CT1-----C-------R1
CT2-----C-------R2
CT3-----C-------R3
CT4-----C-------R4
Table2
CID-----PID----RID
CT1-----A-------R1
CT1-----C-------R11
CT2-----C-------R2
CT3-----A-------R3
CT4-----A-------R4
The condition is I have to give priority to value C in case both values (A and C) exist also it's RID changes so need to get that also in output table, for the same CID and for unique value I will simple replace it in table1 from table2, so output will be like this
Table3
CID-----PID----RID
CT1-----C-------R11
CT2-----C-------R2
CT3-----A-------R3
CT4-----A-------R4
I may be missing something, but isn't this simply:
select cid, max(pid)
from table2
group by cid;
If you want whole records, use a ranking with ROW_NUMBER instead:
select cid, pid, rid
from
(
select cid, pid, rid, row_number() over (partition by cid order by pid desc) as rn
from table2
)
where rn = 1;
You can also use case expressions for ranking, e.g.:
(partition by cid order by case pid when 'C' then 1 when 'A' then 2 else 3 end) as rn
UPDATE: Now that you've finally explained what you are after ...
You want more or less the second query I gave you above. Only that you want data from both tables, which you can get with UNION ALL. You can easily give each row a rank on the way:
table2 PIM C => rank #1
table2 PIM A => rank #2
table1 rank #3
Then again take the row with the best rank:
select cid, pid, rid
from
(
select cid, pid, rid, row_number() over (partition by cid order by rnk) as rn
from
(
select cid, pid, rid, case when pid = 'C' then 1 else 2 end as rnk from table2
union
select cid, pid, rid, 3 as rnk from table1
)
)
where rn = 1;
I have a sql table for student records and I have some duplicate rows for the student dimension cause of the major, so now I have something like this:
ID Major
----------
1 CS
1 Mgt
What I want is to combine this two rows in this form:
ID Major Major2
----------
1 CS Mgt
You need a number for pivoting. Then you can pivot using either pivot or conditional aggregation:
select id,
max(case when seqnum = 1 then major end) as major_1,
max(case when seqnum = 2 then major end) as major_2
from (select t.*,
row_number() over (partition by id order by (select null)) as seqnum
from t
) t
group by id;
Note: you should validate that "2" is large enough to count the majors. You can get the maximum using:
select top 1 id, count(*)
from t
group by id
order by count(*) desc;
If you have at most two different values of major:
select a.id as id,
a.major as major,
b.major as major2
from YOUR_TABLE a
left join YOUR_TABLE b on
a.id = b.id
and (b.major is null or a.major > b.major)
This will help you
Select
ID,
(select top 1 Major from <Your_Table> where id=T.Id order by Major) Major,
(case when count(Id)>1 then (select top 1 Major from #temp where id=T.Id order by Major desc) else null end) Major2
from <Your_Table> T
Group By
ID
You can use pivot function directly
SELECT [ID],[CS] AS Major , [Mgt] AS Major2 from Your_Table_Name
PIVOT
(max(Major)for [Major] IN ([CS] , [Mgt]))as p
I have a master table with a number of IDs in it:
ID ...
0 ...
1 ...
And multiple tables (say vtbl1, vtbl2, vtbl3) with a foreign key to master, a timestamp and a value:
ID Timestamp Value
0 01/01/01.. 2
1 01/01/02.. 7
0 01/01/03.. 5
I would like to get one or more entries for each ID in master with an entry (or null if no entries exist) containing the most recent entry in each v... table (grouped by timestamps):
ID Timestamp vtbl1.Value vtbl2.Value vtbl3.value
0 01/01/03.. 5 2
0 01/01/01.. 4
1 01/01/02.. 7 4 9
I'm sure this is fairly simple but my SQL is rusty and I've been going in circles. Any help would be appreciated.
Clarification
These values come from one or more sensors able to read one or more of the values. So the latest value in each value table for the ID is to be considered the current system state for that ID. If the timestamps match they are considered one update.
I need the minimal set of updates required for each ID to give a full data set for the current state.
Also the values can be of different types.
If I understand your question correctly, one option is to use conditional aggregation and union all:
select id, timestamp,
max(case when tbl = 'tbl1' then value end) t1value,
max(case when tbl = 'tbl2' then value end) t2value,
max(case when tbl = 'tbl3' then value end) t3value
from (
select id, timestamp, value, 'tbl1' tbl
from tbl1
union all
select id, timestamp, value, 'tbl2' tbl
from tbl2
union all
select id, timestamp, value, 'tbl3' tbl
from tbl3
) t
group by id, timestamp
Or if you have multiple records per id and you want the highest value per by timestamp, you can include row_number() in your subquery:
select id, timestamp,
max(case when tbl = 'tbl1' then value end) t1value,
max(case when tbl = 'tbl2' then value end) t2value,
max(case when tbl = 'tbl3' then value end) t3value
from (
select id, timestamp, value, 'tbl1' tbl,
row_number() over (partition by id order by timestamp desc) rn
from tbl1
union all
select id, timestamp, value, 'tbl2' tbl,
row_number() over (partition by id order by timestamp desc) rn
from tbl2
union all
select id, timestamp, value, 'tbl3' tbl,
row_number() over (partition by id order by timestamp desc) rn
from tbl3
) t
where rn = 1
group by id, timestamp
This can get difficult though if max(timestamp) values aren't the same in each of the child tables. Which do you join on at that point?
select m.*, v1.value as t1_val, v2.value as t2_val, v3.value as t3_val
from master m
left join (select x.*
from vtbl1 x
join (select id, max(timestamp) as last_ts
from vtbl1
group by id) y
on x.id = y.id
and x.timestamp = y.last_ts) v1
on m.id = v1.id
left join (select x.*
from vtbl2 x
join (select id, max(timestamp) as last_ts
from vtbl2
group by id) y
on x.id = y.id
and x.timestamp = y.last_ts) v2
on m.id = v2.id
left join (select x.*
from vtbl3 x
join (select id, max(timestamp) as last_ts
from vtbl3
group by id) y
on x.id = y.id
and x.timestamp = y.last_ts) v3
on m.id = v3.id
The fastest query technique depends on the distribution of values. DISTINCT ON would be a simple solution in Postgres, ideal for just a few values per id in each child table. But guessing from your description I expect many rows per id, so I suggest a solution with LATERAL joins. Requires Postgres 9.3+:
Optimize GROUP BY query to retrieve latest record per user
One more complication for your already-not-so-simple case:
Also the values can be of different types
Alternative 1
Cast all values to text. Every data type can be cast to text.
Base query
SELECT m.id, v.timestamp, 1 AS tbl, v.value -- simple int as table id
FROM master m
, LATERAL (
SELECT timestamp, value::text -- cast to text
FROM vtbl1
WHERE id = m.id -- lateral reference
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) v
UNION ALL
SELECT m.id, v.timestamp, 2 AS tbl, v.value -- ascending without gaps
FROM master m
, LATERAL (
SELECT timestamp, value::text
FROM vtbl2
WHERE id = m.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) v
UNION ALL
SELECT m.id, v.timestamp, 3 AS tbl, value
FROM ...
;
All you need for this to be fast is an index on (id, timestamp) for each child table. Best in this form (adding value is only useful if you get index-only scans out of it):
CREATE INDEX vtbl1_combo_idx ON vtbl1 (id, timestamp DESC NULLS LAST, value)
1a. Aggregate (pseudo-crosstab)
To format as desired use aggregate functions on CASE expressions in Postgres 9.3 or older (like demonstrated by #sgeddes) or (better) the new aggregate FILTER clause in Postgres 9.4+:
How can I simplify this game statistics query?
SELECT id, timestamp
, max(value) FILTER (WHERE tbl = 1) AS val1
, max(value) FILTER (WHERE tbl = 2) AS val2
, ...
FROM ( <query frm above> ) t
GROUP BY 1, 2;
1b. Crosstab
Actual cross tabulation (also called "pivot" in other RDBMS) should be considerably faster. You need the additional module tablefunc installed, instructions below.
The special difficulty here: we have a composite "row name" (id, timestamp), but the function expects a single column as row name. So we substitute with row_number(), but do not display that surrogate key in the result:
SELECT id, timestamp, val1, val2, val3, ...
-- normally SELECT * is enough; explicit list to filter rn
FROM crosstab(
$$
SELECT row_number() OVER (ORDER BY id, timestamp DESC NULLS LAST) AS rn
, id, timestamp, tbl, value
FROM ( <query from above> ) t
ORDER BY 1
$$
, 'SELECT generate_series(1,3)' -- replace 3 with highest table nr.
) AS ct (
rn int, id int, timestamp date
, val1 text, val2 text, val3 text, ...);
Closely related:
Postgres - Transpose Rows to Columns
Relevant basics:
PostgreSQL Crosstab Query
Pivot on Multiple Columns using Tablefunc
Alternative 2
Simple, but may be just as fast and preserves original data types:
SELECT id, timestamp
, max(val1) AS val1, max(val2) AS val2, max(val3) AS val3, ...
FROM (
SELECT m.id, v.timestamp
, v.value AS val1, NULL::int AS val2, NULL::numeric AS val3, ...
-- list all values with actual data type
FROM master m
, LATERAL (
SELECT timestamp, value
FROM vtbl1
WHERE id = m.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) v
UNION ALL
SELECT m.id, v.timestamp
, NULL, v.value, NULL, ... -- column names & data types defined in first SELECT
FROM master m
, LATERAL (
SELECT timestamp, value
FROM vtbl2
WHERE id = m.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) v
UNION ALL
SELECT m.id, v.timestamp
, NULL, NULL, v.value, ...
FROM ...
) t
GROUP BY 1, 2
ORDER BY 1, 2;
Aside: Never use basic type names or reserved words (in standard SQL) like timestamp as identifier.
My table:
ID NUM VAL
1 1 Hello
1 2 Goodbye
2 2 Hey
2 4 What's up?
3 5 See you
If I want to return the max number for each ID, it's really nice and clean:
SELECT MAX(NUM) FROM table GROUP BY (ID)
But what if I want to grab the value associated with the max of each number for each ID?
Why can't I do:
SELECT MAX(NUM) OVER (ORDER BY NUM) FROM table GROUP BY (ID)
Why is that an error? I'd like to have this select grouped by ID, rather than partitioning separately for each window...
EDIT: The error is "not a GROUP BY expression".
You could probably use the MAX() KEEP(DENSE_RANK LAST...) function:
with sample_data as (
select 1 id, 1 num, 'Hello' val from dual union all
select 1 id, 2 num, 'Goodbye' val from dual union all
select 2 id, 2 num, 'Hey' val from dual union all
select 2 id, 4 num, 'What''s up?' val from dual union all
select 3 id, 5 num, 'See you' val from dual)
select id, max(num), max(val) keep (dense_rank last order by num)
from sample_data
group by id;
When you use windowing function, you don't need to use GROUP BY anymore, this would suffice:
select id,
max(num) over(partition by id)
from x
Actually you can get the result without using windowing function:
select *
from x
where (id,num) in
(
select id, max(num)
from x
group by id
)
Output:
ID NUM VAL
1 2 Goodbye
2 4 What's up
3 5 SEE YOU
http://www.sqlfiddle.com/#!4/a9a07/7
If you want to use windowing function, you might do this:
select id, val,
case when num = max(num) over(partition by id) then
1
else
0
end as to_select
from x
where to_select = 1
Or this:
select id, val
from x
where num = max(num) over(partition by id)
But since it's not allowed to do those, you have to do this:
with list as
(
select id, val,
case when num = max(num) over(partition by id) then
1
else
0
end as to_select
from x
)
select *
from list
where to_select = 1
http://www.sqlfiddle.com/#!4/a9a07/19
If you're looking to get the rows which contain the values from MAX(num) GROUP BY id, this tends to be a common pattern...
WITH
sequenced_data
AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY id ORDER BY num DESC) AS sequence_id,
*
FROM
yourTable
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
EDIT
I don't know if TeraData will allow this, but the logic seems to make sense...
SELECT
*
FROM
yourTable
WHERE
num = MAX(num) OVER (PARTITION BY id)
Or maybe...
SELECT
*
FROM
(
SELECT
*,
MAX(num) OVER (PARTITION BY id) AS max_num_by_id
FROM
yourTable
)
AS sub_query
WHERE
num = max_num_by_id
This is slightly different from my previous answer; if multiple records are tied with the same MAX(num), this will return all of them, the other answer will only ever return one.
EDIT
In your proposed SQL the error relates to the fact that the OVER() clause contains a field not in your GROUP BY. It's like trying to do this...
SELECT id, num FROM yourTable GROUP BY id
num is invalid, because there can be multiple values in that field for each row returned (with the rows returned being defined by GROUP BY id).
In the same way, you can't put num inside the OVER() clause.
SELECT
id,
MAX(num), <-- Valid as it is an aggregate
MAX(num) <-- still valid
OVER(PARTITION BY id), <-- Also valid, as id is in the GROUP BY
MAX(num) <-- still valid
OVER(PARTITION BY num) <-- Not valid, as num is not in the GROUP BY
FROM
yourTable
GROUP BY
id
See this question for when you can't specify something in the OVER() clause, and an answer showing when (I think) you can: over-partition-by-question