I Have table t1 ordered by tasteRating
Fruit | tasteRating|Cost
-----------------------
Apple | 99 | 1
Banana| 87 | 2
Cherry| 63 | 5
I want t2
Fruit | Cost | Total Cost
-------------------------
Apple | 1 | 1
Banana| 2 | 3
Cherry| 5 | 8
Is there a way to generate Total Cost dynamically in SQL based on value of Cost?
Doing this on Redshift.
Thanks
A running sum like that, can easily be done in a modern DBMS using window functions:
select col_1,
sum(col_1) over (order by taste_rating desc) as col_2
from the_table;
Note however that a running sum without an order by doesn't make sense. So you have to include a column that defines the order of the rows.
SQLFiddle: http://sqlfiddle.com/#!15/166b9/1
EDIT: (By Gordon)
RedShift has weird limitations on Window functions. For some reason, it requires the rows between syntax:
sum(col_1) over (order by taste_rating desc
rows between unbounded preceding and current row
) as col_2
I have no idea why it has this requirement. It is not required by ANSI (although it is supported) and it is not a limitation in Postgres (the base database for Redshift).
Related
+----+------+-------+---------+---------+
| id | order| value | type | account |
+----+------+-------+---------+---------+
| 1 | 1 | a | 2 | 1 |
| 1 | 2 | b | 1 | 1 |
| 1 | 3 | c | 4 | 1 |
| 1 | 4 | d | 2 | 1 |
| 1 | 5 | e | 1 | 1 |
| 1 | 5 | f | 6 | 1 |
| 2 | 6 | g | 1 | 1 |
+----+------+-------+---------+---------+
I need get a select of all fields of this table but only getting 1 row for each combination of id+type (I don't care the value of the type). But I tried some approach without result.
At the moment that I make an DISTINCT I cant include rest of the fields to make it available in a subquery. If I add ROWNUM in the subquery all rows will be different making this not working.
Some ideas?
My better query at the moment is this:
SELECT ID, TYPE, VALUE, ACCOUNT
FROM MYTABLE
WHERE ROWID IN (SELECT DISTINCT MAX(ROWID)
FROM MYTABLE
GROUP BY ID, TYPE);
It seems you need to select one (random) row for each distinct combination of id and type. If so, you could do that efficiently using the row_number analytic function. Something like this:
select id, type, value, account
from (
select id, type, value, account,
row_number() over (partition by id, type order by null) as rn
from your_table
)
where rn = 1
;
order by null means random ordering of rows within each group (partition) by (id, type); this means that the ordering step, which is usually time-consuming, will be trivial in this case. Also, Oracle optimizes such queries (for the filter rn = 1).
Or, in versions 12.1 and higher, you can get the same with the match_recognize clause:
select id, type, value, account
from my_table
match_recognize (
partition by id, type
all rows per match
pattern (^r)
define r as null is null
);
This partitions the rows by id and type, it doesn't order them (which means random ordering), and selects just the "first" row from each partition. Note that some analytic functions, including row_number(), require an order by clause (even when we don't care about the ordering) - order by null is customary, but it can't be left out completely. By contrast, in match_recognize you can leave out the order by clause (the default is "random order"). On the other hand, you can't leave out the define clause, even if it imposes no conditions whatsoever. Why Oracle doesn't use a default for that clause too, only Oracle knows.
This is the code I am running:
ROW_NUMBER() OVER (PARTITION BY id ORDER BY a,b) as seq
This is an example of a table I am working with:
| id | a | b |name|
| --- | --- | - | -- |
| 1 |12345| 14 |John|
| 1 |12345| 14 |Anne|
| 1 |23456| 14 |Dave|
| 2 |45445| 16 |Matt|
When a seq value is assigned to the first two rows, how is the order decided? Asking this as id, a and b are the same for both rows and it seems to change between different runs.
how is the order decided?
The order will be whatever is most convenient for Sql Server. Sometimes this will be table order (as determined by the primary key (clustered index) of the source table rather than insert order), but lots of things can mess with this, such that you might even get different orders from one moment to the next even for the same query. If this matters, you must add more fields to the ORDER BY clause until it is specific enough.
With table table1 like below
+--------+-------+-------+------------+-------+
| flight | orig | dest | passenger | bags |
+--------+-------+-------+------------+-------+
| 1111 | sfo | chi | david | 3 |
| 1112 | sfo | dal | david | 7 |
| 1112 | sfo | dal | kim | 10|
| 1113 | lax | san | ameera | 5 |
| 1114 | lax | lfr | tim | 6 |
| 1114 | lax | lfr | jake | 8 |
+--------+-------+-------+------------+-------+
I'm aggregating the table by orig like below
select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from table1
group by orig
I need to add the passenger with the longest name ( length(passenger) ) for each orig group - how do I go about it?
Output expected
+------+-------------+-----------+---------------+-------------------+
| orig | flight_cnt | pass_cnt | bags_cnt_med | pass_max_len_name |
+------+-------------+-----------+---------------+-------------------+
| sfo | 3 | 2 | 7 | david |
| lax | 3 | 3 | 6 | ameera |
+------+-------------+-----------+---------------+-------------------+
You can conveniently retrieve the passenger with the longest name per group with DISTINCT ON.
Select first row in each GROUP BY group?
But I see no way to combine that (or any other simple way) with your original query in a single SELECT. I suggest to join two separate subqueries:
SELECT *
FROM ( -- your original query
SELECT orig
, count(*) AS flight_cnt
, count(distinct passenger) AS pass_cnt
, percentile_cont(0.5) WITHIN GROUP (ORDER BY bags) AS bag_cnt_med
FROM table1
GROUP BY orig
) org_query
JOIN ( -- my addition
SELECT DISTINCT ON (orig) orig, passenger AS pass_max_len_name
FROM table1
ORDER BY orig, length(passenger) DESC NULLS LAST
) pas USING (orig);
USING in the join clause conveniently only outputs one instance of orig, so you can simply use SELECT * in the outer SELECT.
If passenger can be NULL, it is important to add NULLS LAST:
PostgreSQL sort by datetime asc, null first?
From multiple passenger names with the same maximum length in the same group, you get an arbitrary pick - unless you add more expressions to ORDER BY as tiebreaker. Detailed explanation in the answer linked above.
Performance?
Typically, a single scan is superior, especially with sequential scans.
The above query uses two scans (maybe index / index-only scans). But the second scan is comparatively cheap unless the table is too huge to fit in cache (mostly). Lukas suggested an alternative query with only a single SELECT adding:
, (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1] -- I'd add NULLS LAST
The idea is smart, but last time I tested, array_agg with ORDER BY did not perform so well. (The overhead of per-group ORDER BY is substantial, and array handling is expensive, too.)
The same approach can be cheaper with a custom aggregate function first() like instructed in the Postgres Wiki here. Or, faster, yet, with a version written in C, available on PGXN. Eliminates the extra cost for array handling, but we still need per-group ORDER BY. May be faster for only few groups. You would then add:
, first(passenger ORDER BY length(passenger) DESC NULLS LAST)
Gordon and Lukas also mention the window function first_value(). Window functions are applied after aggregate functions. To use it in the same SELECT, we would need to aggregate passenger somehow first - catch 22. Gordon solves this with a subquery - another candidate for good performance with standard Postgres.
first() does the same without subquery and should be simpler and a bit faster. But it still won't be faster than a separate DISTINCT ON for most cases with few rows per group. For lots of rows per group, a recursive CTE technique is typically faster. There are yet faster techniques if you have a separate table holding all relevant, unique orig values. Details:
Optimize GROUP BY query to retrieve latest record per user
The best solution depends on various factors. The proof of the pudding is in the eating. To optimize performance you have to test with your setup. The above query should be among the fastest.
One method uses the window function first_value(). Unfortunately, this is not available as an aggregation function:
select orig,
count(*) flight_cnt,
count(distinct passenger) as pass_cnt,
percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med,
max(longest_name) as longest_name
from (select t1.*,
first_value(name) over (partition by orig order by length(name) desc) as longest_name
from table1
) t1
group by orig;
You are looking for something like Oracle's KEEP FIRST/LAST where you get a value (the passenger name) according to an aggregate (the name length). PostgreSQL doesn't have such function as far as I know.
One way to go about this is a trick: Combine length and name, get the maximum, then extract the name: '0005david' > '0003kim' etc.
select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med,
, substr(max(to_char(char_length(passenger), '0000') || passenger), 5) as name
from table1
group by orig
order by orig;
For small group sizes, you could use array_agg()
SELECT
orig
, COUNT (*) AS flight_cnt
, COUNT (DISTINCT passenger) AS pass_cnt
, PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY bags ASC) AS bag_cnt_med
, (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1] AS pass_max_len_name
FROM table1
GROUP BY orig
Having said so, while this is shorter syntax, a first_value() window function based approach might be faster for larger data sets as array accumulation might become expensive.
bot it does not solve problem if you have several names wqith same length:
t=# with p as (select distinct orig,passenger,length(trim(passenger)),max(length(trim(passenger))) over (partition by orig) from s127)
, o as ( select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from s127
group by orig)
select distinct o.*,p.passenger from o join p on p.orig = o.orig where max=length;
orig | flight_cnt | pass_cnt | bag_cnt_med | passenger
---------+------------+----------+-------------+--------------
lax | 3 | 3 | 6 | ameera
sfo | 3 | 2 | 7 | david
(2 rows)
populate:
t=# create table s127(flight int,orig text,dest text, passenger text, bags int);
CREATE TABLE
Time: 52.678 ms
t=# copy s127 from stdin delimiter '|';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1111 | sfo | chi | david | 3
>> 1112 | sfo | dal | david | 7
1112 | sfo | dal | kim | 10
1113 | lax | san | ameera | 5
1114 | lax | lfr | tim | 6
1114 | lax | lfr | jake | 8 >> >> >> >>
>> \.
COPY 6
I am looking to calculate cumulative sum across columns in Google Big Query.
Assume there are five columns (NAME,A,B,C,D) with two rows of integers, for example:
NAME | A | B | C | D
----------------------
Bob | 1 | 2 | 3 | 4
Carl | 5 | 6 | 7 | 8
I am looking for a windowing function or UDF to calculate the cumulative sum across rows to generate this output:
NAME | A | B | C | D
-------------------------
Bob | 1 | 3 | 6 | 10
Carl | 5 | 11 | 18 | 27
Any thoughts or suggestions greatly appreciated!
I think, there are number of reasonable workarounds for your requirements mostly in the area of designing better your table. All really depends on how you input your data and most importantly how than you consume it
Still, if to stay with presented requirements - Below is not exactly what you expect in your question as an output, but might be usefull as an example:
SELECT name, GROUP_CONCAT(STRING(cum)) AS all FROM (
SELECT name,
SUM(INTEGER(num))
OVER(PARTITION BY name
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum
FROM (
SELECT name, SPLIT(all) AS num FROM (
SELECT name,
CONCAT(STRING(a),',',STRING(b),',',STRING(c),',',STRING(d)) AS all
FROM yourtable
)
)
)
GROUP BY name
Output is:
name all
Bob 1,3,6,10
Carl 5,11,18,26
Depends on how you than consume this data - it still can work for you
Note, not you avoiding now writing something like col1 + col2 + .. + col89 + col90 - but still need to explicitelly mention each column just ones.
in case if you have "luxury" of implementing your requirements outside of GBQ UI, but rather in some Client- you can use BigQuery API to programatically aquire table schema and build on fly your logic/query and than execute it
Take a look at below APIs to start with:
To get table schema - https://cloud.google.com/bigquery/docs/reference/v2/tables/get
To issue query job - https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert
There's no need for a UDF:
SELECT name, a, a+b, a+b+c, a+b+c+d
FROM tab
I am having a weird result when I am trying to get the LAST_VALUE from a table in SQL Server 2012.
This is the table I have
PK | Id1 | Id2
1 | 2 | 5
2 | 2 | 6
3 | 2 | 5
4 | 2 | 6
This is my query
SELECT
Id1, Id2, LAST_VALUE(PK) OVER (PARTITION BY Id1 ORDER BY Id2) AS LastValue
FROM
#Data
This is the result I am expecting
Id1 | Id2 | LastValue
2 | 5 | 3
2 | 5 | 3
2 | 6 | 4
2 | 6 | 4
This is what I am receiving
Id1 | Id2 | LastValue
2 | 5 | 3
2 | 5 | 3
2 | 6 | 2
2 | 6 | 2
Here is a demonstration of the problem
http://sqlfiddle.com/#!6/5c729/1
Is there anything wrong with my query?
SQL Server doesn't know or care about the order in which rows were inserted into the table. If you need specific order, always use ORDER BY. In your example ORDER BY is ambiguous, unless you include PK into the ORDER BY. Besides, LAST_VALUE function can return odd results if you are not careful - see below.
You can get your expected result using MAX or LAST_VALUE (SQLFiddle). They are equivalent in this case:
SELECT
PK, Id1, Id2
,MAX(PK) OVER (PARTITION BY Id1, Id2) AS MaxValue
,LAST_VALUE(PK) OVER (PARTITION BY Id1, Id2 ORDER BY PK rows between unbounded preceding and unbounded following) AS LastValue
FROM
Data
ORDER BY id1, id2, PK
Result of this query will be the same regardless of the order in which rows were originally inserted into the table. You can try to put INSERT statements in different order in the fiddle. It doesn't affect the result.
Also, LAST_VALUE behaves not quite as you'd intuitively expect with default window (when you have just ORDER BY in the OVER clause). Default window is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, while you'd expected it to be ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Here is a SO answer with a good explanation. The link to this SO answer is on MSDN page for LAST_VALUE. So, once the row window is specified explicitly in the query it returns what is needed.
If you want to know the order in which rows were inserted into the table, I think, the most simple way is to use IDENTITY. So, definition of your table would change to this:
CREATE TABLE Data
(PK INT IDENTITY(1,1) PRIMARY KEY,
Id1 INT,
Id2 INT)
When you INSERT into this table you don't need to specify the value for PK, the server would generate it automatically. It guarantees that generated values are unique and growing (with positive increment parameter), even if you have many clients inserting into the table at the same time simultaneously. There may be gaps between generated values, but the relative order of the generated values will tell you which row was inserted after which row.
It is never a good idea to rely on implicit order caused by the particular implementation of the underlying database engine.
I don't know why, running the query
SELECT * FROM #Data ORDER BY Id2
the result will be
+----+-----+-----+
| PK | id1 | id2 |
+----+-----+-----+
| 1 | 2 | 5 |
| 3 | 2 | 5 |
| 4 | 2 | 6 |
| 2 | 2 | 6 |
+----+-----+-----+
which means SQL Server decided the order of rows in a way that is different from the insert order.
That's why the LAST_VALUE behavior is different from expected, but is consistent with the SQL Server sort method.
But how SQL Server sort your data?
The best answer we have is the accepted answer of this question (from where I took the sentence in the beginning of my answer).
SELECT Id1
, Id2
, LAST_VALUE(PK)
OVER (PARTITION BY Id1
ORDER BY Id2 DESC) AS LastValue
FROM Data
ORDER BY Id2 ASC
Result
Id1 Id2 LastValue
2 5 3
2 5 3
2 6 4
2 6 4