Postgresql Writing max() Window function with multiple partition expressions? - sql

I am trying to get the max value of column A ("original_list_price") over windows defined by 2 columns (namely - a unique identifier, called "address_token", and a date field, called "list_date"). I.e. I would like to know the max "original_list_price" of rows with both the same address_token AND list_date.
E.g.:
SELECT
address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1
The query already takes >10 minutes when I use just 1 expression in the PARTITION (e.g. using address_token only, nothing after that). Sometimes the query times out. (I use Mode Analytics and get this error: An I/O error occurred while sending to the backend) So my questions are:
1) Will the Window function with multiple PARTITION BY expressions work?
2) Any other way to achieve my desired result?
3) Any way to make Windows functions, especially the Partition part run faster? e.g. use certain data types over others, try to avoid long alphanumeric string identifiers?
Thank you!

The complexity of the window functions partitioning clause should not have a big impact on performance. Do realize that your query is returning all the rows in the table, so there might be a very large result set.
Window functions should be able to take advantage of indexes. For this query:
SELECT address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1;
You want an index on table1(address_token, list_date, original_list_price).
You could try writing the query as:
select t1.*,
(select max(t2.original_list_price)
from table1 t2
where t2.address_token = t1.address_token and t2.list_date = t1.list_date
) as max_list_price
from table1 t1;
This should return results more quickly, because it doesn't have to calculate the window function value first (for all rows) before returning values.

Related

Finding the last 4, 3, 2, 1 months consecutive order drops among clients based on drop variance

Here I have this query that finds out the drop percentage of a bunch of clients based on the orders they have received(i.e. It finds the percentage difference in orders by comparing the current month with the previous month). What I want to achieve here is to have a field where I can see the clients who had 4 months continuous drop, 3 months drop, 2 months drop, and 1 month drop.
I know, it can only be achieved by comparing the last 4 months using the lag function or sub queries. can you guys pls help me out on this one, would appreciate it very much
select
fd.customers2, fd.Month1, fd.year1, fd.variance, case when
(fd.variance < -0.00001 and fd.year1 = '2022.0' and fd.Month1 = '1')
then '1month drop' else fd.customers2 end as 1_most_host_drop
from 
(SELECT
c.*,
sa.customers as customers2,
sum(sa.order) as orders,
date_part(mon, sa.date) as Month1,
date_part(year, sa.date) as year1,
(cast(orders - LAG(orders) OVER(Partition by customers2 ORDER BY
 year1, Month1) as NUMERIC(10,2))/NULLIF(LAG(orders) 
OVER(partition by customers2 ORDER BY year1, Month1) * 1, 0)) AS variance
FROM stats sa join (select distinct
    d.id, d.customers 
     from configer d 
    ) c on sa.customers=c.customers
WHERE sa.date >= '2021-04-1' 
GROUP BY Month1, sa.customers, c.id,  year1, 
     c.customers)fd
In a spirit of friendliness: I think you are a little premature in posting this here as there are several issues with the syntax before even reaching the point where you can solve the problem:
You have at least two places with a comma immediately preceding the word FROM:
...AS variance, FROM stats_archive sa ...
...d.customers, FROM config d...
Recommend you don't use VARIANCE as an alias (it is a system function in PostgreSQL and so is likely also a system function name in Redshift)
Not super important, but there's no need for c.* - just select the columns you will use
DATE_PART requires a string as the first parameter DATE_PART('mon',current_date)
I might be wrong about this, but I suspect you cannot use column aliases in the partition by or order by of a window function. Put the originating expressions there instead:
... OVER (PARTITION BY customers2 ORDER BY DATE_PART('year',sa.date),DATE_PART('mon',sa.date))
LAG has three parameters. (1) The column you want to retrieve the value from, (2) the row offset, where a positive integer indicates how many rows prior to the current row you should retrieve a value from according to the partition and order context and (3) the value the function should return as a default (in case of the first row in the partition). As such, you don't need NULLIF. So, to get the row immediately prior to the current row, or return 0 in case the current row is the first row in the partition:
LAG(orders,1,0) OVER (PARTITION BY customers2 ORDER BY DATE_PART('year',sa.date),DATE_PART('mon',sa.date))
If you use 0 as a default in the calculation of what is currently aliased variance, you will almost certainly run into a div/0 error either now, or worse, when you least expect it in the future. You should protect against that with some CASE logic or better, provide a more appropriate default value or even better, calculate the LAG with the default 0, then filter out the 0 rows before doing the calculation.
You can't use column aliases in the GROUP BY. You must reference each field that is not participating in an aggregate in the group by, whether through direct mention (sa.date) or indirectly in an expression (DATE_PART('mon',sa.date))
Your date should be '2021-04-01'
All in all, without sample data, expected results using the posted sample data and without first removing syntax errors, it is a tall order to attempt to offer advice on the problem which is any more specific than:
Build the source of the calculation as a completely separate query first. Calculate the LAG in that source query. Only when you've run that source query and verified that the LAG is producing the correct result should you then wrap it as a sub-query or CTE (not sure if Redshift supports these, but presumably) at which point you can filter out the rows with a zero as the denominator (the first month of orders for each customer).
Good luck!

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

SQLite alias (AS) not working in the same query

I'm stuck in an (apparently) extremely trivial task that I can't make work , and I really feel no chance than to ask for advice.
I used to deal with PHP/MySQL more than 10 years ago and I might be quite rusty now that I'm dealing with an SQLite DB using Qt5.
Basically I'm selecting some records while wanting to make some math operations on the fetched columns. I recall (and re-read some documentation and examples) that the keyword "AS" is going to conveniently rename (alias) a value.
So for example I have this query, where "X" is an integer number that I render into this big Qt string before executing it with a QSqlQuery. This query lets me select all the electronic components used in a Project and calculate how many of them to order (rounding to the nearest multiple of 5) and the total price per component.
SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category,
Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer,
Inventory.price, UsedItems.qty_used as used_qty,
UsedItems.qty_used*X AS To_Order,
ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5,
Inventory.price*Nearest5 AS TotPrice
FROM Inventory
LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid
WHERE UsedItems.pid='1'
ORDER BY RefDes, value ASC
So, for example, I aliased UsedItems.qty_used as used_qty. At first I tried to use it in the next field, multiplying it by X, writing "used_qty*X AS To_Order" ... Query failed. Well, no worries, I had just put the original tab.field name and it worked.
Going further, I have a complex calculation and I want to use its result on the next field, but the same issue popped out: if I alias "ROUND(...)" AS Nearest5, and then try to use this value by multiplying it in the next field, the query will fail.
Please note: the query WORKS, but ONLY if I don't use aliases in the following fields, namely if I don't use the alias Nearest5 in the TotPrice field. I just want to avoid re-writing the whole ROUND(...) thing for the TotPrice field.
What am I missing/doing wrong? Either SQLite does not support aliases on the same query or I am using a wrong syntax and I am just too stuck/confused to see the mistake (which I'm sure it has to be really stupid).
Column aliases defined in a SELECT cannot be used:
For other expressions in the same SELECT.
For filtering in the WHERE.
For conditions in the FROM clause.
Many databases also restrict their use in GROUP BY and HAVING.
All databases support them in ORDER BY.
This is how SQL works. The issue is two things:
The logic order of processing clauses in the query (i.e. how they are compiled). This affects the scoping of parameters.
The order of processing expressions in the SELECT. This is indeterminate. There is no requirement for the ordering of parameters.
For a simple example, what should x refer to in this example?
select x as a, y as x
from t
where x = 2;
By not allowing duplicates, SQL engines do not have to make a choice. The value is always t.x.
You can try with nested queries.
A SELECT query can be nested in another SELECT query within the FROM clause;
multiple queries can be nested, for example by following the following pattern:
SELECT *,[your last Expression] AS LastExp From (SELECT *,[your Middle Expression] AS MidExp FROM (SELECT *,[your first Expression] AS FirstExp FROM yourTables));
Obviously, respecting the order that the expressions of the innermost select query can be used by subsequent select queries:
the first expressions can be used by all other queries, but the other intermediate expressions can only be used by queries that are further upstream.
For your case, your query may be:
SELECT *, PRC*Nearest5 AS TotPrice FROM (SELECT *, ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5 FROM (SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category, Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer, Inventory.price AS PRC, UsedItems.qty_used*X AS To_Order FROM Inventory LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid WHERE UsedItems.pid='1' ORDER BY RefDes, value ASC))

Get the first row of a nested field in BigQuery

I have been struggling with a question that seem simple, yet eludes me.
I am dealing with the public BigQuery table on bitcoin and I would like to extract the first transaction of each block that was mined. In other word, to replace a nested field by its first row, as it appears in the table preview. There is no field that can identify it, only the order in which it was stored in the table.
I ran the following query:
#StandardSQL
SELECT timestamp,
block_id,
FIRST_VALUE(transactions) OVER (ORDER BY (SELECT 1))
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
But it process 492 GB when run and throws the following error:
Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for OVER(ORDER BY) used too much memory..
It seems so simple, I must be missing something. Do you have an idea about how to handle such task?
#standardSQL
SELECT * EXCEPT(transactions),
(SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema
#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks#1529518619028]
The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858
#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000)
An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:
#StandardSQL
SELECT timestamp,
block_id,
transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10
That first row from the array still has some nested data, that you might want to flatten to only their first row too:
#standardSQL
SELECT timestamp
, block_id
, transactions[OFFSET(0)].transaction_id first_transaction_id
, transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
, transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000

Output of CSUM() in teradata

Can anyone please help me in unstanding below csum function.
What will be the output in each case.
csum(1,1),
csum(1,1) + emp_no
csum(1,emp_no)+emp_no
CSUM is an old deprecated function from V2R3, over 15 years ago. It can always be rewritten using newer Standard SQL compliant syntax.
CSUM(1,1) returns the same as ROW_NUMBER() OVER (ORDER BY 1), a sequence starting with 1.
But you should never use it like that as ORDER BY 1 within a Windowed Aggregate Function is not the same as the final ORDER BY 1 of a SELECT, it's ordering all rows by the same value 1. Teradata calculates those functions in parallel based on the values in PARTITION BY and ORDER BY, this means all rows with the same PARTITION/ORDER data are processed on a single AMP, if there's only a single value one AMP will process all rows, resulting in a totally skewed distribution.
Instead of ORDER BY 1 you should use a column which is more or less unique in best case.
csum(1,emp_no)+emp_no is probably used with another SELECT to get the current maximum value of a column and add the new sequential values to it, i.e. creating your own gap-less sequence numbers.
This is the best way to do it:
SELECT ROW_NUMBER() OVER (ORDER BY column(s)_with_a_low_number_of_rows_per_value)
+ COALESCE((SELECT MAX(seqnum) FROM table),0)
,....
FROM table