Redshift LISTAGG frame clause - sql

I am trying to aggregate strings, but limited to only the preceding rows, not the whole partition. Does anyone know how to do this in Redshift?
What I am trying to achieve is the appended_event_namespace column below.
This is what I've tried so far.
LISTAGG(event_namespace, '/')
WITHIN GROUP (ORDER BY tstamp_true)
OVER (PARTITION BY acct_id) AS appended_event_namespace
This results in the full ApplicationLaunch/CategoryBrowse/NotificationCenter/UserProfile aggregation on every single row instead of what is in the desired screenshot.
The difficulty is in getting it to only append up to the current row since there doesn't seem to be a frame-clause for Redshift's LISTAGG(). Thanks for any ideas that may help.

You can hack this together with another query. Start with your appended_event_namespace as the result of your original LISTAGG
SELECT event_namespace,
SUBSTRING(appended_event_namespace,
1,
POSITION(event_namespace,appended_event_namespace) + LEN(event_namespace) - 1
) as appended_event_namespace_cum
FROM your_table;
Basically, you take your aggregated, ordered string, and then take the first N characters where N is ([where it appears in the aggregated string ]+[its length]), which will cut out everything after that item. This gives you a cumulative namespace.

LISTAGG with frame clause is not supported in RS yet. If you have some columns that you can use for partitioning and ordering you can make a self join (not so performant but would accomplish what you want):
SELECT
t1.id
,t2.tstamp_true
,t1.event_namespace
,LISTAGG(t2.event_namespace,'/') WITHIN GROUP (ORDER BY t2.tstamp_true)
FROM your_table t1
JOIN your_table t2
ON t1.id=t2.id
AND t1.tstamp_true>=t2.tstamp_true
GROUP BY 1,2,3
Alternatively, if you want to avoid self join you can build a JSON with the following structure using LISTAGG:
[{tstamp_true_1,event_namespace_1},{tstamp_true_N,event_namespace_N},...]
and write a Python UDF that takes such JSON for the given group of rows and tstamp_true of the given row and returns the path (the function would need to filter the tstamp_true_N values earlier than the second parameter and concatenate filtered event_namespace_N values for the output)

Related

Eliminating Entries Based On Revision

I need to figure out how to eliminate older revisions from my query's results, my database stores orders as 'Q000000' and revisions have an appended '-number'. My query currently is as follows:
SELECT DISTINCT Estimate.EstimateNo
FROM Estimate
INNER JOIN EstimateDetails ON EstimateDetails.EstimateID = Estimate.EstimateID
INNER JOIN EstimateDoorList ON EstimateDoorList.ItemSpecID = EstimateDetails.ItemSpecID
WHERE (Estimate.SalesRepID = '67' OR Estimate.SalesRepID = '61') AND Estimate.EntryDate >= '2017-01-01 00:00:00.000' AND EstimateDoorList.SlabSpecies LIKE '%MDF%'
ORDER BY Estimate.EstimateNo
So for instance, the results would include:
Q120455-10
Q120445-11
Q121675-2
Q122361-1
Q123456
Q123456-1
From this, I need to eliminate 'Q120455-10' because of the presence of '-11' for that order, and 'Q123456' because of the presence of the '-1' revision. I'm struggling greatly with figuring out how to do this, my immediate thought was to use case statements but I'm not sure what is the best way to implement them and how to filter. Thank you in advance, let me know if any more information is needed.
First you have to parse your EstimateNo column into sequence number and revision number using CHARINDEX and SUBSTRING (or STRING_SPLIT in newer versions) and CAST/CONVERT the revision to a numeric type
SELECT
SUBSTRING(Estimate.EstimateNo,0,CHARINDEX('-',Estimate.EstimateNo)) as [EstimateNo],
CAST(SUBSTRING(Estimate.EstimateNo,CHARINDEX('-',Estimate.EstimateNo)+1, LEN(Estimate.EstimateNo)-CHARINDEX('-',Estimate.EstimateNo)+1) as INT) as [EstimateRevision]
FROM
...
You can then use
APPLY - to select TOP 1 row that matches the EstimateNo or
Window function such as ROW_NUMBER to select only records with row number of 1
For example, using a ROW_NUMBER would look something like below:
SELECT
ROW_NUMBER() OVER(PARTITION BY EstimateNo ORDER BY EstimateRevision DESC) AS "LastRevisionForEstimate",
-- rest of the needed columns
FROM
(
-- query above goes here
)
You can then wrap the query above in a simple select with a where predicate filtering out a specific value of LastRevisionForEstimate, for instance
SELECT --needed columns
FROM -- result set above
WHERE LastRevisionForEstimate = 1
Please note that this is to a certain extent, pseudocode, as I do not have your schema and cannot test the query
If you dislike the nested selects, check out the Common Table Expressions

Hive collect list is not working with millions of records

I have a hive query in outer query I am using collect_list. The inner query I have a ordered list of 1.8 million records. When I run the query every time 500-600 records giving wrong result and missing the order in a pattern. I used brickhouse jar also with collect udf. This is also giving same result with 500-600 records differed. I don't have any clue how to debug.
select concat_ws('','',collect_list(host)),
concat_ws('','',collect_list(cast(total_data_volume_host as string))),
concat_ws('','',collect_list(cast(event_duration_host as string))),
concat_ws('','',collect_list(application_name)),
concat_ws('','',collect_list(cast(total_data_volume_app as string))),
concat_ws('','',collect_list(cast(event_duration_app as string))),
Using ORDER BY is a sub-query does not guarantee an ordered array.
We are going to use sort_array for that.
concat_ws works only with array of strings so you are casting your values to string before using collect_list.
The problem now is that the natural order of the elements is changed -
100 > 20 but '20' > '100' (alphabetic order).
The work around is to lpad the values with spaces, sort, concat and then remove the spaces.
with t as (select explode(array(100,20,3)) as i)
select translate(concat_ws(',',sort_array(collect_list(lpad(cast(i as string),10,' ')))),' ','')
from t;
3,20,100

Postgresql Writing max() Window function with multiple partition expressions?

I am trying to get the max value of column A ("original_list_price") over windows defined by 2 columns (namely - a unique identifier, called "address_token", and a date field, called "list_date"). I.e. I would like to know the max "original_list_price" of rows with both the same address_token AND list_date.
E.g.:
SELECT
address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1
The query already takes >10 minutes when I use just 1 expression in the PARTITION (e.g. using address_token only, nothing after that). Sometimes the query times out. (I use Mode Analytics and get this error: An I/O error occurred while sending to the backend) So my questions are:
1) Will the Window function with multiple PARTITION BY expressions work?
2) Any other way to achieve my desired result?
3) Any way to make Windows functions, especially the Partition part run faster? e.g. use certain data types over others, try to avoid long alphanumeric string identifiers?
Thank you!
The complexity of the window functions partitioning clause should not have a big impact on performance. Do realize that your query is returning all the rows in the table, so there might be a very large result set.
Window functions should be able to take advantage of indexes. For this query:
SELECT address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1;
You want an index on table1(address_token, list_date, original_list_price).
You could try writing the query as:
select t1.*,
(select max(t2.original_list_price)
from table1 t2
where t2.address_token = t1.address_token and t2.list_date = t1.list_date
) as max_list_price
from table1 t1;
This should return results more quickly, because it doesn't have to calculate the window function value first (for all rows) before returning values.

Output of CSUM() in teradata

Can anyone please help me in unstanding below csum function.
What will be the output in each case.
csum(1,1),
csum(1,1) + emp_no
csum(1,emp_no)+emp_no
CSUM is an old deprecated function from V2R3, over 15 years ago. It can always be rewritten using newer Standard SQL compliant syntax.
CSUM(1,1) returns the same as ROW_NUMBER() OVER (ORDER BY 1), a sequence starting with 1.
But you should never use it like that as ORDER BY 1 within a Windowed Aggregate Function is not the same as the final ORDER BY 1 of a SELECT, it's ordering all rows by the same value 1. Teradata calculates those functions in parallel based on the values in PARTITION BY and ORDER BY, this means all rows with the same PARTITION/ORDER data are processed on a single AMP, if there's only a single value one AMP will process all rows, resulting in a totally skewed distribution.
Instead of ORDER BY 1 you should use a column which is more or less unique in best case.
csum(1,emp_no)+emp_no is probably used with another SELECT to get the current maximum value of a column and add the new sequential values to it, i.e. creating your own gap-less sequence numbers.
This is the best way to do it:
SELECT ROW_NUMBER() OVER (ORDER BY column(s)_with_a_low_number_of_rows_per_value)
+ COALESCE((SELECT MAX(seqnum) FROM table),0)
,....
FROM table

Oracle Group by issue

I have the below query. The problem is the last column productdesc is returning two records and the query fails because of distinct. Now i need to add one more column in where clause of the select query so that it returns one record. The issue is that the column i need
to add should not be a part of group by clause.
SELECT product_billing_id,
billing_ele,
SUM(round(summary_net_amt_excl_gst/100)) gross,
(SELECT DISTINCT description
FROM RES.tariff_nt
WHERE product_billing_id = aa.product_billing_id
AND billing_ele = aa.billing_ele) productdescr
FROM bil.bill_sum aa
WHERE file_id = 38613 --1=1
AND line_type = 'D'
AND (product_billing_id, billing_ele) IN (SELECT DISTINCT
product_billing_id,
billing_ele
FROM bil.bill_l2 )
AND trans_type_desc <> 'Change'
GROUP BY product_billing_id, billing_ele
I want to modify the select statement to the below way by adding a new filter to the where clause so that it returns one record .
(SELECT DISTINCT description
FROM RRES.tariff_nt
WHERE product_billing_id = aa.product_billing_id
AND billing_ele = aa.billing_ele
AND (rate_structure_start_date <= TO_DATE(aa.p_effective_date,'yyyymmdd')
AND rate_structure_end_date > TO_DATE(aa.p_effective_date,'yyyymmdd'))
) productdescr
The aa.p_effective_date should not be a part of GROUP BY clause. How can I do it? Oracle is the Database.
So there are multiple RES.tariff records for a given product_billing_id/billing_ele, differentiated by the start/end dates
You want the description for the record that encompasses the 'p_effective_date' from bil.bill_sum. The kicker is that you can't (or don't want to) include that in the group by. That suggests you've got multiple rows in bil.bill_sum with different effective dates.
The issue is what do you want to happen if you are summarising up those multiple rows with different dates. Which of those dates do you want to use as the one to get the description.
If it doesn't matter, simply use MIN(aa.p_effective_date), or MAX.
Have you looked into the Oracle analytical functions. This is good link Analytical Functions by Example