I have a massive table with records that all have a date and a price:
id | date | price | etc...
And then I have a list of random date ranges, never with the same length:
ARRAY [
daterange('2020-11-02','2020-11-05'),
daterange('2020-11-15','2020-11-20')
]
How would I most efficiently go about summing and grouping the records by their existence in one of the ranges, like so:
range | sum
------------------------------------------
[2020-11-02,2020-11-05) | 125.55
[2020-11-15,2020-11-20) | 566.12
You can unnest the array, left join the table on dates that are contained in ranges, and finally aggregate:
select x.rg, sum(t.price) sum_price
from unnest($1) x(rg)
left join mytable t on x.rg #> t.date
group by x.rg
$1 represents the array of dateranges that you want to pass to the query.
Related
Is it possible to get a the sum of value from the calendar_table to the main_table without joining like below?
select
date, sum(value)
from
main_table
inner join
calendar_table on start_date <= date and end_date >= date
group by
date
I am trying to avoid a join like this because main_table is a very large table with rows that have very large start and end dates, and it is absolutely killing my performance. And I've already indexed both tables.
Sample desired results:
+-----------+-------+
| date | total |
+-----------+-------+
| 7-24-2010 | 11 |
+-----------+-------+
Sample tables
calendar_table:
+-----------+-------+
| date | value |
+-----------+-------+
| 7-24-2010 | 5 |
| 7-25-2010 | 6 |
| ... | ... |
| 7-23-2020 | 2 |
| 7-24-2020 | 10 |
+-----------+-------+
main_table:
+------------+-----------+
| start_date | end_date |
+------------+-----------+
| 7-24-2010 | 7-25-2010 |
| 8-1-2011 | 8-5-2011 |
+------------+-----------+
You want the sum in the calendar table. So, I would recommend an "incremental" approach. This starts by unpivoting the data and putting the value as an increment and decrement in the results:
select c.date, c.value as inc
from main_table m join
calendar_table t
on m.start_date = c.date
union all
select dateadd(day, 1, c.date), - c.value as inc
from main_table m join
calendar_table t
on m.end_date = c.date;
The final step is to aggregate and do a cumulative sum:
select date, sum(inc) as value_on_date,
sum(sum(inc)) over (order by date) as net_value
from ((select c.date, c.value as inc
from main_table m join
calendar_table t
on m.start_date = c.date
) union all
(select dateadd(day, 1, c.date), - c.value as inc
from main_table m join
calendar_table t
on m.end_date = c.date
)
) c
group by date
order by date;
This is processing two rows of data for each row in the master table. Assuming that your time spans are longer than two days typically for each master row, the resulting data processed should be much smaller. And smaller data implies a faster query.
Here's a cross-apply example to possibly work from.
select main_table.date
, CalendarTable.ValueSum
from main_table
CROSS APPLY(
SELECT SUM(value) as ValueSum
FROM calendar_table
WHERE start_date <= main_table.date and main_table.end_date >= date
) as CalendarTable
group by date
You could try something like this ... but be aware, it is still technically 'joined' to the main table. If you look at an execution plan, you will see that there is a join operation of some kind going on.
select
date,
(select sum(value) from calendar_table t where m.start_date <= t.date and m.end_date >= t.date)
from
main_table m
The thing about that query is that the 'main_table' is not grouped as part of the results. You could possibly do that outside the select, but I don't know what you are trying to achieve. If you are grouping just to get the SUM, then perhaps maintaining the 'main_table' in the group is superflous.
As already mentioned, you must perform a join of some sort in order to get data from more than one table in a query.
You did not provide details if the indexes which are important for performance. I suggest the following indexes to optimize query performance.
For calendar_table, make sure you have a unique clustered index (or primary key) on date. Alternatively, a unique nonclustered index on date with the value column included.
A composite index on the main_table start_date and end_date columns may also be beneficial.
Even with optimal indexes, the query will still take some time against a 500M row table (e.g. a couple of minutes) with no additional filter criteria. If you need results in milliseconds, create an indexed view to materialize the join and aggregation results. Be aware the indexed view will add overhead for inserts/deletes on both tables as well as for updates to the value column in order to keep the index consistent with the underlying data.
Below is an indexed view DDL example.
CREATE VIEW dbo.vw_example
WITH SCHEMABINDING
AS
SELECT
date, sum(value) AS value, COUNT_BIG(*) AS countbig
from
dbo.main_table
inner join
dbo.calendar_table on start_date <= date and end_date >= date
group by
date;
GO
CREATE UNIQUE CLUSTERED INDEX cdx ON dbo.vw_example(date);
GO
Depending on your SQL Server edition, the optimizer may be able to use the indexed view automatically so your original query can use the view index without changes. Otherwise, query the view directly and specify a NOEXPAND hint:
SELECT date, value AS total
FROM dbo.vw_example WITH (NOEXPAND);
EDIT:
With the query improvement #GordonLinoff suggested, a non-clustered index on the main_table end_date column will help optimize that query.
I have the following table structure in my Postgres DB (v12.0)
id | pieces | item_id | material_detail
---|--------|---------|-----------------
1 | 10 | 2 | [{"material_id":1,"pieces":10},{"material_id":2,"pieces":20},{"material_id":3,"pieces":30}]
2 | 20 | 2 | [{"material_id":1,"pieces":40}
3 | 30 | 3 | [{"material_id":1,"pieces":20},{"material_id":3,"pieces":30}
I am using GROUP BY query for this records, like below
SELECT SUM(PIECES) FROM detail_table GROUP BY item_id HAVING item_id =2
Using which I will get the total pieces as 30. But how could I get the count of total pieces from material_detail group by material_id.
I want result something like this
pieces | material_detail
-------| ------------------
30 | [{"material_id":1,"pieces":50},{"material_id":2,"pieces":20},{"material_id":3,"pieces":30}]
As I am from MySQL background, I don't know how to achieve this with JSON fields in Postgres.
Note: material_detail column is of JSONB type.
You are aggregating on two different levels. I can't think of a solution that wouldn't need two separate aggregation steps. Additionally to aggregate the material information all arrays of the item_id have to be unnested first, before the actual pieces value can be aggregated for each material_id. Then this has to be aggregated back into a JSON array.
with pieces as (
-- the basic aggregation for the "detail pieces"
select dt.item_id, sum(dt.pieces) as pieces
from detail_table dt
where dt.item_id = 2
group by dt.item_id
), details as (
-- normalize the material information and aggregate the pieces per material_id
select dt.item_id, (m.detail -> 'material_id')::int as material_id, sum((m.detail -> 'pieces')::int) as pieces
from detail_table dt
cross join lateral jsonb_array_elements(dt.material_detail) as m(detail)
where dt.item_id in (select item_id from pieces) --<< don't aggregate too much
group by dt.item_id, material_id
), material as (
-- now de-normalize the material aggregation back into a single JSON array
-- for each item_id
select item_id, jsonb_agg(to_jsonb(d) - 'item_id') as material_detail
from details d
group by item_id
)
-- join both results together
select p.item_id, p.pieces, m.material_detail
from pieces p
join material m on m.item_id = p.item_id
;
Online example
How to get substring from column which contains records for filter and group by clause in AWS Redshift database.
I have table with records like:
Table_Id | Categories | Value
<ID> | ABC1; ABC1-1; XYZ | 10
<ID> | ABC1; ABC1-2; XYZ | 15
<ID> | XYZ | 5
.....
Now I want to filter records based on individual category like 'ABC1' or 'ABC1 and XYZ'
Expected output from query would like:
Table_Id | Categories | Value
<ID> | ABC1 | 25
<ID> | ABC1-1 | 10
<ID> | ABC1-2 | 15
<ID> | XYZ | 30
.....
So need to group results based on individual categories.
If you have at most 3 values in any "categories" cell you can unnest the cells, get the list of unique values and use that list in a join condition like this:
WITH
values as (
select distinct category
from (
select distinct split_part(categories,';',1) as category from your_table
union select distinct split_part(categories,';',2) from your_table
union select distinct split_part(categories,';',3) from your_table
)
where nullif(category,'') is not null
)
SELECT
t2.category
,sum(t1.value)
FROM your_table t1
JOIN values t2
ON split_part(categories,';',1)=t2.category
OR split_part(categories,';',2)=t2.category
OR split_part(categories,';',3)=t2.category
if you have more than 3 options just add another split_part level both in WITH part and the join condition
#JonScott, #AlexYes and other pals who struggle with similar kinda situations.
I found more better approach other than suggested by #AlexYes.
What I did, I flatter category column which result individual records.
Which I can further process.
Query:
select row_number() over(order by 1) as r1,
to_char(timestamptz 'epoch' + date_time * interval '1 second', 'yyyy-mm-dd') AS DAY,
split_part(categories, ';', numbers.n) as catg,
value
from <TABLE>
join numbers
on numbers.n <= regexp_count(category_string, ';') + 1 <OTHER_CONDITIONS>
Explanation:
Two functions are useful here: first, the split_part function, which takes a string, splits it on ';' delimiter, and returns the first, second, ... , nth value specified from the split string; second, regexp_count, which tells us how many times a particular pattern is found in our string.
To do this fully dynamically, you need to transpose or pivot values in "categories" column into separate rows.
Unfortunately, a "fully dynamic" solution (without knowing the different values beforehand) is NOT possible using redshift.
Your options are as follows:
Use the method suggested by AlexYes in another answer. This is
semi-dynamic and is probably your best option.
Outside of Redshift, run some ETL code to perform
the column -> multiple rows ETL.
Create a hardcoded type solution, and perform the pivot something like this:
select table_id,'ABC1' as category, case when concat(Categories,';') ilike '%ABC1;%' then value else 0 end as value from your_table
union all
select table_id,'ABC1-1' as category, case when concat(Categories,';')ilike '%ABC1-1;%' then value else 0 end as value from your_table
union all
etc
I have a table t in postgres database. It has a column data which contains jsonb data in the following format (for each record)-
{
"20161214": {"4": ["3-14", "5-16", "642"], "9": ["3-10", "5-10", "664"] },
"20161217": {"3": ["3-14", "5-16", "643"], "7": ["3-10", "5-10", "661"] }
}
where 20161214 is the date, "4" is the month, 642 is the amount.
I need to find the minimum amount for each record of the table and the month that amount belongs to.
What I have tried:
Using jsonb_each function and separating key value pairs and then using min function.But still I cant get the month it belongs to.
How can this be achieved?
select j2.date
,j2.month
,j2.amount
from t
left join lateral
(select j1.date
,j2.month
,(j2.value->>2)::numeric as amount
from jsonb_each (t.data) j1 (date,value)
left join lateral jsonb_each (j1.value) j2 (month,value)
on true
order by amount
limit 1
) j2
on true
+----------+-------+--------+
| date | month | amount |
+----------+-------+--------+
| 20161214 | 4 | 642 |
+----------+-------+--------+
Alternatively (without joins):
select
min(case when amount = min_amount then month end) as month,
min_amount as amout
from (
select
key as month,
(select min((value->>2)::int) from jsonb_each(value)) as amount,
min((select min((value->>2)::int) from jsonb_each(value))) over(partition by rnum) as min_amount,
rnum
from (
select
(jsonb_each(data)).*,
row_number() over() as rnum
from t
) t
) t
group by
rnum, min_amount;
I got a values table such as:
id | user_id | value | date
---------------------------------
1 | 12 | 38 | 2014-04-05
2 | 15 | 19 | 2014-04-05
3 | 12 | 47 | 2014-04-08
I want to retrieve all values for given dates. However, if I don't have a value for one specific date, I want to get the previous available value. For instance, with the above dataset, if I query values for user 12 for dates 2014-04-07 and 2014-04-08, I need to retrieve 38 and 47.
I succeeded using two queries like:
SELECT *
FROM values
WHERE date <= $date
ORDER BY date DESC
LIMIT 1
However, it would require dates.length requests each time. So, I'm wondering if there is any more performant solution to retrieve all my values in a single request?
In general, you would use a VALUES clause to specify multiple values in a single query.
If you have only occasional dates missing (and thus no big gaps in dates between rows for any particular user_id) then this would be an elegant solution:
SELECT dt, coalesce(value, lag(value) OVER (ORDER BY dt)) AS value
FROM (VALUES ('2014-04-07'::date), ('2014-04-08')) AS dates(dt)
LEFT JOIN "values" ON "date" = dt AND user_id = 12;
The lag() window function picks the previous value if the current row does not have a value.
If, on the other hand, there may be big gaps, you need to do some more work:
SELECT DISTINCT dt, first_value(value) OVER (ORDER BY diff) AS value
FROM (
SELECT dt, value, dt - "date" AS diff
FROM (VALUES ('2014-04-07'::date), ('2014-04-08')) AS dates(dt)
CROSS JOIN "values"
WHERE user_id = 12) sub;
In this case a CROSS JOIN is made for user_id = 12 and differences between the dates in the VALUES clause and the table rows computed, in a sub-query. So every row has a value for field value. In the main query the value with the smallest difference is selected using the first_value() window function. Note that ordering on diff and picking the first row would not work here because you want values for multiple dates returned.