daily difference calculation performance improvement - sql

I need to calculate the daily price difference in percentage. The query I have works but is getting slower every day. The main idea is to calculate the delta with the previous row. The previous row is normally the previous day, but there might sometimes be a day missing. When that happens it needs to take the last day available.
I'm looking for a way to limit the set that I retrieve in the inner query. There are about 20.000 records added per day.
update
price_watches pw
set
min_percent_changed = calc.delta
from
(select
id,
product_id,
calculation_date,
(1 - (price_min / lag(price_min) over (order by product_id, calculation_date))) * 100 as delta
from
price_watches
where
price_min > 0) calc
where
calc.id = pw.id;

This is wrong on many levels.
1.) It looks like you are updating all rows, including old rows that already have their min_percent_changed set and probably shouldn't be updated again.
2.) You are updating even if the new min_percent_changed is the same as the old.
3.) You are updating rows to store a redundant value that could be calculated on the fly rather cheaply (if done right), thereby making the row bigger and more error prone and producing lots of dead row versions, which means a lot of work for vacuum and slowing down everything else.
You shouldn't be doing any of this.
If you need to materialize the daily delta for read performance optimization, I suggest a small additional 1:1 table that can be updated cheaply without messing with the main table. Especially, if you recalc the value for every row every time. But better calculate new data.
If you really want to recalculate for every row (like your current UPDATE seems to do), make that a MATERIALIZED VIEW to automate the process.
If the new query I am going to demonstrate is fast enough, don't store any redundant data and calculate deltas on the fly.
For your current setup, this query should be much faster, when combined with this matching index:
CREATE INDEX price_watches_product_id_calculation_date_idx
ON price_watches(product_id, calculation_date DESC NULLS LAST);
Query:
UPDATE price_watches pw
SET min_percent_changed = calc.delta
FROM price_watches p1
, LATERAL (
SELECT (1 - p1.price_min / p2.price_min) * 100 AS delta
FROM price_watches p2
WHERE p2.product_id = p1.product_id
AND p2.calculation_date < p1.calculation_date
ORDER BY p2.calculation_date DESC NULLS LAST
LIMIT 1
) calc
WHERE p1.price_min > 0
AND p1.calculation_date = current_date - 1 -- only update new rows!
AND pw.id = p1.id
AND pw.min_percent_changed IS DISTINCT FROM calc.delta;
I am restricting the update to rows from "yesterday": current_date - 1. This is a wild guess at what you actually need.
Explanation for the added last line of the query:
How do I (or can I) SELECT DISTINCT on multiple columns?
Similar to this answer on dba.SE from just a few hours ago:
Slow window function query with big table
Proper information in the question would allow me to adapt the query and give more explanation.

Related

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

How would I optimize this? I need to pull data from different tables and use results of queries in queries

Please propose an approach I should follow since I am obviously missing the point. I am new to SQL and still think in terms of MS Access. Here's an example of what I'm trying to do: Like I said, don't worry about the detail, I just want to know how I would do this in SQL.
I have the following tables:
Hrs_Worked (staff, date, hrs) (200 000+ records)
Units_Done (team, date, type) (few thousand records)
Rate_Per_Unit (date, team, RatePerUnit) (few thousand records)
Staff_Effort (staff, team, timestamp) (eventually 3 - 4 million records)
SO I need to do the following:
1) Calculate what each team earned by multiplying their units with RatePerUnit and Grouping on Team and Date. I create a view TeamEarnPerDay:
Create View teamEarnPerDay AS
SELECT
,Units_Done.Date,
,Units_Done.TeamID,
,Sum([Units_Done]*[Rate_Per_Unit.Rate]) AS Earn
FROM Units_Done INNER JOIN Rate_Per_Unit
ON (Units_Done.quality = Rate_Per_Unit.quality)
AND (Units_Done.type = Rate_Per_Unit.type)
AND (Units_Done.TeamID = Rate_Per_Unit.TeamID)
AND (Units_Done.Date = Rate_Per_Unit.Date)
GROUP BY
Units_Done.Date,
Units_Done.TeamID;
2) Count the TEAM's effort by Grouping Staff_Effort on Team and Date and counting records. This table has a few million records.
I have to cast the timestamp as a date....
CREATE View team_effort AS
SELECT
TeamID
,CAST([Timestamp] AS Date) as TeamDate,
,Count(Staff_EffortID) AS TeamEffort
FROM Staff_Effort
GROUP BY
TeamID
,CAST([Timestamp] AS Date);
3) Calculate the Team's Rate_of_pay: (1) Team_earnings / (2) Team_effort
I use the 2 views I created above. This view's performance drops but is still acceptable to me.
Create View team_rate_of_pay AS
SELECT
tepd.Date
,tepd.TeamID
,tepd.Earn
,tepd.TeamBags
,[Earn]/[TeamEffort] AS teamRate
FROM teamEarnPerDay
INNER JOIN team_effort
ON (teamEarnPerDay.Date = team_effort.TeamDate)
AND (teamEarnPerDay.TeamID = team_effort.TeamID);
4) Group Staff_Effort on Date and Staff and count records to get each individuals's effort. (share of the team effort)
I have to cast the Timestamp as a date....
Create View staff_effort AS
SELECT
TeamID
,StaffID
,CAST([Timestamp] AS Date) as StaffDate
,Count(Staff_EffortID) AS StaffEffort
FROM Staff_Effort
GROUP BY
,TeamID
,StaffID
,CAST([Timestamp] AS Date);
5) Calculate Staff earnings by: (4) Staff_Effort x (3) team_rate_of_pay
Multiply the individual's effort by the team rate he worked at on the day.
This one is ridiculously slow. In fact, it's useless.
CREATE View staff_earnings AS
SELECT
staff_effort.StaffDate
,staff_effort.StaffID
,sum(staff_effort.StaffEffort) AS StaffEffort
,sum([StaffEffort]*[TeamRate]) AS StaffEarn
FROM staff_effort INNER JOIN team_rate_of_pay
ON (staff_effort.TeamID = team_rate_of_pay.TeamID)
AND (staff_effort.StaffDate = team_rate_of_pay.Date)
Group By
staff_effort.StaffDate,
staff_effort.StaffID;
So you see what I mean.... I need various results and subsequent queries are dependent on those results.
What I tried to do is to write a view for each of the above steps and then just use the view in the next step and so on. They work fine but view nr 3 runs slower than the rest, even though still acceptable. View nr 5 is just ridiculously slow.
I actually have another view after nr.5 which brings hours worked into play as well but that just takes forever to produce a few rows.
I want a single line for each staff member, showing what he earned each day calculated as set out above, with his hours worked each day.
I also tried to reduce the number of views by using sub-queries instead but that took even longer.
A little guidance / direction will be much appreciated.
Thanks in advance.
--EDIT--
Taking the query posted in the comments. Did some formatting, added aliases and a little cleanup it would look like this.
SELECT epd.CompanyID
,epd.DATE
,epd.TeamID
,epd.Earn
,tb.TeamBags
,epd.Earn / tb.TeamBags AS RateperBag
FROM teamEarnPerDay epd
INNER JOIN teamBags tb ON epd.DATE = tb.TeamDate
AND epd.TeamID = tb.TeamID;
I eventually did 2 things:
1) Managed to reduce the nr of nested views by using sub-queries. This did not improve performance by much but it seems simpler with fewer views.
2) The actual improvement was caused by using LEFT JOIN in stead of Inner Join.
The final view ran for 50 minutes with the Inner Join without producing a single row yet.
With LEFT JOIN, it produced all the results in 20 seconds!
Hope this helps someone.

Get filtered row count using dm_db_partition_stats

I'm using paging in my app but I've noticed that paging has gone very slow and the line below is the culprit:
SELECT COUNT (*) FROM MyTable
On my table, which only has 9 million rows, it takes 43 seconds to return the row count. I read in another article which states that to return the row count for 1.4 billion rows, it takes over 5 minutes. This obviously cannot be used with paging as it is far too slow and the only reason I need the row count is to calculate the number of available pages.
After a bit of research I found out that I get the row count pretty much instantly (and accurately) using the following:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable')
AND (index_id=0 or index_id=1)
But the above returns me the count for the entire table which is fine if no filters are applied but how do I handle this if I need to apply filters such as a date range and/or a status?
For example, what is the row count for MyTable when the DateTime field is between 2013-04-05 and 2013-04-06 and status='warning'?
Thanks.
UPDATE-1
In case I wasn't clear, I require the total number of rows available so that I can determine the number of pages required that will match my query when using 'paging' feature. For example, if a page returns 20 records and my total number of records matching my query is 235, I know I'll need to display 12 buttons below my grid.
01 - (row 1 to 20) - 20 rows displayed in grid.
02 - (row 21 to 40) - 20 rows displayed in grid.
...
11 - (row 200 to 220) - 20 rows displayed in grid.
12 - (row 221 to 235) - 15 rows displayed in grid.
There will be additional logic added to handle a large amount of pages but that's a UI issue, so this is out of scope for this topic.
My problem with using "Select count(*) from MyTable" is that it is taking 40+ seconds on 9 million records (thought it isn't anymore and I need to find out why!) but using this method I was able to add the same filter as my query to determine the query. For example,
SELECT COUNT(*) FROM [MyTable]
WHERE [DateTime] BETWEEN '2018-04-05' AND '2018-04-06' AND
[Status] = 'Warning'
Once I determine the page count, I would then run the same query but include the fields instead of count(*), the CurrentPageNo and PageSize in order to filter my results by page number using the row ids and navigate to a specific pages if needed.
SELECT RowId, DateTime, Status, Message FROM [MyTable]
WHERE [DateTime] BETWEEN '2018-04-05' AND '2018-04-06' AND
[Status] = 'Warning' AND
RowId BETWEEN (CurrentPageNo * PageSize) AND ((CurrentPageNo + 1) * PageSize)
Now, if I use the other mentioned method to get the row count i.e.
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable')
AND (index_id=0 or index_id=1)
It returns the count instantly but how do I filter this so that I can include the same filters as if I was using the SELECT COUNT(*) method, so I could end up with something like:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable') AND
(index_id=0 or index_id=1) AND
([DateTime] BETWEEN '2018-04-05' AND '2018-04-06') AND
([Status] = 'Warning')
The above clearing won't work as I'm querying the dm_db_partition_stats but I would like to know if I can somehow perform a join or something similar to provide me with the total number of rows instantly but it needs to be filtered rather than apply to the entire table.
Thanks.
Have you ever asked for directions to alpha centauri? No? Well the answer is, you can't get there from here.
Adding indexes, re-orgs/re-builds, updating stats will only get you so far. You should consider changing your approach.
sp_spaceused will return the record count typically instantly; You may be able to use this, however depending (which you've not quite given us enough information) on what you are using the count for might not be adequate.
I am not sure if you are trying to use this count as a means to short circuit a larger operation or how you are using the count in your application. When you start to highlight 1.4 billion records and you're looking for a window in said set, it sounds like you might be a candidate for partitioned tables.
This allows you assign several smaller tables, typically separated by date, years / months, that act as a single table. When you give the date range on 1.4+ Billion records, SQL can meet performance expectations. This does depend on SQL Edition, but there is also view partitioning as well.
Kimberly Tripp has a blog and some videos out there, and Kendra Little also has some good content on how they are used and how to set them up. This would be a design change. It is a bit complex and not something you would want implement on a whim.
Here is a link to Kimberly's Blog: https://www.sqlskills.com/blogs/kimberly/sqlskills-sql101-partitioning/
Dev banter:
Also, I hear you blaming SQL, are you using entity framework by chance?

Order by in subquery behaving differently than native sql query?

So I am honestly a little puzzled by this!
I have a query that returns a set of transactions that contain both repair costs and an odometer reading at the time of repair on the master level. To get an accurate Cost per mile reading I need to do a subquery to get both the first meter reading between a starting date and an end date, and an ending meter.
(select top 1 wf2.ro_num
from wotrans wotr2
left join wofile wf2
on wotr2.rop_ro_num = wf2.ro_num
and wotr2.rop_fac = wf2.ro_fac
where wotr.rop_veh_num = wotr2.rop_veh_num
and wotr.rop_veh_facility = wotr2.rop_veh_facility
AND ((#sdate = '01/01/1900 00:00:00' and wotr2.rop_tran_date = 0)
OR ([dbo].[udf_RTA_ConvertDateInt](#sdate) <= wotr2.rop_tran_date
AND [dbo].[udf_RTA_ConvertDateInt](#edate) >= wotr2.rop_tran_date))
order by wotr2.rop_tran_date asc) as highMeter
The reason I have the tables aliased as xx2 is because those tables are also used in the main query, and I don't want these to interact with each other except to pull the correct vehicle number and facility.
Basically when I run the main query it returns a value that is not correct; it returns the one that is second(keep in mind that the first and second have the same date.) But when I take the subquery and just copy and paste it into it's own query and run it, it returns the correct value.
I do have a work around for this, but I am just curious as to why this happening. I have searched quite a bit and found not much(other than the fact that people don't like order bys in subqueries). Talking to one of my friends that also does quite a bit of SQL scripting, it looks to us as if the subquery is ordering differently than the subquery by itsself when you have multiple values that are the same for the order by(i.e. 10 dates of 08/05/2016).
Any ideas would be helpful!
Like I said I have a work around that works in this one case, but don't know yet if it will work on a larger dataset.
Let me know if you want more code.

Poor DB Performance when using ORDER BY

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?