Enhancing Performance

Enhancing Performance - sql

I'm not as clued up on shortcuts in SQL so I was hoping to utilize the brainpower on here to help speed up a query I'm using. I'm currently using Oracle 8i.
I have a query:
SELECT
NAME_CODE, ACTIVITY_CODE, GPS_CODE
FROM
(SELECT
a.NAME_CODE, b.ACTIVITY_CODE, a.GPS_CODE,
ROW_NUMBER() OVER (PARTITION BY a.GPS_DATE ORDER BY b.ACTIVITY_DATE DESC) AS RN
FROM GPS_TABLE a, ACTIVITY_TABLE b
WHERE a.NAME_CODE = b.NAME_CODE
AND a.GPS_DATE >= b.ACTIVITY_DATE
AND TRUNC(a.GPS_DATE) > TRUNC(SYSDATE) - 2)
WHERE
RN = 1
and this takes about 7 minutes give or take 10 seconds to run.
Now the GPS_TABLE is currently 6.586.429 rows and continues to grow as new GPS coordinates are put into the system, each day it grows by about 8.000 rows in 6 columns.
The ACTIVITY_TABLE is currently 1.989.093 rows and continues to grow as new activities are put into the system, each day it grows by about 2.000 rows in 31 columns.
So all in all these are not small tables and I understand that there will always be a time hit running this or similar queries. As you can see I'm already limiting it to only the last 2 days worth of data, but anything to speed it up would be appreciated.

Your strongest filter seems to be the filter on the last 2 days of GPS_TABLE. It should filter the GPS_TABLE to about 15k rows. Therefore one of the best candidate for improvement is an index on the column GPS_DATE.
You will find that your filter TRUNC(a.GPS_DATE) > TRUNC(SYSDATE) - 2 is equivalent to a.GPS_DATE > TRUNC(SYSDATE) - 2, therefore a simple index on your column will work if you change the query. If you can't change it, you could add a function-based index on TRUNC(GPS_DATE).
Once you have this index in place, we need to access the rows in ACTIVITY_TABLE. The problem with your join is that we will get all the old activities and therefore a good portion of the table. This means that the join as it is will not be efficient with index scans.
I suggest you define an index on ACTIVITY_TABLE(name_code, activity_date DESC) and a PL/SQL function that will retrieve the last activity in the least amount of work using this index specifically:
CREATE OR REPLACE FUNCTION get_last_activity (p_name_code VARCHAR2,
p_gps_date DATE)
RETURN ACTIVITY_TABLE.activity_code%type IS
l_result ACTIVITY_TABLE.activity_code%type;
BEGIN
SELECT activity_code
INTO l_result
FROM (SELECT activity_code
FROM activity_table
WHERE name_code = p_name_code
AND activity_date <= p_gps_date
ORDER BY activity_date DESC)
WHERE ROWNUM = 1;
RETURN l_result;
END;
Modify your query to use this function:
SELECT a.NAME_CODE,
a.GPS_CODE,
get_last_activity(a.name_code, a.gps_date)
FROM GPS_TABLE a
WHERE trunc(a.GPS_DATE) > trunc(sysdate) - 2

Optimising an SQL query is generally done by:
Add some indexes
Try a different way to get the same information
So, start by adding an index for ACTIVITY_DATE, and perhaps some other fields that are used in the conditions.

Related

Limit result rows for minimal time intervals for PostgreSQL

Background: I am running TeslaMate/Grafana for monitoring my car status, one of the gauges plots the battery level fetched from database. My server is located remotely and running in a Dock from an old NAS, so both query performance and network overhead matters.
I found the koisk page frequently hangs and by investigation, it might caused by the query -- two of the plots returns 10~100k rows of results from database. I want to limit the number of rows returned by SQL queries, as the plots certainly don't have that much precision for drawing such detailed intervals.
I tried to follow this answer and use row_number() to pop only 100-th rows of results, but more complicated issues turned up, that is, the time intervals among rows are not consistent.
The car has 4 status, driving / online / asleep / offline.
If the car is at driving status, the time interval could be less than 200ms as the car pushes the status whenever it has new data.
If the car is at online status, the time interval could be several minutes as the system actively fetches the status from the car.
Even worse, if the system thinks the car is going to sleep and need to stop fetching status (to avoid preventing the car to sleep), the interval could be 40 minutes maximum depend on settings.
If the car is in asleep/offline status, no data is recorded at all.
This obviously makes skipping every n-th rows a bad idea, as for case 2-4 above, lots of data points might missing so that Grafana cannot plot correct graph representing the battery level at satisfactory precision.
I wonder if there's any possible to skip the rows by time interval from a datetime field rather than row_number() without much overhead from the query? i.e., fetch every row with minimal 1000ms from the previous row.
E.g., I have following data in the table, I want the rows returned are row 1, 4 and 5.
row date
[1] 1610000001000
[2] 1610000001100
[3] 1610000001200
[4] 1610000002000
[5] 1610000005000
The current (problematic) method I am using is as follows:
SELECT $__time(t.date), t.battery_level AS "SOC [%]"
FROM (
SELECT date, battery_level, row_number() OVER(ORDER BY date ASC) AS row
FROM (
SELECT battery_level, date
FROM positions
WHERE car_id = $car_id AND $__timeFilter(date)
UNION ALL
SELECT battery_level, date
FROM charges c
JOIN charging_processes p ON p.id = c.charging_process_id
WHERE $__timeFilter(date) AND p.car_id = $car_id) AS data
ORDER BY date ASC) as t
WHERE t.row % 100 = 0;
This method clearly gives problem that only returns alternate rows instead of what I wanted (given the last row reads t.row % 2 = 0)
PS: please ignore the table structures and UNION from the sample code, I haven't dig deep enough to the tables which could be other tweaks but irrelevant to this question anyway.
Thanks in advance!

You can use a recursive CTE:
WITH RECURSIVE rec(cur_row, cur_date) AS (
(
SELECT row, date
FROM t
ORDER BY date
LIMIT 1
)
UNION ALL
(
SELECT row, date
FROM t
JOIN rec
ON t.date >= cur_date + 1000
ORDER BY t.date
LIMIT 1
)
)
SELECT *
FROM rec;
cur_row
cur_date
1
1610000001000
4
1610000002000
5
1610000005000
View on DB Fiddle
Using a function instead would probably be faster:
CREATE OR REPLACE FUNCTION f() RETURNS SETOF t AS
$$
DECLARE
row t%ROWTYPE;
cur_date BIGINT;
BEGIN
FOR row IN
SELECT *
FROM t
ORDER BY date
LOOP
IF row.date >= cur_date + 1000 OR cur_date IS NULL
THEN
cur_date := row.date;
RETURN NEXT row;
END IF;
END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT *
FROM f();
row
date
1
1610000001000
4
1610000002000
5
1610000005000

Executing a Aggregate function within a case without Group by

I am trying to assign a specific code to a client based on the number of gifts that they have given in the past 6 months using a CASE. I am unable to use WITH (screenshot) due to the limitations of the software that I am creating the query in. It only allows for select functions. I am unsure how to get a distinct count from another table (transaction data) and use that as parameters in the CASE I have currently built (based on my client information table). Does anyone know of any workarounds for this? I am unable to GROUP BY clientID at the end of my query because not all of my columns are aggregate, and I only need to GROUP BY clientID for this particular WHEN statement in the CASE. I have looked into the OVER() clause, but I am needing my date range that I am evaluating to be dynamic (counting transactions over the last six months), and the amount of rows that I would be including is variable, as the transaction count month to month varies. Also, the software that I am building this in does not recognize the PARTITIONED BY parameter of the over clause.
Any help would be great!
EDIT:
it is not letting me attach an image... -____- I have added the two sections of code that I am looking for assistance with!
WITH "6MonthGIftCount" (
"ConstituentID"
,"GiftCount"
)
AS (
SELECT COUNT(DISTINCT "GiftView"."GiftID" FROM "GiftView" WHERE MONTHS_BETWEEN("GiftView"."GiftDate", getdate()) <= 6 GROUP BY "GiftView"."ConstituentID")
SELECT...CASE
WHEN "6MonthGiftCount"."GiftCount" >= 4
THEN 'A010'
)

Perform your grouping/COUNT(1) in a subquery to obtain the total # of donations by ConstituentID, then JOIN this total into your main query that uses this new column to perform its CASE statement.
select
hist.*,
case when timesDonated > 5 then 'gracious donor'
when timesDonated > 3 then 'repeated donor'
when timesDonated >= 1 then 'donor'
else null end as donorCode
from gifthistory hist
left join ( /* your grouping subquery here, pretending to be a new table */
select
personID,
count(1) as timesDonated
from gifthistory i
WHERE abs(months_between(giftDate, sysdate)) <= 6
group by personid ) grp on hist.personid = grp.personID
order by 1;
*Naturally, syntax changes will vary by DB; you didn't specify which it was based on, but you should be able to use this template with whichever you utilize. This works in both Oracle and SQL Server after tweaking the month calculation appropriately.

Can I speed up this subquery nested PostgreSQL Query

I have the following PostgreSQL code (which works, but slowly) which I'm using to create a materialized view, however it is quite slow and length of code seems cumbersome with the multiple sub-queries. Is there anyway I can improve the speed this code executes at or rewrite so it's shorter and easier to maintain?
CREATE MATERIALIZED VIEW station_views.obs_10_min_avg_ffdi_powerbi AS
SELECT t.station_num,
initcap(t.station_name) AS station_name,
t.day,
t.month_int,
to_char(to_timestamp(t.month_int::text, 'MM'), 'TMMonth') AS Month,
round(((date_part('year', age(t2.dmax, t2.dmin)) * 12 + date_part('month', age(t2.dmax, t2.dmin))) / 12)::numeric, 1) AS record_years,
round((t2.count_all_vals / t2.max_10_periods * 100)::numeric, 1) AS per_datset,
max(t.avg_bom_fdi) AS max,
avg(t.avg_bom_fdi) AS avg,
percentile_cont(0.95) WITHIN GROUP (ORDER BY t.avg_bom_fdi) AS percentile_cont_95,
percentile_cont(0.99) WITHIN GROUP (ORDER BY t.avg_bom_fdi) AS percentile_cont_99
FROM ( SELECT a.station_num,
d.station_name,
a.ten_minute_intervals_utc,
date_part('day', a.ten_minute_intervals_utc) AS day,
date_part('month', a.ten_minute_intervals_utc) AS month_int,
a.avg_bom_fdi
FROM analysis.obs_10_min_avg_ffdi_bom a,
obs_minute_stn_det d
WHERE d.station_num = a.station_num) t,
( SELECT obs_10_min_avg_ffdi_bom_view.station_num,
obs_10_min_avg_ffdi_bom_view.station_name,
min(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc) AS dmin,
max(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc) AS dmax,
date_part('epoch', max(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc) - min(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc)) / 600 AS max_10_periods,
count(*) AS count_all_vals
FROM analysis.obs_10_min_avg_ffdi_bom_view
GROUP BY obs_10_min_avg_ffdi_bom_view.station_num, obs_10_min_avg_ffdi_bom_view.station_name) t2
WHERE t.station_num = t2.station_num
GROUP BY t.station_num, t.station_name, Month, t.month_int, t.day, record_years, per_datset
ORDER BY t.month_int, t.day
WITH DATA;
The output I get is a row for each weather station (station_num & station_name) along with the day & month that a weather variable is recorded (avg_bom_fdi). The month value is retained and converted to a name for purposes of plotting values averaged per month on the chart. I also pull in the total number of years that recordings exist for that station (record_years) and a percentage of how complete that dataset is (per_datset). These both come from the second subquery (t2). The first subquery (t) is used to average the data per day and return the daily max, average and 95/99th percentiles.

I agree with the running the explain plan / execution plan on this
query.
Also , if not needed remove order by
If you see , lot of
time spent on fetching a particular value while reviewing execution plan,
try creating an index on that particular column.
Depending on high
and low cardinality , you can create B-Tree or Bit Map index,if you are deciding on index.

I think you need read something about Execution plan. It's good way to understand what doing with you query.
I recommended you documentation about this problem - LINK

Oracle group by takes lot of time

I have a select query which takes lot of time:
select user_id, variable, round(AVG(v_Score),1) v_score
from TEST_1M_SCORE_V1 where clock between 1 and 12 group by user_id, variable
This table - TEST_1M_SCORE_V1 has 260,000,000 rows.
Is there any other way of writing group by clause so it works faster?
Table definition:
Name Null Type
------------- ---- -------------
USER_ID NUMBER
CLOCK NUMBER
VARIABLE VARCHAR2(255)
V_SCORE NUMBER

This is two answers, not one, depending on the data. This is your query:
select user_id, variable, round(AVG(v_Score), 1) as v_score
from TEST_1M_SCORE_V1
where clock between 1 and 12
group by user_id, variable;
Option 1 is that relatively few rows satisfy the where condition -- where "relatively few" is definitely not more than a handful of percent. In this case, an index on TEST_1M_SCORE_V1(clock) would be useful. You can extend this to TEST_1M_SCORE_V1(clock, user_id, variable, score), for a covering index. Oracle will need to do all the work for the group by, but just on less data.
Option 2 is when more rows satisfy the where condition. In this case, you want Oracle to do a full index scan for the group by. The problem is that where clause. One approach is to incorporate it into the index, using a function-based index. However, that is highly specific (it works for 1 and 12 but not 1 and 11).
Instead, write the query as:
select user_id, variable,
round(AVG(case when clock between 1 and 12 then v_Score end), 1) as v_score
from TEST_1M_SCORE_V1
group by user_id, variable
having sum(case when clock between 1 and 12 then 1 else 0 end) > 0;
(The having clause may not be necessary, depending on how much you care about user_id/variable combos where the avg() will be NULL.)
This query is equivalent to the original. It seems to be doing more work, but that work is highly optimized for an index scan on: TEST_1M_SCORE_V1(user_id, variable, clock, v_score). The idea is that Oracle can read the index, in order, doing the group by and calculations at the same time. It never needs to look up data in the original data set and it never needs to process the group by using a hash- or sort-based algorithm.

SQLite query to get the closest datetime

I am trying to write an SQLite statement to get the closest datetime from an user input (from a WPF datepicker). I have a table IRquote(rateId, quoteDateAndTime, quoteValue).
For example, if the user enter 10/01/2000 and the database have only fixing stored for 08/01/2000, 07/01/2000 and 14/01/2000, it would return 08/01/2000, being the closest date from 10/01/2000.
Of course, I'd like it to work not only with dates but also with time.
I tried with this query, but it returns the row with the furthest date, and not the closest one:
SELECT quoteValue FROM IRquote
WHERE rateId = '" + pRefIndexTicker + "'
ORDER BY abs(datetime(quoteDateAndTime) - datetime('" + DateTimeSQLite(pFixingDate) + "')) ASC
LIMIT 1;
Note that I have a function DateTimeSQLite to transform user input to the right format.
I don't get why this does not work.
How could I do it? Thanks for your help

To get the closest date, you will need to use the strftime('%s', datetime) SQLite function.
With this example/demo, you will get the most closest date to your given date.
Note that the date 2015-06-25 10:00:00 is the input datetime that the user selected.
select t.ID, t.Price, t.PriceDate,
abs(strftime('%s','2015-06-25 10:00:00') - strftime('%s', t.PriceDate)) as 'ClosestDate'
from Test t
order by abs(strftime('%s','2015-06-25 10:00:00') - strftime('%s', PriceDate))
limit 1;
SQL explanation:
We use the strftime('%s') - strftime('%s') to calculate the difference, in seconds, between the two dates (Note: it has to be '%s', not '%S'). Since this can be either positive or negative, we also need to use the abs function to make it all positive to ensure that our order by and subsequent limit 1 sections work correct.

If the table is big, and there is an index on the datetime column, this will use the index to get the 2 closest rows (above and below the supplied value) and will be more efficient:
select *
from
( select *
from
( select t.ID, t.Price, t.PriceDate
from Test t
where t.PriceDate <= datetime('2015-06-23 10:00:00')
order by t.PriceDate desc
limit 1
) d
union all
select * from
( select t.ID, t.Price, t.PriceDate
from Test t
where t.PriceDate > datetime('2015-06-23 10:00:00')
order by t.PriceDate asc
limit 1
) a
) x
order by abs(julianday('2015-06-23 10:00:00') - julianday(PriceDate))
limit 1 ;
Tested in SQLfiddle.

Another useful solution is using BETWEEN operator, if you can determine upper and lower bounds for your time/date query. I encountered this solution just recently here in this link. This is what i've used for my application on a time column named t (changing code for date column and date function is not difficult):
select *
from myTable
where t BETWEEN '09:35:00' and '09:45:00'
order by ABS(strftime('%s',t) - strftime('%s','09:40:00')) asc
limit 1
Also, i must correct my comment on above post. I tried a simple examination of speed of these 3 approaches proposed by #BerndLinde, #ypercubeᵀᴹ and me . I have around 500 tables with 150 rows in each and medium hardware in my PC. The result is:
Solution 1 (using strftime) takes around 12 seconds.
Adding index of column t to solution 1 improves speed by around 30% and takes around 8 seconds. I didn't face any improvement for using index of time(t).
Solution 2 also has around 30% of speed improvement over Solution 1 and takes around 8 seconds
Finally, Solution 3 has around 50% improvement and takes around 5.5 seconds. Adding index of column t gives a little more improvement and takes around 4.8 seconds. Index of time(t) has no effect in this solution.
Note: I'm a simple programmer and this is a simple test in .NET code. A real performance test must consider more professional aspects, which i'm not aware of them. There was also some computations in my code, after querying and reading from database. Also, as #ypercubeᵀᴹ states, this result my not work for large amount of data.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas