Simplify query in H2 database - alternative to TOP X PERCENT - sql

I'm having performance issues with a query and was wondering how to simplify it.
I have a table "Evaluations" (Sample, Category, Jury, Value)
And created some custom functions to get some average values for each sample, so I have this view:
CREATE VIEW Results AS
SELECT Sample,
Category,
IFNULL(COUNT_VALID(Value),0) || ' / ' || COUNT(Value) AS Valid,
CUSTOM_MEAN(Value) AS Mean,
CUSTOM_MEDIAN(Value) AS Median
FROM Evaluations GROUP BY Sample, Category;
Then I want to have another field telling me if each sample is within the 30% of best valued samples of its category. It would be perfect to use TOP(X) PERCENT but it seems H2 doesn't support it so I made a second view that calculates the position in category multiplied by 100, divided by the total count in category and compared to 30:
CREATE VIEW Res AS
SELECT R1.*,
CASE
WHEN (
((SELECT COUNT(*) FROM Results R2
WHERE R2.Category = R1.Category
AND (R2.Mean > R1.Mean OR (R2.Mean = R1.Mean AND R2.Median > R1.Median))) + 1) * 100
/
(SELECT COUNT(*) FROM Results R2 WHERE R2.Category = R1.Category) )
> 30
THEN 'over 30%'
ELSE 'within 30%'
END as 30PERCENT
FROM Results R1 ORDER BY Mean DESC, Median DESC;
This works properly but with just 500 records it takes some time to retrieve the results.
Could someone tell me a more efficient way of constructing this query?
Thanks and regards!

Related

Multiple/Dependent Sub-Queries

I am currently working within SQL Workbench/J and Redshift. I am still learning a bit, and have a question around creating a sub-query that is dependent upon another sub-query result. In the below example, a sub-query has been implemented in order to produce the mean of multiple records grouped upon a unique symbol. I am then using the mean in the primary query to calculate additional values (USD/UCL/LCL). However, I need to add a where clause on these aggregate values, which I cannot do. How would I implement another layer of sub-query to pre-calculate the UCL/LCL due to it being dependent on the first subquery to generate? I have tried adding it to the first sub-query, but have been unsuccessful. I appreciate the help in advance, as I am just learning.
select
symbol,
mean,
avg(volume) as volume,
(mean * avg(volume) * 0.001) as USD,
STDV,
(MEAN + STDV * 3) as UCL,
(MEAN - STDV * 3) as LCL,
sum((high > ucl)::int) as ucltest,
sum((low < lcl)::int) as lcltest
from
(select
h.*,
avg(close) over (partition by symbol) as mean,
cast(stddev_samp(close) over (partition by symbol) as dec(14,2)) as STDV
from
historical h)
group by
symbol, mean, STDV;
You don't need another layer of query but there are issues. The first thing is you need to remember that you have the mean (and stdv) for each value of symbol repeated for each row coming out of the sub query. This is why you need the group by on these to get down to a single value per symbol. This is one way to do it but since these are all correlated it is better to use an aggregate function like MIN(MEAN) as MEAN. This also leads to confusion as there are two values with the name MEAN.
Now the real issue is that you are trying to use ucl and lcl in an aggregate function. These are local to this level of the query so you are seeing an error. You just need to repeat the calculation for these values. Like this (untested):
select
symbol,
min(lmean) as mean,
avg(volume) as volume,
(min(lmean) * avg(volume) * 0.001) as USD,
min(LSTDV) as STDV,
min(LMEAN + LSTDV * 3) as UCL,
min(LMEAN - LSTDV * 3) as LCL,
sum((high > (LMEAN + LSTDV * 3))::int) as ucltest,
sum((low < (LMEAN - LSTDV * 3))::int) as lcltest
from
(select
h.*,
avg(close) over (partition by symbol) as lmean,
cast(stddev_samp(close) over (partition by symbol) as dec(14,2)) as LSTDV
from
historical h)
group by
symbol
having
lcltest = 0 and ucltest = 0; --Having clause excludes any groups with either lcltest or ucltest not equal to 0

How to find neighboring records in the SQL table in terms of month and year?

Please help me to optimize my SQL query.
I have a table with the fields: date, commodity_id, exp_month_id, exp_year, price, where the first 4 fields are the primary key. The months are designated with the alphabet-ordered letters: e.g. F (for Jan), G (for Feb.), H (for March), etc. Thus the letter of more distant from Jan. month will be larger than the letter of the less distant month (F < G < H < ...). Some commodity_ids have all 12 months in the table, some only 5 or 3, which are constant for all years.
I need to calculate the difference between prices (gradient) of the neighboring records in terms of exp_month_id, exp_year. As the first step, I want to define for every couple (exp_month_id, exp_year) the valid couple (next_month_id, next_year). The main problem here, that if the current exp_month_id is the last in the year, then next_year = exp_year + 1 and next_month_id should be the first one in the year.
I have written the following query to do the job:
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id
FROM futures
ORDER BY exp_month_id
)
SELECT DISTINCT f.commodity_id,
f.exp_month_id,
f.exp_year,
(
WITH [temp] AS (
SELECT exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id
)
SELECT exp_month_id
FROM [temp]
WHERE exp_month_id > f.exp_month_id
UNION ALL
SELECT exp_month_id
FROM [temp]
LIMIT 1
)
AS next_month_id,
(
SELECT CASE WHEN EXISTS (
SELECT commodity_id,
exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id AND
exp_month_id > f.exp_month_id
LIMIT 1
)
THEN f.exp_year ELSE f.exp_year + 1 END
)
AS next_year
FROM futures AS f
This query serves as a base for a dynamic table (view) which is subsequently used for calculating the gradient. However, the execution of this query takes more than one second and thus the whole process takes minutes. I wonder if you could help me optimizing the query.
Note: The following requires Sqlite 3.25 or newer for window function support:
Lack of sample data (Preferably as a CREATE TABLE and INSERT statements for easy importing) and expected results makes it hard to test, but if your end goal is computing the difference in prices between expiration dates (Making your question a bit of an XY problem, maybe something like:
SELECT date, commodity_id, price, exp_year, exp_month_id
, price - lag(price, 1) OVER (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id) AS "change from last price"
FROM futures;
Thanks to the hint of #Shawn to use window functions I could rewrite the query in much shorter form:
CREATE VIEW "futures_nextmonths_win" AS
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id,
exp_year
FROM futures)
SELECT commodity_id,
exp_month_id,
exp_year,
lead(exp_month_id) OVER w AS next_month_id,
lead(exp_year) OVER w AS next_year
FROM trading_months
WINDOW w AS (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id);
which is also slightly faster then the original one.

SQL - count amount of occurences for items in different price diapasons

I have a question about SQL, and I honestly tried to search methods before asking. I will give an abstract (but precise) description below, and will greatly appreciate your example of solution (SQL query).
What I have:
Table A with category ids of the items and prices (in USD) for each item. category id has int type of value, price is string and looks like "USD 200000000" (real value is multiplied by 10^7). Tables also has a kind column with int type of value.
Table B with relation of category id and name.
What I need:
Get a table with price diapasons (like 0-100 | 100-200 | ...) as column names and count amount of items for each category id (as lines names) in all of the price diapasons. All results must be filtered by kind parameter (from table A) with value 3.
Questions, that I encountered (and which caused to ask for an example of SQL query):
Cut "USD from price string value, divide it by 10^7 and convert to float.
Gather diapasons of price values (0-100 | 100-200 | ...), with given step in the given interval (max price is considered as unknown at the start). Example: step 100 on 0-500 interval, and step 200 for values >500.
Put diapasons of price values into column names of the result table.
For each diapason, count amount of items in each category (category_id). Left limit of diapason shall not be considered (e.g. on 1000-1200 diapason, items with price 1000 shall not be considered).
Using B table, display name instead of category id.
Response is appreciated, ignorance will be understood.
If you only need category ids, then you do not need B. What you are looking for is conditional aggregation, something like:
select category_id,
sum(case when cast(substring(price, 4, 100) as int)/10000000 < 100 then 1 else 0 end) as price_000_100
sum(case when cast(substring(price, 4, 100) as int)/10000000 >= 100 and cast(substring(price, 4, 100) as int)/10000000 < 200
then 1 else 0
end) as price_100_200,
. . .
from a
group by category_id
There is no standard way to do what you describe.
That is because to do (3) you need a pivot aka crosstab, and this is not in ANSI SQL. Each DBMS has it's own implementation. Plus dynamic columns in a pivot table are an additional complication.
For example, Postgres calls it a "crosstab" and requires the tablefunc module to be installed. See this SO question and the documentation. Compare to SQL Server, which uses the PIVOT command.
You can get close using reasonably standard SQL.
Here is an example based on SQLite. A little bit of conversion would provide a solution for other systems, e.g. SUBSTR would be substring(string [from int] [for int]) in postgre.
Assuming a data table of format:
and a category name table of:
then the following code will produce:
WITH dataCTE AS
(SELECT product_id AS 'ID', CAST(SUBSTR(price, 5) AS INT)/1000000 AS 'USD',
CASE WHEN (CAST(SUBSTR(price, 5) AS INT)/1000000) <= 500 THEN
100 ELSE 200
END AS 'Interval'
FROM data
WHERE kind = 3),
groupCTE AS
(SELECT dataCTE.ID AS 'ID', dataCTE.USD AS 'USD', dataCTE.Interval AS 'Interval',
CASE WHEN dataCTE.Interval = 100 THEN
CAST(dataCTE.USD AS INT)/100
ELSE
(CAST(dataCTE.USD-500 AS INT)/200)+5
END AS 'GroupID'
FROM dataCTE),
cleanCTE AS
(SELECT *, CASE WHEN groupCTE.Interval = 100 THEN
CAST(groupCTE.GroupID *100 AS VARCHAR)
|| '-' ||
CAST((groupCTE.GroupID *100)+99 AS VARCHAR)
ELSE
CAST(((groupCTE.GroupID-5)*200)+500 AS VARCHAR)
|| '-' ||
CAST(((groupCTE.GroupID-5)*200)+500+199 AS VARCHAR)
END AS 'diapason'
FROM groupCTE
INNER JOIN cat_name AS cn ON groupCTE.ID = cn.cat_id)
SELECT *
FROM cleanCTE;
If you modify the last SELECT to:
SELECT name, diapason, COUNT(diapason)
FROM cleanCTE
GROUP BY name, diapason;
then you get a grouped output:
This is as close as you will get without specifying the exact system; even then you will have a problem with dynamically creating the column names.

SQL Percentage of Occurrences

I'm working on some SQL code as part of my University work. The data is factitious just to be clear. I'm trying to count the occurances of 1 & 0 in the SQL table Fact_Stream, this is stored in the Free_Stream column/attribute as a Boolean/bit value.
As calculations cant be made on bit values (at least in the way I'm trying) I've converted the value to an integer -- Just to be clear on that. The table contains information on a streaming companies streams, a 1 indicates the stream was free of charge, a 0 indicates the stream was paid for. My code:
SELECT Fact_Stream.Free_Stream, ((CAST(Free_Stream AS INT)) / COUNT(*) * 100) As 'Percentage of Streams'
FROM Fact_Stream
GROUP BY Free_Stream
The result/output is nearly where I want it to be, but it doesn't display the percentage correctly.
Output:
Using MS SQL Management Studio | MS SQL Server 2012 (I believe)
The percentage should be based on all rows, so you need to divide the count per 1/0 by a count of all rows. The easiest way to get this is utilizing a Windowed Aggregate Function:
SELECT Fact_Stream.Free_Stream,
100.0 * COUNT(*) -- count per bit
/ SUM(COUNT(*)) OVER () -- sum of those counts = count of all rows
As "Percentage of Streams"
FROM Fact_Stream
GROUP BY Free_Stream
You have INTs as a devisor and devidened(not sure I am correct with namings). So the result is also INT. Just cast one of those to decimal(notice how did I change to 100.0). Also you should debide count of elements in group to total count of rows in the table:
select Free_Stream,
(count(*) / (select count(*) from Free_Stream)) * 100.0 as 'Percentage of Streams'
from Fact_Stream
group by Free_Stream
Your equation is dividing the identifier (1 or 0) by the number of streams for each one, instead of dividing the count of free or paid by the total count. One way to do this is to get the total count first, then use it in your query:
declare #totalcount real;
select #totalcount = count(*) from Fact_Stream;
SELECT Fact_Stream.Free_Stream,
(Cast(Count(*) as real) / #totalcount)*100 AS 'Percentage of Streams'
FROM Fact_Stream
group by Fact_Stream.Free_Stream

SQL Percercentile Calculation

I have the following query, which even without a ton of data (~3k rows) is still a bit slow to execute, and the logic is a bit over my head - was hoping to get some help optimizing the query or even an alternate methodology:
Select companypartnumber, (PartTotal + IsNull(Cum_Lower_Ranks, 0) ) / Sum(PartTotal) over() * 100 as Cum_PC_Of_Total
FROM PartSalesRankings PSRMain
Left join
(
Select PSRTop.Item_Rank, Sum(PSRBelow.PartTotal) as Cum_Lower_Ranks
from partSalesRankings PSRTop
Left join PartSalesRankings PSRBelow on PSRBelow.Item_Rank < PSRTop.Item_Rank
Group by PSRTop.Item_Rank
) as PSRLowerCums on PSRLowerCums.Item_Rank = PSRMain.Item_Rank
The PartSalesRankings table simply consists of CompanyPartNumber(bigint) which is a part number designation, PartTotal(decimal 38,5) which is the total sales, and Item_Rank(bigint) which is the rank of the item based on total sales.
I'm trying to end up with my parts into categories based on their percentile - so an "A" item would be top 5%, a "B" item would be the next 15%, and "C" items would be the lower 80th percentile. The view I created works fine, it just takes almost three seconds to execute, which for my purposes is quite slow. I narrowed the bottle neck to the above query - any help would be greatly appreciated.
The problem you are having is the calculation of the cumulative sum of PartTotal. If you are using SQL Server 2012, you can do something like:
select (case when ratio <= 0.05 then 'A'
when ratio <= 0.20 then 'B'
else 'C'
end),
t.*
from (select psr.companypartnumber,
(sum(PartTotal) over (order by PartTotal) * 1.0 / sum(PartTotal) over ()) as ratio
FROM PartSalesRankings psr
) t
SQL Server 2012 also have percentile functions and other functions not in earlier versions.
In earlier versions, the question is how to get the cumulative sum efficiently. Your query is probably as good as anything that can be done in one query. Can the cumulative sum be calculated when partSalesRankings is created? Can you use temporary tables?