Result if TOP 10 values are not there - sql

SELECT AVG(f.P_PRICE_LOW) as TOP10_Average
FROM (SELECT top 10 P_PRICE_LOW
FROM fp_basic_bd
WHERE fs_perm_sec_id='B00242-S-US'
ORDER BY fs_perm_sec_id
) as f
By using this query I am calculating the average of top 10 price values. My question is:
How the average will be calculated if the subquery returns less than 10 values?

The average is calculated over whatever the inner query returns, with NULL values being ignored by AVG().
If inner query returns 10 non-null values, then result = SUM(10 values) / 10.
If inner query returns 3 non-null values, then result = SUM(3 values) / 3.
If inner query returns no non-null values or even no values at all, then result = NULL.

You can think of TOP as restricting the total number of rows returned, not adding or manipulating any values. So if your query returns 90 rows total, TOP will just return the first N of those, so the first 10 in your case. If there are less than N rows returned, then TOP will return all rows found since it did not exceed the maximum value specified.
For your specific case, if your subquery returned <10 rows, the average value found would be based on those rows instead of 10. Since you are using the AVG function instead of manually finding the average yourself, the value found will still be the correct average of the rows found. So if you have 6 rows returned in the subquery, AVG would do the same as (r1 + r2 + ... + r6)/6.

Related

Calculating the mode/median/most frequent observation in categorical variables in SQL impala

I would like to calculate the mode/median or better, most frequent observation of a categorical variable within my query.
E.g, if the variable has the following string values:
dog, dog, dog, cat, cat and I want to get dog since its 3 vs 2.
Is there any function that does that? I tried APPX_MEDIAN() but it only returns the first 10 characters as median and I do not want that.
Also, I would like to get the most frequent observation with respect to date if there is a tie-break.
Thank you!
the most frequent observation is mode and you can calculate it like this.
Single value mode can be calculated like this on a value column. Get the count and pick up row with max count.
select count(*),value from mytable group by value order by 1 desc limit 1
now, in case you have multiple modes, you need to join back to the main table to find all matches.
select orig.value from
(select count(*) c, value v from mytable) orig
join (select count(*) cmode from mytable group by value order by 1 desc limit 1) cmode
ON orig.c= cmode.cmode
This will get all count of values and then match them based on count. Now, if one value of count matches to max count, you will get 1 row, if you have two value counts matches to max count, you will get 2 rows and so on.
Calculation of median is little tricky - and it will give you middle value. And its not most frequent one.

Query smallest number of rows to match a given value threshold

I would like to create a query that operates similar to a cash register. Imagine a cash register full of coins of different sizes. I would like to retrieve a total value of coins in the fewest number of coins possible.
Given this table:
id
value
1
100
2
100
3
500
4
500
5
1000
How would I query for a list of rows that:
has a total value of AT LEAST a given threshold
with the minimum excess value (value above the threshod)
in the fewest possible rows
For example, if my threshold is 1050, this would be the expected result:
id
value
1
100
5
1000
I'm working with postgres and elixir/ecto. If it can be done in a single query great, if it requires a sequence of multiple queries no problem.
I had a go at this myself, using answers from previous questions:
Using ABS() to order by the closest value to the threshold
Select rows until a sum reduction of a single column reaches a threshold
Based on #TheImpaler's comment above, this prioritises minimum number of rows over minimum excess. It's not 100% what I was looking for, so open to improvements if anyone can, but if not I think this is going to be good enough:
-- outer query selects all rows underneath the threshold
-- inner subquery adds a running total column
-- window function orders by the difference between value and threshold
SELECT
*
FROM (
SELECT
i.*,
SUM(i.value) OVER (
ORDER BY
ABS(i.value - $THRESHOLD),
i.id
) AS total
FROM
inputs i
) t
WHERE
t.total - t.value < $THRESHOLD;

How can I retrieve data from a Hive table from two columns with non null values and top 500 records in one query?

I have a Hive table (my_table) which is in ORC format and has 30 columns. Two of the columns (col_us, col_ds) store numeric values which can be 0 or null or some integer. The table is partitioned on the bases of day and hourly.
The table has approx. 8 Million x 96 records in a days partition and I am referring to 15 daily partitions
Currently I am running separate queries to retrieve top 500 records with value greater than 0 using a rank function. One query to retrieve col_us and other for col_ds
It is possible that clo_US may have a numeric value while col_DS is 0 or null
Question:
I want to retrieve top 500 non null and non 0 records from each of these columns from one query.
My Query:
From(
SELECT D.COL_US, D.DATESTAMP,
ROW_NUMBER() OVER (PARTITION BY D.ID,D.SUB_ID ORDER BY CONCAT (D.DATESTAMP,D.HOURSTAMP,D.TIMESTAMP) DESC) AS RNK
FROM ${wf_table_name} D
WHERE DATESTAMP >= '${datestamp_15}' AND DATESTAMP < '${datestamp}'
AND COL_US > 0)T
INSERT OVERWRITE TABLE ${wf_us_table}
SELECT T.COL_US, T.DATESTAMP, T.RNK WHERE T.RNK < 500;
As per your query I can guess that you are trying to get top 500 rows from your table based on date/time that means latest 500 rows where col_us, col_ds both have a value which is >0 but not top 500 from each of these columns.
As per your question your table may have 2 type of value. for example.
col_us
0
NULL
10
5
col_ds
5
10
0
NULL
or both column may have >0 value.
So instead of 'AND COL_US > 0' under WHERE clause use 'AND (COL_US > 0 and col_ds > 0)'
But with this condition you will not get any value from above stated 4 rows.
So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query.
Again, as per your question stated "I want to retrieve top 500 non null and non 0 records from each of these columns from one query." ,
I can guess that you want to get top 500 records from col_us, col_ds depends on the value of col_us/col_ds then you must have to use these columns within rank clause instead of date/time.
What you want to retrieve you may get by UPDATE query depending on other available columns but before that I want to request you to share exactly what you want (top 500 based on col_us/col_ds or latest 500) along with your base and target table structure.

How do modified condition operators work? SQL - ALL and ANY

Lets say I have a table A with attribute numbers that looks like this.
A
numbers
1
2
3
4
5
6
7
8
9
10
What will this query return? How is the 5 getting compared?
SELECT numbers
FROM A
WHERE 5 > ALL (SELECT numbers FROM a)
The ALL statement requires that ALL of the results returned by your subquery
(SELECT numbers FROM A)
to respect the condition (to be smaller than 5), otherwise the condition is not met and no results are returned.
In your case, there are numbers returned by the subquery, SELECT numbers FROM a, 6, 7, 8, 9, 10 which are greater than 5, thus not ALL numbers respect the condition, so the condition is evaluated to FALSE, and no rows are returned.
Update:
Based on your comments I added details to my answer:
The statement using ALL condition should be read as:
"If ALL of the numbers returned by (SELECT numbers FROM A) are smaller than 5, then return the numbers selected by your MAIN SELECT."
The statement using ANY condition should be read as:
"If ANY of the numbers returned by (SELECT numbers FROM A) are smaller than 5, then return the numbers selected by your MAIN SELECT."
You can run the query in this SQLFiddle to see how the results change, just replace ANY with ALL and see the difference.
It will return an empty resultset (no rows).
The WHERE clause is evaluated for each row in the table A [first instance].
The WHERE clause tests whether 5 is greater than EACH row in table A [second instance].
It is not (there are several rows where the value is greater than 5) so the WHERE clause is always false.
Therefore no rows from table A [first instance] pass the query, therefore no rows are returned.

Split a query result based on the result count

I have a query based on basic criteria that will return X number of records on any given day.
I'm trying to check the result of the basic query then apply a percentage split to it based on the total of X and split it in 2 buckets. Each bucket will be a percentage of the total query result returned in X.
For example:
Query A returns 3500 records.
If the number of records returned from Query A is <= 3000, then split the 3500 records into a 40% / 60% split (1,400 / 2,100).
If the number of records returned from Query A is >=3001 and <=50,000 then split the records into a 10% / 90% split.Etc. Etc.
I want the actual records returned, and not just the math acting on the records that returns one row with a number in it (in the column).
I'm not sure how you want to display different parts of the resulting set of rows, so I've just added additional column(part) in the resulting set of rows that contains values 1 indicating that row belongs to the first part and 2 - second part.
select z.*
, case
when cnt_all <= 3000 and cnt <= 40
then 1
when (cnt_all between 3001 and 50000) and (cnt <= 10)
then 1
else 2
end part
from (select t.*
, 100*(count(col1) over(order by col1) / count(col1) over() )cnt
, count(col1) over() cnt_all
from split_rowset t
order by col1
) z
Demo #1 number of rows 3000.
Demo #2 number of rows 3500.
For better usability you can create a view using the query above and then query that view filtering by part column.
Demo #3 using of a view.