Split a query result based on the result count - sql

I have a query based on basic criteria that will return X number of records on any given day.
I'm trying to check the result of the basic query then apply a percentage split to it based on the total of X and split it in 2 buckets. Each bucket will be a percentage of the total query result returned in X.
For example:
Query A returns 3500 records.
If the number of records returned from Query A is <= 3000, then split the 3500 records into a 40% / 60% split (1,400 / 2,100).
If the number of records returned from Query A is >=3001 and <=50,000 then split the records into a 10% / 90% split.Etc. Etc.
I want the actual records returned, and not just the math acting on the records that returns one row with a number in it (in the column).

I'm not sure how you want to display different parts of the resulting set of rows, so I've just added additional column(part) in the resulting set of rows that contains values 1 indicating that row belongs to the first part and 2 - second part.
select z.*
, case
when cnt_all <= 3000 and cnt <= 40
then 1
when (cnt_all between 3001 and 50000) and (cnt <= 10)
then 1
else 2
end part
from (select t.*
, 100*(count(col1) over(order by col1) / count(col1) over() )cnt
, count(col1) over() cnt_all
from split_rowset t
order by col1
) z
Demo #1 number of rows 3000.
Demo #2 number of rows 3500.
For better usability you can create a view using the query above and then query that view filtering by part column.
Demo #3 using of a view.

Related

Query smallest number of rows to match a given value threshold

I would like to create a query that operates similar to a cash register. Imagine a cash register full of coins of different sizes. I would like to retrieve a total value of coins in the fewest number of coins possible.
Given this table:
id
value
1
100
2
100
3
500
4
500
5
1000
How would I query for a list of rows that:
has a total value of AT LEAST a given threshold
with the minimum excess value (value above the threshod)
in the fewest possible rows
For example, if my threshold is 1050, this would be the expected result:
id
value
1
100
5
1000
I'm working with postgres and elixir/ecto. If it can be done in a single query great, if it requires a sequence of multiple queries no problem.
I had a go at this myself, using answers from previous questions:
Using ABS() to order by the closest value to the threshold
Select rows until a sum reduction of a single column reaches a threshold
Based on #TheImpaler's comment above, this prioritises minimum number of rows over minimum excess. It's not 100% what I was looking for, so open to improvements if anyone can, but if not I think this is going to be good enough:
-- outer query selects all rows underneath the threshold
-- inner subquery adds a running total column
-- window function orders by the difference between value and threshold
SELECT
*
FROM (
SELECT
i.*,
SUM(i.value) OVER (
ORDER BY
ABS(i.value - $THRESHOLD),
i.id
) AS total
FROM
inputs i
) t
WHERE
t.total - t.value < $THRESHOLD;

Most Efficient way to Search Massive Redshift Table for Duplicate Values

I have a large Redshift tables (hundreds of millions of ROWS with ~50 columns per row).
There is a need for me to find rows that have duplicate columns for a specific value.
Example:
if my table has the columns 'column_of_interest' and 'date_time', In those hundreds of millions of columns, I need to find all the instances where 'column_of_interest' has more than one value between a certain 'date_time'.
eg:
column_of_interest date_time
ROW 1: ABCD-1234 165895896565
ROW 2: FCEG-3434 165895896577
ROW 3: ABCD-1234 165895986688
ROW 4: ZZZZ-9999 165895986689
ROW 5: ZZZZ-9999 165895987790
in the above.. since ROW 1 and ROW 3 have the same column_of_interest i would like that column_of_interest returned. and ROW 4 and ROW 5 as well, so i would like those returned.
So the end result would be:
duplicates
ABCD-1234
ZZZZ-9999
I have found a few things online, but the table is so large, the query times about before any results are returned. Am I going about this the wrong way? Here are a couple that I tried just to get the results back (but they timeout before returning).
SELECT column_of_interest, COUNT(*)
FROM my_table
GROUP BY column_of_interest
HAVING COUNT(*) > 1
WHERE date_time >= 1601510400000 AND date_time < 1601596800000
LIMIT 200
SELECT a.*
FROM my_table a
JOIN (SELECT column_of_interest, COUNT(*)
FROM my_table
GROUP BY column_of_interest
HAVING count(*) > 1 ) b
ON a.column_of_interest = b.column_of_interest
ORDER BY a.column_of_interest
LIMIT 200
This should be a fine method. And it should not "time out". Your version has a syntax error.
So try:
SELECT column_of_interest, COUNT(*)
FROM my_table
WHERE date_time >= 1601510400000 AND date_time < 1601596800000
GROUP BY column_of_interest
HAVING COUNT(*) > 1
LIMIT 200

Sample certain number of result rows from a postgres table based on given proportions

Let's say I have a table named population with 1000 rows like the following:
And I have another table named proportions that holds the desired proportions of different Group_Names that I want to extract:
I want to randomly sample 100 rows from population table where the proportions of the Group_Names within the sample is in line with that of the Proportion field within proportions table. So in that 100 rows sample, 50 rows should be Group-A, 30 rows should be Group-B and 20 rows should be Group-C.
I can manually sample like:
CREATE EXTENSION tsm_system_rows;
SELECT * FROM population TABLESAMPLE SYSTEM_ROWS(100);
But I do not know how to sample from population programmatically based on proportions table especially if proportions table has a lot more Group_Names than 3 as shown in the example.
The main problem that you will be facing is that TABLESAMPLE takes the sample before applying your group filter. Say that you want 20 rows from group C. The chances of getting those 20 by running
SELECT *
FROM population TABLESAMPLE system_rows(20)
WHERE group_name = 'C'
are pretty slim if group C is small relative to other groups in population.
I'd solve this by writing a stored function that receives as parameters the group name and wanted amount of rows, and samples the table until reaching the wanted amount of rows.
You should also limit the number of iterations, in case that the group is very sparse or there or not enough rows to fulfill the need.
So the function could look like so
CREATE OR REPLACE FUNCTION sample_group (p_group_name text, sample_size int, max_iterations int)
RETURNS int[]
LANGUAGE PLPGSQL AS $$
DECLARE
result int[];
i int := 0;
BEGIN
WHILE i < max_iterations AND coalesce(array_length(result, 1), 0) < sample_size LOOP
WITH sample AS (
SELECT group_name, value
FROM population TABLESAMPLE BERNOULLI (1)
LIMIT 10 * sample_size
), add_rows AS (
SELECT result || array_agg(value) arr
FROM sample
WHERE group_name = p_group_name
)
SELECT array_agg(DISTINCT value), i + 1
INTO result, i
FROM add_rows, unnest(arr) AS t(value);
END LOOP;
RETURN result[1:sample_size];
END;
$$;
I'm using BERNOULLI sampling to avoid getting the same rows over and over.
The function did most of the work for you. All that remains is to call it. In this example I'm setting an upper limit of 500 on the iterations.
SELECT group_name, unnest(sample_group(group_name, (100*proportion)::int, 500)) AS value
from proportions;
You can sample based on randomly assigned row numbers:
select *
from
(
select *
,case
when row_number()
over (partition by pop.group_name
order by random()) <= pr.proportion * 100 -- sample size
then 1
else 0
end as flag
from population as pop
join proportions as pr
on pop.group_name = pr.group_name
) as dt
where flag = 1
Edit:
If the table is large creating a SAMPLE before ROW_NUMBER might greatly reduce the number of rows processed. Of course, the SAMPLE size must be large enough to contain at least the required number of rows, i.e. way over 100 rows.

Ms Access query for returning a set number of records based on two criterias

I have an MS Access query which returns records based on the following criterias:
1. Returns records where ID numbers IN[2,15,30...] this is a list of 106 numbers.
2. Returns all records NOT LIKE "** LQ **". This is currently returning 10 records.
I want this query to always return 116 records but the number that will always be changing is the number of LQ records. So instead of always adjusting the list of 106 records to accomodate for the number of LQ records is there a way to run a query which first pulls all LQ records and then fills in the gap to get to 116 with random records from the list of 106?
So since you are sure that you will get 106 records always from the filter
where ID_numbers IN [2,15,30...]
limit the Like filter output to top 10, and Union All the 2 sets
Select top 10 *
From #t
where FldContainingLQ NOT LIKE "** LQ **"
Union All
Select *
From #t
where ID_numbers IN [2,15,30...]
Also be cautious about the
impact of having Union and Union All
if the NOT LIKE "** LQ **" did not return 10 records as expected situation
If I understand correctly, this is about ordering the explicit list before the rest of the rows. You can use conditional logic in the order by:
select t.*
from t
order by iif(id in (2, 15, 30, . . . ), 1, 2)
Below query will adjust to return 10 records from ID condition if the records returned by LQ is less than 10.
Select * from
--Query 1 will return 10 records IF NOT LIKE **LQ** has 10 records that will be returned , else return 10 rows for ID condition
--When **LQ** returns less than 10 accordingly records for ID condition will be returned
(Select top 10 *
From
(Select *
From t
where Colmn NOT LIKE "** LQ **"
Union All
Select *
From t
where ID IN (2,15,30...))) as tab
Union ALL
Select *
From t
where ID IN (2,15,30...)

How can I retrieve data from a Hive table from two columns with non null values and top 500 records in one query?

I have a Hive table (my_table) which is in ORC format and has 30 columns. Two of the columns (col_us, col_ds) store numeric values which can be 0 or null or some integer. The table is partitioned on the bases of day and hourly.
The table has approx. 8 Million x 96 records in a days partition and I am referring to 15 daily partitions
Currently I am running separate queries to retrieve top 500 records with value greater than 0 using a rank function. One query to retrieve col_us and other for col_ds
It is possible that clo_US may have a numeric value while col_DS is 0 or null
Question:
I want to retrieve top 500 non null and non 0 records from each of these columns from one query.
My Query:
From(
SELECT D.COL_US, D.DATESTAMP,
ROW_NUMBER() OVER (PARTITION BY D.ID,D.SUB_ID ORDER BY CONCAT (D.DATESTAMP,D.HOURSTAMP,D.TIMESTAMP) DESC) AS RNK
FROM ${wf_table_name} D
WHERE DATESTAMP >= '${datestamp_15}' AND DATESTAMP < '${datestamp}'
AND COL_US > 0)T
INSERT OVERWRITE TABLE ${wf_us_table}
SELECT T.COL_US, T.DATESTAMP, T.RNK WHERE T.RNK < 500;
As per your query I can guess that you are trying to get top 500 rows from your table based on date/time that means latest 500 rows where col_us, col_ds both have a value which is >0 but not top 500 from each of these columns.
As per your question your table may have 2 type of value. for example.
col_us
0
NULL
10
5
col_ds
5
10
0
NULL
or both column may have >0 value.
So instead of 'AND COL_US > 0' under WHERE clause use 'AND (COL_US > 0 and col_ds > 0)'
But with this condition you will not get any value from above stated 4 rows.
So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query.
Again, as per your question stated "I want to retrieve top 500 non null and non 0 records from each of these columns from one query." ,
I can guess that you want to get top 500 records from col_us, col_ds depends on the value of col_us/col_ds then you must have to use these columns within rank clause instead of date/time.
What you want to retrieve you may get by UPDATE query depending on other available columns but before that I want to request you to share exactly what you want (top 500 based on col_us/col_ds or latest 500) along with your base and target table structure.