Selecting a 1% sample in Aginity Workbench SQL - sql

I need to scoop up a random sample of 1% of the records in a table (with the number of rows growing every second).
My idea is to
SELECT DISTINCT
random(),
name,
age,
registrationNumber
FROM everGrowingTable
ORDER BY random desc
LIMIT (
(select count(*) from everGrowingTable) * 0.01
) -- this is attempting to get 1%
The compiler complains about the * operator. It is fine when I hard code the table size however.
I've tried IBM documentation, but this talks about calculations using known values, not values that grow (such is that case in my table)
There doesn't seem to be a Aginity SQL function that does this. I've notice the MINUS function in the Aginity Workbench Intellisense, but alas, no multiplication equivalent.

You could use window functions in a subquery to assign a random number to each record and compute the total record number, and then do the filtering in the outer query :
SELECT name, age, registrationNumber
FROM (
SELECT
name,
age,
registrationNumber,
ROW_NUMBER() OVER(ORDER BY random()) rn,
COUNT(*) OVER() cnt
FROM everGrowingTable
) x
WHERE rn <= cnt / 100
ORDER BY rn

Related

How to query samples in relativity?

I have a large data set with about 100 million rows that I want to 'compress' the data set and get a 1% sample of the entire dataset while ensuring relativity.
How can such query be implemented?
Step 1: create the helper table
You can use aggregation to group records by visit_id, and CROSS JOIN with a query that computes the total number of records in the table to compute the distribution percent:
CREATE TABLE my_helper AS
SELECT
t.visit_number,
COUNT(*) visit_count,
SUM(t.purchase_id) sum_purchase,
COUNT(*)/total.cnt distribution
FROM
mytable t
CROSS JOIN (SELECT COUNT(*) cnt FROM mytable) total
GROUP BY t.visit_number
Step 2: sample the main table using the helper table
Within a subquery, you can use ROW_NUMBER() OVER(PARTITION BY visit_number ORDER BY RANDOM()) to assign a random rank to each record within groups of records sharing the same visit_id. Then, in the outer query, you can join on the helper table to select the corect amount of records for each visit_id:
SELECT x.*
FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY visit_number ORDER BY RANDOM()) rn
FROM mytable t
) x
INNER JOIN my_helper h ON h.visit_number = x.visit_number
WHERE x.rn <= 1000000 * h.distribution
Side notes:
this only works if there are indeed more than 1 million record in the source table
the exact number of records in the output might be slightly below or above 1 million (depending on the distribution in the original table)
it should be possible to combine both queries into a single one, which would avoid the need to use a helper table
This is doable. A quick way is to take every nth record only.
1) order by a random column (probably ID)
2) apply a nownum() attribute
3) apply a mod(rownum) = 0 on whatever percent makes sense (e.g. 1% would be rownum mod 100)
You may need steps 1/2 in a sub query and step 3 on the outside.
Enjoy and good luck!

Get two most frequent data from SQL tbl?

i have a tbl call it tbl_test in which continously data are inserting and it has approx 10^6 records at a time it has colums
Acquire_Id (Value between 1 to 20 ),
Status_Msg(value between 'A' to 'Z'),
Status_Code(value between 1 to 26)
There is one to one mapping b/w Status_Msg and Status_Code
Now i want to get two most frequent status_msg and Staus_Code Count for each acquirer if they are present in table
Query should be Cost Saving
Most databases support the ANSI standard window functions. You can get what you want using row_number() (or rank() or dense_rank(), depending on how ties are returned) after aggregating the values.
The following returns exactly two rows for each acquirer (even if there are ties).
select t.*
from (select t.acquire_id, t.status_msg, t.status_code, count(*) as cnt,
row_number() over (partition by t.acquire_id order by count(*) desc) as seqnum
from tbl_test t
group by t.acquire_id, t.status_msg, t.status_code
) t
where seqnum <= 2;

Sqlite get ROW NUMBER and COUNT of records found on every SELECT request

I´m using Sqlite3 on a grid application pretty much like that post here..
The grid needs the rows that is being show and the total number of rows found, used for paging.
On Oracle I use the following statement to get rows from 100 to 500 - fields Id, Name, Phone where Deleted=false:
SELECT * FROM (SELECT ROW_NUMBER()
OVER (ORDER BY ID) AS RN,
COUNT(*) OVER (ORDER BY (SELECT NULL) AS CNT)
Id, Name, Phone FROM MyTable WHERE Deleted='F')
T WHERE RN > 100 AND RN < 500;
On MySQl, I normally use the excellet SELECT SQL_CALC_FOUND_ROWS followed by a SELECT FOUND_ROWS() call.
So my questions are:
a) Is there any equivalent of this Sqlite3 for either Oracle or MySQL option above ?
b) How can I accomplish that on Sqlite3 without issuing 2 selects (one for querying and another one for counting) ?
The question posted here does not solve my problem because it does not return the number of records, just pages through the table.
Thanks a lot for helping...
In recent versions of SQLite (3.25.0 and later) there is support for window functions. So you can rewrite your query as follows:
SELECT * FROM (SELECT ROW_NUMBER() OVER (ORDER BY ID) AS RN,
COUNT() OVER () AS CNT,
Id, Name, Phone FROM MyTable WHERE Deleted='F')
T WHERE RN > 100 AND RN < 500;

clubbing multiple "With" clauses in sql

I am using oracle database 10g and trying to compute the Upper control limit and lower control limit for the data set.Though it seems useless for phone number values but I am just trying to use it as a learning experience.The output should have a row wise form for entries of:-
salutation,zip,lcl and ucl value
which would allow better understanding of data.
with q as(
select student_id,salutation,zip,first_name,last_name from tempTable)
with r as(
select avg(phone) as average,stddev(phone) as sd from tempTable)
select salutation,zip,average-3*sd as"lcl",average+3*sd as"UCL"
from
q ,r
error given is select statement missing.Please tell me what is wrong I am a sql newbie and can't do it myself
while using stacked CTE expect for the first CTE you don't need With keyword instead use comma before the CTE name. Try this syntax.
WITH q
AS (SELECT student_id,
salutation,
zip,
first_name,
last_name
FROM temptable),
r
AS (SELECT Avg(phone) AS average,
STDDEV(phone) AS sd
FROM temptable)
SELECT salutation,
zip,
average - 3 * sd AS"lcl",
average + 3 * sd AS"UCL"
FROM q Cross Join r;
I don't think you need a WITH clause at all to run such a query. It might be better to use the AVG() and STDDEV() functions as window functions (analytic functions in Oracle lingo):
SELECT temp1.*, average - 3 * sd AS lcl, average + 3 * sd AS ucl
FROM (
SELECT student_id, salutation, zip, first_name, last_name
, AVG(phone) OVER ( ) AS average, STDDEV(phone) OVER ( ) AS sd
FROM tempTable
) temp1
You don't even need the subquery but it helps save some keystrokes. See this SQL Fiddle demo with dummy data from DUAL.
P.S. You do need the alias (in this case, temp1) for the subquery if you want to use * to get all the columns selected in the subquery - it won't work otherwise. Alternately you could name the columns explicitly, which is a good practice anyway.

Evaluating the mean absolute deviation of a set of numbers in Oracle

I'm trying to implement a procedure to evaluate the median absolute deviation of a set of numbers (usually obtained via a GROUP BY clause).
An example of a query where I'd like to use this is:
select id, mad(values) from mytable group by id;
I'm going by the aggregate function example but am a little confused since the function needs to know the median of all the numbers before all the iterations are done.
Any pointers to how such a function could be implemented would be much appreciated.
In Oracle 10g+:
SELECT MEDIAN(ABS(value - med))
FROM (
SELECT value, MEDIAN(value) OVER() AS med
FROM mytable
)
, or the same with the GROUP BY:
SELECT id, MEDIAN(ABS(value - med))
FROM (
SELECT id, value, MEDIAN(value) OVER(PARTITION BY id) AS med
FROM mytable
)
GROUP BY
id