How to obtain a uniform sample from my SQL table? - sql

Say I have a simple table:
id, value
2, 5
4, 3
10, 4
20, 5
24, 4
40, 3
60, 3
80, 3
150, 3
90, 3
120, 3
As you can see the majority of the value column is 3. If I want to obtain a subset of this table, there is a high likelihood that 3 would dominate, i.e., SELECT * FROM TABLE LIMIT 10. So how can I thus perform some statistics to ensure that I have a uniform distribution, i.e., a subset that contains 2 of each distinct value?

You can use a query like this
WITH cte AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY value ORDER BY id) AS rn
FROM data
)
SELECT
id,
value
FROM cte
WHERE rn < 3
GROUP BY value, id
It would give you at most 2 rows per value.
You can check a working demo here

Related

Progressive Select Query in Oracle Database

I want to write a select query that selects distinct rows of data progressively.
Explaining with an example,
Say i have 5000 accounts selected for repayment of loan, these accounts are ordered in descending order( Account 1st has highest outstanding while account 5000nd will have the lowest).
I want to select 1000 unique accounts 5 times such that the total outstanding amount of repayment in all 5 cases are similar.
i have tried out a few methods by trying to select rownums based on odd/even or other such way, but it's only good for upto 2 distributions. I was expecting more like a A.P. as in maths that selects data progressively.
A naïve method of splitting sets into (for example) 5 bins, numbered 0 to 4, is give each row a unique sequential numeric index and then, in order of size, assign the first 10 rows to bins 0,1,2,3,4,4,3,2,1,0 and then repeat for additional sets of 10 rows:
WITH indexed_values (value, rn) AS (
SELECT value,
ROW_NUMBER() OVER (ORDER BY value DESC) - 1
FROM table_name
),
assign_bins (value, rn, bin) AS (
SELECT value,
rn,
CASE WHEN MOD(rn, 2 * 5) >= 5
THEN 5 - MOD(rn, 5) - 1
ELSE MOD(rn, 5)
END
FROM indexed_values
)
SELECT bin,
COUNT(*) AS num_values,
SUM(value) AS bin_size
FROM assign_bins
GROUP BY bin
Which, for some random data:
CREATE TABLE table_name ( value ) AS
SELECT FLOOR(DBMS_RANDOM.VALUE(1, 1000001)) FROM DUAL CONNECT BY LEVEL <= 1000;
May output:
BIN
NUM_VALUES
BIN_SIZE
0
200
100012502
1
200
100004633
2
200
99980342
3
200
99976774
4
200
100005756
It will not get the bins to have equal values but it is relatively simple and will get a close approximation if your values are approximately evenly distributed.
If you want to select values from a certain bin then:
WITH indexed_values (value, rn) AS (
SELECT value,
ROW_NUMBER() OVER (ORDER BY value DESC) - 1
FROM table_name
),
assign_bins (value, rn, bin) AS (
SELECT value,
rn,
CASE WHEN MOD(rn, 2 * 5) >= 5
THEN 5 - MOD(rn, 5) - 1
ELSE MOD(rn, 5)
END
FROM indexed_values
)
SELECT value
FROM assign_bins
WHERE bin = 0
fiddle

ORACLE SQL: select MAX value based on 2 different variables (group by function) [duplicate]

This question already has answers here:
Fetch the rows which have the Max value for a column for each distinct value of another column
(35 answers)
Select First Row of Every Group in sql [duplicate]
(2 answers)
Return row with the max value of one column per group [duplicate]
(3 answers)
SQL: getting the max value of one column and the corresponding other columns [duplicate]
(2 answers)
Closed 11 months ago.
I think my problem is quite simple but I don´t really get it :)
I'm using SQL Developer as IDE and have a large table which looks like this:
ID
technology
speed
1
3G
20
1
2G
10
1
4G
40
1
5G
100
2
3G
60
2
4G
90
2
5G
150
3
3G
30
3
4G
50
I need the max value of 'technology' for each 'ID' and also need the 'speed' in the result:
ID
technology
speed
1
5G
100
2
5G
150
3
4G
50
my SQL looks like that:
SELECT ID, MAX(technology) AS technology, speed
FROM "table"
GROUP BY ID, speed;
but with this SQL I get multiple selections for each ID
any ideas?
Since you're including speed in your select and group, and those values vary per row, your current query will basically return the full table just with the MAX(technology) for each row. This is because speed can't be grouped into a single value as they are all different.
ie.
ID technology speed
1 5G 20
1 5G 10
1 5G 40
1 5G 100
Based purely on your sample set, you could select the MAX(speed) since it always coincides with the MAX(technology), and then you would get the right results:
ID technology speed
1 5G 100
However, if the MAX(technology) ever has less than the MAX(speed), the above would become incorrect.
A better approach would be to use a window function because you would remove that potential flaw:
with cte as (
SELECT ID, technology, speed,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY technology DESC) RN
FROM table)
SELECT *
FROM cte
WHERE RN = 1
This assigns a number to each row, starting with number 1 for the row that has the MAX(technology) (ie. ORDER BY technology DESC), and does this for each ID (ie. PARTITION BY ID).
Therefore when we select only the rows that are assigned row number 1, we are getting the full row for each max technology / id combination.
One last note - if there are duplicate rows with the same ID and technology but with various speeds, this would pick one of them at random. You would need to further include an ORDER for speed in that case. Based on your sample set this doesn't happen, but just an fyi.
You can also use the keep keyword. It is handy in situations like this one since ordering by one column and outputting another column happens in one clause, no subquery or CTE is needed. The drawback is that it is Oracle proprietary syntax.
with t (ID, technology, speed) as (
select 1, '3G', 20 from dual union all
select 1, '2G', 10 from dual union all
select 1, '4G', 40 from dual union all
select 1, '5G', 100 from dual union all
select 2, '3G', 60 from dual union all
select 2, '4G', 90 from dual union all
select 2, '5G', 150 from dual union all
select 3, '3G', 30 from dual union all
select 3, '4G', 50 from dual
)
select id
, max(technology) keep (dense_rank first order by speed desc)
, max(speed)
from t
group by id
Db fiddle.

Hive query to select only records in certain percentile

I have table with two columns - ID and total duration:
id tot_dur
123 1
124 2
125 5
126 8
I want to have a Hive query that select only 75th percentile. It should be only the last record:
id tot_dur
126 8
This is what I have, but its hard for me to understand the use of OVER() and PARTITIONED BY() functions, since from what I researched, this are the functions I should use. Before I get the tot_dur column I should sum and group by column duration. Not sure if percentile is the correct function, because I found use cases with percentile_approx.
select k1.id as id, percentile(cast(tot_dur as bigint),0.75) OVER () as tot_dur
from (
SELECT id, sum(duration) as tot_dur
FROM data_source
GROUP BY id) k1
group by id
If I've got you right, this is what you want:
with data as (select stack(4,
123, 1,
124, 2,
125, 5,
126, 8) as (id, tot_dur))
-----------------------------------------------------------------------------
select data.id, data.tot_dur
from data
join (select percentile(tot_dur, 0.75) as threshold from data) as t
where data.tot_dur >= t.threshold;

SQL random number that doesn't repeat within a group

Suppose I have a table:
HH SLOT RN
--------------
1 1 null
1 2 null
1 3 null
--------------
2 1 null
2 2 null
2 3 null
I want to set RN to be a random number between 1 and 10. It's ok for the number to repeat across the entire table, but it's bad to repeat the number within any given HH. E.g.,:
HH SLOT RN_GOOD RN_BAD
--------------------------
1 1 9 3
1 2 4 8
1 3 7 3 <--!!!
--------------------------
2 1 2 1
2 2 4 6
2 3 9 4
This is on Netezza if it makes any difference. This one's being a real headscratcher for me. Thanks in advance!
To get a random number between 1 and the number of rows in the hh, you can use:
select hh, slot, row_number() over (partition by hh order by random()) as rn
from t;
The larger range of values is a bit more challenging. The following calculates a table (called randoms) with numbers and a random position in the same range. It then uses slot to index into the position and pull the random number from the randoms table:
with nums as (
select 1 as n union all select 2 union all select 3 union all select 4 union all select 5 union all
select 6 union all select 7 union all select 8 union all select 9
),
randoms as (
select n, row_number() over (order by random()) as pos
from nums
)
select t.hh, t.slot, hnum.n
from (select hh, randoms.n, randoms.pos
from (select distinct hh
from t
) t cross join
randoms
) hnum join
t
on t.hh = hnum.hh and
t.slot = hnum.pos;
Here is a SQLFiddle that demonstrates this in Postgres, which I assume is close enough to Netezza to have matching syntax.
I am not an expert on SQL, but probably do something like this:
Initialize a counter CNT=1
Create a table such that you sample 1 row randomly from each group and a count of null RN, say C_NULL_RN.
With probability C_NULL_RN/(10-CNT+1) for each row, assign CNT as RN
Increment CNT and go to step 2
Well, I couldn't get a slick solution, so I did a hack:
Created a new integer field called rand_inst.
Assign a random number to each empty slot.
Update rand_inst to be the instance number of that random number within this household. E.g., if I get two 3's, then the second 3 will have rand_inst set to 2.
Update the table to assign a different random number anywhere that rand_inst>1.
Repeat assignment and update until we converge on a solution.
Here's what it looks like. Too lazy to anonymise it, so the names are a little different from my original post:
/* Iterative hack to fill 6 slots with a random number between 1 and 13.
A random number *must not* repeat within a household_id.
*/
update c3_lalfinal a
set a.rand_inst = b.rnum
from (
select household_id
,slot_nbr
,row_number() over (partition by household_id,rnd order by null) as rnum
from c3_lalfinal
) b
where a.household_id = b.household_id
and a.slot_nbr = b.slot_nbr
;
update c3_lalfinal
set rnd = CAST(0.5 + random() * (13-1+1) as INT)
where rand_inst>1
;
/* Repeat until this query returns 0: */
select count(*) from (
select household_id from c3_lalfinal group by 1 having count(distinct(rnd)) <> 6
) x
;

Transposing rows to columns in MsAccess query

I have a table with
cID, side, row, column
with some data of
24, 1, 10, 5
25, 1, 12, 6
24, 2, 18, 3
and so on. Now I want these data to be show in the form of:
cID=24
side 1 2
row 10 18
column 5 3
cID=25
side 2
row 12
column 6
The cID is filtered in the query so the output will be the 3 rows (side, row, column) and the data of them of a specific cID.
Is that possible with MsAccess Query/SQL and how?
Thanks!
Something on these lines:
TRANSFORM First(q.rvalue) AS firstofrow
SELECT q.rhead
FROM (SELECT cid,
side,
row AS rvalue,
"row" AS rhead
FROM atable
UNION ALL
SELECT cid,
side,
column AS rvalue,
"column" AS rhead
FROM atable) AS q
WHERE q.cid = 24
GROUP BY q.rhead
PIVOT q.side;