I am trying to get a sample of data from a large table and want to make sure this can be repeated later on. Other SQL allow repeatable sampling to be done with either setting a seed using set.seed(integer) or repeatable (integer) command. However, this is not working for me in Presto. Is such a command not available yet? Thanks.
One solution is that you can simulate the sampling by adding a column (or create a view) with random stuff (such as UUID) and then selecting rows by filtering on this column (for example, UUID ended with '1'). You can tune the condition to get the sample size you need.
By design, the result is random and also repeatable across multiple runs.
If you are using Presto 0.263 or higher you can use key_sampling_percent to reproducibly generate a double between 0.0 and 1.0 from a varchar.
For example, to reproducibly sample 20% of records in table using the id column:
select
id
from table
where key_sampling_percent(id) < 0.2
If you are using an older version of Presto (e.g. AWS Athena), you can use what's in the source code for key_sampling_percent:
select
id
from table
where (abs(from_ieee754_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2
I have found that you have to use from_big_endian_64 instead of from_ieee754_64 to get reliable results in Athena. Otherwise I got no many numbers close to zero because of the negative exponent.
select id
from table
where (abs(from_big_endian_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2
You may create a simple intermediate table with selected ids:
CREATE TABLE IF NOT EXISTS <temp1>
AS
SELECT <id_column>
FROM <tablename> TABLESAMPLE SYSTEM (10);
This will contain only sampled ids and will be ready to use it downstream in your analysis by doing JOIN with data of interest.
I've been trying to visualise how to do this with a CTE as it appears on the surface to be the best way but just can't get it going. Maybe it needs a temp table as well. I am using SQL Server 2008 R2.
I need to create intercepts (a length along a line essentially) with the following parameters.
The average of the intercept must be greater than .7
The aim is to get the largest intercept possible
There can be up to 2 consecutive meters of values less than .7 (internal waste) but no more
There is no limit to the total internal waste within an intercept
There is no minimum intercept length (well there is but I'll take care of it later)
Note: there will be no gaps as I has taken care of that and the from and to can be decimal
An example is shown here:
enter image description here
In space seen here with the assay on the left and depth on right:
enter image description here
So for a little more clarity if needed, intervals 6 to 7 and 17 to 18 do not make part of the larger intercept as the internal waste (7-9 and/or 15-17) would bring the average below 0.7 - not because of the amount of internal waste.
However the result for 21-22 is not included because there are 3 meters of internal waste between it and the result for 17-18.
Note that there are multiple sites and areas which make part of the original tables primary key so I imagine that a partition of area and site would be used in any ROW_NUMBER OVER statements
Edit: the original code had errors in the from and to (multiple 14 to 15) which would have been confusing sorry. There will be no overlapping from-to's which hopefully simplifies things.
Example values to use:
create table #temp_inter (
area nvarchar(10),
site_ID nvarchar(10),
d_from decimal (18,3),
d_to decimal (18,3),
assay decimal (18,3))
insert into #temp_inter
values ('area_1','abc','0','5','0'),
('area_1','abc','5','6','0.165'),
('area_1','abc','6','7','0.761'),
('area_1','abc','7','8','0.321'),
('area_1','abc','8','9','0.292'),
('area_1','abc','9','10','1.135'),
('area_1','abc','10','11','0.225'),
('area_1','abc','11','12','0.983'),
('area_1','abc','12','13','0.118'),
('area_1','abc','13','14','0.438'),
('area_1','abc','14','15','0.71'),
('area_1','abc','15','16','0.65'),
('area_1','abc','16','17','2'),
('area_1','abc','17','18','0.367'),
('area_1','abc','18','19','0.047'),
('area_1','abc','19','20','0.71'),
('area_1','abc','20','21','0'),
('area_1','abc','21','22','0'),
('area_1','abc','22','23','0'),
('area_1','abc','23','24','2'),
('area_1','abc','24','25','0'),
('area_1','abc','25','26','0'),
('area_1','abc','26','30','0'),
('area_2','zzz','0','5','0'),
('area_2','zzz','5','6','1.165'),
('area_2','zzz','6','7','0.396'),
('area_2','zzz','7','8','0.46'),
('area_2','zzz','8','9','0.111'),
('area_2','zzz','9','10','0.053'),
('area_2','zzz','10','11','0.057'),
('area_2','zzz','11','12','0.055'),
('area_2','zzz','12','13','0.03'),
('area_2','zzz','13','14','0.026'),
('area_2','zzz','14','15','0.194'),
('area_2','zzz','15','16','0.367'),
('area_2','zzz','16','17','0.431'),
('area_2','zzz','17','18','0.341'),
('area_2','zzz','18','19','0.071'),
('area_2','zzz','19','20','0.26'),
('area_2','zzz','20','21','0.659'),
('area_2','zzz','21','22','0.602'),
('area_2','zzz','22','23','2.436'),
('area_2','zzz','23','24','0.874'),
('area_2','zzz','24','25','3.173'),
('area_2','zzz','25','26','0.179'),
('area_2','zzz','26','27','0.065'),
('area_2','zzz','27','28','0.024'),
('area_2','zzz','28','29','0')