How to Pass Query Answer into Limit Function Impala - sql

I am attempting to sample 20% of a table in impala. I have heard somewhere that the built in impala sampling function has issues.
Is there a way to pass in a subquery to the impala limit function to sample n percent of the entire table.
I have something like this:
select
* from
table_a
order by rand()
limit
(
select
round( (count(distinct ids)) *.2,0)
from table_a)
)
The sub query gives me 20% of all records

I'm not sure if Impala has specific sampling logic (some databases do). But you can use window functions:
select a.*
from (select a.*,
row_number() over (order by rand()) as seqnum,
count(*) over () as cnt
from table_a
) a
where seqnum <= cnt * 0.2;

Related

SQL. Limited query

I have a view and I request data from it with a simple query like:
SELECT * FROM my_view WHERE id IN (...)
In general on "normal data" it should return 10-100 entries per id, but for some ids, it may return more than 1,000,000 entries!
I would like to limit my query so that it would not return more than 100 entries per id, but I really have no idea other than running a query for each id separately.
Use row_number():
select v.*
from (select v.*, row_number() over (partition by id order by id) as seqnum
from my_view
where id in (...)
) v
where seqnum <= 100;

How to get the top N percent (e.g., 50%) of a table in BigQuery (standard SQL)?

I have tried the following approaches which none of them worked:
Using SELECT TOP 50 PERCENT: BigQuery does not have top function
Using LIMIT (SELECT COUNT(*) FROM tabl)/2: the reason is BigQuery does not accept any non integer value.
Using SET to set the median value and then use WHERE
In BigQuery I would use window function percent_rank().
select t.* except (prnk)
from (select t.*, percent_rank() over(order by id) prnk from mytable t) t
where prnk <= 0.5
Note: any answer to your question will require that you provide a column to order your data. I assumed that this column is called id.
One method uses window functions:
select t.* except (seqnum, cnt)
from (select t.*, row_number() over (order by ?) as seqnum,
count(*) over () as cnt
from t
) t
where seqnum <= cnt / 2;
Another possibility would be to limit the data with a WHERE clause instead of LIMIT. This is an example if you want yo filter by an ID:
SELECT * FROM table_name as t
WHERE t.id <= (SELECT COUNT(*) FROM table_name)/2;
And if you want to filter by the row number:
SELECT t.* except (rn)
FROM (
SELECT t.*, ROW_NUMBER() OVER () AS rn
FROM table_name as t
) AS t
WHERE t.rn <= (SELECT COUNT(*) FROM table_name)/2;
To scale up, you can use an approx algorithm to find the 50% point:
DECLARE mid_date TIMESTAMP DEFAULT (
SELECT APPROX_QUANTILES(creation_date, 2)[OFFSET(1)] mid_date
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers` )
;
SELECT mid_date
, COUNTIF(creation_date > mid_date) first_half
, COUNTIF(creation_date < mid_date) second_half
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers`
Looks like it works well:
Now let's get these records out:
CREATE TABLE `temp.fifty_percent`
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers`
WHERE creation_date < (
SELECT APPROX_QUANTILES(creation_date, 2)[OFFSET(1)] mid_date
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers`
)
This method will happily scale, while solutions using OVER(ORDER BY) won't.

GBQ window function AND arithmetic operations

Does anyone know if it is possible to do any arithmetic operation on a result derived using GBQ window functions?
For example, can I increase row_number by 100 (some number) using pseudocode like this:
SELECT 100 + ROW_NUMBER() OVER (PARTITION BY X ORDER BY x_id DESC) increased_row_num
FROM Table1
...
You will need to use subquery for that
SELECT 100 + row_num AS increased_row_num FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY X ORDER BY x_id DESC) AS row_num
FROM Table1
)
but I'we hoped that there is another solution
With BigQuery Standard SQL expected functionality works now as is
#standardSQL
SELECT 100 + ROW_NUMBER() OVER (PARTITION BY X ORDER BY x_id DESC) increased_row_num
FROM Table1
See Enabling Standard SQL and Migrating from legacy SQL

Order groups by partition size in sql?

I'm trying to select the group_items of the top N largest groups with the same grouping_attribute from a table, and doing something like this:
SELECT grouping_attribute, group_item,
ROW_NUMBER() OVER (PARTITION BY grouping_attribute ORDER BY ???) AS rn
FROM a_table
WHERE rn < N;
But I don't know what to put in the ORDER BY clause to make it happen. I'm trying to order the rows by the size of their corresponding partitions. COUNT(*) doesn't run. I was hoping there was some way to refer to the size of the partition, but I can't find anything.
If I understand correctly, you want count(*) not row_number(). Use count(*) to get the size of the partitions and then order the resulting rows afterwards. For instance:
SELECT a.*
FROM (SELECT grouping_attribute, group_item,
COUNT(*) over (partition by grouping_attribute) as cnt
FROM a_table
) a
ORDER BY cnt DESC;

Get rows from the table using row no in sql server

I want to get rows from 100-150 from my table in sql server 2008, how i can do that? Is there any way to do so? as much i search Limit keyword is available in mysql but for sql server use common table technique but i don't want to do like that is there any other way available as it is available in Mysql?
select * from
(select row_number() over (order by #column) as row,* from Table) as t
where row between 100 and 150
#column to be replaced by a colomn from your table witch well be used to order the result
use sql limit
http://php.about.com/od/mysqlcommands/g/Limit_sql.htm
In SQL 2005 and above there is a ROW_NUMBER() function. If you need something that works for both MySQL and SQL Server though then I don't know if this is available in MySQL as I've never used it.
http://msdn.microsoft.com/en-us/library/ms186734.aspx
The example given in the linked page that seems most relevant is the following, where the results of a query are ordered by date, and then rows 50 to 60 from that result set are returned.
USE AdventureWorks2012;
GO
WITH OrderedOrders AS
(
SELECT SalesOrderID, OrderDate,
ROW_NUMBER() OVER (ORDER BY OrderDate) AS RowNumber
FROM Sales.SalesOrderHeader
)
SELECT SalesOrderID, OrderDate, RowNumber
FROM OrderedOrders
WHERE RowNumber BETWEEN 50 AND 60;
Actuall, the least expensive way to do this is using top, and then row_number()
select *
from (select *, row_number() over (order by (select NULL)) as rownum
from (select top 150 t.*
from t
) t
) t
where rownum >= 100
However, I do give you one caution. There is no such thing as rows 100-150 in a relational table, because these are inherently unordered. You need to specify the ordering. For this, you need order by:
select *
from (select *, row_number() over (order by <field>) as rownum
from (select top 150 t.*
from t
order by <field>
) t
) t
where rownum >= 100