How to Pass Query Answer into Limit Function Impala

How to Pass Query Answer into Limit Function Impala - sql

I am attempting to sample 20% of a table in impala. I have heard somewhere that the built in impala sampling function has issues.
Is there a way to pass in a subquery to the impala limit function to sample n percent of the entire table.
I have something like this:
select
* from
table_a
order by rand()
limit
(
select
round( (count(distinct ids)) *.2,0)
from table_a)
)
The sub query gives me 20% of all records

I'm not sure if Impala has specific sampling logic (some databases do). But you can use window functions:
select a.*
from (select a.*,
row_number() over (order by rand()) as seqnum,
count(*) over () as cnt
from table_a
) a
where seqnum <= cnt * 0.2;

Related

SQL. Limited query

I have a view and I request data from it with a simple query like:
SELECT * FROM my_view WHERE id IN (...)
In general on "normal data" it should return 10-100 entries per id, but for some ids, it may return more than 1,000,000 entries!
I would like to limit my query so that it would not return more than 100 entries per id, but I really have no idea other than running a query for each id separately.

Use row_number():
select v.*
from (select v.*, row_number() over (partition by id order by id) as seqnum
from my_view
where id in (...)
) v
where seqnum <= 100;

How to get the top N percent (e.g., 50%) of a table in BigQuery (standard SQL)?

I have tried the following approaches which none of them worked:
Using SELECT TOP 50 PERCENT: BigQuery does not have top function
Using LIMIT (SELECT COUNT(*) FROM tabl)/2: the reason is BigQuery does not accept any non integer value.
Using SET to set the median value and then use WHERE

In BigQuery I would use window function percent_rank().
select t.* except (prnk)
from (select t.*, percent_rank() over(order by id) prnk from mytable t) t
where prnk <= 0.5
Note: any answer to your question will require that you provide a column to order your data. I assumed that this column is called id.

One method uses window functions:
select t.* except (seqnum, cnt)
from (select t.*, row_number() over (order by ?) as seqnum,
count(*) over () as cnt
from t
) t
where seqnum <= cnt / 2;

Another possibility would be to limit the data with a WHERE clause instead of LIMIT. This is an example if you want yo filter by an ID:
SELECT * FROM table_name as t
WHERE t.id <= (SELECT COUNT(*) FROM table_name)/2;
And if you want to filter by the row number:
SELECT t.* except (rn)
FROM (
SELECT t.*, ROW_NUMBER() OVER () AS rn
FROM table_name as t
) AS t
WHERE t.rn <= (SELECT COUNT(*) FROM table_name)/2;

To scale up, you can use an approx algorithm to find the 50% point:
DECLARE mid_date TIMESTAMP DEFAULT (
SELECT APPROX_QUANTILES(creation_date, 2)[OFFSET(1)] mid_date
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers` )
;
SELECT mid_date
, COUNTIF(creation_date > mid_date) first_half
, COUNTIF(creation_date < mid_date) second_half
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers`
Looks like it works well:
Now let's get these records out:
CREATE TABLE `temp.fifty_percent`
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers`
WHERE creation_date < (
SELECT APPROX_QUANTILES(creation_date, 2)[OFFSET(1)] mid_date
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers`
)
This method will happily scale, while solutions using OVER(ORDER BY) won't.

GBQ window function AND arithmetic operations

Does anyone know if it is possible to do any arithmetic operation on a result derived using GBQ window functions?
For example, can I increase row_number by 100 (some number) using pseudocode like this:
SELECT 100 + ROW_NUMBER() OVER (PARTITION BY X ORDER BY x_id DESC) increased_row_num
FROM Table1
...

You will need to use subquery for that
SELECT 100 + row_num AS increased_row_num FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY X ORDER BY x_id DESC) AS row_num
FROM Table1
)

but I'we hoped that there is another solution
With BigQuery Standard SQL expected functionality works now as is
#standardSQL
SELECT 100 + ROW_NUMBER() OVER (PARTITION BY X ORDER BY x_id DESC) increased_row_num
FROM Table1
See Enabling Standard SQL and Migrating from legacy SQL

Order groups by partition size in sql?

I'm trying to select the group_items of the top N largest groups with the same grouping_attribute from a table, and doing something like this:
SELECT grouping_attribute, group_item,
ROW_NUMBER() OVER (PARTITION BY grouping_attribute ORDER BY ???) AS rn
FROM a_table
WHERE rn < N;
But I don't know what to put in the ORDER BY clause to make it happen. I'm trying to order the rows by the size of their corresponding partitions. COUNT(*) doesn't run. I was hoping there was some way to refer to the size of the partition, but I can't find anything.

If I understand correctly, you want count(*) not row_number(). Use count(*) to get the size of the partitions and then order the resulting rows afterwards. For instance:
SELECT a.*
FROM (SELECT grouping_attribute, group_item,
COUNT(*) over (partition by grouping_attribute) as cnt
FROM a_table
) a
ORDER BY cnt DESC;

Get rows from the table using row no in sql server

I want to get rows from 100-150 from my table in sql server 2008, how i can do that? Is there any way to do so? as much i search Limit keyword is available in mysql but for sql server use common table technique but i don't want to do like that is there any other way available as it is available in Mysql?

select * from
(select row_number() over (order by #column) as row,* from Table) as t
where row between 100 and 150
#column to be replaced by a colomn from your table witch well be used to order the result

use sql limit
http://php.about.com/od/mysqlcommands/g/Limit_sql.htm

In SQL 2005 and above there is a ROW_NUMBER() function. If you need something that works for both MySQL and SQL Server though then I don't know if this is available in MySQL as I've never used it.
http://msdn.microsoft.com/en-us/library/ms186734.aspx
The example given in the linked page that seems most relevant is the following, where the results of a query are ordered by date, and then rows 50 to 60 from that result set are returned.
USE AdventureWorks2012;
GO
WITH OrderedOrders AS
(
SELECT SalesOrderID, OrderDate,
ROW_NUMBER() OVER (ORDER BY OrderDate) AS RowNumber
FROM Sales.SalesOrderHeader
)
SELECT SalesOrderID, OrderDate, RowNumber
FROM OrderedOrders
WHERE RowNumber BETWEEN 50 AND 60;

Actuall, the least expensive way to do this is using top, and then row_number()
select *
from (select *, row_number() over (order by (select NULL)) as rownum
from (select top 150 t.*
from t
) t
) t
where rownum >= 100
However, I do give you one caution. There is no such thing as rows 100-150 in a relational table, because these are inherently unordered. You need to specify the ordering. For this, you need order by:
select *
from (select *, row_number() over (order by <field>) as rownum
from (select top 150 t.*
from t
order by <field>
) t
) t
where rownum >= 100

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to Pass Query Answer into Limit Function Impala - sql

I'm not sure if Impala has specific sampling logic (some databases do). But you can use window functions: select a.* from (select a., row_number() over (order by rand()) as seqnum, count() over () as cnt from table_a ) a where seqnum <= cnt * 0.2;

Related

SQL. Limited query

How to get the top N percent (e.g., 50%) of a table in BigQuery (standard SQL)?

GBQ window function AND arithmetic operations

Order groups by partition size in sql?

Get rows from the table using row no in sql server

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to Pass Query Answer into Limit Function Impala - sql

I'm not sure if Impala has specific sampling logic (some databases do). But you can use window functions: select a.* from (select a.*, row_number() over (order by rand()) as seqnum, count(*) over () as cnt from table_a ) a where seqnum <= cnt * 0.2;

Related

SQL. Limited query

How to get the top N percent (e.g., 50%) of a table in BigQuery (standard SQL)?

GBQ window function AND arithmetic operations

Order groups by partition size in sql?

Get rows from the table using row no in sql server

Categories

Resources

I'm not sure if Impala has specific sampling logic (some databases do). But you can use window functions: select a.* from (select a., row_number() over (order by rand()) as seqnum, count() over () as cnt from table_a ) a where seqnum <= cnt * 0.2;