select rows where column > percentile(column, 0.5) in hive - hive

I want to select rows where the value of certain column, say A, is larger than its p50 over whole data. So I write such SQL in hive as follow:
set hive.mapred.mode=nonstrict;
with temp_table as (
select percentile(A, 0.5) as p50
from my_table
)
select
my_table.A
from my_table, temp_table
where my_table.A > temp_table.p50
But the process is hanging. Is the SQL correct? Or is there better way to do this task?

Related

What is the most efficient way to randomly sample with replacement in BigQuery?

The answers to this question explain how to randomly sample from a BigQuery table. Is there an efficient way to do this with replacement?
As an example, suppose I have a table with 1M rows and I wish to select 100K independently random sampled rows.
Found a neat solution:
Index the rows of the table
Generate a dummy table with 100K random integers between 1 and 1M
Inner join the tables on index = random value
Code:
# randomly sample 100K rows from `table` with replacement
with large_table as (select *, row_number() over() as rk from `table`),
num_elements as (select count(1) as n from large_table),
dummy_table as (select 1 + cast(rand() * (select n - 1 from num_elements) as int64) as i from unnest(generate_array(1, 100000)))
select * from dummy_table
inner join large_table on dummy_table.i = large_table.rk

Select max value of each group using partition by

I have the following code which is taking a looong time to get executed. What I need to do is select the column having row number equals 1 after partitioning it by three columns (col_1, col_2, col_3) [which are also the key columns] and ordering by some columns as mentioned below. The number of records in the table is around 90 million. Am I following the best approach or is there any other better one?
with cte as (SELECT
b.*
,ROW_NUMBER() OVER ( PARTITION BY col_1,col_2,col_3
ORDER BY new_col DESC, new_col_2 DESC, new_col_3 DESC ) AS ROW_NUMBER
FROM (
SELECT
*
,CASE
WHEN update_col = ' ' THEN new_update_col
ELSE update_col
END AS new_col_1
FROM schema_name.table_name
) b
)
select top 10 * from cte WHERE ROW_NUMBER=1
Currently you are applying CASE on different columns which is impacting all rows in the database table. CASE (String Comparison) Is a costly method.
At the end, you are keeping only records with ROW NUMBER = 1. If I guess this filter keeping Half of your all records, this will increase the query execution time if you filter (Generate ROW NUMBER First and Keep Rows with RN=1) first and then apply CASE method on columns.

Selecting a Random Sample from a View in Postgresql

I have generated a view from a table in PostgreSQL consisting of 50,000 rows. I want to take a random sample from this view based on a number of conditions. I understand this can be done in the following way:
select * from viewname
where columnname = 'A' -- the condition
order by columnname
limit 5;
However, instead of 'limit 5', I want to take a percentage of the number of rows which meet this condition. So for instance, 'limit 5%' (though this is not correct syntax). I understand a similar thing can be done with the tablesample clause but this does not apply to views.
You could use the window function PERCENT_RANK
SELECT *
FROM
(
select *, PERCENT_RANK() OVER (PARTITION BY columnname ORDER BY random()) AS pcrnk
from tablename
where columnname = 'A'
) q
WHERE pcrnk <= 0.05
And if you don't want to see that pcrnk in the result?
SELECT (t).*
FROM
(
select t, PERCENT_RANK() OVER (PARTITION BY columnname ORDER BY random()) AS pcrnk
from tablename t
where columnname = 'A'
) q
WHERE pcrnk <= 0.05
Test on db<>fiddle here
These queries will retrieve 5% of what be retrieved normally based on the criteria columnname = 'A'
F.e. if there are 100 'A' and 1000 'B', then they return 5 records.
If you want to return 5% of all the records in the table? Then here's another trick.
select *
from tablename
where columnname = 'A'
order by random()
limit 0.05 * (select count(*) from tablename)
In order to randomly select a percentage of your rows, and if you have Postgres 9.5 or higher, have a look at Postgres TABLESAMPLE.
It has two options : BERNOULLI and SYSTEM :
The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. [...] These two methods each return a randomly-chosen sample of the table that will contain approximately the specified percentage of the table's rows.
SYSTEM is faster, but BERNOULLI gives better random distribution because each record has the same probability on being selected.
SELECT *
FROM tablename TABLESAMPLE SYSTEM(5)
WHERE columnname = 'A' -- the condition
ORDER BY columnname;
NB : this only works if you are querying a table, and not for views.

How to retrieve specific rows from SQL Server table?

I was wondering is there a way to retrieve, for example, 2nd and 5th row from SQL table that contains 100 rows?
I saw some solutions with WHERE clause but they all assume that the column on which WHERE clause is applied is linear, starting at 1.
Is there other way to query a SQL Server table for a specific rows in case table doesn't have a column whose values start at 1?
P.S. - I know for a solution with temporary tables, where you copy your select statement output and add a linear column to the table. I am using T-SQL
Try this,
SELECT * FROM (
SELECT
ROW_NUMBER() OVER (ORDER BY ColumnName ASC) AS rownumber
FROM TableName
) as temptablename
WHERE rownumber IN (2,5)
With SQL Server:
; WITH Base AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY id) RN FROM YourTable
)
SELECT *
FROM Base WHERE RN IN (2, 5)
The id that you'll have to replace with your primary key or your ordering, YourTable that is your table.
It's a CTE (Common Table Expression) so it isn't a temporary table. It's something that will be expanded together with your query.
There is no 2nd or 5th row in the table.
There is only the 2nd or 5th result in a resultset that you return, as determined by the order you specify in that query.
If you are on SQL Server 2005 or above, you could use Row_Number() function. Ex:
;With CTE as (
select col1, ..., row_number() over (order by yourOrderingCol) rn
from yourTable
)
select col1,...
from cte
where rn in (2,5)
Please note that yourOrderingCol will decide the value of row number (i.e. rn).

SQL select segment

I'm using SQL Server 2008.
I have a table with x amount of rows. I would like to always divide x by 5 and select the 3rd group of records.
Let's say there are 100 records in the table:
100 / 5 = 20
the 3rd segment will be record 41 to 60.
How will I be able in SQL to calculate and select this 3rd segment only?
Thanks.
You can use NTILE.
Distributes the rows in an ordered partition into a specified number of groups.
Example:
SELECT col1, col2, ..., coln
FROM
(
SELECT
col1, col2, ..., coln,
NTILE(5) OVER (ORDER BY id) AS groupno
FROM yourtable
)
WHERE groupno = 3
That's a perfect use for the NTILE ranking function.
Basically, you define your query inside a CTE and add an NTILE to your rows - a number going from 1 to n (the argument to NTILE). You order your rows by some column, and then you get the n groups of rows you're looking for, and you can operate on any one of those "groups" of data.
So try something like this:
;WITH SegmentedData AS
(
SELECT
(list of your columns),
GroupNo = NTILE(5) OVER (ORDER BY SomeColumnOfYours)
FROM dbo.YourTable
)
SELECT *
FROM SegmentedData
WHERE GroupNo = 3
Of course, you can also use an UPDATE statement after the CTE to update those rows.