Need explanation on BUCKET and rand() function in Hive - hive

Can anyone please explain me what the following queries mean?
1. SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s;
2. SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON number) s;
3. SELECT * from numbers TABLESAMPLE(BUCKET 1 OUT OF 2 ON number) s;
4. SELECT * from numbers TABLESAMPLE(BUCKET 2 OUT OF 2 ON number) s;
I tried to understand the above queries in all possible ways I could, but couldn't move a bit forward. Please explain me in detail.
Thanks in advance.
Source
Page Number - 110

#John Deer, When we bucket on a column, the data is divided into the specified buckets and accordingly the files are created in Hadoop. While retrieving data a specified bucket, the data is pulled out from the specified bucket/file. So, the data remains unchanged.
Whereas, if we use rand function(which produces random numbers), the data is changed with every execution of rand().
SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s;
Explanation: Here, there are 10 buckets wherein the data is cased. We are using rand function on the bucketed column to retrieve the data. So, instead of from the 3rd bucket, it pulls random data. So the data is changed with every execution of rand.
SELECT * from numbers TABLESAMPLE(BUCKET 1 OUT OF 2 ON number) s;
Explanation: Here, there are 2 buckets wherein the data is cased. We are using the bucketed column to retrieve the data. So, the data is pulled out from the 2nd bucket and it would not change evenif you run the query any no of times.
Hope this helps!!

Related

How to extract all (including int and float) numerical values in a string column in Google BigQuery?

I have a table Table_1 on Google BigQuery which includes a string column str_column. I would like to write a SQL query (compatible with Google BigQuery) to extract all numerical values in str_column and append them as new numerical columns to Table_1. For example, if str_column includes first measurement is 22 and the other is 2.5; I need to extract 22 and 2.5 and save them under new columns numerical_val_1 and numerical_val_2. The number of new numerical columns should ideally be equal to the maximum number of numerical values in str_column, but if that'd be too complex, extracting the first 2 numerical values in str_column (and therefore 2 new columns) would be fine too. Any ideas?
Consider below approach
select * from (
select str_column, offset + 1 as offset, num
from your_table, unnest(regexp_extract_all(str_column, r'\b([\d.]+)\b')) num with offset
)
pivot (min(num) as numerical_val for offset in (1,2,3))
if applied to sample data like in your question - output is

Extract the highest key:value pair from a string in Standard SQL

I have the following data type below, it is a type of key value pair such as 116=0.2875. Big Query has stored this as a string. What I am required to do is to extract the key i.e 116 from each row.
To make things more complicated if a row has more than one key value pair the iteration to be extracted is the one with the highest number on the right e.g {1=0.1,2=0.8} so the extracted number would be 2.
I am struggling to use SQL to perform this, Particularly as some rows have one value and some have multiple:
This is as close as I have managed to get where I can create a bit of code to extract the highest right hand value (which I don't need) but I just cant seem to create something to either get the whole key/value pair which would be fine and work for me or just the key which would be great.
column
,(SELECT MAX(CAST(Values AS NUMERIC)) FROM UNNEST(JSON_EXTRACT_ARRAY(REPLACE(REPLACE(REPLACE(column,"{","["),"}","]"),"=",","))) AS Values WHERE Values LIKE "%.%") AS Highest
from `table`
Here is some sample data:
1 {99=0.25}
2 {99=0.25}
3 {99=0.25}
4 {116=0.2875, 119=0.6, 87=0.5142857142857143}
5 {105=0.308724832214765}
6 {105=0.308724832214765}
7 {139=0.5712754555198284}
8 {127=0.5767967894928858}
9 {134=0.2530120481927711, 129=0.29696599825632086, 73=0.2662459427947186}
10 {80=0.21242613001118038}
Any help on this conundrum would be greatly appreciated!
Consider below approach
select column,
( select cast(split(kv, '=')[offset(0)] as int64)
from unnest(regexp_extract_all(column, r'(\d+=\d+.\d+)')) kv
order by cast(split(kv, '=')[offset(1)] as float64) desc
limit 1
) key
from your_table
if applied to sample data in your question - output is

Athena: How to check number of duplicate elements in arrays of different rows

My table is on AWS Athena. I am not familiar with SQL or HIVE or Athena in general. I have the following table
col_id , col_list
ABC , [abcde, 123gd, 12345, ...]
B3C , [bbbbb, ergdg, 12345, ...]
YUT , [uyteh, bbbbb, 12345, ...]
col_id is unique and the elements in the array of one single row are also unique.
I need to run a query that count the total number of elements that repeat in different arrays in different rows. In the example above, the array element 12345 shows up in 1st, 2nd, and 3rd rows, and bbbbb shows up in 2nd and 3rd rows, so the number of repetitive elements is 2.
The number of rows is not big so I guess the performance is not a concern here.
Could anyone please let me know how to write this query in Athena? Thank you!
You can explode the array and aggregate:
select col, count(*)
from t lateral view
explode(t.col_list) col
group by col
order by count(*) desc;

Sqoop Import Split by Column Data type

Should the datatype of Split by column in sqoop import always be a number datatype (integer, bignint, numeric)? Can't it be a string?
Yes you can split on any non numeric datatype.
But this is not recommended.
WHY?
For splitting data Sqoop fires
SELECT MIN(col1), MAX(col2) FROM TABLE
then divide it as per you number of mappers.
Now take an example of integer as --split-by column
Table has some id column having value 1 to 100 and you using 4 mappers (-m 4 in your sqoop command)
Sqoop get MIN and MAX value using:
SELECT MIN(id), MAX(id) FROM TABLE
OUTPUT:
1,100
Splitting on integer is easy. You will make 4 parts:
1-25
25-50
51-75
76-100
Now string as --split-by column
Table has some name column having value "dev" to "sam" and you using 4 mappers (-m 4 in your sqoop command)
Sqoop get MIN and MAX value using:
SELECT MIN(id), MAX(id) FROM TABLE
OUTPUT:
dev,sam
Now how will it be divided in 4 parts. As per sqoop docs,
/**
* This method needs to determine the splits between two user-provided
* strings. In the case where the user's strings are 'A' and 'Z', this is
* not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
* splits for strings beginning with each letter, etc.
*
* If a user has provided us with the strings "Ham" and "Haze", however, we
* need to create splits that differ in the third letter.
*
* The algorithm used is as follows:
* Since there are 2**16 unicode characters, we interpret characters as
* digits in base 65536. Given a string 's' containing characters s_0, s_1
* .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
* base 65536. Having mapped the low and high strings into floating-point
* values, we then use the BigDecimalSplitter to establish the even split
* points, then map the resulting floating point values back into strings.
*/
And you will see the warning in the code:
LOG.warn("Generating splits for a textual index column.");
LOG.warn("If your database sorts in a case-insensitive order, "
+ "this may result in a partial import or duplicate records.");
LOG.warn("You are strongly encouraged to choose an integral split column.");
In case of Integer example, all the mappers will get balanced load (all will fetch 25 records from RDBMS).
In case of string, there is less probability that data is sorted. So, it's difficult to give similar loads to all the mappers.
In a nutshell, Go for integer column as --split-by column.
Yes, we can do but it is not recommended due to performance issue. since we know SQOOP runs boundary query "select min(pk/split-by column), max(pk/split-by column) from table where condition" to calculate the split size for the mappers.
split-size = (max - min)/no of mappers
Lets say there is table called employee.
id name age
1 baba 20
2 kishor 30
3 jay 40
..........
10001 pk 60
Senario 1 :
Performing split-by on id column
In this case SQOOP will fire boundary query select min(id),max(id) from employee to compute the split size.
min = 1
max = 100001
default no of mapper = 4
split-size = (10001-1)/4 = 25000
so each mapper will process 25000 lines of record.
mapper 1: 1 - 25000
mapper 2: 25001-50000
mapper 3: 50001-75000
mapper 4: 75001-100000
so its very easy for SQOOP to split the records if we have integral column.
Scenario 2:
Performing split-by on name column
In this case SQOOP will fire "select min(name),max(name) from employee" to compute split size.
min = baba, max= pk
SQOOP wont able to compute split size easily because min and max have text values((min-max)/no of mappers) so it will run TextSplitter class to perform split, which will create extra overhead and may impact the performance.
Note : We need to pass extra argument -D org.apache.sqoop.splitter.allow_text_splitter= true to use TextSplitter class.
No, it must be numeric because according to the specs: "By default , sqoop will use query select min(), max() from to find out boundaries for creating splits." The alternative is to use --boundary-query which also requires numeric columns. Otherwise , the Sqoop job will fail. If you don't have such a column in your table the only workaround is to use only 1 mapper: "-m 1".

Oracle/SQL - Split list into 3 segments

I'm wondering how I would go about writing a query to split a table into 3 segments. When I've had to split a table into 2 before I've always based it off the rownum and doing a mod on it. I know I could again use rownum and select based on ranges, but if the list varies in record count each time the queries are run they will have to be updated.
Any thoughts?
Why can't you continue to use MOD, as in MOD(rownum, 3) = 0, 1 or 2? If it worked for 2, why not 3?