How to implement "group-by" sampling in Hive? - hive

Given a Hive table:
create table mock
(user string,
url string
);
How to sample a certain percentage of url (say 50%) or certain number of url for each user?

There is a built-in query to extract samples from a table.
SELECT * FROM mock TABLESAMPLE(50 PERCENT)
Here is an alternative solution using row_number(). First number each rows for each user
with numbered as (
SELECT user, url, row_number() OVER (PARTITION BY user ORDER BY user) as rn FROM mock
)
Then just either select the odd or even rows using pmod to get 50% sample
SELECT user, url FROM numbered where pmod(rn,2) = 0

Related

SQL: Joining two table based on certain description

I have two tables:
And I want to add GTIN from table 2 to table 1 based on brand name. Though I cant use = or like because as you see in highlighted row they are not fully matched.
For example
Second row in table 1, suppose to have first GTIN from table 2 because both are Ziagen 300mg tablet. However all of what I tried failed to match all row correctly.
Postgres has a pg_trgm module described here. Start with a cross join joining both tables and calculate the similarity(t1.brand,t2.brand) function, which returns the real number.
Next filter the results based on some heuristic number. Then narrow down with choosing single best match using row_number() window function.
The results might be not accurate, you could improve it by taking generic similarity into account as well.
with cross_similarity(generic1,brand1,gtin,brand2,generic2,sim) as (
select *, similarity(t1.brand, t2.brand) as sim
from t1,
t2
where similarity(t1.brand, t2.brand) > 0
)
, max_similarity as (
select *,
row_number() over (partition by gtin order by sim desc) as best_match_rank
from cross_similarity
)
select * from max_similarity where best_match_rank =1;

Access 10th through 70th element in STRUCT

I have 3 fields: username, tracking_id, timestamp. One user will have multiple rows (some have more, some have less) with different tracking ids and timestamps for each action he has taken on my website. I want to group by the username and get the tracking ids of that user's 10th through 70th action. I use standard SQL on BigQuery.
First problem is, I can't find syntax to access a range in the STRUCT (only a single row or using a limit to get the first/last 70 rows for example). Then, I can image after managing to access a range, there could be an issue with the index being out of bounds because some users might not have 70 or more actions.
SELECT
username,
ARRAY_AGG(STRUCT(tracking_id,
timestamp)
ORDER BY
timestamp
)[OFFSET (9 to 69)] #??????
FROM
table
The result should be a table with the same 3 fields: username, tracking_id, timestamp, but instead of containing ALL the user's rows, it should only contain each users 10th to 70th row.
Below is for BigQuery Standard SQL
#standardSQL
SELECT username,
ARRAY_AGG(STRUCT(tracking_id, `timestamp`) ORDER BY `timestamp`) AS selected_actions
FROM (
SELECT * EXCEPT(pos) FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY username ORDER BY `timestamp`) pos
FROM `project.dataset.table`
)
WHERE pos BETWEEN 10 AND 70
)
GROUP BY username

Extracting key from the value of json object in postgres

Let us say I have a json object {'key1':0.5,'key2':0.3,'key3':0.1} in a particular column in a table called test. I want to return the key of the highest value. To get the highest value, in postgres, I can write this query:
select greatest(column1->'key1',column1->'key2',column1->'key3') from test
Now, it returns the greatest value. But the one I want is the key associated with the highest value. Is this possible in postgres json querying?
You need to extract all key/value pairs as rows. Once you have done that, it's a greatest-n-per-group problem - without the "group" though as you are looking at all rows.
select k,val
from (
select t.*, row_number() over (order by t.val::numeric desc) as rn
from jsonb_each_text('{"key1":0.5,"key2":0.3,"key3":0.1}'::jsonb) as t(k,val)
) t
where rn = 1;
Online example: http://rextester.com/OLBM23414

How do I get the average date interval of a column in SQL?

I have a table of user interactions on a web site and I need to calculate the average time between interactions of each user. To make it more simple to understand, here's some records of the table:
Where the first column is the user id and the second is the interaction time. The results that I need is the average time between interactions of each user. Example:
The user 12345 average interaction interval is 1 day
I've already tried to use window functions, but i couldn't get the average because PostgreSQL doesn't let me use GROUP BY or AVG on window functions, I could get the intervals using the following command, but couldn't group it based on the user id.
SELECT INTERACTION_DATE - LAG(INTERACTION_DATE ) OVER (ORDER BY INTERACTION_DATE )
So, I decided to create my own custom function and after that, create a custom aggregate function to do this, and use this function on a group by clause:
CREATE OR REPLACE FUNCTION DATE_INTERVAL(TIMESTAMP)
RETURNS TABLE (USER_INTERVALS INTERVAL)
AS $$
SELECT $1 - LAG($1) OVER (ORDER BY $1)
$$
LANGUAGE SQL
IMMUTABLE;
But this function only return several rows with one column with null value.
Is there a better way to do this?
You need to first calculate the difference between the interactions for each row (and user), then you can calculate the average on that:
select user_id, avg(interaction_time)
from (
select user_id,
interaction_date - lag(interaction_date) over (partition by user_id order by interaction_date) as interaction_time
from the_table
) t
group by user_id;
Encapsule your first query then compute the average:
SELECT AVG(InteractionTime) FROM (
SELECT INTERACTION_DATE - LAG(INTERACTION_DATE ) OVER (ORDER BY INTERACTION_DATE ) AS InteractionTime
)

Random sample table with Hive, but including matching rows

I have a large table containing a userID column and other user variable columns, and I would like to use Hive to extract a random sample of users based on their userID. Furthermore, sometimes these users will be on multiple rows and if a randomly selected userID is contained in other parts of the table I would like to extract those rows too.
I had a look at the Hive sampling documentation and I see that something like this can be done to extract a 1% sample:
SELECT * FROM source
TABLESAMPLE (1 PERCENT) s;
but I am not sure how to add the constraint where I would like all other instances of those 1% userIDs selected too.
You can use rand() to split the data randomly and with the proper percent of userid in your category. I recommend rand() because setting the seed to something make the results repeatable.
select c.*
from
(select userID
, if(rand(5555)<0.1, 'test','train') end as type
from
(select userID
from mytable
group by userID
) a
) b
right outer join
(select *
from userID
) c
on a.userid=c.userid
where type='test'
;
This is set up for entity level modeling purposes, which is why I have test and train as types.