Hive Distinct to sample certain number of column values

Hive Distinct to sample certain number of column values - sql

My aim is to return the data in a way where the 'name' column has at max 3 distinct values.
My table is like this
id, name, year
1, John, 2012
2, Jake, 2012
3, Jenna, 2013
1, John, 2013
4, Tyler, 2012
5, Jenna, 2013
I need to do distinct on name field in such a way that the output contains atmax 3 unique values in name field. There can be repetitions due to different values in other fields and those can come in the output as well. Example if the set the threshold as 3, then we the output should contain only 3 distinct names with repetition allowed.
Output I need is
id, name, year
1, John, 2012
2, Jake, 2012
3, Jenna, 2013
1, John, 2013
5, Jenna, 2013
How to achieve this kind of result with distinct? Because distinct would identify 3 distinct records if I want all the columns.

Consider below approach
WITH sample_table AS (
SELECT 1 id, 'John' name, 2012 `year` UNION ALL
SELECT 2, 'Jake', 2012 UNION ALL
SELECT 3, 'Jenna', 2013 UNION ALL
SELECT 1, 'John', 2013 UNION ALL
SELECT 4, 'Tyler', 2012 UNION ALL
SELECT 5, 'Jenna', 2013
)
SELECT id, name, `year` FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY name) rank
FROM sample_table
) t WHERE rank <= 3;
Query results
For random sampling, you might consider below.
SELECT id, name, `year` FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY hash) rnk FROM (
SELECT *, SUM(rand()) OVER (PARTITION BY name) hash
FROM sample_table
) t
) t WHERE rnk <= 3;
This will show different result from above.

Related

SQL query to get the employee who has not been appraised

There is a employees salary history table, sal_hist which has id,name, salary and effective_date. Requirement is to get the employee who has not been appraised. Below is the table:
Id
name
salary
date
1
a
1000
10-5-2020
1
a
2000
12-6-2020
1
a
3000
12-7-2020
2
b
2500
12-5-2020
2
b
3500
12-7-2020
3
c
2500
12-5-2020
Below is the query I have:
Select id,name from sal_hist group by id,name having count(1)=1;
Is there a different way to achieve the result?

If your date column is the appraisal date and you want the user who has not got an appraisal at the latest date then:
SELECT id, name, salary, appraisal_date
FROM (
SELECT s.*,
MAX(appraisal_date) OVER (PARTITION BY id) AS max_user_appraisal_date,
MAX(appraisal_date) OVER () AS max_appraisal_date
FROM salary_history s
)
WHERE appraisal_date = max_user_appraisal_date
AND max_user_appraisal_date < max_appraisal_date
Which, for the sample data:
CREATE TABLE salary_history (Id, name, salary, appraisal_date) AS
SELECT 1, 'a', 1000, DATE '2020-05-10' FROM DUAL UNION ALL
SELECT 1, 'a', 2000, DATE '2020-06-12' FROM DUAL UNION ALL
SELECT 1, 'a', 3000, DATE '2020-07-12' FROM DUAL UNION ALL
SELECT 2, 'b', 2500, DATE '2020-05-12' FROM DUAL UNION ALL
SELECT 2, 'b', 3500, DATE '2020-07-12' FROM DUAL UNION ALL
SELECT 3, 'c', 2500, DATE '2020-05-12' FROM DUAL;
Outputs:
ID
NAME
SALARY
APPRAISAL_DATE
3
c
2500
12-MAY-20
db<>fiddle here

If you are looking for another way to achieve the result and find the employee with only 1 salary listed (no appraisals in the past) you can use a subquery like this:
SELECT id, name
FROM (
SELECT id, name, COUNT(id) AS [count]
FROM sal_hist
GROUP BY id,name)
WHERE count=1;
But I would note that using HAVING as you have currently is a better method.

SQL/Snowflake Sampling with specific probability

Suppose I have table 1 below, how can I select the values from table 1 with the specified probabilities, where each probability is the chance of the respective value getting selected?
Table 1:
Group Value Probability
A 1 5%
A 10 5%
A 50 20%
A 30 70%
B 5 5%
B 25 70%
B 100 25%
A possible outcome is (assuming 30 and 25 are selected simply because of their higher probabilities):
Table 2:
Group Value
A 30
B 25
I'm trying to solve this on Snowflake and have not been able to through various methods, including partitioning the values and comparing their ranks, as well as using the Uniform function to create random probabilities. Not sure if there's a more elegant way to do a sampling and partition by Group. The end goal is to have the Value field in Table 1 deduplicated, so that each value is given a chance of getting selected based on their probabilities.

Give each group a consecutive range. For example, for 15%, the range will be between 30 and 45.
Pick a random number between 0 and 100.
Find in which range that random number falls:
create or replace temp table probs
as
select 'a' id, 1 value, 20 prob
union all select 'a', 2, 30
union all select 'a', 3, 40
union all select 'a', 4, 10
union all select 'b', 1, 5
union all select 'b', 2, 7
union all select 'b', 3, 8
union all select 'b', 4, 80;
with calculated_ranges as (
select *, range_prob2-prob range_prob1
from (
select *, sum(prob) over(partition by id order by prob) range_prob2
from probs
)
)
select id, random_draw, value, prob
from (
select id, any_value(uniform(0, 100, random())) random_draw
from probs group by id
) a
join calculated_ranges b
using (id)
where range_prob1<=random_draw and range_prob2>random_draw
;

Felipe's answer is great, it definitely solved the problem.
While trying out different approaches yesterday, I tested out this approach on Felipe's table and it seems to be working as well.
I'm giving each record a random probability and comparing against the actual probability. The idea is that if the random probability is less than or equal to the actual probability, then it's accepted and the partitioning will do the deduplication based on a descending order with the probabilities.
create or replace temp table probs
as
select 'a' id, 1 value, 20 prob
union all select 'a', 2, 30
union all select 'a', 3, 40
union all select 'a', 4, 10
union all select 'b', 1, 5
union all select 'b', 2, 7
union all select 'b', 3, 8
union all select 'b', 4, 80;
create or replace temp table t2 as
select *,
min(compare_prob) over(partition by id) as min_compare_prob,
max(compare_prob) over(partition by id) as max_compare_prob,
min_compare_prob <> max_compare_prob as not_all_identical --min_rank2 <> max_rank2 checks if all records (by group) have different values
from (select id,
value,
prob,
UNIFORM(0.00001::float,1::float,random(2)) as rand_prob, --random probability
case when prob >= rand_prob then 1 else 0 end as compare_prob
from (select id, value, prob/100 as prob from probs)
);
--dedeup results
select id, value, prob, rand_prob
from (select *,
row_number() over(partition by id order by prob desc, rand_prob desc) as rn
from t2
where not_all_identical = FALSE
union all
select *,
row_number() over(partition by id order by prob desc, COMPARE_PROB desc) as rn
from t2
where not_all_identical = TRUE)
where rn = 1;

Looping through specific values on a table and insert new rows

I have the following table:
ID, UserID, CompanyID, AccountID, Year1, Month1
I need to insert 10 rows to each AccountID, is there a way to loop through all AccountIDs and to insert for each one of them the following values?
INSERT INTO Perms (UserID, CompanyID, AccountID, Year1, Month1)
VALUES
(175, 74,x,2017,3),
(175, 74,x,2017,4),
(175, 74,x,2017,5),
(175, 74,x,2017,6),
(175, 74,x,2017,7),
(175, 74,x,2017,8),
(175, 74,x,2017,9),
(175, 74,x,2017,10),
(175, 74,x,2017,11),
(175, 74,x,2017,12)
I have about 100 AccountIDs and I need some sort of a loop.
Is that doable?

Use CTEs to represent the account and date sequences. In the case of the account ID values, we can use a recursive CTE. Below I arbitrarily generate values from 1 to 100, though this approach should work with any continuous range. For the year/month combinations, because there are only 10 we can simply hard code them in a CTE. Then, use INSERT INTO ... SELECT with a cross join of the two CTEs.
WITH accounts AS (
SELECT 1 AS account
UNION ALL
SELECT account + 1
FROM accounts
WHERE account + 1 <= 100
),
cte AS (
SELECT 2017 AS year, 3 AS month UNION ALL
SELECT 2017, 4 UNION ALL
SELECT 2017, 5 UNION ALL
SELECT 2017, 6 UNION ALL
SELECT 2017, 7 UNION ALL
SELECT 2017, 8 UNION ALL
SELECT 2017, 9 UNION ALL
SELECT 2017, 10 UNION ALL
SELECT 2017, 11 UNION ALL
SELECT 2017, 12
)
INSERT INTO Perms (UserID, CompanyID, AccountID, Year1, Month1)
SELECT 175, 74, account, year, month
FROM accounts
CROSS JOIN cte;
OPTION (MAXRECURSION 255);
Edit:
If your account IDs are not continuous, then continuing with this answer you may just manually list them in a CTE, e.g.
WITH accounts AS (
SELECT 71 AS account UNION ALL
SELECT 74 UNION ALL
SELECT 78 UNION ALL
SELECT 112 UNION ALL
SELECT 119
-- and others
)

Try this. This is very similair to already existing answer, but more compact:
;with cte as (
select 175 [UserID], 74 [CompanyID], 2017 [Year1], 3 [Month1]
union all
select 175 [UserID], 74 [CompanyID], 2017 [Year1], [Month1] + 1 from cte
where [Month1] < 12
)
select A.[UserID], A.[CompanyID], B.[AccountID], A.[Year1], A.[Month1] from cte A cross join TABLE_NAME B

If you have the accountId's stored in a table and what you want is to insert 10 rows for each account id with Month1 from 3 to 12, try this
WITH CTE
AS
(
SELECT
Month2 = 1
UNION ALL
SELECT
Month2+1
FROM CTE
WHERE Month2 <12
)
INSERT INTO Perms (UserID, CompanyID, AccountID, Year1, Month1)
SELECT
UserID = 175,
CompanyID ='X',
AccountID = YAT.AccountID,
Year1 = 2017,
Month1 = CTE.Month2
FROM CTE
INNER JOIN YourAccountTable YAT
ON CTE.Month2 BETWEEN 3 AND 12
Change the between clause if you want diffrent values

SQL : Sorting data by fields which are not in select and group by clause

I have a SQL query.
Join of two columns.
I want to sort my search results by a column of the child table which is neither in select clause not in group by ( as I cannot group by it and cannot include it in the select ).
Is there some way I can achieve it?

In SQL Server, it results in error
Column "test2.DATE" is invalid in the ORDER BY clause because it is not contained in either an aggregate function or the GROUP BY clause.:
Fiddle here : http://sqlfiddle.com/#!6/d3d28/1
In Oracle, it results in error
ORA-00979: not a GROUP BY expression
00979. 00000 - "not a GROUP BY expression"
With test2 as
(
SELECT 1 ID, 'Hari' fName,30 KEYID ,sysdate DATEE FROM DUAL UNION ALL
SELECT 1, 'Hari' ,20 , sysdate FROM DUAL UNION ALL
SELECT 2, 'John' ,55, sysdate FROM DUAL UNION ALL
SELECT 2, 'John' ,89, sysdate FROM DUAL UNION ALL
SELECT 2, 'John' ,38, sysdate FROM DUAL
)
select id, fname, sum(keyid) from test2
group by id, fname
order by dateE desc;
So, in oracle, it is advisable to include the child-table column in your select clause. In your display UI Logic/resultset processing logic, you can consider ignoring the column.

It doesn't really make sense to order by a field that you are removing duplicates from, since some of the values that are removed could affect the sort order. Just using the example of Hari and John above, imagine this data:
1, 'Hari', 4
1, 'Hari', 6
2, 'John', 5
2, 'John', 7
2, 'John', 8
If you want these to be ordered Hari then John, because Hari has the lowest sort field and John has the next lowest (i.e. for the purposes of ordering, you ignore the sort field values 6, 7 and 8) then order by Min(SortField)
Again, using Nishanthi's example, the query would become something like this:
select id, fname, sum(keyid) from test2
group by id, fname
order by Min(dateE) desc;

SQL server 2008 R2, select one value of a column for each distinct value of another column

On SQL server 2008 R2, I would like to select one value of a column for each distinct value of another column.
e.g.
name id_num
Tom 53
Tom 60
Tom 27
Jane 16
Jane 16
Bill 97
Bill 83
I need to get one id_num for each distinct name, such as
name id_num
Tom 27
Jane 16
Bill 97
For each name, the id_num can be randomly picked up (not required to be max or min) as long as it is associated with the name.
For example, for Bill, I can pick up 97 or 83. Either one is ok.
I do know how to write the SQL query.
Thanks

SELECT
name,MIN(id_num)
FROM YourTable
GROUP BY name
UPDATE:
If you want pick id_num randomly, you may try this
WITH cte AS (
SELECT
name, id_num,rn = ROW_NUMBER() OVER (PARTITION BY name ORDER BY newid())
FROM YourTable
)
SELECT *
FROM cte
WHERE rn = 1
SQL Fiddle Demo

You could grab the max id like this:
SELECT name, MAX(id_num)
FROM tablename
GROUP BY name
That would get you one id for each distinct name.

select name, max(id_num)
from [mytable]
group by name

The (SELECT 1) in the cte does not really order the data in each of the partitions. which should give you the random selection.
CREATE TABLE #tmp
(
name VARCHAR(10)
, id_num INT
)
INSERT INTO #tmp
SELECT 'Tom', 53 UNION ALL
SELECT 'Tom', 60 UNION ALL
SELECT 'Tom', 27 UNION ALL
SELECT 'Jane', 16 UNION ALL
SELECT 'Jane', 16 UNION ALL
SELECT 'Bill', 97 UNION ALL
SELECT 'Bill', 83
;WITH CTE AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY name ORDER BY (SELECT 1)) AS ID
, name
, id_num
FROM #tmp
)
SELECT *
FROM CTE
WHERE ID = 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive Distinct to sample certain number of column values - sql

Related

SQL query to get the employee who has not been appraised

SQL/Snowflake Sampling with specific probability

Looping through specific values on a table and insert new rows

SQL : Sorting data by fields which are not in select and group by clause

SQL server 2008 R2, select one value of a column for each distinct value of another column

Categories

Resources