Hive How to limit the number of entries in collect_set - hive

Let's say I have a table that contains two columns:
Category Productname
Cat1 prod1
Cat1 prod2
Cat1 prod3
Cat1 prod4
Cat1 prod5
Cat2 prod6
Cat2 prod7
Cat2 prod8
Now if I do something like:
SELECT Category, collect_set(Productname)
FROM myTable;
I would get something like:
Cat1 [prod1...prod5]
Cat2 [prod6...prod8]
Now, there are 5 products in Cat1 and 3 in Cat2. However, I want to limit the number of products in each category. Let's say the upper limit is 3. The 3 products can be any random ones as long as they belong to the same category. Also, the upper limit can be a large number. It's worth noting that I want the collect_set to stop once it has reached the upper threshold. Because generating the output and then perform the filtering can be expensive. Methods other than collect_set are also welcome. Thanks for any suggestions!

You can do this by a combination of window function and collect_set.Partition by category and get <= 3 rows for each category,product combination and then use collect_set on the subset.
select category, collect_set(productname)
from
(
select category,productname, row_number() over (partition by category) as r_no
from table_name
) a
where a.r_no <= 3

Related

Can I use SQL to Sum() and/or count() for all elements in the group except for the returned category?

I have a table of data. Column 1 is a list of categories, and column 2 is a boolean. I have N number of categories, with N number of rows per category.
I would like to return a table with the data grouped by category, and summary of the number of rows for each category, and the sum of the boolean column (number of rows with value = 1).
I would also like to return a summary of: (Sum(BooleanField)/Count(BooleanField))/(Sum(BooleanField)/Count(BooleanField)), where the numerator does not include rows with the category (Category_name) that my Group By function returns for, and the denominator is all-inclusive (all categories).
So far, I have the code
SELECT(Category_name),
COUNT(BooleanField),
SUM(BooleanField),
SUM(BooleanField)/COUNT(BooleanField) -- this is % True for each category
-- some logic that takes the % true for all categories except the category
-- that we are grouping by later / by the % true overall (all observations)
FROM Data.Source
GROUP BY Category_Name
This code so far is just exploratory.
The "magic number" column explains what I am looking for next, with the other columns representing what is being returned by my code so far: https://docs.google.com/spreadsheets/d/17oienILCeATmH-kNzBZqz0s0Bj9ptjKZ9HfcQJCvAdA/edit#gid=0
Thanks for any help.
Sample Data:
Category BooleanField
Cat1 0
cat1 1
cat2 1
cat2 1
cat2 1
Cat2 0
Cat2 0
Cat2 1
Cat2 1
Cat2 1
Cat3 0
Cat3 0
Cat3 0
Cat3 1
Cat4 1
Cat4 0
Cat4 0
Cat4 0
Cat4 0
Cat4 1
Desired Result
Category Percent True Sum Count Magic Number
Cat1 50.00% 1 2 1.0000
Cat2 60.00% 6 8 0.6667
Cat3 25.00% 1 4 1.1250
Cat4 33.33% 2 6 1.1429
The magic number column is the trouble I'm finding. I need to find this magic number column so that I can determine which categories are driving down the overall % true the most. Such that removing the most influentially negative category would increase the overall %T the most.
You can use window functions. I think the logic you want is:
select category,
avg(cast(booleanField as int)) as percent_true,
sum(cast(booleanField as int)) as total,
count(*) cnt,
(sum(sum(cast(booleanField as int))) over() - sum(cast(booleanField as int)))
/ (sum(count(*)) over() - count(*))
/ avg(cast(booleanField as int))
as magic_number
from mytable
group by category
order by category
You can use an OVER() window function to sum all values, and then substract the current row.
with data as (
select $1 num, $2 str
from (values (1, 'one'), (2, 'two'), (3, 'three'))
)
select num
, sum(num) over() everything_added
, sum(num) over() - num everything_but_this_row
from data;
With these basic components, you can now build any desired formula, like in "where the numerator does not include rows with the category".

Get a distinct row count from another column

SQL Table is as follows:
Category | Subcategory |
A 1
A 1
A 2
B 1
B 2
I need the number of each subcategory for each category, not including duplicate subcategories within the category.
You'll notice there are 3 total "1" subcategories, but only a count of 2 as the duplicate is redundant and not included.
Example output:
subcategory | count
1 2
2 2
How can I achieve this? I am familiar with COUNT but I can only get the raw number of rows.
Using Snowflake.
Thanks!
You can use GROUP BY, as in:
select Category, count(distinct Subcategory)
from t
group by Category

SQL sum rows and select one unique id

I am seeking a method of summing multiple rows and selecting the Unique id of one of the rows to be the Unique id for the sum row.. if that makes sence?
For example if I have a table like this
ID | Value1 Value2 Text1 Text2
---------|-------------------------------------------
1 | 100 150 Bananas Hawaii
2 | 200 100 Bananas Hawaii
3 | 300 200 Bananas Hawaii
---------|--------------------------------------------
1,2 or 3 | 600 450 Bananas Hawaii
To get the result row I would do something like this
SELECT
sum(Value1) as Value1
sum(Value2) as Value2
FROM
db..table
GROUP BY
Text1
,Text2
However I need to retrieve just one of the ID's to put on my results row, I don't care wether it would be 1, 2 or 3.
The reason for this is, I have a massive database and a big program to retrieve data, but due to some new programming I can suddenly now have more of the same row, but with different Unique ids, hence I am interrested in summing the rows and just keeping one of the Unique ids.
Assigning a new Unique ID to the result row will not help me because of the way everything is designed right now.
I am using Microsoft SQL Server 2005.
You could just use another aggregate like MIN or MAX
SELECT min(ID) as ID
,sum(Value1) as Value1
,sum(Value2) as Value2
FROM
db..table
GROUP BY
Text1
,Text2
Please try:
SELECT
min(ID) as ID,
sum(Value1) as Value1,
sum(Value2) as Value2
FROM
db..table
GROUP BY
Text1
,Text2

Postgres: get average for all values of a column for each distinct from another column

I have a table that looks like:
sku | qty
----|----
sku1| 1
sku1| 3
sku1| 1
sku1| 3
sku2| 1
And I'm trying to write a query that will return the average of qty for each distinct sku.
So for the data above, the output from the query would look like:
sku | qty
----|----
sku1| 2
sku2| 1
So, the averages for sku1 came from 1 3 1 3 and the average of sku2 is just 1
I know it's going to involve some kind of subquery, but I just can't seem to figure it out.
SELECT sku, AVG(qty)
FROM (SELECT DISTINCT sku FROM table)
How do I query for the average qty for each sku?
That's precisely what group by is for:
SELECT sku, AVG(qty)
FROM the_table
GROUP BY sku;
The manual has some examples: http://www.postgresql.org/docs/current/static/queries-table-expressions.html#QUERIES-GROUP

Efficient ways to count the number of times two items are ordered together

I am currently stuck on a problem where I have to write a SQL query to count the number of times a pair of items is ordered together.
The table that I have at my disposal is something like:
ORDER_ID | PRODUCT_ID | QUANTITY
1 1 10
1 2 20
1 3 10
2 1 10
2 2 20
3 3 50
4 2 10
I am looking to write a SQL query that can, for every unique pair of items, count the number of times they were ordered together and tell me the quantities when they were in the same order.
The resulting table should look like:
PRODUCT_ID_1 | PRODUCT_ID_2 | NUM_JOINT_ORDERS | SUM_QUANTITY_1 | SUM_QUANTITY__2
1 2 2 20 40
1 3 1 10 10
2 3 1 20 10
Some things to exploit are that:
Some orders only contain 1 item and so are not relevant in counting the pairwise relationship (not sure how to exclude these but maybe it makes sense to filter them first)
We only need to list the pairwise relationship once in the final table (so maybe a WHERE PRODUCT_ID_1 < PRODUCT_ID_2)
There is a similar post here, though I have reposted the question because
I really want to know the fastest way to do this since my original table is huge and my computational resources are limited, and
in this case I only have a single table and no table that lists the number.
You may use the following approach, which gives you the result shown above.
select
PRODUCT1, PRODUCT2, count(*), sum(QUANTITY1), sum(QUANTITY2)
from (
select
T1.PRODUCT_ID AS PRODUCT1,
T2.PRODUCT_ID AS PRODUCT2,
T1.QUANTITY AS QUANTITY1,
T2.QUANTITY AS QUANTITY2
from TABLE as T1, TABLE as T2
where T1.ORDER_ID=T2.ORDER_ID
and T1.PRODUCT_ID<T2.PRODUCT_ID
)
group by PRODUCT1, PRODUCT2