Countif or CASE with multiple conditions - google-bigquery

I am trying to figure out the most efficient way to count products being placed in a online cart . I have ranked the first 3 items placed in a cart by purchase time(time they were put in the cart not actual check out time), but now am struggling to figure out a way to count the different combinations of items going into the cart.
Counting the individual ranks is easy enough, but I need to figure out a count for purchasing product 1 first and product 1 second as well as all the combinations possible (5 products total). I only need to count first items in the cart, all combinations of first item in cart to second item in cart, and all combinations of second item in cart to third third item in cart.
SELECT
COUNTIF(product = 'Product1' and rank = 1) as firstpurchase_product1,
COUNTIF((product = 'Product1' and rank = 1) and (product = 'Product1' and rank = 2)) as firstpurchase_product1_secondpurchase_product1,
COUNTIF((product = 'Product1' and rank = 1) and (product = 'Product2' and rank = 2)) as firstpurchase_product1_secondpurchase_product2,
#code would continue for all combinations.
FROM(
customer_info.customer_id as customer_id,
customer_info.session_id as session_id,
customer_info.product_purchased as product,
ROW_NUMBER() OVER (PARTITION BY customer_info.session_id ORDER BY customer_info.purchase_time ASC) AS rank
FROM customer_purchases cp,
WHERE p_date >= "2022-04-12"
)rnk
where rnk.finish_rank in (1,2,3)
This seems like a lot of code, is there a better way to do it? The query is returning 0 for all line except when counting just first purchases, should I be using CASE instead?
Any thoughts or ideas would be appreciated.
Thanks!
Example of input:
Product 1, Product 2, Product 3
Product 1, Product 1, Product 1
Product 4, Product 2, Product 1
Product 3, Product 3, Product 5
Product 4, Product 2, Product 4
--this goes on for hundreds of lines
Output:
Count Product 1 in first column
Count Product 2 in first column
#continue for all 5
Count of customers who put product 1 in cart first AND product 1 in cart
second
Count of customers who put product 1 in cart first AND product 2 in cart second
###continue with all combinations with product 1
Count of customers who put product 2 in cart first and product 1 in cart second
Count of customers who put product 2 in the cart first and product 2 in the cart second
###continue with all combinations of product 2,3,4, and 5

It seems to me that you want to GROUP BY a set of columns (item1, item2, item3) and produce a count of the number of times each combination occurs.
Possibly (it's a little unclear from your wording - a well-formatted table showing example raw data and desired results for that example would be helpful), you want to know an overall count for values of item1 regardless of the other items. This can be achieved via GROUP BY ROLLUP(item1, item2, item3).
So, our aim is to get an unaggregated table with those columns, so that we can aggregate it as described!
You have a long-format table (customer ID, session ID, product, rank) and we want a wide-format table with a column for each value of rank. This is a PIVOT operation:
WITH rnk AS (
SELECT
customer_id,
session_id,
product_purchased AS product,
ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY purchase_time ASC) AS rank
FROM customer_info
WHERE p_date >= "2022-04-12"
QUALIFY rank IN (1,2,3)
),
pivoted AS (
SELECT *
FROM rnk PIVOT(
ANY_VALUE(product) AS item FOR rank in (1,2,3)
)
)
SELECT
item_1,
item_2,
item_3,
COUNT(*) AS N
FROM
pivoted
GROUP BY
ROLLUP(item_1, item_2, item_3)
Does that get you what you want?
A couple of features to note:
I use common table expressions (WITH) to make this more readable
QUALIFY is a filter clause to apply to the output of a window clause
Pivoting requires an aggregation function because in general there could be many records with the same value of session, product, and rank. Here we know there will be one record only, so it's safe to use ANY_VALUE (which 'aggregates' by non-deterministically choosing one of the values).
Just to prevent confusion: ROLLUP will give you something like 'Product A', NULL, NULL for some of its records - this doesn't mean items 2 and 3 don't exist, it's just how it signals those records that group only by item 1 and aggregate over all values of the other items.

Related

Check for multiple columns with Single Value using SELECT query

I have a table fruits. Now in UI, I provide a single field for a search criteria. Using this single Search criteria, i want to search in multiple columns of Fruits Table.
Consider Fruits table contains Columns ID,Desc,Price,Quant,Stock. Here Price,Quant are integers and Stock is a varchar.
I have tried the below query which returns the results, but i am worried about the performance.
Suppose assume user enters 2 in the field provided in UI and clicks on search then query will be as shown below
select ID, Desc, Price, Quant, Stock
from Fruits
where Price = '2'
or Quant = '2'
or stock = '2'
Is this the right way to search for multiple columns of same table? Also will be any effect on performance?
First, you want to be sure that the types are compatible. In all likelihood these values are numbers, so drop the quotes:
select ID, Desc, Price, Quant, Stock
from Fruits f
where Price = 2 or Quant = 2 or stock = 2;
This can more simply be written as:
select ID, Desc, Price, Quant, Stock
from Fruits f
where 2 in (Price, Quant, Stock);
but that will not help performance.
In most databases your query will require a full table scan -- although some databases support a particular type of index scan called a skip scan which can help.
The only way I can think to get around that is to have a separate index on each column:
create index idx_fruits_price on fruits(price);
create index idx_fruits_quant on fruits(quant, price);
create index idx_fruits_stock on fruits(stock, quant, price);
(You'll see why the extra columns are helpful.)
And then use union all:
select ID, Desc, Price, Quant, Stock
from Fruits f
where Price = 2
union all
select ID, Desc, Price, Quant, Stock
from Fruits f
where quant = 2 and price <> 2
union all
select ID, Desc, Price, Quant, Stock
from Fruits f
where stock = 2 and price <> 2 and stock <> 2;
Each of the subqueries can use one of the indexes. Because of the inequalities, the results are exclusive -- assuming the column values are not null. If nulls are allowed, the logic can be adjusted to handle that.

Search results ordering based on rating and other values

I am struggling to build a complex ordering algorithm for the search results page.
I would like to order my items by rating (rating count, average rating), but I only want the rating take between 60-80% of the results page. One page has 12 items. They should be distributed randomly on a page.
I want to apply simple ordering as a secondary criteria, such as created_at field.
Does anybody have an idea how to do that?
My interpretation of your requirements are:
12 total items returned
4 items should be the most recently created items
The remaining 8 should be the highest rated items
Items should not appear twice, so if an item is recently created AND highly rated, we will need an extra item
In order to achieve this, I tried the following:
Assign ordered row numbers to both the created_at and avg_rating columns
Calculate the number of items that are in BOTH the top 4 for created and top 8 for items (we'll call this num_duplicates)
Increase the total number of highly rated items to be returned by num_duplicates
SQL Fiddle Here
select *
from
(
select a.*,
sum(
/* We want the total number of items that meet both criteria */
/* For every one of these items, we want to include an extra row */
case when created_row_num <= 4 and rating_row_num <= 8
then 1
else 0
end ) over() as num_duplicates
from (
select ratings.*,
row_number() over( order by created_at desc ) as created_row_num,
row_number() over( order by avg_rating desc ) as rating_row_num
from ratings
) as a
) as b
where created_row_num <= 4
/* Get top 8 by rating, plus 1 for every record that has already been selected by creation date */
or rating_row_num <= 8 + num_duplicates
I ended up using a solution which includes a chance of not rated items to end up in the middle of rated items. The idea of the algorithm is as follows:
ORDER BY
CASE WHEN rating IS NOT NULL OR RANDOM() < 0.0x THEN 1 + RANDOM()ELSE RANDOM() END
DESC NULLS LAST

Nested groups and counts within SQL

I can't eloquently explain my end problem, feel free to edit both this title and content of this question if you can describe the SQL solution better.
I have data in a table containing the columns "Item", "Category", and I have the following query calculating the number of distinct categories within each item.
SELECT
[ItemID], COUNT(DISTINCT [CategoryText] ) AS 'number of categories'
FROM [SampleTable]
GROUP BY
ItemID
This gives me the output
`ItemID', 'number of categories'
100 1
101 3
102 1
What I now want to do is group the number of items by the number of categories they have, with my aim being to determine 'Do most items only have one category?'
For the example above, I would expect the outcome to be
'Number of categories', 'Number of items'
1, 2
3, 1
I'm sure there is a simple query to get to this but am going around in circles without making progress.
Aggregate the results by number of categories.
select [number of categories], count(*) [number of items]
from (SELECT [ItemID], COUNT(DISTINCT [CategoryText]) AS 'number of categories'
FROM [SampleTable]
GROUP BY ItemID) t
group by [number of categories]

PostgreSQL: group with most common items: handling overlapping items

Postgresql: 9.3
I have a long log of "shopping cart ids" and the "product ids" each shopping cart contains.
I'm looking for a way to create groups that have the most "product ids" in common. The "product ids" can be in multiple groups at the same time.
As a result I need the "shopping cart ids","product ids" and the name of the groups (group 1, group 2, ...).
If anyone have a hint on how to do it. I know a SQL query is not ideal for it but it's all I have at the moment.
EDIT: With the below query I know groups of xx Shopping Carts have xx Products in common.
WITH a AS (
SELECT Shopping_Cart.Product_Id AS Product_Id, count(Shopping_Cart.Product_Id) AS "count" FROM Shopping_Cart
GROUP BY Shopping_Cart.Product_Id
ORDER BY "count"
)
SELECT a."count" AS "Product in Common", count(DISTINCT Shopping_Cart.id) AS "Shopping Cart Count" FROM a
RIGHT JOIN Shopping_Cart ON Shopping_Cart.Product_Id = a.Product_Id
GROUP BY a."count"
It's better than nothing but if I have 7 shoppers with items 1,2,3 and 7 shoppers with items 4,5,6 they fall into the same group of shoppers with 3 items in common. I need to separate them.
I bet you might want to loop through your Product table joining back on your Shopping table by CartId. maybe something like
DECLARE
ProdId Product%rowtype;
BEGIN
FOR ProdId IN SELECT "Product Id" from ProductTable
LOOP
SELECT ProdId,ProductId,count(CartId)
From ShoppingTable where CartId in
(Select Distinct CartId from Shopping where ProductId = ProdId)
GROUP BY ProdId,ProductId
ORDER by count(CartId) Desc
RETURN NEXt ProdId;
END LOOP;
RETURN;
END
LANGUAGE 'plpgsql' ;

Most efficient way to implement this logic in Oracle

The problem is as follows. Let there be two types of orders, VE and VC orders (VE orders have priority over VC orders). And two types of priorities HIGH and LOW. Every order is identified by an ORDER_ID, then labeled with an order type and lastly a priority. It happens that over time orders can improve their type, priority, or both, resulting in several new entries with duplicate order id's. The task is to label the state with the highest priority for each order with 1 and the rest with 0's. How would you attempt to do this considering that the ORDERS table is sufficiently big and that in some cases some rows would have to be re-labeled.
Example input:
Example output:
"How would you attempt to do this considering that the ORDERS table is
sufficiently big "
Well, first of all I wouldn't query all the rows in a "sufficiently big" table. That's why Nature gave us the WHERE clause.
So, given some form of filtering, the remaining logic is:
select order_id
, order_type
, priority
, case when rn = 1 then 1 else 0 end as temp_label
from
( select order_id
, order_type
, priority
, row_number() over ( partition by order_id
order by decode(order_type, 'VE', 1, 2)
, decode(priority, 'HIGH', 1, 2)
) as rn
from your_table
where whatever = 'BLAH' -- your criteria go here
)