I am struggling to build a complex ordering algorithm for the search results page.
I would like to order my items by rating (rating count, average rating), but I only want the rating take between 60-80% of the results page. One page has 12 items. They should be distributed randomly on a page.
I want to apply simple ordering as a secondary criteria, such as created_at field.
Does anybody have an idea how to do that?
My interpretation of your requirements are:
12 total items returned
4 items should be the most recently created items
The remaining 8 should be the highest rated items
Items should not appear twice, so if an item is recently created AND highly rated, we will need an extra item
In order to achieve this, I tried the following:
Assign ordered row numbers to both the created_at and avg_rating columns
Calculate the number of items that are in BOTH the top 4 for created and top 8 for items (we'll call this num_duplicates)
Increase the total number of highly rated items to be returned by num_duplicates
SQL Fiddle Here
select *
from
(
select a.*,
sum(
/* We want the total number of items that meet both criteria */
/* For every one of these items, we want to include an extra row */
case when created_row_num <= 4 and rating_row_num <= 8
then 1
else 0
end ) over() as num_duplicates
from (
select ratings.*,
row_number() over( order by created_at desc ) as created_row_num,
row_number() over( order by avg_rating desc ) as rating_row_num
from ratings
) as a
) as b
where created_row_num <= 4
/* Get top 8 by rating, plus 1 for every record that has already been selected by creation date */
or rating_row_num <= 8 + num_duplicates
I ended up using a solution which includes a chance of not rated items to end up in the middle of rated items. The idea of the algorithm is as follows:
ORDER BY
CASE WHEN rating IS NOT NULL OR RANDOM() < 0.0x THEN 1 + RANDOM()ELSE RANDOM() END
DESC NULLS LAST
Related
I need help with the following problem: I want to make a query that contains multiples sums and then takes those sums and uses them to get a percentage: percentage= s1/s1+s2.
I have as input the following data:
Orders shipping date, Nb of orders that have arrived late, Nb of orders that have arrived on time
What I want as output: The percentage of orders that have arrived late and orders that have arrived on time.
I want another column in the table that will have the percentage using SQL.
Concrete example:
*On 2022/01/04 **10:00 AM** I have 3 orders late and 4 order on time=> 7 orders in total. Percentage=3/7 (late), (4/7) on time
*At 2022/01/04 **11:00 AM** I have 5 orders late and 6 orders on time=>11 orders in total (but all this entry is summed with the previous entry so:) <=> 5+3 orders late, 4+6 orders on time, 18 orders in total => percentage= 8/18 late, 10 on time.
In order to sum previous entries order numbers with status "LATE" to current on time order number I wrote the following sql:
(sum1=s1)
SELECT s1.EventDate, (
SELECT SUM(s2.NbOfOrders)
FROM OrderShipmentStats s2
WHERE s2.EventDate <= s1.EventDate AND s2.Status='LATE'
) AS cnt
FROM OrderShipmentStats s1
GROUP BY s1.EventDate, s1.Status
The same kind of sql was written for "On Time" and it works. But what I need to do now is get the values and add them together of the two sql queries and based on the status which is late or on time do s1/s1+s2 or s2/s2+s1.
My problem is that I do not know how to do this formula in a single query using those 2 subqueries, any help would be great.
Picture with Table
Above there is the link with the picture containing how the table looks(I am new so I am not allowed to embed a photo).
The percentage column is the one I will add and there are lines pointing towards how that is calculated.
I created the table based on your image and added a few rows to it.
In the query you could see total orders count per hour, per status and the grand total as you mentioned in the image.
The query looks like:
create table OrderShipmentsStats
(
EventDate datetime not null,
Status varchar(10) not null,
OrdersCount int not null
)
insert into OrderShipmentsStats
values
('2022-01-04T10:00:00','Late',3),
('2022-01-04T10:00:00','On Time',4),
('2022-01-04T11:00:00','Late',5),
('2022-01-04T11:00:00','On Time',6),
('2022-01-04T12:00:00','Late',1),
('2022-01-04T12:00:00','On Time',2)
SELECT
EventDate,
Status,
OrdersCount,
TotalPerHour,
StatusTotal,
GrandStatusTotal,
-- at the line below, multiplying by 1.0 will convert the result and we would receive smth like 0.45, 0.123, some percentage
-- but we want the actual percent like 15%, or 50%. to obtain it, just multiply by 100
cast(1.0 * o.StatusTotal / o.GrandStatusTotal as decimal(5,3)) * 100 as Percentage
from
(
select
EventDate,
Status,
OrdersCount,
TotalPerHour,
StatusTotal,
SUM(TotalPerHour) over (partition by Status order by EventDate asc) as GrandStatusTotal
from
(
select
EventDate,
Status,
OrdersCount,
Sum(OrdersCount) over (partition by EventDate order by EventDate asc) as TotalPerHour,
SUM(OrdersCount) over (partition by Status order by EventDate asc) as StatusTotal
from OrderShipmentsStats
) as t
) as o
order by EventDate, Status
I have a table with reviews for products. I want to sort product_ids that have more than 100 verified reviews(verified review is a review with verified_purshace=True) by the fraction of 5 star-reviews to all reviews. I tried to implement this in one select, but after numerous tries, I finish with the need to create views. I managed to write a query that counts a number of 5-star reviews, but can`t do better. Can anybody give me a hint?
My best query:
select *,count(*)
from (
select *
from reviews
where star_rating = 5
) low_reviews
left join (
select distinct filtered_reviews.product_id
from (
select *
from (
select verified_reviews.product_id, count(*) as verified_reviews_number
from (
select *
from reviews
where verified_purchase=True
) as verified_reviews
) as counted_verified_reviews
where counted_verified_reviews.verified_reviews_number > 100
) as filtered_reviews
) filtered_product_ids on low_reviews.product_id = filtered_product_ids.product_id;
Data example:
review_id customer_id product_id star_rating helpful_votes total_votes vine verified_purshase review_headline review_body review_date
14830128 R158AS05ZMH7VQ 0615349439 5 2 2 N false Planting a Church ... Witnessing To Dracula... 2011-02-14
I want to sort product_ids that have more than 100 verified reviews(verified review is a review with verified_purshace=True) by the fraction of 5 star-reviews to all reviews.
You don't provide sample data, but I would expect a query like this:
select product_id
from reviews
where verified_purchase
group by product_id
having count(*) > 100
order by avg( (review = 5)::int ) desc;
The expression avg( (review = 5)::int ) is a shorthand way of saying count(*) filter (where review = 5) * 1.0 / count(review). It works because it converts the expression review = 5 to an int, which is 1 for true and 0 for false. The average is the proportion of times when it is true.
Actually, the above assumes that you only care about review start ratings for verified purchases. If you want to include all reviews (even non-verified ones) for the ordering:
select product_id
from reviews
group by product_id
having count(*) filter (where verified_purchase) > 100
order by avg( (review = 5)::int ) desc;
I need to find a group of lots to satisfy X demand for items. I can't do it with aggregate functions, it seems to me that I need something more than a window function, do you know anything that can help me solve this problem?
For example, if I have a demand for 1 Item, the query should return any lot with a quantity greater than or equal to 1. But if I have a demand for 15, there are no lots with that availability, so it should return a lot of 10 and another with 5 or one of 10 and two of 3, etc.
With a programming language like Java this is simple, but with SQL is it possible? I am trying to achieve it with sales functions but I cannot find a way to add the available quantity of the current row until reaching the required quantity.
SELECT id,VC_NUMERO_LOTE,SF_FECHA_CREACION,SI_ID_M_ARTICULO,VI_CANTIDAD,NEXT, VI_CANTIDAD + NEXT AS TOT FROM (
SELECT row_number() over (ORDER BY SF_FECHA_CREACION desc) id ,VC_NUMERO_LOTE,SF_FECHA_CREACION,SI_ID_M_ARTICULO,
VI_CANTIDAD,LEAD(VI_CANTIDAD,1) OVER (ORDER BY SF_FECHA_CREACION desc) as NEXT FROM PUBLIC.M_LOTE WHERE SI_ID_M_ARTICULO = 44974
AND VI_CANTIDAD > 0 ) AS T
WHERE MOD(id, 2) != 0
I tried with lead to then sum only odd records but I saw that it is not the way, any suggestions?
You need a recursive query like this:
demo:db<>fiddle
WITH RECURSIVE lots_with_rowcount AS ( -- 1
SELECT
*,
row_number() OVER (ORDER BY avail_qty DESC) as rowcnt
FROM mytable
), lots AS ( -- 2
SELECT -- 3
lot_nr,
avail_qty,
rowcnt,
avail_qty as total_qty
FROM lots_with_rowcount
WHERE rowcnt = 1
UNION
SELECT
t.lot_nr,
t.avail_qty,
t.rowcnt,
l.total_qty + t.avail_qty -- 4
FROM lots_with_rowcount t
JOIN lots l ON t.rowcnt = l.rowcnt + 1
AND l.total_qty < --<your demand here>
)
SELECT * FROM lots -- 5
This CTE is only to provide a row count to each record which can be used within the recursion to join the next records.
This is the recursive CTE. A recursive CTE contains two parts: The initial SELECT statement and the recursion.
Initial part: Queries the lot record with the highest avail_qty value. Naturally, you can order them in any order you like. Most qty first yield the smallest output.
After the UNION the recursion part: Here the current row is joined the previous output AND as an additional condition: Join only if the previous output doesn't fit your demand value. In that case, the next total_qty value is calculated using the previous and the current qty value.
Recursion end, when there's no record left which fits the join condition. Then you can SELECT the entire recursion output.
Notice: If your demand was higher than your all your available quantities in total, this would return the entire table because the recursion runs as long as the demanded is not reached or your table ends. You should add a query before, which checks this:
SELECT SUM(avail_qty) > demand FROM mytable
I gratefully fiddled around with S-Man's fiddle and found a query, at least simpler to understand
select lot_nr, avail_qty, tot_amount from
(select lot_nr, avail_qty,
sum(avail_qty) over (order by avail_qty desc rows between unbounded preceding and current row) as tot_amount,
sum(avail_qty) over (order by avail_qty desc rows between unbounded preceding and current row) - avail_qty as last_amount
from mytable) amounts
where last_amount < 15 -- your amount here
so this lists all rows where with the predecesor (in descending order by avail_qty) the limit isn't yet reached
Here is a simple old-school PL/pgSQL version that uses a (slow) loop. It returns only the lot numbers as an illustration. Basically what it does is return lot numbers for a particular item_id in certain order (that reflects the required business rules) and allocates the available quantities until the allocated quantity is equal or exceeds the required quantity.
create function get_lots(required_item integer, required_qty integer) returns setof text as
$$
declare
r record;
allocated_qty integer := 0;
begin
for r in select * from lots where item_id = required_item order by <your biz-rule> loop
return next r.lot_number;
allocated_qty := allocated_qty + r.available_qty;
exit when allocated_qty >= required_qty;
end loop;
end;
$$ language plpgsql;
-- Use
select lot_id from get_lots(1, 17) lot_id;
I want to select some rows based on certain criteria, and then take one entry from that set and the 5 rows before it and after it.
Now, I can do this numerically if there is a primary key on the table, (e.g. primary keys that are numerically 5 less than the target row's key and 5 more than the target row's key).
So select the row with the primary key of 7 and the nearby rows:
select primary_key from table where primary_key > (7-5) order by primary_key limit 11;
2
3
4
5
6
-=7=-
8
9
10
11
12
But if I select only certain rows to begin with, I lose that numeric method of using primary keys (and that was assuming the keys didn't have any gaps in their order anyway), and need another way to get the closest rows before and after a certain targeted row.
The primary key output of such a select might look more random and thus less succeptable to mathematical locating (since some results would be filtered, out, e.g. with a where active=1):
select primary_key from table where primary_key > (34-5)
order by primary_key where active=1 limit 11;
30
-=34=-
80
83
100
113
125
126
127
128
129
Note how due to the gaps in the primary keys caused by the example where condition (for example becaseu there are many inactive items), I'm no longer getting the closest 5 above and 5 below, instead I'm getting the closest 1 below and the closest 9 above, instead.
There's a lot of ways to do it if you run two queries with a programming language, but here's one way to do it in one SQL query:
(SELECT * FROM table WHERE id >= 34 AND active = 1 ORDER BY id ASC LIMIT 6)
UNION
(SELECT * FROM table WHERE id < 34 AND active = 1 ORDER BY id DESC LIMIT 5)
ORDER BY id ASC
This would return the 5 rows above, the target row, and 5 rows below.
Here's another way to do it with analytic functions lead and lag. It would be nice if we could use analytic functions in the WHERE clause. So instead you need to use subqueries or CTE's. Here's an example that will work with the pagila sample database.
WITH base AS (
SELECT lag(customer_id, 5) OVER (ORDER BY customer_id) lag,
lead(customer_id, 5) OVER (ORDER BY customer_id) lead,
c.*
FROM customer c
WHERE c.active = 1
AND c.last_name LIKE 'B%'
)
SELECT base.* FROM base
JOIN (
-- Select the center row, coalesce so it still works if there aren't
-- 5 rows in front or behind
SELECT COALESCE(lag, 0) AS lag, COALESCE(lead, 99999) AS lead
FROM base WHERE customer_id = 280
) sub ON base.customer_id BETWEEN sub.lag AND sub.lead
The problem with sgriffinusa's solution is that you don't know which row_number your center row will end up being. He assumed it will be row 30.
For similar query I use analytic functions without CTE. Something like:
select ...,
LEAD(gm.id) OVER (ORDER BY Cit DESC) as leadId,
LEAD(gm.id, 2) OVER (ORDER BY Cit DESC) as leadId2,
LAG(gm.id) OVER (ORDER BY Cit DESC) as lagId,
LAG(gm.id, 2) OVER (ORDER BY Cit DESC) as lagId2
...
where id = 25912
or leadId = 25912 or leadId2 = 25912
or lagId = 25912 or lagId2 = 25912
such query works more faster for me than CTE with join (answer from Scott Bailey). But of course less elegant
You could do this utilizing row_number() (available as of 8.4). This may not be the correct syntax (not familiar with postgresql), but hopefully the idea will be illustrated:
SELECT *
FROM (SELECT ROW_NUMBER() OVER (ORDER BY primary_key) AS r, *
FROM table
WHERE active=1) t
WHERE 25 < r and r < 35
This will generate a first column having sequential numbers. You can use this to identify the single row and the rows above and below it.
If you wanted to do it in a 'relationally pure' way, you could write a query that sorted and numbered the rows. Like:
select (
select count(*) from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
Then use that as a common table expression. Write a select which filters it down to the rows you're interested in, then join it back onto itself using a criterion that the index of the right-hand copy of the table is no more than k larger or smaller than the index of the row on the left. Project over just the rows on the right. Like:
with numbered_emps as (
select (
select count(*)
from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
)
select b.*
from numbered_emps a, numbered_emps b
where a.name like '% Smith' -- this is your main selection criterion
and ((b.idx - a.idx) between -5 and 5) -- this is your adjacency fuzzy-join criterion
What could be simpler!
I'd imagine the row-number based solutions will be faster, though.
I have a table called trends_points, this table has the following columns:
id (the unique id of the row)
userId (the id of the user that has entered this in the table)
term (a word)
time (a unix timestamp)
Now, I'm trying to run a query on this table which will get the rows in a specific time frame ordered by how many times the column term appears in the table during the specific timeframe...So for example if the table has the following rows:
id | userId | term | time
------------------------------------
1 28 new year 1262231638
2 37 new year 1262231658
3 1 christmas 1262231666
4 34 new year 1262231665
5 12 christmas 1262231667
6 52 twitter 1262231669
I'd like the rows to come out ordered like this:
new year
christmas
twitter
This is because "new year" exists three times in the timeframe, "christmas" exists twice and "twitter" is only in one row.
So far I've asummed it's a simple WHERE for the specific timeframe part of the query and a GROUP BY to stop the same term from coming up twice in the list.
This makes the following query:
SELECT *
FROM `trends_points`
WHERE ( time >= <time-period_start>
AND time <= <time-period_end> )
GROUP BY `term`
Does anyone know how I'd do the final part of the query? (Ordering the query's results by how many rows contain the same "term" column value..).
Use:
SELECT tp.term,
COUNT(*) 'term_count'
FROM TREND_POINTS tp
WHERE tp.time BETWEEN <time-period_start> AND <time-period_end>
GROUP BY tp.term
ORDER BY term_count DESC, tp.term
See this question about why to use BETWEEN vs using the >=/<= operators.
Keep in mind there can be ties - the order by defaults to alphabetically shorting by term value when this happens, but there could be other criteria.
Also, if you want to additionally limit the number of rows/terms coming back you can add the LIMIT clause to the end of the query. For example, this query will return the top five terms:
SELECT tp.term,
COUNT(*) 'term_count'
FROM TREND_POINTS tp
WHERE tp.time BETWEEN <time-period_start> AND <time-period_end>
GROUP BY tp.term
ORDER BY term_count DESC, tp.term
LIMIT 5
Quick answer:
SELECT
term, count(*) as thecount
FROM
mytable
WHERE
(...)
GROUP BY
term
ORDER BY
thecount DESC
SELECT t.term
FROM trend_points t
WHERE t.time >= <time-period_start> AND t.time <= <time-period_end>
ORDER BY COUNT(t.term) DESC
GROUP BY t.term
COUNT() will give you the number of rows in the group, so just order by that.
SELECT * FROM `trends_points`
WHERE ( `time` >= <time-period_start> AND `time` <= <time-period_end> )
ORDER BY COUNT(`term`) DESC
GROUP BY `term`