Calculating percent of votes inside mysql statement - sql

UPDATE polls_options SET `votes`=`votes`+1, `percent`=ROUND((`votes`+1) / (SELECT voters FROM polls WHERE poll_id=? LIMIT 1) * 100,1)
WHERE option_id=?
AND poll_id=?
Don't have table data yet, to test it properly. :)
And by the way, in what type % integers should be stored in database?
Thanks for the help!

You don't say what database your using (Postgresql, Mysql, Oracle..etc) but if your using Mysql you could get away with using a TinyInt datatype. Your rounding to an int anyway, and assuming your percentages will always be between 0 and 100 you'll be fine.

Your problem seems to be that you don't have any test data so you are unable to test the syntax of your query. But that is a problem you can easily solve yourself and it doesn't even take that long:
Just make up some data and use that to test.
This isn't as hard as it might sound. For example here I create two polls, the first of which has four votes and the second of which has two votes. I then try to add a vote to option 1 of poll 1 using your query.
CREATE TABLE polls_options (
poll_id INT NOT NULL,
option_id INT NOT NULL,
votes INT NOT NULL,
percent FLOAT NOT NULL
);
INSERT INTO polls_options (poll_id, option_id, votes, percent) VALUES
(1, 1, 1, '25'),
(1, 2, 3, '75'),
(2, 1, 1, '50'),
(2, 2, 1, '50');
CREATE TABLE polls (poll_id INT NOT NULL, voters INT NOT NULL);
INSERT INTO polls (poll_id, voters) VALUES
(1, 4),
(2, 2);
UPDATE polls_options
SET votes = votes + 1,
percent = ROUND((votes + 1) / (SELECT voters FROM polls WHERE poll_id = 1 LIMIT 1) * 100,1)
WHERE option_id = 1
AND poll_id = 1;
SELECT * FROM polls_options;
Here are the results:
poll_id option_id votes percent
1 1 2 75
1 2 3 75
2 1 1 50
2 2 1 50
You can see that there are a number of problems:
The polls table isn't updated yet so the total vote count for poll 1 is wrong (4 instead of 5). Notice that you don't even need this table - it duplicates the same information that can already be found in the polls_options table. Having two keep these two tables in sync is extra work. If you need to adjust the results for some reason, for example to remove some spam voting, you will have to remember to update both tables. It's unnecessary extra work and an extra source of errors.
Even if you have remembered to update the polls table first, the percentage for option 1 is still calculated incorrectly: it is calculated as 3/5 instead of 2/5 because it is effectively doing this calculation: ((votes + 1) + 1).
The percentage for 2 isn't updated causing the total percentage for poll 1 to be greater than 100.
You probably shouldn't even be storing the percentage in the database. Instead of persisting this value consider calculate it on-the-fly only when you need it.
You might want to reconsider your table design to avoid redundant data. Consider normalizing your table structure. If you do this then all the problems I listed above will be solved and your statements will be much simpler.
Good luck!

Related

Keep track of item's versions after new insert

I'm currently working on creating a Log Table that will have all the data from another table and will also have recorded, as Versions, changes in the prices of items in the main table.
I would like to know how it is possible to save the versions, that is, increment the value +1 at each insertion of the same item in the Log table.
The Log table is loaded via a Merge of data coming from the User API, on a python script using PYODBC:
MERGE LogTable as t
USING (Values(?,?,?,?,?)) AS s(ID, ItemPrice, ItemName)
ON t.ID = s.ID AND t.ItemPrice= s.ItemPrice
WHEN NOT MATCHED BY TARGET
THEN INSERT (ID, ItemPrice, ItemName, Date)
VALUES (s.ID, s.ItemPrice, s.ItemName, GETDATE())
Table example:
Id
ItemPrice
ItemName
Version
Date
1
50
Foo
1
Today
2
30
bar
1
Today
And after inserting the Item with ID = 1 again with a different price, the table should look like this:
Id
ItemPrice
ItemName
Version
Date
1
50
Foo
1
Today
2
30
bar
1
Today
1
45
Foo
2
Today
Saw some similar questions mentioning using triggers but in these other cases it was not a Merge used to insert the data into the Log table.
May the following helps you, modify your insert statement as this:
Insert Into tbl_name
Values (1, 45, 'Foo',
COALESCE((Select MAX(D.Version) From tbl_name D Where D.Id = 1), 0) + 1, GETDATE())
See a demo from db<>fiddle.
Update, according to the proposed enhancements by #GarethD:
First: Using ISNULL instead of COALESCE will be more performant.
Where performance can play an important role is when the result is not a constant, but rather a query of some sort. -1-
Second: prevent race condition that may occur when multiple threads trying to read the MAX value. So the query will be as the following:
Insert Into tbl_name WITH (HOLDLOCK)
Values (1, 45, 'Foo',
ISNULL((Select MAX(D.Version) From tbl_name D Where D.Id = 1), 0) + 1, GETDATE())

How to speed up a slow MariaDB SQL query that has a flat BNL join?

I'm having problems with a slow SQL query running on the following system:
Operating system: Debian 11 (bullseye)
Database: MariaDB 10.5.15 (the version packaged for bullseye)
The table schemas and some sample data (no DB Fiddle as it doesn't support MariaDB):
DROP TABLE IF EXISTS item_prices;
DROP TABLE IF EXISTS prices;
DROP TABLE IF EXISTS item_orders;
CREATE TABLE item_orders
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
ordered_date DATE NOT NULL
) Engine=InnoDB;
CREATE TABLE prices
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
selected_flag TINYINT UNSIGNED NOT NULL
) Engine=InnoDB;
CREATE TABLE item_prices
(
item_order_id INT UNSIGNED NOT NULL,
price_id INT UNSIGNED NOT NULL,
PRIMARY KEY (item_order_id, price_id),
FOREIGN KEY (item_order_id) REFERENCES item_orders(id),
FOREIGN KEY (price_id) REFERENCES prices(id)
) Engine=InnoDB;
INSERT INTO item_orders VALUES (1, '2022-01-01');
INSERT INTO item_orders VALUES (2, '2022-02-01');
INSERT INTO item_orders VALUES (3, '2022-03-01');
INSERT INTO prices VALUES (1, 0);
INSERT INTO prices VALUES (2, 0);
INSERT INTO prices VALUES (3, 1);
INSERT INTO prices VALUES (4, 0);
INSERT INTO prices VALUES (5, 0);
INSERT INTO prices VALUES (6, 1);
INSERT INTO item_prices VALUES (1, 1);
INSERT INTO item_prices VALUES (1, 2);
INSERT INTO item_prices VALUES (1, 3);
INSERT INTO item_prices VALUES (2, 4);
INSERT INTO item_prices VALUES (2, 5);
INSERT INTO item_prices VALUES (3, 6);
A high-level overview of the table usage is:
For any given month, there will be thousands of rows in item_orders.
A row in item_orders will link to zero or more rows in item_prices (item_orders.id = item_prices.item_order_id).
A row in item_prices will have exactly one linked row in prices (item_prices.price_id = prices.id).
For any given row in item_orders, there will be zero or one row in prices where the selected_flag is 1 (item_orders.id = item_prices.item_order_id AND item_prices.price_id = prices.id AND prices.selected_flag = 1). This is enforced by the application rather than the database (i.e. it's not defined as a CONSTRAINT).
What I want to get, in a single query, are:
The number of rows in item_orders.
The number of rows in item_orders where the related selected_flag is 1.
At the moment I have the following query:
SELECT
COUNT(item_orders.id) AS item_order_count,
SUM(CASE WHEN prices.id IS NOT NULL THEN 1 ELSE 0 END) AS item_order_selected_count
FROM
item_orders
LEFT JOIN prices ON prices.id IN (
SELECT price_id
FROM item_prices
WHERE
item_prices.item_order_id = item_orders.id)
AND prices.selected_flag = 1
This query returns the correct data (item_order_count = 3, item_order_selected_count = 2), however it takes a long time (over 10 seconds) to run on a live dataset, which is too slow for users (it is a heavily-used report, refreshed repeatedly through the day). I think the problem is the subquery in the LEFT JOIN, as removing the LEFT JOIN and the associated SUM reduces the query time to around 0.1 seconds. Also, the EXPLAIN output for the join has this in the Extra column:
Using where; Using join buffer (flat, BNL join)
Searching for 'flat BNL join' reveals a lot of information, of which the summary seems to be: 'BNL joins are slow, avoid them if you can'.
Is it possible to rewrite this query to return the same information, but avoiding the BNL join?
Things I've considered already:
All the ID columns are indexed (item_orders.id, prices.id, item_prices.item_order_id, item_prices.price_id).
Splitting the query in two - one for item_order_count (no JOIN), the other for item_order_selected_count (INNER JOIN, as I only need rows which match). This works but isn't ideal as I want to build up this query to return more data (I've stripped it back to the minimum for this question). Also, I'm trying to keep the query output as close as possible to what the user will see, as that makes debugging easier and makes the database (which is optimised for that workload) do the work, rather than the application.
Changing the MariaDB configuration: Some of the search results for BNL joins suggest changing configuration options, however I'm wary of doing this as there are hundreds of other queries in the application and I don't want to cause a regression (e.g. speed up this query but accidentally slow down all the others).
Upgrading MariaDB: This would be a last resort as it would involve using a version different to that packaged with Debian, might break other parts of the application, and the system has just been through a major upgrade.
Not sure whether this will be any faster but worth a try (table joins on indexed foreign keys are fast and sometimes simplicity is king...)
SELECT
(SELECT COUNT(*) FROM item_orders) AS item_order_count,
(SELECT COUNT(*)
FROM item_orders io
JOIN item_prices ip
ON io.id = ip.item_order_id
JOIN prices p
ON ip.price_id = p.id
WHERE p.selected_flag = 1) AS item_order_selected_count;
I came back to this question this week as the performance got even worse as the number of rows increased, to the point where it was taking over 2 minutes to run the query (with around 100,000 rows in the item_orders table, so hardly 'big data').
I remembered that it was possible to list multiple tables in the FROM clause and wondered if the same was true of a LEFT JOIN. It turns out this is the case and the query can be rewritten as:
SELECT
COUNT(item_orders.id) AS item_order_count,
SUM(CASE WHEN prices.id IS NOT NULL THEN 1 ELSE 0 END) AS item_order_selected_count
FROM
item_orders
LEFT JOIN (item_prices, prices) ON
item_prices.item_order_id = item_orders.id
AND prices.id = item_prices.price_id
AND prices.selected_flag = 1
This returns the same results but takes less than a second to execute. Unfortunately I don't know any relational algebra to prove this, but effectively what I am saying is 'only LEFT JOIN where everything matches on both item_prices and prices'.

An SQL query with OFFSET/FETCH is returning unexpected results

I have an SQL Server 2019 database table named User with 1,000 rows as follows:
I am having a hard time understanding how this SELECT query with OFFSET/FETCH is returning unexpected results:
SELECT *
FROM [User]
WHERE (([NameGiven] LIKE '%1%')
OR ([NameFamily] LIKE '%2%'))
ORDER BY [Id] ASC
OFFSET 200 ROWS FETCH NEXT 100 ROWS ONLY;
Query results:
The results range from 264 to 452 with a total of 100 rows. Why would the records 201, 211, etc. not show up? Am I wrong in my expectations or is there a mistake in the query criteria?
If I remove the OFFSET/FETCH options from the ORDER BY clause, the results are as expected. That makes me think that the WHERE clause is not the problem.
Any advice would be appreciated.
The problem is that you expect the offset to happen before the filter but in actuality it doesn't happen until after the filter. Think about a simpler example where you want all the people named 'sam' and there are more people named 'sam' than your offset:
CREATE TABLE dbo.foo(id int, name varchar(32));
INSERT dbo.foo(id, name) VALUES
(1, 'sam'),
(2, 'sam'),
(3, 'bob'),
(4, 'sam'),
(5, 'sam'),
(6, 'sam');
If you just say:
SELECT id FROM dbo.foo WHERE name = 'sam';
You get:
1
2
4
5
6
If you then add an offset of 3,
-- this offsets 3 rows _from the filtered result_,
-- not the full table
SELECT id FROM dbo.foo
WHERE name = 'sam'
ORDER BY id
OFFSET 3 ROWS FETCH NEXT 2 ROWS ONLY;
You get:
5
6
It takes all the rows that match the filter, then skips the first three of those filtered rows (1,2,4) - not 1,2,3 like your question implies that you expect.
Example db<>fiddle
Going back to your case in the question, you are filtering out rows like 77 and 89 because they don't contain a 1 or a 2. So the offset you asked for is 200, but in terms of which rows that means, the offset is actually more like:
200 PLUS the number of rows that *don't* match your filter
until you hit the 200th row that *does*
You could try to force the filter to happen after, e.g.:
;WITH u AS
(
SELECT *
FROM [User]
ORDER BY [Id]
OFFSET 200 ROWS FETCH NEXT 100 ROWS ONLY
)
SELECT * FROM u
WHERE (([NameGiven] LIKE '%1%')
OR ([NameFamily] LIKE '%2%'))
ORDER BY [Id]; -- yes you still need this one
...but then you would almost certainly never get 100 rows in each page because some of those 100 rows would then be removed by the filter. I don't think this is what you're after.

Storing operators with operands in table in SQL Server

I work at a company that sells many versions of a product to several different resellers, and each reseller adds parameters that change the resale price of the product.
For example, we sell a vehicle service contract where, for a certain vehicle, the reserve price of the contract is $36. The dealer marks up every reserve by 30% (to $47), adds a premium of $33 to the reserve price (now $80), and adds a set of fee--like commissions and administrative costs--to bring the contract total to $235.
The reserve price is the same for every dealer on this program, but they all use different increases that are either flat or a percentage. There are of course dozens of parameters for each contract.
My question is this: can I store a table of parameters like "x*1.3" or "y+33" that are indexed to a unique ID, and then join or cross apply that table to one full of values like the reserve price mentioned above?
I looked at the SQL Server "table valued parameters," but I don't see from the MSDN examples if they apply to my case.
Thanks so much for your kind replies.
EDIT:
As I feared, my example seems to be a little too esoteric (my fault). So consider this:
Twinings recommends different temperatures for brewing various kinds of tea. Depending on your elevation, your boiling point might be different. So there must be a way to store a table of values that looks like this--
(source: twinings.co.uk)
A user enters a ZIP code that has a corresponding elevation, and SQL Server calculates and returns the correct brew temperature for you. Is that any better an example?
Again, thanks to those who have already contributed.
I don't know if I like this solution, but it does seem to at least work. The only real way to iteratively construct totals is to use some form of "loop", and the most set-based way of doing that these days is with a recursive CTE:
declare #actions table (ID int identity(1,1) not null, ApplicationOrder int not null,
Multiply decimal(12,4), AddValue decimal(12,4))
insert into #actions (ApplicationOrder,Multiply,AddValue) values
(1,1.3,null),
(2,null,33),
(3,null,155)
declare #todo table (ID int not null, Reserve decimal(12,4))
insert into #todo(ID,Reserve) values (1,36)
;With Applied as (
select
t.ID, Reserve as Computed, 0 as ApplicationOrder
from
#todo t
union all
select a.ID,
CONVERT(decimal(12,4),
((a.Computed * COALESCE(Multiply,1)) + COALESCE(AddValue,0))),
act.ApplicationOrder
from
Applied a
inner join
#actions act
on
a.ApplicationOrder = act.ApplicationOrder - 1
), IdentifyFinal as (
select
*,ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ApplicationOrder desc) as rn
from Applied
)
select
*
from
IdentifyFinal
where
rn = 1
Here I've got a simple single set of actions to apply to each price (in #actions) and a set of prices to apply them to (in #todo). I then use the recursive CTE to apply each action in turn.
My result:
ID Computed ApplicationOrder rn
----------- --------------------------------------- ---------------- --------------------
1 234.8000 3 1
Which isn't far off your $235 :-)
I appreciate that you may have different actions to apply to each particular price, and so my #actions may instead, for you, be something that works out which rules to apply in each case. That may be one of more CTEs before mine that do that work, possibly using another ROW_NUMBER() expression to work out the correct ApplicationOrder values. You may also need more columns and join conditions in the CTE to satisfy this.
Note that I've modelled the actions so that each can apply a multiplication and/or an add at each stage. You may want to play around with that sort of idea (or e.g. add a "rounding" flag of some kind as well so that we might well end up with the $235 value).
Applied ends up containing the initial values and each intermediate value as well. The IdentifyFinal CTE gets us just the final results, but you may want to select from Applied instead just to see how it worked.
You can use a very simple structure to store costs:
DECLARE #costs TABLE (
ID INT,
Perc DECIMAL(18, 6),
Flat DECIMAL(18, 6)
);
The Perc column represents percentage of base price. It is possible to store complex calculations in this structure but it gets ugly. For example if we have:
Base Price: $100
Flat Fee: $20
Tax: 11.5%
Processing Fee: 3%
Then it will be stored as:
INSERT INTO #costs VALUES
-- op example
(1, 0.0, NULL),
(1, 0.3, NULL),
(1, NULL, 33.0),
(1, NULL, 155.0),
-- above example
(2, 0.0, NULL),
(2, NULL, 20.0),
(2, 0.115, NULL),
(2, NULL, 20.0 * 0.115),
(2, 0.03, NULL),
(2, NULL, 20.0 * 0.03),
(2, 0.115 * 0.03, NULL),
(2, NULL, 20 * 0.115 * 0.03);
And queried as:
DECLARE #tests TABLE (
ID INT,
BasePrice DECIMAL(18, 2)
);
INSERT INTO #tests VALUES
(1, 36.0),
(2, 100.0);
SELECT t.ID, SUM(
BasePrice * COALESCE(Perc, 0) +
COALESCE(Flat, 0)
) AS TotalPrice
FROM #tests t
INNER JOIN #costs c ON t.ID = c.ID
GROUP BY t.ID
ID | TotalPrice
---+-------------
1 | 234.80000000
2 | 137.81400000
The other, better, solution is to use a structure such as follows:
DECLARE #costs TABLE (
ID INT,
CalcOrder INT,
PercOfBase DECIMAL(18, 6),
PercOfPrev DECIMAL(18, 6),
FlatAmount DECIMAL(18, 6)
);
Where CalcOrder represents the order in which calculation is done (e.g. tax before processing fee). PercOfBase and PercOfPrev specify whether base price or running total is multiplied. This allows you to handle situations where, for example, a commission is added on base price but it must not be included in tax and vice-versa. This approach requires recursive or iterative query.

finding consecutive date pairs in SQL

I have a question here that looks a little like some of the ones that I found in search, but with solutions for slightly different problems and, importantly, ones that don't work in SQL 2000.
I have a very large table with a lot of redundant data that I am trying to reduce down to just the useful entries. It's a history table, and the way it works, if two entries are essentially duplicates and consecutive when sorted by date, the latter can be deleted. The data from the earlier entry will be used when historical data is requested from a date between the effective date of that entry and the next non-duplicate entry.
The data looks something like this:
id user_id effective_date important_value useless_value
1 1 1/3/2007 3 0
2 1 1/4/2007 3 1
3 1 1/6/2007 NULL 1
4 1 2/1/2007 3 0
5 2 1/5/2007 12 1
6 3 1/1/1899 7 0
With this sample set, we would consider two consecutive rows duplicates if the user_id and the important_value are the same. From this sample set, we would only delete row with id=2, preserving the information from 1-3-2007, showing that the important_value changed on 1-6-2007, and then showing the relevant change again on 2-1-2007.
My current approach is awkward and time-consuming, and I know there must be a better way. I wrote a script that uses a cursor to iterate through the user_id values (since that breaks the huge table up into manageable pieces), and creates a temp table of just the rows for that user. Then to get consecutive entries, it takes the temp table, joins it to itself on the condition that there are no other entries in the temp table with a date between the two dates. In the pseudocode below, UDF_SameOrNull is a function that returns 1 if the two values passed in are the same or if they are both NULL.
WHILE (##fetch_status <> -1)
BEGIN
SELECT * FROM History INTO #history WHERE user_id = #UserId
--return entries to delete
SELECT h2.id
INTO #delete_history_ids
FROM #history h1
JOIN #history h2 ON
h1.effective_date < h2.effective_date
AND dbo.UDF_SameOrNull(h1.important_value, h2.important_value)=1
WHERE NOT EXISTS (SELECT 1 FROM #history hx WHERE hx.effective_date > h1.effective_date and hx.effective_date < h2.effective_date)
DELETE h1
FROM History h1
JOIN #delete_history_ids dh ON
h1.id = dh.id
FETCH NEXT FROM UserCursor INTO #UserId
END
It also loops over the same set of duplicates until there are none, since taking out rows creates new consecutive pairs that are potentially dupes. I left that out for simplicity.
Unfortunately, I must use SQL Server 2000 for this task and I am pretty sure that it does not support ROW_NUMBER() for a more elegant way to find consecutive entries.
Thanks for reading. I apologize for any unnecessary backstory or errors in the pseudocode.
OK, I think I figured this one out, excellent question!
First, I made the assumption that the effective_date column will not be duplicated for a user_id. I think it can be modified to work if that is not the case - so let me know if we need to account for that.
The process basically takes the table of values and self-joins on equal user_id and important_value and prior effective_date. Then, we do 1 more self-join on user_id that effectively checks to see if the 2 joined records above are sequential by verifying that there is no effective_date record that occurs between those 2 records.
It's just a select statement for now - it should select all records that are to be deleted. So if you verify that it is returning the correct data, simply change the select * to delete tcheck.
Let me know if you have questions.
select
*
from
History tcheck
inner join History tprev
on tprev.[user_id] = tcheck.[user_id]
and tprev.important_value = tcheck.important_value
and tprev.effective_date < tcheck.effective_date
left join History checkbtwn
on tcheck.[user_id] = checkbtwn.[user_id]
and checkbtwn.effective_date < tcheck.effective_date
and checkbtwn.effective_date > tprev.effective_date
where
checkbtwn.[user_id] is null
OK guys, I did some thinking last night and I think I found the answer. I hope this helps someone else who has to match consecutive pairs in data and for some reason is also stuck in SQL Server 2000.
I was inspired by the other results that say to use ROW_NUMBER(), and I used a very similar approach, but with an identity column.
--create table with identity column
CREATE TABLE #history (
id int,
user_id int,
effective_date datetime,
important_value int,
useless_value int,
idx int IDENTITY(1,1)
)
--insert rows ordered by effective_date and now indexed in order
INSERT INTO #history
SELECT * FROM History
WHERE user_id = #user_id
ORDER BY effective_date
--get pairs where consecutive values match
SELECT *
FROM #history h1
JOIN #history h2 ON
h1.idx+1 = h2.idx
WHERE h1.important_value = h2.important_value
With this approach, I still have to iterate over the results until it returns nothing, but I can't think of any way around that and this approach is miles ahead of my last one.