Is storing calculated values in database a bad idea? [closed] - sql

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Let's say I'm working on an online store project. So, I'm gonna have to create a table called 'Product' in my database. Say I also want users to be able to 'like' my products. That requires me to create another table called 'ProductLike' to store users' IDs alongside the ID of the product they like (a junction table).
The main scenario: Every time when a user sends a request to my website to get a product page, I'm gonna have to recalculate the number of likes that product has.
My question is: So, I know the standard approach is not to store 'Calculated values' in the database (normalization). But what about cases like this? (I mean cases in which it might be expensive to calculate something). For instance in the example above, isn't it better to have a column named 'NumberOfLikes' in the 'Product' table to store the calculated number of the product likes for fast retrieval?

Update
isn't it better to have a column named 'NumberOfLikes' in the 'Product' table to store the calculated number of the product likes?
IMHO, the direct answer to this question is "No, unless you have a real performance problem due to the counting of likes".
If you do have a performance problem, and you've identified its source as the count of likes, Then you might want to consider adding a LikesCount column to the products table. If you do add such a column, please note you are going to have to update it on every change to the ProductLike table - delete, update and insert.
This means you are going to have to write a trigger for this table to handle all these cases, but it shouldn't be too hard since you can do everything in a single trigger - something like this:
create trigger ProductLikeChaneged on ProductLike
for insert, update, delete
as
update p
set LikesCount = (select count(*) from ProductLike as pl where pl.productId = p.Id)
from product as p
where exists
(
select 1 from inserted as i where p.id = i.productId
)
or exists
(
select 1 from deleted as d where p.id = d.productId
)
Original version
Based on your description, "calculating" the number of likes for a product is simply a count of rows in the ProductLike table where the product id is the id of the product you are currently displaying to the user.
This can be done very fast, especially if the ProductLike table clustered index is ProductId and then UserId, thus allowing SQL Server to use clustered index seek and not a table scan.
Basically, your ProductLike table should look like this:
Create table ProductLike
(
ProductId int,
UserId int,
Constraint PK_ProductLike PRIMARY KEY (ProductId, UserId)
)
Note that by default, SQL Server will use the primary key as the clustered index of the table.
Then your select statement for the product page can be something like this:
select Name, Description, -- Other product related details
(select count(*)
from productLike as pl
where pl.ProductId = p.Id) as likeCount
from product as p

By "calculated" value, I suspect you mean an accumulation of the number of requests.
The simplest approach in terms of database design and maintenance is to store each request as a row in a table and to summarize when needed. This has certain nice features:
A user can "unrequest" or "unlike" quite easily.
Inserts are (typically) at the "end" of the table, minimizing fragmentation and speeding inserts. Note: This can result in contention for the last page if multiple threads are writing at the same time.
Counts can be flexible, limited to a particular date range or type of user for instance.
The data is drill-downable. That is, for a given count you know exactly what produced it.
Summarization is often very reasonable, if you have the right indexes and partitions on the data.
That said, such summarization does not meet all needs. A traditional approach is to use a trigger to maintain summary tables -- adding lots of complexity for maintenance (you need insert, delete, and update triggers). I think #daniherrera's answer gives guidance on the best approach.

For real life they are real solutions. You should to materialize this field and denormalize database to keep performance. Do you have serveral options to keep this field uptodate:
Materialized views.
Triggers.
Store procedure.
Disclaimer: Your question is a primary opinion-based, I guess will be closed in a while.

Number of product liked by user can be fetched by UserProductLike table, where userid is id of your user.

Related

How to create an aggregate table (data mart) that will improve chart performance?

I created a table named user_preferences where user preferences have been grouped by user_id and month.
Table:
Each month I collect all user_ids and assign all preferences:
city
district
number of rooms
the maximum price they can spend
The plan assumes displaying a graph showing users' shopping intentions like this:
The blue line is the number of interested users for the selected values in the filters.
The graph should enable filtering by parameters marked in red.
What you see above is a simplified form for clarifying the subject. In fact, there are many more users. Every month, the table increases by several hundred thousand records. The SQL query retrieving data (feeding) for chart lasts up to 50 seconds. It's far too much - I can't afford it.
So, I need to create a table (table/aggregation/data mart) where I will be able to insert the previously calculated numer of interested users for all combinations. Thanks to this, the end user will not have to wait for the data to count.
Details below:
Now the question is - how to create such a table in PostgreSQL?
I know how to write a SQL query that will calculate a specific example.
SELECT
month,
count(DISTINCT user_id) interested_users
FROM
user_preferences
WHERE
month BETWEEN '2020-01' AND '2020-03'
AND city = 'Madrid'
AND district = 'Latina'
AND rooms IN (1,2)
AND price_max BETWEEN 400001 AND 500000
GROUP BY
1
The question is - how to calculate all possible combinations? Can I write multiple nested loop in SQL?
The topic is extremely important to me, I think it will also be useful to others for the future.
I will be extremely grateful for any tips.
Well, base on your query, you have the following filters:
month
city
distirct
rooms
price_max
You can try creating a view with the following structure:
SELECT month
,city
,distirct
,rooms
,price_max
,count(DISTINCT user_id)
FROM user_preferences
GROUP BY month
,city
,distirct
,rooms
,price_max
You can make this view materialized. So, the query behind the view will not be executed when queried. It will behave like table.
When you are adding new records to the base table you will need to refresh the view (unfortunately, posgresql does not support auto-refresh like others):
REFRESH MATERIALIZED VIEW my_view;
or you can scheduled a task.
If you are using only exact search for each field, this will work. But in your example, you have criteria like:
month BETWEEN '2020-01' AND '2020-03'
AND rooms IN (1,2)
AND price_max BETWEEN 400001 AND 500000
In such cases, I usually write the same query but SUM the data from the materialized view. In your case, you are using DISTINCT and this may lead to counting a user multiple times.
If this is a issue, you need to precalculate too many combinations and I doubt this is the answer. Alternatively, you can try to normalize your data - this will improve the performance of the aggregations.

What are some of the best practices for tracking updates to a record over time in sql? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have a product table with a primary key #productid (bigint), a product number (int), and a version (int)
Any time someone makes changes to the product record ONLY, I plan on inserting a new row in the database with the same product number and version number + 1. This will provide me with the historical tracking I need for the record because I can see the version changes throughout time.
/* Selecting the current version is simple */
Select top 1 *
from products
where productnumber = #productnumber
order by version desc
However, my problem comes in with the Foreign key one-to-many or many-to-many relationship tables. This table points to many others (i.e. product pricing with date ranges, product categories, etc.) which also need to be tracked.
/* Product categories, pricing */
/* Should I use #productnumber here? How do I track changes to these records? */
select name
from productcategories
where productid = #productid
select price
from productpricing
where productid = #productid and
StartDate > #StartDate and
EndDate <#Enddate
So now any time there is a version change, I plan to to re-insert new category and pricing records with the new Primary Key product id that was generated..This is going to lead to a ton of duplicates, especially if no changes were made to these records.
Also the issue comes in with - what happens if a category is removed but there were no changes to the product record? I would want to see who removed the category. Essentially, a full audit is needed on each table.
I have seen some different examples but most of them only seem to deal with a record in one table and not a record that is a part of one-to-many or many-to-many relationships. I was hoping this could be done without the need of additional tables.
Are there any better methods or practices? Is this going to be a performance nightmare?
If you are using a newer version of SQL Server as you are, you can should look into temporal tables as this might be your best option.
If you need to support older versions, my preferred method is to have a history table with a new PK column, change flag (I,U,D), a date modified, user that made the change, and all of the columns from the primary table. I then index the column related to the PK of the non-history table. Triggers don't impact performance too much if you don't put logic in them.
Example (pseudocode):
Table: Car
Column: CarID INT IDENTITY(1,1) Primary Key
Column: Name varchar
Table: Car_hist
Column: Car_histID INT IDENTITY(1,1) Primary Key
Column: Change char(1)
Column: DateOfChange DateTime2
Column: ChangedByUser (varchar or int)
Column: CarID <-add a unique non-clustered index
Column: Name varchar
You can write a generator in SQL that generates the script to create the history table, indexes, etc. It helps if you have a consistent table design practice.
Now the reason: I rarely have to query history tables, but when I do, it is almost always for a single record to see what happened and who changed it. This method allows you to select from the history on the parent table's PK value quickly and read it as a historical change log easily (who changed what and when). I don't see how you can do that in your design. If you are really slick, you can find or write a grid that diffs rows for you and you can quickly see what changed.

SQL Select Sub Query [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have the following database tables
User List
------ ------
userId (PK) listId (PK)
fullName description
addedById (FK with userId in User table)
modifiedById (FK with userId in User table)
I need to pull out all data from the List table, but instead of showing the IDs for addedById and modifiedById, I need to pull the fullName from the User table.
This query works, and gives me the data I need. However, I'm not sure if there is a better way of constructing this query? I'm not keen on having multiple sub select queries within my main select query, mainly because of performance issues.
select t1.[description],
t1.addedById,
t1.modifiedById,
(select fullName from dbo.User where userId = t1.addedById) as [AddedByUser],
(select fullName from dbo.User where userId = t1.modifiedById) as [ModifiedByUser]
from dbo.List t1
I'd really appreciate if anyone could suggest improvements to the query, or advise to keep as is.
Thanks.
A more standard SQL method would be:
SELECT t1.description,
t1.addedById,
t1.modifiedById,
add.fullName AS [AddedByUser],
mod.fullName AS [ModifiedByUser]
FROM dbo.List t1
LEFT JOIN dbo.User add
ON add.userId = t1.addedById
LEFT JOIN dbo.User mod
ON mod.userID = t1.modifiedById
However, I suspect this will perform identically to your query and probably has an identical execution plan. The only real advantage to this method is that it's easier to expand. For example, if you wanted to add new columns from the User table this would be easier.
As #Gordon notes, performance should be fine if there's an index on the fields you're joining with.
Is that double join with same table bothering you? It's not slow, its preferred way of doing it instead making the linked subquery.
I am assuming you got indexes on PK and both FK.
Only thing you can minimize is to lessen the number of joined rows. You can do that
by either using the left inner join or filtering in the end with where clause stating both keys you use in join must be not null
I can write both examples for you but this should be self explanatory.
Other mayor thing you can do is to PRESELECT values from users. For example if you have a shit ton of users, and you know only few of them can be in either role. And you can filter which ones by some column in Users you haven't listed, even better if that column has an index.
Then you can maybe profit from pre-selecting only those users and then joining to selection result. Either by temp table if both can be pre-selected or by making a select on spot instead joining to the entire table. Not sure about numbers we are dealing with here for this to become relevant..

Is it better to have int joins instead of string columns?

Let's say I have a User which has a status and the user's status can be 'active', 'suspended' or 'inactive'.
Now, when creating the database, I was wondering... would it be better to have a column with the string value (with an enum type, or rule applied) so it's easier to both query and know the current user status or are joins better and I should join in a UserStatuses table which contains the possible user statuses?
Assuming, of course statuses can not be created by the application user.
Edit: Some clarification
I would NOT use string joins, it would be a int join to UserStatuses PK
My primary concern is performance wise
The possible status ARE STATIC and will NEVER change
On most systems it makes little or no difference to performance. Personally I'd use a short string for clarity and join that to a table with more detail as you suggest.
create table intLookup
(
pk integer primary key,
value varchar(20) not null
)
insert into intLookup (pk, value) values
(1,'value 1'),
(2,'value 2'),
(3,'value 3'),
(4,'value 4')
create table stringLookup
(
pk varchar(4) primary key,
value varchar(20) not null
)
insert into stringLookup (pk, value) values
(1,'value 1'),
(2,'value 2'),
(3,'value 3'),
(4,'value 4')
create table masterData
(
stuff varchar(50),
fkInt integer references intLookup(pk),
fkString varchar(4)references stringLookup(pk)
)
create index i on masterData(fkInt)
create index s on masterData(fkString)
insert into masterData
(stuff, fkInt, fkString)
select COLUMN_NAME, (ORDINAL_POSITION %4)+1,(ORDINAL_POSITION %4)+1 from INFORMATION_SCHEMA.COLUMNS
go 1000
This results in 300K rows.
select
*
from masterData m inner join intLookup i on m.fkInt=i.pk
select
*
from masterData m inner join stringLookup s on m.fkString=s.pk
On my system (SQL Server)
- the query plans, I/O and CPU are identical
- execution times are identical.
- The lookup table is read and processed once (in either query)
There is NO difference using an int or a string.
I think, as a whole, everyone has hit on important components of the answer to your question. However, they all have good points which should be taken together, rather than separately.
As logixologist mentioned, a healthy amount of Normalization is generally considered to increase performance. However, in contrast to logixologist, I think your situation is the perfect time for normalization. Your problem seems to be one of normalization. In this case, using a numeric key as Santhosh suggested which then leads back to a code table containing the decodes for the statuses will result in less data being stored per record. This difference wouldn't show in a small Access database, but it would likely show in a table with millions of records, each with a status.
As David Aldridge suggested, you might find that normalizing this particular data point will result in a more controlled end-user experience. Normalizing the status field will also allow you to edit the status flag at a later date in one location and have that change perpetuated throughout the database. If your boss is like mine, then you might have to change the Status of Inactive to Closed (and then back again next week!), which would be more work if the status field was not normalized. By normalizing, it's also easier to enforce referential integrity. If a status key is not in the Status code table, then it can't be added to your main table.
If you're concerned about the performance when querying in the future, then there are some different things to consider. To pull back status, if it's normalized, you'll be adding a join to your query. That join will probably not hurt you in any sized recordset but I believe it will help in larger recordsets by limiting the amount of raw text that must be handled. If your primary concern is performance when querying the data, here's a great resource on how to optimize queries: http://www.sql-server-performance.com/2007/t-sql-where/ and I think you'll find that a lot of the rules discussed here will also apply to any inclusion criteria you enforce in the join itself.
Hope this helps!
Christopher
The whole idea behind normalization is to keep the data from repeating (well at least one of the concepts).
In this case there is only 1 status a user at one time (I assume) can have so their is no reason to put it in its own table. You would simply complicate things. The only reason you would have a seperate table is if for some reason these statuses were not static. Meaning next month you may add "Sort of Active" and "Maybe Inactive". This would mean changing code to make up for that if you didnt put them in their own table. You could create a maintenace page where users could add statuses and then that would require you to create a seperate table.
An issue to consider is whether these status values have attributes of their own.
For example, perhaps you would want to have a default sort order that is different from the alphabetical order of the status text. You might also want to treat two of the statuses in a particular way that you do not treat the other, and that could be an attribute.
If you have a need for that, or suspect a future need for that, then move the status text to a different table and use an integer key value for them.
I would suggest using Integer values like 0, 1, 2. If this is fixed. When interpreting the results in Reports we can change these status back to strings.

Naming tables to show association with another table [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a table called Order which holds customer orders.
I want people to be able to add notes to orders. Whats the best way to name this new table? Because other tables may also have notes e.g. Customer, I need the table to be named so that is shows association with the relevant table. Relationship is an Order will have 0-or-many Notes.
So should I name it:
Order_Note
OrderNote
They all seem fine. I also need to create another table that will list the 'types' of Order that have been placed. An Order can have 1 or many 'types'. So how would I name this association in the table name?
Order_Type
OrderType
In this Order_Type table, it will have just two columns OrderID and TypeID. So I need a final table which holds all possible Types of order that can be placed. How do I name this, given that it will be confusing with the table above?
Order_Types
OrderTypes
But this is breaking the rule of not having plurals in table names.
Edit:
The Order table is a data table. Order_Type is a joining table. And finally, OrderTypes is a lookup table. Thanks to Hogan for making this obvious to me. I have also removed hyphenation between words as an option as it may cause future problems.
SOLUTION 1:
Name association between tables using underscore e.g. Order_Type
Name lookup and data tables without underscores e.g. Order, OrderType
I'll also use a schema so that lookup tables show like Lookup.OrderType which helps to clarify what is what.
The way I've done it is listed below
As a side note, these suggestions have to do with how to think about
joined table names and it does not matter if you use camel case or
not, underscores or not etc.
The important note here is that there is fundamentally a difference
between a joining table, a lookup table and a data table. The names
should reflect this and be consistent.
1)
I would make a note table and call it Note. Then I would add a relationship between orders and notes and call it order2note or orderNote or orderNoteRel
This table name defines the two joined tables in some order, sometimes you can put the non-FK first but in many cases it is best to just default to alphabetical.
2)
For tables that define a code (or a type as you put it) I will make a convention of ending the "Type" or "Code" or "CD" etc in the table name.
so orderType or orderCD or orderCode would be for the table that defines order types.
3)
The final table is actually a join between order table and orderType table so it would be
order2orderType or orderOrderCD or orderOrderCodeRel
(or some other combination of the conventions I've shown.)
This is the important one. If you remember that the table you are joining to should have order in its name (it is the orderType table) Then the join between order and order type should have order twice in its name. While this seems redundant at first once you get used to it it makes total sense.