SQL Server logging row visits best practice - sql

I currently have a database for articles that keeps track of the most read article for a certain amount of time by incrementing the "visits" counter on page_load. The current "visits" counter is a column in the articles table (see below):
id | title | description | visits | creation_date
---+--------+-------------+--------+-----------------
1 | test1 | test test.. | 10 | 2019-01-01
2 | test2 | test test.. | 20 | 2019-01-01
Sometimes, I experienced connection timeouts and I suspected a deadlock from the "visits" write procedure (database locks if concurrent users were incrementing the same row at once). I thought of the below scenario as an enhancement:
Remove the Visits counter from the table Articles
Create a new table article_visits with two columns: article_id and date
Articles
id | title | desc | creation_date
---+-------+------+---------------
1 | test1 | desd | 2019-01-01
2 | test1 | desd | 2019-01-01
article_visits
article_id | visit_date
-----------+----------------------
1 | 2019-01-01
1 | 2019-01-01
1 | 2019-01-01
1 | 2019-01-01
1 | 2019-01-01
1 | 2019-01-01
2 | 2019-01-01
2 | 2019-01-01
2 | 2019-01-01
As an alternative option, once triggering a new visit, I insert a new row into the articles_visits table to avoid any deadlocks on the articles table. This solution will make the articles_visits table grow big very quickly but I don't think table size is a problem.
I would like to know if this is the proper way to log article visits and if the optimization if is a better option than the original solution.

This is a fine way to record article visits. It is much less (or not at all) prone to deadlocks, because you are basically just appending new rows.
It is more flexible. You can get the number of visits between two dates, for instance. And that can be defined at query time. You can store the exact time, so determine if there are time preferences for views.
The downside is performance on querying. If you frequently need the counts, then the calculation can be expensive.
If this is an issue, there are multiple possible approaches:
A process that summarizes all the data periodically (say data).
A process that summarizes the data on a period basis for that period (say a daily summary).
A materialized/indexed view which allows the database to keep the data up-to-date.

This is certainly valid, though you may want to do some scoping on how much additional storage and memory load this will require for your database server.
Additionally, I might add a full datetime or datetime2 column for the actual timestamp (in addition to the current date column rather than instead of it, since you'll want to do aggregation by date only and having that value pre-computed can improve performance), and perhaps a few other columns such as IP Address and Referrer. Then you can use this data for additional purposes, such as auditing, tracking referrer/advertiser ROI, etc.

I'm interested to understand why you are getting a dead lock. It should be the case that a db platform should be able to handle a update tablename set field = field + 1 concurrently just fine. Here the table or row will lock and then release but the time should not be long enough to cause a deadlock error.
YOU COULD get a deadlock error if you are updating or locking more than one table with a transaction accross multiple tables esp. if you do them in a different order.
So the question is... in your original code are you linking to multiple tables when you do the update statement? The solution could be as simple as making your update atomic to one table.
However, I do agree -- the table you describe is a more functional design.

Current Articles table is not in Normalized form.
I will say putting visits column in Articles table is not proper way of
De-Normalization.
Current Articles table is not only giving you deadlock issue but also you cannot get so many other type of Report.
Daily Visit Report, Weekly Visit Report.
Creating Article_visits table is very good move .
It will be very frequently updated.
My Article_visits design
article_visit_id | article_id | visit_date | visit_count
-----------------+--------------+----------------------+----------------------
1 | 1 | 2019-01-01 | 6
2 | 2 | 2019-01-01 | 3
Here Article_Visit_id is int identity(1,1) which is also Clustered Index.
Create NonClustered Index NCI_Articleid_date ON Article_visits(article_id,visit_date)
GO
In short ,creating CI on article_id,visit_date will expensive affair.
If record do not exists for that article on that date then insert with visit_count 1
if it exists then update visit_count i.e. increase by 1.
It is Normalized.
You can create any kind of report, current requirement+ any future requirement.
You can show Article wise count.Query is so easy and Performant.
You can get weekly ,even getting yearly report is so easy and without
Indexed View.
Actual Table Design,
Create Table Article(Articleid int identity(1,1) primary key
,title varchar(100) not null,Descriptions varchar(max) not null
,CreationDate Datetime2(0))
GO
Create Table Article_Visit(Article_VisitID int identity(1,1) primary key,Articleid int not null ,Visit_Date datetime2(0) not null,Visit_Count int not null)
GO
--Create Trusted FK
ALTER TABLE Article_Visit
WITH NOCHECK
ADD CONSTRAINT FK_Articleid FOREIGN KEY(Articleid)
REFERENCES Article(Articleid) NOT FOR REPLICATION;
GO
--Create NonClustered Index NCI_Articleid_Date on
-- Article_Visit(Articleid,Visit_Date)
--Go
Create NonClustered Index NCI_Articleid_Date1 on
Article_Visit(Visit_Date)include(Articleid)
Go
Create Trusted FK to get Index Seek Benefit (in short).
I think ,NCI_Articleid_Date is no more require because ofArticleid being Trusted FK .
Deadlock Issue: Trusted FK was also created to overcome Deadlock issue.
It often occur due to bad Application code or UN-Optimized Sql query or Bad Table Design.Beside this also there several other valid reason,like handling Race Condition.It is quite DBA thing.If deadlock is hurting too much then after addressing above reason, you may have to Isolation Level.
Many Deadlock issue are auto handle by Sql server itself.
There are so many article online on DEADLOCK REASON.
I don't think table size is a problem
Table size are big issue.Chances of Deadlock in both design are very very less.But you will always face other demerit of Big Size table.
I am telling you to read few more article.
I hope that this is your exactly same real table with same data type ?
How frequently both table will inserted/updated ?
Which table will be query more frequently ?
Concurrent use of each table.
Deadlock can be only minimize so that there is no performance issue or transaction issue.
What is relation between Visitorid and Artcileid ?

Related

Materialized View vs Trigger for aggregating data?

I have a TASK table :
ID | NAME | STATUS |
----------------------
1 | Task 1 | Open |
2 | Task 2 | Closed |
3 | Task 3 | Closed |
And in my application i constantly query for a count of tasks grouped by status, so I'm looking for a caching solution.
Naturally, I thought of a trigger that automatically updates an aggregation table on any change to the TASKS table
TASK_COUNT table :
OPEN | CLOSED |
----------------
1 | 2 |
But I've read there is also materialized views.
Which is more reccomended for aggregating data? Materialized Views or Triggers?
Important to note that in my actual scenario I have more aggregations than just STATUS, and more tables than just TASK.
Also this is a rapidly evolving table, and I need the aggregated data to be always up to date.
The downside to materialized views is that the data may not be totally current. As explained in the documentation:
While access to the data stored in a materialized view is often much faster than accessing the underlying tables directly or through a view, the data is not always current; yet sometimes current data is not needed.
The advantage of materialized views is that they are much simpler to maintain -- basically define and go. But there can be a lag for updates.
If you need totally current information, then triggers are probably the better solution.

How to bond N database table with one master-table?

Lets assume that I have N tables for N Bookstores. I have to keep data about books in separate tables for each bookstore, because each table has different scheme (number and types of columns is different), however there are same set of columns which is common for all Bookstores table;
Now I want to create one "MasterTable" with only few columns.
| MasterTable |
|id. | title| isbn|
| 1 | abc | 123 |
| MasterToBookstores |
|m_id | tb_id | p_id |
| 1 | 1 | 2 |
| 1 | 2 | 1 |
| BookStore_Foo |
|p_id| title| isbn| date | size|
| 1 | xyz | 456 | 1998 | 3KB |
| 2 | abc | 123 | 2003 | 4KB |
| BookStore_Bar |
|p_id| title| isbn| publisher | Format |
| 1 | abc | 123 | H&K | PDF |
| 2 | mnh | 986 | Amazon | MOBI |
My question, is it right to keep data in such way? What are best-practise about this and similar cases? Can I give particular Bookstore table an aliase with number, which will help me manage whole set of tables?
Is there a better way of doing such thing?
I think you are confusing the concepts of "store" and "book".
From you comments and the example data, it appears the problem is in having different sets of attributes for books, not stores. If so, you'll need a structure similar to this:
The symbol: denotes inheritance1. The BOOK is the "base class" and BOOK1/BOOK2/BOOK3 are various "subclasses"2. This is a common strategy when entities share a set of attributes or relationships3. For the fuller explanation of this concept, please search for "Subtype Relationships" in the ERwin Methods Guide.
Unfortunately, inheritance is not directly supported by current relational databases, so you'll need to transform this hierarchy into plain tables. There are generally 3 strategies for doing so, as described in these posts:
Interpreting ER diagram
Parent and Child tables - ensuring children are complete
Supertype-subtype database design
NOTE: The structure above allows various book types to be mixed inside the same bookstore. Let me know if that's not desirable (i.e. you need exactly one type of books in any given bookstore)...
1 Aka. category, subclassing, subtyping, generalization hierarchy etc.
2 I.e. types of books, depending on which attributes they require.
3 In this case, books of all types are in the many-to-many relationship with stores.
If you had at least two columns which all other tables use it then you could have base table for all books and add more tables for the rest of the data using the id from Base table.
UPDATE:
If you use entity framework to connect to your DB I suggest you to try this:
Create your entities model something like this:
then let entity framework generate the database(Update database from Model) for you. Note this uses inheritance(not in database).
Let me know if you have questions.
Suggest data model:
1. Have a master database, which saves master data
2. The dimension tables in master database, transtional replicated to your distributed bookstore database
3. You can choose to use updatable scriscriber or merge replication is also a good choice
4. Each distributed bookstore database still work independently, however master data either merge back by merge replication or updatable subscriber.
5. If you want to make sure master data integrity, you can only read-only subscriber, and use transational replication to distribute master data into distributed database, but in this design, you need to have store proceduces in master database to register your dimension data. Make sure there is no double-hop issue.
I would suggest you to have two tables:
bookStores:
id name someMoreColumns
books:
id bookStore_id title isbn date publisher format size someMoreColumns
It's easy to see the relationship here: a bookStore has many books.
Pay attention that I'm putting all the columns you have in all of your BookStore tables in just one table, even if some row from some table does not have a value to some column.
Why I prefer this way:
1) To all the data from BookStore tables, just few columns will never have a value on table books (as example, size and format if you don't have an e-book version). The other columns can be filled someday (you can set a date to your e-books, but you don't have this column on your table BookStore_Bar, which seems to refer to the e-books). This way you can have much more detailed infos from all your books if someday you want to update it.
2) If you have a bunch of tables BookStore, lets say 12, you will not be able to handle your data easily. What I say is, if you want to run some query to all your books (which means to all your tables), you will have at least three ways:
First: run manually the query to each of the 12 tables and so merge the data;
Second: write a query with 12 joins or set 12 tables on your FROM clause to query all your data;
Third: be dependent of some script, stored procedure or software to do for you the first or the second way I just said;
I like to be able to work with my data as easy as possible and with no dependence of some other script or software, unless I really need it.
3) As of MySQL (because I know much more of MySQL) you can use partitions on your table books. It is a high level of data management in which you can distribute the data from your table to several files on your disk instead of just one, as generally a table is allocated. It is very useful when handling a large ammount of data in a same table and it speeds up queries based on your data distribution plan. Lets see an example:
Lets say you already have 12 distinct bookStores, but under my database model. For each row in your table books you'll have an association to one of the 12 bookStore. If you partition your data over the bookStore_id it will be almost the same as you had 12 tables, because you can create a partition for each bookStore_id and so each partition will handle only the related data (the data that match the bookStore_id).
Lets say you want to query the table books to the bookStore_id in (1, 4, 9). If your query really just need of these three partitions to give you the desired output, then the others will not be queried and it will be as fast as you were querying each separated table.
You can drop a partition and the other will not be affected. You can add new partitions to handle new bookStores. You can subpartition a partition. You can merge two partitions. In a nutshell, you can turn your single table books in an easy-to-handle, multi-storage table.
Side Effects:
1) I don't know all of table partitioning, so it's good to refer to the documentation to learn all important points to create and manage it.
2) Take care of data with regular backups (dumps) as you probably may have a very populated table books.
I hope it helps you!

Column with alternate serials

I would like to create a table of user_widgets which is primary keyed by a user_id and user_widget_id, where user_widget_id works like a serial, except for that it starts at 1 per each user.
Is there a common or practical solution for this? I am using PostgreSQL, but an agnostic solution would be appreciated as well.
Example table: user_widgets
| user_id | user_widget_id | user_widget_name |
+-----------+------------------+----------------------+
| 1 | 1 | Andy's first widget |
+-----------+------------------+----------------------+
| 1 | 2 | Andy's second widget |
+-----------+------------------+----------------------+
| 1 | 3 | Andy's third widget |
+-----------+------------------+----------------------+
| 2 | 1 | Jake's first widget |
+-----------+------------------+----------------------+
| 2 | 2 | Jake's second widget |
+-----------+------------------+----------------------+
| 2 | 3 | Jake's third widget |
+-----------+------------------+----------------------+
| 3 | 1 | Fred's first widget |
+-----------+------------------+----------------------+
Edit:
I just wanted to include some reasons for this design.
1. Less information disclosure, not just "Security through obscurity"
In a system where user's should not be aware of one another, they also should not be aware of eachother's widget_id's. If this were a table of inventory, weird trade secrets, invoices, or something more sensitive, they be able to start have their own uninfluenced set of ID's for those widgets. In addition to the obvious routine security checks, this adds an implicit security layer where the table has to be filtered by both the widget id and the user id.
2. Data Imports
Users should be permitted to import their data from some other system without having to trash all of their legacy IDs (if they have integer IDs).
3. Cleanliness
Not terribly dissimilar from my first point, but I think that users who create less content than other may be baffled or annoyed by significant jumps in their widget ID's. This of course is more superficial than functional, but could still be valuable.
A possible solution
One of the answers suggests the application layer handles this. I could store a next_id column on that user's table that gets incremented. Or perhaps even just count the rows per user, and not allow deletion of records (using a deleted/deactivated flag instead). Could this be done with a trigger function, or even a stored procedure rather than in the application layer?
If you have a table:
CREATE TABLE user_widgets (
user_id int
,user_widget_name text --should probably be a foreign key to a look-up table
PRIMARY KEY (user_id, user_widget_name)
)
You could assign user_widget_id dynamically and query:
WITH x AS (
SELECT *, row_number() OVER (PARTITION BY user_id
ORDER BY user_widget_name) AS user_widget_id
FROM user_widgets
)
SELECT *
FROM x
WHERE user_widget_id = 2;
user_widget_id is applied alphabetically per user in this scenario and has no gaps, Adding, changing or deleting entries can result in changes, obviously.
More about window functions in the manual.
Somewhat more (but not completely) stable:
CREATE TABLE user_widgets (
user_id int
,user_widget_id serial
,user_widget_name
PRIMARY KEY (user_id, user_widget_id)
)
And:
WITH x AS (
SELECT *, row_number() OVER (PARTITION BY user_id
ORDER BY user_widget_id) AS user_widget_nr
FROM user_widgets
)
SELECT *
FROM x
WHERE user_widget_nr = 2;
Addressing question update
You can implement a regime to count existing widgets per user. But you will have a hard time making it bulletproof for concurrent writes. You would have to lock the whole table or use SERIALIZABLE transaction mode - both of which are real downers for performance and need additional code.
But if you guarantee that no rows are deleted you could go with my second approach - one sequence for user_widget_id across the table, that giving you a "raw" ID. A sequence is a proven solution for concurrent load, preserves the relative order in user_widget_id and is fast. You could provide access to the table using a view that dynamically replaces the "raw" user_widget_id with the corresponding user_widget_nr like my query above.
You could (in addition) "materialize" a gapless user_widget_id by replacing it with user_widget_nr at off hours or triggered by events of your choosing.
To improve performance I would have the sequence for user_widget_id start with a very high number. Seems like there can only be a handful of widgets per user.
SELECT setval(user_widgets_user_widget_id_seq', 100000);
If no number is high enough to be safe, add a flag instead. Use the condition WHERE user_widget_id > 100000 to quickly identify "raw" IDs. If your table is huge you may want to add a partial index using the condition (which will be small). For use in the mentioned view in a CASE statement. And in this statement to "materialize" IDs:
UPDATE user_widgets w
SET user_widget_id = u.user_widget_nr
FROM (
SELECT user_id, user_widget_id
,row_number() OVER (PARTITION BY user_id
ORDER BY user_widget_id) AS user_widget_nr
FROM user_widgets
WHERE user_widget_id > 100000
) u
WHERE w.user_id = u.user_id
AND w.user_widget_id = u.user_widget_id;
Possibly follow up with a REINDEX or even VACUUM FULL ANALYZE user_widgets at off hours. Consider a FILLFACTOR below 100, as columns will be updated at least once.
I would certainly not leave this to the application. That introduces multiple additional points of failure.
I am going to join in, in questioning the specific requirements. In general, if you are trying to order things of this sort, that might be better left to the application. If you knew me you'd realize this was really saying something. My concern is that every case I can think of may require re-ordering on the part of the application because otherwise the numbers would be irrelevant.
So I would just:
CREATE TABLE user_widgets (
user_id int references users(id),
widget_id int,
widget_name text not null,
primary key(user_id, widget_id)
);
And I'd leave it at that.
Now based on your justification, this addresses all of your concerns (imports). However I have once in a long while had to do something similar. The use case I had was a case where a local tax jurisdiction required that packing slips(!) be sequentially numbered without gaps, separate from invoices. Counting records, btw won't meet your import requirements.
What we did was create a table with one row per sequence and use that and then tie that in with a trigger.

performance issue in a select query from a single table

I have a table as below
dbo.UserLogs
-------------------------------------
Id | UserId |Date | Name| P1 | Dirty
-------------------------------------
There can be several records per userId[even in millions]
I have clustered index on Date column and query this table very frequently in time ranges.
The column 'Dirty' is non-nullable and can take either 0 or 1 only so I have no indexes on 'Dirty'
I have several millions of records in this table and in one particular case in my application i need to query this table to get all UserId that have at least one record that is marked dirty.
I tried this query - select distinct(UserId) from UserLogs where Dirty=1
I have 10 million records in total and this takes like 10min to run and i want this to run much faster than this.
[i am able to query this table on date column in less than a minute.]
Any comments/suggestion are welcome.
my env
64bit,sybase15.0.3,Linux
my suggestion would be to reduce the amount of data that needs to be queried by "archiving" log entries to an archive table in suitable intervals.
You can still access all entries if you provide a union-view over current and archived log data, but accessing current logs would be much reduced.
Add an index containing both the UserId and Dirty fields. Put UserId before Dirty in the index as it has more unique values.

Change all primary keys in access table to new numbers

I have an access table with an automatic primary key, a date, and other data. The first record starts at 36, due to deleted records. I want to change all the primary keys so they begin at 1 and increment, ordered by the date. Whats the best way to do this?
I want to change the table from this:
| TestID | Date | Data |
| 36 | 12/02/09 | .54 |
| 37 | 12/04/09 | .52 |
To this:
| TestID | Date | Data |
| 1 | 12/02/09 | .54 |
| 2 | 12/04/09 | .52 |
EDIT: Thanks for the input and those who answered. I think some were reading a little too much into my question, which is okay because it still adds to my learning and thinking process. The purpose of my question was two fold: 1) It would simply be nicer for me to have the PK match with the order of my data's dates and 2) to learn if something like this was possible for later use. Such as, if I want to add a new column to the table which numbers the tests, labels the type of test, etc. I am trying to learn a lot at once right now so I get a little confused where to start sometimes. I am building .NET apps and trying to learn SQL and database management and it is sometimes confusing finding the right info with the different RDMS's and ways to interact with them.
Following from MikeW, you can use the following SQL command to copy the data from the old to the new table:
INSERT
TestID, Date, Data
INTO
NewTable
SELECT
TestID, Date, Data
FROM
OldTable;
The new TestID will start from 1 if you use an AutoIncrement field.
I would create a new table, with autoincrement.
Then select all the existing data into it, ordering by date. That will result in the IDs being recreated from "1".
Then you could drop the original table, and rename the new one.
Assuming no foreign keys - if so you'd have to drop and recreate those too.
An Autonumber used as a surrogate primary keys is not data, but metadata used to do nothing but connect records in related tables. If you need to control the values in that field, then it's data, and you can't use an Autonumber, but have to roll your own autoincrement routine. You might want to look at this thread for a starting point, but code for this for use in Access is available everywhere Access programmers congregate on the Net.
I agree that the value of the auto-generated IDENTITY values should have no meaning, even for the coder, but for education purposes, here's how to reseed the IDENTITY using ADO:
ACC2000: Cannot Change Default Seed and Increment Value in UI
Note the article as out of date because it says, "there are no options available in the user interface (UI) for you to make this change." In later version the Access, the SQL DLL could be executed when in ANSI-92 Query Mode e.g. something like this:
ALTER TABLE MyTable ALTER TestID INTEGER IDENTITY (1, 1) NOT NULL;