Materialized View vs Trigger for aggregating data?

Materialized View vs Trigger for aggregating data? - sql

I have a TASK table :
ID | NAME | STATUS |
----------------------
1 | Task 1 | Open |
2 | Task 2 | Closed |
3 | Task 3 | Closed |
And in my application i constantly query for a count of tasks grouped by status, so I'm looking for a caching solution.
Naturally, I thought of a trigger that automatically updates an aggregation table on any change to the TASKS table
TASK_COUNT table :
OPEN | CLOSED |
----------------
1 | 2 |
But I've read there is also materialized views.
Which is more reccomended for aggregating data? Materialized Views or Triggers?
Important to note that in my actual scenario I have more aggregations than just STATUS, and more tables than just TASK.
Also this is a rapidly evolving table, and I need the aggregated data to be always up to date.

The downside to materialized views is that the data may not be totally current. As explained in the documentation:
While access to the data stored in a materialized view is often much faster than accessing the underlying tables directly or through a view, the data is not always current; yet sometimes current data is not needed.
The advantage of materialized views is that they are much simpler to maintain -- basically define and go. But there can be a lag for updates.
If you need totally current information, then triggers are probably the better solution.

Related

Understanding the precise difference in how SQL treats temp tables vs inline views

I know similar questions have been asked, but I will try to explain why they haven't answered my exact confusion.
To clarify, I am a complete beginner to SQL so bear with me if this is an obvious question.
Despite being a beginner I have been fortunate enough to be given a role doing some data science and I was recently doing some work where I wrote a query that self-joined a table, then used an inline view on the result, which I then selected from. I can include the code if necessary but I feel it is not for the question.
After running this, the admin emailed me and asked to please stop since it was creating very large temp tables. That was all sorted and he helped me write it more efficiently, but it made me very confused.
My understanding was that temp tables are specifically created by a statement like
SELECT INTO #temp1
I was simply using a nested select statement. Other questions on here seem to confirm that temp tables are different. For example the question here along with many others.
In fact I don't even have privileges to create new tables, so what am I misunderstanding? Was he using "temp tables" differently from the standard use, or do inline views create the same temp tables?
From what I can gather, the only explanation I can think of is that genuine temp tables are physical tables in the database, while inline views just store an array in RAM rather than in the actual database. Is my understanding correct?

There are two kind of temporary tables in MariaDB/MySQL:
Temporary tables created via SQL
CREATE TEMPORARY TABLE t1 (a int)
Creates a temporary table t1 that is only available for the current session and is automatically removed when the current session ends. A typical use case are tests in which you don't want to clean everything up in the end.
Temporary tables/files created by server
If the memory is too low (or the data size is too large), the correct indexes are not used, etc. the database server needs to create temporary files for sorting, collecting results from subqueries, etc. Temporary files are an indicator of your database design / and / or instructions should be optimized. Disk access is much slower than memory access and unnecessarily wastes resources.
A typical example for temporary files is a simple group by on a column which is not indexed (information displayed in "Extra" column):
MariaDB [test]> explain select first_name from test group by first_name;
+------+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| 1 | SIMPLE | test | ALL | NULL | NULL | NULL | NULL | 4785970 | Using temporary; Using filesort |
+------+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
1 row in set (0.000 sec)
The same statement with an index doesn't need to create temporary table:
MariaDB [test]> alter table test add index(first_name);
Query OK, 0 rows affected (7.571 sec)
Records: 0 Duplicates: 0 Warnings: 0
MariaDB [test]> explain select first_name from test group by first_name;
+------+-------------+-------+-------+---------------+------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------------+---------+------+------+--------------------------+
| 1 | SIMPLE | test | range | NULL | first_name | 58 | NULL | 2553 | Using index for group-by |
+------+-------------+-------+-------+---------------+------------+---------+------+------+--------------------------+

SQL Server logging row visits best practice

I currently have a database for articles that keeps track of the most read article for a certain amount of time by incrementing the "visits" counter on page_load. The current "visits" counter is a column in the articles table (see below):
id | title | description | visits | creation_date
---+--------+-------------+--------+-----------------
1 | test1 | test test.. | 10 | 2019-01-01
2 | test2 | test test.. | 20 | 2019-01-01
Sometimes, I experienced connection timeouts and I suspected a deadlock from the "visits" write procedure (database locks if concurrent users were incrementing the same row at once). I thought of the below scenario as an enhancement:
Remove the Visits counter from the table Articles
Create a new table article_visits with two columns: article_id and date
Articles
id | title | desc | creation_date
---+-------+------+---------------
1 | test1 | desd | 2019-01-01
2 | test1 | desd | 2019-01-01
article_visits
article_id | visit_date
-----------+----------------------
1 | 2019-01-01
1 | 2019-01-01
1 | 2019-01-01
1 | 2019-01-01
1 | 2019-01-01
1 | 2019-01-01
2 | 2019-01-01
2 | 2019-01-01
2 | 2019-01-01
As an alternative option, once triggering a new visit, I insert a new row into the articles_visits table to avoid any deadlocks on the articles table. This solution will make the articles_visits table grow big very quickly but I don't think table size is a problem.
I would like to know if this is the proper way to log article visits and if the optimization if is a better option than the original solution.

This is a fine way to record article visits. It is much less (or not at all) prone to deadlocks, because you are basically just appending new rows.
It is more flexible. You can get the number of visits between two dates, for instance. And that can be defined at query time. You can store the exact time, so determine if there are time preferences for views.
The downside is performance on querying. If you frequently need the counts, then the calculation can be expensive.
If this is an issue, there are multiple possible approaches:
A process that summarizes all the data periodically (say data).
A process that summarizes the data on a period basis for that period (say a daily summary).
A materialized/indexed view which allows the database to keep the data up-to-date.

This is certainly valid, though you may want to do some scoping on how much additional storage and memory load this will require for your database server.
Additionally, I might add a full datetime or datetime2 column for the actual timestamp (in addition to the current date column rather than instead of it, since you'll want to do aggregation by date only and having that value pre-computed can improve performance), and perhaps a few other columns such as IP Address and Referrer. Then you can use this data for additional purposes, such as auditing, tracking referrer/advertiser ROI, etc.

I'm interested to understand why you are getting a dead lock. It should be the case that a db platform should be able to handle a update tablename set field = field + 1 concurrently just fine. Here the table or row will lock and then release but the time should not be long enough to cause a deadlock error.
YOU COULD get a deadlock error if you are updating or locking more than one table with a transaction accross multiple tables esp. if you do them in a different order.
So the question is... in your original code are you linking to multiple tables when you do the update statement? The solution could be as simple as making your update atomic to one table.
However, I do agree -- the table you describe is a more functional design.

Current Articles table is not in Normalized form.
I will say putting visits column in Articles table is not proper way of
De-Normalization.
Current Articles table is not only giving you deadlock issue but also you cannot get so many other type of Report.
Daily Visit Report, Weekly Visit Report.
Creating Article_visits table is very good move .
It will be very frequently updated.
My Article_visits design
article_visit_id | article_id | visit_date | visit_count
-----------------+--------------+----------------------+----------------------
1 | 1 | 2019-01-01 | 6
2 | 2 | 2019-01-01 | 3
Here Article_Visit_id is int identity(1,1) which is also Clustered Index.
Create NonClustered Index NCI_Articleid_date ON Article_visits(article_id,visit_date)
GO
In short ,creating CI on article_id,visit_date will expensive affair.
If record do not exists for that article on that date then insert with visit_count 1
if it exists then update visit_count i.e. increase by 1.
It is Normalized.
You can create any kind of report, current requirement+ any future requirement.
You can show Article wise count.Query is so easy and Performant.
You can get weekly ,even getting yearly report is so easy and without
Indexed View.
Actual Table Design,
Create Table Article(Articleid int identity(1,1) primary key
,title varchar(100) not null,Descriptions varchar(max) not null
,CreationDate Datetime2(0))
GO
Create Table Article_Visit(Article_VisitID int identity(1,1) primary key,Articleid int not null ,Visit_Date datetime2(0) not null,Visit_Count int not null)
GO
--Create Trusted FK
ALTER TABLE Article_Visit
WITH NOCHECK
ADD CONSTRAINT FK_Articleid FOREIGN KEY(Articleid)
REFERENCES Article(Articleid) NOT FOR REPLICATION;
GO
--Create NonClustered Index NCI_Articleid_Date on
-- Article_Visit(Articleid,Visit_Date)
--Go
Create NonClustered Index NCI_Articleid_Date1 on
Article_Visit(Visit_Date)include(Articleid)
Go
Create Trusted FK to get Index Seek Benefit (in short).
I think ,NCI_Articleid_Date is no more require because ofArticleid being Trusted FK .
Deadlock Issue: Trusted FK was also created to overcome Deadlock issue.
It often occur due to bad Application code or UN-Optimized Sql query or Bad Table Design.Beside this also there several other valid reason,like handling Race Condition.It is quite DBA thing.If deadlock is hurting too much then after addressing above reason, you may have to Isolation Level.
Many Deadlock issue are auto handle by Sql server itself.
There are so many article online on DEADLOCK REASON.
I don't think table size is a problem
Table size are big issue.Chances of Deadlock in both design are very very less.But you will always face other demerit of Big Size table.
I am telling you to read few more article.
I hope that this is your exactly same real table with same data type ?
How frequently both table will inserted/updated ?
Which table will be query more frequently ?
Concurrent use of each table.
Deadlock can be only minimize so that there is no performance issue or transaction issue.
What is relation between Visitorid and Artcileid ?

Is it more efficient to query a view from a stored procedure or include the table join in the stored procedure?

I have a database structure in MySQL similar to Instagram, where I have a table containing paths to pictures in a file system and a table containing user information as such:
Users:
ID | userName | age | gender
---|-----------|-----|-------
1 | MrBanana | 15 | 0
2 | BobTheMan | 21 | 0
3 | TheBest | 19 | 1
4 | MsTest | 24 | 1
Pictures:
ID | Path | userID
---|-----------|--------
1 | www.test1 | 2
2 | www.test2 | 4
3 | www.test3 | 3
4 | www.test4 | 2
Now the requirement is that whenever a picture is called up it will include the userName and ID. So the first Idea I had was to create a view that joins the two table so that a picture now also has the user name and ID of the images attached to it and then query the pictures out of that view. The query would be placed in a stored procedure. Now my question is if this is efficient or if it where more efficient to do the query and join in one stamens and put that into the stored procedure ?
My concern is that if I use the view approach, every time it queries the view it will have to first join the entirety of the two tables and if these tables become very big this would be a very time consuming process. So if I create a stored procedure that first finds all the needed pictures and then joins the user to it it would be more efficient.
I am not sure if I am understanding this correctly and would like to ask for help on which approach is better and would scale more effectively ?

Not sure which RDBMS are you using, but from my experience with SQL Server (and I guess that the other vendors do the same) an ordinary view would use the indexes
of the tables included in the view query as if you where doing that query outside the view.
So if you are worrying about if your vwPicturesWithUser would use the index of Pictures table when you query for the picture with ID=3, the answer is yes (well I guess that somebody could come up with some odd scenario where the query planner decides to ignore the index, but that would happen too querying without the view).

SQL Server Replication questions

I'm Brazilian and I'm not very good English, I apologize.
I have a problem: before replication when replicating tables I wanted to set some rules for some columns not to be replicated, or be replicated with a default value.
id | descrisaoProduto | estoque
1 | abcd | 10
on replication
id | descrisaoProduto | estoque
1 | (null or value default) | 10**
And find out if there is any way that when it is replicated, it convert a table to another.
id | estoqueLocal | estoqueMatriz
1 | 10 | 0
on replication
(replication)
id | estoqueLocal | estoqueMatriz
1 | 0 | 10

Probably the simplest way to accomplish this would be to create a view representing the data you wish the subscriber to see, and then replicate that view instead of the underlying source table. Views can be replicated as easily as tables.
In your scenario, you would want to replicate an indexed view as a table on the subscriber side. In this way, you would not need to replicate the underlying table. From the article above:
For indexed views, transactional replication also allows you to replicate the indexed view as a table rather than a view, eliminating the need to also replicate the base table. To do this, specify one of the "indexed view logbased" options for the #type parameter of sp_addarticle (Transact-SQL).
Here's an article demonstrating how to set up replication of an indexed view with transactional replication.

How to bond N database table with one master-table?

Lets assume that I have N tables for N Bookstores. I have to keep data about books in separate tables for each bookstore, because each table has different scheme (number and types of columns is different), however there are same set of columns which is common for all Bookstores table;
Now I want to create one "MasterTable" with only few columns.
| MasterTable |
|id. | title| isbn|
| 1 | abc | 123 |
| MasterToBookstores |
|m_id | tb_id | p_id |
| 1 | 1 | 2 |
| 1 | 2 | 1 |
| BookStore_Foo |
|p_id| title| isbn| date | size|
| 1 | xyz | 456 | 1998 | 3KB |
| 2 | abc | 123 | 2003 | 4KB |
| BookStore_Bar |
|p_id| title| isbn| publisher | Format |
| 1 | abc | 123 | H&K | PDF |
| 2 | mnh | 986 | Amazon | MOBI |
My question, is it right to keep data in such way? What are best-practise about this and similar cases? Can I give particular Bookstore table an aliase with number, which will help me manage whole set of tables?
Is there a better way of doing such thing?

I think you are confusing the concepts of "store" and "book".
From you comments and the example data, it appears the problem is in having different sets of attributes for books, not stores. If so, you'll need a structure similar to this:
The symbol: denotes inheritance1. The BOOK is the "base class" and BOOK1/BOOK2/BOOK3 are various "subclasses"2. This is a common strategy when entities share a set of attributes or relationships3. For the fuller explanation of this concept, please search for "Subtype Relationships" in the ERwin Methods Guide.
Unfortunately, inheritance is not directly supported by current relational databases, so you'll need to transform this hierarchy into plain tables. There are generally 3 strategies for doing so, as described in these posts:
Interpreting ER diagram
Parent and Child tables - ensuring children are complete
Supertype-subtype database design
NOTE: The structure above allows various book types to be mixed inside the same bookstore. Let me know if that's not desirable (i.e. you need exactly one type of books in any given bookstore)...
1 Aka. category, subclassing, subtyping, generalization hierarchy etc.
2 I.e. types of books, depending on which attributes they require.
3 In this case, books of all types are in the many-to-many relationship with stores.

If you had at least two columns which all other tables use it then you could have base table for all books and add more tables for the rest of the data using the id from Base table.
UPDATE:
If you use entity framework to connect to your DB I suggest you to try this:
Create your entities model something like this:
then let entity framework generate the database(Update database from Model) for you. Note this uses inheritance(not in database).
Let me know if you have questions.

Suggest data model:
1. Have a master database, which saves master data
2. The dimension tables in master database, transtional replicated to your distributed bookstore database
3. You can choose to use updatable scriscriber or merge replication is also a good choice
4. Each distributed bookstore database still work independently, however master data either merge back by merge replication or updatable subscriber.
5. If you want to make sure master data integrity, you can only read-only subscriber, and use transational replication to distribute master data into distributed database, but in this design, you need to have store proceduces in master database to register your dimension data. Make sure there is no double-hop issue.

I would suggest you to have two tables:
bookStores:
id name someMoreColumns
books:
id bookStore_id title isbn date publisher format size someMoreColumns
It's easy to see the relationship here: a bookStore has many books.
Pay attention that I'm putting all the columns you have in all of your BookStore tables in just one table, even if some row from some table does not have a value to some column.
Why I prefer this way:
1) To all the data from BookStore tables, just few columns will never have a value on table books (as example, size and format if you don't have an e-book version). The other columns can be filled someday (you can set a date to your e-books, but you don't have this column on your table BookStore_Bar, which seems to refer to the e-books). This way you can have much more detailed infos from all your books if someday you want to update it.
2) If you have a bunch of tables BookStore, lets say 12, you will not be able to handle your data easily. What I say is, if you want to run some query to all your books (which means to all your tables), you will have at least three ways:
First: run manually the query to each of the 12 tables and so merge the data;
Second: write a query with 12 joins or set 12 tables on your FROM clause to query all your data;
Third: be dependent of some script, stored procedure or software to do for you the first or the second way I just said;
I like to be able to work with my data as easy as possible and with no dependence of some other script or software, unless I really need it.
3) As of MySQL (because I know much more of MySQL) you can use partitions on your table books. It is a high level of data management in which you can distribute the data from your table to several files on your disk instead of just one, as generally a table is allocated. It is very useful when handling a large ammount of data in a same table and it speeds up queries based on your data distribution plan. Lets see an example:
Lets say you already have 12 distinct bookStores, but under my database model. For each row in your table books you'll have an association to one of the 12 bookStore. If you partition your data over the bookStore_id it will be almost the same as you had 12 tables, because you can create a partition for each bookStore_id and so each partition will handle only the related data (the data that match the bookStore_id).
Lets say you want to query the table books to the bookStore_id in (1, 4, 9). If your query really just need of these three partitions to give you the desired output, then the others will not be queried and it will be as fast as you were querying each separated table.
You can drop a partition and the other will not be affected. You can add new partitions to handle new bookStores. You can subpartition a partition. You can merge two partitions. In a nutshell, you can turn your single table books in an easy-to-handle, multi-storage table.
Side Effects:
1) I don't know all of table partitioning, so it's good to refer to the documentation to learn all important points to create and manage it.
2) Take care of data with regular backups (dumps) as you probably may have a very populated table books.
I hope it helps you!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas