I want to create a copy of a table, fewer columns than original... but still updated from it - sql

I have a 30gb table, which has 30-40 columns. I create reports using this table and it causes performance problems. I just use 4-5 columns of this table for the reports. So that, I want to create a second table for the reports. But the second table must be updated when the original table is changed without using triggers.
No matter what my query is, When the query is executed, sql tries to cache all 30gb. When the cache is fully loaded, sql starts to use disk. Actually I want to aviod this
How can I do this?
Is there a way of doing this using ssis
thanks in advance

CREATE VIEW myView
AS
SELECT
column1,
column3,
column4 * column7
FROM
yourTable
A view is effectively just a stored query, like a macro. You can then select from that view as if it were a normal table.
Unless you go for matierialised views, it's not really a table, it's just a query. So it won't speed anything up, but it does encapsulate code and assist in controlling what data different users/logins can read.

If you are using SQL Server, what you want is an indexed view. Create a view using the column you want and then place an index on them.
An indexed view stores the data in the view. It should keep the view up-to-date with the underlying table, and it should reduce the I/O for reading the table. Note: this assumes that your 4-5 columns are much narrower than the overall table.

Dems answer with the view seems ideal, but if you are truly looking for a new table, create it and have it automatically updated with triggers.
Triggers placed on the primary table can be added for all Insert, Update and Delete actions upon it. When the action happens, the trigger fires and can be used to do additional function... such as update your new secondary table. You will pull from the Inserted and Deleted tables (MSDN)
There are many great existing articles here on triggers:
Article 1, Article 2, Google Search

You can create that second table just like you're thinking, and use triggers to update table 2 whenever table 1 is updated.
However, triggers present performance problems of their own; the speed of your inserts and updates will suffer. I would recommend looking for more conventional alternatives to improve query performance, which sounds like SQL Server since you mentioned SSIS.
Since it's only 4-5 out of 30 columns, have you tried adding an index which covers the query? I'm not sure if there are even more columns in your WHERE clause, but you should try that first. A covering index would actually do exactly what you're describing, since the table would never need to be touched by the query. Of course, this does cost a little in terms of space and insert/update performance. There's always a tradeoff.
On top of that, I can't believe that you would need to pull a large percentage of rows for any given report out of a 30 gb table. It's simply too much data for a report to have. A filtered index can improve query performance even more by only indexing the rows that are most likely to be asked for. If you have a report which lists the results for the past calendar month, you could add a condition to only index the rows WHERE report_date > '5/1/2012' for example.

Related

PostgreSQL efficient count of number of rows WHERE parameter=TRUE

I have a very massive PostgreSQL database table. One of its columns stores a boolean parameter. What I want is a very simple query:
SELECT COUNT(*) FROM myTable WHERE myTable.isSthTrue;
The problem I have is that it is very slow as it needs to check every row if it satisfies the criteria. Adding an index to this parameter speeds up the calculation roughly two times, but doesn't really improve the complexity in general.
Some people on the internet suggest adding triggers that update the count which is stored in a separate table. But that feels like too much effort for an easy problem like this.
If you need an exact result.
Then, yes, a trigger-based solution is likely the best path to go.
If an estimate is okay, consider Materialized Views. (Postgres 9.3+)
Something like CREATE MATERIALIZED VIEW myCount AS SELECT COUNT(*) FROM myTable WHERE myTable.isSthTrue; would maintain a copy of the expensive query you reference. The only caveat is that this aggregate view would not be automatically updated. To do that you need to call REFRESH MATERIALIZED VIEW, which could be done in a cron, or other timed task.

Why select count(*) is faster the select * even if table has no indexes?

Is there any diffrence between the time taken for Select * and Select count(*) for the table having no primary key and other indexes in SQL server 2008 R2?
I have tried select count(*) from a view and it has taken 00:05:41 for 410063922 records.
Select (*) from view has already taken 10 minutes for first 600000 records and the query is still running. So it looks like that it will take more than 1 hour.
Is there any way through which I can make this view faster without any change in the structure of the underlying tables?
Can I create indexed view for tables without indexes?
Can I use caching for the view inside sql server so if it is called again, it takes less time?
It's a view which contains 20 columns from one table only. The table does not have any indexes.The user is able to query the view. I am not sure whether user does select * or select somecolumn from view with some where conditions. The only thing which I want to do is to propose them for some changes through which their querying on the view will return results faster. I am thinking of indexing and caching but I am not sure whether they are possible on a view with table having no indexes. Indexing is not possible here as mentioned in one of the answers.
Can anyone put some light on caching within sql server 2008 R2?
count(*) returns just a number and select * returns all the data. Imagine having to move all that data and the time it takes for your hundred of thousands of records. Even if your table was indexed probably, running select * on your hundreds of thousands of records will still take a lot of time even if less than before, and should never bee needed in the first place.
Can I create indexed view for tables without indexes?
No, you have to add indexes for indexed results
Can I use caching for the view inside sql server so if it is called again, it takes less time?
Yes you can, but its of no use for such a requirement. Why are you selecting so many records in the first place? You should never have to return millions or thousands of rows of complete data in any query.
Edit
Infact you are trying to get billions of rows without any where clause. This is bound to fail on any server that you can get hold off, so better stop there :)
TL;DR
Indexes do not matter for a SELECT * FROM myTABLE query because there is no condition and billions of rows. Unless you change your query, no optimization can help you
The execution time difference is due to the fact that SELEC * will show the entire content of your table and the SELECT COUNT(*) will only count how many rows are present without showing them.
Answer about optimisation
In my opinion you're taking the problem with the wrong angle. First of all it's important to define the real need of your clients, when the requirements are defined you'll certainly be able to improve your view in order to get better performance and avoid returning billions of data.
Optimisations can even be made on the table structure sometimes (we don't have any info about your current structure).
SQL Server will automatically use a system of caching in order to make the execution quicker but that will not solve your problem.
SQL Server apparently does very different work when its result set field list is different. I just did a test of a query joining several tables where many millions of rows were in play. I tested different queries, which were all the same except for the list of fields in the SELECT clause. Also, the base query (for all tests) returned zero rows.
The SELECT COUNT(*) took 6 seconds and the SELECT MyPrimaryKeyField took 6 seconds. But once I added any other column (even small ones) to the SELECT list, the time jumped to 20 minutes - even though there were no records to return.
When SQL Server thinks it needs to leave its indexes (e.g., to access table columns not included in an index) then its performance is very different - we all know this (which is why SQL Server supports including base columns when creating indexes).
Getting back to the original question, the SQL Server optimizer apparently chooses to access the base table data outside of the indexes before it knows that it has no rows to return. In the poster's original scenario, though, there were no indexes or PK (don't know why), but maybe SQL Server is still accessing table data differently with COUNT(*).

Select Query VS View

Generally we are aggregating transaction data to one table using SQL Job periodically. Now the customer needs live data so 1 hour or hour is too high. Instead of running job in minutes, team suggest to create a view which includes the logic for selecting data from the transaction table itself.
But my confusion is any performance related difference between select query and View. Putting data in aggregate table on a periodica manner and select from this aggregation is easy because
1. It does not affect main tables
Direct select statement in direct transactions tables impact insert opertions as it frequently happening. If i create a view i think the concept is same as select statement. Now also we are selecting from the main tables. Or my understanding is wrong?
Please suggest the best approach for this and why
If i create a view i think the concept is same as select statement
Correct, a view in its simplest form is nothing more than saved query definition. When this view is used in a query the definition will be expanded out into the outer query and optimised accordingly. There is not performance benefit. This is not true of course for indexed views, where the view essentially becomes it's own table and using the NOEXPAND query hint will stop the definition being expanded, and will simply read from the view's index(es). Since this is an aggregation query though I suspect an indexed view won't even be possible, never mind a viable solution.
The next part about having a table to store the aggregation is more difficult to answer. Yes this could benefit performance, but at the cost of not having up to the minute data, and also having to maintain the table. Whether this is a suitable solution is entirely dependent on your needs, how up to date the data needs to be, how often it is required, how long it takes to (a) populate the reporting table (b) run the query on it's own.
For example if it takes 20 seconds to run the query, but it is only needed twice a day, then there is no point running this query every hour to maintain a reporting table to assist a query that is run twice a day.
Another option could be maintaining this reporting table through triggers, i.e. when a row is inserted/updated, cascade the change to the reporting table there and then, this would make the reporting table up to date, but again, if you are inserting millions of transactions a day and running the report a few times, you have to weigh up if the additional overhead cause by the trigger is worth it.
You could reduce the impact on write operations by using a transaction isolation level of READ UNCOMMITTED for the SELECT, but as with the summary table option, this comes at the cost of having up to the second accurate information, as you will be reading uncommitted transactions.
A final option could be an intermediate option, create a summary table daily, and partition/index your main table by date, then you can get the summary from historic data from the table created daily, and union this with only today's data which should be relatively fast with the right indexes.
A good approach is to create the view and profile your DB. Yes the best way to check if the view is a good solution is creating it and testing the results.
A similar question was asked here:
Is a view faster than a simple query?

Speeding up deletes that have joins

i am running a stored procedure to delete data from two tables:
delete from TESTING_testresults
from TESTING_testresults
inner join TESTING_QuickLabDump
on TESTING_QuickLabDump.quicklabdumpid = TESTING_TestResults.quicklabdumpid
where TESTING_quicklabdump.[Specimen ID]=#specimen
delete from TESTING_QuickLabDump
from TESTING_Quicklabdump
where [specimen id]=#specimen
one table is 60m rows and the other is about 2m rows
the procedure takes about 3 seconds to run.
is there any way i can speed this up? perhaps using EXISTS?
meaning IF EXISTS...THEN DELETE - because the delete should not be occurring every single time
something like this
if #specimen exists in TESTING_QuickLabDump then do the procedure with the two deletes
thank you !!!
Rewriting the query probably wont help speeding this up. Use the profiler to find out which parts of the query are slow. For this, make it profiler output the execution plan. Then, try adding appropriate indexes. Perhaps one or both tables could use an index over [specimen id].
For a table with 60 mil rows I would definitely look into partitioning the data horizontally and/or vertically. If it's time-sensitive data then you ought to be able to move old data into a history table. That's usually the first and most obvious thing people do so I would imagine if that were a possibility you would have already done it.
If there are many columns then it would definitely benefit you to denormalize the data into multiple tables. If you did this, I would suggest renaming the tables and creating a view of all the partitioned tables named after the original table. Doing that should ensure existing code isn't broken.
If you 'really' want to fine tune the speed then you should look into getting a faster hard drive and learn a little about hard drives work. Whether the data is stored towards the inner or outer section of the hd will affect speed of access slightly for example. And solid state hard drives have come a long way so you might look into getting one of those.
Beside indexing "obvious" fields, also look in your database schema and check if you have any FOREIGN KEYs whose ON DELETE CASCADE or SET NULL might be triggered by your delete (unlike Oracle, MS SQL Server will tend to show these in the execution plan). Fortunately, this is usually fairly easy to fix by indexing the child endpoint of the FOREIGN KEY.
Also check if you have any expensive triggers.

sql: DELETE + INSERT vs UPDATE + INSERT

A similar question has been asked, but since it always depends, I'm asking for my specific situation separately.
I have a web-site page that shows some data that comes from a database, and to generate the data from that database I have to do some fairly complex multiple joins queries.
The data is being updated once a day (nightly).
I would like to pre-generate the data for the said view to speed up the page access.
For that I am creating a table that contains exact data I need.
Question: for my situation, is it reasonable to do complete table wipe followed by insert? or should I do update,insert?
SQL wise seems like DELETE + INSERT will be easier (INSERT part is a single SQL expression).
EDIT: RDBMS: MS SQL Server 2008 Ent
TRUNCATE will be faster than delete, so if you need to empty a table do that instead
You didn't specify your RDBMS vendor but some of them also have MERGE/UPSERT commands This enables you do update the table if the data exists and insert if it doesn't
It partly depends on how the data is accessed. If you have a period of time with no (or very few) users accessing it, then there won't be much impact on the data disappearing (between the DELETE and the completion of the INSERT) for a short while.
Have you considered using a materialized view (MSSQL calls them indexed views) instead of doing it manually? This could also have other performance benefits as an indexed view gives the query optimizer more choices when its constructing execution plans for other queries that reference the table(s) in the view.
It depends on the size of the table and the recovery model on the database. If you are deleting many hundreds of thousands of records and reinstating them vs updating a small batch of a few hundred and inserting tens of rows, it will add an unnecessary size to your transaction logs. However you could use TRUNCATE to get around this as it won't affect the transaction log.
Do you have the option of a MERGE/UPSERT? If you're using MS-SQL you can use CROSS APPLY to do something similar if you don't.
One approach to handling this type of problem is to insert into new table, then do a table Rename. This will insure that all new data is present at the same time.
What if some data that was present yesterdays is not anymore? Delete may be safer or you could end up deleting some records anyway.
And in the end it doesnt really matter which way you go.
Unless on the case #kevinw mentioned
Although I fully agree with SQLMenace's answer I do would like to point out that MERGE does NOT remove unneeded records ! If you're sure that your new data will be a super-set of the existing data, then MERGE is great, otherwise you'll either need to make sure that you delete any superfluous records later on, or use the TRUNCATE + INSERT method ...
(Personally I'm still a fan of the latter as it usually is quite fast, just make sure to drop all indexes/unique constraints upfront and rebuild them one by one. This has the benefit of the INSERT transaction being smaller and the index-adding being done in (smaller) transactions again later on). (**)
(**: yes, this might be tricky on live system, but then again he already mentioned this was done during some kind of overnight anyway, I'm extrapolating there is no user-access at that time)