Clean/truncate table before selecting into it - sql

Just had a quick question to know if this is the right way to do something.
I have a query that I want to create a table. I thought about updating the table with only the changed rows, but since my query only takes about a minute, I think it is easier to just drop the whole table and rerun the query every time I do my hourly update.
Will this be the way to do it?
Truncate Table
Select *
Into Table
From TableTwo
Where X
And then just take that query, turn it into a stored procedure, and turn the procedure into a job that runs once an hour.
Also, I want this table to have indexes. Will they be preserved even if I truncate every time.

You can do this. I would probably advise dropping the table instead. This will better handle changes in table structure.
If you do use truncate, you want insert rather than into:
insert into Table -- column list is recommended
Select *
From TableTwo
Where X;
Dropping the table might take an iota more time, and it doesn't preserve triggers, constraints, foreign key references, and storage definitions. (I'm guessing those are not important.) However, it does allow the query to change over time, which might be useful to future-proof the code.

Truncate doesn't drop the Table from the database, it just cleans up the table.
So you won't be able to run SELECT INTO because ObjectID already exists. However Truncate preserves all the indexes and keys and table integrity
However if you drop the table you get rid it's ObjectID and then you can run SELECT INTO. It's only a good idea if the Table you're inserting from is going to have a lot of changes all the time(which is a bad thing on its own). This method doesn't preserve any indexes or keys and you'd have to create them in the Stored proc every time you run it.
Which is again a bad thing on it's own.
My Suggestion is you should turn it into the Insert Into stored procedure with Truncate in it. If your company decides to make changes to the Table then you go and change your MyTable and SP, it's more headache, but usually companies don't change their database structure very often, unless the database is still in a Development or Testing and not live. In that case SELECT INTO will be only a temporary solution.
CREATE PROC MyProc
as
TRUNCATE TABLE MyTable;
INSERT INTO MyTable (Col1, Col2, Col3)
SELECT Col1, Col2, Col3
FROM TableTwo

Related

What is the most efficient way to implement a Redshift merge/upsert operation

I am in the process of writing a custom upsert function for a specific use case for a redshift table. On their docs, AWS suggests two methods which i'm drawing inspiration from. Here is what i want to accomplish:
Insert any new rows to an existing table, but only if they don't already exist.
There is never a need to delete or modify an existing row (for my use case)
I have so far come up with two separate ways to do this, but I'm wondering what the tradeoffs of each could be
using an EXCEPT query for insertion of only new rows from a temp table:
insert into persisted_table (
select *
from temp_table
except
select *
from persisted_table
);
store results of aUNION ALL query on temp table with persisted table, and use that as the persisted table
insert into new_table (
select *
from temp_table
union
select *
from persisted_table
);
alter table persisted_table rename to old_perisisted_table_marked_for_deletion;
alter table new_table rename to persisted_table;
I'm aware that union all is slow and generally not recommended for bulk/large scale operations. Apart from that though are there any arguments that could influence this decision?
The first advise I'd give is to remember that Redshift is a cluster. Whatever process you select, if the data is large, you will want the comparison to determine if the row already exists to stay "on node". You will want the tables in questions to be distributed by the same key.
Next I would think about what the indexes are into the data. The processes you laid out will compare all columns when comparing. Is this needed? If a subset of columns can be the key this can make things more efficient.
insert into persisted_table (
select * from temp_table a left join persisted_table b on {keys}
where a.keys is null );
Hopefully these aspects will help your decision process

Caching joined tables in SQL Server

My website has a search procedure that runs very slowly. One thing that slows it down is the 8 table join it has to do (It also has a WHERE clause on ~6 search parameters). I've tried to make the query faster using various methods such as adding indexes, but these have not helped.
One Idea I have is to cache the result of the 8 table join. I could create a temporary table of the join, and make the search procedure query this table. I could update the table every 10 minutes or so.
Using pseudo code, I would change my procedure to look like this:
IF CachedTable is NULL or CachedTable is older than 10 minutes
DROP TABLE CachedTable
CREATE TABLE CachedTable as (select * from .....)
ENDIF
Select * from CachedTable Where Name = #SearchName
AND EmailAddress = #SearchEmailAddress
Is this a working strategy? I don't really know what syntax I would need to pull this off, or what I would need to lock to stop things from breaking if two queries happen at the same time.
Also, it might take quite a long time to make a new CachedTable each time, so I thought of trying something like double buffering in computer graphics:
IF CachedTabled is NULL
CREATE TABLE CachedTable as (select * from ...)
ELSE IF CachedTable is older than 10 minutes
-- Somehow do this asynchronously, so that the next time a search comes
-- through the new table is used?
ASYNCHRONOUS (
CREATE TABLE BufferedCachedTable as (select * from ...)
DROP TABLE CachedTable
RENAME TABLE BufferedCachedTable as CachedTable
)
Select * from CachedTable Where Name = #SearchName
AND EmailAddress = #SearchEmailAddress
Does this make any sense? If so, how would I achieve it? If not, what should I do instead? I tried using indexed views, but this resulted in weird errors, so I want something like this that I can have more control over (Also, something I can potentially spin off onto a different server in the future.)
Also, what about indexes and so on for tables created like this?
This may be obvious from the question, but I don't know that much about SQL or the options I have available.
You can use multiple schemas (you should always specify schema!) and play switch-a-roo as I demonstrated in this question. Basically you need two additional schemas (one to hold a copy of the table temporarily, and one to hold the cached copy).
CREATE SCHEMA cache AUTHORIZATION dbo;
CREATE SCHEMA hold AUTHORIZATION dbo;
Now, create a mimic of the table in the cache schema:
SELECT * INTO cache.CachedTable FROM dbo.CachedTable WHERE 1 = 0;
-- then create any indexes etc.
Now when it comes time to refresh the data:
-- step 1:
TRUNCATE TABLE cache.CachedTable;
-- (if you need to maintain FKs you may need to delete)
INSERT INTO cache.CachedTable SELECT ...
-- step 2:
-- this transaction will be almost instantaneous,
-- since it is a metadata operation only:
BEGIN TRANSACTION;
ALTER SCHEMA hold TRANSFER dbo.Cachedtable;
ALTER SCHEMA dbo TRANSFER cache.CachedTable;
ALTER SCHEMA cache TRANSFER hold.CachedTable;
COMMIT TRANSACTION;

Is it possible to add index to a temp table? And what's the difference between create #t and declare #t

I need to do a very complex query.
At one point, this query must have a join to a view that cannot be indexed unfortunately.
This view is also a complex view joining big tables.
View's output can be simplified as this:
PID (int), Kind (int), Date (date), D1,D2..DN
where PID and Date and Kind fields are not unique (there may be more than one row having same combination of pid,kind,date), but are those that will be used in join like this
left join ComplexView mkcs on mkcs.PID=q4.PersonID and mkcs.Date=q4.date and mkcs.Kind=1
left join ComplexView mkcl on mkcl.PID=q4.PersonID and mkcl.Date=q4.date and mkcl.Kind=2
left join ComplexView mkco on mkco.PID=q4.PersonID and mkco.Date=q4.date and mkco.Kind=3
Now, if I just do it like this, execution of the query takes significant time because the complex view is ran three times I assume, and out of its huge amount of rows only some are actually used (like, out of 40000 only 2000 are used)
What i did is declare #temptable, and insert into #temptable select * from ComplexView where Date... - one time per query I select only the rows I am going to use from my ComplexView, and then I am joining this #temptable.
This reduced execution time significantly.
However, I noticed, that if I make a table in my database, and add a clustered index on PID,Kind,Date (non-unique clustered) and take data from this table, then doing delete * from this table and insert into this table from complex view takes some seconds (3 or 4), and then using this table in my query (left joining it three times) take down query time to half, from 1 minute to 30 seconds!
So, my question is, first of all - is it possible to create indexes on declared #temptables.
And then - I've seen people talk about "create #temptable" syntax. Maybe this is what i need? Where can I read about what's the difference between declare #temptable and create #temptable? What shall I use for a query like mine? (this query is for MS Reporting Services report, if it matters).
#tablename is a physical table, stored in tempdb that the server will drop automatically when the connection that created it is closed, #tablename is a table stored in memory & lives for the lifetime of the batch/procedure that created it, just like a local variable.
You can only add a (non PK) index to a #temp table.
create table #blah (fld int)
create nonclustered index idx on #blah (fld)
It's not a complete answer but #table will create a temporary table that you need to drop or it will persist in your database. #table is a table variable that will not persist longer than your script.
Also, I think this post will answer the other part of your question.
Creating an index on a table variable
Yes, you can create indexes on temp tables or table variables. http://sqlserverplanet.com/sql/create-index-on-table-variable/
The #tableName syntax is a table variable. They are rather limited. The syntax is described in the documentation for DECLARE #local_variable. You can kind of have indexes on table variables, but only indirectly by specifying PRIMARY KEY and UNIQUE constraints on columns. So, if your data in the columns that you need an index on happens to be unique, you can do this. See this answer. This may be “enough” for many use cases, but only for small numbers of rows. If you don’t have indexes on your table variable, the optimizer will generally treat table variables as if they contain one row (regardless of how many rows there actually are) which can result in terrible query plans if you have hundreds or thousands of rows in them instead.
The #tableName syntax is a locally-scoped temporary table. You can create these either using SELECT…INTO #tableName or CREATE TABLE #tableName syntax. The scope of these tables is a little bit more complex than that of variables. If you have CREATE TABLE #tableName in a stored procedure, all references to #tableName in that stored procedure will refer to that table. If you simply reference #tableName in the stored procedure (without creating it), it will look into the caller’s scope. So you can create #tableName in one procedure, call another procedure, and in that other procedure read/update #tableName. However, once the procedure that created #tableName runs to completion, that table will be automatically unreferenced and cleaned up by SQL Server. So, there is no reason to manually clean up these tables unless if you have a procedure which is meant to loop/run indefinitely or for long periods of time.
You can define complex indexes on temporary tables, just as if they are permanent tables, for the most part. So if you need to index columns but have duplicate values which prevents you from using UNIQUE, this is the way to go. You do not even have to worry about name collisions on indexes. If you run something like CREATE INDEX my_index ON #tableName(MyColumn) in multiple sessions which have each created their own table called #tableName, SQL Server will do some magic so that the reuse of the global-looking identifier my_index does not explode.
Additionally, temporary tables will automatically build statistics, etc., like normal tables. The query optimizer will recognize that temporary tables can have more than just 1 row in them, which can in itself result in great performance gains over table variables. Of course, this also is a tiny amount of overhead. Though this overhead is likely worth it and not noticeable if your query’s runtime is longer than one second.
To extend Alex K.'s answer, you can create the PRIMARY KEY on a temp table
IF OBJECT_ID('tempdb..#tempTable') IS NOT NULL
DROP TABLE #tempTable
CREATE TABLE #tempTable
(
Id INT PRIMARY KEY
,Value NVARCHAR(128)
)
INSERT INTO #tempTable
VALUES
(1, 'first value')
,(3, 'second value')
-- will cause Violation of PRIMARY KEY constraint 'PK__#tempTab__3214EC071AE8C88D'. Cannot insert duplicate key in object 'dbo.#tempTable'. The duplicate key value is (1).
--,(1, 'first value one more time')
SELECT * FROM #tempTable

What does 'select to a temp table' mean?

This answer had me slightly confused. What is a 'select to a temp table' and can someone show me a simple example of it?
A temp table is a table that exists just for the duration of the stored procedure and is commonly used to hold temporary results on the way to a final calculation.
In SQL Server, all temp tables are prefixed with a # so if you issue a statement like
Create table #tmp(id int, columnA)
Then SQL Server will automatically know that the table is temporary, and it will be destroyed when the stored procedure goes out of scope unless the table is explicitly dropped like
drop table #tmp
I commonly use them in stored procedures that run against huge tables with a high transaction volume, because I can insert the subset of data that I need into the temp table as a temporary copy and work on the data without fear of bringing down a production system if what I'm doing with the data is a fairly intense operation.
In SQL Server all temp tables live in the tempdb datase.
See this article for more information.
If you have a complex set of results that you want to use again and again, then do you keep querying the main tables (where data will be changing, and may impact performance) or do you store them up in a temporary table for more processing. It's better to use a temporary table often.
Or you really need to iterate through rows in a non-set fashion you can use a temp table (or CURSOR)
If you do simple CRUD against a DB then you probably have no need for temp tables
You have:
table variables: DECLARE #foo TABLE (bar int...)
explict temp tables: CREATE TABLE #foo (bar int...)
inline created: SELECT ... INTO #foo FROM...
A temp table is a table that is dynamically created by using some such syntax:
SELECT [columns] INTO #MyTable FROM SomeExistingTable
What you then have is a table that is populated with the values that you selected into it. Now you can select against it, update it, whatever.
SELECT FirstName FROM #MyTable WHERE...
The table lives for some predetermined scope of time, for example, for the duration of the stored procedure in which it lives. Then it's gone from memory and never accessible again. Temporary.
HTH
You can use SELECT ... INTO to both create a temp table and populate it like so:
SELECT Col1, Col2...
INTO #Table
FROM ...
WHERE ...
(BTW, this syntax is for SQL Server and Sybase. )
EDIT Once you had created the table like I did above, you can then use it other queries on the same connection:
Select
From OtherTable
Join #Table
On #Table.Col = OtherTable.Col
The key here is that it all happens on the same connection. Thus, to create and use a temp table from a client script would be awkward in that you would have to ensure that all subsequent uses of the table were on the same connection. Instead, most people use temp tables in stored procedures where they create the table on one line and then use a few lines later in the same procedure.
Think of temp tables as sql variable of type 'table'. Use them in scripts and stored procedures. It comes handy when you need to manipulate data that is not simple value but a subset of a database table (both vertical and horizontal).
When you realize these benefits then you can take advantage of more power that comes with various sharing models (scope) for temp tables: private, global, transaction, etc. All major RDBMS engines support temp tables but there is no standard features or syntax for them.
For example of usage see answer.

Improving performance of Sql Delete

We have a query to remove some rows from the table based on an id field (primary key). It is a pretty straightforward query:
delete all from OUR_TABLE where ID in (123, 345, ...)
The problem is no.of ids can be huge (Eg. 70k), so the query takes a long time. Is there any way to optimize this?
(We are using sybase - if that matters).
There are two ways to make statements like this one perform:
Create a new table and copy all but the rows to delete. Swap the tables afterwards (alter table name ...) I suggest to give it a try even when it sounds stupid. Some databases are much faster at copying than at deleting.
Partition your tables. Create N tables and use a view to join them into one. Sort the rows into different tables grouped by the delete criterion. The idea is to drop a whole table instead of deleting individual rows.
Consider running this in batches. A loop running 1000 records at a time may be much faster than one query that does everything and in addition will not keep the table locked out to other users for as long at a stretch.
If you have cascade delete (and lots of foreign key tables affected) or triggers involved, you may need to run in even smaller batches. You'll have to experiement to see which is the best number for your situation. I've had tables where I had to delete in batches of 100 and others where 50000 worked (fortunate in that case as I was deleting a million records).
But in any even I would put my key values that I intend to delete into a temp table and delete from there.
I'm wondering if parsing an IN clause with 70K items in it is a problem. Have you tried a temp table with a join instead?
Can Sybase handle 70K arguments in IN clause? All databases I worked with have some limit on number of arguments for IN clause. For example, Oracle have limit around 1000.
Can you create subselect instead of IN clause? That will shorten sql. Maybe that could help for such a big number of values in IN clause. Something like this:
DELETE FROM OUR_TABLE WHERE ID IN
(SELECT ID FROM somewhere WHERE some_condition)
Deleting large number of records can be sped up with some interventions in database, if database model permits. Here are some strategies:
you can speed things up by dropping indexes, deleting records and recreating indexes again. This will eliminate rebalancing index trees while deleting records.
drop all indexes on table
delete records
recreate indexes
if you have lots of relations to this table, try disabling constraints if you are absolutely sure that delete command will not break any integrity constraint. Delete will go much faster because database won't be checking integrity. Enable constraints after delete.
disable integrity constraints, disable check constraints
delete records
enable constraints
disable triggers on table, if you have any and if your business rules allow that. Delete records, then enable triggers.
last, do as other suggested - make a copy of the table that contains rows that are not to be deleted, then drop original, rename copy and recreate integrity constraints, if there are any.
I would try combination of 1, 2 and 3. If that does not work, then 4. If everything is slow, I would look for bigger box - more memory, faster disks.
Find out what is using up the performance!
In many cases you might use one of the solutions provided. But there might be others (based on Oracle knowledge, so things will be different on other databases. Edit: just saw that you mentioned sybase):
Do you have foreign keys on that table? Makes sure the referring ids are indexed
Do you have indexes on that table? It might be that droping before delete and recreating after the delete might be faster.
check the execution plan. Is it using an index where a full table scan might be faster? Or the other way round? HINTS might help
instead of a select into new_table as suggested above a create table as select might be even faster.
But remember: Find out what is using up the performance first.
When you are using DDL statements make sure you understand and accept the consequences it might have on transactions and backups.
Try sorting the ID you are passing into "in" in the same order as the table, or index is stored in. You may then get more hits on the disk cache.
Putting the ID to be deleted into a temp table that has the Ids sorted in the same order as the main table, may let the database do a simple scanned over the main table.
You could try using more then one connection and spiting the work over the connections so as to use all the CPUs on the database server, however think about what locks will be taken out etc first.
I also think that the temp table is likely the best solution.
If you were to do a "delete from .. where ID in (select id from ...)" it can still be slow with large queries, though. I thus suggest that you delete using a join - many people don't know about that functionality.
So, given this example table:
-- set up tables for this example
if exists (select id from sysobjects where name = 'OurTable' and type = 'U')
drop table OurTable
go
create table OurTable (ID integer primary key not null)
go
insert into OurTable (ID) values (1)
insert into OurTable (ID) values (2)
insert into OurTable (ID) values (3)
insert into OurTable (ID) values (4)
go
We can then write our delete code as follows:
create table #IDsToDelete (ID integer not null)
go
insert into #IDsToDelete (ID) values (2)
insert into #IDsToDelete (ID) values (3)
go
-- ... etc ...
-- Now do the delete - notice that we aren't using 'from'
-- in the usual place for this delete
delete OurTable from #IDsToDelete
where OurTable.ID = #IDsToDelete.ID
go
drop table #IDsToDelete
go
-- This returns only items 1 and 4
select * from OurTable order by ID
go
Does our_table have a reference on delete cascade?