Related
I have a string vector data containing items that I want to insert into a table named foos. It's possible that some of the elements in data already exist in the table, so I must watch out for those.
The solution I'm using starts by transforming the data vector into virtual table old_and_new; it then builds virtual table old which contains the elements which are already present in foos; then, it constructs virtual table new with the elements
which are really new. Finally, it inserts the new elements in table foos.
WITH old_and_new AS (SELECT unnest ($data :: text[]) AS foo),
old AS (SELECT foo FROM foos INNER JOIN old_and_new USING (foo)),
new AS (SELECT * FROM old_and_new EXCEPT SELECT * FROM old)
INSERT INTO foos (foo) SELECT foo FROM new
This works fine in a non-concurrent setting, but fails if concurrent threads
try to insert the same new element at the same time. I know I can solve this
by setting the isolation level to serializable, but that's very heavy-handed.
Is there some other way I can solve this problem? If only there was a way to
tell PostgreSQL that it was safe to ignore INSERT errors...
Is there some other way I can solve this problem?
There are plenty, but none are a panacea...
You can't lock for inserts like you can do a select for update, since the rows don't exist yet.
You can lock the entire table, but that's even heavier handed that serializing your transactions.
You can use advisory locks, but be super wary about deadlocks. Sort new keys so as to obtain the locks in a consistent, predictable order. (Someone more knowledgeable with PG's source code will hopefully chime in, but I'm guessing that the predicate locks used in the serializable isolation level amount to doing precisely that.)
In pure sql you could also use a do statement to loop through the rows one by one, and trap the errors as they occur:
http://www.postgresql.org/docs/9.2/static/sql-do.html
http://www.postgresql.org/docs/9.2/static/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING
Similarly, you could create a convoluted upsert function and call it once per piece of data...
If you're building $data at the app level, you could run the inserts one by one and ignore errors.
And I'm sure I forgot some additional options...
Whatever your course of action is (#Denis gave you quite a few options), this rewritten INSERT command will be much faster:
INSERT INTO foos (foo)
SELECT n.foo
FROM unnest ($data::text[]) AS n(foo)
LEFT JOIN foos o USING (foo)
WHERE o.foo IS NULL
It also leaves a much smaller time frame for a possible race condition.
In fact, the time frame should be so small, that unique violations should only be popping up under heavy concurrent load or with huge arrays.
Dupes in the array?
Except, if you your problem is built-in. Do you have duplicates in the input array itself? In this case, transaction isolation is not going to help you. The enemy is within!
Consider this example / solution:
INSERT INTO foos (foo)
SELECT n.foo
FROM (SELECT DISTINCT foo FROM unnest('{foo,bar,foo,baz}'::text[]) AS foo) n
LEFT JOIN foos o USING (foo)
WHERE o.foo IS NULL
I use DISTINCT in the subquery to eliminate the "sleeper agents", a.k.a. duplicates.
People tend to forget that the dupes may come within the import data.
Full automation
This function is one way to deal with concurrency for good. If a UNIQUE_VIOLATION occurs, the INSERT is just retried. The newly present rows are excluded from the new attempt automatically.
It does not take care of the opposite problem, that a row might have been deleted concurrently - this would not get re-inserted. One might argue, that this outcome is ok, since such a DELETE happened concurrently. If you want to prevent this, make use of SELECT ... FOR SHARE to protect rows from concurrent DELETE.
CREATE OR REPLACE FUNCTION f_insert_array(_data text[], OUT ins_ct int) AS
$func$
BEGIN
LOOP
BEGIN
INSERT INTO foos (foo)
SELECT n.foo
FROM (SELECT DISTINCT foo FROM unnest(_data) AS foo) n
LEFT JOIN foos o USING (foo)
WHERE o.foo IS NULL;
GET DIAGNOSTICS ins_ct = ROW_COUNT;
RETURN;
EXCEPTION WHEN UNIQUE_VIOLATION THEN -- tag.tag has UNIQUE constraint.
RAISE NOTICE 'It actually happened!'; -- hardly ever happens
END;
END LOOP;
END
$func$
LANGUAGE plpgsql;
I made the function return the count of inserted rows, which is completely optional.
-> SQLfiddle demo
I like both Erwin and Denis' answers, but another approach might be to have concurrent sessions performing the unnesting and loading into a separate temporary table, and optionally eliminating what duplicates they can against the target table, and having a single session selecting from this temporary table, resolving temp table internal duplicates in an appropriate manner, inserting to the target table checking again for existing values, and deleting the selected temporary table records (in the same query using a common table expression).
This would be more batch oriented, in the style of a data warehouse extraction-load-transform paradigm, but would guarantee that no unique constraint issues would need to be dealt with.
Other advantages/disadvantages apply, such as decoupling the final insert from the data gathering (possible advantage), and needing to vacuum the temp table frequently (possible disadvantage), which may not be relevant to Jon's case, but might be useful info to others in the same situation.
Can anyone explain the situations in which we need to make use of temporary tables in stored procedures?
There are many cases where a complex join can really trip up the optimizer and make it do very expensive things. Sometimes the easiest way to cool the optimizer down is to break the complex query into smaller parts. You'll find a lot of misinformation out there about using a #table variable instead of a #temp table because #table variables always live in memory - this is a myth and don't believe it.
You may also find this worthwhile if you have an outlier query that would really benefit from a different index that is not on the base table, and you are not permitted (or it may be detrimental) to add that index to the base table (it may be an alternate clustered index, for example). A way to get around that would be to put the data in a #temp table (it may be a limited subset of the base table, acting like a filtered index), create the alternate index on the #temp table, and run the join against the #temp table. This is especially true if the data filtered into the #temp table is going to be used multiple times.
There are also times when you need to make many updates against some data, but you don't want to update the base table multiple times. You may have multiple things you need to do against a variety of other data that can't be done in one query. It can be more efficient to put the affected data into a #temp table, perform your series of calculations / modifications, then update back to the base table once instead of n times. If you use a transaction here against the base tables you could be locking them from your users for an extended period of time.
Another example is if you are using linked servers and the join across servers turns out to be very expensive. Instead you can stuff the remote data into a local #temp table first, create indexes on it locally, and run the query locally.
For my company I am redesigning some stored procedures. The original procedures are using lots of permanent tables which are filled during the execution of procedure and at the end, the values are deleted. The number of rows can extend from 100 to 50,000 rows for calculation of aggregations.
My question is, will there be severe performance issues if I replace those tables with temp tables ? Is it feasible to use temp tables ?
It depends on how often your using them, how long the processing takes, and if you are concurrently accessing data from the tables while writing.
If you use a temp table, it won't be sitting around waiting for indexing and caching while it's not in use. So it should save an ever so slight bit of resources there. However, you will incur overhead with the temp tables (i.e. creating and destroying).
I would re-examine how your queries function in the procedures and consider employing more in procedure CURSOR operations instead of loading everything into tables and deleting them.
However, databases are for storing information and retrieving information. I would shy away from using permanent tables for routine temp work and stick with the temp tables.
The overall performance shouldn't have any effect with the use case you specified in your question.
Hope this helps,
Jeffrey Kevin Pry
Yes its certainly feasible, you may want to check to see if the permanent tables have any indexing on them to speed up joins and so on.
I agree with Jeffrey. It always depends.
Since you're using Sql Server 2008 you might have a look at table variables.
They should be lighter than TEMP tables.
I define a User Defined Function which returns a table variable like this:
CREATE FUNCTION .ufd_GetUsers ( #UserCode INT )
RETURNS #UsersTemp TABLE
(
UserCode INT NOT NULL,
RoleCode INT NOT NULL
)
AS
BEGIN
INSERT #RolesTemp
SELECT
dbo.UsersRoles.Code, Roles.Code
FROM
dbo.UsersRoles
INNER JOIN
dbo.UsersRolesRelations ON dbo.UsersRoles.Code = dbo.UsersRolesRelations.UserCode
INNER JOIN
dbo.UsersRoles Roles ON dbo.UsersRolesRelations.RoleCode = Roles.Code
WHERE dbo.UsersRoles.Code = #UserCode
INSERT #UsersTemp VALUES(#UserCode, #UserCode)
RETURN
END
A big question is, can more then one person run one of these stored procedures at a time? I regularly see these kind of tables carried over from old single user databases (or from programmers who couldn't do subqueries or much of anything beyond SELECT * FROM). What happens if more then one user tries to run the same procedure, what happens if it crashes midway through - does the table get cleaned up? With temp tables or table variables you have the ability to properly scope the table to just the current connection.
Definitely use a temporary table, especially since you've alluded to the fact that its purpose is to assist with calculations and aggregates. If you used a table inside one of your database's schemas all that work is going to be logged - written, backed up, and so on. Using a temporary table eliminates that overhead for data that in the end you probably don't care about.
You actually might save some time from the fact that you can drop the temp tables at the end instead of deleting rows (you said you had multiple users so you have to delete rather than truncate). Deleting is a logged operation and can add considerable time to the process. If the permanent tables are indexed, then create the temp tables and index them as well. I would bet you would see an increase in performance usless your temp db is close to out of space.
Table variables also might work but they can't be indexed and they are generally only faster for smaller datasets. So you might try a combination of temp tables for the things taht will be large enough to benfit form indexing and table varaibles for the smaller items.
An advatage of using temp tables and table variables is that you guarantee that one users process won;t interfer with another user's process. You say they currently havea way to identify which records but all it takes is one bug being introduced to break that when using permanent tables. Permanent table for temporary processing are a very risky choice. Temp tables and table variabels can never see the data from someone else's process and thus are far safer as a choice.
Table variables are normally the way to go.
SQL2K and below can have significant performance bottlenecks if there are many temp tables being manipulated - the issue is the blocking DDL on the system tables.
Sql2005 is better, but table vars avoid the whole issue by not using those system tables at all, so can perform without inter-user locking issues (except those involved with the source data).
The issue is then that table vars only persist within scope, so if there is genuinuely a large amount of data that needs to be processed repeatedly & needs to be persisted over a (relatively) long duration then 'static' work tables may actually be faster - it does need a user key of some sort & regular cleaning. A last resort really.
I have 2 procedures. One that builds a temp table and another (or several others) that use a temp table that's created in the first proc. Is this considered bad form? I'm walking into an existing codebase here and it uses this pattern a lot. It certainly bothers me but I can't say that it's explicitly a bad idea. I just find it to be an annoying pattern -- something smells rotten but I can't say what. I'd prefer to build the table locally and then fill it with an exec. But that requires procs that only return one table, which is unusual in the existing codebase.
Do the gurus of SQL shy away from this sort of temp table sharing? If so, why? I'm trying to form a considered opinion about this and would like some input.
Would a view be a viable alternative?
What about performance? Would it be better to build a #table locally or build a #table in the "fill" proc?
There's a good discussion of all methods here: http://www.sommarskog.se/share_data.html
As a programming paradigm it is ugly because it involves passing out-of-band parameters ('context parameters') that are not explicitly called out in the procedure signatures. This creates hidden dependencies and leads to spaghetti effect.
But in the specific context of SQL, there is simply no alternative. SQL works with data sets and you cannot pass back and forth these data sets as procedure parameters. You have few alternatives:
Pass through the client. All too obvious not a real option.
XML or strings with delimiters types to represent results. Not a serious option by any stretch, they may give good 'programming' semantics (ie. Demeter Law conformance) but they really really suck when performance comes into play.
shared tables (be it in tempdb or in appdb) like Raj suggest. You're loosing the #temp automated maintenance (cleanup on ref count goes to 0) and you have to be prepared for creation race conditions. Also they can grow large for no reason (they're no longer partitioned by session into separate rowsets, like #temp tables are).
#tables. They're scoped to the declaration context (ie. the procedure) and cannot be passed back and forth between procedures. I also discovered some nasty problems under memory pressure.
The biggest problem with sharing temp tables is that it introduces external dependencies into a procedure that may not be apparent at first glance. Say you have some procedure p1 that calls a procedure p2 and a temp table #t1 is used to pass information between the two procedures. If you want run p2 in isolation to see what it does, you have to create a "harness" that defines the #t1 before you can run it. Using temp tables in T-SQL is often equivalent to using global variables in other languages-- not recommended but sometimes unavoidable.
SQL Server 2008 now has table-valued parameters but Microsoft chose to make them read-only in this release. Still, it means you don't have to use temp tables in some scenarios where you had to before.
My advice is, use them if you have to, but document their use thoroughly. If you have some proc that depends on a temp table to run, call this out in a comment.
Sometimes this is the only way to go. If you need to do this, then DOCUMENT, DOCUMENT, DOCUMENT the facts. Here is one way to make it clear, put the table definition as a comment in the parameter section...
CREATE PROCEDURE xyz
(
#param1 int --REQUIRED, what it does
,#param2 char(1) --OPTIONAL, what it does
,#param3 varchar(25) --OPTIONAL, what it does
--this temp table is required and must be created in the calling procedure
--#TempXyz (RowID int not null primary key
-- ,DataValue varchar(10) not null
-- ,DateValue datetime null
-- )
)
Also document in the calling procedure where the temp table is created....
--** THIS TEMP TABLE IS PASSED BETWEEN STORED PROCEDURES **
--** ALL CHANGES MUST TAKE THIS INTO CONSIDERATION!!! **
CREATE TABLE #TempXyz
(RowID int not null primary key
,DataValue varchar(10) not null
,DateValue datetime null
)
I've encountered this same exact problem. What I can say from personal experience:
If there's a way to avoid using a #temp table across multiple procedures, use it. #temp tables are easy to lose track of and can easily grow tempdb if you're not careful.
In some cases, this method is un-avoidable (my classic example is certain reporting functionality that builds data differently based on report configration). If managed carefully, I believe that in these situations it is acceptable.
Sharing the temp tables is not a bad idea. But we need to ensure that the table level data should not be manipulated by the two procs at the same time, leading to Dirty Read/Write scenario. This will also help to have a single table [centralized ] and the procs working on it, to have sync-up data.
Regarding performance, it is true that, having multiple procs sharing the same temp data, will decrease performance. But at the same time, having individual tables [ per Proc ] will lead to increased memory consumption,
I have used this approach in a few cases. But I always declare the temp table as a fully qualified table and not by using #tablename.
CREATE TABLE [tempdb].[dbo].[tablename]
with this approach, it is easy to keep track of the temp table.
Raj
Another technique I've seen to avoid using temp tables is so-called "SPID-keyed" tables. Instead of a temp table, you define a regular table with the following properties:
CREATE TABLE spid_table
(
id INT IDENTITY(1, 1) NOT NULL,
-- Your columns here
spid INT NOT NULL DEFAULT ##SPID,
PRIMARY KEY (id)
);
In your procedures, you would have code like the following:
SELECT #var = some_value
FROM spid_table
WHERE -- some condition
AND spid = ##SPID;
and at the end of processing:
DELETE FROM spid_table
WHERE spid = ##SPID;
The big disadvantage of this is that the table uses the same recovery model as the rest of the database so all these transient inserts, updates and deletes are being logged. The only real advantage is the dependency is more apparent than using a temp table.
I guess, it is OK if the 2nd proc checks for existence of temp table & moves forward, if so. Also, it can raise error, if the temp table doesn't exist, asking user to run the proc1 first.
Why is the work divided in 2 stored procs? Can't things be done in 1 proc?
Why is using '*' to build a view bad ?
Suppose that you have a complex join and all fields may be used somewhere.
Then you just have to chose fields needed.
SELECT field1, field2 FROM aview WHERE ...
The view "aview" could be SELECT table1.*, table2.* ... FROM table1 INNER JOIN table2 ...
We have a problem if 2 fields have the same name in table1 and table2.
Is this only the reason why using '*' in a view is bad?
With '*', you may use the view in a different context because the information is there.
What am I missing ?
Regards
I don't think there's much in software that is "just bad", but there's plenty of stuff that is misused in bad ways :-)
The example you give is a reason why * might not give you what you expect, and I think there are others. For example, if the underlying tables change, maybe columns are added or removed, a view that uses * will continue to be valid, but might break any applications that use it. If your view had named the columns explicitly then there was more chance that someone would spot the problem when making the schema change.
On the other hand, you might actually want your view to blithely
accept all changes to the underlying tables, in which case a * would
be just what you want.
Update: I don't know if the OP had a specific database vendor in mind, but it is now clear that my last remark does not hold true for all types. I am indebted to user12861 and Jonny Leeds for pointing this out, and sorry it's taken over 6 years for me to edit my answer.
Although many of the comments here are very good and reference one common problem of using wildcards in queries, such as causing errors or different results if the underlying tables change, another issue that hasn't been covered is optimization. A query that pulls every column of a table tends to not be quite as efficient as a query that pulls only those columns you actually need. Granted, there are those times when you need every column and it's a major PIA having to reference them all, especially in a large table, but if you only need a subset, why bog down your query with more columns than you need.
Another reason why "*" is risky, not only in views but in queries, is that columns can change name or change position in the underlying tables. Using a wildcard means that your view accommodates such changes easily without needing to be changed. But if your application references columns by position in the result set, or if you use a dynamic language that returns result sets keyed by column name, you could experience problems that are hard to debug.
I avoid using the wildcard at all times. That way if a column changes name, I get an error in the view or query immediately, and I know where to fix it. If a column changes position in the underlying table, specifying the order of the columns in the view or query compensates for this.
These other answers all have good points, but on SQL server at least they also have some wrong points. Try this:
create table temp (i int, j int)
go
create view vtemp as select * from temp
go
insert temp select 1, 1
go
alter table temp add k int
go
insert temp select 1, 1, 1
go
select * from vtemp
SQL Server doesn't learn about the "new" column when it is added. Depending on what you want this could be a good thing or a bad thing, but either way it's probably not good to depend on it. So avoiding it just seems like a good idea.
To me this weird behavior is the most compelling reason to avoid select * in views.
The comments have taught me that MySQL has similar behavior and Oracle does not (it will learn about changes to the table). This inconsistency to me is all the more reason not to use select * in views.
Using '*' for anything production is bad. It's great for one-off queries, but in production code you should always be as explicit as possible.
For views in particular, if the underlying tables have columns added or removed, the view will either be wrong or broken until it is recompiled.
Using SELECT * within the view does not incur much of a performance overhead if columns aren't used outside the view - the optimizer will optimize them out; SELECT * FROM TheView can perhaps waste bandwidth, just like any time you pull more columns across a network connection.
In fact, I have found that views which link almost all the columns from a number of huge tables in my datawarehouse have not introduced any performance issues at all, even through relatively few of those columns are requested from outside the view. The optimizer handles that well and is able to push the external filter criteria down into the view very well.
However, for all the reasons given above, I very rarely use SELECT *.
I have some business processes where a number of CTEs are built on top of each other, effectively building derived columns from derived columns from derived columns (which will hopefully one day being refactored as the business rationalizes and simplifies these calculations), and in that case, I need all the columns to drop through each time, and I use SELECT * - but SELECT * is not used at the base layer, only in between the first CTE and the last.
The situation on SQL Server is actually even worse than the answer by #user12861 implies: if you use SELECT * against multiple tables, adding columns to a table referenced early in the query will actually cause your view to return the values of the new columns under the guise of the old columns. See the example below:
-- create two tables
CREATE TABLE temp1 (ColumnA INT, ColumnB DATE, ColumnC DECIMAL(2,1))
CREATE TABLE temp2 (ColumnX INT, ColumnY DATE, ColumnZ DECIMAL(2,1))
GO
-- populate with dummy data
INSERT INTO temp1 (ColumnA, ColumnB, ColumnC) VALUES (1, '1/1/1900', 0.5)
INSERT INTO temp2 (ColumnX, ColumnY, ColumnZ) VALUES (1, '1/1/1900', 0.5)
GO
-- create a view with a pair of SELECT * statements
CREATE VIEW vwtemp AS
SELECT *
FROM temp1 INNER JOIN temp2 ON 1=1
GO
-- SELECT showing the columns properly assigned
SELECT * FROM vwTemp
GO
-- add a few columns to the first table referenced in the SELECT
ALTER TABLE temp1 ADD ColumnD varchar(1)
ALTER TABLE temp1 ADD ColumnE varchar(1)
ALTER TABLE temp1 ADD ColumnF varchar(1)
GO
-- populate those columns with dummy data
UPDATE temp1 SET ColumnD = 'D', ColumnE = 'E', ColumnF = 'F'
GO
-- notice that the original columns have the wrong data in them now, causing any datatype-specific queries (e.g., arithmetic, dateadd, etc.) to fail
SELECT *
FROM vwtemp
GO
-- clean up
DROP VIEW vwTemp
DROP TABLE temp2
DROP TABLE temp1
It's because you don't always need every variable, and also to make sure that you are thinking about what you specifically need.
There's no point getting all the hashed passwords out of the database when building a list of users on your site for instance, so a select * would be unproductive.
Once upon a time, I created a view against a table in another database (on the same server) with
Select * From dbname..tablename
Then one day, a column was added to the targetted table. The view started returning totally incorrect results until it was redeployed.
Totally incorrect : no rows.
This was on Sql Server 2000.
I speculate that this is because of syscolumns values that the view had captured, even though I used *.
A SQL query is basically a functional unit designed by a programmer for use in some context. For long-term stability and supportability (possibly by someone other than you) everything in a functional unit should be there for a purpose, and it should be reasonably evident (or documented) why it's there - especially every element of data.
If I were to come along two years from now with the need or desire to alter your query, I would expect to grok it pretty thoroughly before I would be confident that I could mess with it. Which means I would need to understand why all the columns are called out. (This is even more obviously true if you are trying to reuse the query in more than one context. Which is problematic in general, for similar reasons.) If I were to see columns in the output that I couldn't relate to some purpose, I'd be pretty sure that I didn't understand what it did, and why, and what the consequences would be of changing it.
It's generally a bad idea to use *. Some code certification engines mark this as a warning and advise you to explicitly refer only the necessary columns. The use of * can lead to performance louses as you might only need some columns and not all. But, on the other hand, there are some cases where the use of * is ideal. Imagine that, no matter what, using the example you provided, for this view (aview) you would always need all the columns in these tables. In the future, when a column is added, you wouldn't need to alter the view. This can be good or bad depending the case you are dealing with.
I think it depends on the language you are using. I prefer to use select * when the language or DB driver returns a dict(Python, Perl, etc.) or associative array(PHP) of the results. It makes your code alot easier to understand if you are referring to the columns by name instead of as an index in an array.
No one else seems to have mentioned it, but within SQL Server you can also set up your view with the schemabinding attribute.
This prevents modifications to any of the base tables (including dropping them) that would affect the view definition.
This may be useful to you for some situations. I realise that I haven't exactly answered your question, but thought I would highlight it nonetheless.
And if you have joins using select * automatically means you are returning more data than you need as the data in the join fields is repeated. This is wasteful of database and network resources.
If you are naive enough to use views that call other views, using select * can make them even worse performers (This is technique that is bad for performance on its own, calling mulitple columns you don't need makes it much worse).