SELECT INTO where source data are in other database than target table - sql

I execute SELECT INTO query where my source data are in other database than the table I insert to (but on the same server).
When I execute the query using the same database where my source data are (USE DATABASE_MY_SOURCE_DATA), it completes in under a minute. When I change the database to the database where my target table sits, it doesn't complete in 10 minutes (I don't know the exact time because I cancelled it).
Why is that? Why is the difference so huge? I can't get my head around it.

Querying cross-database, even using a linked server connection, is always likely (at least in 2021) to present performance concerns.
The first problem is that the optimizer doesn't have access to estimate the number of rows in the remote table(s). It's also going to miss indexes on those tables, resorting to table scans (which tend to be a lot slower on large tables than index seeks).
Another issue is that there is no data caching, so the optimizer makes round-trips to the remote database for every necessary operation.
More information (from a great source):
https://www.brentozar.com/archive/2021/07/why-are-linked-server-queries-so-bad/
Assuming that you want this to be more performant, and that you are doing substantial filtering on the remote data source, you may see some performance benefit from creating - on the remote database - a view that filters to just the rows you want on the target table and query that for your results.
Alternatively (and likely more correctly) you should wrap these operations in an ETL process (such as SSIS) that better manages these connections.

Related

Is it possible to implement point in time recovery (PITR) in PostgreSQL for a single table?

Let's say I have a database with lots of tables, but there's one big table that's being updated regularly. At any given point in time, this table contains billions of rows, and let's say that the table is updated so regularly that we can expect a 100% refresh of the table by the end of each quarter. So the volume of data being moved around is in the order tens of billions. Because this table is changing so constantly, I want to implement a PITR, but only for this one table. I have two options:
Hack PostgreSQL's in-house PITR to apply only for one table.
Build it myself by creating a base backup, set up continuous archiving, and using a python script to execute the log of SQL statements up to a point in time (or use PostgreSQL's EXECUTE statement to loop through the archive). The big con with this is that it won't have the timeline functionality.
My problem is, I don't know if option 1 is even possible, and I don't know if option 2 even makes sense (looping through billions of rows sounds like it defeats the purpose of PITR, which is speed and convenience.) What other options do I have?

MS SQL Server Query caching

One of my projects has a very large database on which I can't edit indexes etc., have to work as it is.
What I saw when testing some queries that I will be running on their database via a service that I am writing in .net. Is that they are quite slow when ran the first time?
What they used to do before is - they have 2 main (large) tables that are used mostly. They showed me that they open SQL Server Management Studio and run a
SELECT *
FROM table1
JOIN table2
a query that takes around 5 minutes to run the first time, but then takes about 30 seconds if you run it again without closing SQL Server Management Studio. What they do is they keep open SQL Server Management Studio 24/7 so that when one of their programs executes queries that are related to these 2 tables (which seems to be almost all queries ran by their program) in order to have the 30 seconds run time instead of the 5 minutes.
This happens because I assume the 2 tables get cached and then there are no (or close to none) disk reads.
Is this a good idea to have a service which then runs a query to cache these 2 tables every now and then? Or is there a better solution to this, given the fact that I can't edit indexes or split the tables, etc.?
Edit:
Sorry just I was possibly unclear, the DB hopefully has indexes already, just I am not allowed to edit them or anything.
Edit 2:
Query plan
This could be a candidate for an indexed view (if you can persuade your DBA to create it!), something like:
CREATE VIEW transhead_transdata
WITH SCHEMABINDING
AS
SELECT
<columns of interest>
FROM
transhead th
JOIN transdata td
ON th.GID = td.HeadGID;
GO
CREATE UNIQUE CLUSTERED INDEX transjoined_uci ON transhead_transdata (<something unique>);
This will "precompute" the JOIN (and keep it in sync as transhead and transdata change).
You can't create indexes? This is your biggest problem regarding performance. A better solution would be to create the proper indexes and address any performance by checking wait stats, resource contention, etc... I'd start with Brent Ozar's blog and open source tools, and move forward from there.
Keeping SSMS open doesn't prevent the plan cache from being cleared. I would start with a few links.
Understanding the query plan cache
Check your current plan cache
Understanding why the cache would clear (memory constraint, too many plans (can't hold them all), Index Rebuild operation, etc. Brent talks about this in this answer
How to clear it manually
Aside from that... that query is suspect. I wouldn't expect your application to use those results. That is, I wouldn't expect you to load every row and column from two tables into your application every time it was called. Understand that a different query on those same tables, like selecting less columns, adding a predicate, etc could and likely would cause SQL Server to generate a new query plan that was more optimized. The current query, without predicates and selecting every column... and no indexes as you stated, would simply do two table scans. Any increase in performance going forward wouldn't be because the plan was cached, but because the data was stored in memory and subsequent reads wouldn't experience physical reads. i.e. it is reading from memory versus disk.
There's a lot more that could be said, but I'll stop here.
You might also consider putting this query into a stored procedure which can then be scheduled to run at a regular interval through SQL Agent that will keep the required pages cached.
Thanks to both #scsimon #Branko Dimitrijevic for their answers I think they were really useful and the one that guided me in the right direction.
In the end it turns out that the 2 biggest issues were hardware resources (RAM, no SSD), and Auto Close feature that was set to True.
Other fixes that I have made (writing it here for anyone else that tries to improve):
A helper service tool will rearrange(defragment) indexes once every
week and will rebuild them once a month.
Create a view which has all the columns from the 2 tables in question - to eliminate JOIN cost.
Advised that a DBA can probably help with better tables/indexes
Advised to improve server hardware...
Will accept #Branko Dimitrijevic 's answer as I can't accept both

stored proc time outs

We are currently having difficulties with a sql server procedure timing out on queries. 9 times out of 10 the query will run within 5 second max, however, on occasions, the proc can continue to run in excess of 2 mins and causing time outs on the front end (.net MVC application)..
They have been investigating this for over a week now, checking jobs, server performance and all seems to be ok..
The DBA's have narrowed it down to a particular table which is being bombarded from different application with inserts / updates. This in combination with the complex select query that is causing the time out that joins on that table (im being told) is causing the time outs..
Are there any suggestions at all to how to get around these time outs?
ie.
replicate the table and query the new table?
Any additional debugging that can prove that this is actually the issue?
Perhaps cache the data on the front end, if a time out, call data from cache?
A table being bombarded with updates is a table being bombarded with locks. And yes, this can affect performance.
First, copy the table and run the query multiple times. There are other possibilities for the performance issue.
One cause of unstable stored procedure performance in SQL Server is compilation. The code in the stored procedure is compiled the first time it is executed -- the resulting execution plan might work for some inputs and not others. This is readily fixed by using the option to recompile the queries each time (although this adds overhead).
Then, think about the query. Does it need the most up-to-date data? If not, perhaps you can just copy the table once per hour or once per day.
If the most recent data is needed, you might need to re-think the architecture. A table that does insert-only using a clustered identity column always inserts at the end of the table. This is less likely to interfere with queries on the table.
Replication may or may not help the problem. After all, full replication will be doing the updates on the replicated copy. You don't solve the "bombardment" problem by bombarding two tables.
If your queries involve a lot of historical data, then partitioning might help. Only the most recent partition would be "bombarded", leaving the others more responsive to queries.
The DBA's have narrowed it down to a particular table which is being bombarded from different application with inserts / updates. This in combination with the complex select query that is causing the time out that joins on that table (im being told) is causing the time outs
We used to face many time outs and used to get a lot of escalations..This is the approach we followed for reducing time outs..
Some may be applicable in your case,some may not...but following will not cause any harm
Change below sql server settings:
1.Remote login timeout :60
2.Remote query timeout:0
Also if your windows server is set to use Dynamic ram,try changing it to static ram..
You may also have to tune,some of windows server settings
TCP Offloading/Chimney & RSS…What is it and should I disable it?
Following above steps,reduced our time outs by 99%..
For the rest 1%,we have dealt each case seperately
1.Update statistics for those tables involved in the query
2.Try fine tuning the query further
This helped us reduce time outs totally..

SQL complexe queryget differential

I have a complex SQL query that should be executed every day to load a table. The query is executed one time for all the data, then should be executed on a differential data of one day.
My question is what is the best performante way to load the data, I have to solutions :
Execute the query upon all the data base with a where query to take just the changed data.
Build a copy of the source tables that would be truncated every time and loaded just with the differential of data, and then execute the query upon these tables.
The performance characteristics of a query depend heavily on which DBMS and on the physical data model (indexes, statistics, etc.). Very little can be said that's generally applicable to answer that question.
With good indexing, etc. (whatever that exactly means for the DBMS you're using) you can get very good performance just querying the changed data (provided there's a simple, index-able expression for "data that has changed").
While I would strongly suspect that you'd technically get "the fastest" performance for that query by loading the data into a table that contains only the deltas, it may not save enough performance to offset the costs, which include:
It's more complex: more tables to maintain, more scripts to move data around, ...
The act of adding new data is made less efficient, because you have to write it twice (once to the incremental table, and once - either at the same time or later - to the table where historical data is accumulated

Generated de-normalised View table

We have a system that makes use of a database View, which takes data from a few reference tables (lookups) and then does a lot of pivoting and complex work on a hierarchy table of (pretty much fixed and static) locations, returning a view view of data to the application.
This view is getting slow, as new requirements are added.
A solution that may be an option would be to create a normal table, and select from the view, into this table, and let the application use that highly indexed and fast table for it's querying.
Issue is, I guess, if the underlying tables change, the new table will show old results. But the data that drives this table changes very infrequently. And if it does - a business/technical process could be made that means an 'Update the Table' procedure is run to refresh this data. Or even an update/insert trigger on the primary driving table?
Is this practice advised/ill-advised? And are there ways of making it safer?
The ideal solution is to optimise the underlying queries.
In SSMS run the slow query and include the actual execution plan (Ctrl + M), this will give you a graphical representation of how the query is being executed against your database.
Another helpful tool is to turn on IO statistics, this is usually the main bottleneck with queries, put this line at the top of your query window:
SET STATISTICS IO ON;
Check if SQL recommends any missing indexes (displayed in green in the execution plan), as you say the data changes infrequently so it should be safe to add additional indexes if needed.
In the execution plan you can hover your mouse over any element for more information, check the value for estimated rows vs actual rows returned, if this varies greatly update the statistics for the tables, this can help the query optimiser find the best execution plan.
To do this for all tables in a database:
USE [Database_Name]
GO
exec sp_updatestats
Still no luck in optimising the view / query?
Be careful with update triggers as if the schema changes on the view/table (say you add a new column to the source table) the new column will not be inserted into your 'optimised' table unless you update the trigger.
If it is not a business requirement to report on real time data there is not too much harm in having a separate optimized table for reporting (Much like a DataMart), just use a SQL Agent job to refresh it nightly during non-peak hours.
There are a few cons to this approach though:
More storage space / duplicated data
More complex database
Additional workload during the refresh
Decreased cache hits