Creation of Index during data load

Creation of Index during data load - sql

Can we create index on a table while data is being read from it.Actually I am using an ETL tool to load data to target.The performance is very slow and I do not want the data load to stop.
Will the performance improve if i create the index during data load is in progress.
ETL tool used :-BODS

In an analytical DB you basically have 2 use cases for indexes:
To speed up data loads
To speed up queries
The more indexes you have on a table the slower it will be to load. Therefore, traditionally, "query" indexes were dropped at the start of each data load and re-built once the load was completed. This is still good practice, where possible, but obviously if you have massive tables (and therefore excessive index re-build times), users running queries during data loads or continuous/streaming loading then this is not possible.
Creating indexes while loading data is likely to slow down the data load - not necessarily because of any impact on the tables being indexed but because indexing uses DB resources and therefore those resources are not available for use by the data loading activities.
Creating an index on a table while that table is being loaded will not speed up the data load and will probably slow it down (see previous paragraph). When SQL is executed on a DB one of the first thing the DB will do is generate an execution plan i.e. determine the most efficient way to execute the statement based on table statistics, available indexes, etc. Once the SQL statement is executing (based on the plan), it will not then continuously check if the indexes have changed since the execution started, re-calculate the plan and re-execute the statement if there is now a more efficient plan available.
Hope this helps? Please tick this answer if it does

Related

SELECT INTO where source data are in other database than target table

I execute SELECT INTO query where my source data are in other database than the table I insert to (but on the same server).
When I execute the query using the same database where my source data are (USE DATABASE_MY_SOURCE_DATA), it completes in under a minute. When I change the database to the database where my target table sits, it doesn't complete in 10 minutes (I don't know the exact time because I cancelled it).
Why is that? Why is the difference so huge? I can't get my head around it.

Querying cross-database, even using a linked server connection, is always likely (at least in 2021) to present performance concerns.
The first problem is that the optimizer doesn't have access to estimate the number of rows in the remote table(s). It's also going to miss indexes on those tables, resorting to table scans (which tend to be a lot slower on large tables than index seeks).
Another issue is that there is no data caching, so the optimizer makes round-trips to the remote database for every necessary operation.
More information (from a great source):
https://www.brentozar.com/archive/2021/07/why-are-linked-server-queries-so-bad/
Assuming that you want this to be more performant, and that you are doing substantial filtering on the remote data source, you may see some performance benefit from creating - on the remote database - a view that filters to just the rows you want on the target table and query that for your results.
Alternatively (and likely more correctly) you should wrap these operations in an ETL process (such as SSIS) that better manages these connections.

Generated de-normalised View table

We have a system that makes use of a database View, which takes data from a few reference tables (lookups) and then does a lot of pivoting and complex work on a hierarchy table of (pretty much fixed and static) locations, returning a view view of data to the application.
This view is getting slow, as new requirements are added.
A solution that may be an option would be to create a normal table, and select from the view, into this table, and let the application use that highly indexed and fast table for it's querying.
Issue is, I guess, if the underlying tables change, the new table will show old results. But the data that drives this table changes very infrequently. And if it does - a business/technical process could be made that means an 'Update the Table' procedure is run to refresh this data. Or even an update/insert trigger on the primary driving table?
Is this practice advised/ill-advised? And are there ways of making it safer?

The ideal solution is to optimise the underlying queries.
In SSMS run the slow query and include the actual execution plan (Ctrl + M), this will give you a graphical representation of how the query is being executed against your database.
Another helpful tool is to turn on IO statistics, this is usually the main bottleneck with queries, put this line at the top of your query window:
SET STATISTICS IO ON;
Check if SQL recommends any missing indexes (displayed in green in the execution plan), as you say the data changes infrequently so it should be safe to add additional indexes if needed.
In the execution plan you can hover your mouse over any element for more information, check the value for estimated rows vs actual rows returned, if this varies greatly update the statistics for the tables, this can help the query optimiser find the best execution plan.
To do this for all tables in a database:
USE [Database_Name]
GO
exec sp_updatestats
Still no luck in optimising the view / query?
Be careful with update triggers as if the schema changes on the view/table (say you add a new column to the source table) the new column will not be inserted into your 'optimised' table unless you update the trigger.
If it is not a business requirement to report on real time data there is not too much harm in having a separate optimized table for reporting (Much like a DataMart), just use a SQL Agent job to refresh it nightly during non-peak hours.
There are a few cons to this approach though:
More storage space / duplicated data
More complex database
Additional workload during the refresh
Decreased cache hits

how to speed up spatial index creation with many tables?

I have created more than 500 tables with geometry columns ,
it's taking more time for spatial index creation after inserting data, how can I create spatial index for all tables within a few seconds,
are there any other way to speed up for spatial index creation?

What is it that you want to speed up ? The creation of the spatial indexes (= the CREATE INDEX commands) ? Or the inserts/updates to the tables. How big are those tables ? And how long is acceptable for your business process ?
Indexes are typically created once only upon the initial creation and population of a spatial table. Unless your application is one that inserts massive quantities of data at short intervals, or one that tracks moving objects, there is no need whatsoever to rebuild the indexes after the initial creation. They are obviously maintained automatically.
As for the time to build a spatial index, it does obviously take longer (sometimes significantly so) than building an index on the same number of names. Obviously, the larger your tables, the longer that can take. Then again, if you think you can build 500 indexes (spatial or not) in a few seconds, you are deluded, even if you use very powerful hardware and the tables are empty.
There are two ways to make your index builds take less time:
1) For large tables, partition them and do the build in parallel. That is only possible if you own the partitioning option (only available in EE). Then again, that makes sense only for large tables - i.e. 50 mio rows and more. It makes little sense on small tables.
2) For a large number of smallish tables, just do the builds in parallel, i.e. run multiple index creations concurrently. Use the database's scheduler (DBMS_SCHEDULER) to orchestrate that.
If what you are after is how to accelerate data ingestion (moving objects, tracking, ...) then you need to explain a bit more about your workflow.

INSERT INTO goes much slower with time in SQL Server 2012

We have a very big database WriteDB, which store raw trading data and we use this table to fast writes. Then with sql scripts I import data from WriteDB into ReadDB in comparatively the same table, but extended with some extra values + relation added. Import script is like that:
TRUNCATE TABLE [ReadDB].[dbo].[Price]
GO
INSERT INTO [ReadDB].[dbo].[Price]
SELECT a.*, 0 as ValueUSD, 0 as ValueEUR
from [WriteDB].[dbo].[Price] a
JOIN [ReadDB].[dbo].[Companies] b ON a.QuoteId = b.QuoteID
So initially there is around 130 mil. rows in this table (~50GB). Each day some of them added, some of them changes, so right now we decide not over complicate logic and just re-import all data. The problem that for some reason with time this script works longer and longer, on the almost same amount of data. First run it's take ~1h, now it's already taken 3h
Also SQL Server after import work not well. After import (or during it) if I try to run different queries, even the simplest they often fail with timeout errors.
What is the reason of such bad behavior and how to fix this?

One theory is that your first 50GB dataset has filled available memory for caching. Upon truncating the table, your cache is now effectively empty. This alternating behavior makes effective use of the cache difficult and incurs a substantial number of cache misses / increased IO time.
Consider the following sequence of events:
You load your initial dataset into WriteDb. During the load operation, pages in WriteDb are cached. There's very little memory contention because there's only one copy of the dataset and sufficient memory.
You initially populate ReadDb. The pages required to populate ReadDb (the data in WriteDb) are already largely cached. Fewer reads are required from disk, and your IO time can be dedicated to writing the inserted data for ReadDb. (This is your fast first run.)
You load your second dataset into WriteDb. During the load operation, there is insufficient memory to cache both existing data in ReadDb and new data written to WriteDb. This memory contention leads to fewer pages of WriteDb cached.
You truncate ReadDb. This invalidates a substantial portion of your cache (i.e. the 50GB of ReadDb data that was cached).
You then attempt your second load of ReadDb. Here you have very little of WriteDb cached, so your IO time is split between reading pages of WriteDb (your query) and writing pages of ReadDb (your insert). (This is your slow second run.)
You could test this theory by comparing the SQL Server cache miss ratio during your first and second load operations.
Some ways to improve performance might be to:
Use separate disk arrays for ReadDb / WriteDb to increase parallel IO performance.
Increase the available cache (amount of server memory) to accomodate the combined size of ReadDb + WriteDb and minimize cache misses.
Minimize the impact of each load operation on existing cached pages by using a MERGE statement instead of dumping / loading 50GB of data at a time.

SQL Server cache question

When I run a certain stored procedure for the first time it takes about 2 minutes to finish. When I run it for the second time it finished in about 15 seconds. I'm assuming that this is because everything is cached after the first run. Is it possible for me to "warm the cache" before I run this procedure for the first time? Is the cached information only used when I call the same stored procedure with the same parameters again or will it be used if I call the same stored procedure with different params?

When you peform your query, the data is read into memory in blocks. These blocks remain in memory but they get "aged". This means the blocks are tagged with the last access and when Sql Server requires another block for a new query and the memory cache is full, the least recently used block (the oldest) is kicked out of memory. (In most cases - full tables scan blocks are instantly aged to prevent full table scans overrunning memory and choking the server).
What is happening here is that the data blocks in memory from the first query haven't been kicked out of memory yet so can be used for your second query, meaning disk access is avoided and performance is improved.
So what your question is really asking is "can I get the data blocks I need into memory without reading them into memory (actually doing a query)?". The answer is no, unless you want to cache the entire tables and have them reside in memory permanently which, from the query time (and thus data size) you are describing, probably isn't a good idea.
Your best bet for performance improvement is looking at your query execution plans and seeing whether changing your indexes might give a better result. There are two major areas that can improve performance here:
creating an index where the query could use one to avoid inefficient queries and full table scans
adding more columns to an index to avoid a second disk read. For example, you have a query that returns columns A, and B with a where clause on A and C and you have an index on column A. Your query will use the index for column A requiring one disk read but then require a second disk hit to get columns B and C. If the index had all columns A, B and C in it the second disk hit to get the data can be avoided.

I don't think that generating the execution plan will cost more that 1 second.
I believe that the difference between first and second run is caused by caching the data in memory.
The data in the cache can be reused by any further query (stored procedure or simple select).
You can 'warm' the cache by reading the data through any select that reads the same data. But that will even cost about 90 seconds as well.

You can check the execution plan to find out which tables and indexes your query uses. You can then execute some SQL to get the data into the cache, depending on what you see.
If you see a clustered index seek, you can simply do SELECT * FROM my_big_table to force all the table's data pages into the cache.
If you see a non-clustered index seek, you could try SELECT first_column_in_index FROM my_big_table.
To force a load of a specific index, you can also use the WITH(INDEX(index)) table hint in your cache warmup queries.

SQL server cache data read from disc.
Consecutive reads will do less IO.
This is of great help since disk IO is usually the bottleneck.
More at:
http://blog.sqlauthority.com/2014/03/18/sql-server-performance-do-it-yourself-caching-with-memcached-vs-automated-caching-with-safepeak/

The execution plan (the cached info for your procedure) is reused every time, even with different parameters. It is one of the benefits of using stored procs.
The very first time a stored procedure is executed, SQL Server generates an execution plan and puts it in the procedure cache.
Certain changes to the database can trigger an automatic update of the execution plan (and you can also explicitly demand a recompile).
Execution plans are dropped from the procedure cache based an their "age". (from MSDN: Objects infrequently referenced are soon eligible for deallocation, but are not actually deallocated unless memory is required for other objects.)
I don't think there is any way to "warm the cache", except to perform the stored proc once. This will guarantee that there is an execution plan in the cache and any subsequent calls will reuse it.
more detailed information is available in the MSDN documentation: http://msdn.microsoft.com/en-us/library/ms181055(SQL.90).aspx

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas