We are redesigning a very large (~100Gb, partitioned) table in SQL Server 2012.
For that we need to convert data from the old (existing) table into the newly designed table on the production server. The new table is also partitioned. Rows are only appended to the table.
The problem is a lot of users work on this server, and we can do this conversion process only in chunks and when the server is not under heavy load (a couple of hours a day).
I wonder if there is a better & faster way?
This time we will finish the conversion process in a few days (and then switch our application to use the new table), but what would we do if the table was 1Tb? Or 10Tb?
PS. More details on the current process:
The tables are partitioned based on the CloseOfBusinessDate column (DATE). Currently we run this query when the server is under low load:
INSERT INTO
NewTable
...
SELECT ... FROM
OldTable -- this SELECT involves xml parsing and CROSS APPLY
WHERE
CloseOfBusinessDate = #currentlyMigratingDate
Every day about 1M rows from the old table gets converted into 200M rows in the new table.
When we finish the conversion process we will simply update our application to use NewTable.
Everybody, who took time to read the question and tried to help me, I'm sorry, I didn't have enough details myself. Turns out the query that selects data from the old table and converts it, is VERY slow (thanks to #Martin Smith I've decided to check the SELECT query). The query involves parsing xml & uses cross apply. I think the better way in our case would be to write a small application that would simply load data from the old table for each day, convert it in memory and then use Bulk Copy to insert into the new table.
Related
I've got my query that pulls data from sql server 2012 into excel using power query. However, when I change the date range in my query I'd like to pull the new data into my table and store it below the previously pulled data without deleting it. So far all I've found is the refresh button, which will rerun my query with the new dates but replace the old. Is there a way to accomplish this? I'm using it to build an automated QA testing program that will compare this period to last. Thank you.
This sounds like incremental load. If your table doesn't exceed 1,1 Mio rows, you can use the technique described here: http://www.thebiccountant.com/2016/02/09/how-to-create-a-load-history-or-load-log-in-power-query-or-power-bi/
So I have a summary i need to return to the end user application.
It should accept 3 parameters DateType, StartDate, EndDate.
Date Type will determine the date field I use to filter the data.
The way i accomplished this was putting all the IDs of the records for a datetype into a TEMP table and then joining my summary to the list of IDs.
This worked fine when running on the query on the SQL server that houses the data.
However, that is a replicated server, so when I compiled to a stored proc that would be on the server with the rest of the application data, it slowed the query down. IE 2 seconds vs 50 seconds.
I think the cross join from the temp table that is created on the SQL server then joining to the tables on the replciation server, is causing the slow down.
Are there any methods or techniques that I can use to get around this and build this all in one stored procedure?
If I create 3 stored procedures with their own date range, then they are fast again. However, this means maintaining multiple stored procs for the same thing.
First off, if you are running a version of SQL Server older than 2012 SP1, one problem is that users who aren't allowed to run DBCC SHOW_STATISTICS (which is most users who aren't sysadmins, see the "Permissions" section in the documentation) don't get access to statistics on remote tables. This can severely cripple the optimizer's ability to generate a good execution plan. Upgrading SQL Server or granting more permissions can help there.
If your query involves filtering or joining on a character column, make sure the remote server is flagged in the linked server options as "collation compatible". If this option is off, SQL Server can't assume strings can be compared across the servers and it will start pumping entire tables up and down just to make sure the data ends up where the comparison has to be made.
If the execution plan is as good as it gets and it's still not good enough, one general (lame) technique is to transfer all data locally first (SELECT * INTO #localtable FROM remote.db.schema.table), then run the query as a non-distributed query. Obviously, in order for this to work, the remote table cannot be "too big" and in some cases this actually has worse performance, depending on how many rows are involved. But it's always worth considering, because the optimizer does a better job with local tables.
Another approach that avoids pulling tables together across servers is packing up data in parameters to remote stored procedure calls. Entire tables can be passed as XML through an NVARCHAR(MAX), since neither XML columns nor table-valued parameters are supported in distributed queries. The basic idea is the same: avoid the need for the the optimizer to figure out an efficient distributed query. The best approach greatly depends on your data and your query, obviously.
I'm looking for a little guidance on a SAS/SQL performance issue I'm having. In SAS Enterprise Guide, I've created a program that creates a table. This table has about 90k rows:
CREATE TABLE test AS (
SELECT id, SUM(myField)
FROM table1
GROUP BY id
)
I have a much larger table with millions of rows. Each row has an id. I want to sum values on this table, using only id's present in the 'test' table. I tried this:
CREATE TABLE test2 AS(
SELECT big.id, SUM(big.myOtherField)
FROM big
INNER JOIN test
ON test.id = big.id
GROUP BY big.id
)
The problem I'm having is that it takes forever to run the second query against the big table with millions of records. I thought the inner join on the subset of id's would help (and maybe it is) but I wanted to make sure I was doing everything I could to speed it up.
I don't have any way to get information on the indexing of the underlying database. I'm more interested in getting the opinion of someone who has more SQL and SAS experience than me.
From what you show in your question, you are joining two SAS data sets, not two database objects. In any case, you can speed up the processing by defining indexes on the JOIN columns used in each table. Assuming you have permission to do so, here are examples:
proc sql;
create index id on big(id);
create index id on test(id);
quit;
Of course, you probably should first check the table definition before doing that. You can use the "describe" statement to see the structure:
proc sql;
describe table big;
quit;
Indexes improve access performance at the cost of disk space and update maintenance. Once created, the indexes will be a permanent part of the SAS data set and will be automatically updated if you use SQL INSERT or DELETE statements. But be aware that the indexes will be deleted if you recreate the data set with a simple data step.
On the other hand, if these tables really are in an external database (like Oracle for example), you have a different challenge. If that's the case, I'd ask a new question and provide a complete example of the SAS code you are using (including and libname statements).
If you are working with non-SAS data, ie, data that resides in a SQL DB or a no-SQL database for that matter, you will see significant improvements in performance using pass-through SQL or, if supported and you have the licenses for it, in-database processing.
One important point about proc sql vs pass-through sql. Proc sql, by default, creates duplication of the original source data in SAS datasets prior to doing the work. Whereas, pass-through just requests the result set from the source data provider. In short, you can imagine that a table with 5 million rows will take a lot longer to use with proc sql (even if you are only interested in about 1% of the data) than if you just have to pull that 1% of data across the network using the pass-through mechanism.
I'm new to Oracle. I've worked with Microsoft SQL Server for years, though. I was brought into a project that was already overdue and over budget, and I need to be "the Oracle expert." I've got a table that has 14 million rows in it. And I've got a complicated query that updates the table. But I'm going to start with the simple queries. When I issue a simple update that modifies a small number of records (100 to maybe 10,000 or so) it takes no more than 2 to 5 minutes to table scan the table and update the affected records. (Less time if the query can use an index.) But if I update the entire table with:
UPDATE MyTable SET MyFlag = 1;
Then it takes 3 hours!
If the table scan completes in minutes, why should this take hours? I could certainly use some advise on how to troubleshoot this, since I don't have enough experience with Oracle to know what diagnostics queries to run. (I'm on Oracle 11g and using Oracle SQL Developer as the client.)
Thanks.
When you do the UPDATE in Oracle, the data you are modifying are sequentially appended to the redo log and then distributed among the data blocks by a process called CHECKPOINT.
In addition, the old version of the data are copied into UNDO space to support possible transaction rollbacks and access to the old version of data by concurrent processes.
This all may take significantly more time than pure read operations which don't modify data.
I need to quickly implement a read-only database containing data pulled from two identically structured live databases.
The live dbs are actually company dbs from a Dynamics accounting system so I'm happy for any Dynamics specific advice but this is mostly a SQL question. It's a fairly old version of Dynamics from before Great Plains was acquired by Microsoft. This is on SQL Server 2000.
We have reports and applications which access the Dynamics data. These apps are designed to look at one company db. Now we need to add another. It's appropriate that most of these reports and apps see combined data. They don't really care which company an order or invoice exists in. They only look at a small number of the tables.
It seems to me that the simplest solution is to create a reports only db with combined data. Preferably, we need an efficient way to update this db with changes several times a day.
I'm a developer, not a db expert but here's my plan:
Create the combined reporting db with the required tables initially with the same table structure as the live dbs.
All Dynamics tables seem to have an int identity column called DEX_ROW_ID. I'm not sure what it's used for, (it's not indexed) but that seems like the obvious generic way to uniquely identify rows. On the reporting db I will change it to a normal int (not an identity). I will create a unique index on DEX_ROW_ID in all dbs.
Dynamics does not have timestamps so I will add a timestamp column to tables in the live dbs and a corresponding binary(8) column in the reporting db. I'm assuming and hoping that Dynamics won't be upset by the additional index and column.
Add an int CompanyId column to the reporting db tables and add it to the end of any unique indexes. Most data will be naturally unique even without that. ie, order and invoice numbers etc will be different for the two live dbs. We may need to make some minor changes to the applications but I'm not expecting to do much other than point them to the new reporting db.
Assuming my reporting db is called Reports, the live dbs are Live1 and Live2, the timestamp column is called TS and all dbs are on the same server ... here's my first attempt at an update script for copying the changes in one table called MyTable in Live1 to the reporting db.
USE Reports
CREATE TABLE #Changes
(
ReportId int,
LiveId int
)
/* Collect in a temp table the ids or rows which have been deleted or changed
in the live db L.DEX_ROW_ID will be null if the row has been deleted */
INSERT INTO #Changes
SELECT R.DEX_ROW_ID, L.DEX_ROW_ID
FROM MyTable R LEFT OUTER JOIN Live1.dbo.MyTable L ON L.DEX_ROW_ID = R.DEX_ROW_ID
WHERE R.CompanyId = 1 AND L.DEX_ROW_ID IS NULL OR L.TS <> R.TS
/* Delete rows that have been deleted or changed on the live db
I wonder if using join syntax would run better than the subquery. */
DELETE FROM MyTable
WHERE CompanyId = 1 AND DEX_ROW_ID IN (SELECT ReportId FROM #Changes)
/* Recopy rows that have changed in the live db */
INSERT INTO MyTable
SELECT 1 AS CompanyId, * FROM Live1.dbo.MyTable L
WHERE L.DEX_ROW_ID IN (SELECT ReportId FROM #Changes WHERE LiveId IS NOT NULL)
/* Copy the rows that are new in the live db */
INSERT INTO MyTable
SELECT 1 AS CompanyId, * FROM Live1.dbo.MyTable
WHERE DEX_ROW_ID > (SELECT MAX(DEX_ROW_ID) FROM MyTable WHERE CompanyId = 1)
Then do the same for the Live2 db. Repeat for every table in Reports. I know I should use a parameter #CompanyId instead of the literal but I can't do that for the live db name some I might generate these dynamically with a C# program or something.
I'm looking for any advice, suggestions or critique on what I'm doing here. I know it won't be atomic. Things could be happening on the live db while this script runs. I think we can live with that. We'll probably do a full copy either nightly or weekly when nothing is happening on the live dbs.
We need to favor performance over elegance or perfection. Some initial testing has the first query with the TS comparisons running at about 30 seconds for the biggest table so I'm optimistic that this is going to work but I'd also like to know if I'm missing something obvious or not seeing the forest for the trees.
We don't really want to deal with log files on the reporting db. Can we just set that to simple recovery model and forget about logs?
Thanks
I think there are a couple open questions here.
Do you need these reports to be near-real-time? Or is this this sort of reporting that could live with daily updates? But assume you need up-to-the-minute data.
Have you considered querying the databases directly and merging the data per-report on the fly? You'll have to do a lot of reporting to duplicate the effort that's going to go into designing, creating, and supporting a real-time merged replicated database.
Thirty seconds is (IMHO) unacceptable for any single query against a production database. There could be any number of tuning-related reasons for taking this long, but it at least means you're going to need serious professional SQL Server optimization resources (i.e. people). And if this is a problem for the queries for reports, it doesn't bode well for the queries to maintain a separate database for reporting.
Tuck into the back of your mind the consideration that, if you need to consolidate to a single database, it's worth considering whether you should make it an OLAP database rather than a mirror. The mirror will be quicker and easier, but the OLAP would be far more flexible and powerful in the long term; and it might be well to go the whole way from the beginning.
The last thing I'd want to do is write a custom update script. Try these bulletproof methods first:
Let's hope your production databases are backed up. Restore those backups every night to the reporting server. You can automate restores with the RESTORE command, which will work with a file on a network server.
Use SQL Server replication to push data from the live servers to the backend.
Schedule a DTS package every night to import the entire production database.
This might seem like brute force. But since you're copying a 2000-era database, brute force cannot be a problem with today's hardware. As an added advantage, these methods can be supported by a sysadmin instead of a developer.
Method 1 has the added added advantage of serving as backup verification. :)