I want to add a new column to a Sql Server table with about 20,000 rows using this query
ALTER TABLE [db].[dbo].[table]
ADD entry_date date;
the query result still working, until now it takes 10 minute and still working, Is it normal !
10 minutes seems like a long time, but adding a column takes time.
Why? Adding a column (generally) requires changing the layout of the table on the pages where the data is stored. In your case, the entry date is 4 bytes, so every row needs to be expanded by 4 bytes -- and fit on one data page.
Adding columns generally requires rewriting all data.
Also note that while this is occurring, locks and transactions are also involved. So, if other activity on the server is accessing the table (particularly data changes), then that also affects performance.
Oh, did I mention that indexes are involved as well? Indexes care about the physical location of data.
All that said, 10 minutes seems like a long time for 20,000 rows. You might find that things go much faster if you do:
Create a new table with the columns you want -- but no indexes except for the primary key or clustered index.
Insert the data from the existing table.
Add whatever indexes and triggers you want.
Of course, you need to be careful if you have an identity column.
Related
I have a table tblCallDataStore which has a flow of 2 millions records daily.Daily, I need to delete any record older than 48 hours. If I create a delete job, it runs for more than 13 hours or sometimes more than that. What is the feasible way to partition the table and truncate the partition.
How may I do that?
I have a Receivedate column and I want to do the partition on its basis.
Unfortunately, you will need to bite the one-time bullet and add partitioning to your table (hint: base it off the DATEPART(DAY, ...) of the timestamp you are keying off of so you can quickly get this as your #ParitionID variable later). After that, you can use the SWITCH PARTITION statement to move all rows of the specified partition ID to another table. You can then quickly TRUNCATE the receiving table since you don't care about the data anymore.
There are a couple of tricky setup steps in this process: Managing the partition IDs themselves, and ensuring the receiving table has an identical structure (and various other requirements, such as not being able to refer to your partitioned table with foreign keys (umm... ouch)). Once you have the partition ID you want, the code is simple:
ALTER TABLE [x] SWITCH PARTITION #PartitionID TO [x_copy]
I have been meaning to write a stored procedure to automatically maintain the receiving tables, since at work we have just been manually updating both tables in tandem whenever we make a change (e.g. adding a new column or index). To manually do it, basically just copy the table definition and append something to all the entity names (e.g. just put the number 2 on the end).
P.S. Maybe you want to use hourly partitioning instead, then you have much more control over the "48 hours" requirement, and also have the flexibility to handle time zones (ugh). You'd just have to be more clever since DATEPART(HOUR, ...) obviously won't work directly. Also note that your partition IDs must be defined as a range, so eventually you will end up cycling through them, so keep that in mind.
P.P.S. As noted in the comments by David Browne, partitioning was an enterprise-only feature until SQL 2016 SP1.
In my application, users can create custom tables with three column types, Text, Numeric and Date. They can have up to 20 columns. I create a SQL table based on their schema using nvarchar(430) for text, decimal(38,6) for numeric and datetime, along with an Identity Id column.
There is the potential for many of these tables to be created by different users, and the data might be updated frequently by users uploading new CSV files. To get the best performance during the upload of the user data, we truncate the table to get rid of existing data, and then do batches of BULK INSERT.
The user can make a selection based on a filter they build up, which can include any number of columns. My issue is that some tables with a lot of rows will have poor performance during this selection. To combat this I thought about adding indexes, but as we don't know what columns will be included in the WHERE condition we would have to index every column.
For example, on a local SQL server one table with just over a million rows and a WHERE condition on 6 of its columns will take around 8 seconds the first time it runs, then under one second for subsequent runs. With indexes on every column it will run in under one second the first time the query is ran. This performance issue is amplified when we test on an SQL Azure database, where the same query will take over a minute the first time its run, and does not improve on subsequent runs, but with the indexes it takes 1 second.
So, would it be a suitable solution to add a index on every column when a user creates a column, or is there a better solution?
Yes, it's a good idea given your model. There will, of course, be more overhead maintaining the indexes on the insert, but if there is no predictable standard set of columns in the queries, you don't have a lot of choices.
Suppose by 'updated frequently,' you mean data is added frequently via uploads rather than existing records being modified. In that case, you might consider one of the various non-SQL databases (like Apache Lucene or variants) which allow efficient querying on any combination of data. For reading massive 'flat' data sets, they are astonishingly fast.
I have a Composite-List-List partitioned table with 19 Columns and about 400 million rows. Once a week new data is inserted in this table and before the insert I need to set the values of 2 columns to null for specific partitions.
Obvious approach would be something like the following where COLUMN_1 is the partition criteria:
UPDATE BLABLA_TABLE
SET COLUMN_18 = NULL, SET COLUMN_19 = NULL
WHERE COLUMN_1 IN (VALUE1, VALUE2…)
Of course this would be awfully slow.
My second thought was to use CTAS for every partition that I need to set those two columns to null and then use EXCHANGE PARTITION to update the data in my big table. Unfortunately that wouldn’t work because it´s a Composite-Partition.
I could use the same approach with subpartitions but then I would have to use CATS about 8000 times and drop those tables afterwards every week. I guess that would not pass the upcoming code-review.
May somebody has another idea how to performantly solve this?
PS: I’m using ORACLE 11g as database.
PPS: Sorry for my bad English…..
You've ruled out updating through DDL (switch partitions), so this lets us with only DML to consider.
I don't think that it's actually that bad an update with a table so heavily partitioned. You can easily split the update in 8k mini updates (each a single tiny partition):
UPDATE BLABLA_TABLE SUBPARTITION (partition1) SET COLUMN_18 = NULL...
Each subpartition would contain 15k rows to be updated on average so the update would be relatively tiny.
While it still represents a very big amount of work, it should be easy to set to run in parallel, hopefully during hours where database activity is very light. Also the individual updates are easy to restart if one of them fails (rows locked?) whereas a 120M update would take such a long time to rollback in case of error.
If I were to update almost 90% of rows in table, I would check feasibility/duration of just inserting to another table of same structure (less redo, no row chaining/migration, bypass cache and so on via direct insert. drop indexes and triggers first. exclude columns to leave them null in target table), rename the tables to "swap" them, rebuild indexes and triggers, then drop the old table.
From my experience in data warehousing, plain direct insert is better than update/delete. More steps needed but it's done in less time overall. I agree, partition swap is easier said than done when you have to process most of the table and just makes it more complex for the ETL developer (logic/algorithm bound to what's in the physical layer), we haven't encountered need to do partition swaps so far.
I would also isolate this table in its own tablespaces, then alternate storage between these two tablespaces (insert to 2nd drop table from 1st, vice-versa in next run, resize empty tablespace to reclaim space).
Assume a table named 'log', there are huge records in it.
The application usually retrieves data by simple SQL:
SELECT *
FROM log
WHERE logLevel=2 AND (creationData BETWEEN ? AND ?)
logLevel and creationData have indexes, but the number of records makes it take longer to retrieve data.
How do we fix this?
Look at your execution plan / "EXPLAIN PLAN" result - if you are retrieving large amounts of data then there is very little that you can do to improve performance - you could try changing your SELECT statement to only include columns you are interested in, however it won't change the number of logical reads that you are doing and so I suspect it will only have a neglible effect on performance.
If you are only retrieving small numbers of records then an index of LogLevel and an index on CreationDate should do the trick.
UPDATE: SQL server is mostly geared around querying small subsets of massive databases (e.g. returning a single customer record out of a database of millions). Its not really geared up for returning truly large data sets. If the amount of data that you are returning is genuinely large then there is only a certain amount that you will be able to do and so I'd have to ask:
What is it that you are actually trying to achieve?
If you are displaying log messages to a user, then they are only going to be interested in a small subset at a time, and so you might also want to look into efficient methods of paging SQL data - if you are only returning even say 500 or so records at a time it should still be very fast.
If you are trying to do some sort of statistical analysis then you might want to replicate your data into a data store more suited to statistical analysis. (Not sure what however, that isn't my area of expertise)
1: Never use Select *
2: make sure your indexes are correct, and your statistics are up-to-date
3: (Optional) If you find you're not looking at log data past a certain time (in my experience, if it happened more than a week ago, I'm probably not going to need the log for it) set up a job to archive that to some back-up, and then remove unused records. That will keep the table size down reducing the amount of time it takes search the table.
Depending on what kinda of SQL database you're using, you might look into Horizaontal Partitioning. Oftentimes, this can be done entirely on the database side of things so you won't need to change your code.
Do you need all columns? First step should be to select only those you actually need to retrieve.
Another aspect is what you do with the data after it arrives to your application (populate a data set/read it sequentially/?).
There can be some potential for improvement on the side of the processing application.
You should answer yourself these questions:
Do you need to hold all the returned data in memory at once? How much memory do you allocate per row on the retrieving side? How much memory do you need at once? Can you reuse some memory?
A couple of things
do you need all the columns, people usually do SELECT * because they are too lazy to list 5 columns of the 15 that the table has.
Get more RAM, themore RAM you have the more data can live in cache which is 1000 times faster than reading from disk
For me there are two things that you can do,
Partition the table horizontally based on the date column
Use the concept of pre-aggregation.
Pre-aggregation:
In preagg you would have a "logs" table, "logs_temp" table, a "logs_summary" table and a "logs_archive" table. The structure of logs and logs_temp table is identical. The flow of application would be in this way, all logs are logged in the logs table, then every hour a cron job runs that does the following things:
a. Copy the data from the logs table to "logs_temp" table and empty the logs table. This can be done using the Shadow Table trick.
b. Aggregate the logs for that particular hour from the logs_temp table
c. Save the aggregated results in the summary table
d. Copy the records from the logs_temp table to the logs_archive table and then empty the logs_temp table.
This way results are pre-aggregated in the summary table.
Whenever you wish to select the result, you would select it from the summary table.
This way the selects are very fast, because the number of records are far less as the data has been pre-aggregated per hour. You could even increase the threshold from an hour to a day. It all depends on your needs.
Now the inserts would be fast too, because the amount of data is not much in the logs table as it holds the data only for the last hour, so index regeneration on inserts would take very less time as compared to very large data-set hence making the inserts fast.
You can read more about Shadow Table trick here
I employed the pre-aggregation method in a news website built on wordpress. I had to develop a plugin for the news website that would show recently popular (popular during the last 3 days) news items, and there are like 100K hits per day, and this pre-aggregation thing has really helped us a lot. The query time came down from more than 2 secs to under a second. I intend on making the plugin publically available soon.
As per other answers, do not use 'select *' unless you really need all the fields.
logLevel and creationData have indexes
You need a single index with both values, what order you put them in will affect performance, but assuming you have a small number of possible loglevel values (and the data is not skewed) you'll get better performance putting creationData first.
Note that optimally an index will reduce the cost of a query to log(N) i.e. it will still get slower as the number of records increases.
C.
I really hope that by creationData you mean creationDate.
First of all, it is not enough to have indexes on logLevel and creationData. If you have 2 separate indexes, Oracle will only be able to use 1.
What you need is a single index on both fields:
CREATE INDEX i_log_1 ON log (creationData, logLevel);
Note that I put creationData first. This way, if you only put that field in the WHERE clause, it will still be able to use the index. (Filtering on just date seems more likely scenario that on just log level).
Then, make sure the table is populated with data (as much data as you will use in production) and refresh the statistics on the table.
If the table is large (at least few hundred thousand rows), use the following code to refresh the statistics:
DECLARE
l_ownname VARCHAR2(255) := 'owner'; -- Owner (schema) of table to analyze
l_tabname VARCHAR2(255) := 'log'; -- Table to analyze
l_estimate_percent NUMBER(3) := 5; -- Percentage of rows to estimate (NULL means compute)
BEGIN
dbms_stats.gather_table_stats (
ownname => l_ownname ,
tabname => l_tabname,
estimate_percent => l_estimate_percent,
method_opt => 'FOR ALL INDEXED COLUMNS',
cascade => TRUE
);
END;
Otherwise, if the table is small, use
ANALYZE TABLE log COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;
Additionally, if the table grows large, you shoud consider to partition it by range on creationDate column. See these links for the details:
Oracle Documentation: Range Partitioning
OraFAQ: Range partitions
How to Create and Manage Partition Tables in Oracle
I have a SQL Server table in production that has millions of rows, and it turns out that I need to add a column to it. Or, to be more accurate, I need to add a field to the entity that the table represents.
Syntactically this isn't a problem, and if the table didn't have so many rows and wasn't in production, this would be easy.
Really what I'm after is the course of action. There are plenty of websites out there with extremely large tables, and they must add fields from time to time. How do they do it without substantial downtime?
One thing I should add, I did not want the column to allow nulls, which would mean that I'd need to have a default value.
So I either need to figure out how to add a column with a default value in a timely manner, or I need to figure out a way to update the column at a later time and then set the column to not allow nulls.
ALTER TABLE table1 ADD
newcolumn int NULL
GO
should not take that long... What takes a long time is to insert columns in the middle of other columns... b/c then the engine needs to create a new table and copy the data to the new table.
I did not want the column to allow nulls, which would mean that I'd need to have a default value.
Adding a NOT NULL column with a DEFAULT Constraint to a table of any number of rows (even billions) became a lot easier starting in SQL Server 2012 (but only for Enterprise Edition) as they allowed it to be an Online operation (in most cases) where, for existing rows, the value will be read from meta-data and not actually stored in the row until the row is updated, or clustered index is rebuilt. Rather than paraphrase any more, here is the relevant section from the MSDN page for ALTER TABLE:
Adding NOT NULL Columns as an Online Operation
Starting with SQL Server 2012 Enterprise Edition, adding a NOT NULL column with a default value is an online operation when the default value is a runtime constant. This means that the operation is completed almost instantaneously regardless of the number of rows in the table. This is because the existing rows in the table are not updated during the operation; instead, the default value is stored only in the metadata of the table and the value is looked up as needed in queries that access these rows. This behavior is automatic; no additional syntax is required to implement the online operation beyond the ADD COLUMN syntax. A runtime constant is an expression that produces the same value at runtime for each row in the table regardless of its determinism. For example, the constant expression "My temporary data", or the system function GETUTCDATETIME() are runtime constants. In contrast, the functions NEWID() or NEWSEQUENTIALID() are not runtime constants because a unique value is produced for each row in the table. Adding a NOT NULL column with a default value that is not a runtime constant is always performed offline and an exclusive (SCH-M) lock is acquired for the duration of the operation.
While the existing rows reference the value stored in metadata, the default value is stored on the row for any new rows that are inserted and do not specify another value for the column. The default value stored in metadata is moved to an existing row when the row is updated (even if the actual column is not specified in the UPDATE statement), or if the table or clustered index is rebuilt.
Columns of type varchar(max), nvarchar(max), varbinary(max), xml, text, ntext, image, hierarchyid, geometry, geography, or CLR UDTS, cannot be added in an online operation. A column cannot be added online if doing so causes the maximum possible row size to exceed the 8,060 byte limit. The column is added as an offline operation in this case.
The only real solution for continuous uptime is redundancy.
I acknowledge #Nestor's answer that adding a new column shouldn't take long in SQL Server, but nevertheless, it could still be an outage that is not acceptable on a production system. An alternative is to make the change in a parallel system, and then once the operation is complete, swap the new for the old.
For example, if you need to add a column, you may create a copy of the table, then add the column to that copy, and then use sp_rename() to move the old table aside and the new table into place.
If you have referential integrity constraints pointing to this table, this can make the swap even more tricky. You probably have to drop the constraints briefly as you swap the tables.
For some kinds of complex upgrades, you could completely duplicate the database on a separate server host. Once that's ready, just swap the DNS entries for the two servers and voilà!
I supported a stock exchange company
in the 1990's who ran three duplicate
database servers at all times. That
way they could implement upgrades on
one server, while retaining one
production server and one failover
server. Their operations had a
standard procedure of rotating the
three machines through production,
failover, and maintenance roles every
day. When they needed to upgrade
hardware, software, or alter the
database schema, it took three days to
propagate the change through their
servers, but they could do it with no
interruption in service. All thanks
to redundancy.
"Add the column and then perform relatively small UPDATE batches to populate the column with a default value. That should prevent any noticeable slowdowns"
And after that you have to set the column to NOT NULL which will fire off in one big transaction. So everything will run really fast until you do that so you have probably gained very little really. I only know this from first hand experience.
You might want to rename the current table from X to Y. You can do this with this command sp_RENAME '[OldTableName]' , '[NewTableName]'.
Recreate the new table as X with the new column set to NOT NULL and then batch insert from Y to X and include a default value either in your insert for the new column or placing a default value on the new column when you recreate table X.
I have done this type of change on a table with hundreds of millions of rows. It still took over an hour, but it didn't blow out our trans log. When I tried to just change the column to NOT NULL with all the data in the table it took over 20 hours before I killed the process.
Have you tested just adding a column filling it with data and setting the column to NOT NULL?
So in the end I don't think there's a magic bullet.
select into a new table and rename. Example, Adding column i to table A:
select *, 1 as i
into A_tmp
from A_tbl
//Add any indexes here
exec sp_rename 'A_tbl', 'A_old'
exec sp_rename 'A_tmp', 'A_tbl'
Should be fast and won't touch your transaction log like inserting in batches might.
(I just did this today w/ a 70 million row table in < 2 min).
You can wrap it in a transaction if you need it to be an online operation (something might change in the table between the select into and the renames).
Another technique is to add the column to a new related table (Assume a one-to-one relationship which you can enforce by giving the FK a unique index). You can then populate this in batches and then you can add the join to this table wherever you want the data to appear. Note I would only consider this for a column that I would not want to use in every query on the original table or if the record width of my original table was getting too large or if I was adding several columns.