What happens when I drop a column in SQL Server

What happens when I drop a column in SQL Server - sql

If my understanding is right then the number of rows that are stored on one page in SQL Server is determined by the number of columns in the table and their datatypes. One I/O operation can read one page, so the more rows fits into one page, the more rows can be returned by one I/O operation, and your queries run faster.
I wonder what happens when you drop a column? Does SQL Server go back to the memory and "re-stores" the data, i.e what if I drop enough columns for more data to fit on one page? And if SQL Server does not do that automatically, can I force the process?
I'm removing a lot of text columns and a few IDs on the heavily used table, and I hoping that the I/O will improve after I drop the columns.

Dropping a column is a logical operation, not physical. No data gets modified. The column metadata gets marked as 'deleted' and will be ignored. The record size is unchanged. Read SQL Server table columns under the hood for more details explanation and clear examples demonstrating my claim.
As Stefan said, you have to rebuild the table (heap or clustered index) to 'reclaim' the space.
I'm removing a lot of text columns and a few IDs on the heavily used table, and I hoping that the I/O will improve after I drop the columns.
Use indexes to reduce IO. For unindexable ad-hoc analytical workloads, use columnstores. Read How to analyse SQL Server performance.

You can rebuild the clustered index to force SQL Server use the free space. Otherwise you have "holes" in the pages.

Related

SQL Server : large data import with clustered index

Performance-wise, does a clustered index help or not when bulk inserting hundreds of millions of rows in a table?
LE: after the INSERTs I have to put the database into production so I will have to create the one or more indexes.

A clustered index specifies that the data is ordered on the data pages.
When you are inserting data, the new data has to be sorted and compared to existing values. This is going to incur overhead.
The one exception is when you have an identity column -- that is being generated during the insert. Then the database knows that the new data goes "at the end" of the table.

Indexes are meant for speeding up retrieval (SELECT) of rows. They only have anti-effect with respect to INSERT or DELETE or UPDATE. And, in your case, if INSERT is the predominant operation to be performed in your system, don't go for indexes at all. Even in your Production system, assess the ratio between retrieval operations and insert/update operations and if it turns out to be that the retrieval operation is going to be dominant, then you can think of indexes.
Note: Whenever we define a Primary Key on a table, a basic index structure is already created for that table. So, without any specific need for retrieval optimization, there is no actual need to design and implement indexes.
You can know more here: https://www.geeksforgeeks.org/sql-indexes/

how Indexing works in logical reads and what are the benefits

I want to reduce the logical reads to speed up the stored procedures execution time sql server, Later i come to know by using index i will find my solution.
I need to know how indexing works and its benefits.

Indexes are used to find rows with specific column values quickly. Without an index, SQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, SQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data. This is much faster than reading every row sequentially.
BUT indexes slow down inserts and updates (which can become a really serious issue with locking) and cost disk space.
Read MSDN for the different indexes that exists in SQL server

How database system comes to know how many different values a particular column has?

At following link
http://www.programmerinterview.com/index.php/database-sql/selectivity-in-sql-databases/
the author has written that since "SEX" column has only two possible values thus its selectivity for 10000 records would be; according to formula given; 0.02 %.
But my question that how a database system come to know that this particular column has this many unique values? Wouldn't the database system require scanning the entire table at least once? or some other way the database system would come to know about those unique values?

First, you are applying the formula wrong. The selectivity for sex (in the example given) would be 50% not 0.02%. That means that each value appears about 50% of the time.
The general way that databases keep track of this is using something called "statistics". These are measures that are kept about all tables and used by the optimizer. Sometimes, the information can also be provided by an index on the column.

Comming back to your actual question: Yes, the database scans all table data frequently and saves some statistics, (e.g. max value, min value, number of distinct keys, number of rows in a table, etc.) in a internal table. These statistics are used to estimate the basic result of your query (or other DML operations) in order to evalutat the optimal execution plan. You can manually trigger generation of statistic by running command EXEC DBMS_STATS.GATHER_DATABASE_STATS; or some of the other ones. You can advise Oracle also to read only a sample of all data (e.g. 10% of all rows)
Usually data content does not change drastically, so it does not matter if the numbers are not absolutly exact, they are (usually) sufficient to estimate an execution plan.

Oracle has many processes related to calculating the number of distinct values (NDV).
Manual Statistics Gathering: Statistics gathering can be triggered manually, through many different procedures in DBMS_STATS.
AUTOTASK: Since 10g Oracle has a default AUTOTASK job, "auto optimizer stats collection". It will only gather statistics if the current stats are stale.
Bulk Load: In 12c statistics can be gathered during a bulk load.
Sample: The NDV can be computed from 100% of the data or can be estimated based on a sample. The sample can be either based on blocks or rows.
One-pass distinct sampling: 11g introduced a new AUTO_SAMPLE_SIZE algorithm. It scans the entire table but only uses one pass. It's much faster to scan the whole table than to have to sort even a small part of it. There are several more in-depth descriptions of the algorithm, such as this one.
Incremental Statistics: For partitioned tables Oracle can store extra information about the NDV, called a synopsis. With this information, if only a single partition is modified, only that one partition needs to be analyzed to generate both partition and global statistics.
Index NDV: Index statistics are created by default when an index is created. Also, the information can be periodically re-gathered from DBMS_STATS.GATHER_INDEX_STATS or the cascade option in other procedures in DBMS_STATS.
Custom Statistics: The NDV can be manually set with DBMS_STATS.SET_* or ASSOCIATE STATISTICS.
Dynamic Sampling: Right before a query is executed, Oracle can automatically sample a small number of blocks from the table to estimate the NDV. This usually only happens when statistics are missing.

Database scans the data set in a table so it can use the most efficient method to retrieve data. Database measures the uniqueness of values using the following formula:
Index Selectivity = number of distinct values / the total number of values
The result will be between zero or one. Index Selectivity of zero means that there are not any unique values. In these cases indexes actually reduce performance. So database uses sequential scanning instead of seek operations.
For more information on indexes read https://dba.stackexchange.com/questions/42553/index-seek-vs-index-scan

SQL Server Table Partitioning, what is happening behind the scenes?

I'm working with table partitioning on extremely large fact table in a warehouse. I have executed the script a few different ways. With and without non clustered indexes. With indexes it appears to dramatically expand the log file while without the non clustered indexes it appears to not expand the log file as much but takes more time to run due to the rebuilding of the indexes.
What I am looking for is any links or information as to what is happening behind the scene specifically to the log file when you split a table partition.

I think it isn't to hard to theorize what is going on (to a certain extent). Behind the scenes each partition is given a different HoBT, which in normal language means each partition is in effect sitting on it's own hidden table.
So theorizing the splitting of a partition (assuming data is moving) would involve:
inserting the data into the new table
removing data from the old table
The NC index can be figured out, but depending on whether there is a clustered index or not, the theorizing will alter. It also matters whether the index is partition aligned or not.
Given a bit more information on the table (CL or Heap) we could theorize this further

If the partition function is used by a
partitioned table and SPLIT results in
partitions where both will contain
data, SQL Server will move the data to
the new partition. This data movement
will cause transaction log growth due
to inserts and deletes.
This is from an article by Microsoft on Partitioned Table and Index Strategies
So looks like its doing a delete from old partition and and insert into the new partition. This could explain the growth in t-log.

SQL Server 2005: disk space taken by dropped columns

I have a big table in SQL Server 2005 that's taking about 3.5 GB of space (according to sp_spaceused). It has 10 million records, and several indexes.
I just dropped a bunch of columns from it, such that the record length got reduced to a half, and to my surprise it took zero time to do that. Obviously, sp_spaceused was still reporting the same taken space, SQL server hadn't really done anything when dropping the columns, other than marking them as "dropped".
So I moved all the data from this table into another new table, truncated it, and moved all the data back, so that it'd get all reconstructed.
Now, after that, data is taking 2.8 GB, which IS less than before, but I expected a bigger drop.
Is it possible that the fact that this table originally had these columns is still leaving something there?
Was truncating it not enough? Should I drop it and create it again with the smaller column set?
Or is the data really taking 2.8 GB?
Thanks!

You will need to rebuild the clustered index (assuming you have one - by default, your primary key is the clustered key).
ALTER INDEX (your clustered index) ON TABLE (your table) REBUILD
The data is really the leaf level of your clustered index - once you rebuild it, it will be "compacted" and the rows should be stored on much fewer data pages, reducing your database size, too.
If that doesn't help at all, you might also need to run a DBCC SHRINKDATABASE on your database to really reclaim the space. These two steps together should really get you some smaller database file!
Marc

How did you calculate that "expected a bigger drop"? Note that the data comes in 8K pages, which means that even if individual rows are smaller, that does not always mean you need less pages to store them.
For example (an extreme example), if your rows used to be 7.5K each, only one row per page would fit. You drop some columns, your row is 5K, but still it is one row per page.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas