I want to optimize storage of a big table by taking out values of columns of type varchar to external lookup table (there are many duplicated values)
the process of doing it is very technical in it's nature (creating lookup table and reference it instead of the actual value) and it sounds like it should be part of the infrastructure (sql server in this case or any RDBMS).
than I thought, it should be an option of a index - do not store duplicate values.
only a reference to the duplicated value.
can index be optimized in such a manner - not holding duplicated values, but just reference?
it should make the size of the table and index much smaller when there are many duplicated values.
SQL Server cannot do deduplication of column values. An index stores one row for each row of the base table. They are just sorted differently.
If you want to deduplicate you can keep a separate table that holds all possible (or actually occurring) values with a much shorter ID. You can then refer to the values by only storing their ID.
You can maintain that deduplication table in the application code or using triggers.
Related
Does creating an index on a column that will always have a different value in each record (like a unique column) improves performances on SELECTs?
I understand that having an index on a column named ie. status which can have 3 values (such as PENDING, DONE, FAILED) and searching only FAILED in 1kk records will be faster.
But what happens if I have a unique id (not the primary key) in 1kk records, and I'm doing a SELECT on that column?
An index on a unique column is actually better than an index on a column with a few values.
To understand why, you need a basic understanding of how databases manage storage. This is a high-level view.
The primary purpose of an index is to reduce the number of pages that need to be read for a query. The rows themselves are stored on data pages. If you don't have an index, then all the data needs to be read.
The index is a data structure that makes it efficient to find a particular value. You can think of it as a sorted list, where a binary search is used to identify the right location. In actual fact, these are usually stored in a structure called b-trees (where the "b" stands for "balanced", not "binary") but that is an implementation detail. And there are types of indexes that don't use b-trees.
So, if the values are unique, then an index is extremely helpful. Instead of doing a full table scan, the "row id" can efficiently be looked up in the index and then only one data page needs to be read.
Note that unique constraints are implemented using indexes. If you have declared a column to be unique, there is no need for an additional index because it is already there.
TL;DR:
What is an efficient way to loop through 500k rows, apply a custom transformation logic/ciper to a column (or few, perhaps) in each single row, and update the column with the transformed data?
Is there a way to do it efficiently in SQL without having to write a separate program to loop through each row and apply the logic?
Background:
We have a table (~500k rows), and some columns contains sensitive data that needs to be masked. As we are masking identity columns that are used in joins, the masking needs to be consistent across all other tables. After much consideration with MD5 / CRC / hashing algorithms we decided to just stick with our own cipher algorithm that will guarantee uniqueness without ending up with too many meaningless characters.
If you want to replace the values in place, then you need to be sure that the types are compatible -- the existing column has the same type as the destination column. Then:
update t
set col = my_cipher(col);
This is relatively expensive because every row is being updated.
If the types are not the same, then you have a bigger challenge. It sounds like the columns you want to change are foreign keys. You probably need to drop all foreign key constraints and modify the tables so the columns have the same type, change the values, and then add the foreign keys back in.
This is for SQL Server.
I have a table that will contain a lot of rows and that table will be queried multiple times so I need to make sure my design is optimized.
Just for the question let say that table contains 2 columns. Name and Type.
Name is a varchar and it will be unique.
Type can be 5 different value (type1... type5). (It possible can contains more values in the future)
Should I make type a varchar (and create an index) or would be it better to create a table of types that will contains 5 rows with only a column for the name and make type a foreign key?
Is there a performance difference between both approach? The queries will not always have the same condition. Sometime it will query the name, type, or both with different values.
EDIT: Consider that in my application if type would be a table, the IDs would be cached so I wouldn't have to query the Type table everytime.
Strictly speaking, you'll probably get better query performance if you keep all the data in one table. However doing this is known as "denormalization" and comes with a number of pretty significant drawbacks.
If your table has "a lot of rows", storing an extra varchar field for every row as opposed to say, a small, or even tinyint, can add a non-trivial amount of size to your table
If any of that data needs to change, you'll have to perform lots of updates against that table. This means transaction log growth and potential blocking on the table during modification locks. If you store it as a separate table with 5-ish rows, if you need to update the data associated with that data, you just update one of the 5 rows you need.
Denormalizing data means that the definition of that data is no longer stored in one place, but in multiple places (actually its stored across every single row that contains those values).
For all the reasons listed above, managing that data (inserts, updates, deletes, and simply defining the data) can quickly become far more overhead than simply normalizing the data correctly in the first place, and for little to no benefit beyond what can be done with proper indexing.
If you find the need to return both the "big" table and some other information from the type table and you're worried about join performance, truthfully, wouldn't be. That's a generalization, but If your big table has, say, 500M rows in it, I can't see many use cases where you'd want all those rows returned; you're probably going to get a subset. In which case, that join might be more manageable. Provided you index type, the join should be pretty snappy.
If you do go the route of denormalizing your data, I'd recommend still having the lookup table as the "master definition" of what a "type" is, so it's not a conglomeration of millions of rows of data.
If you STILL want to denormalize the data WITHOUT a lookup table, at least put a CHECK constraint on the column to limit which values are allowable or not.
How much is "a lot of rows"?.
If it is hundreds of thousands or more, then a Columnstore Index may be a good fit.
It depends on your needs, but usually you would want the type column to be of a numerical value (in your case tinyint).
I'm using SQL server 2014. I'm creating multiple tables, always with more than 500 columns, which will be varying accordingly.
So, I created a sparse column so that I could be sure if the number of my columns exceed 1024 there won't be a problem. Now there is a new problem:
Cannot create a row that has sparse data of size 8710 which is greater
than the allowable maximum sparse data size of 8023.
I know SQL server allows only 8 Kb of data in a row, I need to know what's the work around for this. If I need to plan to move to No SQL (Mongodb) how much impact will it create on converting my stored procedure.
Maximum number of columns in an ordinary table is 1024. Maximum number of columns in a wide (sparse) table is 30,000. Sparse columns are usually used when you have a lot of columns, but most of them are NULL.
In any case, there is limit of 8060 bytes per row, so sparse columns won't help.
Often, having thousand columns in a table indicate that there are problems with the database design and normalisation.
If you are sure that you need these thousand values as columns, not as rows in a related table, then the only workaround that comes to mind is to partition the table vertically.
For example, you have a Table1 with column ID (which is the primary key) and 1000 other columns. Split it into Table1 and Table2. Each will have the same ID as a primary key and 500 columns each. The tables would be linked 1:1 using foreign key constraint.
The datatypes that are used and the density of how much data in a row is null determines the effectiveness of sparse columns. If all of the fields in a table are populated there is actually more overhead on storing those rows and will cause you to hit that maximum page size faster. If that is the case then don't use sparse columns.
See how many you can convert from static to variable length datatypes (varchar, nvarchar, varbinary). This might buy you some additional space in the page as variable length fields can be put into overflow pages, but do carry an overhead of 24 bytes for the pointer into the overflow page. I suspect you were thinking that sparse columns was going to allow you to store 30K columns...this would only be the circumstance where you had a wide table where most of the columns are NULL.
MongoDB will not be your answer...at least not without a lot of refactoring. You will not be able to leverage your existing stored procedures. It might be the best fit for you but there are many things to consider when moving to MongoDB. Your data access layer will need to be rebuilt unless you just happen to be persisting your data in the relational structure as JSON documents :). I assume that is not the case.
I am assuming that you have wide tables and they are densely populated...based on that assumption here is my recommendation.
Partition the table as Vladimir suggested, but create a view that will join all these tables together to make it look like one table. Now you have the same structure as you did before. Then add an Instead of Trigger to the view to update the tables. This is a way that you can get what you want without having to do major refactoring of your code. There is code you need to add for the trigger, but my experience as been that it's easy to write and most times I didn't write the code but created a script to generate the code for all the views I had to do this for since it was repetitive.
Given a large table (10-100 million rows) what's the best way to add some extra (unindexed) columns to it?
Just add the columns.
Create a separate table for each extra column, and use joins when you want to access the extra values.
Does the answer change depending on whether the extra columns are dense (mostly not null) or sparse (mostly null)?
A column with a NULL value can be added to a row without any changes to the rest of the data page in most cases. Only one bit has to be set in the NULL bitmap. So, yes, a sparse column is much cheaper to add in most cases.
Whether it is a good idea to create a separate 1:1 table for additional columns very much depends on the use case. It is generally more expensive. For starters, there is an overhead of 28 bytes (heap tuple header plus item identifier) per row and some additional overhead per table. It is also much more expensive to JOIN rows in a query than to read them in one piece. And you need to add a primary / foreign key column plus an index on it. Splitting may be a good idea if you don't need the additional columns in most queries. Mostly it is a bad idea.
Adding a column is fast in PostgreSQL. Updating the values in the column is what may be expensive, because every UPDATE writes a new row (due to the MVCC model). Therefore, it is a good idea to update multiple columns at once.
Database page layout in the manual.
How to calculate row sizes:
Making sense of Postgres row sizes
Calculating and saving space in PostgreSQL