Process and update each row in SQL efficiently - sql

TL;DR:
What is an efficient way to loop through 500k rows, apply a custom transformation logic/ciper to a column (or few, perhaps) in each single row, and update the column with the transformed data?
Is there a way to do it efficiently in SQL without having to write a separate program to loop through each row and apply the logic?
Background:
We have a table (~500k rows), and some columns contains sensitive data that needs to be masked. As we are masking identity columns that are used in joins, the masking needs to be consistent across all other tables. After much consideration with MD5 / CRC / hashing algorithms we decided to just stick with our own cipher algorithm that will guarantee uniqueness without ending up with too many meaningless characters.

If you want to replace the values in place, then you need to be sure that the types are compatible -- the existing column has the same type as the destination column. Then:
update t
set col = my_cipher(col);
This is relatively expensive because every row is being updated.
If the types are not the same, then you have a bigger challenge. It sounds like the columns you want to change are foreign keys. You probably need to drop all foreign key constraints and modify the tables so the columns have the same type, change the values, and then add the foreign keys back in.

Related

Query on varchar vs foreign key performance

This is for SQL Server.
I have a table that will contain a lot of rows and that table will be queried multiple times so I need to make sure my design is optimized.
Just for the question let say that table contains 2 columns. Name and Type.
Name is a varchar and it will be unique.
Type can be 5 different value (type1... type5). (It possible can contains more values in the future)
Should I make type a varchar (and create an index) or would be it better to create a table of types that will contains 5 rows with only a column for the name and make type a foreign key?
Is there a performance difference between both approach? The queries will not always have the same condition. Sometime it will query the name, type, or both with different values.
EDIT: Consider that in my application if type would be a table, the IDs would be cached so I wouldn't have to query the Type table everytime.
Strictly speaking, you'll probably get better query performance if you keep all the data in one table. However doing this is known as "denormalization" and comes with a number of pretty significant drawbacks.
If your table has "a lot of rows", storing an extra varchar field for every row as opposed to say, a small, or even tinyint, can add a non-trivial amount of size to your table
If any of that data needs to change, you'll have to perform lots of updates against that table. This means transaction log growth and potential blocking on the table during modification locks. If you store it as a separate table with 5-ish rows, if you need to update the data associated with that data, you just update one of the 5 rows you need.
Denormalizing data means that the definition of that data is no longer stored in one place, but in multiple places (actually its stored across every single row that contains those values).
For all the reasons listed above, managing that data (inserts, updates, deletes, and simply defining the data) can quickly become far more overhead than simply normalizing the data correctly in the first place, and for little to no benefit beyond what can be done with proper indexing.
If you find the need to return both the "big" table and some other information from the type table and you're worried about join performance, truthfully, wouldn't be. That's a generalization, but If your big table has, say, 500M rows in it, I can't see many use cases where you'd want all those rows returned; you're probably going to get a subset. In which case, that join might be more manageable. Provided you index type, the join should be pretty snappy.
If you do go the route of denormalizing your data, I'd recommend still having the lookup table as the "master definition" of what a "type" is, so it's not a conglomeration of millions of rows of data.
If you STILL want to denormalize the data WITHOUT a lookup table, at least put a CHECK constraint on the column to limit which values are allowable or not.
How much is "a lot of rows"?.
If it is hundreds of thousands or more, then a Columnstore Index may be a good fit.
It depends on your needs, but usually you would want the type column to be of a numerical value (in your case tinyint).

Sparse column size limitation workaround

I'm using SQL server 2014. I'm creating multiple tables, always with more than 500 columns, which will be varying accordingly.
So, I created a sparse column so that I could be sure if the number of my columns exceed 1024 there won't be a problem. Now there is a new problem:
Cannot create a row that has sparse data of size 8710 which is greater
than the allowable maximum sparse data size of 8023.
I know SQL server allows only 8 Kb of data in a row, I need to know what's the work around for this. If I need to plan to move to No SQL (Mongodb) how much impact will it create on converting my stored procedure.
Maximum number of columns in an ordinary table is 1024. Maximum number of columns in a wide (sparse) table is 30,000. Sparse columns are usually used when you have a lot of columns, but most of them are NULL.
In any case, there is limit of 8060 bytes per row, so sparse columns won't help.
Often, having thousand columns in a table indicate that there are problems with the database design and normalisation.
If you are sure that you need these thousand values as columns, not as rows in a related table, then the only workaround that comes to mind is to partition the table vertically.
For example, you have a Table1 with column ID (which is the primary key) and 1000 other columns. Split it into Table1 and Table2. Each will have the same ID as a primary key and 500 columns each. The tables would be linked 1:1 using foreign key constraint.
The datatypes that are used and the density of how much data in a row is null determines the effectiveness of sparse columns. If all of the fields in a table are populated there is actually more overhead on storing those rows and will cause you to hit that maximum page size faster. If that is the case then don't use sparse columns.
See how many you can convert from static to variable length datatypes (varchar, nvarchar, varbinary). This might buy you some additional space in the page as variable length fields can be put into overflow pages, but do carry an overhead of 24 bytes for the pointer into the overflow page. I suspect you were thinking that sparse columns was going to allow you to store 30K columns...this would only be the circumstance where you had a wide table where most of the columns are NULL.
MongoDB will not be your answer...at least not without a lot of refactoring. You will not be able to leverage your existing stored procedures. It might be the best fit for you but there are many things to consider when moving to MongoDB. Your data access layer will need to be rebuilt unless you just happen to be persisting your data in the relational structure as JSON documents :). I assume that is not the case.
I am assuming that you have wide tables and they are densely populated...based on that assumption here is my recommendation.
Partition the table as Vladimir suggested, but create a view that will join all these tables together to make it look like one table. Now you have the same structure as you did before. Then add an Instead of Trigger to the view to update the tables. This is a way that you can get what you want without having to do major refactoring of your code. There is code you need to add for the trigger, but my experience as been that it's easy to write and most times I didn't write the code but created a script to generate the code for all the views I had to do this for since it was repetitive.

Removing non-adjacent duplicates comparing all fields

What is the most (time) efficient way of removing all exact duplicates from an unsorted standard internal table (non-deep structure, arbitrarily large)?
All I can think of is simply sorting the entire thing by all of its fields before running DELETE ADJACENT DUPLICATES FROM itab COMPARING ALL FIELDS. Is there a faster or preferred alternative? Will this cause problems if the structure mixes alphanumeric fields with numerics?
To provide context, I'm trying to improve performance on some horrible select logic in legacy programs. Most of these run full table scans on 5-10 joined tables, some of them self-joining. I'm left with hundreds of thousands of rows in memory and I'm fairly sure a large portion of them are just duplicates. However, changing the actual selects is too complex and would require /ex[tp]ensive/ retesting. Just removing duplicates would likely cut runtime in half but I want to make sure that the deduplication doesn't add too much overhead itself.
I would investigate two methods:
Store the original index in an auxiliary field, SORT BY the fields you want to compare (possibly using STABLE), DELETE ADJACENT DUPLICATES, then re-SORT BY the stored index.
Using a HASHED TABLE for the fields you want to compare, LOOP through the data table. Use READ TABLE .. TRANSPORTING NO FIELDS on the hashed table to find out whether the value already existed and if so, remove it - otherwise add the values to the hashed table.
I'm not sure about the performance, but I would recommend to use SAT on a plausible data set for both methods and compare the results.

does duplicate values in index takes duplicate space

I want to optimize storage of a big table by taking out values of columns of type varchar to external lookup table (there are many duplicated values)
the process of doing it is very technical in it's nature (creating lookup table and reference it instead of the actual value) and it sounds like it should be part of the infrastructure (sql server in this case or any RDBMS).
than I thought, it should be an option of a index - do not store duplicate values.
only a reference to the duplicated value.
can index be optimized in such a manner - not holding duplicated values, but just reference?
it should make the size of the table and index much smaller when there are many duplicated values.
SQL Server cannot do deduplication of column values. An index stores one row for each row of the base table. They are just sorted differently.
If you want to deduplicate you can keep a separate table that holds all possible (or actually occurring) values with a much shorter ID. You can then refer to the values by only storing their ID.
You can maintain that deduplication table in the application code or using triggers.

PostgreSQL: performance impact of extra columns

Given a large table (10-100 million rows) what's the best way to add some extra (unindexed) columns to it?
Just add the columns.
Create a separate table for each extra column, and use joins when you want to access the extra values.
Does the answer change depending on whether the extra columns are dense (mostly not null) or sparse (mostly null)?
A column with a NULL value can be added to a row without any changes to the rest of the data page in most cases. Only one bit has to be set in the NULL bitmap. So, yes, a sparse column is much cheaper to add in most cases.
Whether it is a good idea to create a separate 1:1 table for additional columns very much depends on the use case. It is generally more expensive. For starters, there is an overhead of 28 bytes (heap tuple header plus item identifier) per row and some additional overhead per table. It is also much more expensive to JOIN rows in a query than to read them in one piece. And you need to add a primary / foreign key column plus an index on it. Splitting may be a good idea if you don't need the additional columns in most queries. Mostly it is a bad idea.
Adding a column is fast in PostgreSQL. Updating the values in the column is what may be expensive, because every UPDATE writes a new row (due to the MVCC model). Therefore, it is a good idea to update multiple columns at once.
Database page layout in the manual.
How to calculate row sizes:
Making sense of Postgres row sizes
Calculating and saving space in PostgreSQL