Multiple column selection in columnar database - sql

I am just understanding the difference between row based and column based databases. I know their benefits but I have few questions.
Let's say I have a table with 3 columns - col1, col2 and col3. I want to fetch all col2, col3 pairs where col3 matches particular value. Let's say column values are stored in disk like below.
Block1 = col1
Block2,Block3 = col2
Block4 = col3
My understanding is that column value along with row id information will be stored in a block. Eg: (Block4 -> apple:row_2, banana:row_1). Am I correct?
Are values in the block sorted by column value? Eg: (Block4 -> apple:row_2, banana:row_1 instead of Block4 -> banana:row_1, apple:row_2). If not, how does filtering or search work without compromising the performance?
Assuming values in the block are sorted based on column value, how does corresponding col2 values will be filtered based on row ids fetched from Block4 ? Does it require linear search then?

The purpose of a columnar database is to improve performance for read queries by limiting the IO only to those columns used in the query. It does this by separating the columns into separate storage spaces.
A naive form of a columnar database would store one or a set of columns with a primary key and then use JOIN to bring together all the columns for a table. Columns that are not referenced would not be included.
However, databases that provide native support for columnar databases have much more sophisticated functionality than the naive example. Each columnar database stores data in its own way. So, your answer depends on the particular database which you haven't specified.
They might store "blocks" of values for a column and these blocks represent (in some way) a range of rows. So, if you are choosing 1 row from a billion row table, only the blocks with those rows need to be read.
Storing columns separately allows for enhanced functionality at the column level:
Compression. Values with the same data type can be much more easily compressed than rows which contain different values.
Block statistics. Blocks can be summarized statistically -- such as min and max values -- which can facilitate filtering.
Secondary data structures. Indexes for instance can be used within blocks (and these might be akin to "sorting" the values, actually).
The cost of all this is that inserts are no longer simple, so ACID properties are trickier with a column orientation. Because such databases are often used for decision support queries, this may not be an important limitation.
The "rows" are determined -- essentially -- by row ids. However, the row ids may actually consist of multiple parts, such as a block id and a row-within-a-block. This allows the store to use, say, 4 bytes for each component but not be limited to 4 billion rows.
Reconstructing rows between different columns is obviously a critical piece of functionality for any such database. In the "naive" example, this is handled via JOIN algorithms. However, specialized data structures would clearly have more specific approaches. Storing the data essentially in "row order" would be a typical solution.

Related

Query on varchar vs foreign key performance

This is for SQL Server.
I have a table that will contain a lot of rows and that table will be queried multiple times so I need to make sure my design is optimized.
Just for the question let say that table contains 2 columns. Name and Type.
Name is a varchar and it will be unique.
Type can be 5 different value (type1... type5). (It possible can contains more values in the future)
Should I make type a varchar (and create an index) or would be it better to create a table of types that will contains 5 rows with only a column for the name and make type a foreign key?
Is there a performance difference between both approach? The queries will not always have the same condition. Sometime it will query the name, type, or both with different values.
EDIT: Consider that in my application if type would be a table, the IDs would be cached so I wouldn't have to query the Type table everytime.
Strictly speaking, you'll probably get better query performance if you keep all the data in one table. However doing this is known as "denormalization" and comes with a number of pretty significant drawbacks.
If your table has "a lot of rows", storing an extra varchar field for every row as opposed to say, a small, or even tinyint, can add a non-trivial amount of size to your table
If any of that data needs to change, you'll have to perform lots of updates against that table. This means transaction log growth and potential blocking on the table during modification locks. If you store it as a separate table with 5-ish rows, if you need to update the data associated with that data, you just update one of the 5 rows you need.
Denormalizing data means that the definition of that data is no longer stored in one place, but in multiple places (actually its stored across every single row that contains those values).
For all the reasons listed above, managing that data (inserts, updates, deletes, and simply defining the data) can quickly become far more overhead than simply normalizing the data correctly in the first place, and for little to no benefit beyond what can be done with proper indexing.
If you find the need to return both the "big" table and some other information from the type table and you're worried about join performance, truthfully, wouldn't be. That's a generalization, but If your big table has, say, 500M rows in it, I can't see many use cases where you'd want all those rows returned; you're probably going to get a subset. In which case, that join might be more manageable. Provided you index type, the join should be pretty snappy.
If you do go the route of denormalizing your data, I'd recommend still having the lookup table as the "master definition" of what a "type" is, so it's not a conglomeration of millions of rows of data.
If you STILL want to denormalize the data WITHOUT a lookup table, at least put a CHECK constraint on the column to limit which values are allowable or not.
How much is "a lot of rows"?.
If it is hundreds of thousands or more, then a Columnstore Index may be a good fit.
It depends on your needs, but usually you would want the type column to be of a numerical value (in your case tinyint).

Sparse column size limitation workaround

I'm using SQL server 2014. I'm creating multiple tables, always with more than 500 columns, which will be varying accordingly.
So, I created a sparse column so that I could be sure if the number of my columns exceed 1024 there won't be a problem. Now there is a new problem:
Cannot create a row that has sparse data of size 8710 which is greater
than the allowable maximum sparse data size of 8023.
I know SQL server allows only 8 Kb of data in a row, I need to know what's the work around for this. If I need to plan to move to No SQL (Mongodb) how much impact will it create on converting my stored procedure.
Maximum number of columns in an ordinary table is 1024. Maximum number of columns in a wide (sparse) table is 30,000. Sparse columns are usually used when you have a lot of columns, but most of them are NULL.
In any case, there is limit of 8060 bytes per row, so sparse columns won't help.
Often, having thousand columns in a table indicate that there are problems with the database design and normalisation.
If you are sure that you need these thousand values as columns, not as rows in a related table, then the only workaround that comes to mind is to partition the table vertically.
For example, you have a Table1 with column ID (which is the primary key) and 1000 other columns. Split it into Table1 and Table2. Each will have the same ID as a primary key and 500 columns each. The tables would be linked 1:1 using foreign key constraint.
The datatypes that are used and the density of how much data in a row is null determines the effectiveness of sparse columns. If all of the fields in a table are populated there is actually more overhead on storing those rows and will cause you to hit that maximum page size faster. If that is the case then don't use sparse columns.
See how many you can convert from static to variable length datatypes (varchar, nvarchar, varbinary). This might buy you some additional space in the page as variable length fields can be put into overflow pages, but do carry an overhead of 24 bytes for the pointer into the overflow page. I suspect you were thinking that sparse columns was going to allow you to store 30K columns...this would only be the circumstance where you had a wide table where most of the columns are NULL.
MongoDB will not be your answer...at least not without a lot of refactoring. You will not be able to leverage your existing stored procedures. It might be the best fit for you but there are many things to consider when moving to MongoDB. Your data access layer will need to be rebuilt unless you just happen to be persisting your data in the relational structure as JSON documents :). I assume that is not the case.
I am assuming that you have wide tables and they are densely populated...based on that assumption here is my recommendation.
Partition the table as Vladimir suggested, but create a view that will join all these tables together to make it look like one table. Now you have the same structure as you did before. Then add an Instead of Trigger to the view to update the tables. This is a way that you can get what you want without having to do major refactoring of your code. There is code you need to add for the trigger, but my experience as been that it's easy to write and most times I didn't write the code but created a script to generate the code for all the views I had to do this for since it was repetitive.

how to discover all possible ways in which two tables can be joined to produce significant results

Given two tables(only data), I want to find out all possible ways in which the two tables can be joined to produce significant results. Each way corresponds to a mapping of attributes from one table to the other. The strength of a join for a mapping M from table T1 to T2 is the percentage of rows from T1 that join to some row of T2 under the mapping M. I am interested in finding all the mappings that have support greater than a threshold t. This is a very expensive operation since the number of possible mappings is exponential in the number of attributes. Thus, I am thinking to consider sampling techniques for join discovery.
For example, consider TPC-H benchmark database. Say we were given the tables for customer and orders relation. We have raw data. We dont know what each attribute corresponds to. Now by seeing the data, we should be able to derive that customer.customer_id and orders.customer_id are join columns with high support but not customer.customer_age and orders.customer_id. Similarly we should find all joins possible and order them according to their support. Since checking every possible combination of attributes is very costly we need an efficient technique to be used.
Real use-case: I am given a raw huge dataset where columns are isolated(assume 2 tables are there for simplicity). I want to discover what are the joins possible from it with support value efficiently.
(Note: Since I dont know anything about type of attributes, I am thinking to consider all of them as strings and use sampling)
I am clear that sampling is required. My questions are only the following.
What is a good sampling strategy here? What metadata should be computed to decide the sample sizes? Can the sampling of each table be done independently or should they be correlated?
One possibility would be to analyze the statistical properties of each single column. This way you would discover that customers.id and orders.cust_id have similar distributions and you would not even try to match orders.item_count against customers.age:
min max average variance ...
orders.item_count 1 29 3.1782 ...
customers.age 18 75 38.45 ...
customers.id 17239 29115 23177 ...
orders.cust_id 17445 29037 23491 ...
Moreover, these properties can be derived from a sampling of each table, without examining the whole table. But for a proper estimation of support, you would need to do a uniform sampling, which can be as expensive as an additional full table scan. However, you would only do this if the statistical properties look promising. Your cost would be n + m + k(n*m) with a hopefully small k.
I know this isn't much, but perhaps it might be a start.
(This is one of many possible meaning of significant: "the two columns refer to the same entity". Your mileage may vary).
Update: MySQL isn't very well suited for the preliminary elaboration, where we actually use no RDBMS features at all. You might consider running the initial analysis using e.g. Hadoop.
How to go about it
One piece of data which is required is a ballpark estimate of the number of rows in tables A and B. Otherwise, things get really hairy (and performance goes out of the window).
Begin scan
At this point we have read one row and we know what fields will be in table A. We also know that table A has one billion rows, and we know (from the features of our system) that we can't afford to read more than ten million rows, nor write more than one million.
So we start reading table A skipping every 100 rows (1 bil/10 mil). We will sample one row every 1000 (1 bil/1 mil). So we will save one row in ten (1000/100) into SampleA. While we read, we accumulate statistical data on every column, i.e. we have in memory, for every column, a list of values such as Col12_Min, Col12_Max, Col12_Sum, Col12_SumSquare, ... . We can add other heuristic parameters, such as Col12_Increasing and Col12_Decreasing, and we add 1 to Col12_Increasing every time the value we read is more than the previous one, we add 1 to _Decreasing if it is less. This allows to quickly recognize "counter" rows if the table is clustered.
The whole concept of sampling/reading one row every N requires that the table has no regular distributions at that frequency: if, for example, column 23 contains a customer ID except once every hundred rows, when it contains a zero, by reading that column with period 100 we will read all zeroes, and come to wrong conclusions. If this is the case, I'm sorry; there are too many requirements, and to satisfy them all you can't use shortcuts - you need to read every row. A complicated enough case can't be solved in a simple way.
But supposing the case is more realistic, we do the same for table B, which goes into SampleB.
When finished, we have two much smaller tables, information on the columns, and a problem.
The "free for all" join is a matrix such as
ACol1 ACol2 ACol3 ... AColN
BCol1 ? ? ? ?
BCol2 ? ? ? ?
BCol3 ? ? ? ?
...
BColM ? ? ? ?
By examining the maxima, minima, and other parameters of the columns, as well as their data type, we immediately strike out all matrix cells where data types do not match, or statistical parameters are too different the ones from the other.
ACol1 ACol2 ACol3 ... AColN
BCol1 ?
BCol2 ? ?
BCol3 ? ?
...
BColM ?
But now what can we expect when we join SampleA.cust_id against SampleB.cust_id? SampleA contains only one customer every thousand. So when attempting to join SampleA and SampleB, we can expect getting no more than a 0.1% match. Given a customer ID from SampleB, the likelihood of it having been harvested in SampleA is 1/1000.
We can now run an additional check: verify whether the columns are unique or not. We will see that SampleA.cust_id is unique, while SampleB.cust_id is not. This tells us that the join, if it holds, will be one-to-many.
Supposing we know (from statistical data) that SampleA.cust_id contains numbers in the range 10000000-20000000 and holds 53000 rows, and SampleB.cust_id contains numbers in the same range and holds 29000 rows; if the two columns were not correlated, but had those parameters, we would expect that generating one number at random in a range ten million wide (which we do when we extract a row from SampleB and read its cust_it) would have a probability of 53000/10000000 = 0.53% of matching a row in SampleA.
The two probabilities are different enough (always supposing we're dealing with uniform distributions) that we can try and use them to discriminate the two cases.
When we have sufficiently restricted the number of column pairs, we can run a "fake join test" by reading again the whole A (another full table scan) and verifying that all the values of SampleB.cust_id are indeed present.
Important: if either table is incomplete, i.e. the join is not "perfect" in the original tables, this will introduce an error. If this error is large enough, then it will no longer be possible to tell two columns' relation by comparing probabilities. Also, some distributions may conspire so that the probabilities are close enough to prevent a definite answer either way. In all those cases, you need to come up with some different heuristic, based on the actual nature of the data. You can't expect to find a "universal joining algorithm".
Also: all of the above holds for one column vs one column relations. Composite keys are a different can of worms altogether, and statistical analysis, while possible, will require very different tools - BigData, and something akin to OLAP - and above all, very different (and massive) processing costs.
What assumptions can you make going in? If this is a big blob o’ data about which you can assume nothing, then you’ve really got your work cut out for you.
Matching datatypes. Integers do not join on strings; dates do not join with floats. Ditto, precision. 4 byte integers do not join with 2 byte integers; 50-max strings do not match with 128-max strings. If you’re converting everything to strings… how do you tell binary from integer from floating point from time from string to Unicode?
"Significant" statistical correlation depends on the type of relation. Stats for (One to one), (one to one-or-more), (one to zero or more), (one to zero or one) are all doable, but they don’t look like each other. Can you rule out (many-to-many) in your data?
Without scanning all the data, any sampled statistics are in doubt. Was the data was ordered by one column? What if sparse data clumped somewhere, making it appear more prevalent than it is?
Big big assumption: joins occur between single columns. If there’s “compound column identifiers” (and it does happen), and you factor in order (col1 + col2 vs. col2 + col1), it gets much, much more complex than an (m *n) analysis. You did indicate that you know anything about this data that you picked up…
Without some reasonable starting assumptions or guesses, you just won’t get anywhere on a problem like this without serious time and effort.
You need to start with a metric of "matchedness". Let's say it is the number of values that match divided by the total number of values in both tables.
You can readily generate the SQL you need as something like:
select max(name1) as name1, max(name2) as name2,
count(*) as totalvalues,
sum(in1) as totalvalues_table1,
sum(in2) as totalvalues_table2,
sum(in1*in2)/count(*) as measure
from (select value, max(name1) as name1, max(name2) as name2,
max(inone) as in1, max(intwo) as in2
from ((select 'col1' as name1, NULL, col1 as value, 1 as inone, 0 as intwo from table1 t1)
union all
(select NULL, 'col1', col1, 0, 1 from table2 t2)
) tt
group by value
) v;
The "measure" gives some indication of overlap, at the values level. You could use sum() to get rows -- but then you have a problem that some tables have many more rows than others.
You can then automatically generate the SQL for all columns using the INFORMATION_SCHEMA.COLUMNS table, which is available in most databases (all real databases have a similar method for getting metadata, even if the view/table name is different).
However, I think you might quickly find that this problem is harder to define than you think. Play around with tables of different sizes and different amounts of overlap to see if you can come up with a good heuristic.

does duplicate values in index takes duplicate space

I want to optimize storage of a big table by taking out values of columns of type varchar to external lookup table (there are many duplicated values)
the process of doing it is very technical in it's nature (creating lookup table and reference it instead of the actual value) and it sounds like it should be part of the infrastructure (sql server in this case or any RDBMS).
than I thought, it should be an option of a index - do not store duplicate values.
only a reference to the duplicated value.
can index be optimized in such a manner - not holding duplicated values, but just reference?
it should make the size of the table and index much smaller when there are many duplicated values.
SQL Server cannot do deduplication of column values. An index stores one row for each row of the base table. They are just sorted differently.
If you want to deduplicate you can keep a separate table that holds all possible (or actually occurring) values with a much shorter ID. You can then refer to the values by only storing their ID.
You can maintain that deduplication table in the application code or using triggers.

PostgreSQL: performance impact of extra columns

Given a large table (10-100 million rows) what's the best way to add some extra (unindexed) columns to it?
Just add the columns.
Create a separate table for each extra column, and use joins when you want to access the extra values.
Does the answer change depending on whether the extra columns are dense (mostly not null) or sparse (mostly null)?
A column with a NULL value can be added to a row without any changes to the rest of the data page in most cases. Only one bit has to be set in the NULL bitmap. So, yes, a sparse column is much cheaper to add in most cases.
Whether it is a good idea to create a separate 1:1 table for additional columns very much depends on the use case. It is generally more expensive. For starters, there is an overhead of 28 bytes (heap tuple header plus item identifier) per row and some additional overhead per table. It is also much more expensive to JOIN rows in a query than to read them in one piece. And you need to add a primary / foreign key column plus an index on it. Splitting may be a good idea if you don't need the additional columns in most queries. Mostly it is a bad idea.
Adding a column is fast in PostgreSQL. Updating the values in the column is what may be expensive, because every UPDATE writes a new row (due to the MVCC model). Therefore, it is a good idea to update multiple columns at once.
Database page layout in the manual.
How to calculate row sizes:
Making sense of Postgres row sizes
Calculating and saving space in PostgreSQL