HBase multiple column families performance - apache

I have 2 HBase tables - one with a single column family, and other has 4 column families. Both tables are keyed by same rowkey, and the column families all have a single column qualifier each, with a json string as value (each json payload is about 10-20K in size). All column families use fast-diff encoding and gzip compression.
After loading about 60MM rows to each table, a scan test on any single column family in 2nd table takes 4x the time to scan the single column family from 1st table. Note that the scan on 2nd table uses addFamily to limit scan to only 1 column family, and both tests scan 1MM rows exactly - so the net workload (and hence performance expectation) should be the same in both cases. However, tests show 4x time on any column family in 2nd table vs 1st table. Performance did not change much even after running a major compaction on both tables.
Though HBase doc and other tech forums recommend not using more than 1 column family per table, nothing I have read so far suggests scan performance will linearly degrade based on number of column families. Has anyone else experienced this, and is there a simple explanation for this?
To note, the reason second table has 4 column families is even though I only scan one column family at a time now, there are requirements to scan multiple column families from that table given a set of rowkeys.
Thanks for any insight into the performance question.

That's a normal behavior, if I've got your situation right.
Since each column family represents a separate Store on RegionServer, accessing multiple stores takes more time.
You can limit your scan to specific column families, use
addFamily on your scan object.

Related

Multiple tables vs one table with more columns

My chosen database is MongoDB. But the question should be independent.
So for example, each row of record will have a flag that can take 1 of 2 possible values.
What is the pro and con of:
Having 1 table with a column to hold the value of this flag.
versus:
the pro and con of:
Having 2 tables to hold the two different types of records distinguished by the aforementioned flag?
Would this be cheaper in terms of storage, since you don't have that extra column?
Would this also be faster in queries, since you know exactly which table to look without having to perform a filter?
What is the common practice in industry?
Storage for a single column holding just a flag (e.g. active and archived) should be negligible. The query could be faster with two tables, however your application is more complex, you have to write 2 queries.
When you have only 2 distinct values and these values are more or less evenly distributed, then an index will not be used, thus the performance should be equal - unless you select the entire table.
It might be useful to have 2 tables if the flags are not evenly distributed. For example you have a rather small active data set which is queried frequently, and a big archive data set which is much bigger but hardly queried.
If available, you can also work with partitions which is actually a good combination of both.

How to store large currency like data in database

My data is similar to currency in many aspects so I will use it for demonstration.
I have 10-15 different groups of data, we can say different currencies like Dollar or Euro.
They need to have these columns:
timestamp INT PRIMARY KEY
value INT
Each of them will have more than 1 billion rows and i will append new rows as time passes.
I will just select them in some intervals and create graphs. Probably multiple currency in same graph.
Question is should I add a group column and store all in one table or leave it separately. If they are in same column timestamp will not be unique anymore and probably I should use advanced SQL techniques to make it efficient.
10 - 15 "currencies"? 1 billion rows each? Consider list partitioning in Postgres 11 or later. This way, the timestamp column stays unique per partition. (Although I am not sure why that is a necessity.)
Or simply have 10 - 15 separate tables without storing the "currency" redundantly per row. Size matters with this many rows.
Or, if you typically have multiple values (one for each "currency") for the same timestamp, you might use a single table with 10-15 dedicated "currency" columns. Much smaller overall, as it saves the tuple overhead for each "currency" (28 bytes per row or more). See:
Making sense of Postgres row sizes
The practicality of a single row for multiple "currencies" depends on detailed specs. For example: might not work so well for many updates on individual values.
You added:
I have read clustered indexes which orders data in physical order in disk. I will not insert new rows in middle of table
That seems like a perfect use case for BRIN indexes, which are dramatically smaller than their B-tree relatives. Typically a bit slower, but with your setup maybe even faster. Related:
How do I improve date-based query performance on a large table?

Why does Redshift need to do a full table scan to find the max value of the DIST/SORT key?

I'm doing simple tests on Redshift to try and speed up the insertion of data into a Redshift table. One thing I noticed today is that doing something like this
CREATE TABLE a (x int) DISTSTYLE key DISTKEY (x) SORTKEY (x);
INSERT INTO a (x) VALUES (1), (2), (3), (4);
VACUUM a; ANALYZE a;
EXPLAIN SELECT MAX(x) FROM a;
yields
QUERY PLAN
XN Aggregate (cost=0.05..0.05 rows=1 width=4)
-> XN Seq Scan on a (cost=0.00..0.04 rows=4 width=4)
I know this is only 4 rows, but it still shouldn't be doing a full table scan to find the max value of a pre-sorted column. Isn't that metadata included in the work done by ANALYZE?
And just as a sanity check, the EXPLAIN for SELECT x FROM a WHERE x > 3 only scans 2 rows instead of the whole table.
Edit: I inserted 1,000,000 more rows into the table with random values from 1 to 10,000. Did a vacuum and analyze. The query plan still says it has to scan all 1,000,004 rows.
Analyzing query plans in a tiny data set does not yield any practical insight on how the database would perform a query.
The optimizer has thresholds and when the cost difference between different plans is small enough it stops considering alternative plans. The idea is that for simple queries, the time spent searching for the "perfect" execution plan, can possibly exceed the total execution time of a less optimal plan.
Redshift has been developed on the code for ParAccel DB. ParAccel has literally hundreds of parameters that can be changed/adjusted to optimize the database for different workloads/situations.
Since Redshift is a "managed" offering, it has these settings preset at levels deemed optimal by Amazon engineers given an "expected" workload.
In general, Redshift and ParAccel are not that great for single slice queries. These queries tend to be run in all slices anyway, even if they are only going to find data in a single slice.
Once a query is executing in a slice, the minimum amount of data read is a block. Depending on block size this can mean hundreds of thousand rows.
Remember, Redshift does not have indexes. So you are not going to have a simple record lookup that will read a few entries off an index and then go laser focused on a single page on the disk. It will always read at least an entire block for that table, and it will do that in every slice.
How to have a meaningful data set to be able to evaluate a query plan?
The short answer is that your table would have a "large number" of data blocks per slice.
How many blocks is per slice is my table going to require? The answer depends on several factors:
Number of nodes in your cluster
Type of node in the cluster - Number of slices per node
Data Type - How many bytes each value requires.
The type of compression encoding for the column involved in the
query. The optimal encoding depends on data demographics
So let's start at the top.
Redshift is an MPP Database, where processing is spread accross multiple nodes. See Redshift's architecture here.
Each node is further sub-divided in slices, which are dedicated data partitions and corresponding hardware resources to process queries on that partition of the data.
When a table is created in Redshift, and data is inserted, Redshift will allocate a minimum of one block per slice.
Here is a simple example:
If you created a cluster with two ds1.8xlarge nodes, you would have 16 slices per node times two nodes for a total of 32 slices.
Let's say we are querying and column in the WHERE clause is something like "ITEM_COUNT" an integer. An integer consumes 4 bytes.
Redshift uses a block size of 1MB.
So in this scenario, your ITEM_COUNT column would have available to it a minimum of 32 blocks times block size of 1MB which would equate to 32MB of storage.
If you have 32MB of storage and each entry only consumes 4 bytes, you can have more than 8 million entries, and they could all fit inside of a single block.
In this example in the Amazon Redshift documentation they load close to 40 million rows to evaluate and compare different encoding techniques. Read it here.
But wait.....
There is compression, if you have a 75% compression rate, that would mean that even 32 million records would still be able to fit into that single block.
What is the bottom line?
In order to analyze your query plan you would need tables, columns that have several blocks. In our example above 32 milion rows would still be a single block.
This means that in the configuration above, with all the assumptions, a table with a single record would basically most likely have the same query plan as a table with 32 million records, because, in both cases the database only needs to read a single block per slice.
If you want to understand how your data is distributed across slices and how many blocks are being used you can use the queries below:
How many rows per slice:
Select trim(name) as table_name, id, slice, sorted_rows, rows
from stv_tbl_perm
where name like '<<your-tablename>>'
order by slice;
How to count how many blocks:
select trim(name) as table_name, col, b.slice, b.num_values, count(b.slice)
from stv_tbl_perm a, stv_blocklist b
where a.id = b.tbl
and a.slice = b.slice
and name like '<<your-tablename>>'
group by 1,2,3,4
order by col, slice;

How to increase scan speed in Hbase

I am new to Apache Hbase and I am using hbase-0.98.13 and I have created a table sample with column family sample_family. And I have loaded the output from pig script to hbase table. when I try to scan the table based on one of the column in column family it takes more than 2 minutes.
Here is the query
scan 'sample', {FILTER=>"SingleColumnValueFilter('sample_family','id',=,'binary:1000')"}
Can any one tell me how to bring this process in one or two seconds?
Is there any configuration changes to be made for this? Can any one help me in this?
There's no silver bullet to make a search in HBase fast.
A scan in your example has to iterate over all the rows in a table, that's why it takes significant time on large tables. And there are no secondary indices in HBase that help to improve a search by specific columns.
The most effective way to improve scans perfomance is to have properly designed row keys. HBase internally keeps rows sorted by row keys, and you can specify start and end rows for a scan. So it's crucial to have row keys designed for search by the most frequent criteria. In your question you search by column id where a value is 1000. You could put this id into the row key (however, you have to make sure you avoid regions hotspotting).

How does database indexing work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 12 months ago.
The community reviewed whether to reopen this question yesterday and left it closed:
Original close reason(s) were not resolved
Improve this question
Given that indexing is so important as your data set increases in size, can someone explain how indexing works at a database-agnostic level?
For information on queries to index a field, check out How do I index a database column.
Why is it needed?
When data is stored on disk-based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.
Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires (N+1)/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire tablespace must be searched at N block accesses.
Whereas with a sorted field, a Binary Search may be used, which has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.
What is indexing?
Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and a pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.
The downside to indexing is that these indices require additional space on the disk since the indices are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.
How does it work?
Firstly, let’s outline a sample database table schema;
Field name Data type Size on disk
id (Primary key) Unsigned INT 4 bytes
firstName Char(50) 50 bytes
lastName Char(50) 50 bytes
emailAddress Char(100) 100 bytes
Note: char was used in place of varchar to allow for an accurate size on disk value.
This sample database contains five million rows and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).
Example 1 - sorted vs unsorted fields
Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a table using the MyISAM engine which is using the default block size B = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.
A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value, given that the id field is a key field. But since the id field is also sorted, a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.
Now the firstName field is neither sorted nor a key field, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.
Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks than the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;
Field name Data type Size on disk
firstName Char(50) 50 bytes
(record pointer) Special 4 bytes
Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.
Example 2 - indexing
Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the index is N = (r/bfr) = 5000000/18 = 277,778 blocks.
Now a search using the firstName field can utilize the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 1,000,000 block accesses required to find a firstName match in the non-indexed table.
When should it be used?
Given that creating an index requires additional disk space (277,778 blocks extra from the above example, a ~28% increase), and that too many indices can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.
Since indices are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.
Classic example "Index in Books"
Consider a "Book" of 1000 pages, divided by 10 Chapters, each section with 100 pages.
Simple, huh?
Now, imagine you want to find a particular Chapter that contains a word "Alchemist". Without an index page, you have no other option than scanning through the entire book/Chapters. i.e: 1000 pages.
This analogy is known as "Full Table Scan" in database world.
But with an index page, you know where to go! And more, to lookup any particular Chapter that matters, you just need to look over the index page, again and again, every time. After finding the matching index you can efficiently jump to that chapter by skipping the rest.
But then, in addition to actual 1000 pages, you will need another ~10 pages to show the indices, so totally 1010 pages.
Thus, the index is a separate section that stores values of indexed
column + pointer to the indexed row in a sorted order for efficient
look-ups.
Things are simple in schools, isn't it? :P
An index is just a data structure that makes the searching faster for a specific column in a database. This structure is usually a b-tree or a hash table but it can be any other logic structure.
The first time I read this it was very helpful to me. Thank you.
Since then I gained some insight about the downside of creating indexes:
if you write into a table (UPDATE or INSERT) with one index, you have actually two writing operations in the file system. One for the table data and another one for the index data (and the resorting of it (and - if clustered - the resorting of the table data)). If table and index are located on the same hard disk this costs more time. Thus a table without an index (a heap) , would allow for quicker write operations. (if you had two indexes you would end up with three write operations, and so on)
However, defining two different locations on two different hard disks for index data and table data can decrease/eliminate the problem of increased cost of time. This requires definition of additional file groups with according files on the desired hard disks and definition of table/index location as desired.
Another problem with indexes is their fragmentation over time as data is inserted. REORGANIZE helps, you must write routines to have it done.
In certain scenarios a heap is more helpful than a table with indexes,
e.g:- If you have lots of rivalling writes but only one nightly read outside business hours for reporting.
Also, a differentiation between clustered and non-clustered indexes is rather important.
Helped me:- What do Clustered and Non clustered index actually mean?
Now, let’s say that we want to run a query to find all the details of any employees who are named ‘Abc’?
SELECT * FROM Employee
WHERE Employee_Name = 'Abc'
What would happen without an index?
Database software would literally have to look at every single row in the Employee table to see if the Employee_Name for that row is ‘Abc’. And, because we want every row with the name ‘Abc’ inside it, we can not just stop looking once we find just one row with the name ‘Abc’, because there could be other rows with the name Abc. So, every row up until the last row must be searched – which means thousands of rows in this scenario will have to be examined by the database to find the rows with the name ‘Abc’. This is what is called a full table scan
How a database index can help performance
The whole point of having an index is to speed up search queries by essentially cutting down the number of records/rows in a table that need to be examined. An index is a data structure (most commonly a B- tree) that stores the values for a specific column in a table.
How does B-trees index work?
The reason B- trees are the most popular data structure for indexes is due to the fact that they are time efficient – because look-ups, deletions, and insertions can all be done in logarithmic time. And, another major reason B- trees are more commonly used is because the data that is stored inside the B- tree can be sorted. The RDBMS typically determines which data structure is actually used for an index. But, in some scenarios with certain RDBMS’s, you can actually specify which data structure you want your database to use when you create the index itself.
How does a hash table index work?
The reason hash indexes are used is because hash tables are extremely efficient when it comes to just looking up values. So, queries that compare for equality to a string can retrieve values very fast if they use a hash index.
For instance, the query we discussed earlier could benefit from a hash index created on the Employee_Name column. The way a hash index would work is that the column value will be the key into the hash table and the actual value mapped to that key would just be a pointer to the row data in the table. Since a hash table is basically an associative array, a typical entry would look something like “Abc => 0x28939″, where 0x28939 is a reference to the table row where Abc is stored in memory. Looking up a value like “Abc” in a hash table index and getting back a reference to the row in memory is obviously a lot faster than scanning the table to find all the rows with a value of “Abc” in the Employee_Name column.
The disadvantages of a hash index
Hash tables are not sorted data structures, and there are many types of queries which hash indexes can not even help with. For instance, suppose you want to find out all of the employees who are less than 40 years old. How could you do that with a hash table index? Well, it’s not possible because a hash table is only good for looking up key value pairs – which means queries that check for equality
What exactly is inside a database index?
So, now you know that a database index is created on a column in a table, and that the index stores the values in that specific column. But, it is important to understand that a database index does not store the values in the other columns of the same table. For example, if we create an index on the Employee_Name column, this means that the Employee_Age and Employee_Address column values are not also stored in the index. If we did just store all the other columns in the index, then it would be just like creating another copy of the entire table – which would take up way too much space and would be very inefficient.
How does a database know when to use an index?
When a query like “SELECT * FROM Employee WHERE Employee_Name = ‘Abc’ ” is run, the database will check to see if there is an index on the column(s) being queried. Assuming the Employee_Name column does have an index created on it, the database will have to decide whether it actually makes sense to use the index to find the values being searched – because there are some scenarios where it is actually less efficient to use the database index, and more efficient just to scan the entire table.
What is the cost of having a database index?
It takes up space – and the larger your table, the larger your index. Another performance hit with indexes is the fact that whenever you add, delete, or update rows in the corresponding table, the same operations will have to be done to your index. Remember that an index needs to contain the same up to the minute data as whatever is in the table column(s) that the index covers.
As a general rule, an index should only be created on a table if the data in the indexed column will be queried frequently.
See also
What columns generally make good indexes?
How do database indexes work
Simple Description!
The index is nothing but a data structure that stores the values for a specific column in a table. An index is created on a column of a table.
Example: We have a database table called User with three columns – Name, Age and Address. Assume that the User table has thousands of rows.
Now, let’s say that we want to run a query to find all the details of any users who are named 'John'.
If we run the following query:
SELECT * FROM User
WHERE Name = 'John'
The database software would literally have to look at every single row in the User table to see if the Name for that row is ‘John’. This will take a long time.
This is where index helps us: index is used to speed up search queries by essentially cutting down the number of records/rows in a table that needs to be examined.
How to create an index:
CREATE INDEX name_index
ON User (Name)
An index consists of column values(Eg: John) from one table, and those values are stored in a data structure.
So now the database will use the index to find employees named John
because the index will presumably be sorted alphabetically by the
Users name. And, because it is sorted, it means searching for a name
is a lot faster because all names starting with a “J” will be right
next to each other in the index!
Just think of Database Index as Index of a book.
If you have a book about dogs and you want to find an information about let's say, German Shepherds, you could of course flip through all the pages of the book and find what you are looking for - but this of course is time consuming and not very fast.
Another option is that, you could just go to the Index section of the book and then find what you are looking for by using the Name of the entity you are looking ( in this instance, German Shepherds) and also looking at the page number to quickly find what you are looking for.
In Database, the page number is referred to as a pointer which directs the database to the address on the disk where entity is located. Using the same German Shepherd analogy, we could have something like this (“German Shepherd”, 0x77129) where 0x77129 is the address on the disk where the row data for German Shepherd is stored.
In short, an index is a data structure that stores the values for a specific column in a table so as to speed up query search.