Does Google BigTable support range scan? - bigtable

I am student learning Google BigTable design.
I am confused, the SST table is sorted internally. Two SST tables may not be sorted. In this case, it seems BigTable doesn't support efficient range scan for primary key? For example, "select * where id between 100 and 200". BigTable may need to scan all the SST to get the result.
Then my understanding for why SST is sorted is because for single primary key query, we can do binary search within a SST.
Another question I have is, does MemTable sorted? If yes, how? Because MemTable need to update frequently. If use data structure like tree, then we need to traverse the tree when we write MemTable into SST?

It sounds like you've at least been through an overview of the original Bigtable paper, but here's a reference if you haven't read the whole thing; your questions can mostly be answered by a closer read: https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
Your intuitions about Bigtable are spot on. Both the SStables on disk and the Memtable are sorted based on the primary key, and any read (not just a scan) requires consulting all of them to produce a merged view. However, note that they are all sorted on the same key, so this amounts to a parallel traversal. We seek to the beginning of the range to be read in each sstable and in the memtable, and traverse them in parallel from there.
This process is alluded to in section 5.3: "A valid read operation is executed on a merged view of the sequence of SSTables and the memtable. Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be formed efficiently."
Some of these lookups can be mitigated using Bloom Filters as described in section 6 of the paper, but not all.
Of course, if there are too many sstables in a tablet, this will still become inefficient. Section 5.4 of the paper goes into a bit more detail about how this is solved, which is to periodically merge the "stack" of sstables for a tablet into a smaller number of new sstables. If a particular tablet stops receiving writes, this will eventually quiesce its state down to a single file.
Regarding the efficiency of the memtable, the paper does not prescribe a particular data structure. However, suffice it to say that there are many examples of efficient in-memory sorted data structures. In addition, Section 5.4 mentions that there can actually be multiple memtables in a given tablet. By the time we scan a memtable to write it out to disk, we have already installed a new memtable at the top of the tablet "stack" and are serving the latest incoming reads from there.

Related

How does postgres implement a sequential scan?

I understand that when the majority of a table is estimated to be required in the result set for a given query, that a sequential scan may be preferred over using an index.
What I'm curious about is how postgres actually reads the pages into memory?
Does it organise them into some kind of ad-hoc in memory index whilst it reads them?
What if the table's too large to fit into memory?
Are there any high level papers on the topic?
(I've done some searching but results are full of blog posts explaining the basics of indexing, not the implementation details of a sequential scan. I expect it's not as straightforward as read into an array when evaluating a join condition over most of a table)
What I'm curious about is how postgres actually reads the pages into memory?
The engine reads the whole heap in any order while discarding rows marked as deleted. Hot blocks (already present in the cache) are much faster to process.
Does it organise them into some kind of ad-hoc in memory index whilst it reads them?
No, a sequential scan avoids indexes and reads the heap directly using buffering and the cache.
What if the table's too large to fit into memory?
A sequential scan is pipelined. This means I/O blocks are read as needed. The engine does not need to have the whole heap in memory before it starts processing it. It read a few blocks, then process them and discards them; then it does this again and again until it reads all the blocks of the heap.
Are there any high level papers on the topic?
There should be but, anyway, any good book on query optimization will describe this process in detail.
EDIT For Your Second Question:
What I guess I mean is if you're joining on some random column X, does it have to iterate through each possible row multiple times to find the correct row for each value in the other table, or does it do something more advanced than that?
Well, when you join a couple of tables (or more) the engine query planner produces a plan that includes a "Nested Loop", a "Hash Join", or a "Merge Join" operator. There are more operators but these are the common ones.
The Nested Loop Join retrieves rows for the linked table that match the first one. It could perform an index seek or scan on the related table (ideal) or a full table scan (not ideal).
The Hash Join hashes the secondary table first (incurring in high startup cost) and then joins fast.
The Merge Join sorts both tables by the join key (assuming an equi-join), again incurring in heavy startup cost) and then joins fast (like a zipper).

bigtable: how does bigtable serves write request?

I'm reading google's bigtable paper. I noticed that in section 5.3, it says
Updates are committed to a commit log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable; the older updates are stored in a sequence of SSTables.
What confuses me is that, according to this answer, SSTable should store the sorted key-value pairs. But from the texts quoted above, it gives me the feeling that both memtable and sstable stores update operations, instead of the actual value. So what does bigtable actually do when there comes an write request?
According to official documentation [1]:
"A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. (Tablets are similar to HBase regions.) Tablets are stored on Colossus, Google's file system, in SSTable format. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Each tablet is associated with a specific Cloud Bigtable node. In addition to the SSTable files, all writes are stored in Colossus's shared log as soon as they are acknowledged by Cloud Bigtable, providing increased durability."
The official documentation has a link to this document where is explained more detailed [2]:
“A "Sorted String Table" then is exactly what it sounds like, it is a file which contains a set of arbitrary, sorted key-value pairs inside. Duplicate keys are fine, there is no need for "padding" for keys or values, and keys and values are arbitrary blobs.
If we need to preserve the fast read access which SSTables give us, but we also want to support fast random writes, turns out, we already have all the necessary pieces: random writes are fast when the SSTable is in memory, that is the definition of the memtable.”
In effect, what happens during a write is the Tablet Server (Cloud Bigtable Node) generates a committed log entry describing the mutation, plus a modification to the row in the memtable. Once this memtable is too large, the entire memtable is compacted into many immutable SSTables, partitioned by locality group (column family) and then each is added to the respective stack of SSTables for each locality group.
Note that each SSTable does not contain the cell values for all rows in the locality group, only the most recent updates. Reads may need to group updates from one or many SSTables in the locality group to construct a response.
See section "5.4 Compactions" in the paper [3] for more information on how mutations may be moved around to increase performance. Furthermore, see the heading "Locality Groups" under section "6 Refinements" for more information on the implications of using locality groups.
[1] https://cloud.google.com/bigtable/docs/overview#architecture
[2] https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/
[3] https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

How well does a unique hash index perform in comparison to the record ID?

Following CQRS practices, I will need to supply a custom generated ID (like a UUID) in any create command. This means when using OrientDB as storage, I won't be able to use its generated RIDs, but rather perform lookups on a manual index using the UUIDs.
Now in the OrientDB docs it states that the performance of fetching records using the RID is independent of the database size O(1), presumably because it already describes the physical location of the record. Is that also the case when using a UNIQUE_HASH_INDEX?
Is it worth bending CQRS practices to request a RID from the database when assembling the create command, or is the performance difference negligible?
I have tested the performance of record retrieval based on RIDs and indexed UUID fields using a database holding 180,000 records. For the measurement, 30,000 records have been looked up, while clearing the local cache between each retrieval. This is the result:
RID: about 0.2s per record
UUID: about 0.3s per record
I've done queries throughout populating the database in 30,000 record steps. The retrieval time wasn't significantly influenced by the database size in both cases. Don't mind the relatively high times as this experiment was done on an overloaded PC. It's the relation between the two that is relavant.
To anser my own question, a UNIQUE_HAS_INDEX based query is close enough to RID-based queries.

How to implement a scalable, unordered collection in DynamoDB?

I am looking into implementing a scalable unordered collection of objects on top of Amazon DynamoDB. So far the following options have been considered:
Use DynamoDB document data types (map, list) and use document path to access stand-alone items. This has one obvious drawback for collection being limited to 400KB of data, meaning perhaps 1..10K objects depending on their size. Less obvious drawback is that cost of insertion of a new object into such collection is going to be huge: Amazon specifies that the write capacity will be deducted based on the total item size, not just newly added object -- therefore ~400 capacity units for inserting 1KB object when approaching the size limit. So considering this ruled out?
Using composite primary hash + range key, where primary hash remains the same for all objects in the collection, and range key is just something random or an atomic counter. Obvious drawback is that having identical hash key results in bad key distribution -- cardinality is low when there are collections with large number of objects. This means bad partitioning, and having a scale issue with all reads/writes on the same collection being stuck to one shard, becoming subject to 3000 reads / 1000 writes per second limitation of DynamoDB partition.
Using global secondary index with secondary hash + range key, where hash key remains the same for all objects belonging to the same collection, and range key is just something random or an atomic counter. Similar to above, partitioning becomes poor for the GSI, and it will become a bottleneck with too many identical hashes draining all the provisioned capacity to the index rapidly. I didn't find how the GSI is implemented exactly, thus not sure how badly it suffers from low cardinality.
Question is, whether I could live with (2) or (3) and suffer from non-ideal key distribution, or is there another way of implementing collection that was overlooked, or perhaps I should at all consider looking into another nosql database engine.
This is a "shooting from the hip" answer, what you end up doing may depend on how much and what type of reading and writing you do.
Two things the dynamo docs encourage you to avoid are hot keys and, in general, scans. You noted that in cases (2) and (3), you end up with a hot key. If you expect this to scale (large collections), the hot key will probably hurt more and more, especially if this is a write-intensive application.
The docs on Query and Scan operations (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html) say that, for a query, "you must specify the hash key attribute name and value as an equality condition." So if you want to avoid scans, this might still force your hand and put you back into that hot key situation.
Maybe one route would be to embrace doing a scan operation, but just have one table devoted to your collection. Then you could just have a fully random (well distributed) hash key and do a scan every time. This assumes you always want everything from the collection (you didn't say). This will still hurt if you scale up to a large collection, but if you always want the full set back, you'll have to deal with that pain regardless. If you just want a subset, you can add a limit parameter. This would help performance, but you will always get back the same subset (or you can use the last evaluated key and keep going). The docs also mention parallel scans.
If you are using AWS, elasticache/redis might be another route to try? The first pass might code up a lot faster/cleaner than situation (1) that you mentioned.

compression, defragmentation, reclaiming space, shrinkdatabase vs. shrinkfile

[1] states:
"When data is deleted from a heap, the data on the page is not compressed (reclaimed). And should all of the rows of a heap page are deleted, often the entire page cannot be reclaimed"
"The ALTER INDEX rebuild and reorganize options cannot be used to defragment and reclaim space in a heap (but they can used to defragment non-clustered indexes on a heap).
If you want to defragment a heap in SQL Server 2005, you have three options:
1) create a clustered index on the heap, then drop the clustered index;
2) Use SELECT INTO to copy the old table to a new table; or
3) use BCP or SSIS to move the data from the old table to a new table.
In SQL Server 2008, the ALTER TABLE command has been changed so that it now has the ability to rebuild heap"
Plz explain me:
What are the difference between compression, (de)fragmentation, reclaiming the space, shrinkfile and shrinkdatabase in MS SQL Server 2005?
What does shrinkfile and shrinkdatabase accomplish in MS SQL Server 2005?
Update:
The question was inspired by discussion in [2] - how to shrink database in MS SQL Server 2005?
Update2: #PerformanceDBA,
Congats! You've gained over 500+ in just a week. This is remarkable!
Your diagram
Thanks, once more, for your time.
I shall ask later and not here.
Internals is not my primary preoccupation and not easiest one.
It is very succinct and generally odes not invoke any doubts or questions.
I'd prefer some tool, descriptions/instructions, technique around which to develop my doubts, question and discussion.
Plz see, for ex., my questions:
http://www.sqlservercentral.com/Forums/Topic1013693-373-2.aspx#bm1014385
http://www.sqlservercentral.com/Forums/Topic1013975-373-1.aspx
They are basically duplicates of what I asked but cannot discuss in stackoverflow.com
Update3: #PerformanceDBA,
thanks, once more, the main purpose of my questions was to determine the ways how to resolve concrete questions (basing on as well as avoiding what ) having contradictory docs, articles, discussions, answers, etc. which you helped to detect.
Currently I do not have further (unresolvable and blocking me) questions in this area.
[1]
Brad McGehee. Brad's Sure Guide to Indexes
(11 June 2009)
http://www.simple-talk.com/sql/database-administration/brads-sure-guide-to-indexes/
[2]
Answers and feedback to question
"Shrink a database below its initial size"
https://stackoverflow.com/questions/3543884/shrink-a-database-below-its-initial-size/3774639#3774639
No one touched this in over one month.
The answers to the first three are actually in the Diagram I Made for You, which you have not bothered to digest and ask questions about ... it is often used as a platform for discussion.
(That is a condensed version of my much more elaborate Sybase diagrams, which I have butchered for the MS context. There is a link at the bottom of that doc, if you want the full Sybase set.)
Therefore I am not going to spend much time on you either. And please do not ask for links to "reference sites", there ain't no such thing (what is available is non-technical rubbish), which is precisely why I have my own diagrams; there are very few people who understand MS SQL Internals.
reclaiming the space
That is the correct term. MS does not remove deleted rows from the page, or deleted pages from the extent. Reclaiming the space is an operation that goes through the Heap and removes the unused (a) rows and (b) pages. Of course that changes the RowIds, so all Nonclustered indices have to be rebuilt.
compression
In the context of the pasted text: same as Reclaiming space.
defragmentation
the operation of full scale removal of unused space. There are three Levels:
I. Database (AllocationUnits), across all objects
II. Object (Extent & Page), Page Chains, Split Pages, Overflow Pages
III. Heap Only (No Clustered index), the subject of the post
shrinkfile
Quite a different operation, to reduce the space allocated on a Device (File). This removes unused AllocationUnits (hence 'shrink') but it is not the same a de-fragmenting AllocationUnits.
shrinkdatabase
To do the same for a Database; All Devices Allocations used by the database across all Devices.
Response to Comments
The poster at SSC is clueless and does not address your question directly.
there is no such thing as a Clustered table (CREATE CLUSTERED TABLE fails)
there is such a thing as a Clustered index (CREATE CLUSTERED INDEX succeeds)
as per my diagrams, it is a single physical structure; the clustered index INCLUDES the rows and thus the Heap is eliminated
where there is no Clustered index, there are two physical structures: a Heap and a separate Nonclustered Index
Now before you go diving into them with DBCC, which is too low level, and clueless folks cannot identify, let alone explain, the whys and wherefores, you need to understand and confirm the above:
create a Table_CI (we are intending to add a CI, there is still no such thing as a Clustered Table)
add an unique clustered index to it UC_PK
add a few rows
create a table Heap
add an unique Nonclustered index to it NC_PK
add a few rows
SELECT * FROM sysindexes WHERE id = OBJECT_ID("Table_CI")
SELECT * FROM sysindexes WHERE id = OBJECT_ID("Heap")
note that each sysindexes entry is a complete, independent, data storage structure (look at the columns)
contemplate the output
compare with my diagram
compare with the rubbish in the universe
In future, I will not answer questions about the confused rubbish in the universe, and the incorrect and misinformed posts on other sites (I do not care if they are MS Certified Professionals, they have proved that they are incapable of inspecting their databases and determining the correct information)
There is a reason I have bothered to create accurate diagrams (the manuals, pictures, and all available info fro MS, is all rubbish; no use for you to look for accurate info from the :authority", because the "authority" is technically bankrupt).
Even Gail finally gets around to
I suspect you'd benefit from more reading on overall architecture of indexes before fiddling with the low level internals.
Except, there isn't any. That are not confusing, non-technical, and inconsistent.
There is a reason I have bothered to create accurate diagrams.
Back to the DBCCs. Gail is simply incorrect. In a Clustered Index (which includes the rows), the single page contains rows. Yes, rows. that is the leaf level of the index. There is a B-tree, it lives in the top of the page, but it is so small and you can't see it. Look at the sysindexes output. The root and firstpage pointer IS pointing to the page; that is the root of the Clustered Index. When you dive into the ocean, you need to know what to look for, AND where to find it, otherwise you won't find what you are looking for, and you will get distracted by the flotsam and jetsam that you do find by accident.
Now look at the TWO SEPARATE STRUCTURES for the NCI and the Heap.
Oh, and MS has changed from using the OAM terminology to the IAM where the data structure is an index. That introduces confusion. In terms of data structures (entries in sysindexes), they are all Objects; they might or might not be Indices). The point is, who cares, we know what it is, it is an ObjectAllocationMap ... if you are looking at at NCI, gee, it is an IndexObjectAllocationMap; if you are looking at a Heap, it is a HeapObjectAllocMap. I will let you ponder what it is in the case of a CI. In chasing it down, or in using it (finding the pages that belong to the OBJECT, it does not matter, they are all Objects. When doing that, you need to know, some objects have a PageChain and others do not (another of your questions). CIs have them; NCIs and Heaps do not.
Gail Shaw: "I doubt these kinds of internals are documented anywhere. After all, we're using undocumented features. Definition of index depends who you ask and where you look.
ROTFLMAO. My sides hurt, I could not read the posts that followed it. These are supposed to be intelligent human beings ? Working in the IT world ? Definitions CHANGE ? What with the temperature or the time of day ? And that was SQL Server Central ? Not the backwoods ?
When MS stole SQL Server from Sybase, the documentation was rock solid. Of coure, with each major release, they "rewrite" it, and the docs get weaker and more fluffy (recall our discussion in another post). Now we have cute pictures that make people feel good but are inaccurate, technically. Which is why earnest people like you have problems. the pictures do not even match the text in the manuals.
Anyway, DEFINITIONS do not change. That's the definition of definitions. They are true in any context. And Um, the um feature you are using is an ordinary, documented feature. Has been since 1987. Except MS lost it somewhere and no one can find it. You'll have to ask a Sybase Guru who was around in the old days, who remembers what exact data structures were in the code that they acquired. And if you are really lucky, he will be up to date with the differences that MoronSociety has introduced in 2000, 2005, 2008. He might even have a single accurate diagram that matches the output of sysindexes and DBCC on your box. If you find him, kiss his ring and shower him with gold. Lock up your daughters.
(not serious, my sides are killing me, the mirth is overflowing).
Now do you see why I will not answer questions about the confused rubbish in the universe ? There are just SO MANY morons out there in MoronSociety.
-----
Gail again:
"Scans:
An index scan is a complete read of all of the leaf pages in the index. When an index scan is done on the clustered index, it’s a table scan in all but name.
When an index scan is done by the query processor, it is always a full read of all of the leaf pages in the index, regardless of whether all of the rows are returned. It is never a partial scan.
A scan does not only involve reading the leaf levels of the index, the higher level pages are also read as part of the index scan."
There must be a reason she is named after fast wind. She writes "books" ? Yeah, fantasy novels. Hot air is for balloonists not IT professionals.
Complete and total drivel. The whole point of an Index Scan AND WHY IT IS PREFERABLE TO A TABLE SCAN, because it is trying to AVOID A TABLE SCAN, is that:
- the engine (executing the query tree) can go directly to the Index (Clustered or Nonclustered, at this point)
- navigate the B-Tree to find the place to start (which up to this point, is much the same as when it is getting a few rows, ie. not scanning)
- the B-Tree (from any good TECHNICAL diagram) is a few pages, containing many, many index entries per page, so it is very fast
- that's the root plus non-leaf levels
- until it find a leaf-level entry that qualifies
- from that point on, it does a SCAN, sequentially, through the LEAF level of said index (fat blue arrow)
now for NCIs, if you remember your homework, that means the leaf level pages are full of index_leaf_level_entry + CI_key
so it is scanning sequentially across the NCI Leaf level (that's why there is a PageChain only at the leaf level of NCIs, so that it can navigate across)
but jumping all over the place on the HEAP, to get the data rows
but for a CI, the leaf level IS the data row (data pages, with only data rows, that's why you cannot see an "index" in them; the non-leaf-level CI pages are pure index pages containing index_entries only)
so when it SCANS the index leaf_level sequentially, using the PageChain, it is SCANNING the data sequentially, they are the same operation (fat green arrow)
no Heap
no jumping around
For comparison, then, a TABLE SCAN (MS Only):
- has no PageChain on the Heap
- has no choice, but to start at the beginning
- and read every data page
- of which many will be fragmented (contain unused space left by deleted or forwarded rows)
- and others will be completely empty
The whole intent is, the optimiser had already decided, not to go for a table (heap) scan, that it could go for an Index scan (because it required LESS than the full range of data, and it could find the starting point of that data via some index). If you look at your SHOWPLAN, even for retrieving a single unique PK row, it says "INDEX SCAN". All that means is, it will navigate the B-Tree first, to find at least one row. And then it may scan the leaf level, until it finds an end point. If it is a covered query, it never goes to the data rows.
There is no substitute for a Clustered Index.