This may be a silly question - but I couldn't find out how to "index" the row key in Hbase so I am assuming that when HBase puts in the row key they have built-in support to automatically index the table based on the row key - in other words, treating the row key as primary key automatically?
thanks!
The table is not just indexed by the key it is actually lexicographically ordered by the key. i.e. Hbase knows on which region service to find each key and within that regionserver the region and the sepecific HFile. The data that is written to the HFile is ordered by the key.
The lexicographic ordering means you can also retrive data by partial key (e.g. a scan for "a") will get everything that starts with "a". This is used a lot of time to put multiple dimensions in the key e.g. you can have the key set to country followed by city to get aggregates per country and then get a breakdown by city efficiently.
Yes, the tables are ordered via the row key. And clients can get the region server ids that contain the range of row keys, allowing then to connect directly to the region server that contains the row key that they require. Furthermore, since the keys are ordered byte arrays, the region server can do a binary search to retrieve the row from list of rows that it contains. This makes random retrieval very efficient, and it makes scanning contiguous rows very efficient.
Related
While refactoring the code for a project I am currently working on I have been wondering about whether a certain kind of entity insert/update into an SQL database could be done more elegantly.
TLDR of the module I am working on: A large amount of data is regularily synchronized from a weakly structured data store (Google Sheets) into a relational database. The source store is database-agnostic and contains no database IDs or foreign keys. The database (EF Core code first) uses standard incremental integer IDs and foreign keys.
My problem: The sheets in the Google Sheets largely resemble the database tables. Related entities are referenced by a (by design) unique string value ("Name"). These natural names need to be resolved to the artificial foreign keys when inserting and updating to the database. So far, this process is done by loading the IDs and names of all related entities into memory, finding the matching entity, and then saving the dependant entity with the found foreign keys to the database.
While this process works well, it requires either many more database roundtrips (if the related entity is looked up one by one) or a decent amount of memory to keep all [Name, Id] pairs in memory, and also rendering the database index on the name column useless as the searching is done in memory.
As an example, an excerpt from my data model:
Table Items, with columns
Id (Primary Key, unique, clustered index)
Name (String, unique, indexed)
Table Locations, with columns
Id (Primary Key, unique, clustered index)
Name (String, unique, indexed)
Table PlacedItems (Specific Items, which are placed in a specific location), with columns:
Id (Primary Key, unique, clustered index)
LocationId (Foreign Key referencing Locations, indexed)
ItemId (Foreign Key referencing Items, indexed)
The Google Sheet which I import has the following sheets:
Sheet items, with column "Name" (string)
Sheet locations, with column "Name" (string)
Sheet placed_items, with columns "LocationName" (string) and "ItemName" (string)
My question:
Assuming, all Items and Locations have been imported, and have been given auto-incremented IDs, how do I most elegantly / efficiently insert all the Placed Items?
The most intrigueing kind of solution would be some way to resolve the Names to Ids entirely on the database, respectively in one round trip, without having to load any item / location data into memory.
To my naïve self it seems like this is a type of problem that I am not the first that tries to solve it, and which has a smart solution already that I am just not seeing yet.
Additional points
I'd like to stick with the artificial IDs in the database. While the strings are unique, they are values designed for being edited by humans and are thus readibility is the prime concern for these strings, and not any kind of efficiency as the DB-generated IDs probably are. When doing reads and any further operations on the data, the IDs are used as primary source for identity and uniqueness of records.
I don't except Entity Framework Core to have a solution to this problem (but surprise me if it has!), so I am also very interested in solutions with raw SQL, stored procedures and similar
What exactly does the SORT statement without key specification do when run on a standard internal table? As per the documentation:
If no explicit sort key is entered using the addition BY, the internal table itab is sorted by the primary table key. The priority of the sort is based on the order in which the key fields are specified in the table definition. In standard keys, the sort is prioritized according to the order of the key fields in the row type of the table. If the primary table key of a standard table is empty, no sort takes place. If this is known statically, the syntax check produces a warning.
With the primary table key being defined as:
Each internal table has a primary table key that is either a self-defined key or the standard key. For hashed tables, the primary key is a hash key, for sorted tables, the primary key is a sorted key. Both of these table types are key tables for which key access is optimized and the primary key thus has its own administration. The key fields of these tables are write-protected when you access individual rows. Standard tables also have a primary key, but the corresponding access is not optimized, there is no separate key administration, and the key fields are not write-protected.
And for good measure, the standard key is defined as:
Primary table key of an internal table, whose key fields in a structured row type are all table fields with character-like data types and byte-like data types. If the row type contains substructures, these are broken down into elementary components. The standard key for non-structured row types is the entire table row if the row type itself is not a table type. If there are no corresponding table fields, or the row type itself is a table type, the standard key from standard tables is empty or contains no key fields.
All of which mainly just confuses me as I'm not sure if I can really rely on the basic SORT statement to provide a reliable or safe result. Should I really just avoid it in all situations or does it have a purpose if used properly?
By extension, if I want to run a DELETE ADJACENT DUPLICATES FROM itab COMPARING ALL FIELDS, when would it be safe to do so after a simple SORT itab.? Only if I added a key on all fields? Without an explicit key only if I have an internal table with clike and xsequence columns? If I want to execute that DELETE statement, what is the most optimal SORT statement to run on the internal table?
SORT without BY should be avoided in all situations because it "makes the program difficult to understand and possibly unpredictable" (dixit ABAP documentation). I think that if you don't mention BY, there is a warning by a static check in the Code Inspector. You should use SORT itab BY table_line where table_line is a special name ("pseudo-component") meaning "all fields of the line".
Not your question, but you may also define the internal table with primary and secondary keys, so that you don't need to sort explicitly - DELETE ADJACENT DUPLICATES can be used with any of those keys.
Internal tables can have keys that can be inherited from structures the itab is based on or specified. As the documentation says, sort without by sorts by primary key, and that is safe assuming the internal table is implemented correctly.
I think this feature is designed as a dynamic feature to be used with smart table key design. If done correctly, sort without by can get your program to adapt to table key changes in the future. (so if your key changes, sort with change with it). Problems might arise when key is modified in an odd way.
As rule of a thumb:
The more specific your program code is, the less prone to errors (and safer) it is.
So sort by key_id, key_date will always produce the same sort by those 2 fields.
Dynamic components in an application make it more flexible, but tend to have (often hard to notice) bugs coming out when things they rely on are modified .
So if you take the previous example with 2 key fields, you add 1 in the middle (let's say key_is_active between 2 existing fields), sorting results might change in a way you did not expect.
If you had an algorithm that processes based on date, your algorithm might be broken by that change.
In your particular case with delete adjacent I would follow Sandra Rossi's advice.
I need to understand how one can search attributes of a DynamoDB that is part of an array.
So, in denormalising a table, say a person that has many email addresses. I would create an array into the person table to store email addresses.
Now, as the email address is not part of the sort key, and if I need to perform a search on an email address to find the person record. I need to index the email attribute.
Can I create an index on the email address, which is 1-many relationship with a person record and it's stored as an array as I understand it in DynamoDB.
Would this secondary index be global or local? Assuming I have billions of person records?
If I could create it as either LSI or GSI, please explain the pros/cons of each.
thank you very much!
Its worth getting the terminology right to start with. DynamoDB supported data types are
Scalar - String, number, binary, boolean
Document - List, Map
Sets - String Set, Number Set, Binary Set
I think you are suggesting you have an attribute that contains a list of emails. The attribute might look like this
Emails: ["one#email.com", "two#email.com", "three#email.com"]
There are a couple of relevant points about Key attributes described here. Firstly keys must be top-level attributes (they cant be nested in JSON documents). Secondly they must be of scalar types (i.e. String, Number or Binary).
As your list of emails is not a scalar type, you cannot use it in a key or index.
Given this schema you would have to perform a scan, in which you would set the FilterExpression on your Emails attribute using the CONTAINS operator.
Stu's answer has some great information in it and he is right, you can't use an Array it's self as a key.
What you CAN sometimes do is concatenate several variables (or an Array) into a single string with a known seperator (maybe '_' for example), and then use that string as a Sort Key.
I used this concept to create a composite Sort Key that consisted of multiple ISO 8061 date objects (DyanmoDB stores dates as ISO 8061 in String type attributes). I also used several attributes that were not dates but were integers with a fixed character length.
By using the BETWEEN comparison I am able to individually query each of the variables that are concatenated into the Sort Key, or construct a complex query that matches against all of them as a group.
In other words a data object could use a Sort Key like this:
email#gmail.com_email#msn.com_email#someotherplace.com
Then you could query that (assuming you knew what the partition key is) with something like this:
SELECT * FROM Users
WHERE User='Bob' AND Emails LIKE '%email#msn.com%'
YOU MUST know the partition key in order to perform a Query no matter what you choose as your Sort Key and no matter how that Sort Key is constructed.
I think the real question you are asking is what should my sort keys and partition keys be? That will depend on exactly which queries you want to make and how frequently each type of query is used.
I have found that I have way more success with DynamoDB if I think about the queries I want to make first, and then go from there.
A word on Secondary Indexes (GSI / LSI)
The issue here is that you still need to 'know' the Partition Key for your secondary data structure. GSI / LSI help you avoid needing to create additional DynamoDB tables for the sole purpose of improving data access.
From Amazon:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html
To me it sounds more like the issue is selecting the Keys.
LSI (Local Secondary Index)
If (for your Query case) you don't know the Partition Key to begin with (as it seems you don't) then a Local Secondary Index won't help — since it has the SAME Partition Key as the base table.
GSI (Global Secondary Index)
A Global Secondary Index could help in that you can have a DIFFERENT Partition Key and Sort Key (presumably a partition key that you could 'know' for this query).
So you could use the Email attribute (perhaps composite) as the Sort Key on your GSI and then something like a service name, or sign-up stage, as your Partition Key. This would let you 'know' what partition that user would be in based on their progress or the service they signed up from (for example).
GSI / LSI still need to generate unique values using their keys so keep that in mind!
I would like to create a surrogate key for a hive table, but one that could be replicated every time the data was put in the table. Other tables would reference this table through the surrogate key, and the table could be regenerated to add more rows, and that association wouldn't be broken. My thought is to basically have a composite key of all columns in the table.
Is it reasonable to concatenate all of my columns and take the md5 hash of that string to use as an easy look-up to that row?
The problems that I see with this solution are:
If the data changes in the rows, the association will still be broken
There is no real guarantee that the hash values are unique (though with my numbers, collisions are very unlikely)
notes on the data:
The data is partitioned by day, and there are around 100k rows for
each day.
There are cases that two rows have the exact same data and
it's fine if they end up with the same key.
You have answered your own question:
There is no real guarantee that the hash values are unique (though
with my numbers, collisions are very unlikely)
Keys need to be unique, that's their purpose. If you give me a records key (be it surrogate or natural) I can find that record. Hashes are not going to be unique.
You need to go back and ask yourself WHY you want this surrogate key. If its just for a unique identifier then use your DB's unique identifier|sequence type and be done with it.
If there is a business requirement (The need to replicate the SK <- why?) then go back to that reason and try and come up with a more direct|simple solution for it.
(We tried hashes for type2 change detection - it did not work and we went back to column by column comparisons)
This concerns me:
There are cases that two rows have the exact same data and it's fine if they end up with the same key
If you have 2 records in your database that are exactly the same then you are missing data: even a sequence or timestamp, something that can be used to differentiate your records. If you don't have a natural key, you are probably missing something.
Traditionally I have always used an ID column in SQL (mostly mysql and postgresql).
However I am wondering if it is really necessary if the rest of the columns in each row make in unique. In my latest project I have the "ID" column set as my primary key, however I never call it or use it in any way, as the data in the row makes it unique and is much more useful for me.
So, if every row in a SQL table is unique, does it need a primary key ID table, and are there ant performance changes with or without one?
Thanks!
EDIT/Additional info:
The specific example that made me ask this question is a table I am using for a many-to-many-to-many-to-many table (if we still call it that at that point) it has 4 columns (plus ID) each of which represents an ID of an external table, and each row will always be numeric and unique. only one of the columns is allowed to be null.
I understand that for normal tables an ID primary key column is a VERY good thing to have. But I get the feeling on this particular table it just wastes space and slows down adding new rows.
If you really do have some pre-existing column in your data set that already does uniquely identify your row - then no, there's no need for an extra ID column. The primary key however must be unique (in ALL circumstances) and cannot be empty (must be NOT NULL).
In my 20+ years of experience in database design, however, this is almost never truly the case. Most "natural" ID's that appear to be unique aren't - ultimately. US Social Security Numbers aren't guaranteed to be unique, and most other "natural" keys end up being almost unique - and that's just not good enough for a database system.
So if you really do have a proper, unique key in your data already - use it! But most of the time, it's easier and more convenient to have just a single surrogate ID that you can guarantee will be unique over all rows.
Don't confuse the logical model with the implementation.
The logical model shows a candidate key (all columns) which could makes your primary key.
Great. However...
In practice, having a multi column primary key has downsides: it's wide, not good when clustered etc. There is plenty of information out there and in the "related" questions list on the right
So, you'd typically
add a surrogate key (ID column)
add a unique constraint to keep the other columns unique
the ID column will be the clustered key (can be only one per table)
You can make either key the primary key now
The main exception is link or many-to-many tables that link 2 ID columns: a surrogate isn't needed (unless you have a braindead ORM)
Edit, a link: "What should I choose for my primary key?"
Edit2
For many-many tables: SQL: Do you need an auto-incremental primary key for Many-Many tables?
Yes, you could have many attributes (values) in a record (row) that you could use to make a record unique. This would be called a composite primary key.
However it will be much slower in general because the construction of the primary index will be much more expensive. The primary index is used by relational database management systems (RDBMS) not only to determine uniqueness, but also in how they order and structure records on disk.
A simple primary key of one incrementing value is generally the most performant and the easiest solution for the RDBMS to manage.
You should have one column in every table that is unique.
EDITED...
This is one of the fundamentals of database table design. It's the row identifier - the identifier identifies which row(s) are being acted upon (updated/deleted etc). Relying on column combinations that are "unique", eg (first_name, last_name, city), as your key can quickly lead to problems when two John Smiths exist, or worse when John Smith moves city and you get a collision.
In most cases, it's best to use a an artificial key that's guaranteed to be unique - like an auto increment integer. That's why they are so popular - they're needed. Commonly, the key column is simply called id, or sometimes <tablename>_id. (I prefer id)
If natural data is available that is unique and present for every row (perhaps retinal scan data for people), you can use that, but all-to-often, such data isn't available for every row.
Ideally, you should have only one unique column. That is, there should only be one key.
Using IDs to key tables means you can change the content as needed without having to repoint things
Ex. if every row points to a unique user, what would happen if he/she changed his name to let say John Blblblbe which had already been in db? And then again, what would happen if you software wants to pick up John Blblblbe's details, whose details would be picked up? the old John's or the one ho has changed his name? Well if answer for bot questions is 'nothing special gonna happen' then, yep, you don't really need "ID" column :]
Important:
Also, having a numeric ID column with numbers is much more faster when you're looking for an exact row even when the table hasn't got any indexing keys or have more than one unique
If you are sure that any other column is going to have unique data for every row and isn't going to have NULL at any time then there is no need of separate ID column to distinguish each row from others, you can make that existing column primary key for your table.
No, single-attribute keys are not essential and nor are surrogate keys. Keys should have as many attributes as are necessary for data integrity: to ensure that uniqueness is maintained, to represent accurately the universe of discourse and to allow users to identify the data of interest to them. If you have already identified a suitable key and if you don't find any real need to create another one then it would make no sense to add redundant attributes and indexes to your table.
An ID can be more meaningful, for an example an employee id can represent from which department he is, year of he join and so on. Apart from that RDBMS supports lots operations with ID's.