Is it possible to batch insert edges into arangodb without user defined keys, but a uniquely indexed attribute?
For example (in pseudo code):
from db.C.name=x to db.D.number=y
Where both name and number have unique indexes, but defining user-originated keys would be an issue.
The idea of an edge index is to link vertex documents, which are defined by their _id attribute (e.g. collection/key). Because of the way the engine works, you must provide a _from and _to attribute with each edge...
...but that doesn't stop you from adding your own attributes (and indexing them)!
Because of the unique nature of edge indexes, I was forced to add my own from_id and to_id values, which mirrored the _from and _to, respectively. Adding a hash index to these allowed me to quickly reconcile new, existing, and obsolete records.
Alternatively, it might be possible to use the name and number values as _key values. Nothing says you need to use the system-supplied _key. The only caveat is the _key and _id values have character restrictions.
Related
Context
We are limited by ArangoDB's recommendation against using attribute names starting with an underscore _ https://www.arangodb.com/docs/stable/data-modeling-naming-conventions-attribute-names.html because we want to be certain that any such attribute would not be used by ArangoDB at a later stage.
We could add an attribute
properties:{myproperty1:'abc',_myUnderscoreProperty:'def'},
but in case we would do this for documents representing users, which would have
properties:{_name:'abc',_email:'abc#graphileon.com'},
we would need to be able to create a unique constraint on properties._name. But this does not seem to be possible.
Question
Is this possible or there a workaround?
Yes, it is possible. You can create a unique index on field properties._name
I need to understand how one can search attributes of a DynamoDB that is part of an array.
So, in denormalising a table, say a person that has many email addresses. I would create an array into the person table to store email addresses.
Now, as the email address is not part of the sort key, and if I need to perform a search on an email address to find the person record. I need to index the email attribute.
Can I create an index on the email address, which is 1-many relationship with a person record and it's stored as an array as I understand it in DynamoDB.
Would this secondary index be global or local? Assuming I have billions of person records?
If I could create it as either LSI or GSI, please explain the pros/cons of each.
thank you very much!
Its worth getting the terminology right to start with. DynamoDB supported data types are
Scalar - String, number, binary, boolean
Document - List, Map
Sets - String Set, Number Set, Binary Set
I think you are suggesting you have an attribute that contains a list of emails. The attribute might look like this
Emails: ["one#email.com", "two#email.com", "three#email.com"]
There are a couple of relevant points about Key attributes described here. Firstly keys must be top-level attributes (they cant be nested in JSON documents). Secondly they must be of scalar types (i.e. String, Number or Binary).
As your list of emails is not a scalar type, you cannot use it in a key or index.
Given this schema you would have to perform a scan, in which you would set the FilterExpression on your Emails attribute using the CONTAINS operator.
Stu's answer has some great information in it and he is right, you can't use an Array it's self as a key.
What you CAN sometimes do is concatenate several variables (or an Array) into a single string with a known seperator (maybe '_' for example), and then use that string as a Sort Key.
I used this concept to create a composite Sort Key that consisted of multiple ISO 8061 date objects (DyanmoDB stores dates as ISO 8061 in String type attributes). I also used several attributes that were not dates but were integers with a fixed character length.
By using the BETWEEN comparison I am able to individually query each of the variables that are concatenated into the Sort Key, or construct a complex query that matches against all of them as a group.
In other words a data object could use a Sort Key like this:
email#gmail.com_email#msn.com_email#someotherplace.com
Then you could query that (assuming you knew what the partition key is) with something like this:
SELECT * FROM Users
WHERE User='Bob' AND Emails LIKE '%email#msn.com%'
YOU MUST know the partition key in order to perform a Query no matter what you choose as your Sort Key and no matter how that Sort Key is constructed.
I think the real question you are asking is what should my sort keys and partition keys be? That will depend on exactly which queries you want to make and how frequently each type of query is used.
I have found that I have way more success with DynamoDB if I think about the queries I want to make first, and then go from there.
A word on Secondary Indexes (GSI / LSI)
The issue here is that you still need to 'know' the Partition Key for your secondary data structure. GSI / LSI help you avoid needing to create additional DynamoDB tables for the sole purpose of improving data access.
From Amazon:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html
To me it sounds more like the issue is selecting the Keys.
LSI (Local Secondary Index)
If (for your Query case) you don't know the Partition Key to begin with (as it seems you don't) then a Local Secondary Index won't help — since it has the SAME Partition Key as the base table.
GSI (Global Secondary Index)
A Global Secondary Index could help in that you can have a DIFFERENT Partition Key and Sort Key (presumably a partition key that you could 'know' for this query).
So you could use the Email attribute (perhaps composite) as the Sort Key on your GSI and then something like a service name, or sign-up stage, as your Partition Key. This would let you 'know' what partition that user would be in based on their progress or the service they signed up from (for example).
GSI / LSI still need to generate unique values using their keys so keep that in mind!
I'm new in Neo4j an I need some help.
I'm trying to make constraint in multiple properties of nodes at once per two meanings:
I need to specify as constraint many properties without typing again and again all over the properties with the command
I need to define many properties as ONE- UNITY constraint like in SQL when 3 attributes is a primary key and not separably.
How can I achieve it?
You are actually asking 2 questions.
The APOC procedure apoc.schema.assert is helpful for conveniently ensuring that the DB has the required set of indexes and constraints. (Be aware, though, that this procedure will drop any existing indexes and constraints not specified in the call.)
For example, as shown in the documentation, this call:
CALL apoc.schema.assert(
{Track:['title','length']},
{Artist:['name'],Track:['id'],Genre:['name']});
will return a result like this (also, if an index or constraint had been dropped, a row with the action value of "DROPPED" would have been returned as well):
╒════════════╤═══════╤══════╤═══════╕
│label │key │unique│action │
╞════════════╪═══════╪══════╪═══════╡
│Track │title │false │CREATED│
├────────────┼───────┼──────┼───────┤
│Track │length │false │CREATED│
├────────────┼───────┼──────┼───────┤
│Artist │name │true │CREATED│
├────────────┼───────┼──────┼───────┤
│Genre │name │true │CREATED│
├────────────┼───────┼──────┼───────┤
│Track │id │true │CREATED│
└────────────┴───────┴──────┴───────┘
Since there is not (yet) any way to create an index or constraint on multiple properties of a node label, one popular workaround is to use an extra property whose value is an array of the values you want to use. You will have to make sure the values are all of the same type, converting some if necessary. Unfortunately, this requires storing some data redundantly, and makes your code a bit more complex.
I am trying to read up on best practices on DynamoDB. I saw that DynamoDB has two PK types:
Hash Key
Hash and Range Key
From what I read, it appears the latter is like the former but supports sorting and indexing of a finite set of columns.
So my question is why ever use only a hash key without a range key? Is it a viable choice only when the table is not searched?
It'd also be great to have some general guidelines on when to use what key type. I've read several guides (including Amazon's own documentation on DynamoDB) but none of them appear to directly address this question.
Thanks
The choice of which key to use comes down to your Use Cases and Data Requirements for a particular scenario. For example, if you are storing User Session Data it might not make much sense using the Range Key since each record could be referenced by a GUID and accessed directly with no grouping requirements. In general terms once you know the Session Id you just get the specific item querying by the key. Another example could be storing User Account or Profile data, each user has his own and you most likely will access it directly (by User Id or something else).
However, if you are storing Order Items then the Range Key makes much more sense since you probably want to retrieve the items grouped by their Order.
In terms of the Data Model, the Hash Key allows you to uniquely identify a record from your table, and the Range Key can be optionally used to group and sort several records that are usually retrieved together. Example: If you are defining an Aggregate to store Order Items, the Order Id could be your Hash Key, and the OrderItemId the Range Key. Whenever you would like to search the Order Items from a particular Order, you just query by the Hash Key (Order Id), and you will get all your order items.
You can find below a formal definition for the use of these two keys:
"Composite Hash Key with Range Key allows the developer to create a
primary key that is the composite of two attributes, a 'hash
attribute' and a 'range attribute.' When querying against a composite
key, the hash attribute needs to be uniquely matched but a range
operation can be specified for the range attribute: e.g. all orders
from Werner in the past 24 hours, or all games played by an individual
player in the past 24 hours." [VOGELS]
So the Range Key adds a grouping capability to the Data Model, however, the use of these two keys also have an implication on the Storage Model:
"Dynamo uses consistent hashing to partition its key space across its
replicas and to ensure uniform load distribution. A uniform key
distribution can help us achieve uniform load distribution assuming
the access distribution of keys is not highly skewed."
[DDB-SOSP2007]
Not only the Hash Key allows to uniquely identify the record, but also is the mechanism to ensure load distribution. The Range Key (when used) helps to indicate the records that will be mostly retrieved together, therefore, the storage can also be optimized for such need.
Choosing the correct keys to represent your data is one of the most critical aspects during your design process, and it directly impacts how much your application will perform, scale and cost.
Footnotes:
The Data Model is the model through which we perceive and manipulate our data. It describes how we interact with the data in the database [FOWLER]. In other words, it is how you abstract your data model, the way you group your entities, the attributes that you choose as primary keys, etc
The Storage Model describes how the database stores and manipulates the data internally [FOWLER]. Although you cannot control this directly, you can certainly optimize how the data is retrieved or written by knowing how the database works internally.
I want to know, if the two settings node_auto_indexing and relationship_auto_indexing in the neo4j.properties concerning the ids of nodes and rels?
or creates neo4j automatically an index for the ids of the inserted nodes and rels?
the auto index creates index for all properties defined at the *_keys_indexable line in the neo4j.properties file.
the index then bounds the node ID with the specific property value. thus, searching the index for the the property value will return the node.
since your question is a bit unclear to me, you might want to take a look at official docu:
http://docs.neo4j.org/chunked/milestone/auto-indexing.html
No you shouldn't add your ID to the auto index. There is no use for it, since you can already retrieve nodes by ID, without using auto index.
There are however occassions where the usual ID is not sufficient. For instance, when working with users, you may have a user id of some kind. You'd then store this in a property, and add that property to the auto index. This way, you can search by user id. Underlying, Neo4J matches your custom user ID, with the actual node id.
Important to keep in mind here is that per definition, auto index is not unique. You need to design your application in such a fashion that the property is in fact unique, if you're expecting a single node result.