What is the optimal way to deal with nested documents with elasticsearch? - kotlin

I'm trying to define the nice way of indexing a data model at Elasticsearch. I have few business entities that can contain each other, so i'm not sure should the nested or distributed structures be used.
Let me describe the example model:
Company {
id: Int,
name: String,
address: String,
... 10 more unique company properties ...
}
Partner {
id: Int
registrationNumber: String,
contract: String,
... 10 more unique partner properties ...
}
Representative {
id: Int
lastName: String,
firstName: String,
... 10 more unique representative properties ...
}
For now i see tho ways of organizing index structure:
1. Single index. The obvious way is to create a single index representatives, that will contain all theis data like:
representatives:
- id
- lastName
- firstName
- partner:
- id:
- registrationNumber
- contract
- company:
- id
- name
- address
- other fields...
- other fields...
- other fields...
But I expect few problems here:
If a partner contains n representatives then for each representative the partner and company index fields will be duplicated. Seems like this will decrease the performance.
Saving documents with repeated info into index will cause it's size growth and that seems to be an overhead.
2. Multi-index. From the other side a separate index can be created for each business entity, like companies, partners, representatives:
representatives:
- id
- lastName
- firstName
- other fields...
partners:
- id
- registrationNumber
- contract
- other fields...
companies:
- id
- name
- address
- other fields...
This index structure will no contain duplicated data, but I expect new problems here:
I will have to make 3 search queries instead of 1
I wil lhave to deal somehow with multiple search query responses and make a complex logic to define the expected hits.
I don't have such an experience with Elasticsearch to understand pros & cons of these two structures...
Could anyone please give me an advice, what is better, the 1-st or the 2-nd option?

The best option is usually to flatten (denormalize) your data, including the duplication that entails. This often simplifies queries and maximizes performance.
Otherwise there's other ways to model it, such as the Join type:
https://www.elastic.co/guide/en/elasticsearch/reference/8.3/parent-join.html#parent-join
But ensure you read up on the performance and limitations considerations:
https://www.elastic.co/guide/en/elasticsearch/reference/8.3/parent-join.html#_parent_join_and_performance
Some more reading (but in the fairly old/outdated "definitive guide" documentation) is here and may provide some still useful overview of the options:
https://www.elastic.co/guide/en/elasticsearch/guide/master/relations.html

Related

What would be the downsides of storing many-to-many relates in a JSON field instead of in a tertiary table?

Say I intend to have the following objects stored in a relational database (pseudo-code):
Domain:
id: 123
Name: A
Variables: Thing_X, Thing_Y, Thing_Z
Domain:
id: 789
Name: B
Variables: Thing_W, Thing_X, Thing_Y
I know the standard way to structure the many-to-many relationship between Domains and Variables would be to use a tertiary table. However, I think I can do some interesting stuff, if I represent the relates in a JSON. And, would like to know the deficiencies of doing something like the following:
Domain:
id: 123
name: A
variable_relates_JSON:{
{table: 'Variable', id: 314, name: 'Thing_X'},
...
}
Variable:
id: 314
name: Thing_X
domain_relates_JSON:{
{table: 'Domain', id: 123, name: 'A'},
...
}
I've made another post more specifically about the time complexity of this JSON method versus using a tertiary table. I'm happy to hear answers to that question here as well. But, I'm also interested in general challenges I may encounter with this approach.
JSON strings incur the overhead of having to store the name of the field as well as the value. This multiplies the size of the data.
In addition, dates and numbers are stored as strings. In a regular database format, 2020-01-01 occupies 4 bytes or so versus the 10 bytes in a string (not including the delimiters). Similarly for numbers.
More space required for the data slows down databases. Then, searching for a particular JSON requires scanning or parsing the JSON string. Some databases provide support for binary JSON formats or strings to facilitate this, but you have to set that up as part of the table.
Here's an issue I thought of: CONCURRENT UPDATES.
Continuing the example above, let's say users assign both Variable Thing_XX and Variable Thing_YY to Domain A.
For each assignment, I will have to get the JSON and add the relevant id somewhere in its structure. If either gets the JSON before the other finishes assignment, then it will overwrite the assignment of the other.
A scrappy solution might be to somehow 'lock' the field while someone is editing it. But, that could become quite problematic.

How to properly store a JSON object into a Table?

I am working on a scenario where I have invoices available in my Data Lake Store.
Invoice example (extremely simplified):
{
"business_guid":"b4f16300-8e78-4358-b3d2-b29436eaeba8",
"ingress_timestamp": 1523053808,
"client":{
"name":"Jake",
"age":55
},
"transactions":[
{
"name":"peanut",
"amount":100
},
{
"name":"avocado",
"amount":2
}
]
}
All invoices are stored in ADLS, and can be queried. But, It is my desire to provide access to the same data inside an ALD DB.
I am not an expert on unstructed data: I have RDBMS background. Taking that into consideration, I can only think of 2 possible scenarios:
2/3 tables - invoice, client (could be removed) and transaction. In this scenario, I would have to create an invoice ID to be able to build relationships between those tables
1 table - client info could be normalized into invoice data. But, transactions could (maybe) be defined as an SQL.ARRAY<SQL.MAP<string, object>>
I have mainly 3 questions:
What is the correct way of doing so? Solution 1 seems much better structured.
If I go with solution 1, how do I properly create an ID (probably GUID)? Is it acceptable to require ID creation when working with ADL?
Is there another solution I am missing here?
Thanks in advance!
This type of question is a bit like do you prefer your sauce on the pasta or next to the pasta :). The answer is: it depends.
To answer your 3 questions more seriously:
#1 has the benefit of being normalized that works well if you want to operate on the data separately (e.g., just clients, just invoices, just transactions) and want to the benefits of normalization, get the right indexing, and are not limited by the rowsize limits (e.g., your array of map needs to fit into a row). So I would recommend that approach unless your transaction data is always small and you always access the data together and mainly search on the column data.
U-SQL per se has no understanding of the hierarchy of the JSON document. Thus, you would have to write an extractor that turns your JSON into rows in a way that it either gives you the correlation of the parent to the child (normally done by stepwise downwards navigation with cross apply) and use the key value of the parent data item as the foreign key, or have the extractor generate the key (as int or guid).
There are some sample JSON extractors on the U-SQL GitHub site (start at http://usql.io) that can get you started with the JSON to rowset conversion. Note that you will probably want to optimize the extraction at some point to be JSON Reader based so you process larger docs without loading it into memory.

Searching in values of a redis db

I am a novice in using Redis DB. After reading some of the documentation and looking into some of the examples on the Internet and also scanning stackoverflow.com, I can see that Redis is very fast, scales well but this costs the price that we have to think out how our data will be accessed at the design time and what operations they will have to undergo. This I can understand but I am a little confused about searching in the data what was so easy, however slow, with the plain old SQL. I could do it in one way with the KEY command but it is an O(N) operation and not O(log(N)). So I would lose one of the advantages of Redis.
What do more experienced colleagues say here?
Let's take an example use case: we have need to store personal data for approx. 100.000 people and those data need to be searched by name, phone nr.
For this I would use the following structures:
1. SET for storing all persons' ids {id1, id2, ...}
2. HASH for each person to store personal data and name it
like map:<id> e.g. map:id1{name:<name>, phone:<number>, etc...}
Solution 1:
1. HASH for storing all persons' ids but the key should be the phone number
2. Then with the command KEY 123* all ids could be retrieved who have a phone number
sarting with 123. On basis of the ids also the other personal data could be retrieved.
3. So forth for each data to be searched for a separate HASH should be created.
But a major drawback of this solution is that the attribute values must also be unique, so that the assigment of the phone number and the ids in the HASH would be unambiguous. On the other hand, O(N) runtime is not ideal.
Moreover, this uses more space than would be necessary and the KEY command deteriorates the access performance. (http://redis.io/commands/keys)
How should it be done in the right way? I could also imagine that ids would go in a ZSET and the data needed search could be the scores but this make only possible to work with ranges not with seraches.
Thank you also in advance, regards, Tamas
Answer summary:
Actually, both responses state that Redis was not designed to search in the values of the keys. If this use case is necessary, then either workarounds need to be implemented as shown in my original solution or in the below solution.
The below solution by Eli has a much better performance, than my original one because the access to the keys can be considered constant, only the list of ids needs to be iterated through, for the access this would give O(const) runtime. This data model also allows that one person might have the same phone number as someone else and so on also for names etc... so 1-n relationship is also possible (I would say with old ERD terminology).
The drawback of this solution is, that it consumes much more space than mine and phone numbers whose starting digits are known only, could not be searched.
Thanks for both responses.
Redis is for use cases where you need to access and update data at very high frequency and where you benefit from use of data structures (hashes, sets, lists, strings, or sorted sets). It's made to fill very specific use cases. If you have a general use case like very flexible searching, you'd be much better served by something built for this purpose like elastic search or SOLR.
That said, if you must do this in Redis, here's how I'd do it (assuming users can share names and phone numbers):
name:some_name -> set([id1, id2, etc...])
name:some_other_name -> set([id3, id4, etc...])
phone:some_phone -> set([id1, id3, etc...])
phone:some_other_phone -> set([id2, id4, etc...])
id1 -> {'name' : 'bob', 'phone' : '123-456-7891', etc...}
id2 -> {'name' : 'alice', 'phone' : '987-456-7891', etc...}
In this case, we're making a new key for every name (prefixed with "name:") and every phone number (prefixed "phone:"). Each key points to a set of ids that have all the info you want for a user. When you search, for a phone, for example, you'll do:
HGETALL 'phone:123-456-7891'
and then loop through the results and return whatever info on each (name in our example) in your language of choice (you can do this whole thing in server-side Lua on the Redis box to go even faster and avoid network back-and-forth, if you want):
for id in results:
HGET id 'name'
You're cost here will be O(m) where m is the number of users with the given phone number, and this will be a very fast operation on Redis because of how optimized it is for speed. It'll be overkill in your case because you probably don't need things to go so fast, and you'd prefer having flexible search, but this is how you would do it.
redis is awesome, but it's not built for searching on anything other than keys. You simply cant query on values without building extra data sets to store items to facilitate such querying, but even then you don't get true search, just more maintenance, inefficient use of memory, yada, yada...
This question has already been addressed, you've got some reading to do :-D
To search strings, build auto-complete in redis and other cool things...
How do I search strings in redis?
Why using MongoDB over redis is smart when searching inside documents...
What's the most efficient document-oriented database engine to store thousands of medium sized documents?
Original Secondary Indicies in Redis
The accepted answer here is correct in that the traditional way of handling searching in Redis has been through secondary indices built around Sets and Sorted Sets.
e.g.
HSET Person:1 firstName Bob lastName Marley age 32 phoneNum 8675309
You would maintain secondary indices, so you would have to call
SADD Person:firstName:Bob Person:1
SADD Person:lastName:Marley Person:1
SADD Person:phoneNum:8675309 Person:1
ZADD Person:age 32 Person:1
This allows you to now perform search-like operations
e.g.
SELECT p.age
FROM People AS p
WHERE p.firstName = 'Bob' and p.lastName = 'Marley' and p.phoneNum = '8675309'
Becomes:
ids = SINTER Person:firstName:Bob Person:lastName:Marley Person:phoneNum:8675309
foreach id in ids:
age = HGET id age
print(age)
The key challenge to this methodology is that in addition to being relatively complicated to set up (it really forces you to think about your model), it becomes extremely difficult to maintain atomically, particularly in shardded environments (where cross-shard key constraints can become problematic) consequentially the keys and index can drift apart, forcing you to periodically have to loop through and rebuild the index.
Newer Secondary Indices with RediSearch
Caveat: This uses RediSearch a Redis Module available under the Redis Source Available License
There's a newer module that plugs into Redis that can do all this for you called RediSearch This lets you declare secondary indices, and then will take care of indexing everything for you as you insert it. For the above example, you would just need to run
FT.CREATE person-idx ON HASH PREFIX 1 Person: SCHEMA firstName TAG lastName TAG phoneNumber TEXT age NUMERIC SORTABLE
That would declare the index, and after that all you need to do is insert stuff into Redis, e.g.
HSET Person:1 firstName Bob lastName Marley phoneNumber 8675309 age 32
Then you could run:
FT.SEARCH person-idx "#firstName:{Bob} #lastName:{Marley} #phoneNumber: 8675309 #age:[-inf 33]"
To return all the items matching the pattern see query syntax for more details
zeeSQL is a novel Redis modules with SQL and secondary indexes capabilities, allowing search by value of Redis keys.
You can set it up in such a way to track the values of all the hashes and put them into a standard SQL table.
For your example of searching people by phone number and name, you could do something like.
> ZEESQL.CREATE_DB DB
"OK"
> ZEESQL.INDEX DB NEW PREFIX customer:* TABLE customer SCHEMA id INT name STRING phone STRING
At this point zeeSQL will track all the hashes that start with custumer and will put them into a SQL table. It will store the fields id as an integer, name as a string and phone as a string.
You can populate the table simply adding hashes to Redis, and zeeSQL will keep everything in sync.
> HMSET customer:1 id 1 name joseph phone 123-345-2345
> HMSET customer:2 id 2 name lukas phone 234-987-4453
> HMSET customer:3 id 3 name mary phone 678-443-2341
At this point you can look into the customer table and you will find the result you are looking for.
> ZEESQL.EXEC DB COMMAND "select * from customer"
1) 1) RESULT
2) 1) id
2) 2) name
2) 3) phone
3) 1) INT
3) 2) STRING
3) 3) STRING
4) 1) 1
4) 2) joseph
4) 3) 123-345-2345
5) 1) 2
5) 2) lukas
5) 3) 234-987-4453
6) 1) 3
6) 2) mary
6) 3) 678-443-2341
The results specify, at first the name of the columns, then the type of the columns and finally the actual results set.
zeeSQL is based on SQLite and it supports all the SQLite syntax for filtering and aggregation.
For instance, you could search for people knowing only the prefix of their phone number.
> ZEESQL.EXEC DB COMMAND "select name from customer where phone like 678%"
1) 1) RESULT
2) 1) name
3) 1) STRING
4) 1) mary
You can find more examples in the tutorial: https://doc.zeesql.com/tutorial#using-secondary-indexes-or-search-by-values-in-redis

Modelling NoSQL database (when converting from SQL database)

I have a SQL database that I want to convert to a NoSQL one (currently I'm using RavenDB)
Here are my tables:
Trace:
ID (PK, bigint, not null)
DeploymentID (FK, int, not null)
AppCode (int, not null)
Deployment:
DeploymentID (PK, int, not null)
DeploymentVersion (varchar(10), not null)
DeploymentName (nvarchar(max), not null)
Application:
AppID (PK, int, not null)
AppName (nvarchar(max), not null)
Currently I have these rows in my tables:
Trace:
ID: 1 , DeploymentID: 1, AppCode: 1
ID: 2 , DeploymentID: 1, AppCode: 2
ID: 3 , DeploymentID: 1, AppCode: 3
ID: 3 , DeploymentID: 2, AppCode: 1
Deployment:
DeploymentID: 1 , DeploymentVersion: 1.0, DeploymentName: "Test1"
DeploymentID: 2 , DeploymentVersion: 1.0, DeploymentName: "Test2"
Application:
AppID: 1 , AppName: "Test1"
AppID: 2 , AppName: "Test2"
AppID: 3 , AppName: "Test3"
My question is: HOW should I build my NoSQL document model ?
Should it look like:
trace/1
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
"Application": "Test1"
}
trace/2
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
"Application": "Test2"
}
trace/3
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
"Application": "Test3"
}
trace/4
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test2" } ],
"Application": "Test1"
}
And what if Deployment 1 gets changed ? Should I go by each document and change the data?
And when should I use references in NoSQL ?
Document databases such as Raven are not relational databases. You CANNOT first build the database model and then later on decide on various interesting ways of querying it. Instead, you should first determine what access patterns you want to support, and then design the document schemas accordingly.
So in order to answer your question, what we really need to know is how you intend to use the data. For example, displaying all traces ordered by time is a distinctly different scenario than displaying traces associated a specific deployment or application. Each one of those requirements will dictate a different design, as will supporting them both.
This in itself may be useful information to you (?), but I suspect you want more concrete answers :) So please add some additional details on your intended usage.
There are a few "do" and "don'ts" when deciding on a strategy:
DO: Optimize for the common use-cases. There is often a 20/80 breakdown where 20% of the UX drives 80% of the load - the homepage/landing page of web apps is a classic example. First priority is to make sure that these are as efficient as possible. Make sure that your data model allows either A) loading those in either a single IO request or B) is cache-friendly
DONT: don't fall into the dreaded "N+1" trap. This pattern occurs when you data model forces you to make N calls in order to load N entities, often preceded by an additional call to get the list of the N IDs. This is a killer, especially together with #3...
DO: Always cap (via the UX) the amount of data which you are willing to fetch. If the user has 3729 comments you obviously aren't going to fetch them all at once. Even it it was feasible from a database perspective, the user experience would be horrible. Thats why search engines use the "next 20 results" paradigm. So you can (for example) align the database structure to the UX and save the comments in blocks of 20. Then each page refresh involves a single DB get.
DO: Balance the Read and Write requirements. Some types of systems are read-heavy and you can assume that for each write there will be many reads (StackOverflow is a good example). So there it makes sense to make writes more expensive in order to gain benefits in read performance. For example, data denormalization and duplication. Other systems are evenly balanced or even write heavy and require other approaches
DO: Use the dimension of TIME to your advantage. Twitter is a classic example: 99.99% of tweets will never be accessed after the first hour/day/week/whatever. That opens all kinds of interesting optimization possibilities in the your data schema.
This is just the tip of the iceberg. I suggest reading up a little on column-based NoSQL systems (such as Cassandra)
How you model your documents depends mostly on your application and it's domain. From there, the document model can be refined by understanding your data access patterns.
Blindly attempting to map a relational data model to a non-relational one is probably not a good idea.
UPDATE: I think Matt got the main idea of my point here. What I am trying to say is that there is no prescribed method (that I am aware of anyway) to translate a relational data model (like a normalized SQL Schema) to a non-relational data model (like a document model) without understanding and considering the domain of the application. Let me elaborate a bit here...
After looking at your SQL schema, I have no idea what a trace is besides a table that appears to join Applications and Deployments. I also have no idea how your application typically queries the data. Knowing a little about this makes a difference when you model your documents, just as it would make a difference in the way you model your application objects (or domain objects).
So the document model suggested in your question may or may not work for you application.

Solandra to replace our Lucene + RDBMS?

Currently we are using a combination of SQL Server and Lucene to index some relational data about domain names. We have a Domain table, and about 10 other various other tables for histories of different metrics we calculate and store about domains. For example:
Domain
Id BIGINT
Domain NVARCHAR
IsTracked BIT
SeoScore
Id BIGINT
DomainId BIGINT
Score INT
Timestamp DATETIME
We are trying to include all the domains from major zone files in our database, so we are looking at about 600 million records eventually, which seems like it's going to be a bit of a chore to scale in SQL Server. Given our reliance on Lucene to do some pretty advanced queries, Solandra seems like it may be a nice fit. I am having a hard time not thinking about our data in relational database terms.
The SeoScore table would map one to many Domains (one record for each time we calculated the score). I'm thinking that in Solandra terms, the best way to achieve this would be use two indexes, one for Domain and one for SeoScore.
Here are the querying scenarios we need to achieve:
A 'current snapshot' of the latest metrics for each domain (so the latest SeoScore for a given domain. I'm assuming we would find the Domain records we want first, and then run further queries to get the latest snapshot of each metric separately.
Domains with SeoScores not having been checked since x datetime, and having IsTracked=1, so we would know which ones need to be recalculated. We would need some sort of batching system here so we could 'check out' domains and run calculations on them without duplicating efforts.
Am I way off track here? Would we be right in basically mapping our tables to separate indexes in solandra in this case?
UPDATE
Here's some JSON notation of what I'm thinking:
Domains : { //Index
domain1.com : { //Document ID
Middle : "domain1", //Field
Extension : "com",
Created : '2011-01-01 01:01:01.000',
ContainsDashes : false,
ContainsNumbers : false,
IsIDNA : false,
},
domain2.com {
...
}
}
SeoScores : { //Index
domain1.com { //Document ID
'2011-02-01 01:01:01.000' : {
SeoScore: 3
},
'2011-01-01 01:01:01.000' : {
SeoScore: -1
}
},
domain2.com {
...
}
}
For SeoScores you might want to consider using virtual cores:
https://github.com/tjake/Solandra/wiki/ManagingCores
This lets you partition the data by domain so you can have SeoScores.domain1 and make each document the represent one timestamp.
The rest sounds fine.