Modelling NoSQL database (when converting from SQL database) - sql

I have a SQL database that I want to convert to a NoSQL one (currently I'm using RavenDB)
Here are my tables:
Trace:
ID (PK, bigint, not null)
DeploymentID (FK, int, not null)
AppCode (int, not null)
Deployment:
DeploymentID (PK, int, not null)
DeploymentVersion (varchar(10), not null)
DeploymentName (nvarchar(max), not null)
Application:
AppID (PK, int, not null)
AppName (nvarchar(max), not null)
Currently I have these rows in my tables:
Trace:
ID: 1 , DeploymentID: 1, AppCode: 1
ID: 2 , DeploymentID: 1, AppCode: 2
ID: 3 , DeploymentID: 1, AppCode: 3
ID: 3 , DeploymentID: 2, AppCode: 1
Deployment:
DeploymentID: 1 , DeploymentVersion: 1.0, DeploymentName: "Test1"
DeploymentID: 2 , DeploymentVersion: 1.0, DeploymentName: "Test2"
Application:
AppID: 1 , AppName: "Test1"
AppID: 2 , AppName: "Test2"
AppID: 3 , AppName: "Test3"
My question is: HOW should I build my NoSQL document model ?
Should it look like:
trace/1
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
"Application": "Test1"
}
trace/2
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
"Application": "Test2"
}
trace/3
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
"Application": "Test3"
}
trace/4
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test2" } ],
"Application": "Test1"
}
And what if Deployment 1 gets changed ? Should I go by each document and change the data?
And when should I use references in NoSQL ?

Document databases such as Raven are not relational databases. You CANNOT first build the database model and then later on decide on various interesting ways of querying it. Instead, you should first determine what access patterns you want to support, and then design the document schemas accordingly.
So in order to answer your question, what we really need to know is how you intend to use the data. For example, displaying all traces ordered by time is a distinctly different scenario than displaying traces associated a specific deployment or application. Each one of those requirements will dictate a different design, as will supporting them both.
This in itself may be useful information to you (?), but I suspect you want more concrete answers :) So please add some additional details on your intended usage.
There are a few "do" and "don'ts" when deciding on a strategy:
DO: Optimize for the common use-cases. There is often a 20/80 breakdown where 20% of the UX drives 80% of the load - the homepage/landing page of web apps is a classic example. First priority is to make sure that these are as efficient as possible. Make sure that your data model allows either A) loading those in either a single IO request or B) is cache-friendly
DONT: don't fall into the dreaded "N+1" trap. This pattern occurs when you data model forces you to make N calls in order to load N entities, often preceded by an additional call to get the list of the N IDs. This is a killer, especially together with #3...
DO: Always cap (via the UX) the amount of data which you are willing to fetch. If the user has 3729 comments you obviously aren't going to fetch them all at once. Even it it was feasible from a database perspective, the user experience would be horrible. Thats why search engines use the "next 20 results" paradigm. So you can (for example) align the database structure to the UX and save the comments in blocks of 20. Then each page refresh involves a single DB get.
DO: Balance the Read and Write requirements. Some types of systems are read-heavy and you can assume that for each write there will be many reads (StackOverflow is a good example). So there it makes sense to make writes more expensive in order to gain benefits in read performance. For example, data denormalization and duplication. Other systems are evenly balanced or even write heavy and require other approaches
DO: Use the dimension of TIME to your advantage. Twitter is a classic example: 99.99% of tweets will never be accessed after the first hour/day/week/whatever. That opens all kinds of interesting optimization possibilities in the your data schema.
This is just the tip of the iceberg. I suggest reading up a little on column-based NoSQL systems (such as Cassandra)

How you model your documents depends mostly on your application and it's domain. From there, the document model can be refined by understanding your data access patterns.
Blindly attempting to map a relational data model to a non-relational one is probably not a good idea.
UPDATE: I think Matt got the main idea of my point here. What I am trying to say is that there is no prescribed method (that I am aware of anyway) to translate a relational data model (like a normalized SQL Schema) to a non-relational data model (like a document model) without understanding and considering the domain of the application. Let me elaborate a bit here...
After looking at your SQL schema, I have no idea what a trace is besides a table that appears to join Applications and Deployments. I also have no idea how your application typically queries the data. Knowing a little about this makes a difference when you model your documents, just as it would make a difference in the way you model your application objects (or domain objects).
So the document model suggested in your question may or may not work for you application.

Related

Unnesting a big quantity of columns in BigQuery and BigTable

I have a table in BigTable, with a single column family, containing some lead data. I was following the Google Cloud guide to querying BigTable data from BigTable (https://cloud.google.com/bigquery/external-data-bigtable) and so far so good.
I've crated the table definition file, like the docs required:
{
"sourceFormat": "BIGTABLE",
"sourceUris": [
"https://googleapis.com/bigtable/projects/{project_id}/instances/{instance_id}/tables/{table_id}"
],
"bigtableOptions": {
"readRowkeyAsString": "true",
"columnFamilies": [
{
"familyId": "leads",
"columns": [
{
"qualifierString": "Id",
"type": "STRING"
},
{
"qualifierString": "IsDeleted",
"type": "STRING"
},
...
]
}
]
}
}
But then, things started to go south...
This is how the BigQuery "table" ended up looking:
Each row is a rowkey and inside each column there's a nested cell, where the only value I need is the value from leads.Id.cell (in this case)
After a bit of searching I found a solution to this:
https://stackoverflow.com/a/70728545/4183597
So in my case it would be something like this:
SELECT
ARRAY_TO_STRING(ARRAY(SELECT value FROM UNNEST(leads.Id.cell)), "") AS Id,
...
FROM xxx
The problem is that I'm dealing with a dataset with more than 600 columns per row. It is unfeasible (and impossible, given BigQuery's subquery limits) to repeat this process more than 600 times per row/query.
I couldn't think of a way to automate this query or even think about other methods to unnest this many cells (my SQL knowledge stops here).
Is there any way to do a unesting like this for 600+ columns, with an SQL/BigQuery query? Preferable in a more efficient way? If not, I'm thinking of doing a daily batch process, using a simple Python connector from BigTable to BigQuery, but I'm afraid of the costs this will incur.
Any documentation, reference or idea will be greatly appreciated.
Thank you.
In general, you're setting yourself up for a world of pain when you try to query a NoSQL database (like BigTable) using SQL. Unnesting data is a very expensive operation in SQL because you're effectively performing a cross join (which is many-to-many) every time UNNEST is called, so trying to do that 600+ times will give you either a query timeout or a huge bill.
The BigTable API will be way more efficient than SQL since it's designed to query NoSQL structures. A common pattern is to have a script that runs daily (such as a Python script in a Cloud Function) and uses the API to get that day's data, parse it, and then output that to a file in Cloud Storage. Then you can query those files via BigQuery as needed. A daily script that loops through all the columns of your data without requiring extensive data transforms is usually cheap and definitely less expensive than trying to force it through SQL.
That being said, if you're really set on using SQL, you might be able to use BigQuery's JSON functions to extract the nested data you need. It's hard to visualize what your data structure is without sample data, but you may be able to read the whole row in as a single column of JSON or a string. Then if you have a predictable path for the values you are looking to extract, you could use a function like JSON_EXTRACT_STRING_ARRAY to extract all of those values into an array. A Regex function could be used similarly as well. But if you need to do this kind of parsing on the whole table in order to query it, a batch job to transform the data first will still be much more efficient.

What would be the downsides of storing many-to-many relates in a JSON field instead of in a tertiary table?

Say I intend to have the following objects stored in a relational database (pseudo-code):
Domain:
id: 123
Name: A
Variables: Thing_X, Thing_Y, Thing_Z
Domain:
id: 789
Name: B
Variables: Thing_W, Thing_X, Thing_Y
I know the standard way to structure the many-to-many relationship between Domains and Variables would be to use a tertiary table. However, I think I can do some interesting stuff, if I represent the relates in a JSON. And, would like to know the deficiencies of doing something like the following:
Domain:
id: 123
name: A
variable_relates_JSON:{
{table: 'Variable', id: 314, name: 'Thing_X'},
...
}
Variable:
id: 314
name: Thing_X
domain_relates_JSON:{
{table: 'Domain', id: 123, name: 'A'},
...
}
I've made another post more specifically about the time complexity of this JSON method versus using a tertiary table. I'm happy to hear answers to that question here as well. But, I'm also interested in general challenges I may encounter with this approach.
JSON strings incur the overhead of having to store the name of the field as well as the value. This multiplies the size of the data.
In addition, dates and numbers are stored as strings. In a regular database format, 2020-01-01 occupies 4 bytes or so versus the 10 bytes in a string (not including the delimiters). Similarly for numbers.
More space required for the data slows down databases. Then, searching for a particular JSON requires scanning or parsing the JSON string. Some databases provide support for binary JSON formats or strings to facilitate this, but you have to set that up as part of the table.
Here's an issue I thought of: CONCURRENT UPDATES.
Continuing the example above, let's say users assign both Variable Thing_XX and Variable Thing_YY to Domain A.
For each assignment, I will have to get the JSON and add the relevant id somewhere in its structure. If either gets the JSON before the other finishes assignment, then it will overwrite the assignment of the other.
A scrappy solution might be to somehow 'lock' the field while someone is editing it. But, that could become quite problematic.

How to properly store a JSON object into a Table?

I am working on a scenario where I have invoices available in my Data Lake Store.
Invoice example (extremely simplified):
{
"business_guid":"b4f16300-8e78-4358-b3d2-b29436eaeba8",
"ingress_timestamp": 1523053808,
"client":{
"name":"Jake",
"age":55
},
"transactions":[
{
"name":"peanut",
"amount":100
},
{
"name":"avocado",
"amount":2
}
]
}
All invoices are stored in ADLS, and can be queried. But, It is my desire to provide access to the same data inside an ALD DB.
I am not an expert on unstructed data: I have RDBMS background. Taking that into consideration, I can only think of 2 possible scenarios:
2/3 tables - invoice, client (could be removed) and transaction. In this scenario, I would have to create an invoice ID to be able to build relationships between those tables
1 table - client info could be normalized into invoice data. But, transactions could (maybe) be defined as an SQL.ARRAY<SQL.MAP<string, object>>
I have mainly 3 questions:
What is the correct way of doing so? Solution 1 seems much better structured.
If I go with solution 1, how do I properly create an ID (probably GUID)? Is it acceptable to require ID creation when working with ADL?
Is there another solution I am missing here?
Thanks in advance!
This type of question is a bit like do you prefer your sauce on the pasta or next to the pasta :). The answer is: it depends.
To answer your 3 questions more seriously:
#1 has the benefit of being normalized that works well if you want to operate on the data separately (e.g., just clients, just invoices, just transactions) and want to the benefits of normalization, get the right indexing, and are not limited by the rowsize limits (e.g., your array of map needs to fit into a row). So I would recommend that approach unless your transaction data is always small and you always access the data together and mainly search on the column data.
U-SQL per se has no understanding of the hierarchy of the JSON document. Thus, you would have to write an extractor that turns your JSON into rows in a way that it either gives you the correlation of the parent to the child (normally done by stepwise downwards navigation with cross apply) and use the key value of the parent data item as the foreign key, or have the extractor generate the key (as int or guid).
There are some sample JSON extractors on the U-SQL GitHub site (start at http://usql.io) that can get you started with the JSON to rowset conversion. Note that you will probably want to optimize the extraction at some point to be JSON Reader based so you process larger docs without loading it into memory.

How to query multiple aggregates efficiently with DDD?

When I need to invoke some business method, I need to get all aggregate roots related to the operation, even if the operation is as primitive as the one given below (just adding item into a collection). What am I missing? Or is CRUD-based approach where you run one single query including table joins, selects and insert at the end - and database engine does all the work for you - actually better in terms of performance?
In the code below I need to query separate aggregate root (which creates another database connection and sends another select query). In real world applications I have been querying a lot more than one single aggregate, up to 8 for a single business action. How can I improve performance/query overhead?
Domain aggregate roots:
class Device
{
Set<ParameterId> parameters;
void AddParameter(Parameter parameter)
{
parameters.Add(parameter.Id);
}
}
class Parameter
{
ParameterId Id { get; }
}
Application layer:
class DeviceApplication
{
private DeviceRepository _deviceRepo;
private ParameterRepository _parameterRepo;
void AddParameterToDevice(string deviceId, string parameterId)
{
var aParameterId = new ParameterId(parameterId);
var aDeviceId = new DeviceId(deviceId);
var parameter = _parameterRepo.FindById(aParameterId);
if (parameter == null) throw;
var device = _deviceRepo.FindById(aDeviceId);
if (device == null) throw;
device.AddParameter(parameter);
_deviceRepo.Save(device);
}
}
Possible solution
I've been told that you can pass just an Id of another aggregate like this:
class Device
{
void AddParameter(ParameterId parameterId)
{
parameters.Add(parameterId);
}
}
But IMO it breaks incapsulation (by explicitely emphasizing term ID into the business), also it doesn't prevent from pasting wrong or otherwise incorrect identity (created by user).
And Vaughn Vernon gives examples of application services that use the first approach (passing whole aggregate instance).
The short answer is - don't query your aggregates at all.
An aggregate is a model that exposes behaviour, not data. Generally, it is considered a code smell to have getters on aggregates (ID is the exception). This makes querying a little tricky.
Broadly speaking there are 2 related ways to go about solving this. There are probably more but at least these don't break the encapsulation.
Option 1: Use domain events -
By getting your domain (aggregate roots) to emit events which illustrate the changes to internal state you can build up tables in your database specifically designed for querying. Done right you will have highly performant, denormalised queryable data, which can be linearly scaled if necessary. This makes for very simple queries. I have an example of this on this blog post: How to Build a Master-Details View when using CQRS and Event Sourcing
Option 2: Infer query tables -
I'm a huge fan of option 1 but if you don't have an event sourced approach you will still need to persist the state of your aggregates at some point. There are all sorts of ways to do this but you could plug into the persistence pipeline for your aggregates a process whereby you extract queryable data into a read model for use with your queries.
I hope that makes sense.
If you figured out that having RDBMS query with joins will work in this case - probably you have wrong aggregate boundaries.
For example - why would you need to load the Parameter in order to add it to the Device? You already have the identity of this Parameter, all you need to do is to add this id to the list of references Parameters in the Device. If you do it in order to satisfy your ORM - you're most probably doing something wrong.
Also remember that your aggregate is the transactional boundary. You really would want to complete all database operations inside one transaction and one connection.

Solandra to replace our Lucene + RDBMS?

Currently we are using a combination of SQL Server and Lucene to index some relational data about domain names. We have a Domain table, and about 10 other various other tables for histories of different metrics we calculate and store about domains. For example:
Domain
Id BIGINT
Domain NVARCHAR
IsTracked BIT
SeoScore
Id BIGINT
DomainId BIGINT
Score INT
Timestamp DATETIME
We are trying to include all the domains from major zone files in our database, so we are looking at about 600 million records eventually, which seems like it's going to be a bit of a chore to scale in SQL Server. Given our reliance on Lucene to do some pretty advanced queries, Solandra seems like it may be a nice fit. I am having a hard time not thinking about our data in relational database terms.
The SeoScore table would map one to many Domains (one record for each time we calculated the score). I'm thinking that in Solandra terms, the best way to achieve this would be use two indexes, one for Domain and one for SeoScore.
Here are the querying scenarios we need to achieve:
A 'current snapshot' of the latest metrics for each domain (so the latest SeoScore for a given domain. I'm assuming we would find the Domain records we want first, and then run further queries to get the latest snapshot of each metric separately.
Domains with SeoScores not having been checked since x datetime, and having IsTracked=1, so we would know which ones need to be recalculated. We would need some sort of batching system here so we could 'check out' domains and run calculations on them without duplicating efforts.
Am I way off track here? Would we be right in basically mapping our tables to separate indexes in solandra in this case?
UPDATE
Here's some JSON notation of what I'm thinking:
Domains : { //Index
domain1.com : { //Document ID
Middle : "domain1", //Field
Extension : "com",
Created : '2011-01-01 01:01:01.000',
ContainsDashes : false,
ContainsNumbers : false,
IsIDNA : false,
},
domain2.com {
...
}
}
SeoScores : { //Index
domain1.com { //Document ID
'2011-02-01 01:01:01.000' : {
SeoScore: 3
},
'2011-01-01 01:01:01.000' : {
SeoScore: -1
}
},
domain2.com {
...
}
}
For SeoScores you might want to consider using virtual cores:
https://github.com/tjake/Solandra/wiki/ManagingCores
This lets you partition the data by domain so you can have SeoScores.domain1 and make each document the represent one timestamp.
The rest sounds fine.