Flatbuffers Schema: Vectors of unions - schema

I got this error while trying vectors of unions
error: Vectors of unions are not yet supported in all the specified
programming languages.
Clearly flatbuffers doesn't support vectors of unions. So i need another data type to solve my problem. Here's my case:
Using model Entity Component System (ECS), i have 3 entities and 3 components, here's the structure
EntityA EntityB EntityC
component1 component1 component3
component3 component2
If i can use vectors of unions the schema look like this
union Components { Component1, Component2, Component3 }
table Update {
component:[Components];
}
Where Component[N] are tables. Actually i have a solution without vectors of unions
table Update {
component1:[Component1];
component2:[Component2];
component3:[Component3];
}
But when list of component increase it's becoming unmanageable.
I'm sorry, i'm using ECS and this is for game development actually. But it's not about the game, so i think this is the right place for asking this kind of problem.
How to solve this without Vectors of unions and better than above solution?

Yes, vectors of unions is a new feature (added just a few weeks ago) that so far is only available in C++.
The traditional way is to create a table Component { c:Components; } to wrap the union value, then to make a [Component] out of them.
Using multiple vectors may indeed become inefficient if the number of components is high.

Related

Is there a more efficient and readable alternative to nested dictionaries with different keys in python?

I have a data structure in my application which takes the form of an enormous nested dictionary. It is difficult to work with and visualize, and I'm trying to see if there is an alternative. The structure is as follows:
{ top_level_key: {key: value,
key: value,
....
other_dict_key: {other_key: other_value,
other_key: other_value,
other_key: other_value}
top_level_key: {...}
}
I thought about using a pandas multi index, but the fact that the keys are different in the inner dictionary makes this impossible (I think), or just not useful, because I would end up with a bunch of null values for the outer dictionary entries for those keys.
Any ideas would be much appreciated.
If you just need to visualize your data more easily, I would suggest using the pprint module from Python standard library, which acts as a pretty printer especially useful for nested data structures.
See this part of the documentation for usage example.

Select random N records from GraphQL query

I am building a simple quiz app that will allow a user to choose various categories and generate a 5 question quiz to test their knowledge. I have a long list of questions setup in AppSync accessible via GraphQL. However, as that list keeps growing, it doesn’t make sense for me to pull these to the client and randomly select there.
Does GraphQL support choosing random 5 from a query? Such that, serverside, I can select just 5 records at random?
query listAll {
listQuestions(filter: {
topic: {
contains: "chocolate"
}
}) {
items {
question
answer
}
}
}
I have thought about other approaches such as randomly assigning each record a number and filtering on this, but this would not be random each time.
Any ideas?
Does GraphQL support choosing random 5 from a query?
Not directly, no. Most of the more "interesting" things you might imagine doing in an SQL query, even simpler things like "return only the first 10 records" or "has a family name of 'Jones'", aren't directly supported in GraphQL. You have to build this sort of thing out of the primitives it gives you.
Such that, serverside, I can select just 5 records at random?
Most GraphQL server implementations support resolver functions which are arbitrary code called when a field value is requested. You could write a schema like
type Query {
listQuestions(filter: QuestionFilter, random: Int): [Question!]!
}
and get access to the arguments in the resolver function.
It looks like AppSync has its own resolver system. It's not obvious to me from paging through the documentation that it supports a "pick n at random" method; it seems to be mostly designed as a facade around database storage, and most databases aren't optimized for this kind of query.
David is right about writing this logic inside a resolver (as a GraphQL way).
If you are using AWS AppSync, you can use a Lambda resolver and attach it to the query, so you can write the logic to pick random values inside of the Lambda so it's part of the GraphQL response. This is one way of doing this.

How to properly store a JSON object into a Table?

I am working on a scenario where I have invoices available in my Data Lake Store.
Invoice example (extremely simplified):
{
"business_guid":"b4f16300-8e78-4358-b3d2-b29436eaeba8",
"ingress_timestamp": 1523053808,
"client":{
"name":"Jake",
"age":55
},
"transactions":[
{
"name":"peanut",
"amount":100
},
{
"name":"avocado",
"amount":2
}
]
}
All invoices are stored in ADLS, and can be queried. But, It is my desire to provide access to the same data inside an ALD DB.
I am not an expert on unstructed data: I have RDBMS background. Taking that into consideration, I can only think of 2 possible scenarios:
2/3 tables - invoice, client (could be removed) and transaction. In this scenario, I would have to create an invoice ID to be able to build relationships between those tables
1 table - client info could be normalized into invoice data. But, transactions could (maybe) be defined as an SQL.ARRAY<SQL.MAP<string, object>>
I have mainly 3 questions:
What is the correct way of doing so? Solution 1 seems much better structured.
If I go with solution 1, how do I properly create an ID (probably GUID)? Is it acceptable to require ID creation when working with ADL?
Is there another solution I am missing here?
Thanks in advance!
This type of question is a bit like do you prefer your sauce on the pasta or next to the pasta :). The answer is: it depends.
To answer your 3 questions more seriously:
#1 has the benefit of being normalized that works well if you want to operate on the data separately (e.g., just clients, just invoices, just transactions) and want to the benefits of normalization, get the right indexing, and are not limited by the rowsize limits (e.g., your array of map needs to fit into a row). So I would recommend that approach unless your transaction data is always small and you always access the data together and mainly search on the column data.
U-SQL per se has no understanding of the hierarchy of the JSON document. Thus, you would have to write an extractor that turns your JSON into rows in a way that it either gives you the correlation of the parent to the child (normally done by stepwise downwards navigation with cross apply) and use the key value of the parent data item as the foreign key, or have the extractor generate the key (as int or guid).
There are some sample JSON extractors on the U-SQL GitHub site (start at http://usql.io) that can get you started with the JSON to rowset conversion. Note that you will probably want to optimize the extraction at some point to be JSON Reader based so you process larger docs without loading it into memory.

How to query multiple aggregates efficiently with DDD?

When I need to invoke some business method, I need to get all aggregate roots related to the operation, even if the operation is as primitive as the one given below (just adding item into a collection). What am I missing? Or is CRUD-based approach where you run one single query including table joins, selects and insert at the end - and database engine does all the work for you - actually better in terms of performance?
In the code below I need to query separate aggregate root (which creates another database connection and sends another select query). In real world applications I have been querying a lot more than one single aggregate, up to 8 for a single business action. How can I improve performance/query overhead?
Domain aggregate roots:
class Device
{
Set<ParameterId> parameters;
void AddParameter(Parameter parameter)
{
parameters.Add(parameter.Id);
}
}
class Parameter
{
ParameterId Id { get; }
}
Application layer:
class DeviceApplication
{
private DeviceRepository _deviceRepo;
private ParameterRepository _parameterRepo;
void AddParameterToDevice(string deviceId, string parameterId)
{
var aParameterId = new ParameterId(parameterId);
var aDeviceId = new DeviceId(deviceId);
var parameter = _parameterRepo.FindById(aParameterId);
if (parameter == null) throw;
var device = _deviceRepo.FindById(aDeviceId);
if (device == null) throw;
device.AddParameter(parameter);
_deviceRepo.Save(device);
}
}
Possible solution
I've been told that you can pass just an Id of another aggregate like this:
class Device
{
void AddParameter(ParameterId parameterId)
{
parameters.Add(parameterId);
}
}
But IMO it breaks incapsulation (by explicitely emphasizing term ID into the business), also it doesn't prevent from pasting wrong or otherwise incorrect identity (created by user).
And Vaughn Vernon gives examples of application services that use the first approach (passing whole aggregate instance).
The short answer is - don't query your aggregates at all.
An aggregate is a model that exposes behaviour, not data. Generally, it is considered a code smell to have getters on aggregates (ID is the exception). This makes querying a little tricky.
Broadly speaking there are 2 related ways to go about solving this. There are probably more but at least these don't break the encapsulation.
Option 1: Use domain events -
By getting your domain (aggregate roots) to emit events which illustrate the changes to internal state you can build up tables in your database specifically designed for querying. Done right you will have highly performant, denormalised queryable data, which can be linearly scaled if necessary. This makes for very simple queries. I have an example of this on this blog post: How to Build a Master-Details View when using CQRS and Event Sourcing
Option 2: Infer query tables -
I'm a huge fan of option 1 but if you don't have an event sourced approach you will still need to persist the state of your aggregates at some point. There are all sorts of ways to do this but you could plug into the persistence pipeline for your aggregates a process whereby you extract queryable data into a read model for use with your queries.
I hope that makes sense.
If you figured out that having RDBMS query with joins will work in this case - probably you have wrong aggregate boundaries.
For example - why would you need to load the Parameter in order to add it to the Device? You already have the identity of this Parameter, all you need to do is to add this id to the list of references Parameters in the Device. If you do it in order to satisfy your ORM - you're most probably doing something wrong.
Also remember that your aggregate is the transactional boundary. You really would want to complete all database operations inside one transaction and one connection.

What is a best way to organise the complex couchdb view (sql-like query)?

In my application I need a SQL-like query of the documents. The big picture is that there is a page with a paginated table showing the couchdb documents of a certain "type". I have about 15 searchable columns like timestamp, customer name, the us state, different numeric fields, etc. All of these columns are orderable, also there is a filter form allowing the user to filter by each of the fields.
For a more concrete below is a typical query which is a result by a customer setting some of the filter options and following to the second page. Its written in a pseodo-sql code, just to explain the problem:
timestamp > last_weeks_monday_epoch AND timestamp < this_weeks_monday_epoch AND marked_as_test = False AND dataspace="production" AND fico > 650
SORT BY timestamp DESC
LIMIT 15
SKIP 15
This would be a trivial problem if I were using any sql-like database, but couchdb is way more fun ;) To solve this I've created a view with the following structure of the emitted rows:
key: [field, value], id: doc._id, value: null
Now, to resolve the example query above I need to perform a bunch of queries:
{startkey: ["timestamp", last_weeks_monday_epoch], endkey: ["timestamp", this_weeks_monday_epoch]}, the *_epoch here are integers epoch timestamps,
{key: ["marked_as_test", False]},
{key: ["dataspace", "production"]},
{startkey: ["fico", 650], endkey: ["fico", {}]}
Once I have the results of the queries above I calculate intersection of the sets of document IDs and apply the sorting using the result of timestamp query. Than finally I can apply the slice resolving the document IDs of the rows 15-30 and download their content using bulk get operation.
Needless to say, its not the fastest operation. Currently the dataset I'm working with is roughly 10K documents big. I can already see that the part when I'm calculating the intersection of the sets can take like 4 seconds, obviously I need to optimize it further. I'm afraid to think, how slow its going to get in a few months when my dataset doubles, triples, etc.
Ok, so having explained the situation I'm at, let me ask the actual questions.
Is there a better, more natural way to reach my goal without loosing the flexibility of the tool?
Is the view structure I've used optimal ? At some point I was considering using a separate map() function generating the value of each field. This would result in a smaller b-trees but more work of the view server to generate the index. Can I benefit this way ?
The part of algorithm where I have to calculate intersections of the big sets just to later get the slice of the result bothers me. Its not a scalable approach. Does anyone know a better algorithm for this ?
Having map function:
function(doc){
if(doc.marked_as_test) return;
emit([doc.dataspace, doc.timestamp, doc.fico], null):
}
You can made similar request:
http://localhost:5984/db/_design/ddoc/_view/view?startkey=["production", :this_weeks_monday_epoch]&endkey=["production", :last_weeks_monday_epoch, 650]&descending=true&limit=15&skip=15
However, you should pass :this_weeks_monday_epoch and :last_weeks_monday_epoch values from the client side (I believe they are some calculable variables on database side, right?)
If you don't care about dataspace field (e.g. it's always constant), you may move it into the map function code instead of having it in query parameters.
I don't think CouchDB is a good fit for the general solution to your problem. However, there are two basic ways you can mitigate the ways CouchDB fits the problem.
Write/generate a bunch of map() functions that use each separate column as the key (for even better read/query performance, you can even do combinatoric approaches). That way you can do smart filtering and sorting, making use of a bunch of different indices over the data. On the other hand, this will cost extra disk space and index caching performance.
Try to find out which of the filters/sort orders your users actually use, and optimize for those. It seems unlikely that each combination of filters/sort orders is used equally, so you should be able to find some of the most-used patterns and write view functions that are optimal for those patterns.
I like the second option better, but it really depends on your use case. This is one of those things SQL engines have been pretty good at traditionally.