How to add an already serialized bytebuffer to a builder that is creating a vector of tables? - flatbuffers

I am working in Java with Flatbuffers. I have a vector of tables, something like :
table T2 {
property : [T1];
}
table T1 {
field1 : int;
field2 : int;
}
So,records of T1 were already serialized and the ByteBuffers are cached in, say, Redis. When I retrieve them(ByteBuffers) back from redis, I want to add these records to T2 and create a vector of tables. Since all these T1 records are already serialized, I want to know the most efficient way to build the flatbuffer T2. I think I can create it by adding each T1 by accessing their fields and creating new objects. I believe this may not be the most efficient way to achieve this. I am hoping if there is a way to directly add the serialized bytebuffers of T1 to the builder and get their corresponding offsets to pass to the .createvectorOfTables() method.

From your mentioned scenario, it looks like you want to handle two different buffers. In that case it would be better to keep it that way in schema. Can try:
table T2 {
property: [T1Buffer]; // Stores the buffer
}
table T1Buffer {
tBuff: [ubyte];
}
table T1 {
field1: int;
field2: int;
}
Now, when you are done writing your T1, just push the obtained buffer in T1Buffertable, and now create array of these buffer tables.
During read, you can get each T1Buffer and then deserialize independently.

Related

Determining whether SPARQL query inserted anything

Say I have an update form such as:
INSERT DATA {
# ... data
}
WHERE {
FILTER EXISTS {
# ... condition
}
}
which may or may not insert data depending on whether the FILTER condition holds. As far as I can tell, the SPARQL 1.1 update standard makes no recommendations about the response that a SPARQL engine must return after successfully running this query. In other words, there is no way to tell whether data was inserted or not.
Of course, one could subsequently run a SELECT query to whether rows have been inserted/changed, but this second query would not run as part of the same transaction as the INSERT, so false positives and negatives can be expected.
Am I missing something here? Is there some way, aside from vendor-specific solutions, to determine whether filter conditions matched or not? This seems like a pretty significant limitation.
The only hack I can think of is generating, with every insert, a triple marked with a unique UUID, which gets added to the graph provided that the FILTER condition holds. Then a subsequent SELECT for this UUID would determine conclusively whether the INSERT ran or not.
INSERT DATA { } WHERE { } isn't legal syntax.
There is INSERT DATA { } (for plain data no variables) or INSERT { } WHERE { } for a template and binding variables.
INSERT DATA :: https://www.w3.org/TR/sparql11-update/#insertData
INSERT {} WHERE {} :: https://www.w3.org/TR/sparql11-update/#insert

Tree-Structures & SQL - Looking for design recommendations

from what I've researched so far, this topic is both well documented and very broad. So I'm hoping you can safe me some time diving into the depths of how to store trees in a database by pointing me in the right direction.
I'm working with questionnaires, similarly to how HL7/FHIR approach them:There's two classes: Questionnaire and Item, with Questionnaire consisting of a Set of Items. However, Items can refer to any number of additional Items (i.e. children).So basically, I have a n-ary tree-like structure with - depending on how you want to look at it -a) a Questionnaire-Object as root and several Items as childrenb) several Items as a root each (i.e. a forest), again each with several Items as children
class Questionnaire {
items: Set<Item>
inner class Item {
children: Set<Item>
}
}
This part of the data structure unfortunately is non-negotiable (other than the use of inner classes, which I could change).
I'm now looking for a sensible way to store such a structure in my database (currently MySQL).
Luckily, I'm only ever storing and reading the whole questionnaire. I do not need to access individual nodes or branches, and the data will not be changed / updated (because any change to an existing Questionnaire will result in a new Questionnaire as per my projects definition). So I only need to work with SELECT and INSERT statements, each for one complete Questionnaire (with all its items).
My first approach was to reverse the Item-to-Item relationship, i.e. referring to one parent rather than several children. However, I fear that this might be hell to translate back into the already fixed object-structure. I'm hoping for a fairly easy solution.
Please note that I am aware that there's probably really nice solutions using ORM, but I've been having trouble wrapping my head around the whole setup progress lately, and am now too pressed for time to get into that. Right now, I need a solution in plain SQL to show results. ORM will have to wait a little, but I will get back to that!Also note that performance does not matter right now.
Thanks in advance for your efforts, your help will be much apreciated!
So here's what I ended up doing in case anyone else is looking for an answer:
Let's take as an example my QuestionnaireResponse class:
data class QuestionnaireResponse (
val qID: String,
val timeStamp: String,
val items: List<Item> = listOf<Item>()) {
inner class Item (val itemID: String, val itemType: String, var unit: String,
var answers: MutableList<String> = mutableListOf())
}
Where qID references the Questionnaire that has been answered here.
When a Questionnaire is answered, I'll receive the above object in JSON. I decided to parse the incoming JSON to my data structure, extract qID and timeStamp, and store those values in my database. That way I can select only those QuestionnaireResponses answering to a specific Questionnaire, and filter by timeStamp, while still circumventing to try and represent that (basically recursive) structure into my Db.
The SQL code to create the corresponding table looks like this:
CREATE TABLE `questionnaireresponses` (
`questionnaireID` int NOT NULL,
`timestamp` varchar(25) NOT NULL,
`questionnaireResonseObject` json DEFAULT NULL,
PRIMARY KEY (`questionnaireID`,`timestamp`),
CONSTRAINT `answeredQuestionnaire` FOREIGN KEY (`questionnaireID`) REFERENCES `questionnaires` (`id`))
From what I read, not all databases support the json data type. What it does in MySQL is making sure that the inserted data is formatted properly. I never mind additional checks, but since I've already been successfully parsing the JSON in my application before inserting it into the Db, that step can be omitted. Thus, if your Db doesn't support the json-type, any type that allows to store strings of variable length (e.g. text or blob) might work as well.

Abstract view of how distinct queries are implemented in NoSQL

I am developing a system using Google Data-Store, where there's a Kind - Posts and which has 2 properties
1. message (string)
2. hashtags (list)
I wanted to query the distinct hashtags with the number. For example
say The posts are
{
{
"message":"msg1",
"tags":["abc","cde","efr"]
},
{
"message":"msg2",
"tags":["abc,"efgh","efk"]
},
{
"message":"msg3",
"tags":["abc,"efgh","efr"]
}
}
The output should be
{
"abc":3
"cde":1
"efk":1
"efgh":2
"efr":2
}
But in NoSQL implementation Data-store I can't directly query this. In order to query I have to load all the messages and find distinct queries. It will be a time-consuming event.
But I have seen a distinct function db.collection.distinct() which I think might have optimize this problem. If It has to be done on any NoSQL what may be the optimum solution for this?
Unfortunately, projection queries with 'distinct on' will only return a single result per distinct value (https://cloud.google.com/datastore/docs/concepts/queries#projection_queries). It will not provide a count of each distinct value. You'll need to do the count yourself, but you can use a projection query to save cost by only returning the tag values instead of the full entities.

Reference an arbitrary row and field in another table

Is there any form (data type, inherence..) of implement in postgresql something like this:
CREATE TABLE log (
datareferenced table_row_column_reference,
logged boolean
);
The referenced data may be any row field from the database. My objective is implement something like this without use Procedural Language or implement it in a higher layer, using only a relational approach and without modify the rest of the tables. Another feature may be referencial integrity, example:
-- Table foo (id, field1, field2, fieldn)
-- ('bar', '2014-01-01', 4.33, Null)
-- Table log (datareferenced, logged)
-- ({table foo -> id:'bar' -> field2 } <=> 4.33, True)
DELETE FROM foo where id='bar';
-- as result, on cascade, deleted both rows.
I have an application build onto a MVC pattern. The logic is written in Python. The application is a management tool, very data intensive. My goal is implement a module that could store additional information per every data present in the DDBB. Per example, a client have a serie of attributes (name, address, phone, email ...) across multiple tables, and I want that the app could store metadata-like for every registry from all the DDBB. A metadata could be last modification, or a user flag, etc.
I have implemented the metadata model (in postgres), its mapping to objects and a parcial API. But the part left is the most important, the glue. My plan B is create that glue in the data mapping layer as a module. Something like this:
address= person.addresses[0]
address.saveMetadata('foo', 'bar')
-- in the superclass of Address
def saveMetadata(self, code, value):
self.mapper.metadata_adapter.save(self, code, value)
-- in the metadata adapter class:
def save(self, entity, code, value):
sql = """update value=%s from metadata_values
where code=%s and idmetadata=
(select id from metadata_rels mr
where mr.schema=%s and mr.table=%s and
mr.field=%s and mr.recordpk=%s)"""%
(value, code,
self.class2data[entity.__class__]["schema"],
self.class2data[entity.__class__]["table"],
self.class2data[entity.__class__]["field"],
entity.id)
self.mapper.execute(sql)
def read(self, entity , code):
sql = """select mv.value
from metadata_values mv
join metadata_rels mr on mv.idmetadata=mr.id
where mv.code=%s and mr.schema=%s and mr.table=%s and
mr.field=%s and mr.recordpk=%s"""%
(code,
self.class2data[entity.__class__]["schema"],
self.class2data[entity.__class__]["table"],
self.class2data[entity.__class__]["field"],
entity.id )
return self.mapper.execute(sql)
But it would add overhead between python and postgresql, complicate Python logic, and using PL and triggers may be very laborious and bug-prone. That is why i'm looking at doing the same at the DDBB level.
No, there's nothing like that in PostgreSQL.
You could build triggers yourself to do it, probably using a composite type. But you've said (for some reason) you don't want to use PL/PgSQL, so you've ruled that out. Getting RI triggers right is quite hard, though, and you must apply a trigger to the referencing and referenced ends.
Frankly, this seems like a square peg, round hole kind of problem. Are you sure PostgreSQL is the right choice for this application?
Describe your needs and goal in context. Why do you want this? What problem are you trying to solve? Maybe there's a better way to approach the same problem one step back...

How to implement deduplication in a billion rows table ssis

which is the best option to implement distinct operation in ssis?
I have a table with more than 200 columns and contain more than 10 million rows.
I need to get the ditinct rows from this table.Is it wise to use a execute sql task (with select query to deduplicate the rows) or is there any other way to achieve this in ssis
I do understood that the ssis sort component deduplicate the rows..but this is a blocking component it is not at all a good idea to use ...Please let me know your views on this
I had done it in 3 steps this way:
Dump the MillionRow table into HashDump table, which has only 2 columns: Id int identity PK, and Hash varbinary(20). This table shall be indexed on its Hash column.
Dump the HashDump table into HashUni ordered by Hash column. In between would be a Script Component that check whether the current row's Hash column value is same as the previous row. If same, direct row to Duplicate output, else Unique. This way you can log the Duplicate even if what you need is just the Unique.
Dump the MillionRow table into MillionUni table. In between would be a Lookup Component that uses HashUni to tell which row is Unique.
This method allows me to log each duplicates with a message such as: "Row 1000 is a duplicate of row 100".
I have not found a better way than this. Earlier, I made a unique index on MillionUni, to dump directly the MillionRow into it, but I was not able to use "fast load", which was way too slow.
Here is one way to populate the Hash column:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
StringBuilder sb = new StringBuilder();
sb.Append(Row.Col1String_IsNull ? "" : Row.Col1String); sb.Append("|");
sb.Append(Row.Col2Num_IsNull ? "" : Row.Col2Num.ToString()); sb.Append("|");
sb.Append(Row.Col3Date_IsNull ? "" : Row.Col3Date.ToString("yyyy-MM-dd"));
var sha1Provider = HashAlgorithm.Create("SHA1");
Row.Hash = sha1Provider.ComputeHash(Encoding.UTF8.GetBytes(sb.ToString()));
}
If 200 columns prove to be a chore for you, part of this article shall inspire you. It is making a loop for the values of all column objects into a single string.
And to compare the Hash, use this method:
byte[] previousHash;
int previousRowNo;
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (StructuralComparisons.StructuralEqualityComparer.Equals(Row.Hash, previousHash))
{
Row.DupRowNo = previousRowNo;
Row.DirectRowToDuplicate();
}
else
{
Row.DirectRowToUnique();
}
previousHash = Row.Hash;
previousRowNo = Row.RowNo;
}
I won't bother SSIS for it, a couple of queries will do; also you have a lot of data, so i suggest you check the execution plan before running the queries, and optimize your indexes
http://www.brijrajsingh.com/2011/03/delete-duplicate-record-but-keep.html
Check out a small article i wrote on the same topic
As far as I know, the Sort Component is the only transformation which allows you to distinct the duplcities. Or you could use SQL-like command.
If the sorting operation is problem, then you should use (assuming your source is DB) "SQL Command" in Data Access Mode specification. Select distinct your data and that's it .. you may also save a bit time as the ETL wont have to go through the Sort Component.