Lightweight data format - dataformat

As known, JSON is lighter data format, than XML and it is more preferable to use. But when you transfer big arrays of objects with the same structure, JSON is overload with data too. For example:
[
{
name: 'John',
surname: 'Smith',
info: { age: 25, comments: '' }
},
{
name: 'Sam',
surname: 'Black',
info: { age: 27, comments: '' }
},
{
name: 'Tom',
surname: 'Lewis',
info: { age: 21, comments: '' }
}
]
name, surname, age and comments triple declaration is useless, if I exactly know, that every array object has the same structure.
Is there any data format, that can minify such array data and be flexible enough?

Admittedly, this is a hackish solution, but we've used it and it works. You can flatten everything into arrays. For example, the above would be represented as:
[
['John','Smith',[24,'']],
['Sam','Black',[27,'']],
['Tom','Lewis',[21,'']]
]
The downside is that on serializing/deserializing, you have to do some custom logic. However, this does result in additional savings for a text-based solution, and Ray is right -- if you really want maximal savings, binary is the way to go.

Well if you have text formats, YAML tries to have minimal markup. It gets rid of the semicolons and braces pretty much. But text compresses pretty well.
But if you want to remove redundancies in property names, you have to go with a binary format. Look into MessagePack, Protocol Buffers, or Avro. I don't know of any text-based formats that remove this kind of redundancy.
LATE ADDITION:
Oh my, after using Hadoop to process dozens of gigabytes at a shot for the past year, how could I have forgotten CSV? Geez. The first row can be the schema, and you really don't need quotes. And the separator can be up to you. Something like this:
name|surname|infoage|infocomments
John|Smith|24|
Sam|Black|27|Hi this is a comment
Tom|Lewis|21|This comment has an \| escaped pipe
For small docs it might be smaller than some binary formats, but binary is good for storing real numbers.
Also CSV is really only good when you have a collection of items that are all the same. For complex object hierarchies go with binary, YAML, or #incaren's array-based solution.

Related

How to querying and filtering efficiently on Fauna DB?

For example, let’s assume we have a collection with hundreds of thousands of documents of clients with 3 fields, name, monthly_salary, and age.
How can I search for documents that monthly_salary is higher than 2000 and age higher than 30?
In SQL this would be straightforward but with Fauna, I´m struggling to understand the best approach because terms of Index only work with an exact match. I see in docs that I can use the Filter function but I would need to get all documents in advance so it looks a bit counterintuitive and not performant.
Below is an example of how I can achieve it, but not sure if it’s the best approach, especially if it contains a lot of records.
Map(
Filter(
Paginate(Documents(Collection('clients'))),
Lambda(
'client',
And(
GT(Select(['data', 'monthly_salary'], Get(Var('client'))), 2000),
GT(Select(['data', 'age'], Get(Var('client'))), 30),
)
)
),
Lambda(
'filteredClients',
Get(Var('filteredClients'))
)
)
Is this correct or I´m missing some fundamental concepts about Fauna and FQL?
can anyone help?
Thanks in advance
Efficient searching is performed using Indexes. You can check out the docs for search with Indexes, and there is a "cookbook" for some different search examples.
There are two ways to use Indexes to search, and which one you use depends on if you are searching for equality (exact match) or inequality (greater than or less than, for example).
Searching for equality
If you need an exact match, then use Index terms. This is most explicit in the docs, and it is also not what your original question is about, so I am not going to dwell much here. But here is a simple example
given user documents with this shape
{
ref: Ref(Collection("User"), "1234"),
ts: 16934907826026,
data: {
name: "John Doe",
email: "jdoe#example.com,
age: 50,
monthly_salary: 3000
}
}
and an index defined like the following
CreateIndex({
name: "users_by_email",
source: Collection("User"),
terms: [ { field: ["data", "email"] } ],
unique: true // user emails are unique
})
You can search for exact matches with... the Match function!
Get(
Match(Index("user_by_email"), "jdoe#example.com")
)
Searching for inequality
Searching for inequalities is more interesting and also complicated. It requires using Index values and the Range function.
Keeping with the document above, we can create a new index
CreateIndex({
name: "users__sorted_by_monthly_salary",
source: Collection("User"),
values: [
{ field: ["data", "monthly_salary"] },
{ field: ["ref"] }
]
})
Note that I've not defined any terms in the above Index. The important thing for inequalities is again the values. We've also included the ref as a value, since we will need that later.
Now we can use Range to get all users with salary in a given range. This query will get all users with salary starting at 2000 and all above.
Paginate(
Range(
Match(Index("users__sorted_by_monthly_salary")),
[2000],
[]
)
)
Combining Indexes
For "OR" operations, use the Union function.
For "AND" operations, use the Intersection function.
Functions like Match and Range return Sets. A really important part of this is to make sure that when you "combine" Sets with functions like Intersection, that the shape of the data is the same.
Using sets with the same shape is not difficult for Indexes with no values, they default to the same single ref value.
Paginate(
Intersection(
Match(Index("user_by_age"), 50), // type is Set<Ref>
Match(Index("user_by_monthly_salary, 3000) // type is Set<Ref>
)
)
When the Sets have different shapes they need to be modified or else the Intersection will never return results
Paginate(
Intersection(
Range(
Match(Index("users__sorted_by_age")),
[30],
[]
), // type is Set<[age, Ref]>
Range(
Match(Index("users__sorted_by_monthly_salary")),
[2000],
[]
) // type is Set<[salary, Ref]>
)
)
{
data: [] // Intersection is empty
}
So how do we change the shape of the Set so they can be intersected? We can use the Join function, along with the Singleton function.
Join will run an operation over all entries in the Set. We will use that to return only a ref.
Join(
Range(Match(Index("users__sorted_by_age")), [30], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
)
Altogether then:
Paginate(
Intersection(
Join(
Range(Match(Index("users__sorted_by_age")), [30], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
),
Join(
Range(Match(Index("users__sorted_by_monthly_salary")), [2000], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
)
)
)
tips for combining indexes
You can use additional logic to combine different indexes when different terms are provided, or search for missing fields using bindings. Lot's of cool stuff you can do.
Do check out the cook book and the Fauna forums as well for ideas.
BUT WHY!!!
It's a good question!
Consider this: Since Fauna is served as a serverless API, you get charged for each individual read and write on your documents and indexes as well as the compute time to execute your query. SQL can be much easier, but it is a much higher level language. Behind SQL sits a query planner making assumptions about how to get you your data. If it cannot do it efficiently it may default to scanning your entire table of data or otherwise performing an operation much more expensive than you might have expected.
With Fauna, YOU are the query planner. That means it is much more complicated to get started, but it also means you have fine control over the performance of you database and thus your cost.
We are working on improving the experience of defining schemas and the indexes you need, but at the moment you do have to define these queries at a low level.

Large list literals in Kotlin stalling/crashing compiler

I'm using val globalList = listOf("a1" to "b1", "a2" to "b2") to create a large list of Pairs of strings.
All is fine until you try to put more than 1000 Pairs into a List. The compiler either takes > 5 minutes or just crashes (Both in IntelliJ and Android Studio).
Same happens if you use simple lists of Strings instead of Pairs.
Is there a better way / best practice to include large lists in your source code without resorting to a database?
You can replace a listOf(...) expression with a list created using a constructor or a factory function and adding the items to it:
val globalList: List<Pair<String, String>> = mutableListOf().apply {
add("a1" to "b1")
add("a2" to "b2")
// ...
}
This is definitely a simpler construct for the compiler to analyze.
If you need something quick and dirty instead of data files, one workaround is to use a large string, then split and map it into a list. Here's an example mapping into a list of Ints.
val onCommaWhitespace = "[\\s,]+".toRegex() // in this example split on commas w/ whitespace
val bigListOfNumbers: List<Int> = """
0, 1, 2, 3, 4,
:
:
:
8187, 8188, 8189, 8190, 8191
""".trimIndent()
.split(onCommaWhitespace)
.map { it.toInt() }
Of course for splitting into a list of Strings, you'd have to choose an appropriate delimiter and regex that don't interfere with the actual data set.
There's no good way to do what you want; for something that size, reading the values from a data file (or calculating them, if that were possible) is a far better solution all round — more maintainable, much faster to compile and run, easier to read and edit, less likely to cause trouble with build tools and frameworks…
If you let the compiler finish, its output will tell you the problem.  (‘Always read the error messages’ should be one of the cardinal rules of development!)
I tried hotkey's version using apply(), and it eventually gave this error:
…
Caused by: org.jetbrains.org.objectweb.asm.MethodTooLargeException: Method too large: TestKt.main ()V
…
There's the problem: MethodTooLargeException.  The JVM allows only 65535 bytes of bytecode within a single method; see this answer.  That's the limit you're coming up against here: once you have too many entries, its code would exceed that limit, and so it can't be compiled.
If you were a real masochist, you could probably work around this to an extent by splitting the initialisation across many methods, keeping each one's code just under the limit.  But please don't!  For the sake of your colleagues, for the sake of your compiler, and for the sake of your own mental health…

What is the benefit of defining datatypes for literals in an RDF graph?

I am using rdflib in Python to build my first rdf graph. However, I do not understand the explicit purpose of defining Literal datatypes. I have scraped over the documentation and did my due diligence with google and the stackoverflow search, but I cannot seem to find an actual explanation for this. Why not just leave everything as a plain old Literal?
From what I have experimented with, is this so that you can search for explicit terms in your Sparql query with BIND? Does this also help with FILTERing? i.e. FILTER (?var1 > ?var2), where var1 and var2 should represent integers/floats/etc? Does it help with querying speed? Or am I just way off altogether?
Specifically, why add the following triple to mygraph
mygraph.add((amazingrdf, ns['hasValue'], Literal('42.0', datatype=XSD.float)))
instead of just this?
mygraph.add((amazingrdf, ns['hasValue'], Literal("42.0")))
I suspect that there must be some purpose I am overlooking. I appreciate your help and explanations - I want to learn this right the first time! Thanks!
Comparing two xsd:integer values in SPARQL:
ASK { FILTER (9 < 15) }
Result: true. Now with xsd:string:
ASK { FILTER ("9" < "15") }
Result: false, because when sorting strings, 9 comes after 1.
Some equality checks with xsd:decimal:
ASK { FILTER (+1.000 = 01.0) }
Result is true, it’s the same number. Now with xsd:string:
ASK { FILTER ("+1.000" = "01.0") }
False, because they are clearly different strings.
Doing some maths with xsd:integer:
SELECT (1+1 AS ?result) {}
It returns 2 (as an xsd:integer). Now for strings:
SELECT ("1"+"1" AS ?result) {}
It returns "11" as an xsd:string, because adding strings is interpreted as string concatenation (at least in Jena where I tried this; in other SPARQL engines, adding two strings might be an error, returning nothing).
As you can see, using the right datatype is important to communicate your intent to code that works with the data. The SPARQL examples make this very clear, but when working directly with an RDF API, the same kind of issues crop up around object identity, ordering, and so on.
As shown in the examples above, SPARQL offers convenient syntax for xsd:string, xsd:integer and xsd:decimal (and, not shown, for xsd:boolean and for language-tagged strings). That elevates those datatypes above the rest.

Lucene NOT_ANALYZED not working with uppercase characters

I have build an index using a StandardAnalyzer, in this index are a few fields. For example purposes, imagine it has Id and Type. Both are NON_ANALYZED, meaning you can only search for them as-is.
There are a few entries in my index:
{Id: "1", Type: "Location"},
{Id: "2", Type: "Group"},
{Id: "3", Type: "Location"}
When I search for +Id:1 or any other number, I get the appropriate result (again using StandardAnalyzer).
However, when I search for +Type:Location or the +Type:Group, I'm not getting any results. The strange thing is that when I enable leading wildcards, that +Type:*ocation does return results! +Type:*Location or other combinations do not.
This got me leading to believe the indexer/query doesn't like uppercase characters! After lowercasing the Type to location and group before indexing them, I could search for them as such.
If I turn the Type-field to ANALYZED, it works with pretty much any search (uppercase/lowercase, etc), but I want to query for the Type-field as-is.
I'm completely baffled why it's doing this. Could anyone explain to me why my indexer doesn't let me search for NON_ANALYZED fields that have a capital in their value?
Are you using StandardAnalyzer when parsing your your query string (+Type:Location)? The StandardAnalyzer will lower-case all terms, so you're really searching with +Type:location.
Always use the same analyzer when searching and indexing. Look into using the PerFieldAnalyzer and set the Type field to use the KeywordAnalyzer.

Tips for designing a serialization file format that will permit easy merging

Say I'm building a UML modeling tool. There's some hierarchical organization to the data, and model elements need to be able to refer to others. I need some way to save model files to disk. If multiple people might be working on the files simultaneously, the time will come to merge these model files. Also, it would be nice to compare two revisions in source control and see what has changed. This seems like it would be a common problem across many domains
For this to work well using existing difference and merge tools, the file format should be text, separated onto multiple lines.
What are some existing serialization formats that do a good job (or poor job) addressing such problems? Or, if designing a custom file format, what are some tips / guidelines / pitfalls?
Bonus question: Any additional guidance if I want to eventually split the model up into multiple files, each separately source controlled?
I solved that problem long ago for octave/matlab, now I need something for C#.
The task was to merge two octave-structs to one. I found no merge tool and no fitting serializer, so I had to think about something.
The most important concept decision was to split the struct-tree into lines with the complete path and the content of the leave.
The basic Idea was
Serialize the Struct to Lines, where each line represents a basic Variable (Matrix, string, float,...)
An array or matrix of struct will have the index in the path.
concatenate the two resulting text files, sort the lines
detect collisions and do collision-handling (very easy, because the same Properties will be positioned directly unde each other after the line sorting)
do deserialize
Example:
>> s1
s1 =
scalar structure containing the fields:
b =
2x2 struct array containing the fields:
bruch
t = Textstring
f = 3.1416
s =
scalar structure containing the fields:
a = 3
b = 4
will be serialized to
root.b(1,1).bruch=txt2base('isfloat|[ [ 0, 4 ] ; [ 1, 0 ] ; ]');
root.b(1,2).bruch=txt2base('isfloat|[ [ 1, 6 ] ; [ 1, 0 ] ; ]');
root.b(2,1).bruch=txt2base('isfloat|[ [ 2, 7 ] ; [ 1, 0 ] ; ]');
root.b(2,2).bruch=txt2base('isfloat|[ [ 7 ] ; [ 1 ] ; ]');
root.f=txt2base('isfloat|[3.1416]');
root.s.a=txt2base('isfloat|[3]');
root.s.b=txt2base('isfloat|[4]');
root.t=txt2base('ischar|Textstring');
The advantage of this method is, that it is very easy to implement and it is human readable. First you have to write the two functions base2txt and txt2base, wich convert basic types to strings and back. Then you just go recursively through the tree and write for each struct property the path to the property (here seperated by ".") and the content to one line.
The big disadvantage is, that at least my implementation of this is very slow.
The answer to the second question: Is there already something like this out there? I dont know... but I searched for a while, so I don't think so.
Some guidelines:
The format should be designed so that when only one thing has changed in a model, there is only one corresponding change in the file. Some counterexamples:
It's no good if the file format uses arbitrary reference IDs that change every time you edit and save the model.
It's no good if array items are stored with their indices listed explicitly, since inserting items into the middle of an array will cause all the following indices to get shuffled down. That will cause those items to show up in a 'diff' unnecessarily.
Regarding references: if IDs are created serially, then two people editing the same revision of the model could end up creating new elements with the same ID. This will become a problem when merging.