How to querying and filtering efficiently on Fauna DB? - faunadb

For example, let’s assume we have a collection with hundreds of thousands of documents of clients with 3 fields, name, monthly_salary, and age.
How can I search for documents that monthly_salary is higher than 2000 and age higher than 30?
In SQL this would be straightforward but with Fauna, I´m struggling to understand the best approach because terms of Index only work with an exact match. I see in docs that I can use the Filter function but I would need to get all documents in advance so it looks a bit counterintuitive and not performant.
Below is an example of how I can achieve it, but not sure if it’s the best approach, especially if it contains a lot of records.
Map(
Filter(
Paginate(Documents(Collection('clients'))),
Lambda(
'client',
And(
GT(Select(['data', 'monthly_salary'], Get(Var('client'))), 2000),
GT(Select(['data', 'age'], Get(Var('client'))), 30),
)
)
),
Lambda(
'filteredClients',
Get(Var('filteredClients'))
)
)
Is this correct or I´m missing some fundamental concepts about Fauna and FQL?
can anyone help?
Thanks in advance

Efficient searching is performed using Indexes. You can check out the docs for search with Indexes, and there is a "cookbook" for some different search examples.
There are two ways to use Indexes to search, and which one you use depends on if you are searching for equality (exact match) or inequality (greater than or less than, for example).
Searching for equality
If you need an exact match, then use Index terms. This is most explicit in the docs, and it is also not what your original question is about, so I am not going to dwell much here. But here is a simple example
given user documents with this shape
{
ref: Ref(Collection("User"), "1234"),
ts: 16934907826026,
data: {
name: "John Doe",
email: "jdoe#example.com,
age: 50,
monthly_salary: 3000
}
}
and an index defined like the following
CreateIndex({
name: "users_by_email",
source: Collection("User"),
terms: [ { field: ["data", "email"] } ],
unique: true // user emails are unique
})
You can search for exact matches with... the Match function!
Get(
Match(Index("user_by_email"), "jdoe#example.com")
)
Searching for inequality
Searching for inequalities is more interesting and also complicated. It requires using Index values and the Range function.
Keeping with the document above, we can create a new index
CreateIndex({
name: "users__sorted_by_monthly_salary",
source: Collection("User"),
values: [
{ field: ["data", "monthly_salary"] },
{ field: ["ref"] }
]
})
Note that I've not defined any terms in the above Index. The important thing for inequalities is again the values. We've also included the ref as a value, since we will need that later.
Now we can use Range to get all users with salary in a given range. This query will get all users with salary starting at 2000 and all above.
Paginate(
Range(
Match(Index("users__sorted_by_monthly_salary")),
[2000],
[]
)
)
Combining Indexes
For "OR" operations, use the Union function.
For "AND" operations, use the Intersection function.
Functions like Match and Range return Sets. A really important part of this is to make sure that when you "combine" Sets with functions like Intersection, that the shape of the data is the same.
Using sets with the same shape is not difficult for Indexes with no values, they default to the same single ref value.
Paginate(
Intersection(
Match(Index("user_by_age"), 50), // type is Set<Ref>
Match(Index("user_by_monthly_salary, 3000) // type is Set<Ref>
)
)
When the Sets have different shapes they need to be modified or else the Intersection will never return results
Paginate(
Intersection(
Range(
Match(Index("users__sorted_by_age")),
[30],
[]
), // type is Set<[age, Ref]>
Range(
Match(Index("users__sorted_by_monthly_salary")),
[2000],
[]
) // type is Set<[salary, Ref]>
)
)
{
data: [] // Intersection is empty
}
So how do we change the shape of the Set so they can be intersected? We can use the Join function, along with the Singleton function.
Join will run an operation over all entries in the Set. We will use that to return only a ref.
Join(
Range(Match(Index("users__sorted_by_age")), [30], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
)
Altogether then:
Paginate(
Intersection(
Join(
Range(Match(Index("users__sorted_by_age")), [30], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
),
Join(
Range(Match(Index("users__sorted_by_monthly_salary")), [2000], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
)
)
)
tips for combining indexes
You can use additional logic to combine different indexes when different terms are provided, or search for missing fields using bindings. Lot's of cool stuff you can do.
Do check out the cook book and the Fauna forums as well for ideas.
BUT WHY!!!
It's a good question!
Consider this: Since Fauna is served as a serverless API, you get charged for each individual read and write on your documents and indexes as well as the compute time to execute your query. SQL can be much easier, but it is a much higher level language. Behind SQL sits a query planner making assumptions about how to get you your data. If it cannot do it efficiently it may default to scanning your entire table of data or otherwise performing an operation much more expensive than you might have expected.
With Fauna, YOU are the query planner. That means it is much more complicated to get started, but it also means you have fine control over the performance of you database and thus your cost.
We are working on improving the experience of defining schemas and the indexes you need, but at the moment you do have to define these queries at a low level.

Related

How to create where statement based on result of multiset

So, i would like to filter my query by exact match in result of multiset. Any ideas how to do it in JOOQ?
Example:
val result = dsl.select(
PLANT_PROTECTION_REGISTRATION.ID,
PLANT_PROTECTION_REGISTRATION.REGISTRATION_NUMBER,
PLANT_PROTECTION_REGISTRATION.PLANT_PROTECTION_ID,
multiset(
select(
PLANT_PROTECTION_APPLICATION.ORGANISM_ID,
PLANT_PROTECTION_APPLICATION.ORGANISM_TEXT
).from(PLANT_PROTECTION_APPLICATION)
.where(PLANT_PROTECTION_APPLICATION.REGISTRATION_ID.eq(PLANT_PROTECTION_REGISTRATION.ID))
).`as`("organisms")
).from(PLANT_PROTECTION_REGISTRATION)
// here i would like to filter my result only for records that their organisms contain specific
// organism id
.where("organisms.organism_id".contains(organismId))
I've explained the following answer more in depth in this blog post
About the MULTISET value constructor
The MULTISET value constructor operator is so powerful, we'd like to use it everywhere :) But the way it works is that it creates a correlated subquery, which produces a nested data structure, which is hard to further process in the same SQL statement. It's not impossible. You could create a derived table and then unnest the MULTISET again, but that would probably be quite unwieldy. I've shown an example using native PostgreSQL in that blog post
Alternative using MULTISET_AGG
If you're not nesting things much more deeply, how about using the lesser known and lesser hyped MULTISET_AGG alternative, instead? In your particular case, you could do:
// Using aliases to make things a bit more readable
val ppa = PLANT_PROTECTION_APPLICATION.as("ppa");
// Also, implicit join helps keep things more simple
val ppr = ppa.plantProtectionRegistration().as("ppr");
dsl.select(
ppr.ID,
ppr.REGISTRATION_NUMBER,
ppr.PLANT_PROTECTION_ID,
multisetAgg(ppa.ORGANISM_ID, ppa.ORGANISM_TEXT).`as`("organisms"))
.from(ppa)
.groupBy(
ppr.ID,
ppr.REGISTRATION_NUMBER,
ppr.PLANT_PROTECTION_ID)
// Retain only those groups which contain the desired ORGANISM_ID
.having(
boolOr(trueCondition())
.filterWhere(ppa.ORGANISM_ID.eq(organismId)))
.fetch()

Index return "data" field entirely in fauna db

I am trying to create an index that returns entire data object of the documents in a collection;
here is the code:
CreateIndex({
name: "users_by_data",
source: Collection("users"),
values: { field: ['data'] }
})
but after creation it says: Values Not set (using ref by default)
If I specifically define fields (separately by their name), it will behave as expected, but data isn't working. the question is:
Is it impossible (e.g. for performance reasons) or am I doing it wrong?
side note: I am aware that I can do Lambda function on Paginate and achieve similar result, but this question specifically is about the Index level;
You can currently index regular values (strings, numbers, dates, etc) and you can index an array which will more or less 'unroll' the array into separate index entries. However, what you are trying, indexing an object is not possible at this point. An object (like data) will be ignored if you try to index it.
Currently, you have two options:
as you mentioned, using Map/Get at query time.
listing all values of the data object in the index since you can select specific values of an object in the index (which is however less flexible if new attributes arrive later on in the object)
We intend to support indexing of objects in the future, I can't provide an ETA yet though. There is a feature request on our forums as well that you can vote up: https://forums.fauna.com/t/object-as-terms-instead-of-scalar-s/628
You're going to want to use the Select function on the Ref you get back from the Index if you only want the data field back.
For an individual document, you can do something like this
Select( "data",
Get(
Match(
Index("yourIndexName"),
**yourIndexTerm // Could point to String/Number/FQL Ref
)
)
)
For a list of documents, you can use Paginate as you said but you can still pull the data property out of each document
Map(
Paginate(
Match(
Index("yourIndexName"),
**yourIndexTerm // Could point to String/Number/FQL Ref
)
),
Lambda("doc", Select("data", Get(Var("doc"))))
)

What is the benefit of defining datatypes for literals in an RDF graph?

I am using rdflib in Python to build my first rdf graph. However, I do not understand the explicit purpose of defining Literal datatypes. I have scraped over the documentation and did my due diligence with google and the stackoverflow search, but I cannot seem to find an actual explanation for this. Why not just leave everything as a plain old Literal?
From what I have experimented with, is this so that you can search for explicit terms in your Sparql query with BIND? Does this also help with FILTERing? i.e. FILTER (?var1 > ?var2), where var1 and var2 should represent integers/floats/etc? Does it help with querying speed? Or am I just way off altogether?
Specifically, why add the following triple to mygraph
mygraph.add((amazingrdf, ns['hasValue'], Literal('42.0', datatype=XSD.float)))
instead of just this?
mygraph.add((amazingrdf, ns['hasValue'], Literal("42.0")))
I suspect that there must be some purpose I am overlooking. I appreciate your help and explanations - I want to learn this right the first time! Thanks!
Comparing two xsd:integer values in SPARQL:
ASK { FILTER (9 < 15) }
Result: true. Now with xsd:string:
ASK { FILTER ("9" < "15") }
Result: false, because when sorting strings, 9 comes after 1.
Some equality checks with xsd:decimal:
ASK { FILTER (+1.000 = 01.0) }
Result is true, it’s the same number. Now with xsd:string:
ASK { FILTER ("+1.000" = "01.0") }
False, because they are clearly different strings.
Doing some maths with xsd:integer:
SELECT (1+1 AS ?result) {}
It returns 2 (as an xsd:integer). Now for strings:
SELECT ("1"+"1" AS ?result) {}
It returns "11" as an xsd:string, because adding strings is interpreted as string concatenation (at least in Jena where I tried this; in other SPARQL engines, adding two strings might be an error, returning nothing).
As you can see, using the right datatype is important to communicate your intent to code that works with the data. The SPARQL examples make this very clear, but when working directly with an RDF API, the same kind of issues crop up around object identity, ordering, and so on.
As shown in the examples above, SPARQL offers convenient syntax for xsd:string, xsd:integer and xsd:decimal (and, not shown, for xsd:boolean and for language-tagged strings). That elevates those datatypes above the rest.

Elasticsearch: match every position only once

In my Elasticsearch index I have documents that have multiple tokens at the same position.
I want to get a document back when I match at least one token at every position.
The order of the tokens is not important.
How can I accomplish that? I use Elasticsearch 0.90.5.
Example:
I index a document like this.
{
"field":"red car"
}
I use a synonym token filter that adds synonyms at the same positions as the original token.
So now in the field, there are 2 positions:
Position 1: "red"
Position 2: "car", "automobile"
My solution for now:
To be able to ensure that all positions match, I index the maximum position as well.
{
"field":"red car",
"max_position": 2
}
I have a custom similarity that extends from DefaultSimilarity and returns 1 tf(), idf() and lengthNorm(). The resulting score is the number of matching terms in the field.
Query:
{
"custom_score": {
"query": {
"match": {
"field": "a car is an automobile"
}
},
"_script": "_score*100/doc[\"max_position\"]+_score"
},
"min_score":"100"
}
Problem with my solution:
The above search should not match the document, because there is no token "red" in the query string. But it matches, because Elasticsearch counts the matches for car and automobile as two matches and that gives a score of 2 which leads to a script score of 102, which satisfies the "min_score".
If you needed to guarantee 100% matches against the query terms you could use minimum_should_match. This is the more common case.
Unfortunately, in your case, you wish to provide 100% matches of the indexed terms. To do this, you'll have to drop down to the Lucene level and write a custom (java - here's boilerplate you can fork) Similarity class, because you need access to low-level index information that is not exposed to the Query DSL:
Per document/field scanned in the query scorer:
Number of analyzed terms matched (overlap is the Lucene terminology, it is used the the coord() method of the DefaultSimilarity class)
Number of total analyzed terms in the field: Look at this thread for a couple different ways to get this information: How to count the number of terms for each document in lucene index?
Then your custom similarity (you can probably even extend DefaultSimilarity) will need to detect queries where terms matched < total terms and multiply their score by zero.
Since query and index-time analysis have already happened at this level of scoring, the total number of indexed terms will already be expanded to include synonyms, as should the query terms, avoiding the false-positive "a car is an automobile" issue above.

In Django, what is the most efficient way to check for an empty query set?

I've heard suggestions to use the following:
if qs.exists():
...
if qs.count():
...
try:
qs[0]
except IndexError:
...
Copied from comment below: "I'm looking for a statement like "In MySQL and PostgreSQL count() is faster for short queries, exists() is faster for long queries, and use QuerySet[0] when it's likely that you're going to need the first element and you want to check that it exists. However, when count() is faster it's only marginally faster so it's advisable to always use exists() when choosing between the two."
query.exists() is the most efficient way.
Especially on postgres count() can be very expensive, sometimes more expensive then a normal select query.
exists() runs a query with no select_related, field selections or sorting and only fetches a single record. This is much faster then counting the entire query with table joins and sorting.
qs[0] would still includes select_related, field selections and sorting; so it would be more expensive.
The Django source code is here (django/db/models/sql/query.py RawQuery.has_results):
https://github.com/django/django/blob/60e52a047e55bc4cd5a93a8bd4d07baed27e9a22/django/db/models/sql/query.py#L499
def has_results(self, using):
q = self.clone()
if not q.distinct:
q.clear_select_clause()
q.clear_ordering(True)
q.set_limits(high=1)
compiler = q.get_compiler(using=using)
return compiler.has_results()
Another gotcha that got me the other day is invoking a QuerySet in an if statement. That executes and returns the whole query !
If the variable query_set may be None (unset argument to your function) then use:
if query_set is None:
#
not:
if query_set:
# you just hit the database
exists() is generally faster than count(), though not always (see test below). count() can be used to check for both existence and length.
Only use qs[0]if you actually need the object. It's significantly slower if you're just testing for existence.
On Amazon SimpleDB, 400,000 rows:
bare qs: 325.00 usec/pass
qs.exists(): 144.46 usec/pass
qs.count() 144.33 usec/pass
qs[0]: 324.98 usec/pass
On MySQL, 57 rows:
bare qs: 1.07 usec/pass
qs.exists(): 1.21 usec/pass
qs.count(): 1.16 usec/pass
qs[0]: 1.27 usec/pass
I used a random query for each pass to reduce the risk of db-level caching. Test code:
import timeit
base = """
import random
from plum.bacon.models import Session
ip_addr = str(random.randint(0,256))+'.'+str(random.randint(0,256))+'.'+str(random.randint(0,256))+'.'+str(random.randint(0,256))
try:
session = Session.objects.filter(ip=ip_addr)%s
if session:
pass
except:
pass
"""
query_variatons = [
base % "",
base % ".exists()",
base % ".count()",
base % "[0]"
]
for s in query_variatons:
t = timeit.Timer(stmt=s)
print "%.2f usec/pass" % (1000000 * t.timeit(number=100)/100000)
It depends on use context.
According to documentation:
Use QuerySet.count()
...if you only want the count, rather than doing len(queryset).
Use QuerySet.exists()
...if you only want to find out if at least one result exists, rather than if queryset.
But:
Don't overuse count() and exists()
If you are going to need other data from the QuerySet, just evaluate it.
So, I think that QuerySet.exists() is the most recommended way if you just want to check for an empty QuerySet. On the other hand, if you want to use results later, it's better to evaluate it.
I also think that your third option is the most expensive, because you need to retrieve all records just to check if any exists.
#Sam Odio's solution was a decent starting point but there's a few flaws in the methodology, namely:
The random IP address could end up matching 0 or very few results
An exception would skew the results, so we should aim to avoid handling exceptions
So instead of filtering something that might match, I decided to exclude something that definitely won't match, hopefully still avoiding the DB cache, but also ensuring the same number of rows.
I only tested against a local MySQL database, with the dataset:
>>> Session.objects.all().count()
40219
Timing code:
import timeit
base = """
import random
import string
from django.contrib.sessions.models import Session
never_match = ''.join(random.choice(string.ascii_uppercase) for _ in range(10))
sessions = Session.objects.exclude(session_key=never_match){}
if sessions:
pass
"""
s = base.format('count')
query_variations = [
"",
".exists()",
".count()",
"[0]",
]
for variation in query_variations:
t = timeit.Timer(stmt=base.format(variation))
print "{} => {:02f} usec/pass".format(variation.ljust(10), 1000000 * t.timeit(number=100)/100000)
outputs:
=> 1390.177710 usec/pass
.exists() => 2.479579 usec/pass
.count() => 22.426991 usec/pass
[0] => 2.437079 usec/pass
So you can see that count() is roughly 9 times slower than exists() for this dataset.
[0] is also fast, but it needs exception handling.
I would imagine that the first method is the most efficient way (you could easily implement it in terms of the second method, so perhaps they are almost identical). The last one requires actually getting a whole object from the database, so it is almost certainly the most expensive.
But, like all of these questions, the only way to know for your particular database, schema and dataset is to test it yourself.
I was also in this trouble. Yes exists() is faster for most cases but it depends a lot on the type of queryset you are trying to do. For example for a simple query like:
my_objects = MyObject.objets.all() you would use my_objects.exists(). But if you were to do a query like: MyObject.objects.filter(some_attr='anything').exclude(something='what').distinct('key').values() probably you need to test which one fits better (exists(), count(), len(my_objects)). Remember the DB engine is the one who will perform the query, and to get a good result in performance, it depends a lot on the data structure and how the query is formed. One thing you can do is, audit the queries and test them on your own against the DB engine and compare your results you will be surprised by how naive sometimes django is, try QueryCountMiddleware to see all the queries executed, and you will see what i am talking about.