How to handle concurrency in faunadb - faunadb

I've some backend APIs which connect to faunadb; I'm able to do everything I need with data but I've some serious doubts about concurrent modifications (which maybe are not strictly related to faunadb only, but I'd like to understand how to deal with it using this technology).
One example above all: I want to create a new document (A) in a collection (X) which is linked (via reference or other fields) to other documents (B and C) in another collection (Y); in order to be linked, these documents (B and C) must satisfy a condition (e.g. field F = "V"). Once A has been created, B and C cannot be modified (or the condition will be invalidated!).
Of course the API to create the document A can run concurrently with the API used to modify documents B and C.
Here comes the doubt: what if, while creating the document A linked to document B and C, someone else changes field F of document B to something different from "V"?
I could end up with A linked to a wrong document, because both APIs don't know what the other one is doing..
Do I need to use the "Do" function in both APIs to create atomic transactions? So I can:
Check if B and C are valid and, if yes, create A in a single transaction
Check if B is linked to A and, if it doesn't, modify it in a single transaction
Thanks everyone.

Fauna tries to present a consistent data view no matter or when your clients need to ask. Isolation of transaction effects is what matters on short time scales (typically less than 10ms).
The Do function merely lets you combine multiple disparate FQL expressions into a single query. There is no conditional processing aspect to Do.
You can certainly check conditions before undertaking operations, and all Fauna queries are atomic transactions: all of the query succeeds or none of its does.
Arranging for intermediate query values in order to perform conditional logic does tend to make FQL queries more complex, but they are definitely possible:
The query for your first API might look something like this:
Let(
{
document_b: Get(<reference to B document>),
document_c: Get(<reference to C document>),
required_b: Select(["data", "required_field"], Var("document_b"),
required_c: Select(["data", "other_required"], Var("document_c"),
condition: And(Var("required_b"), Var("required_c")),
},
If(
Var("condition"),
Create(Collection("A"), { data: { <document A data }}),
Abort("Cannot create A because the conditions have not been met.")
)
)
The Let function allows you to compose named values for intermediate expressions, which can read or write whatever they need, along with logical operations that determine which conditions need to be tested. The value composition is followed by an expression which, in this example, tests the conditions and only creates the document in the A collection when the conditions are met. When the conditions are not met, the transaction is aborted with an appropriate error message.
Let can nest Lets as much as required, provide the query fits within the maximum query length of 16MB, so you can embed a significant amount of logic into your queries. When the length of a single query is not sufficient, you can define UDFs that can be called, which allow you to store business logic that you can use any number of times.
See the E-commerce tutorial for a UDF that performs all of the processing required to submit an order, check if there is sufficient product in stock, deduct requested quantities from inventory, set backordered status, and create the order.

Related

Airflow: BigQueryOperator vs BigQuery Quotas and Limits

Is there any pratical way to control quotas and limits on Airflow?.
I'm specially interested on controlling BigQuery concurrency.
There are different levels of quotas on BigQuery . So according to the Operator inputs, there should be a way to check if conditions are met, otherwise waiting for it to fulfill.
It seems to be a composition of Sensor-Operators, querying against a database like redis for example:
QuotaSensor(Project, Dataset, Table, Query) >> QuotaAddOperator(Project, Dataset, Table, Query)
QuotaAddOperator(Project, Dataset, Table, Query) >> BigQueryOperator(Project, Dataset, Table, Query)
BigQueryOperator(Project, Dataset, Table, Query) >> QuotaSubOperator(Project, Dataset, Table, Query)
The Sensor must check conditions like:
- Global running queries <= 300
- Project running queries <= 100
- .. etc
Is there any lib that already does that for me? A plugin perhaps?
Or any other easier solution?
Otherwise, following the Sensor-Operators approach.
How can I encapsulate all of it under a single operator? To avoid repetition of code,
a single operator: QuotaBigQueryOperator
Currently, it is only possible to get the Compute Engine quotas programmatically. However, there is an opened feature request to get/set other project quotas via API. You can post there about the specific case you would like to have implemented and follow it to track it and ask for updates.
Meanwhile, as workaround you can try to use the PythonOperator. With it you can define your own custom code and you would be able to implement retries for the queries that you send that get a quotaExceeded error (or the specific error you are getting). In this way you wouldn't have to explicitly check for the quota levels. You just run the queries and retry until they get executed. This is a simplified code for the strategy I am thinking about:
for query in QUERIES_TO_RUN:
while True:
try:
run(query)
except quotaExceededException:
continue # Jumps to the next cycle of the nearest enclosing loop.
break

Efficient Querying Data With Shared Conditions

I have multiple sets of data which are sourced from an Entity Framework code-first context (SQL CE). There's a GUI which displays the number of records in each query set, and upon changing some set condition (e.g. Date), the sets all need to recalculate their "count" value.
While every set's query is slightly different in some way, most of them share common conditions in some way. A simple example:
RelevantCustomers = People.Where(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0 && P.Type=="Customer")
RelevantSuppliers = People.Where(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0 && P.Type=="Supplier")
So the thing is, there's enough of these demanding queries, that each time the user changes some condition (e.g. SelectedDate), it takes a really long time to recalculate the number of records in each set.
I realise that part of the reason for this is the need to query through, for example, the transactions each time to check what is really the same condition for both RelevantCustomers and RelevantSuppliers.
So my question is that, given these sets share common "base conditions" which depend on the same sets of data, is there some more efficicent way I could be calculating these sets?
I was thinking something with custom generic classes like this:
QueryGroup<People>(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0)
{
new Query<People>("Customers", P=>P.Type=="Customer"),
new Query<People>("Suppliers", P=>P.Type=="Supplier")
}
I can structure this just fine, but what I'm finding is that it makes basically no difference to the efficiency as it still needs to repeat the "shared condition" for each set.
I've also tried pulling the base condition data out as a static "ToList()" first, but this causes issues when running into navigation entities (i.e. People.Addresses don't get loaded).
Is there some method I'm not aware of here in terms of efficiency?
Thanks in advance!
Give something like this a try: Combine "similar" values into fewer queries, then separate the results afterwards. Also, use Any() rather than Count() for exists check. Your updated attempt goes part-way, but will still result in 2x hits to the database. Also, when querying it helps to ensure that you are querying against indexed fields, and those indexes will be more efficient with numeric IDs rather than strings. (I.e. a TypeID of 1 vs. 2 for "Customer" vs. "Supplier") Normalized values are better for indexing and lead to smaller records, at the cost of extra verbose queries.
var types = new string[] {"Customer", "Supplier"};
var people = People.Where(p => types.Contains(p.Type)
&& p.Transactions.Any(t => t.Date > selectedDate)).ToList();
var relevantCustomers = people.Where(p => p.Type == "Customer").ToList();
var relevantSuppliers = people.Where(p => p.Type == "Supplier").ToList();
This results in just one hit to the database, and the Any should be more perform-ant than fetching an entire count. We split the customers and suppliers after the fact from the in-memory set. The caveat here is that any attempt to access details such as transactions etc. on customers and suppliers would result in lazy-load hits since we didn't eager load them. If you need entire entity graphs then be sure to .Include() relevant details, or be more selective on the data extracted from the first query. I.e. select anonymous types with the applicable details rather than just the entity.

What's more optimal: query chaining parent & child or selecting from parent's child objects

Curious which of these is better performance wise. If you have a User with many PlanDates, and you know your user is the user with an id of 60 in a variable current_user, is it better to do:
plan_dates = current_user.plan_dates.select { |pd| pd.attribute == test }
OR
plan_dates = PlanDate.joins(:user).where("plan_dates.attribute" => test).where("users.id" => 60)
Asking because I keep reading about the dangers of using select since it builds the entire object...
select is discouraged because, unlike the ActiveRelation methods where, joins, etc., it's a Ruby method from Enumerable which means that the entire user.plan_dates relation must be loaded into memory before the selection can begin. That may not make a difference at a small scale, but if an average user has 3,000 plan dates, then you're in trouble!
So, your second option, which uses just one SQL query to get the result, is the better choice. However, you can also write it like so:
user.plan_dates.where(attribute: test)
This is still just one SQL query, but leverages the power of ActiveRelation for a more expressive result.
The second. The select has to compare objects on the code level, and the second is just a query.
In addition, the second expression may not be actually executed unless you use the variable, while the first will be always executed.

NHibernate Filtering data best practices

I have the following situation:
User logs in, opens an overview of all products, can only see a list of products where a condition is added, this condition is variable. Example: WHERE category in ('catA', 'CatB')
Administrator logs in, opens an overview of all products, he can see all products no filter applied.
I need to make this as dynamically as possible. My data access classes are using Generics for most of the time.
I've seen filters but my conditions are very variable, so i don't see this as scalable enough.
We use NH filters for something similar, and it works fine. If no filter needs to be applied, you can omit setting any value for the filter. We use these filters for more basic stuff, filters that are applied nearly 100% of the time, fx deleted objects filters, client data segregating, etc. Not sure what scalability aspect you're looking for?
For more high level and complex filtering, we use a custom class that manipulates a repository root. Something like the following:
public IQueryOver<TIn, TOut> Apply(IQueryOver<TIn, TOut> query)
{
return query.Where(x => ... );
}
If you have an IoC container integrated with your NH usage, something like this can easily be generalized and plugged into your stack. We have these repository manipulators that do simple where clauses, and others that generate complex where clauses that reference domain logic and others that joins a second table on and filters on that.
You could save all categories in an category list and pass this list to the query. If the list is not null and contains elements you can work with the following:
List<string> allowedCategoriesList = new List<string>();
allowedCategoriesList.Add(...);
...
.WhereRestrictionOn(x => x.category).IsIn(allowedCategoriesList)
It's only important to skip this entry if you do not have any filters (so, you want to see all entries without filtering), as you will otherwise see not one single result.

django objects...values() select only some fields

I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.
The values(*fields) function allows you to specify which fields you want.
Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.