Raven Index Efficiency: Map Reduce vs. Transformer - ravendb

I'm wondering what is most efficient in Raven, using a transformer or using a map reduce. I'm looking at a case where all fields from several collections are returned, but I only need a fraction of them indexed. I ran a test on one example and I found that both Index building and Index retrieval is faster using the transformer. I'd like to know if this result will hold and what Raven is doing in the background to increase efficiency.
Below is a "toy example", where a Professor can teach in multiple Classes. I only need an index on Classes.Name.
Collection: Professors
ProfessorID, Name, PField1, PField2, ...
Collection: Classes
ClassID, Name, ProfessorID, CField1, CField2, ...
Transformer pseudo-code:
Map:
from class in docs.Classes select new {Name = class.Name}
Transformer:
from result in results
let prof = LoadDocument(result.ProfessorID)
select new {
result.ClassID, result.Name, result.ProfessorID,
result.CField1, result.CField2, ...
prof.Name, prof.PField1, prof.PField2, ...
}
Map/Reduce pseudo-code
Map1:
from prof in Professors
select { all fields}
Map2:
from class in Classes
select {all fields}
Reduce
from result in results
group result by result.ProfessorID into g
select {all fields}

A transformer runs during query, so it only does the work when/if you need it to.
Map/Reduce is meant to reduce the data, for aggregation, not really for what you are using it for.
It runs in the background and will build the projection for you.
I would recommend transformer for this.

Related

How to automatically break down a SQL-like query with many joins into discrete, independent steps?

Note: This is a learning exercise to learn how to implement a SQL-like relational database. This is just one thin slice of a question in the overall grand vision.
I have the following query, given a test database with a few hundred records:
select distinct "companies"."name"
from "companies"
inner join "projects" on "projects"."company_id" = "companies"."id"
inner join "posts" on "posts"."project_id" = "projects"."id"
inner join "comments" on "comments"."post_id" = "posts"."id"
inner join "addresses" on "addresses"."company_id" = "companies"."id"
where "addresses"."name" = 'Address Foo'
and "comments"."message" = 'Comment 3/3/2/1';
Here, the query is kind of unrealistic, but it demonstrates the point which I am trying to make. The point is to have a query with a few joins, so that I can figure out how to write this in sequential steps.
The first part of the question is (which I think I've partially figured out), is how do you write these joins as a sequence of independent steps, with the output of one fed into the input of the other? Also, is there more than one way to do it?
// step 1
let companies = select('companies')
// step 2
let projects = join(companies, select('projects'), 'id', 'company_id')
// step 3
let posts = join(projects, select('posts'), 'id', 'project_id')
// step 4
let comments = join(posts, select('comments'), 'id', 'post_id')
// step 5
let finalPosts = posts.filter(post => !!comments.find(comment => comment.post_id === post.id))
// step 6
let finalProjects = projects.filter(project => !!posts.find(post => post.project_id === project.id))
// step 7, could also be run in parallel to step 2 potentially
let addresses = join(companies, select('addresses'), 'id', 'company_id')
// step 8
let finalCompanies = companies.filter(company => {
return !!posts.find(post => post.company_id === company.id)
&& !!addresses.find(address => address.company_id === company.id)
})
These filters could probably be more optimized using indexes of some sort, but that is beside the point I think. This just demonstrates that there seem to be about 8 steps to find the companies we are looking for.
The main question is, how do you automatically figure out the steps from the SQL query?
I am not asking about how to parse the SQL query into an AST. Assume we have some sort of object structure we are dealing with, like an AST, to start.
How would you have to have the SQL query in structured object form, such that it would lead to these 8 steps? I would like to be able to specify a query (using a custom JSON-like syntax, not SQL), and then have it divide the query into these steps to divide and conquer so to speak and perform the queries in parts (for learning how to implement distributed databases). But I don't see how we go from SQL-like syntax, to 8 steps. Can you show how that might be done?
Here is the full code for the demo, which you can run with psql postgres -f test.sql. The result should be "Company 3".
Basically looking for a high level algorithm (doesn't even need to be code), which describes the key way you would break down some sort of AST-like object representation of a SQL query, into the actual planned steps of the query.
My algorithm looks like this in my head:
represent SQL query in object tree.
convert object tree to steps.
I am not really sure what (1) should be structured like, and even if we had some sort of structure, I'm not sure how to get that to complete (2). Looking for more details on the implementations of these steps, mainly step (2).
My "object structure" for step 1 would be something like this:
const query = {
select: {
distinct: true,
columns: ['companies.name'],
from: ['companies'],
},
joins: [
{
type: 'inner join',
table: 'projects',
left: 'projects.company_id',
right: 'companies.id',
},
...
],
conditions: [
{
left: 'addresses.name',
op: '=',
right: 'Address Foo'
},
...
]
}
I am not sure how useful that is, but it doesn't relate to steps at all. At a high level, what kind of code would I have to write to convert that object sort of structure into steps? Seems like one potential avenue is do a topological sort on the joins. But then you need to combine that with the select and conditions somehow, not sure how you would even begin to programmatically know what step should be before what other step, or even what the steps are. Maybe if I somehow could break it into known "chunks", then it would be simple to apply TOP sort to it after that, but then the question is still, how to get into chunks from the object structure / SQL?
Basically, I have been reading about the theory behind "query planning and optimization", but don't know how to apply it in this regard. How did this site do it?
One aspect is breaking at least the where conditions into CNF.
Implementing joins is a huge topic which is probably out of scope for a StackOverflow answer.
If you're looking for practical information about how joins are implemented, I would suggest...
The Join Operation section of Use The Index, Luke for different types of join implementation.
Section 7 of the The SQLite Query Optimizer Overview covers joins. And reading the SQLite source code. It is about as small a practical SQL implementation will get.
The output of explain in Postgresql gives very detailed information about how it has implemented the query. And they are explained in Operator Optimization Information

Cypher query for gremlin traversal

New to cypher query
g.V().has('Entity1','id',within(id1)).in('Entity2').
where(__.out('Entity3').where(__.out('Entity4').has('name',within(name))))
how to convert the above gremlin to cypher and return adjacent Entity2 invertex.
Here condition is
out('Entity3') should be out of Entity2
out('Entity4') should be out of Entity3 and name in the provided list of values
Return adjacent vertex of inEntity2
Straight answer:
MATCH (m:Entity1 )<-[:Entity2]-(n)
WHERE (n)-[:Entity3]->()-[:Entity4]->({name: "ABC"})
AND m.id in ["id1"]
RETURN n
# Assuming id is a property here.
# If id is the actual ID of the node
MATCH (m:Entity1 )<-[:Entity2]-(n)
WHERE (n)-[:Entity3]->()-[:Entity4]->({name: "ABC"})
AND ID(m) in ["id1"]
RETURN n
I tried to create the graph for you use-case using this query:
CREATE (a:Entity2)-[:Entity2]->(b:Entity1 {id:"id1"}),
(a)-[:Entity3]->(:Entity3)-[:Entity4]->(:Entity4 {name:"ABC"})
the graph looks like this:
However, I think while writing your gremlin traversal you had the intention of specifying the label of the vertex rather than label of the edge. That is why in the query I wrote to create the graph, the relationship and the vertex, relationship is pointing to have same label.
If that is your intention then your cypher query would look like.
MATCH (:Entity1 {id:"id1"})<--(n:Entity2)
WHERE (n)-->(:Entity3)-->(:Entity4 {name: "ABC"})
RETURN n
I'm not 100% sure what you are looking for as the Gremlin above seems incomplete compared to the description but I think what you are looking for is something like this:
MATCH (e1:Entity1)<-[:Entity2]-(e2)-[:Entity3]->(e3)-[:Entity4]->(e4 {code: 'LHR'})
WHERE e1 IN (id1)
RETURN e2

How to create query with Spring Data JPA?

Please help me. I can not create query from native sql using spring data.
How do I convert the following expression?
select distinct brand, model from cars
where brand like '%:query%' or model like '%:query%' limit 2;
You could use projections with Spring Data JPA, something of the likes of:
interface SimpleCar {
String getBrand();
String getModel();
}
and then:
List<SimpleCar> findDistinctByBrandLikeOrModelLike(String query);
You can review here for different options to query a DB with some examples.
By the way, if you want to see what queries your methods are executing you can activate DEBUG logging mode in log4j for org.springframework package or org.hibernate (if you're using Hibernate)
that helped me, hopefully you too
https://www.petrikainulainen.net/programming/spring-framework/spring-data-jpa-tutorial-three-custom-queries-with-query-methods/
Use:
#Query("select distinct brand, model from cars " +
"where brand like %:query% or model like %:query% limit 2")
public List<Car> find(#Param("query") String query);
Basically what you had but without the speech marks.

Search by exact words in a phrase using Umbraco Examine

I have some description field per content and those are:
For content1:
The quick brown fox jumps over the lazy dog. And the lazy dog is good.
For content2:
The lazy fog is crazy.
Now, when I use keyword = lazy dog, I want to give result as content1 and not content2
I tried like:
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria =
searcher.CreateSearchCriteria()
.GroupedAnd( new List<string> { "description" }, "lazy dog") )
.Compile();
ISearchResults result = searcher.Search( criteria );
But it didn't gave me desired results, it give me results: content1 and content2.
What should I do in order to get as content1 result ?
By default examine is compiling this query to:
+(+description:lazy dog)
and based on it it's returning the results with both: lazy and dog words.
What you want to achieve is:
+(+description:"lazy dog")
First of what you need to try is to escape the phrase. In your case it will be:
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria =
searcher.CreateSearchCriteria()
.GroupedAnd( new List<string> { "description" }, "lazy dog".Escape()) )
.Compile();
ISearchResults result = searcher.Search( criteria );
Can't test it now, but there were some problems with it in the past from what I remember. The second option and a life saver for you, may be building the search query manually and using the raw query.
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria = searcher.CreateSearchCriteria();
var query = criteria.RawQuery("+description:\"lazy dog\"");
ISearchResults result = searcher.Search( query );
And it should return you correct = matched result only. Personally, I've used also some boosting of specific words to just point some results higher in the score list, but if you want to have only matched items, try above solutions and let me know if it helped you.
If you want to deal with more than one property, you can either use some fluent API methods like GroupedAnd or GroupedOr (depending of the desired behaviour of search) or build more advanced raw query.
For the first option, check Grouped Operations documentation: https://github.com/Shazwazza/Examine/wiki/Grouped-Operations.
For the second scenario it would be the best to analyze how it's done e.g. in ezSearch package (which btw. is awesome!): https://github.com/umco/umbraco-ezsearch/blob/master/Src/Our.Umbraco.ezSearch/Web/UI/Views/MacroPartials/ezSearch.cshtml.

Django ORM Cross Product

I have three models:
class Customer(models.Model):
pass
class IssueType(models.Model):
pass
class IssueTypeConfigPerCustomer(models.Model):
customer=models.ForeignKey(Customer)
issue_type=models.ForeignKey(IssueType)
class Meta:
unique_together=[('customer', 'issue_type')]
How can I find all tuples of (custmer, issue_type) where there is no IssueTypeConfigPerCustomer object?
I want to avoid a loop in Python. A solution which solves this in the DB would be preferred.
Background: for every customer and for every issue-type, there should be a config in the DB.
If you can afford to make one database trip for each issue type, try something like this untested snippet:
def lacking_configs():
for issue_type in IssueType.objects.all():
for customer in Customer.objects.filter(
issuetypeconfigpercustomer__issue_type=None
):
yield customer, issue_type
missing = list(lacking_configs())
This is probably OK unless you have a lot of issue types or if you are doing this several times per second, but you may also consider having a sensible default instead of making a config object mandatory for each combination of issue type and customer (IMHO it is a bit of a design-smell).
[update]
I updated the question: I want to avoid a loop in Python. A solution which solves this in the DB would be preferred.
In Django, every Queryset is either a list of Model instances or a dict (values querysets), so it is impossible to return the format you want (a list of tuples of Model) without some Python (and possibly multiple trips to the database).
The closest thing to a cross product would be using the "extra" method without a where parameter, but it involves raw SQL and knowing the underlying table name for the other model:
missing = Customer.objects.extra(
select={"issue_type_id": 'appname_issuetype.id'},
tables=['appname_issuetype']
)
As a result, each Customer object will have an extra attribute, "issue_type_id", containing the id of one IssueType. You can use the where parameter to filter based on NOT EXISTS (SELECT 1 FROM appname_issuetypeconfigpercustomer WHERE issuetype_id=appname_issuetype.id AND customer_id=appname_customer.id). Using the values method you can have something close to what you want - this is probably enough information to verify the rule and create the missing records. If you need other fields from IssueType just include them in the select argument.
In order to assemble a list of (Customer, IssueType) you need something like:
cross_product = [
(customer, IssueType.objects.get(pk=customer.issue_type_id))
for customer in
Customer.objects.extra(
select={"issue_type_id": 'appname_issuetype.id'},
tables=['appname_issuetype'],
where=["""
NOT EXISTS (
SELECT 1
FROM appname_issuetypeconfigpercustomer
WHERE issuetype_id=appname_issuetype.id
AND customer_id=appname_customer.id
)
"""]
)
]
Not only this requires the same number of trips to the database as the "generator" based version but IMHO it is also less portable, less readable and violates DRY. I guess you can lower the number of database queries to a couple using something like this:
missing = Customer.objects.extra(
select={"issue_type_id": 'appname_issuetype.id'},
tables=['appname_issuetype'],
where=["""
NOT EXISTS (
SELECT 1
FROM appname_issuetypeconfigpercustomer
WHERE issuetype_id=appname_issuetype.id
AND customer_id=appname_customer.id
)
"""]
)
issue_list = dict(
(issue.id, issue)
for issue in
IssueType.objects.filter(
pk__in=set(m.issue_type_id for m in missing)
)
)
cross_product = [(c, issue_list[c.issue_type_id]) for c in missing]
Bottom line: in the best case you make two queries at the cost of legibility and portability. Having sensible defaults is probably a better design compared to mandatory config for each combination of Customer and IssueType.
This is all untested, sorry if some homework was left for you.