What is filter pushdown optimization? - sql

Could you please bring some examples?

First, find a proper definition here for "Filter Pushdown":
One way to prevent loading data that is not actually needed is filter pushdown (sometimes also referred to as predicate pushdown), which enables the execution of certain filters at the data source before it is loaded to an executor process. This becomes even more important if the executors are not on the same physical machine as the data.
Note that:
In many cases, filter pushdown is automatically applied by Spark without explicit commands or input from the user. But in certain cases, users have to provide specific information or even implement certain functionality themselves, especially when creating custom data sources, i.e. for unsupported database types or unsupported file types.
Now, you can find a simple example in databricks. In this example, you can find that the order of select and filter can be optimized in regards to the query execution performance.

Related

Incremental indexing for semantic search

I wonder if there are some standards or best practices, in performing an incremental indexation of a triple store for semantic search purpose.
Indeed to support semantic search one usually use solr or elasticsearch where resource are indexed according to some specific SPARQL query. While one can re-index its entire resources set once a day for instance, it is not that desirable. Hence comes the need to perform it incrementally. However that requires somehow to track changes, with the ultimate goat to be able to keep on indexing or deleting whatever has changed only.
For instance to only index what has change, the SPARQL query should include some timestamp filter somehow.
If anyone has some suggestions, or experience on performing it, that he would like to share this would be well apreciated
So far I am being somewhat inspired by EEA ElasticSearch RDF River Plugin. I'm also looking at the ontology Changeset Ontology.
The easiest way to accomplish this would be to get something involved in the transaction lifecycle. Then you're able to see the changes to the database which will give you the graph that needs to be indexed.
But don't dismiss doing a full re-index on a periodic schedule, such as nightly. Unless your requirement is that full-text searches must always be against the most recent data and your data changes quickly, a full re-index on a regular basis will work just fine.

GridGain SQL queries without data model + other GridGain SQL questions

I have been checking out GridGain for a while and came across some features regarding GridGain's SQL capabilities, which led me to some questions (that I couldn't find a firm answer in the docs)
From the examples, there is always an explicit data model. I am using Java, so that means there's always a class definition of the model to be queried for. The examples in the API docs: http://atlassian.gridgain.com/wiki/display/GG60/SQL,+Scan,+And+Full+Text+Queries begin by showing how properties much be annotated, which suggests to me an explicit model is always required. Properties of the model can be annotated for SQL querying such as "#GridCacheQuerySqlField". Is an explicit data model always required? Ideally, I would like a way to not have to explicitly state the model, as my use case does change often and has complex relations.
What subset of SQL queries can be performed through GridGain's SQL API? My use cases often require very complex queries. For example, in the docs (same link as above) it states that "Continuous Queries cannot be used with SQL. Only predicate-based queries are supported." where can I find what subset of SQL is supported (and under what conditions, as the example provided does not perform continuous sql queries unless the condition that queries are predicate-based is met)
Thanks in advance for the insight
GridGain has support for non-fixed data model in Enterprise version, namely portable objects. Portable objects allow you to render data model as a map-like nesting structure which allows dynamic structure changes, indexing and portability across different languages (Java, C#, .NET). You can take a look at portable objects in GridGain Enterprise edition examples and read documentation here: http://entdoc.gridgain.org/latest/Portable+Cross+Platform+Objects In open-source version explicit class definition is always required.
The SQL limitations are described in GridCacheQuery javadoc: http://gridgain.com/sdk/6.5.0/javadoc/org/gridgain/grid/cache/query/GridCacheQuery.html
Group by and sort by statements are applied separately on each node, so result set will likely be incorrectly grouped or sorted after results from multiple remote nodes are grouped together.
Aggregation functions like sum, max, avg, etc. are also applied on each node. Therefore you will get several results containing aggregated values, one for each node.
Joins will work correctly only if joined objects are stored in collocated mode or at least one side of the join is stored in REPLICATED cache.

Does Scalding support record filtering via predicate pushdown w/Parquet?

There are obvious speed benefits from not having to read records that would fail a filter. I see Spark support for it, but I haven't found any documentation on how to do it w/Scalding.
Unfortunately there is no support for this in scalding-parquet yet. We at Tapad started working on implementing Predicate support in scalding. Once we get something working we'll share it.
We have implemented our own ParquetAvroSource that can read/store avro records in parquet. It's possible to use column projection and read only columns/fields required to a scalding job. In some cases using this feature jobs read only 1% of the input bytes.
Predicate pushdown was added to Scalding, but it is not documented yet.
For more details see scalding issue #1089

Elasticsearch querying multiple types and grouped by types?

Suppose I am to search against two types [cars] and [buildings], and I would want the results to be separated. Is there a way one can group results by types?
I understand one simple way will be to query each types separately, but for other use cases one may actually need to query tens or hundreds of types together. Is there a native way or hacky way(like using sort) to achieve this?
This type of grouping behavior is (currently) not available in elasticsearch. It has been a long standing request:
https://github.com/elasticsearch/elasticsearch/issues/256
There are two approaches that can help, both of which are far from perfect, but may be good enough for some use cases.
Client side aggregation. Request a lot more results than you plan on displaying and the then bucket those.
Using multi-query. This allows you to easily pass down some number of queries in a single batch, but will have potential scaling problems if the number of queries gets to large.
This is one feature that Solr has that elasticsearch doesn't, but I have never tried it. I used a similar feature with Autonomy IDOL years back, but the performance was abysmal.
If you want the results separated in groups of documents, you're going to have to restructure your documents, since, elasticsearch is focused on finding matching documents. You might get around this by designing a document that has child documents then you can query for matches on the parent document that represents your type.
I guess there might be some common field (let's say it's [price]) if you want to search against different types. Then it would be reasonable to add some different type like [price_aggregator] and put into it fields [type] and [price]. And then you could easily build your query against just one type. This requires some additional work while indexing and more memory to store index but it's much performant when you search.

Named Graphs and Federated SPARQL Endpoints

I recently came across the working draft for SPARQL 1.1 Federation Extensions and wondered whether this was already possible using Named Graphs (not to detract from the usefulness of the aforementioned draft).
My understanding of Named Graphs is a little hazy, save that the only thing I have gleamed from reading the specs comprises rules around the merger, non merger in relation to other graphs at query time. Since this doesn't fully satisfy my understanding, my question is as follows:
Given the following query:
SELECT ?something
FROM NAMED <http://www.vw.co.uk/models/used>
FROM NAMED <http://www.autotrader.co.uk/cars/used>
WHERE {
...
}
Is it reasonable to assume that a query processor/endpoint could or should in the context of the named graphs do the following:
Check is the named graph exists locally
If it doesn't then perform the following operation (in the case of the above query, I will use the second named graph)
GET /sparql/?query=EncodedQuery HTTP/1.1
Host: www.autotrader.co.uk
User-agent: my-sparql-client/0.1
Where the EncodedQuery only includes the second named graph in the FROM NAMED clause and the WHERE clause is amended accordingly with respect to GRAPH clauses (e.g if a GRAPH <http://www.vw.co.uk/models/used> {...} is being used).
Only if it can't perform the above, then do any of the following:
GET /cars/used HTTP/1.1
Host: www.autotrader.co.uk
or
LOAD <http://www.autotrader.co.uk/cars/used>
Return appropriate search results.
Obviously there might be some additional considerations around OFFSET's and LIMIT's
I also remember reading somewhere a long time ago in galaxy far far away, that the default graph of any SPARQL endpoint should be a named graph according to the following convention:
For: http://www.vw.co.uk/sparql/ there should be a named graph of: http://www.vw.co.uk that represents the default graph and so by the above logic, it should already be possible to federate SPARQL endpoints using named graphs.
The reason I ask is that I want to start promoting federation across the domains in the above example, without having to wait around for the standard, making sure that I won't do something that is out of kilter or incompatible with something else in the future.
Named graph and URLs used in federated queries (using SERVICE or FROM) are two different things. The latter point to SPARQL endpoints, the named graphs are within a triple store and have the main function of separating different data sets. This, in turn, can be useful to both improve performance and represent knowledge, such as representing what is the source of a set of statements.
For instance, you might have two data sources both stating that ?movie has-rating ?x and you might want to know which source is stating which rating, in this case you can use two named graphs associated to the two sources (e.g., http://www.example.com/rotten-tomatoes and http://www.example.com/imdb). If you're storing both data sets in the same triple store, probably you will want to use NGs, and remote endpoints are a different thing. Furthermore, the URL of a named graph can be used with vocabularies like VoID to describe a dataset as a whole (eg, the data set name, where and when the triples are imported from, who is the maintainer, user licence). This is another reason to partition your triple store into NGs.
That said, your mechanism to bind NGs to endpoint URLs might be implemented as an option, but I don't think it's a good idea to have it as mandatory, since managing remote endpoint URLs and NGs separately can be more useful.
Moreover, the real challenge in federated queries is to offer endpoint-transparent queries, making the query engine smart enough to analyse the query and understand how to split it and perform partial queries on the right endpoints (and join the results later, in an efficient way). There is a lot of research being done on that, one of the most significant results (as far as I know) is FedX, which has been used to implement several query distribution optimisations (example).
Last thing to add, I vaguely remember the convention that you mention about $url, $url/sparql. There are a couple of approaches around (e.g., LOD cloud). That said, in most nowadays triple stores (e.g., Virtuoso), queries that don't specify a named graph (don't use GRAPH) work in a way different than falling into a default graph case, they actually query the union of all named graphs in the store, which is usually much more useful (when you don't know where something is stated, or you want to integrate cross-graph data).