Is it possible to create using RDF a relationship between points (Latitude/Longitude) such that when loaded into a store (such as Jena Fuseki with the GeoSPARQL extension) that the distance between points will be inferred, e.g. in kilometers?
Otherwise, in order to determine the distance between points and add it to the triple store, I imagine that a query would be necessary for each point (and/or a query with a subquery to essentially to run a query for all points in a single query). Does this sound correct?
Ideally, the solution is fast as there could be millions of points. For example, I tried measuring 600,000 points to 9,000 linestrings (600k x 9k comparisons) in Jena GeoSPARL Fuseki and each comparison was taking roughly 10 seconds.
Related
An argument in favor of graph dbms with native storage over relational dbms made by neo4j (also in the neo4j graph databases book) is that "index-free adjacency" is the most efficient means of processing data in a graph (due to the 'clustering' of the data/nodes in a graph-based model).
Based on some benchmarking I've performed, where 3 nodes are sequentially connected (A->B<-C) and given the id of A I'm querying for C, the scalability is clearly O(n) when testing the same query on databases with 1M, 5M, 10M and 20M nodes - which is reasonable (with my limited understanding) considering I am not limiting my query to 1 node only hence all nodes need to be checked for matching. HOWEVER, when I index the queried node property, the execution time for the same query, is relatively constant.
Figure shows execution time by database node size before and after indexing. Orange plot is O(N) reference line, while the blue plot is the observed execution times.
Based on these results I'm trying to figure out where the advantage of index-free adjacency comes in. Is this advantageous when querying with a limit of 1 for deep(er) links? E.g. depth of 4 in A->B->C->D->E, and querying for E given A. Because in this case we know that there is only one match for A (hence no need to brute force through all the other nodes not part of this sub-network).
As this is highly dependent on the query, I'm listing an example of the Cypher query below for reference (where I'm matching entity labeled node with id of 1, and returning the associated node (B in the above example) and the secondary-linked node (C in the above example)):
MATCH (:entity{id:1})-[:LINK]->(result_assoc:assoc)<-[:LINK]-(result_entity:entity) RETURN result_entity, result_assoc
UPDATE / ADDITIONAL INFORMATION
This source states: "The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n).". This statement explains the O(n) results before indexing. I guess the O(1) performance after indexing is identical to a hash list performance(?). Not sure why using any other index the complexity is O(n log n) if even using a hash list the worst case is O(n).
From my understanding, the index-free aspect is only pertinent for adjacent nodes (that's why it's called index-free adjacency). What your plots are demonstrating, is that when you find A, the additional time to find C is negligible, and the question of whether to use an index or not, is only to find the initial queried node A.
To find A without an index it takes O(n), because it has to scan through all the nodes in the database, but with an index, it's effectively like a hashlist, and takes O(1) (no clue why the book says O(n log n) either).
Beyond that, finding the adjacent nodes are not that hard for Neo4j, because they are linked to A, whereas in RM the linkage is not as explicit - thus a join, which is expensive, and then scan/filter is required. So to truly see the advantage, one should compare the performance of graph DBs and RM DBs, by varying the depth of the relations/links. It would also be interesting to see how a query would perform when the neighbours of the entity nodes are increased (ie., the graph becomes denser) - does Neo4j rely on the graphs never being too dense? Otherwise the problem of looking through the neighbours to find the right one repeats itself.
I created a test dataset of roughly 450GB in BigQuery and I am getting execution speed of ~9 seconds to query the largest table (10bn rows) when running from WebUI. I just wanted to check if this is a 'normal' expected result and whether it would get worse with larger size (i.e. 100bn rows+) and if the queries become more complex. I am aware of table partitioning/etc. but I just want to get a sense of what is 'normal' expected speed without first getting into optimization, since the above seems like 'smallish' size for what BQ is meant for.
The above result is achieved on a simple query like this:
select ColumnA from DataSet.Table order by ColumnB desc limit 100
So the result returned to the client is very small. ColumnA is structured as UUIDs represented in String format and ColumnB is integer.
It's almost impossible to say if this is "normal" or not. BigQuery is a multitenancy architecture/infrastructure. That means we all share the same resources (i.e. compute power) in the cluster when running queries. Therefore, query times are never deterministic in BigQuery i.e. they can vary depending on the number of concurrent queries executing from users at any given time. That said however, you can get reserved slots for a flat rate price. Although, you'd need to be spending quite a lot of money to justify that.
You can improve execution times by removing compute/shuffle/memory intensive steps like order by etc. Obviously, the complexity of the query will also have and impact on the query times.
On some of our projects we can smash through 3TB-5TB with a relatively complex query in about 15s-20s. Sometimes it quicker, sometimes is slower. We also run queries over much smaller datasets that can take the same amount of time. This is because what I wrote at the beginning - BigQuery query times are not deterministic.
Finally, BigQuery will cache results, so if you issue the same query multiple times over the same dataset it will be returned from the cache i.e. much quicker!
I was introduced to ElasticSearch significant terms aggregation a while ago and was positively surprised how good and relevant this metric turns out to be. For those not familiar with it, it's quite a simple concept - for a given query (foreground set) a given property is scored against the statistical significance of the background set.
For example, if we were querying for the most significant crime types in the British Transport Police:
C = 5,064,554 -- total number of crimes
T = 66,799 -- total number of bicycle thefts
S = 47,347 -- total number of crimes in British Transport Police
I = 3,640 -- total number of bicycle thefts in British Transport Police
Ordinarily, bicycle thefts represent only 1% of crimes (66,799/5,064,554) but for the British Transport Police, who handle crime on railways and stations, 7% of crimes (3,640/47,347) is a bike theft. This is a significant seven-fold increase in frequency.
The significance for "bicycle theft" would be [(I/S) - (T/C)] * [(I/S) / (T/C)] = 0.371...
Where:
C is the number of all documents in the collection
S is the number of documents matching the query
T is the number of documents with the specific term
I is the number of documents that intersect both S and T
For practical reasons (the sheer amount of data I have and huge ElasticSearch memory requirements), I'm looking to implement the significant terms aggregation in SQL or directly in code.
I've been looking at some ways to potentially optimize this kind of query, specifically, decreasing the memory requirements and increasing the query speed, at the expense of some error margin - but so far I haven't cracked it. It seems to me that:
The variables C and S are easily cacheable or queriable.
The variable T could be derived from a Count-Min Sketch instead of querying the database.
The variable I however, seems impossible to derive with the Count-Min Sketch from T.
I was also looking at the MinHash, but from the description it seems that it couldn't be applied here.
Does anyone know about some clever algorithm or data structure that helps tackling this problem?
I doubt a SQL impl will be faster.
The values for C and T are maintained ahead of time by Lucene.
S is a simple count derived from the query results and I is looked up using O(1) data structures. The main cost are the many T lookups for each of the terms observed in the chosen field. Using min_doc_count typically helps drastically reduce the number of these lookups.
For practical reasons (the sheer amount of data I have and huge ElasticSearch memory requirements
Have you looked into using doc values to manage elasticsearch memory better? See https://www.elastic.co/blog/support-in-the-wild-my-biggest-elasticsearch-problem-at-scale
An efficient solution is possible for the case when the foreground set is small enough. Then you can afford processing all documents in the foreground set.
Collect the set {Xk} of all terms occurring in the foreground set for the chosen field, as well as their frequencies {fk} in the foreground set.
For each Xk
Calculate the significance of Xk as (fk - Fk) * (fk / Fk), where Fk=Tk/C is the frequency of Xk in the background set.
Select the terms with the highest significance values.
However, due to the simplicity of this approach, I wonder if ElasticSearch already contains that optimization. If it doesn't - then it very soon will!
I commonly deal with data-sets that have well over 5 billion data points in a 3D grid over time. Each data point has a certain value, which needs to be visualised. So it's a 5-dimensional data-set. Lets say the data for each point looks like (x, y, z, time, value)
I need to run arbitrary queries against these data-sets where for example the value is between a certain range, or below a certain value.
I need to run queries where I need all data for a specific z value
These are the most common queries that I would need to run against this data-set. I have tried the likes of MySQL and MongoDB and created indices for those values, but the resource requirements are quite extreme with long query-runtime. I ended up writing my own file-format to at least store the data for relative easy retrieval. This approach makes it difficult to find data without having to read/scan the entire data-set.
I have looked at the likes of Hadoop and Hive, but the queries are not designed to be run in real-time. In terms of data size it seems a better fit though.
What would be the best method to index such larger amounts of data efficiently? Is a custom indexing system the best approach or to slice the data into smaller chunks and device a specific way of indexing (which way though?). the goal is to be able to run queries against the data and have the results returned in under 0.5 seconds. My best was 5 seconds by running the entire DB on a huge RAM drive.
Any comments and suggestions are welcome.
EDIT:
the data for all x, y, z, time and value are all FLOAT
It really depends on the hardware you have available, but regardless of that and considering the type and the amount of data you are dealing with, I definitely suggest a clustered solution.
As you already mentioned, Hadoop is not a good fit because it is primarily a batch processing tool.
Have a look at Cassandra and see if it solves your problem. I feel like a column-store rdbms like CitusDB (Free up to 6 nodes) or Vertica (Free up to 3 nodes) may also prove useful.
I'm building a screener that should be able to search through a table about 50 columns wide and 7000 rows long really fast.
Each row is composed of the following columns.
primary_key, quantity1, quantity2, quantity3...quantity50.
All quantities are essentially floats or integers. Hence a typical screener would look like this.
Get all rows which have quantity1 > x and quantity2 < y and quantity3 >= z.
Indexing all columns should lead to really fast search times however some of the columns will be updating in realtime. Indexing everything obviously leads to very low insert/update times.
A portion of the columns are fairly static though. Hence an idea was to segregate the data into two tables, one containing all columns that are static while the other containing data that is dynamic. Any screener would then be applied to both tables based on the actual query. And the results combined in the end.
I am currently planning on using a MySQL engine, most probably INNoDB. However I'm looking to get much faster response times. An implementation of the same problem on a certain site was very snappy. Regardless of the query size, I was getting the results within 500 ms. Wondering what other options are available out there to implement this function.