We have Task objects in Aerospike. Each task has list of dependent Task ids like:
TaskA = {id=TaskA, dependencies=['TaskB', 'TaskC']}
TaskB = {id=TaskB, dependencies=[]}
TaskC = {id=TaskC, dependencies=['TaskD']}
TaskD = {id=TaskD, dependencies=[]}
I'm looking for most efficient approach how to get all dependencies(including transient ones) in 1 query to aerospike. In our example query should return all 4 tasks.
You are trying to make a Graph Query. Aerospike, like all other popular NoSQL databases, is not a native graph database, hence does not offer index-free adjacency. i.e. Aerospike cannot traverse the links and send you just the results. You will have to pull each record into your application, read the next relationships, then read those individual records - i.e. traverse in the application.
Or alternatively - embed the entire "graph" data in a single record, if that works for you, data model wise. NoSQL modeling uses this "encapsulation" technique as an alternate to a flexible and extensible
graph data model.
In the near future, Aerospike will be supported as a storage layer in a graph implementation and then you can store graph data models and run Gremlin queries with data storage on Aerospike. More here: https://medium.com/aerospike-developer-blog/are-graph-databases-finally-ready-for-prime-time-8f7ddd49a855
Related
I have Regions in GemFire with a large number of records.
I need to lookup elements in those Regions for validation purposes. The lookup is happening for every item we scan; There can be more than 10000 items.
What will be an efficient way to look up element in Regions?
Please suggest.
Vikas-
There are several ways in which you can look up, or fetch multiple elements from a GemFire Region.
As you can see, a GemFire Region indirectly implements java.util.Map, and so provides all the basic Map operations, such as get(key):value, in addition to several other operations that are not available in Map like getAll(Collection keys):Map.
Though, get(key):value is not going to be the most "efficient" method for looking up multiple items at once, but getAll(..) allows you to pass in a Collection of keys for all the values you want returned. Of course, you have to know the keys of all the values you want in advance, so...
You can obtain GemFire's QueryService from the Region by calling region.getRegionService().getQueryService(). The QueryService allows you to write GemFire Queries with OQL (or Object Query Language). See GemFire's User Guide on Querying for more details.
The advantage of using OQL over getAll(keys) is, of course, you do not need to know the keys of all the values you might need to validate up front. If the validation logic is based on some criteria that matches the values that need to be evaluated, you can express this criteria in the OQL Query Predicate.
For example...
SELECT * FROM /People p WHERE p.age >= 21;
To call upon the GemFire QueryService to write the query above, you would...
Region people = cache.getRegion("/People");
...
QueryService queryService = people.getRegionSevice().getQueryService();
Query query = queryService.newQuery("SELECT * FROM /People p WHERE p.age >= $1");
SelectResults<Person> results = (SelectResults<Person>) query.execute(asArray(21));
// process (e.g. validate) the results
OQL Queries can be parameterized and arguments passed to the Query.execute(args:Object[]) method as shown above. When the appropriate Indexes are added to your GemFire Regions, then the performance of your Queries can improve dramatically. See the GemFire User Guide on creating Indexes.
Finally, with GemFire PARTITION Regions especially, where your Region data is partitioned, or "sharded" and distributed across the nodes (GemFire Servers) in the cluster that host the Region of interests (e.g. /People), then you can combine querying with GemFire's Function Execution service to query the data locally (to that node), where the data actually exists (e.g. that shard/bucket of the PARTITION Regioncontaining a subset of the data), rather than bringing the data to you. You can even encapsulate the "validation" logic in the GemFire Function you write.
You will need to use the RegionFunctionContext along with the PartitionRegionHelper to get the local data set of the Region to query. Read the Javadoc of PartitionRegionHelper as it shows the particular example you are looking for in this case.
Spring Data GemFire can help with many of these concerns...
For Querying, you can use the SD Repository abstraction and extension provided in SDG.
For Function Execution, you can use SD GemFire's Function ExeAnnotation support.
Be careful though, using the SD Repository abstraction inside a Function context is not just going to limit the query to the "local" data set of the PARTITION Region. SD Repos always work on the entire data set of the "logical" Region, where the data is necessarily distributed across the nodes in a cluster in a partitioned (sharded) setup.
You should definitely familiarize yourself with GemFire Partitioned Regions.
In summary...
The approach you choose above really depends on several factors, such as, but not limited to:
How you organized the data in the first place (e.g. PARTITION vs. REPLICATE, which refers to the Region's DataPolicy).
How amenable your validation logic is to supplying "criteria" to, say, an OQL Query Predicate to "SELECT" only the Region data you want to validate. Additionally, efficiency might be further improved by applying appropriate Indexing.
How many nodes are in the cluster and how distributed your data is, in which case a Function might be the most advantageous approach... i.e. bring the logic to your data rather than the data to your logic. The later involves selecting the matching data on the nodes where the data resides that could involve several network hops to the nodes containing the data depending on your topology and configuration (i.e. "single-hop access", etc), serializing the data to send over the wire thereby increasing the saturation on your network, and so on and so forth).
Depending on your UC, other factors to consider are your expiration/eviction policies (e.g. whether data has been overflowed to disk), the frequency of the needed validations based on how often the data changes, etc.
Most of the time, it is better to validate the data on the way in and catch errors early. Naturally, as data is updated, you may also need to perform subsequent validations, but that is no substitute for early (as possible) verifications where possible.
There are many factors to consider and the optimal approach is not always apparent, so test and make sure your optimizations and overall approach has the desired effect.
Hope this helps!
Regards,
-John
Set up the PDX serializer and use the query service to get your element. "Select element from /region where id=xxx". This will return your element field without deserializing the record. Make sure that id is indexed.
There are other ways to validate quickly if your inbound data is streaming rather than a client lookup, such as the Function Service.
I am building a factory QC fixture that measures, analyzes, and stores data on the physical dimensions of products leaving a factory. The raw data for each measured product starts off as a table with 5 columns, and up to 86000 rows. To get useful information, this table must undergo some processing. This data is collected in LabVIEW, but stored in an SQL server database. I want to ask whether it's best to pass the data to the server and process it in there (via stored procedure), or process it outside the server and then add it in?
Let me tell you about the processing to be done:
To get meaningful information from the raw data, each record in the table needs to be passed into a function that calculates parameters of interest. The function also uses other records (I'll call them secondary records) from the raw table. The contents of the record originally passed into the function dictate what secondary records the function uses to perform calculations. The function also utilizes some trigonometric operators. My concern is that SQL will be very slow or buggy when doing calculations over a big table. I am not sure if this sort of task is something SQL is efficient at doing, or if I'm better off trying to get it done through the GPU using CUDA.
Hopefully I'm clear enough on what I need, please let me know if not.
Thanks in advance for your answers!
Generally we need SQL server to help us sort, search, index, update data and share the data with multiple users (according to their privileges) at the same time. I see no work for SQL server in the task you've described. Looks like no one needs any of those 860000 raws before they've been processed.
HBase's co-processor is a good example of "moving calculations not data". Not sure if Aerospike support something similar to this?
Aerospike supports user-defined functions (UDFs), which are functions users load in to the database and execute.
Aerospike provides two types of UDF, record and stream, both of which are equivalent to HBase's Endpoint coprocessors, in that they execute against the data and return a result. The record UDF is executed against a single record, allowing for record modifications and calculations on a single record. The stream UDF is executed against query results, providing ability to analyze or aggregate the data. Both UDF execute on the node containing the data, and return a user-defined result.
Aerospike does not support the concept of HBase's Observer coprocessors, which are executed based on an event.
This isn't quite a direct answer to your question, but VoltDB supports near-arbitrary Java processing in the database process local to the partition of data you're interested in. You can mix Java and SQL in a fully transactional environment and still scale to millions of ACID transactions per second.
New to Mongodb. Is Mongodb efficient for real time queries where the values for the criteria changes every time for my query. Also there will be some aggregation of the resultset before sending the response back to the user. As an example my user case needs to produce the data in the following format after processing a collection for different criteria values.
Service Total Improved
A 1000 500
B 2000 700
.. .. ..
I see Mongodb has Aggregation which process records and return computed results. Should I be used aggregation instead for efficiency? If aggregation is the way to go, I guess I would do that every time my source data changes. Also, is this what Mongo Hadoop is used for? Am I on the right track in my understanding? Thanks in advance.
Your question is too general, IMHO.
Speed depends on the size of your data and on the kind of your query and if you have put an index on your key etc.
Changing values in your queries are not critical, AFAIK.
For example I work on a MongoDB with 3 million docs and can do some queries in a couple of seconds, some in a couple of minutes. A simple map reduce over all 3 M docs takes about 25 min on that box.
I have not tried the aggregation API yet, which seems to be a successor/alternative to map / reduce runs.
I did not know about the MongoDB / Hadoop integration. It seems to keep MongoDB as an easy-to-use storage unit, which feeds data to a Hadoop cluster and gets results from it, using the more advanced map reduce framework from Hadoop (more phases, better use of a cluster of Hadoop nodes)..
I would follow mongodbs guidelines for counting stuff.
See mongodbs documentation page for preaggregated reports.
Hadoop is good for batch processing, which you probably donĀ“t need for these counting use cases?
See this list for other typical hadoop use cases: link.
And heres a resource for typical mongo+hadoop use cases: link.
I have a situation where I want to do some DB-related operations in a Java application (e.g. on Eclipse). I use MySQL as a RDBMS and Hibernate as an ORM provider.
I retreive all records using embedded SQL in Java:
//Define conncections ...etc
ResultSet result = myStmt.executeQuery("SELECT * FROM employees");
// iterator
I retreive all records using Hibernate ORM / JPQL:
// Connections,Entity Manager....etc
List result = em.createQuery("SELECT emp FROM Employees emp").getResultList();
// iterator
I know that the RDMS is located on secondary-memory (DISK). The question is, when I get both results back. Where are the employees actually? On the secondary (SM) or on main-memory (MM)?
I want to have at the end two object populations for further testing, one operating on the SM and one on the MM? How is this possible?
Thanks
Frank
Your Java Objects are real Java Objects, they are in (to use your term) MM, at least for a while. The beauty of the Hbernate/JPA programming model is that while in MM you can pretty much treat the objects as if they were any other Java object, make a few changes to them etc. And then at some agreed time Hibernate's persistence mechansim gets them bask to, SM (disk).
You will need to read up on the implications of Sessions and Transactions in order to understand when the transitions between MM and SM occur, and also very importantly, what happens if two users want to work with the same data at the same time.
Maybe start here
It is also possible to create objects in MM that are (at least for now) not related to any data on disk - these are "transient" objects, and also to "disconnect" data in memeory from what's on disk.
My bottom line here is that Hibernate/JPA does remove much grunt work from persistence coding, but it cannot hide the complexity of scale, as your data volumes increase, your data model's complexity grows and your user's actions contend for data you need to do serious thinking. Hibernate allows you to achive good things, but it can't do that thinking for you, you have to make careful choices as your problem domain gets more complex.