HBase's co-processor is a good example of "moving calculations not data". Not sure if Aerospike support something similar to this?
Aerospike supports user-defined functions (UDFs), which are functions users load in to the database and execute.
Aerospike provides two types of UDF, record and stream, both of which are equivalent to HBase's Endpoint coprocessors, in that they execute against the data and return a result. The record UDF is executed against a single record, allowing for record modifications and calculations on a single record. The stream UDF is executed against query results, providing ability to analyze or aggregate the data. Both UDF execute on the node containing the data, and return a user-defined result.
Aerospike does not support the concept of HBase's Observer coprocessors, which are executed based on an event.
This isn't quite a direct answer to your question, but VoltDB supports near-arbitrary Java processing in the database process local to the partition of data you're interested in. You can mix Java and SQL in a fully transactional environment and still scale to millions of ACID transactions per second.
Related
I have Regions in GemFire with a large number of records.
I need to lookup elements in those Regions for validation purposes. The lookup is happening for every item we scan; There can be more than 10000 items.
What will be an efficient way to look up element in Regions?
Please suggest.
Vikas-
There are several ways in which you can look up, or fetch multiple elements from a GemFire Region.
As you can see, a GemFire Region indirectly implements java.util.Map, and so provides all the basic Map operations, such as get(key):value, in addition to several other operations that are not available in Map like getAll(Collection keys):Map.
Though, get(key):value is not going to be the most "efficient" method for looking up multiple items at once, but getAll(..) allows you to pass in a Collection of keys for all the values you want returned. Of course, you have to know the keys of all the values you want in advance, so...
You can obtain GemFire's QueryService from the Region by calling region.getRegionService().getQueryService(). The QueryService allows you to write GemFire Queries with OQL (or Object Query Language). See GemFire's User Guide on Querying for more details.
The advantage of using OQL over getAll(keys) is, of course, you do not need to know the keys of all the values you might need to validate up front. If the validation logic is based on some criteria that matches the values that need to be evaluated, you can express this criteria in the OQL Query Predicate.
For example...
SELECT * FROM /People p WHERE p.age >= 21;
To call upon the GemFire QueryService to write the query above, you would...
Region people = cache.getRegion("/People");
...
QueryService queryService = people.getRegionSevice().getQueryService();
Query query = queryService.newQuery("SELECT * FROM /People p WHERE p.age >= $1");
SelectResults<Person> results = (SelectResults<Person>) query.execute(asArray(21));
// process (e.g. validate) the results
OQL Queries can be parameterized and arguments passed to the Query.execute(args:Object[]) method as shown above. When the appropriate Indexes are added to your GemFire Regions, then the performance of your Queries can improve dramatically. See the GemFire User Guide on creating Indexes.
Finally, with GemFire PARTITION Regions especially, where your Region data is partitioned, or "sharded" and distributed across the nodes (GemFire Servers) in the cluster that host the Region of interests (e.g. /People), then you can combine querying with GemFire's Function Execution service to query the data locally (to that node), where the data actually exists (e.g. that shard/bucket of the PARTITION Regioncontaining a subset of the data), rather than bringing the data to you. You can even encapsulate the "validation" logic in the GemFire Function you write.
You will need to use the RegionFunctionContext along with the PartitionRegionHelper to get the local data set of the Region to query. Read the Javadoc of PartitionRegionHelper as it shows the particular example you are looking for in this case.
Spring Data GemFire can help with many of these concerns...
For Querying, you can use the SD Repository abstraction and extension provided in SDG.
For Function Execution, you can use SD GemFire's Function ExeAnnotation support.
Be careful though, using the SD Repository abstraction inside a Function context is not just going to limit the query to the "local" data set of the PARTITION Region. SD Repos always work on the entire data set of the "logical" Region, where the data is necessarily distributed across the nodes in a cluster in a partitioned (sharded) setup.
You should definitely familiarize yourself with GemFire Partitioned Regions.
In summary...
The approach you choose above really depends on several factors, such as, but not limited to:
How you organized the data in the first place (e.g. PARTITION vs. REPLICATE, which refers to the Region's DataPolicy).
How amenable your validation logic is to supplying "criteria" to, say, an OQL Query Predicate to "SELECT" only the Region data you want to validate. Additionally, efficiency might be further improved by applying appropriate Indexing.
How many nodes are in the cluster and how distributed your data is, in which case a Function might be the most advantageous approach... i.e. bring the logic to your data rather than the data to your logic. The later involves selecting the matching data on the nodes where the data resides that could involve several network hops to the nodes containing the data depending on your topology and configuration (i.e. "single-hop access", etc), serializing the data to send over the wire thereby increasing the saturation on your network, and so on and so forth).
Depending on your UC, other factors to consider are your expiration/eviction policies (e.g. whether data has been overflowed to disk), the frequency of the needed validations based on how often the data changes, etc.
Most of the time, it is better to validate the data on the way in and catch errors early. Naturally, as data is updated, you may also need to perform subsequent validations, but that is no substitute for early (as possible) verifications where possible.
There are many factors to consider and the optimal approach is not always apparent, so test and make sure your optimizations and overall approach has the desired effect.
Hope this helps!
Regards,
-John
Set up the PDX serializer and use the query service to get your element. "Select element from /region where id=xxx". This will return your element field without deserializing the record. Make sure that id is indexed.
There are other ways to validate quickly if your inbound data is streaming rather than a client lookup, such as the Function Service.
I have a very complex query that needs to join 9 or more tables with some 'group by' expressions . Most of these tables have almost the same of numbers of the rows. These tables also have some columns that can be used as the 'key' to partition the tables.
Previously, the app ran fine, but now the data set has 3~4 times data as before. My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling and the application stalls (no matter how I adjust the memory, partition, executors, etc.). The actual data probably is just dozens of Gs.
I would think that if the partitioning works properly, Spark shouldn't do shuffle so much and the join should be done on each node. It is puzzling that why Spark is not so 'smart' to do so.
I could split the data set (with the 'key' I mentioned above) into many data sets that these data sets can be dealt with independently. But the burden will be on myself...it discounts the very reason to use Spark. What other approaches that could help?
I use Spark 2.0 over Hadoop YARN.
My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling
When joining datasets if the size of one side is less than a certain configurable size, spark broadcasts the entire table to each executor so that join may be performed locally everywhere. Your above observation is consistent with this. You can also provide broadcast hint explicitly to the spark, like so df1.join(broadcast(df2))
Other than that, can you please provide more specifics about your problem?
[Sometime ago I was also grappling with the issue of join and shuffles for one of our jobs that had to handle couple of TBs. We were using RDDs (and not the dataset api). I wrote about my findings [here]1. These may be of some use to you are try to reason about the underlying data shuffle.]
Update: According to documentation -- spark.sql.autoBroadcastJoinThreshold is the configurable property key. 10 MB is its default value. And it does the following:
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.
So apparently, this is supported only for the Hive tables.
I am confused with cassandra eventual constancy vs query sequencing, i have following questions
If I send 2 queries in sequence (without turning on isIdempotent property). First query is to Delete record and second query is to create records. Is it possible that the query 2 executes before query one.
my java code will look like this
public void foo(){
delete(entity);//First delete a record
create(entity); //Second create a record
}
another thing I am not specifying any timestamp in my query.
2) My second question is, Cassandra is eventually consistent. And if I send both the above queries in sequential order and it doesnt get replicated to all nodes, will those queries maintain the order when actually its getting replicated to all nodes?
I tried to look cassandra documentation , although it talks about query sequencing in batch operations, but it doesnt talk about query sequencing in non batch operation.
I am using cassandra 2.1
By default, in modern versions, we use client side timestamps. Check the driver documentation here:
https://datastax.github.io/java-driver/manual/query_timestamps/
Based on the timestamp, C* operates using LWW heuristics (last write wins) if the create has an earlier timestamp than the delete, a query won't return data. If the create has a newer timestamp, it will.
If you need linearization, i.e. the guarantee that certain operations will be executed in sequence, you can use lightweight transactions based on paxos:
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
I am building a factory QC fixture that measures, analyzes, and stores data on the physical dimensions of products leaving a factory. The raw data for each measured product starts off as a table with 5 columns, and up to 86000 rows. To get useful information, this table must undergo some processing. This data is collected in LabVIEW, but stored in an SQL server database. I want to ask whether it's best to pass the data to the server and process it in there (via stored procedure), or process it outside the server and then add it in?
Let me tell you about the processing to be done:
To get meaningful information from the raw data, each record in the table needs to be passed into a function that calculates parameters of interest. The function also uses other records (I'll call them secondary records) from the raw table. The contents of the record originally passed into the function dictate what secondary records the function uses to perform calculations. The function also utilizes some trigonometric operators. My concern is that SQL will be very slow or buggy when doing calculations over a big table. I am not sure if this sort of task is something SQL is efficient at doing, or if I'm better off trying to get it done through the GPU using CUDA.
Hopefully I'm clear enough on what I need, please let me know if not.
Thanks in advance for your answers!
Generally we need SQL server to help us sort, search, index, update data and share the data with multiple users (according to their privileges) at the same time. I see no work for SQL server in the task you've described. Looks like no one needs any of those 860000 raws before they've been processed.
New to Mongodb. Is Mongodb efficient for real time queries where the values for the criteria changes every time for my query. Also there will be some aggregation of the resultset before sending the response back to the user. As an example my user case needs to produce the data in the following format after processing a collection for different criteria values.
Service Total Improved
A 1000 500
B 2000 700
.. .. ..
I see Mongodb has Aggregation which process records and return computed results. Should I be used aggregation instead for efficiency? If aggregation is the way to go, I guess I would do that every time my source data changes. Also, is this what Mongo Hadoop is used for? Am I on the right track in my understanding? Thanks in advance.
Your question is too general, IMHO.
Speed depends on the size of your data and on the kind of your query and if you have put an index on your key etc.
Changing values in your queries are not critical, AFAIK.
For example I work on a MongoDB with 3 million docs and can do some queries in a couple of seconds, some in a couple of minutes. A simple map reduce over all 3 M docs takes about 25 min on that box.
I have not tried the aggregation API yet, which seems to be a successor/alternative to map / reduce runs.
I did not know about the MongoDB / Hadoop integration. It seems to keep MongoDB as an easy-to-use storage unit, which feeds data to a Hadoop cluster and gets results from it, using the more advanced map reduce framework from Hadoop (more phases, better use of a cluster of Hadoop nodes)..
I would follow mongodbs guidelines for counting stuff.
See mongodbs documentation page for preaggregated reports.
Hadoop is good for batch processing, which you probably donĀ“t need for these counting use cases?
See this list for other typical hadoop use cases: link.
And heres a resource for typical mongo+hadoop use cases: link.