Best way to lookup elements in GemFire Region - gemfire

I have Regions in GemFire with a large number of records.
I need to lookup elements in those Regions for validation purposes. The lookup is happening for every item we scan; There can be more than 10000 items.
What will be an efficient way to look up element in Regions?
Please suggest.

Vikas-
There are several ways in which you can look up, or fetch multiple elements from a GemFire Region.
As you can see, a GemFire Region indirectly implements java.util.Map, and so provides all the basic Map operations, such as get(key):value, in addition to several other operations that are not available in Map like getAll(Collection keys):Map.
Though, get(key):value is not going to be the most "efficient" method for looking up multiple items at once, but getAll(..) allows you to pass in a Collection of keys for all the values you want returned. Of course, you have to know the keys of all the values you want in advance, so...
You can obtain GemFire's QueryService from the Region by calling region.getRegionService().getQueryService(). The QueryService allows you to write GemFire Queries with OQL (or Object Query Language). See GemFire's User Guide on Querying for more details.
The advantage of using OQL over getAll(keys) is, of course, you do not need to know the keys of all the values you might need to validate up front. If the validation logic is based on some criteria that matches the values that need to be evaluated, you can express this criteria in the OQL Query Predicate.
For example...
SELECT * FROM /People p WHERE p.age >= 21;
To call upon the GemFire QueryService to write the query above, you would...
Region people = cache.getRegion("/People");
...
QueryService queryService = people.getRegionSevice().getQueryService();
Query query = queryService.newQuery("SELECT * FROM /People p WHERE p.age >= $1");
SelectResults<Person> results = (SelectResults<Person>) query.execute(asArray(21));
// process (e.g. validate) the results
OQL Queries can be parameterized and arguments passed to the Query.execute(args:Object[]) method as shown above. When the appropriate Indexes are added to your GemFire Regions, then the performance of your Queries can improve dramatically. See the GemFire User Guide on creating Indexes.
Finally, with GemFire PARTITION Regions especially, where your Region data is partitioned, or "sharded" and distributed across the nodes (GemFire Servers) in the cluster that host the Region of interests (e.g. /People), then you can combine querying with GemFire's Function Execution service to query the data locally (to that node), where the data actually exists (e.g. that shard/bucket of the PARTITION Regioncontaining a subset of the data), rather than bringing the data to you. You can even encapsulate the "validation" logic in the GemFire Function you write.
You will need to use the RegionFunctionContext along with the PartitionRegionHelper to get the local data set of the Region to query. Read the Javadoc of PartitionRegionHelper as it shows the particular example you are looking for in this case.
Spring Data GemFire can help with many of these concerns...
For Querying, you can use the SD Repository abstraction and extension provided in SDG.
For Function Execution, you can use SD GemFire's Function ExeAnnotation support.
Be careful though, using the SD Repository abstraction inside a Function context is not just going to limit the query to the "local" data set of the PARTITION Region. SD Repos always work on the entire data set of the "logical" Region, where the data is necessarily distributed across the nodes in a cluster in a partitioned (sharded) setup.
You should definitely familiarize yourself with GemFire Partitioned Regions.
In summary...
The approach you choose above really depends on several factors, such as, but not limited to:
How you organized the data in the first place (e.g. PARTITION vs. REPLICATE, which refers to the Region's DataPolicy).
How amenable your validation logic is to supplying "criteria" to, say, an OQL Query Predicate to "SELECT" only the Region data you want to validate. Additionally, efficiency might be further improved by applying appropriate Indexing.
How many nodes are in the cluster and how distributed your data is, in which case a Function might be the most advantageous approach... i.e. bring the logic to your data rather than the data to your logic. The later involves selecting the matching data on the nodes where the data resides that could involve several network hops to the nodes containing the data depending on your topology and configuration (i.e. "single-hop access", etc), serializing the data to send over the wire thereby increasing the saturation on your network, and so on and so forth).
Depending on your UC, other factors to consider are your expiration/eviction policies (e.g. whether data has been overflowed to disk), the frequency of the needed validations based on how often the data changes, etc.
Most of the time, it is better to validate the data on the way in and catch errors early. Naturally, as data is updated, you may also need to perform subsequent validations, but that is no substitute for early (as possible) verifications where possible.
There are many factors to consider and the optimal approach is not always apparent, so test and make sure your optimizations and overall approach has the desired effect.
Hope this helps!
Regards,
-John

Set up the PDX serializer and use the query service to get your element. "Select element from /region where id=xxx". This will return your element field without deserializing the record. Make sure that id is indexed.
There are other ways to validate quickly if your inbound data is streaming rather than a client lookup, such as the Function Service.

Related

Did anyone write custom Affinity function?

I want all nodes in a cluster to have equal number data load. With
default Affinity function it is not happening.
As of now, we have 3 nodes. We use group ID as affinity key, and we have 3
group IDs (1, 2 and 3). And we limit cache partitions to group IDs. Overall
nodes=group IDs=cache partitions. So that each node have equal number of
partitions.
Will it be okay to write custom Affinity function? And
what will we lose doing so? Did anyone write custom Affinity function?
The affinity function doesn't guarantee an even distribution across all nodes. It's statistical... and three values isn't really enough to make sure the data is "fairly" distributed.
So, yes, writing a new affinity function would work. The downsides being you need to make it fast (it's called a lot) and you'd be hard-coding it to your current node topology. What happens when you choose to add a new node? What happens when a node fails? Also, you'd be potentially putting all your data into three partitions which make it harder to scale out (one of the main advantages of Ignite's architecture).
As an alternative, I'd look at your data model. Splitting your data into three chunks is too coarse for things to work automatically.

Efficiently perform COUNT DISTINCT with spark, on csvs?

I have a large volume of data, and I'm looking to efficiently (ie, using a relatively small Spark cluster) perform COUNT and DISTINCT operations one of the columns.
If I do what seems obvious, ie load the data into a dataframe:
df = spark.read.format("CSV").load("s3://somebucket/loadsofcsvdata/*").toDF()
df.registerView("someview")
and then attempt to run a query:
domains = sqlContext.sql("""SELECT domain, COUNT(id) FROM someview GROUP BY domain""")
domains.take(1000).show()
my cluster just crashes and burns - throwing out of memory exceptions or otherwise hanging/crashing/not completing the operation.
I'm guessing that somewhere along the way there's some sort of join that blows one of the executors' memory?
What's the ideal method for performing an operation like this, when the source data is at massive scale and the target data isn't (the list of domains in the above query is relatively short, and should easily fit in memory)
related info available at this question: What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?
I would suggest to tune your executors settings. Especially, setting following parameters correctly can provide dramatic improvement in performance.
spark.executor.instances
spark.executor.memory
spark.yarn.executor.memoryOverhead
spark.executor.cores
In your case, I would also suggest to tune Number of partitions, especially bump up following param from default 200 to higher value, as per requirement.
spark.sql.shuffle.partitions

Is SQL Server Express 14 appropriate to use for conducting calculations over tables?

I am building a factory QC fixture that measures, analyzes, and stores data on the physical dimensions of products leaving a factory. The raw data for each measured product starts off as a table with 5 columns, and up to 86000 rows. To get useful information, this table must undergo some processing. This data is collected in LabVIEW, but stored in an SQL server database. I want to ask whether it's best to pass the data to the server and process it in there (via stored procedure), or process it outside the server and then add it in?
Let me tell you about the processing to be done:
To get meaningful information from the raw data, each record in the table needs to be passed into a function that calculates parameters of interest. The function also uses other records (I'll call them secondary records) from the raw table. The contents of the record originally passed into the function dictate what secondary records the function uses to perform calculations. The function also utilizes some trigonometric operators. My concern is that SQL will be very slow or buggy when doing calculations over a big table. I am not sure if this sort of task is something SQL is efficient at doing, or if I'm better off trying to get it done through the GPU using CUDA.
Hopefully I'm clear enough on what I need, please let me know if not.
Thanks in advance for your answers!
Generally we need SQL server to help us sort, search, index, update data and share the data with multiple users (according to their privileges) at the same time. I see no work for SQL server in the task you've described. Looks like no one needs any of those 860000 raws before they've been processed.

applying business rules at the database level

I'm working on a project in which we will need to determine certain types of statuses for a large body of people, stored in a database. The business rules for determining these statuses are fairly complex and may change.
For example,
if a person is part of group X
and (if they have attribute O) has either attribute P or attribute Q,
or (if they don't have attribute O) has attribute P but not Q,
and don't have attribute R,
and aren't part of group Y (unless they also are part of group Z),
then status A is true.
Multiply by several dozen statuses and possibly hundreds of groups and attributes. The people, groups, and attributes are all in the database.
Though this will be consumed by a Java app, we also want to be able to run reports directly against the database, so it would be best if the set of computed statuses were available at at the data level.
Our current design plan, then, is to have a table or view that consists of a set of boolean flags (hasStatusA? hasStatusB? hasStatusC?) for each person. This way, if I want to query for everyone who has status C, I don't have to know all of the rules for computing status C; I just check the flag.
(Note that, in real life, the flags will have more meaningful names: isEligibleForReview?, isPastDueForReview?, etc.).
So a) is this a reasonable approach, and b) if so, what's the best way to compute those flags?
Some options we're considering for computing flags:
Make the set of flags a view, and calculate the flag values from the underlying data in real time using SQL or PL-SQL (this is an Oracle DB). This way the values are always accurate, but performance may suffer, and the rules would have to be maintained by a developer.
Make the set of flags consist of static data, and use some type of rules engine to keep those flags up-to-date as the underlying data changes. This way the rules can be maintained more easily, but the flags could potentially be inaccurate at a given point in time. (If we go with this approach, is there a rules engine that can easily manipulate data within a database in this way?)
In a case like this I suggest applying Ward Cunningham's question- ask yourself "What's the simplest thing that could possibly work?".
In this case, the simplest thing might be to come up with a view that looks at the data as it exists and does the calculations and computations to produce all the fields you care about. Now, load up your database and try it out. Is it fast enough? If so, good - you did the simplest possible thing and it worked out fine. If it's NOT fast enough, good - the first attempt didn't work, but you've got the rules mapped out in the view code. Now you can go on to try the next iteration of "the simplest thing" - perhaps your write a background task that watches for inserts and updates and then jumps in to recompute the flags. If that works, fine and dandy. If not, go to the next iteration...and so on.
Share and enjoy.
I would advise against making the statuses as column names but rather use a status id and value. such as a customer status table with columns of ID and Value.
I would have two methods for updating statuses. One a stored procedure that either has all the logic or calls separate stored procs to figure out each status. you could make all this dynamic by having a function for each status evaluation, and the one stored proc could then call each function. The 2nd method would be to have whatever stored proc(s), that updates user info, call a stored proc to go update all the users statuses based upon the current data. These two methods would allow you to have both realtime updates for the data that changed and if you add a new status, you can call the method to update all statuses with new logic.
Hopefully you have one point of updates to the user data, such as a user update stored proc, and you can put the status update stored proc call in that procedure. This would also save having to schedule a task every n seconds to update statuses.
An option I'd consider would be for each flag to be backed by a deterministic function that returns the up-to-date value given the relevant data.
The function might not perform well enough, however, if you're calling it for many rows at a time (e.g. for reporting). So, if you're on Oracle 11g, you can solve this by adding virtual columns (search for "virtual column") to the relevant tables based on the function. The Result Cache feature should improve the performance of the function as well.

Main-Memory Secondary-Memory Objects

I have a situation where I want to do some DB-related operations in a Java application (e.g. on Eclipse). I use MySQL as a RDBMS and Hibernate as an ORM provider.
I retreive all records using embedded SQL in Java:
//Define conncections ...etc
ResultSet result = myStmt.executeQuery("SELECT * FROM employees");
// iterator
I retreive all records using Hibernate ORM / JPQL:
// Connections,Entity Manager....etc
List result = em.createQuery("SELECT emp FROM Employees emp").getResultList();
// iterator
I know that the RDMS is located on secondary-memory (DISK). The question is, when I get both results back. Where are the employees actually? On the secondary (SM) or on main-memory (MM)?
I want to have at the end two object populations for further testing, one operating on the SM and one on the MM? How is this possible?
Thanks
Frank
Your Java Objects are real Java Objects, they are in (to use your term) MM, at least for a while. The beauty of the Hbernate/JPA programming model is that while in MM you can pretty much treat the objects as if they were any other Java object, make a few changes to them etc. And then at some agreed time Hibernate's persistence mechansim gets them bask to, SM (disk).
You will need to read up on the implications of Sessions and Transactions in order to understand when the transitions between MM and SM occur, and also very importantly, what happens if two users want to work with the same data at the same time.
Maybe start here
It is also possible to create objects in MM that are (at least for now) not related to any data on disk - these are "transient" objects, and also to "disconnect" data in memeory from what's on disk.
My bottom line here is that Hibernate/JPA does remove much grunt work from persistence coding, but it cannot hide the complexity of scale, as your data volumes increase, your data model's complexity grows and your user's actions contend for data you need to do serious thinking. Hibernate allows you to achive good things, but it can't do that thinking for you, you have to make careful choices as your problem domain gets more complex.