Gemfire Query interface - alternativies to gfsh/pulse - gemfire

Are there any thick-client alternatives to Pulse / Gfsh to query the regions of Gemfire? Though pulse is good, it's not usable as a sqldeveloper/toad for testing/querying.

Unfortunately, none that I know of, sorry.
However, an alternative approach would be to use Spring Data GemFire Repositories (additional details here) to write/express your (OQL) queries, and then write automated [JUnit] tests to test your queries defined in your application Repository interface.
For example, I can define an interface extension of either the SDC's [Crud]Repository or SDG's GemfireRepository interface and declare my application queries following certain conventions (a specification of the query criteria defined by the interface method signature). I.e. I do not need to write the actual queries.
Then, it is a relatively simple matter to define tests to exercise your application's queries.
You can even express more complex queries (like Equi-Joins on 2 or more collocated PRs). However, beware of the query limitations involving PRs in particular, as well as in general.
More information on querying PRs can be found here, and specifically involving Equi-Join Queries on PRs.
I have hard time imagining any tool successfully enabling this sort of practical querying since querying 2 collocated PRs (or a PR with any other Region type, e.g. REPLICATE or LOCAL) in an Equi-Join (OQL) Query must be performed inside a GemFire Function.
Anyway, I know this was not exactly what you were looking for since you probably just need something quick to test the validity of your query results in addition to analyzing the perf (like Explain Plan), but, this at least increases your test coverage in an automated, repeatable fashion.
Of course, this is all moot point if you are just looking to perform analysis on the data outside an application.
Cheers,
John

Related

does performance of an in-memory database depend on whether it is written in c++ or java

We are trying to find an in-memory database with index support that we can use for our application.
We are looking at Aerospike, Apache Ignite, Geode, Voltdb.
There is not much to distinguish and every one claims to be fast and have great community support.
Out of these, Aerospike and VoltDB are C/C++ based and Apache Ignite and Geode are java based.
Considering there is little to choose between the databases in terms of performance and further it is tough to test which db will work for us better in production, Was trying to find out if the performance of an in-memory database will also depend on whether it is java based or c/c++ based. Considering garbage collection issues are quite frequent and its a tough to properly tune it for your use case(which may change after some time), is it true that the java based dbs will be at a disadvantage.
Thanks
You can't really conclude that one db is faster than another just because it is written in X language vs Y language. Database is a very complex product with many features. Some queries may be faster in one db, other queries in another db.
The only way to find out is to test your specific use case.
For an in-memory DB that maintains consistency like Geode does (i.e. makes synchronous replication to other nodes before releasing the client thread), your network is going to be a bigger concern than will the hotspot compiler. Still, here are two points of input to get you to the point where language is irrelevant:
1) If you are doing lots of creates/ updates over reads: Use off-heap memory on the server. This minimizes GC's.
2) Use Geode's serialization mapping between C/C++ and Java objects to avoid JNI. Specifically, use the DataSerializer http://gemfire.docs.pivotal.io/geode/developing/data_serialization/gemfire_data_serialization.html
If you plan to use queries extensively rather than gets/ puts, use the PDXSerializer: http://gemfire.docs.pivotal.io/geode/developing/data_serialization/use_pdx_serializer.html
I guess I'm going to be the contrarian.
All else being equal, compiled code is faster than the JVM, and there's just no garbage collection to have to employ tactics to avoid.
Having been written in C/C++, eXtremeDB (my company's product) is able to avoid using the C run-time memory management altogether. Managing the memory area entirely within the database software enables the use of highly efficient & purpose-specific memory managers, and eliminates the potential for memory leaks (from the whole system point of view, e.g. if 200GB is set aside for the in-memory database, it will never exceed 200GB). eXtremeDB is not unique in this regard; other in-memory DBMS written in C/C++ are also able to avoid the C run-time malloc/free or C++ new/delete. So please don't ding me for making a product pitch, I'm not. I'm pointing out a capability that is possible with a C/C++ implementation that may not be available with a JVM.
The other answerers are correct: that a crappy implementation of a SQL execution plan for a given query can overwhelm any advantage of compiled code vs JVM, but at some point you've got to have confidence that your DBMS vendor knows what they are doing (and are interested in improving their product if a plan is demonstrably inefficient/wrong). If you're not using SQL, then the goodness/badness of a SQL optimizer is not part of the equation, and it's really down to how well the database system's index methods are written, and availability of different types of indexes for different search requirements (e.g. a hash index will generally be better than a b-tree for exact match lookup, but a hash index can't support partial key (wildcard) search or ordered retrieval).
There are some public (independent, audited) benchmarks you can look to. We have participated in a few STAC-M3, though only one other DBMS has also (the DBMS you listed specifically, have not).

Do dplyr functions on a database tbl execute locally or remotely?

I've been using dplyr for a bit locally and I've found it a very powerful tool. One thing that gets showcased in a lot of the intro talks I've found is how you can use it to operate on a database table "to only work with the data you want" via its aggregation functions, summarize, mutate, etc. I understand how it translates those into sql statements, but not so much other operations.
For example, if I wanted to work on a database table as a tbl, and I wanted to run a function on the result of my pipeline through do(), such as glm, would glm be transported to the database somehow to be run there, or is the data necessarily downloaded (in whatever reduced form) and then glm is run locally?
Depending on the size of the table in question, this is an important distinction. Thanks!
Any R analyses, calls to glm(), are run locally. As #joran commented above, the databases vignette, introductory documentation, development information, and many you can find on using dplyr are useful in learning how certain operations are converted to SQL and executed on the DB system. I believe you can induce certain bottlenecks by introducing R-specific analyses in the middle of a chain of operations when finishing DB-capable operations first might be more efficient.

Database Type Agnostic Select Query Encapsulation class

I am upgrading a webapp that will be using two different database types. The existing database is a MySQL database, and is tightly integrated with the current systems, and a MongoDB database for the extended functionality. The new functionality will also be relying pretty heavily on the MySQL database for environmental variables such as information on the current user, content, etc.
Although I know I can just assemble the queries independently, it got me thinking of a way that might make the construction of queries much simpler (only for easier legibility while building, once it's finished, converting back to hard coded queries) that would entail an encapsulation object that would contain:
what data is being selected (including functionally derived data)
source (including joined data, I know that join's are not a good idea for non-relational db's, but it would be nice to have the facility just in case, which can be re-written into two queries later for performance times)
where and having conditions (stored as their own object types so they can be processed later, potentially including other select queries that can be interpreted by whatever db is using it)
orders
groupings
limits
This data can then be passed to an interface adapter that can build and execute the query, returning it in an array, or object or whatever is desired.
Although this sounds good, I have no idea if any code like this exists. If so, can anybody point it out to me, if not, are there any resources on similar projects undertaken that might allow me to continue the work and build a basic version?
I know this is a complicated library, but I have been working on this update for the last few days, and constantly switching back and forth has been getting me muddled at times and allowing for mistakes to occur
I would study things like the SQL grammar: http://www.h2database.com/html/grammar.html
Gives you an idea of how queries should be constructed.
You can study existing libraries around LINQ (C#): https://code.google.com/p/linqbridge/
Maybe even check out this link about FQL (Facebook's query language): https://code.google.com/p/mockfacebook/issues/list?q=label:fql
Like you already know, this is a hard problem. It will be a big challenge to make it run efficiently. Maybe consider moving all data from MySQL and Mongo to a third data store that has a copy of all the data and then running queries against that? Replicating all writes to something like Redis or Elastic Search and then write your queries there?
Either way, good luck!

Benefits of stored procedures vs. other forms of grabbing data from a database [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What are the pros and cons to keeping SQL in Stored Procs versus Code
Just curious on the advantages and disadvantages of using a stored procedure vs. other forms of getting data from a database. What is the preferred method to ensure speed, accuracy, and security (we don't want sql injections!).
(should I post this question to another stack exchange site?)
As per the answer to all database questions 'it depends'. However, stored procedures definitely help in terms of speed because of plan caching (although properly parameterized SQL will benefit from that too). Accuracy is no different - an incorrect query is incorrect whether it's in a stored procedure or not. And in terms of security, they can offer a useful way of limiting access for users - seeing as you don't need to give them direct access to the underlying tables - you can just allow them to execute the stored procedures that you want. There are, however, many many questions on this topic and I'd advise you to search a bit and find out some more.
There are several questions on Stackoverflow about this problem. I really don't think you'll get a "right" answer here, both can work out very well, and both can work horribly. I think if you are using Java then the general pattern is to use an ORM framework like Hibernate/JPA. This can be completely safe from SQL injection attacks as long as you use the framework correctly. My experience with .Net developers is that they are more likely to use stored procedure backed persistence, but that seems to be more open than it was before. Both NHibernate and other MS technologies seem to be gaining popularity.
My personal view is that in general an ORM will save you some time from lots of verbose coding since it can automatically generate much of the SQL you use in a typical CRUD type system. To gain this you will likely give up a little performance and some flexibility. If your system is low to medium volume (10's of thousands of requests per day) then an ORM will be just fine for you. If you start getting in to the millions of requests per day then you may need something a little more bare metal like straight SQL or stored procedures. Note than an ORM doesn't prevent you from going more direct to the DB, it's just not normally what you would use.
One final note, is that I think ORM persistence makes an application much more testable. If you use stored procedures for much of your persistence then you are almost bound to start getting a bunch of business logic in these. To test them you have to actually persist data and interact with the DB, this makes testing slow and brittle. Using an ORM framework you can either avoid most of this testing or use an in memory DB when you really want to test persistence.
See:
Stored Procedures and ORM's
Manual DAL & BLL vs. ORM
This may be better on the Programmers SE, but I'll answer here.
CRUD stored procedures used to be, and sometimes still are, the best practice for data persistence and retrieval on a SQL DBMS. Every such DBMS has stored procedures, so you're practically guaranteed to be able to use this solution regardless of the coding language and DBMS, and code which uses the solution can be pointed to any DB that has the proper stored procs and it'll work with minimal code changes (there are some syntax changes required when calling SPs in different DBMSes; often these are integrated into a language's library support for accessing SPs on a particular DBMS). Perhaps the biggest advantage is centralized access to the table data; you can lock the tables themselves down like Fort Knox, and dispense access rights for the SPs as necessary to more limited user accounts.
However, they have some drawbacks. First off, SPs are difficult to TDD, because the tools don't really exist within database IDEs; you have to create tests in other code that exercise the SPs (and so the test must set up the DB with the test data that is expected). From a technical standpoint, such a test is not and cannot be a "unit test", which is a small, narrow test of a small, narrow area of functionality, which has no side effects (such as reading/writing to the file system). Also, SPs are one more layer that has to be changed when making a needed change to functionality. Adding a new field to a query result requires changing the table, the retrieval source code, and the SP. Adding a new way to search for records of a particular type requires the statement to be created and tested, then encapsulated in a SP, and the corresponding method created on the DAO.
The new best practice where available, IMO, is a library called an object-relational mapper or ORM. An ORM abstracts the actual data layer, so what you're asking for becomes the code objects themselves, and you query for them based on properties of those objects, not based on table data. These queries are almost always code-configurable, and are translated into the DBMS's flavor of SQL based on one or more "mappings" that you define between the object model and the data model (objects of type A are persisted as records in table B, where this property C is written to field D).
The advantages are more flexibility within the code actually looking for data in the form of these code objects. The criteria of a query is usually able to be customized in-code; if a new query is needed that has a different WHERE clause, you just write the query, and the ORM will translate it into the new SQL statement. Because the ORM is the only place where SQL is actually used (and most ORMs use system stored procs to execute parameterized query strings where available) injection attacks are virtually impossible. Lastly, depending on the language and the ORM, queries can be compiler-checked; in .NET, a library called Linq is available that provides a SQL-ish keyword syntax, that is then converted into method calls that are given to a "query provider" that can translate those method calls into the data store's native query language. This also allows queries to be tested in-code; you can verify that the query used will produce the desired results given an in-memory collection of objects that stands in for the actual DBMS.
The disadvantages of an ORM is that the ORM library is usually language-specific; Hibernate is available in Java, NHibernate (and L2E and L2SQL) in .NET, and a few similar libraries like Pork in PHP, but if you're coding in an older or more esoteric language there's simply nothing of the sort available. Another one is that security becomes a little trickier; most ORMs require direct access to the tables in order to query and update them. A few will tolerate being pointed to a view for retrieval and SPs for updating (allowing segregation of view/SP and table security and the ability to restrict the retrievable fields), but now you're mixing the worst of both worlds; you still have to define mappings, but now you also have code in the data layer. The easiest way to overcome this is to implement your security elsewhere; force applications to get data using a web service, which provides the data using the ORM and has specific, limited "front doors". Also, many ORMs have some performance problems when used in certain ways; most are designed to "lazy-load" data, where data is retrieved the moment it's actually needed and not before, which increases up-front performance when you don't need every record you asked for. However, when you DO need every record you asked for, this creates extra round trips. You have to structure queries in specific ways to get around this expected use-case behavior.
Which is better? You have to decide. I can tell you now that using an ORM is MUCH easier to set up and get working correctly than SPs, and it's much easier to make (and limit the scope of) changes to the schema and to queries. In the modern development house, where the priority is to make it work first, and then make it perform well and/or be secure against intrusion, that's a HUGE plus. In most cases where you think security is an issue, it really isn't, and when security really is an issue, putting the solution in the DB layer is usually the wrong place, because the DBMS is the very last line of defense against intrusion; if the DBMS itself has to be counted on to stop something unwanted from happening, you have failed to do so (or even encouraged it to happen) in many layers of software and firmware above it.

Anything similar to SqlSoup for Scala?

Which existing Scala database API is most similar to SqlSoup for Python (part of SqlAlchemy)? What I see in SqlSoup: a convenient and largely portable database API where I don't have to specify schemas and all the types are inferred via reflection, yet I don't have to write raw SQL expressions. Also preferable is the fact that it's part of a more complete database package that does support "everything else" (schema specifications, ORM, etc.), and they share many of the same query abstractions. I imagine that Scala 2.9's Dynamic type may come in handy here. Thanks in advance.
The most similar is SQLAlchemy 0.6 http://www.sqlalchemy.org/news.html which supports Jython. That means that you can use SQLAlchemy on the JVM and call it from Java or Scala. Check this for more details http://www.rexx.com/~dkuhlman/jython_course_03.html#calling-jython-from-java
You will likely need to write some interface code in Jython.
The inefficiencies of SQLAlchemy come from the impedance mismatch between SQL and object oriented thinking. Now that you have chosen a non-object oriented language, maybe it is time to move away from SQLAlchemy clones and work on a threadpool to give you non-blocking access to SQL databases. Actors work real well when you break down the problem into lots of small simple tasks and SQLSoup seems too heavy for this.
Maybe you would also benefit from a memcache in front of your SQL database. Imagine that you need to process an SQL request through 7 steps to get the data in the form that you want. If you save all the intermediate results in memcache you may be able to reduce the number of times that you hit the SQL db. Actors lend themselves to a loosely couple design where you can replace an actor or insert two in the place of one.