Main-Memory Secondary-Memory Objects - main-memory-database

I have a situation where I want to do some DB-related operations in a Java application (e.g. on Eclipse). I use MySQL as a RDBMS and Hibernate as an ORM provider.
I retreive all records using embedded SQL in Java:
//Define conncections ...etc
ResultSet result = myStmt.executeQuery("SELECT * FROM employees");
// iterator
I retreive all records using Hibernate ORM / JPQL:
// Connections,Entity Manager....etc
List result = em.createQuery("SELECT emp FROM Employees emp").getResultList();
// iterator
I know that the RDMS is located on secondary-memory (DISK). The question is, when I get both results back. Where are the employees actually? On the secondary (SM) or on main-memory (MM)?
I want to have at the end two object populations for further testing, one operating on the SM and one on the MM? How is this possible?
Thanks
Frank

Your Java Objects are real Java Objects, they are in (to use your term) MM, at least for a while. The beauty of the Hbernate/JPA programming model is that while in MM you can pretty much treat the objects as if they were any other Java object, make a few changes to them etc. And then at some agreed time Hibernate's persistence mechansim gets them bask to, SM (disk).
You will need to read up on the implications of Sessions and Transactions in order to understand when the transitions between MM and SM occur, and also very importantly, what happens if two users want to work with the same data at the same time.
Maybe start here
It is also possible to create objects in MM that are (at least for now) not related to any data on disk - these are "transient" objects, and also to "disconnect" data in memeory from what's on disk.
My bottom line here is that Hibernate/JPA does remove much grunt work from persistence coding, but it cannot hide the complexity of scale, as your data volumes increase, your data model's complexity grows and your user's actions contend for data you need to do serious thinking. Hibernate allows you to achive good things, but it can't do that thinking for you, you have to make careful choices as your problem domain gets more complex.

Related

Best way to lookup elements in GemFire Region

I have Regions in GemFire with a large number of records.
I need to lookup elements in those Regions for validation purposes. The lookup is happening for every item we scan; There can be more than 10000 items.
What will be an efficient way to look up element in Regions?
Please suggest.
Vikas-
There are several ways in which you can look up, or fetch multiple elements from a GemFire Region.
As you can see, a GemFire Region indirectly implements java.util.Map, and so provides all the basic Map operations, such as get(key):value, in addition to several other operations that are not available in Map like getAll(Collection keys):Map.
Though, get(key):value is not going to be the most "efficient" method for looking up multiple items at once, but getAll(..) allows you to pass in a Collection of keys for all the values you want returned. Of course, you have to know the keys of all the values you want in advance, so...
You can obtain GemFire's QueryService from the Region by calling region.getRegionService().getQueryService(). The QueryService allows you to write GemFire Queries with OQL (or Object Query Language). See GemFire's User Guide on Querying for more details.
The advantage of using OQL over getAll(keys) is, of course, you do not need to know the keys of all the values you might need to validate up front. If the validation logic is based on some criteria that matches the values that need to be evaluated, you can express this criteria in the OQL Query Predicate.
For example...
SELECT * FROM /People p WHERE p.age >= 21;
To call upon the GemFire QueryService to write the query above, you would...
Region people = cache.getRegion("/People");
...
QueryService queryService = people.getRegionSevice().getQueryService();
Query query = queryService.newQuery("SELECT * FROM /People p WHERE p.age >= $1");
SelectResults<Person> results = (SelectResults<Person>) query.execute(asArray(21));
// process (e.g. validate) the results
OQL Queries can be parameterized and arguments passed to the Query.execute(args:Object[]) method as shown above. When the appropriate Indexes are added to your GemFire Regions, then the performance of your Queries can improve dramatically. See the GemFire User Guide on creating Indexes.
Finally, with GemFire PARTITION Regions especially, where your Region data is partitioned, or "sharded" and distributed across the nodes (GemFire Servers) in the cluster that host the Region of interests (e.g. /People), then you can combine querying with GemFire's Function Execution service to query the data locally (to that node), where the data actually exists (e.g. that shard/bucket of the PARTITION Regioncontaining a subset of the data), rather than bringing the data to you. You can even encapsulate the "validation" logic in the GemFire Function you write.
You will need to use the RegionFunctionContext along with the PartitionRegionHelper to get the local data set of the Region to query. Read the Javadoc of PartitionRegionHelper as it shows the particular example you are looking for in this case.
Spring Data GemFire can help with many of these concerns...
For Querying, you can use the SD Repository abstraction and extension provided in SDG.
For Function Execution, you can use SD GemFire's Function ExeAnnotation support.
Be careful though, using the SD Repository abstraction inside a Function context is not just going to limit the query to the "local" data set of the PARTITION Region. SD Repos always work on the entire data set of the "logical" Region, where the data is necessarily distributed across the nodes in a cluster in a partitioned (sharded) setup.
You should definitely familiarize yourself with GemFire Partitioned Regions.
In summary...
The approach you choose above really depends on several factors, such as, but not limited to:
How you organized the data in the first place (e.g. PARTITION vs. REPLICATE, which refers to the Region's DataPolicy).
How amenable your validation logic is to supplying "criteria" to, say, an OQL Query Predicate to "SELECT" only the Region data you want to validate. Additionally, efficiency might be further improved by applying appropriate Indexing.
How many nodes are in the cluster and how distributed your data is, in which case a Function might be the most advantageous approach... i.e. bring the logic to your data rather than the data to your logic. The later involves selecting the matching data on the nodes where the data resides that could involve several network hops to the nodes containing the data depending on your topology and configuration (i.e. "single-hop access", etc), serializing the data to send over the wire thereby increasing the saturation on your network, and so on and so forth).
Depending on your UC, other factors to consider are your expiration/eviction policies (e.g. whether data has been overflowed to disk), the frequency of the needed validations based on how often the data changes, etc.
Most of the time, it is better to validate the data on the way in and catch errors early. Naturally, as data is updated, you may also need to perform subsequent validations, but that is no substitute for early (as possible) verifications where possible.
There are many factors to consider and the optimal approach is not always apparent, so test and make sure your optimizations and overall approach has the desired effect.
Hope this helps!
Regards,
-John
Set up the PDX serializer and use the query service to get your element. "Select element from /region where id=xxx". This will return your element field without deserializing the record. Make sure that id is indexed.
There are other ways to validate quickly if your inbound data is streaming rather than a client lookup, such as the Function Service.

Merge identical databases into one

We have 15 databases of 75 tables with an avarage of a million rows. all with the same schema but different data. We have now been given the requirements by the client to bring all 15 into one database. Each set of data filtered by the user’s login.
The changes to the application have been completed to do the filtering. We are now left with the task of merging all databases into one.
The issue is conflicting PK and FK as the PK’s and the FK’s are of type int so we will have 15 PK ids of 1.
One idea is to use. net and the DBML to insert the records as new records into the new database letting linq deal with the PK and FK and using code to deal with duplicate data.
What other ways are there to do this?
It's never a trivial job to integrate databases when the records don't have unique primary keys in all databases. A few weeks ago I built a similar integration script for which I decided to use Entity Framework.
First the good news. With EF's DbContext API it's ridiculously easy to insert a complete object graph and make EF take care of all newly generated primary keys an foreign keys. The reason why this is so easy is that when an object's state is changed to Added all of its adhering objects become Added as well and EF figures out the right order of inserts. This is truly great! It made me build the core of the copy routine in a few hours, which would have been many days if I should have done it in T-SQL for example. The latter is much much more error prone too.
Of course life isn't that easy. Now the bad news:
This takes tons of machine resources. Of course I used a new context instance for each copy step, but still I had to execute the program on a machine with a decent processor and a fair amount of internal memory. The exact specifications don't matter, the message is: test with the largest databases and see what kind of beast you need. If the memory consumption can't be managed by any machine at your disposal, you have to split up the routine in smaller chunks, but that will take more programming.
The object graph that's changed to Added must be divergent. By this I mean that there should only be 1-n associations starting from the root. The reason is, EF will really mark all objects as Added. So if somewhere in the graph a few branches refer back to the same object (because there is a n-1 association), these "new" objects will be multiplied, because EF doesn't know their identity. An example of this could be Company -< Customer -< Order >- OrderType: when there are only 2 order types, inserting one root company with 10 customers with 10 orders each will create 100 order type records in stead of 2.
So the hard part is to find paths your class structure that are divergent as much as possible. This won't always be possible. If so, you'll have to add the leaves of the converging paths first. In the example: first insert order types. When a new company is inserted you first load the existing order types into the context and then add the company. Now link the new orders to the existing order types. This can only be done if you can match objects by natural keys (in this example: the order type names), but usually this is possible.
You must take care not to insert multiple copies of master data. Suppose the order types in the previous example are the same in all databases (although their primary keys may differ!). The order types from the source database should not be reinserted in the target database. Moreover, you must fix the references in the source data to the correct records in the target database (again by matching by natural key).
So although it wasn't trivial it was doable and the job was done in a relatively short time. I'm sure that other alternatives (t-SQL, integration services, BIDS, if doable at all) would have taken more time or would have been more buggy. And the problem with bugs in this area is that they may become apparent much later.
I later found out that the issues I describe under 2) are related to fetching the source objects with AsNoTracking. See this interesting post: Entity Framework 6 - use my getHashCode(). I used AsNoTracking because it performs better and it reduces memory consumption.

Super column vs serialization vs 2 lookups in Cassandra

We have:
users, each of which has events, each of which has several properties (time, type etc.). Our basic use case is to fetch all events of a given user in a given time-span.
We've been considering the following alternatives in Cassandra for the Events column-family. All alternatives share: key=user_id (UUID), column_name = event_time
column_value = serialized object of event properties. Will need to read/write all the properties every time (not a problem), but might also be difficult to debug (can't use Cassandra command-line client easily)
column is actually a super column, sub-columns are separate properties. Means reading all events(?) every time (possible, though sub-optimal). Any other cons?
column_value is a row-key to another CF, where the event properties are stored. Means maintaining two tables -> complicates calls + reads/writes are slower(?).
Anything we're missing? Any standard best-practice here?
Alternative 1 : Why go to Cassandra if you are to store serialized object ? MongoDB or a similar product would perform better on this task if I get it wright (never actually tried a document base NoSQL, so correct me if I'm wrong on this one). Anyway, I tried this alternative once in MySQL 6 years ago and it is still painful to maintain today.
Alternative 2 : Sorry, I didn't had to play with super colunm yet. Would use this only if I had to show frequently many information on many users (i.e. much more than just their username and a few qualifiers) and their respective events in one query. Also could make query based on a given time-span a bit tricky if there are conditions on the user itself too, since a user's row is likely to have event's columns that fit in the span an other columns that doesn't.
Alternative 3 : Would defenitly be my choice in most cases. You are not likely to write events and create a user in the same transaction, so no worry for consistency. Use the username itself as a standard event column (don't forget to index it) so your calls will be pretty fast. More on this type of data model at http://www.datastax.com/docs/0.8/ddl/index.
Yes it's a two call read, but it do is two different families of data anyway.
As for a best-practices, the field is kinda new, not sure there are any widely approved yet.

Association properties, nightmare performance (Entity Framework)

I have a fairly large EF4 model, using POCO code gen. I've got lots of instances where I select a single entity from whichever table by its ID.
However on some tables, this takes 2 minutes or more, where on most tables it takes less than a second. I'm out of ideas as to where to look now, because I can't see any reason. It's always the same tables that cause problems, but I can query them directly against the database without problems, so it must be somewhere in Entity Framework territory that the problem is coming from.
The line is the quite innoccuous:
Dim newProd As New Product
Product.ShippingSize = Entities.ShippingSizes.Single(Function(ss) ss.Id = id)
id is simply an integer passed in from the UI, Id on my entity is the primary key, which is indexed on the database
Entities is a freshly created instance of my entity framework datacontext
This is not the first query being executed against the Context, it is the first query against this EntitySet though
I have re-indexed all tables having seen posts suggesting that a corrupt index could cause slow access, that hasn't made any difference
The exact same line of code against other tables runs almost instantly, it's only certain tables
This particular table is tiny - it only has 4 things in it
Any suggestions as to where to even start?
--edit - I'd oversimplified the code in the question to the point where the problem disappeared!
Where to start?
Print or log the actual SQL string that's being sent to the database.
Execute that literal string on the server and measure its performance.
Use your server's EXPLAIN plan system to see what the server's actually doing.
Compare the raw SQL performance to your EF performance.
That should tell you whether you have a database problem or an EF problem.
Seems like this is a function of the POCO template's Fixup behaviour in combination with lazy loading.
Because the entity has already been loaded via Single, subsequent operations seem to be happening in memory rather than against the database. The Fixup method by default makes Contains() calls, which is where everything grinds to a halt while 10s of thousands of items get retrieved, initialised as proxies, and evaluated in memory.
I tried changing this Contains() to a Where(Function(x) x.Id = id).Count > 0 (will do logically the same thing, but trying to force a quick DB operation instead of the slow in-memory one). The query was still performed in-memory, and just as slow.
I switched from POCO to the standard EntityGenerator, and this problem just disappeared with no other changes. Say what you will about patterns/practices, but this is a nasty problem to have - I didn't spot this until I switched from fakes and small test databases to a full size database. Entity Generator saves the day for now.

Design Question - Put hundreds of Yes/No switches in columns, rows, or other?

We are porting an old application that used a hierarchical database to a relational web app, and are trying to figure out the best way to port configuration switches (Y/N values).
Our old system had 256 distinct switches (per client) that were each stored as a bit in one of 8 32-bit data fields. Each client would typically have ~100 switches set. To read or set a switch, we'd use bitwise arithmetic using a #define value. For example:
if (a_switchbank4 & E_SHOW_SALARY_ON_CHECKS) //If true, print salary on check
We were debating what approach to store switches in our new relational (MS-SQL) database:
Put each switch in its own field
Pros: fast and easy read/write/access - 1 row per client
Cons: seems kludgey, need to change schema every time we add a switch
Create a row per switch per client
Pros: unlimited switches, no schema changes necessary w/ new switches
Cons: slightly more arduous to pull data, lose intellisense w/o extra work
Maintain bit fields
Pros: same code can be leveraged, smaller XML data transmissions between machines
Cons: doesn't make any sense to our developers, hard to debug, too easy to use wrong 'switch bank' field for comparison
I'm leaning towards #1 ... any thoughts?
It depends on a few factors such as:
How many switches are set for each client
How many switches are actually used
How often switches are added
If I had to guess (and I would be guessing) I'd say what you really want are tags. One table has clients, with a unique ID for each, another has tags (the tag name and a unique ID) and a third has client ID / tag ID pairs, to indicate which clients have which tags.
This differs from your solution #2 in that tags are only present for the clients where that switch is true. In other words, rather than storing a client ID, a switch ID, and a boolean you store just a client ID and a switch ID, but only for the clients with that switch set.
This takes up about one third the space over solution number two, but the real advantage is over solutions one and three: indexing. If you want to find out things like which clients have switches 7, 45, and 130 set but not 86 or 14, you can do them efficiently with a single index on a tag table, but there's no practical way to do them with the other solutions.
You could think about using database views to give you the best of each solution.
For example store the data as one row per switch, but use a view that pivots the switches (rows) into columns where this is more convenient.
I would go with option #2, one row per flag.
However, I'd also consider a mix of #1 and #2. I don't know your app, but if some switches are related, you could group those into tables where you have multiple columns of switches. You could group them based on use or type. You could, and would probably still have a generic table with one switch per row, for ones that don't fit into the groups.
Remember too if you change the method, you may have a lot of application code to change that relys on the existing method of storing the data. Whether you should change the method may depend on exactly how hard it will be and how many hours it will take to change everything associated. I agree with Markus' solution, but you do need to consider exactly how hard refactoring is going to be and whether your project can afford the time. The refactoring book I've been reading would suggest that you maintain both for a set time period with triggers to keep them in synch while you then start fixing all the references. Then on a set date you drop the original (and the triggers) from the database. This allows you to usue the new method going forth, but gives the flexibility that nothing will break before you get it fixed, so you can roll out the change before all references are fixed. It requires discipline however as it is easy to not get rid of the legacy code and columns because everything is working and you are afraid not to. If you are in the midst of total redesign where everything will be tested thougroughly and you have the time built into the project, then go ahead and change everything all at once.
I'd also lean toward option 1, but would also consider an option 4 in some scenarios.
4- Store in dictionary of name value pairs. Serialize to database.
I would recommend option 2. It's relatively straightforward to turn a list of tags/rows into a hash in the code, which makes it fairly easy to check variables. Having a table with 256+ columns seems like a nightmare.
One problem with option #2 is that having a crosstab query is a pain:
Client S1 S2 S3 S4 S5 ...
A X X
B X X X
But there are usually methods for doing that in a database-specific way.