memcache as Nhibernate second level cache - nhibernate

I have a question about second level caching with NHibernate and memcache. Suppose the following configuration:
Website A uses DB_A. Data from table X is being cached.
Website B uses DB_B. Data from table X is being cached.
Both web apps share a single memcache server.
Now, table X in DB_A and DB_B while having the same schema have different data, so row with PK = 1 in DB_A will NOT be the same data as row with PK = 1 in DB_B.
My question is, will each application clobber the other's data, or is the second level caching smart enough to create cache keys which don't over lap databases.

I'm not sure if you'll have overlapped, overwritten data. You'll need to check what cache-keys are being used. However, interesting reads that you might find helpful:
http://ayende.com/blog/3976/nhibernate-2nd-level-cache
http://ayende.com/blog/3112/nhibernate-and-the-second-level-cache-tips
http://ayende.com/blog/1708/nhibernate-caching-the-secong-level-cache-space-is-shared
The last one is probably of most use to you. The author did something similar to what you are attempting, except you're making your life easier by (somehow) having no primary key conflicts.

Related

Best Practise for SQL Replication / Load Managing

I'm currently running an Ubuntu Server with Mariadb on it. It serves all sql requests for a Website (with a good amount of requests on it).
Few times a day we import large CSV files into the Database to update our Data. The Problem is, since those csv blast the db (import takes around 15 Minutes).
It seems to be using only 1 Core from 4 but still, the website (or better its sql requests during that time) get ridiculously slow. Now the question for me is what can I do here to affect the Website less to none?
I was considering Database replication to a different server, but I'm expecting that to use the same amount of ressources during the import time so no real benefit here I guess?
The other thing I considered is to have 2 SQL Databases, and during an import all requests should be switched to the other Database Server and I would basicly do each import twice, one on Server 1 (during that time Server 2 should serve the site) once thats done, the website has to be swiched to Server 1 and the import is done on Server 2. While that would work, it seems to be quite an effort for a non perfect solution (like how are requests handled during the switch from Server 1 to 2 and so on.
So what solutions exist here, preferably somewhat affordable.
All ideas and hints are welcome.
Thanks in advance
Best Regards
Menax
Is the import replacing an entire table? If so, load it into a separate table, then swap it into place. Essentially zero downtime, namely during the RENAME TABLE. For details, see http://mysql.rjweb.org/doc.php/deletebig or possibly http://mysql.rjweb.org/doc.php/staging_table
If the import is doing something else, please provide details.
One connection uses one core, no more.
More (from Comments)
SELECT id, marchants_id
from products
WHERE links LIKE '%https://´www.merchant.com/productsite_5'
limit 1
That is hard to optimize because of the leading wildcard in the LIKE. Is that really what you need? As it stands, that query must scan the table.
SELECT id, price
from price_history
WHERE product_id = 5
order by id desc
limit 1
That would benefit from INDEX(product_id, id, price) -- in that order. With that index, the query will be as close to instantaneous as possible.
Please provide the rest of the transaction with the Update and Insert, plus SHOW CREATE TABLE. There is quite possibly a way to "batch" the actions rather than doing one product price at a time. This may speed it up 10-fold.
Flip-flopping between two servers -- only if the data is readonly. If you are otherwise modifying the tables, that would be a big nightmare.
For completely replacing a table,....
CREATE a new TABLE
Populate it
RENAME TABLE to swap the new table into place.
(But I still don't understand your processing well enough to say if this is best. When you say "switch the live database", are you referring to a server (computer), a DATABASE (schema), or just one TABLE?

DB schema for updating downstream sources?

I want a table to be sync-able by a web API.
For example,
GET /projects?sequence_latest=2113&limit=10
[{"state":"updated", "id":12,"sequence":2116},
{"state":"deleted" "id":511,"sequence":2115}
{"state":"created", "id":601,"sequence":2114}]
What is a good schema to achieve this?
I intend this for Postgresql with Django ORM, which uses surrogate keys. Presence of an ORM may kill answers like unions.
I can come up with only half-solutions.
I could have a modified_time column, but we cannot convey deletions.
I could have a table for storing deleted IDs, when returning 10 new/updated rows, I could return all the deleted rows between them. But this works only when the latest change is an insert/update and there are a moderate number of deleted rows.
I could set a deleted flag on the row and null the rest, but its kinda bad schema design to set all columns nullable.
I could have another table that stores ID, modification sequence number and state(new, updated, deleted), but its another table to maintain and setting sequence numbers cause contentions; imagine n concurrent requests querying for latest ID.
If you're using an ORM you want simple(ish) and if you're serving the data via an API you want quick.
To go through your suggested options:
Correct, so this doesn't help you. You could have a deleted flag in your main table though.
This seems quite a random way of doing it and breaks your insistence that there be no UNION queries.
Not sure why you would need to NULL the rest of the column here? What benefit does this bring?
I would strongly advise against having a table that has a modification sequence number. Either this means that you're performing a lot of analytic queries in order to find out the most recent state or you're updating the same rows multiple times and maintaining a table with the same PK as your normal one. At that point you might as well have a deleted flag in your main table.
Essentially the design of your API gives you one easy option; you should have everything in the same table because all data is being returned through the same method. I would follow your point 2 and Wolph's suggestion, have a deleted_on column in your table; making it look like:
create table my_table (
id ... primary key
, <other_columns>
, created_on date
, modified_on date
, deleted_on date
);
I wouldn't even bother updating all the other columns to be NULL. If you want to ensure that you return no data create a view on top of your table that nulls data where the deleted_on column has data in it. Then, your API only accesses the table through the view.
If you are really, really worried about space and the volume of records and will perform regular database maintenance to ensure that both are controlled then maybe go with option 4. Create a second table that has the state of each ID in your main table and actually delete the data from your main table. You then can do a LEFT OUTER JOIN to the main table to get the data. When there is no data that ID has been deleted. Honestly, this is overkill until you know whether you will definitely require it.
You don't mention why you're using an web API for data-transfers; but, if you're going to be transferring a lot of data or using this for internal systems only it might be worth using a lower-level transfer mechanism.

How to model data for a CouchDB geocoder

I am working on a CouchDB based geocoding application using a large national dataset that is supplied relationally. There are some 250 million records split over 9 tables (The ER Diagram can be viewed at http://bit.ly/1dlgZBt). I am quite new to nosql document databases and CouchDB in particular and am considering how to model this. I have currently loaded the data into a CouchDB database per table with a type field indicating which kind of record it is. The _id attribute is set to be the primary key for table [A] and [C], for everything else it is auto-generated by Couch. I plan on setting up Lucene with Couch for indexing and full text search. The X and Y Point coordinates are all stored in table [A] but to find these I will need to search using data in [Table E], [Tables B, C & D combined] and/or [Table I] with the option of filtering results based on data in [Table F].
My original intention was to create a single CouchDB database which would combine all of these tables into a single structure with [Table A] as the root and all related tables nested under this. I would then build my various search indexes on this and also setup a spatial index using GeoCouch for reverse geocoding. However I have read articles that suggest view collation as an alternative approach.
An important factor here I guess is reads vs writes. The plan is that this data will never be updated, only read. Data is released every quarter at which time the existing DB would be blown away and a new DB created.
I would welcome any suggestions for how best to setup and organise this from any experienced Couch or related document database users.
Many thanks in advance for any assistance.
guygrange,
While I am far from an expert in document database design, the key thing to recognize about documents DBs is that everything is about making your queries fast by keeping all of the necessary information in a single document. Hence, you need to look at your queries and how you expect to access this data. For example, I can easily imagine a geocoding application to not need access to everything in each table for your most frequent queries. Hence, to save on bandwidth, you would make a main document that has the main information you most frequently care about along with a key for the rest of the appropriate data. Then you could fetch the remaining data with that key and merge the dictionaries for easy management in your client code.
Anon,
Andrew

Merge identical databases into one

We have 15 databases of 75 tables with an avarage of a million rows. all with the same schema but different data. We have now been given the requirements by the client to bring all 15 into one database. Each set of data filtered by the user’s login.
The changes to the application have been completed to do the filtering. We are now left with the task of merging all databases into one.
The issue is conflicting PK and FK as the PK’s and the FK’s are of type int so we will have 15 PK ids of 1.
One idea is to use. net and the DBML to insert the records as new records into the new database letting linq deal with the PK and FK and using code to deal with duplicate data.
What other ways are there to do this?
It's never a trivial job to integrate databases when the records don't have unique primary keys in all databases. A few weeks ago I built a similar integration script for which I decided to use Entity Framework.
First the good news. With EF's DbContext API it's ridiculously easy to insert a complete object graph and make EF take care of all newly generated primary keys an foreign keys. The reason why this is so easy is that when an object's state is changed to Added all of its adhering objects become Added as well and EF figures out the right order of inserts. This is truly great! It made me build the core of the copy routine in a few hours, which would have been many days if I should have done it in T-SQL for example. The latter is much much more error prone too.
Of course life isn't that easy. Now the bad news:
This takes tons of machine resources. Of course I used a new context instance for each copy step, but still I had to execute the program on a machine with a decent processor and a fair amount of internal memory. The exact specifications don't matter, the message is: test with the largest databases and see what kind of beast you need. If the memory consumption can't be managed by any machine at your disposal, you have to split up the routine in smaller chunks, but that will take more programming.
The object graph that's changed to Added must be divergent. By this I mean that there should only be 1-n associations starting from the root. The reason is, EF will really mark all objects as Added. So if somewhere in the graph a few branches refer back to the same object (because there is a n-1 association), these "new" objects will be multiplied, because EF doesn't know their identity. An example of this could be Company -< Customer -< Order >- OrderType: when there are only 2 order types, inserting one root company with 10 customers with 10 orders each will create 100 order type records in stead of 2.
So the hard part is to find paths your class structure that are divergent as much as possible. This won't always be possible. If so, you'll have to add the leaves of the converging paths first. In the example: first insert order types. When a new company is inserted you first load the existing order types into the context and then add the company. Now link the new orders to the existing order types. This can only be done if you can match objects by natural keys (in this example: the order type names), but usually this is possible.
You must take care not to insert multiple copies of master data. Suppose the order types in the previous example are the same in all databases (although their primary keys may differ!). The order types from the source database should not be reinserted in the target database. Moreover, you must fix the references in the source data to the correct records in the target database (again by matching by natural key).
So although it wasn't trivial it was doable and the job was done in a relatively short time. I'm sure that other alternatives (t-SQL, integration services, BIDS, if doable at all) would have taken more time or would have been more buggy. And the problem with bugs in this area is that they may become apparent much later.
I later found out that the issues I describe under 2) are related to fetching the source objects with AsNoTracking. See this interesting post: Entity Framework 6 - use my getHashCode(). I used AsNoTracking because it performs better and it reduces memory consumption.

SQL Identity Column out of step

We have a set of databases that have a table defined with an Identity column as the primary key. As a sub-set of these are replicated to other servers, a seed system was created so that they could never clash. That system was by using a starting seed with an increment of 50.
In this way the table on DB1 would generate 30001, 30051 etc, where Database2 would generate 30002, 30052 and so on.
I am looking at adding another database into this system (it is split for scaling/loading purposes) and have discovered that the identites have got out of sync on one or two of the databases - i.e. database 3 that should have numbers ending in 3, doesn't anymore. The seeding and increments is still correct according to the table design.
I am obviously going to have to work around this problem somehow (probably by setting a high initial value), but can anyone tell me what would cause them to get out of sync like this? From a query on the DB I can see the sequence went as follows: 32403,32453, 32456, 32474, 32524, 32574 and has continued in increments of 50 ever since it went wrong.
As far as I am aware no bulk-inserts or DTS or anything like that has put new data into these tables.
Second (bonus) question - how to reset the identity so that it goes back to what I want it to actually be!
EDIT:
I know the design is in principle a bit ropey - I didn't ask for criticism of it, I just wondered how it could have got out of sync. I inherited this system and changing the column to a GUID - whilst undoubtedly the best theoretical solution - is probably not going to happen. The system evolved from a single DB to multiple DBs when the load got too large (a few hundred GBs currently). Each ID in this table will be referenced in many other places - sometimes a few hundred thousand times each (multiplied by about 40,000 for each item). Updating all those will not be happening ;-)
Replication = GUID column.
To set the value of the next ID to be 1000:
DBCC CHECKIDENT (orders, RESEED, 999)
If you want to actually use Primary Keys for some meaningful purpose other than uniquely identify a row in a table, then it's not an Identity Column, and you need to assign them some other explicit way.
If you want to merge rows from multiple tables, then you are violating the intent of Identity, which is for one table. (A GUID column will use values that are unique enough to solve this problem. But you still can't impute a meaningful purpose to them.)
Perhaps somebody used:
SET IDENTITY INSERT {tablename} ON
INSERT INTO {tablename} (ID, ...)
VALUES(32456, ....)
SET IDENTITY INSERT {tablename} OFF
Or perhaps they used DBCC CHECKIDENT to change the identity. In any case, you can use the same to set it back.
It's too risky to rely on this kind of identity strategy, since it's (obviously) possible that it will get out of synch and wreck everything.
With replication, you really need to identify your data with GUIDs. It will probably be easier for you to migrate your data to a schema that uses GUIDs for PKs than to try and hack your way around IDENTITY issues.
To address your question directly,
Why did it get out of sync may be interesting to discuss, but the only result you could draw from the answer would be to prevent it in the future; which is a bad course of action. You will continue to have these and bigger problems unless you deal with the design which has a fatal flaw.
How to set the existing values right is also (IMHO) an invalid question, because you need to do something other than set the values right - it won't solve your problem.
This isn't to disparage you, it's to help you the best way I can think of. Changing the design is less work both short term and long term. Not to change the design is the pathway to FAIL.
This doesn't really answer your core question, but one possibility to address the design would be to switch to a hi_lo algorithm. it wouldn't require changing the column away from an int. so it shouldn't be nearly as much work as changing to a guid.
Hi_lo is used by the nhibernate ORM, but I couldn't find much documentation on it.
Basically the way a Hi_lo works is you have 1 central place where you keep track of your hi value. 1 table in 1 of the databases that every instance of your insert application can see. then you need to have some kind of a service (object, web service, whatever) that has a life somewhat longer than a single entity insert. this service when it starts up will go to the hi table, grab the current value, then increment the value in that table. Use a read committed lock to do this so that you won't get any concurrency issues with other instances of the service. Now you would use the new service to get your next id value. It internally starts at the number it got from the db, and when it passes that value out, increments by 1. keeping track of this current value and the "range" it's allowed to pass out. A simplistic example would be this.
service 1 gets 100 from "hi_value" table in db. increments db value 200.
service 1 gets request for a new ID. passes out 100.
another instance of the service, service 2 (either another thread, another middle tier worker machine, etc) spins up, gets 200 from the db, increments db to 300.
service 2 gets a request for a new id. passes out 200.
service 1 gets a request for a new id. passes out 101.
if any of these ever gets to passing out more than 100 before dying, then they will go back to the db, and get the current value and increment it and start over. Obviously there's some art to this. How big should your range be, etc.
A very simple variation on this is a single table in one of your db's that just contains the "nextId" value. basically manually reproducing oracle's sequence concept.