Servlet design, concurrent access to field - oop

I have rather general question, please advice.
I have a servlet.
This servlet has private field.
Private field is a kind of metadata stuff (public class Metadata{//bla-bla-bla}).
When GET request is processed, this metadata is used to perform some operation.
I want to implement POST method in the same servlet. User uploads file and Metadata field is updated.
The problem: concurrent access to this private field with Metadata object shared among sereval web-threads using one servlet instance. POST method operaton (Update Metadata object) can lead to Metadata inconsistent state and concurrent GET request can be failed.
The question: what is the best way to update Metadata object while GET requests are running?
Dummy solution:
During each GET request,, at the very beginning
Synchonize Metadata object and clone it in one block, then release it.
Concurrent GET requests work with clone verstion of Metadata object which is consistent.
During each POST request.
Synchonize Metadata object and update its fields.
Release Metadata object.
Please advice or critisize.

Using synchronized methods set and get in the Metadata class is fine but may slower your web app in case you have multiple readers and (much) less writers:
Java synchronized keyword is used to acquire a exclusive lock on an
object. When a thread acquires a lock of an object either for reading
or writing, other threads must wait until the lock on that object is
released. Think of a scenerio that there are many reader threads that reads a shared
data frequently and only one writer thread that updates shared data.
It’s not necessary to exclusively lock access to shared data while
reading because multiple read operations can be done in parallel
unless there is a write operation.
(Excerpt from that nice post)
So using a multiple read single write strategy may be better in term of performance in some cases as explained also in the same Java5 ReadWriteLock interface doc:
A read-write lock allows for a greater level of concurrency in
accessing shared data than that permitted by a mutual exclusion lock.
It exploits the fact that while only a single thread at a time (a
writer thread) can modify the shared data, in many cases any number of
threads can concurrently read the data (hence reader threads). In
theory, the increase in concurrency permitted by the use of a
read-write lock will lead to performance improvements over the use of
a mutual exclusion lock. In practice this increase in concurrency will
only be fully realized on a multi-processor, and then only if the
access patterns for the shared data are suitable.
Whether or not a read-write lock will improve performance over the use
of a mutual exclusion lock depends on the frequency that the data is
read compared to being modified, the duration of the read and write
operations, and the contention for the data - that is, the number of
threads that will try to read or write the data at the same time. For
example, a collection that is initially populated with data and
thereafter infrequently modified, while being frequently searched
(such as a directory of some kind) is an ideal candidate for the use
of a read-write lock. However, if updates become frequent then the
data spends most of its time being exclusively locked and there is
little, if any increase in concurrency. Further, if the read
operations are too short the overhead of the read-write lock
implementation (which is inherently more complex than a mutual
exclusion lock) can dominate the execution cost, particularly as many
read-write lock implementations still serialize all threads through a
small section of code. Ultimately, only profiling and measurement will
establish whether the use of a read-write lock is suitable for your
application.
A ready to use implementation is the ReentrantReadWriteLock.
Take a look at the previous post for a nice tutorial on how to use it.

Related

Distributed lock with TTL

When we have a distributed lock with TTL, it is possible that lock will expire because of TTL config and the process which had that lock has not finished computation and it will continue to manipulate the object for which it acquired lock as it doesn't know that lock has already expired. How can we avoid that scenario?
The solution you are looking for is called "fencing token". Long story short - every mutation command/operation should include the token and the executor should check if the token is still valid.
The token is just a number, call it "term", every time a new lock is issued, the term get increased.
The executor has a simple logic, it never accepts commands with an old term.
Arguably, this is the only option to truly avoid any lock related issues. Stuff like having timestamps, or explicit lock releases - all of them are inherently prone to various race issues.
Another pointer I recommend to look at - Red Lock algorithm; and the issues it has - more on this here: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html

Multiple data insertions using async writing with Apache Geode

We have Apache Geode connected to Postgres using an AEQ + AsyncCacheListener configured to write data to Postgres. During async write, we submit the list of events that we want to persist and it asynchronously inserts those events. Let's say I have two client applications which calls processEvents for async writing and both have some events in common which violate some key. But, after client calls processEvents, control is immediately returned to client. In such cases how will client know if some issue occurred? What are the best practices to tackle this?
What do you mean by the events in common "violate some key"? Like a primary or foreign key constraint, or some other database constraint perhaps (e.g. uniqueness, non-null values, etc)?
Handling a conflict depends on the importance and nature of the data being inserted, or written to the backend (Postgres) database from Geode and its significance to the application, from a requirements and business logic POV.
If 2 (or more) client applications are writing to the same cache/database entries/records, then certainly some type of collision will eventually occur, and how it is handled will depend on the data and the type of operation performed on the data.
In general, handling the violation closer to where and when the violation occurs (e.g. inside the AsyncEventListener itself) maybe preferable or ideal, since then you should have most of the necessary information (e.g. DataAccessException, events, additional capabilities to query the DB) to deal with the situation.
Inside the AEQ Listener, you could employ different strategies depending on the data and operation as determined by the application:
First update wins (enforced by optimistic locking)
Perform a merge
Log [failed] event(s)
Overwrite value(s) (last update wins).
...
You could employ Geode to conflate events stored in the AEQ for the same key, which should minimize collisions/conflicts.
If the client (as in "client" in a client/server topology) needs to be informed, then you could write the failed events to another Region where a client registers a CQ to be notified when entries are written to this (failed events) Region. The client-side handler associated the CQ could then take the appropriate action, such as notifying the end-user, refreshing and then retrying the operation, and so on.
Given the async nature of the initial write, then you can only respond asynchronously once the violation occurs. This is not unlike in a Reactive world (namely with onSuccess/onFailure event handlers).
So, in this situation, I don't think there really is a "best practice" per-say, rather only "recommendations". For example, handling the situation as near to the actual occurrence of the violation as possible, since handling the violation usually involves having the necessary information readily available to make the best possible, informed decision on the right course of action.
Sometimes you can automate the recovery, other times you might need manual intervention. Most definitely, do not guess. Clearly document your application/systems (configured) behavior when it can handle a situation and when it cannot.
I don't think there is a general, 1 size fits all solution in this case.
I hope this gives you some ideas to think about.

Trident or Storm topology that writes on Redis

I have a problem with a topology. I try to explain the workflow...
I have a source that emits ~500k tuples every 2 minutes, these tuples must be read by a spout and processed exatly once like a single object (i think a batch in trident).
After that, a bolt/function/what else?...must appends a timestamp and save the tuples into Redis.
I tried to implement a Trident topology with a Function that save all the tuples into Redis using a Jedis object (Redis library for Java) into this Function class, but when i deploy i receive a NotSerializable Exception on this object.
My question is.How can i implement a Function that writes on Redis this batch of tuples? Reading on the web i cannot found any example that writes from a function to Redis or any example using State object in Trident (probably i have to use it...)
My simple topology:
TridentTopology topology = new TridentTopology();
topology.newStream("myStream", new mySpout()).each(new Fields("field1", "field2"), new myFunction("redis_ip", "6379"));
Thanks in advance
(replying about state in general since the specific issue related to Redis seems solved in other comments)
The concepts of DB updates in Storm becomes clearer when we keep in mind that Storm reads from distributed (or "partitioned") data sources (through Storm "spouts"), processes streams of data on many nodes in parallel, optionally perform calculations on those streams of data (called "aggregations") and saves the results to distributed data stores (called "states"). Aggregation is a very broad term that just means "computing stuff": for example computing the minimum value over a stream is seen in Storm as an aggregation of the previously known minimum value with the new values currently processed in some node of the cluster.
With the concepts of aggregations and partition in mind, we can have a look at the two main primitives in Storm that allow to save something in a state: partitionPersist and persistentAggregate, the first one runs at the level of each cluster node without coordination with the other partitions and feels a bit like talking to the DB through a DAO, while the second one involves "repartitioning" the tuples (i.e. re-distributing them across the cluster, typically along some groupby logic), doing some calculation (an "aggregate") before reading/saving something to DB and it feels a bit like talking to a HashMap rather than a DB (Storm calls the DB a "MapState" in that case, or a "Snapshot" if there's only one key in the map).
One more thing to have in mind is that the exactly once semantic of Storm is not achieved by processing each tuple exactly once: this would be too brittle since there are potentially several read/write operations per tuple defined in our topology, we want to avoid 2-phase commits for scalability reasons and at large scale, network partitions become more likely. Rather, Storm will typically continue replaying the tuples until he's sure they have been completely successfully processed at least once. The important relationship of this to state updates is that Storm gives us primitive (OpaqueMap) that allows idempotent state update so that those replays do not corrupt previously stored data. For example, if we are summing up the numbers [1,2,3,4,5], the resulting thing saved in DB will always be 15 even if they are replayed and processed in the "sum" operation several times due to some transient failure. OpaqueMap has a slight impact on the format used to save data in DB. Note that those replay and opaque logic are only present if we tell Storm to act like that, but we usually do.
If you're interested in reading more, I posted 2 blog articles here on the subject.
http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/
http://svendvanderveken.wordpress.com/2014/02/05/error-handling-in-storm-trident-topologies/
One last thing: as hinted by the replay stuff above, Storm is a very asynchronous mechanism by nature: we typically have some data producer that post event in a queueing system (e,g. Kafka or 0MQ) and Storm reads from there. As a result, assigning a timestamp from within storm as suggested in the question may or may not have the desired effect: this timestamp will reflect the "latest successful processing time", not the data ingestion time, and of course it will not be identical in case of replayed tuples.
Have you tried trident-state for redis. There is a code on github that does it already:
https://github.com/kstyrc/trident-redis.
Let me know if this answers your question or not.

Acquiring Locks when updating a Redis key/value

I'm using AcquireLock method from ServiceStack Redis when updating and getting the key/value like this:
public virtual void Set(string key, T entity)
{
using (var client = ClientManager.GetClient())
{
using (client.AcquireLock(key + ":locked", DefaultLockingTimeout, DefaultLockExpire))
{
client.Set(key, entity);
}
}
}
I've extended AcqurieLock method to accept extra parameter for expiration of the lock key. So I'm wondering that if I need AcquireLock at all or not? My class uses AcquireLock in every operation like Get<>, GetAll<>, ExpireAt, SetAll<>, etc..
But this approach doesn't work everytime. For example, if the operating in the lock throws an exception, then the key remains locked. For this situation I've added DefaultLockExpire parameter to AcquireLock method to expire the "locked" key.
Is there any better solution, or when do we need acquiring locks like "lock" blocks in multi-thread programming.
As The Real Bill answer has said, you don't need locks for Redis itself. What the ServiceStack client offers in terms of locking is not for Redis, but for your application. In a C# application, you can lock things locally with lock(obj) so that something cannot happen concurrently (only one thread can access the locked section at a time), but that only works if you have one webserver. If you want to prevent something happening concurrently, you need a locking mechanism living outside of the webserver. Redis is a good fit for this.
We have a case where it is checked if a customer has a shopping cart already and if not, create it. Between checking and creating it, there's a time where another request could have also found out that cart doesn't exist and might also proceed to create one. That's a classical case for locking but a simple lock wouldn't work here as the request may have arrived from an entirely different web-server. So for this, we use the ServiceStack Redis client (with some abstraction) to lock using Redis and only allow one request at a time to enter the "create a cart" section.
So to answer your actual question: no, you don't need a lock for getting/setting values to Redis.
I wouldn't use locks for get/set operations. Redis will do those actions atomically, so there is no chance of it getting "changed underneath you" when setting or getting. I've built systems where hundreds of clients are updating/operating on values concurrently and never needed a lock to do those operations (especially an expire).
I don't know how Service Stack redis implements the locking it has so I can't say why it is failing. However, I'm not sure I'd trust it given there is no true locking needed on the Redis side for data operations. Redis is single-threaded so locking there doesn't make sense.
If you are doing complex operations where you get a value, operate on things based on it, then update it after a while and can't have the value change in the meantime I'd recommend reading and groking http://redis.io/topics/transactions to see if what you want is what Redis is good for, whether your code needs refactored to eliminate the problem, or at the least find a better way to do it.
For example, SETNX may be the route you need to get what you want, but without details I can't say it will work.
As #JulianR says, the locking in ServiceStack.Redis is only for application-level distributed locks (i.e. to replace using a DB or an empty .lock file on a distributed file system) and it only works against other ServiceStack.Redis clients in other process using the same key/API to acquire the lock.
You would never need to do this for normal Redis operations since they're all atomic. If you want to ensure a combination of redis operations happen atomically than you would combine them within a Redis Transaction or alternatively you can execute them within a server-side Lua script - both allow atomic execution of batch operations.

Good ways to decouple GUIs from SOAP/WS-API update/write calls?

Let's assume we have some configuration GUI that in its current form uses direct DB transactions to submit new configurations for more than one configurable component in a consistent manner.
Now let's move the data (DB) stuff behind some SOAP/WS API. The GUI has no direct DB access anymore. The transactional behaviour must remain, but the API should NOT be designed to explcitly accommodate the GUI form submissions. In fact, I don't even know how the new GUI will work or how the user input will be structured. Therefore I need to provide something like WS-AtomicTransaction on the API server side. However, there are (at least) two caveats:
The GUI is written in PHP: I don't think there is any WS-Transaction support in PHP available.
I don't want to keep DB transactions open on the server side while waiting for additional client requests.
Solutions I can think of:
using Camel's aggregation. However, that would make things more complicated in at least two ways:
You cannot use DB row ids of newly inserted rows in the subsequent calls inside the same transaction. You need to use some sort of symbolic back-referencing because there would be no communication between client and server while processing the aggregated messages.
call replies would not be immediate (or the immediate and separate reply to each single call would only be some sort of a stub, ie. not containing any useful information beyond "your message has been attached to TX xyz" -- if that's at all possible in the Camel aggregation case).
the two disadvantages of the previous solution make me think of request batches where possibly the WS standards provide means for referencing call results in subsequent calls inside the batch transaction. Is there any such thing already available? Maybe even as a PHP client?
trying to eliminate lock contention in the database by carefully using row-level locks etc. However, when inserting new elements, my guess is that usually pages and index pages need to be locked by the DB.
maybe some server-side persistence layer using optimistic locking? But again, that would not return any DB IDs back to the client before the final commit if DB writes would be postponed until the commit (don't know if that's possible at all).
What do YOU think?
Transactions are a powerful tool and we easily get into a thinking pattern in which we see every problem as a nail we hit with this big hammer. I can relate to your confusion because I've experienced it myself. Unfortunately I have no better advice for you than to try not think in terms of transactions but of atomic API calls.
When I think in terms of transactions, my thought pattern usually goes like this:
start transaction
read (repeat as required)
update (repeat as required)
commit/roll back
It takes some time to realize that we overuse this pattern. Actual conflicts are rare and there are many other ways of dealing with them. Here is a commonly used one in APIs
read and send data to client (atomic API call)
update data (on the client)
send original + updates back to the server (atomic API call)
start transaction (on server)
read
compare with original from client
if not same, return error (client should retry)
if same, update
commit
The last six points are part of the implementation of the API call.
Ferenc Mihaly
http://theamiableapi.com