Multithreading and Synchronization - synchronized

I have been trying to understand the real reason for using synchronization in a multithread code.
We know if multiple threads access a common shared resource at the same time, will results in many issues like deadlock, racing condition etc. but if we synchronize a code which is called by multiple threads, it would allow only one thread to access the resource and other threads will wait in queue. if this is the case, then this is as good as a single thread application without synchronization. what performance gain that we will get if we synchronize a multithreaded code?
just an example to compare two scenarios
1. we have to process 1000 records in a single thread model assuming that it takes 1 second to process a single record, so totally it takes 1000 seconds to complete.
2. we have to process 1000 records in a multi thread model with the process method is synchronized assuming that it takes 1 second to process a single record and lets say 10 threads are spawned, so here as well, wheneve a thread gets access to synchronized method, rest of the threads will be in queue, totally it takes 1000 seconds to complete.
I would be really satisfied and relieved if someone could make me understand this basic. thanks,
Edit:
I haven't mentioned the programming language: Its Java
Just to understand the impact of synchronization and without synchronization for the below code (Spring Batch Example):
package com.dbas.core;
import java.util.List;
import org.springframework.batch.item.ItemReader;
public class NextReader implements ItemReader<String> {
private List<String> itemList;
public NextReader(ListBean listBean) {
itemList = listBean.getItemList();
}
public synchronized String read()
{
if (!itemList.isEmpty()) {
return itemList.remove(0);
}
return null;
}
}
Do we have to sychronize the above code? if not, the instance variable "itemList" will be shared across multiple threads, if shared, will the above item retrieval works properly?
There will be a Processor gets called after read() which process the items. is it advicable to synchronize the above code for multiple threads or will it work without any issue without synchronization?
Thanks.

Of course, if the time you keep the mutex is very long compared to the rest of the code, then you lose the benefits of multithreading. As an extreme, consider an application where a thread keeps the mutex forever: you get a single thread application!
For this reason, software developers usually design the code so that you keep the lock for the shortest time as possible. See, for example, the double checked locking pattern: http://en.wikipedia.org/wiki/Double-checked_locking
For more complex situations, there exist data structures that allow to have good performance even when there are several threads that read and write data. See, for example, the RCU data structure, implemented also in the Linux kernel: http://en.wikipedia.org/wiki/Read-copy-update

You are right that if the entire process is synchronized, it would take the same time no matter how many threads there are.(In reality, it would take longer time the more threads there are due to the overhead of context switching and other things).
The clue then is to not synchronize an entire process method in such a case.
Ideally processing one message would be totally independent of processing another message, in that case you could in theory process 1000 messages concurrently, and it would take 1000 times less time, given you have 1000 processors at your disposal.
In practice, you end up somewhere inbetween. You might use many small locks, each covering code and data that are independent of each other. Or you might keep all the data for each message independent of each other, requiring no synchronization for that part, but as the main processing is done, you need to insert the result in a shared array - where you need locks around access to that shared array.

Synchronization in multithreaded code is used to enable safe shared state access of resources between different threads. Accessing shared state by many different threads, depending on the language and hardware implementation details, has the following hazards:
Data corruption, when two threads concurrently attempt to read / write the same memory location.
Visibility side-effects, when one thread changes the state of a shared resource then changes may not become apparent to other threads immediately (or even at all) without proper synchronization. This may also happen due to compiler optimizations (instruction re-ordering)
Having said that your question is vague, you mention the word "synchronized" by not any specific programming language. In Java, the word syncrhonized in various contexts means an implicit monitor / lock and too many locks could be a bad thing for performance, there are uses cases where fine grained locking or non-blocking algorthims / CAS strategies offer better performance. The topic is very broad and you ll have to be more specific.
edit: In the scenario that you describe, if all work is completely serial and all state is shared then there could be little or no benefit in a multithreaded implementation. However such extremes are rare, often a portion of the task can be run in parallel and then you can have a measurable performance improvement. Amdahl's law can be used to find the theoritical maximum performance benefit when trying to parallelize a task.
edit:
Regarding your edit, since i happen to have used Spring Batch i can tell you for sure that if you are using a thread pool to read items from the list you have to use synchronization and you can do it in many ways, below you can see two of them:
public class NextReader implements ItemReader<String> {
private List<String> itemList;
private AtomicInteger current = new AtomicInteger(0);
public NextReader(ListBean listBean) {
itemList = listBean.getItemList();
}
public syncronized String read() {
int index = this.current.getAndIncrement();
if (index < itemList.size()) {
return itemList.get(index);
} else
return null;
}
}
or
public class NextReader implements ItemReader<String> {
private List<String> itemList;
private AtomicInteger current = new AtomicInteger(0);
public NextReader(ListBean listBean) {
itemList = listBean.getItemList();
}
public syncronized String read() {
int index = this.current.getAndIncrement();
if (index < itemList.size()) {
return itemList.get(index);
} else
return null;
}
}
They both have the same effect, that is they allow you to read from itemList using multiple on a thread safe manner.

Related

Coroutines, understanding suspend

I'm trying to understand a passage in Hands-On Design Patterns with Kotlin, Chapter 8, Threads and Coroutines.
Why is it that when we rewrite the function as suspend, "we can serve 20 times more users, all thanks to the smart way Kotlin has rewritten our code".
fun profile(id:String):Profile {
val bio = fetchBioOverHttp(id) //takes 1s
val picture = fetchPictureFromDb(id) // takes 100ms
val friends = fetchFriendsFromDb(id) // takes 500ms
return Profile(bio, picture)
}
I've attached the two relevant pages but basically, it says "if we have a thread pool of of 10 threads, the first 10 requests will get into the pool and the 11th will get stuck until the first one finishes. This means we can serve three users simultaneously, and the fourth one will wait until the first one gets his/her results."
I think I understand this point. 3 threads execute the three methods in parallel, then another 3, then another 3, which gives us 9 threads actively executing code. The 10th thread executes the first fetchBioOverHttp method, and we're out of threads until thread #1 finishes its fetchBioOverHttp call.
However, how does rewriting these methods as suspend methods result in serving 20 times more users? I guess I'm not understanding the path of execution here.
To be honest, I don't like this example.
Author meant that after rewriting httpCall() it doesn't wait for the result - it schedules processing in the background, registers a callback and then immediately returns. The caller thread is freed and it can start handling another request while the first one is being processed. By using this technique we can process multiple requests while using even a single thread.
I don't like this explanation, because it ignores how coroutines really work internally. Instead, it tries to compare them to something the reader could be familiar with - asynchronous callback-based APIs. Normally, this is good as it helps to understand. However, in this case the problem is that in most cases coroutines internally... create a thread pool and use it to schedule blocking IO operations. Therefore, both provided solutions are pretty much the same and the main difference is that we created a pool of 10 threads and by default coroutines use 64 threads.
Kotlin compiler does not cut the function into two. There is still a single function with a lot of additional code inside. I agree it can be interpreted as two functions calling each other, but this is not what the compiler does. If that wasn't explained in the book, I think this is misleading.

Synchronized collection that blocks on every method

I have a collection that is commonly used between different threads. In one thread I need to add items, remove items, retrieve items and iterate over the list of items. What I am looking for is a collection that blocks access to any of its read/write/remove methods whenever any of these methods are already being called. So if one thread retrieves an item, another thread has to wait until the reading has completed before it can remove an item from the collection.
Kotlin doesn't appear to provide this. However, I could create a wrapper class that provides the synchronization I'm looking for. Java does appear to offer the synchronizedList class but from what I read, this is really for blocking calls on a single method, meaning that no two threads can remove an item at the same time but one can remove while the other reads an item (which is what I am trying to avoid).
Are there any other solutions?
A wrapper such as the one returned by synchronizedList
synchronizes calls to every method, using the wrapper itself as the lock. So one thread would be blocked from calling get(), say, while another thread is currently calling put(). (This is what the question seems to ask for.)
However, as the docs to that method point out, this does nothing to protect sequences of calls, such as you might use when iterating through a collection. If another thread changes the collection in between your calls to next(), then anything could happen. (This is what I think the question is really about!)
To handle that safely, your options include:
Manual synchronization. Surround each sequence of calls to the collection in a synchronized block that synchronises on the collection, e.g.:
val list = Collections.synchronizedList(mutableListOf<String>())
// …
synchronized (list) {
for (i in list) {
// …
}
}
This is straightforward, and relatively easy to do if the collection is under your control. But if you miss any sequences, then you could get unexpected behaviour. Also, you'll need to keep your sequences short, to avoid holding the lock for an extended time and affecting performance.
Use a concurrent collection implementation which provides primitives letting you do all the processing you need in a single call, avoiding iteration and other sequences.
For maps, Java provides very good support with its ConcurrentMap interface, and high-performance implementations such as ConcurrentHashMap. These have methods allowing you to iterate, update single or multiple mappings, search, reduce, and many other whole-map operations in a single call, avoiding any concurrency problems.
For sets (as per this question) you can use a ConcurrentSkipListSet, or you can create one from a ConcurrentHashMap with newKeySet().
For lists (as per this question), there are fewer options. (I think concurrent lists are much less commonly needed.) If you don't need random access, ConcurrentLinkedQueue may suffice. Or if modification is much less common than iteration, CopyOnWriteArrayList could work.
There are many other concurrent classes in the java.util.concurrent package, so it's well worth looking through to see if any of those is a better match for your particular case.
If you have specialised requirements, you could write your own collection implementation which supports them. Obviously this is more work, and only worthwhile if none of the above approaches does what you want.
In general, I think it's well worth stepping back and seeing whether iteration is really needed. Historically, in imperative languages all the way from FORTRAN through BASIC and C up to Java, the for loop has traditionally been the tool of choice (sometimes the only structure) for operating on collections of data — and for those of us who grew up on those languages, it's what we reach for instinctively. But the functional programming paradigm provides alternative tools, and so in languages like Kotlin which provide some of them, it's good to stop and ask ourselves “What am I ultimately trying to achieve here?” (Often what we want is actually to update all entries, or map to a new structure, or search for an element, or find the maximum — all of which have better approaches in Kotlin than low-level iteration.)
After all, if you can tell the compiler what you want to do, instead of how to do it, then your program is likely to be shorter and easier to read and maintain, freeing you to think about more important things!

Do we need to lock the immutable list in kotlin?

var list = listOf("one", "two", "three")
fun One() {
list.forEach { result ->
/// Does something here
}
}
fun Two() {
list = listOf("four", "five", "six")
}
Can function One() and Two() run simultaneously? Do they need to be protected by locks?
No, you dont need to lock the variable. Even if the function One() still runs while you change the variable, the forEach function is running for the first list. What could happen is that the assignment in Two() happens before the forEach function is called, but the forEach would either loop over one or the other list and not switch due to the assignment
if you had a println(result) in your forEach, your program would output either
one
two
three
or
four
five
six
dependent on if the assignment happens first or the forEach method is started.
what will NOT happen is something like
one
two
five
six
Can function One() and Two() run simultaneously?
There are two ways that that could happen:
One of those functions could call the other.  This could happen directly (where the code represented by // Does something here in One()⁽¹⁾ explicitly calls Two()), or indirectly (it could call something else which ends up calling Two() — or maybe the list property has a custom setter which does something that calls One()).
One thread could be running One() while a different thread is running Two().  This could happen if your program launches a new thread directly, or a library or framework could do so.  For example, GUI frameworks tend to have one thread for dispatching events, and others for doing work that could take time; and web server frameworks tend to use different threads for servicing different requests.
If neither of those could apply, then there would be no opportunity for the functions to run simultaneously.
Do they need to be protected by locks?
If there's any possibility of them being run on multiple threads, then yes, they need to be protected somehow.
99.999% of the time, the code would do exactly what you'd expect; you'd either see the old list or the new one.  However, there's a tiny but non-zero chance that it would behave strangely — anything from giving slightly wrong results to crashing.  (The risk depends on things like the OS, CPU/cache topology, and how heavily loaded the system is.)
Explaining exactly why is hard, though, because at a low level the Java Virtual Machine⁽²⁾ does an awful lot of stuff that you don't see.  In particular, to improve performance it can re-order operations within certain limits, as long as the end result is the same — as seen from that thread.  Things may look very different from other threads — which can make it really hard to reason about multi-threaded code!
Let me try to describe one possible scenario…
Suppose Thread A is running One() on one CPU core, and Thread B is running Two() on another core, and that each core has its own cache memory.⁽³⁾
Thread B will create a List instance (holding references to strings from the constant pool), and assign it to the list property; both the object and the property are likely to be written to its cache first.  Those cache lines will then get flushed back to main memory — but there's no guarantee about when, nor about the order in which that happens.  Suppose the list reference gets flushed first; at that point, main memory will have the new list reference pointing to a fresh area of memory where the new object will go — but since the new object itself hasn't been flushed yet, who knows what's there now?
So if Thread A starts running One() at that precise moment, it will get the new list reference⁽⁴⁾, but when it tries to iterate through the list, it won't see the new strings.  It might see the initial (empty) state of the list object before it was constructed, or part-way through construction⁽⁵⁾.  (I don't know whether it's possible for it to see any of the values that were in those memory locations before the list was created; if so, those might represent an entirely different type of object, or even not a valid object at all, which would be likely to cause an exception or error of some kind.)
In any case, if multiple threads are involved, it's possible for one to see list holding neither the original list nor the new one.
So, if you want your code to be robust and not fail occasionally⁽⁶⁾, then you have to protect against such concurrency issues.
Using #Synchronized and #Volatile is traditional, as is using explicit locks.  (In this particular case, I think that making list volatile would fix the problem.)
But those low-level constructs are fiddly and hard to use well; luckily, in many situations there are better options.  The example in this question has been simplified too much to judge what might work well (that's the down-side of minimal examples!), but work queues, actors, executors, latches, semaphores, and of course Kotlin's coroutines are all useful abstractions for handling concurrency more safely.
Ultimately, concurrency is a hard topic, with a lot of gotchas and things that don't behave as you'd expect.
There are many source of further information, such as:
These other questions cover some of the issues.
Chapter 17: Threads And Locks from the Java Language Specification is the ultimate reference on how the JVM behaves.  In particular, it describes what's needed to ensure a happens-before relationship that will ensure full visibility.
Oracle has a tutorial on concurrency in Java; much of this applies to Kotlin too.
The java.util.concurrent package has many useful classes, and its summary discusses some of these issues.
Concurrent Programming In Java: Design Principles And Patterns by Doug Lea was at one time the best guide to handling concurrency, and these excerpts discuss the Java memory model.
Wikipedia also covers the Java memory model
(1) According to Kotlin coding conventions, function names should start with a lower-case letter; that makes them easier to distinguish from class/object names.
(2) In this answer I'm assuming Kotlin/JVM.  Similar risks are likely apply to other platforms too, though the details differ.
(3) This is of course a simplification; there may be multiple levels of caching, some of which may be shared between cores/processors; and some systems have hardware which tries to ensure that the caches are consistent…
(4) References themselves are atomic, so a thread will either see the old reference or the new one — it can't see a bit-pattern comprising parts of the old and new ones, pointing somewhere completely random.  So that's one problem we don't have!
(5) Although the reference is immutable, the object gets mutated during construction, so it might be in an inconsistent state.
(6) And the more heavily loaded your system is, the more likely it is for concurrency issues to occur, which means that things will probably fail at the worst possible time!

Which is faster, creating processes or Threads? And why?

I just wanna understand which is faster , the thread or the process and why is that ?
all the information i get is about the difference of weight
In the overwhelming majority of cases we can assume that the process creating takes much more time than creating a new thread in an existing process. Creating a process requires at least:
Classes loading and verification.*
Linking. *
Classes initialization. *
Static members initialization. *
Follow the link, there you will find a lot of detailed information about the process loading and you will understand that this is a very cumbersome procedure.
А new thread creating as a whole requires only operation system call.

How to fill Gtk::TreeModelColumn with a large dataset without locking up the application

I need to fill in a large (maybe not so much - several thousands of entries) dataset to a Gtk::TreeModelColumn. How do I do that without locking up the application. Is it safe to put the processing into separate thread? What parts of the application do I have to protect with a lock then? Is it only the Gtk::TreemodelColumn class, or Gtk::TreeView widget it is placed in, or maybe even surrounding frame or window?
There are two general approaches you could take. (Disclaimer: I've tried to provide example code, but I rarely use gtkmm - I'm much more familiar with GTK in C. The principles remain the same, however.)
One is to use an idle function - that runs whenever nothing's happening in your GUI. For best results, do a small amount of calculation in the idle function, like adding one item to your treeview. If you return true from the idle function, then it is called again whenever there is more processing time available. If you return false, then it is not called again. The good part about idle functions is that you don't have to lock anything. So you can define your idle function like this:
bool fill_column(Gtk::TreeModelColumn* column)
{
// add an item to column
return !column_is_full();
}
Then start the process like this:
Glib::signal_idle().connect(sigc::bind(&fill_column, column));
The other approach is to use threads. In the C API, this would involve gdk_threads_enter() and friends, but I gather that the proper way to do that in gtkmm, is to use Glib::Dispatcher. I haven't used it before, but here is an example of it. However, you can also still use the C API with gtkmm, as pointed out here.