Are we asking too much of transactional memory?

Are we asking too much of transactional memory? - locking

I've been reading up a lot about transactional memory lately. There is a bit of hype around TM, so a lot of people are enthusiastic about it, and it does provide solutions for painful problems with locking, but you regularly also see complaints:
You can't do I/O
You have to write your atomic sections so they can run several times (be careful with your local variables!)
Software transactional memory offers poor performance
[Insert your pet peeve here]
I understand these concerns: more often than not, you find articles about STMs that only run on some particular hardware that supports some really nifty atomic operation (like LL/SC), or it has to be supported by some imaginary compiler, or it requires that all accesses to memory be transactional, it introduces type constraints monad-style, etc. And above all: these are real problems.
This has lead me to ask myself: what speaks against local use of transactional memory as a replacement for locks? Would this already bring enough value, or must transactional memory be used all over the place if used at all?

Yes, some of the problems you mention can be real ones now, but things evolve.
As any new technology, first there is a hype, then the new technology shows that there are some unresolved problems, and then some of these problems are solved and others not. This result in another possibility to solve your problems, for which this technology is the more adapted.
I will say that you can use STM for a part of your application that can leave with the constraints the currents state of the art have. Part of the application that don't mind about a lost of efficiency for example.
Communication between the transaction and non transactional parts is the big problem. There are STM that are lock aware, so them can interact in a consistent way with non transactional parts.
I/O is also possible, but your transaction becomes irrevocable, that is, can not be aborted. That means that only one transaction can use I/O at the same time. You can also use I/O once the top level transaction has succeed, on a non-transactional world, as now.
Most of the STM library base systems force the user to make the difference between transactional and non transactional data. So yes, you need to understand what this exactly means. On the other hand, compilers can deduce what access must be transactional or not, the problem been that they can be too conservative, decreasing the efficiency we can get when we manage explicitly the different kind of variables. This is the same as having static, local and dynamic variables. You need to know the constraints each one have to make a correct program.

I've been reading up a lot about transactional memory lately.
You might also be interested in this podcast on software transactional memory, which also introduces STM using an analogy based on garbage collection:
The paper is about an analogy between garbage collection and transactional memory.
In addition to seeing the beauty of the analogy, the discussion also serves as a good
introduction to transactional memory (which was mentioned in the Goetz/Holmes episode)
and - to some extent - to garbage collection.

If you use transactional memory as a replacement for locks, all the code that executes with that lock held could be rolled back upon completion. Thus the code that was previously using locks must be transactional, and will have all the same drawbacks (and benefits).
So, you could possibly restrict the influence of TM to only those parts of the code that hold locks, right? Every piece of code that can be called during a held lock must support TM, in that scenario. How much of your program does not hold locks and is never called by code that holds locks?

Related

How does Google, Facebook etc. deal with memory corruption etc

I was wondering how Google, Facebook etc. deal with hardware errors like memory corruptions, calculating errors in the CPU etc. Given the increasing density (and shrinking size) of circuits, it seems the frequency of hardware errors are going up, not down. Also, big providers like Google and Facebook have so many machines that memory corruption must be an everyday occurrence. So I am wondering what kind pf policy they have with regards to this. After all, most algorithms assume that the underlying hardware is operating correctly and that data doesnt change in memory etc. If it does, all bets are essentially off. This could cause corruption not just for the specific data being affected by the error, but it could conceivably spread to other computations. For instance if the error affects a locking/synchronization protocol it could cause data hazards with threads or nodes etc. Or could corrupt a database, causing violations of invariants assumed elsewhere etc. This could cause other nodes, who discover the corruption, to fail. I have seen this in practice where erroneous data in a database (an invalid timestamp in a configuration-related row) caused a whole system to fail because the application validated the timestamp whenever reading the row!
Hopefully, most of the time the errors will simply result in a node crashing etc. maybe even before committing any data (for instance, if an operating system structure is corrupted). But since the errors occur essentially at random, it could occur everywhere and the error could live on without being noticed.
It must be somewhat challenging. Also, I am thinking that big providers must occationally see errors / stack traces in their logs, that cannot be explained through code inspection/analysis, because the situation simply cant happen if the code had executed "as written". But this can often be quite hard to conclude, so it could be a lot of investigation is being spent on an error before it is finally concluded that it must have been a hardware error.
Of course this is not limited to big service providers, since these errors can occur everywhere. But big service providers are much more exposed to it, and it would make sense for them to have a policy in this area.
I can see different ways how it can be addressed:
1) Pragmatic, repair errors as you go along. Often the repair is simply to reboot a machine. In cases where customer data is corrupted and someone complains, then fix that.
2) Hardening of code running on individual nodes. I dont know what techniques could be used, but for instance calculating certain results twice and comparing before committing. This will of course incur an overhead, and also the comparison logic can itself be subject to corruption, but maybe quite low risk since it requires an error in that area specifically. Also, this logic could also be duplicated.
3) Different nodes running in lock-step, comparisons being done between nodes before results are allowed to be committed.
4) Large-scale architectural initiatives to reduce damage from a localized error. Making sure to compare DB content with previous backup(s) to detect bit rot (before blindly making another backup of the current data) etc. Various integrity checks in place. Resiliency in other nodes in case of corrupted data (not relying too strongly on invariants holding etc.). Essentially "being liberal in what you accept".
There might be other things I havent thought of, which is my reason for asking the question :)

At least the memory content must be reliable:
https://en.wikipedia.org/wiki/ECC_memory
There are also other error detection/correction codes used at various levels (checksums, hashes, etc.).

Concurrent programming test in GO

I'm trying to ensure that my concurrent program is free of the following,
deadlock
livelock
starvation
I found the following tool
http://blog.golang.org/race-detector
And I tried compiling and running using -race enabled and did not see any complaints.
Does anyone know whether this checks for all the above issues? Does not receiving any complaints in the output mean to say that the program is free of these issues?

Deadlock, livelock and starvation cannot be eliminated by testing alone and they are not detected by the Go race detector. Deadlocks in Go programs will be detected at runtime, but then it's usually too late. Livelocks (non-terminating busy loops) will not be detected unless they also cause a deadlock.
Thread starvation is similar to livelock in that imbalanced busy-ness in an application causes some activities to be stymied and never make the expected progress. An example is the famous 'Wot no Chickens?' by Peter Welch.
The race detector itself is limited in its usefulness because some race conditions depend on environment and the conditions that cause a particular race may be absent during the testing phase, so the race detector will miss them.
If all this sounds rather bleak, there is a body of theoretical work that can help a lot. The premise is that these four dynamic problems are best addressed via design strategies and language features, rather than by testing. As a simple example, the Occam programming language (which is quite like Go in its concurrency model) has a parallel usage rule enforced by the compiler that eliminates race conditions. This imposes a restriction on the programmer: aliases for mutable state (i.e. pointers) are not allowed.
Thread starvation in Go (and Occam) should likewise not be as much a problem as in Java because the concurrency model is better designed. Unless you abuse select, it won't be a problem.
Deadlock is best addressed by theoretically-based design patterns. For example, Martin & Welch published A Design Strategy for Deadlock-Free Concurrent Systems, which described principally the client-server strategy and the i/o-par strategy. This is aimed at Occam programs but applies to Go as well. The client-server strategy is simple: describe your Go-routine network as a set of communicating servers and their clients; ensure that there are no loops in the network graph => deadlock is eliminated. I/o-par is a way to form rings and toruses of Go-routines such that there will not be a deadlock within the structure.

IMHO the race detector checks nothing from your list. It checks racy writes to memory. (Sidenote: Goroutine deadlocks are detected by the runtime.)

Use of atomic properties in Objective C: Any side effects?

I understand that the meaning of atomic was explained in What's the difference between the atomic and nonatomic attributes?, but what I want to know is:
Q. Are there any side effects, besides performance issues, in using atomic properties everywhere?
It seems the answer is no, as the performance of iPhone is quite fast nowadays. So why are so many people still using non-atomic?
Even atomic does not guarantee thread safety, but it's still better than nothing, right?

Even atomic does not guarantee thread safety, but it's still better than nothing, right?
Wrong. Having written some really complex concurrent programs, I recommend exactly the opposite. You should reserve atomic for when it truly makes sense to use -- and you may not fully understand this until you write concurrent programs without any use of atomic. If I am writing a multithreaded program, I don't want programming errors masked (e.g. race conditions). I want concurrency issues loud and obvious. This way, they are easier to identify, reproduce, and correct.
The belief that some thread safety is better than none is flawed. The program is either threadsafe, or it is not. Using atomic can make those aspects of your programs more resistant to issues related to concurrency, but that doesn't buy you much. Sure, there will likely be fewer crashes, but the program is still undisputedly incorrect, and it will still blow up in mysterious ways. My advice: If you aren't going to take the time to learn and write correct concurrent programs, just keep them single threaded (if that sounds kind of harsh: it's not meant to be harsh - it will save you from a lot of headaches). Multithreading and concurrency are huge, complicated subjects - it takes a long time to learn to write truly correct, long-lived programs in many domains.
Of course, atomic can be used to achieve threadsafety in some cases -- but making every access atomic guarantees nothing for thread safety. As well, it's highly unusual (statistically) that atomic properties alone will make a class truly threadsafe, particularly as complexity of the class increases; it is more probable that a class with one ivar is truly safe using atomics only, versus a class with 5 ivars. atomic properties are a feature I use very very rarely (again, some pretty large codebases and concurrent programs). It's practically a corner case if atomics are what makes a class truly thread safe.
Performance and execution complexity are the primary reasons to avoid them. Compared to nonatomic accesses, and the frequency and simplicity of accessing a variable, use of atomic adds up very fast. That is, atomic accesses introduce a lot of execution complexity relative to the task they perform.
Spin locks are one way atomic properties are implemented. So, would you want a synchronization primitive such as a spin lock or mutex implicitly surrounding every get and set, knowing it does not guarantee thread safety? I certainly don't! Making every property access in your implementations atomic can consume a ton of CPU time. You should use it only when you have an explicit reason to do so (also mentioned by dasblinkenlicht+1). Implementation Detail: some accesses do not require spin locks to uphold guarantees of atomic; it depends on several things, such as the architecture and the size of a variable.
So to answer your question "any side-effect?" in a TL;DR format: Performance is the primary reason as you noted, while the applicability of what atomic guarantees and how that is useful for you is very narrow at your level of abstraction (oft misunderstood), and it masks real bugs.

You should not pay for what you do not use. Unlike plugged-in computers where CPU cycles cost you in terms of time, CPU cycles on a mobile device cost you both in time and in the battery use. If your application is single-threaded, there is no reason to use atomic, because the locking and unlocking operations would be a waste of time and battery. The battery is more important than the time: while the latency associated with addition of extra operations may be invisible to your end-user, the cycles spent will reduce the time the mobile device can work after a single charge, a measure that a lot of users consider very important.

different between cmpxchg and btr bts

btr,bts instruction is simple and it's can lock the share resource.
Why does the instruction cmpxchg exist? What's the different between these two instructions?

IIRC (it's been a while) lock btr is more expensive than cmpxchg, which was designed to automatically lock the bus for atomicity and to do so as quickly as possible. (Specifically, lock INSTR holds the bus lock for the entire instruction cycle, and does full invalidation, but the microcode for cmpxchg locks and invalidates only when absolutely needed so as to be the fastest possible synchronization primitive.)
(Edit: it also enables fancier (user-)lock-free strategies, per this message.
CMPXCHG [memaddr], reg compares a memory location to EAX (or AX, or
AL); if they are the same, it writes the source operand to the memory
location. This can obviously be used in the same way as XCHG, but it
can be used in another very interesting way as well, for lock-free
synchronization.
Suppose you have a process that updates a shared data structure. To
ensure atomicity, it generates a private updated copy of the data
structure; when it is finished, it atomically updates a single pointer
which used to point to the old data structure so that it now points to
the new data structure.
The straightforward way of doing this will be useful if there's some
possibility of the process failing, and it gives you atomicity. But we
can modify this procedure only a little bit to allow multiple
simultaneous updates while ensuring correctness.
The process simply atomically compares the pointer to the value it had
when it started its work, and if so, makes the pointer point to the new
data structure. If some other process has updated the data structure
in the mean time, the comparison will fail and the exchange will not
happen. In this case, the process must start over from the
newly-updated data structure.
(This is essentially a primitive form of Software Transactional Memory.)

BTR and BTS work on a bit level, where as CMPXCHG works on a wider data type(generally 32, 64 or 128 bits at once). They also function differently, the intel developer manuals give a good summary of how they work. It may also help to note that certain processors may have implemented BTR and BTS poorly (due to them not being so widely utilised), making CMPXCHG the better option for high performance locks.

When do transactions become more of a burden than a benefit?

Transactional programming is, in this day and age, a staple in modern development. Concurrency and fault-tolerance are critical to an applications longevity and, rightly so, transactional logic has become easy to implement. As applications grow though, it seems that transactional code tends to become more and more burdensome on the scalability of the application, and when you bridge into distributed transactions and mirrored data sets the issues start to become very complicated. I'm curious what seems to be the point, in data size or application complexity, that transactions frequently start becoming the source of issues (causing timeouts, deadlocks, performance issues in mission critical code, etc) which are more bothersome to fix, troubleshoot or workaround than designing a data model that is more fault-tolerant in itself, or using other means to ensure data integrity. Also, what design patterns serve to minimize these impacts or make standard transactional logic obsolete or a non-issue?
--
EDIT: We've got some answers of reasonable quality so far, but I think I'll post an answer myself to bring up some of the things I've heard about to try to inspire some additional creativity; most of the responses I'm getting are pessimistic views of the problem.
Another important note is that not all dead-locks are a result of poorly coded procedures; sometimes there are mission critical operations that depend on similar resources in different orders, or complex joins in different queries that step on each other; this is an issue that can sometimes seem unavoidable, but I've been a part of reworking workflows to facilitate an execution order that is less likely to cause one.

I think no design pattern can solve this issue in itself. Good database design, good store procedure programming and especially learning how to keep your transactions short will ease most of the problems.
There is no 100% guaranteed method of not having problems though.
In basically every case I've seen in my career though, deadlocks and slowdowns were solved by fixing the stored procedures:
making sure all tables are accessed in order prevents deadlocks
fixing indexes and statistics makes everything faster (hence diminishes the chance of deadlock)
sometimes there was no real need of transactions, it just "looked" like it
sometimes transactions could be eliminated by making multiple statement stored procedures in single statement ones.

The use of shared resources is wrong in the long run. Because by reusing an existing environment you are creating more and more possibilities. Just review the busy beavers :) The way Erlang goes is the right way to produce fault-tolerant and easily verifiable systems.
But transactional memory is essential for many applications in widespread use. If you consult a bank with its millions of customers for example you can't just copy the data for the sake of efficiency.
I think monads are a cool concept to handle the difficult concept of changing state.

One approach I've heard of is to make a versioned insert only model where no updates ever occur. During selects the version is used to select only the latest rows. One downside I know of with this approach is that the database can get rather large very quickly.
I also know that some solutions, such as FogBugz, don't use enforced foreign keys, which I believe would also help mitigate some of these problems because the SQL query plan can lock linked tables during selects or updates even if no data is changing in them, and if it's a highly contended table that gets locked it can increase the chance of DeadLock or Timeout.
I don't know much about these approaches though since I've never used them, so I assume there are pros and cons to each that I'm not aware of, as well as some other techniques I've never heard about.
I've also been looking into some of the material from Carlo Pescio's recent post, which I've not had enough time to do it justice unfortunately, but the material seems very interesting.

If you are talking 'cloud computing' here, the answer would be to localize each transaction to the place where it happens in the cloud.
There is no need for the entire cloud to be consistent, as that would kill performance (as you noted). Simply, keep track of what is changed and where and handle multiple small transactions as changes propagate through the system.
The situation where user A updates record R and user B at the other end of cloud does not see it (yet) is the same as the one when user A didn't do the change yet in the current strict-transactional environment. This could lead to a discrepancy in an update-heavy system, so systems should be architectured to work with updates as less as possible - moving things to aggregation of data and pulling out the aggregates once the exact figure is critical (i.e. moving requirement for consistency from write-time to critical-read-time).
Well, just my POV. It's hard to conceive a system that is application agnostic in this case.

Try to make changes at the database level in the least number of possible instructions.
The general rule is to lock a resource the lest possible time. Using T-SQL, PLSQL, Java on Oracle or any similar way you can reduce the time that each transaction locks a shared resource. I fact transactions in the database are optimized with row-level locks, multi-version, and other kinds of intelligent techniques. If you can make the transaction at the database you save the network latency. Apart from other layers like ODBC/JDBC/OLEBD.
Sometimes the programmer tries to obtain the good things of a database ( It is transactional, parallel, distributed, ) but keep a caché of the data. Then they need to add manually some of the database features.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas