different between cmpxchg and btr bts - locking

btr,bts instruction is simple and it's can lock the share resource.
Why does the instruction cmpxchg exist? What's the different between these two instructions?

IIRC (it's been a while) lock btr is more expensive than cmpxchg, which was designed to automatically lock the bus for atomicity and to do so as quickly as possible. (Specifically, lock INSTR holds the bus lock for the entire instruction cycle, and does full invalidation, but the microcode for cmpxchg locks and invalidates only when absolutely needed so as to be the fastest possible synchronization primitive.)
(Edit: it also enables fancier (user-)lock-free strategies, per this message.
CMPXCHG [memaddr], reg compares a memory location to EAX (or AX, or
AL); if they are the same, it writes the source operand to the memory
location. This can obviously be used in the same way as XCHG, but it
can be used in another very interesting way as well, for lock-free
synchronization.
Suppose you have a process that updates a shared data structure. To
ensure atomicity, it generates a private updated copy of the data
structure; when it is finished, it atomically updates a single pointer
which used to point to the old data structure so that it now points to
the new data structure.
The straightforward way of doing this will be useful if there's some
possibility of the process failing, and it gives you atomicity. But we
can modify this procedure only a little bit to allow multiple
simultaneous updates while ensuring correctness.
The process simply atomically compares the pointer to the value it had
when it started its work, and if so, makes the pointer point to the new
data structure. If some other process has updated the data structure
in the mean time, the comparison will fail and the exchange will not
happen. In this case, the process must start over from the
newly-updated data structure.
(This is essentially a primitive form of Software Transactional Memory.)

BTR and BTS work on a bit level, where as CMPXCHG works on a wider data type(generally 32, 64 or 128 bits at once). They also function differently, the intel developer manuals give a good summary of how they work. It may also help to note that certain processors may have implemented BTR and BTS poorly (due to them not being so widely utilised), making CMPXCHG the better option for high performance locks.

Related

Can Elixir / Erlang copy a process, including its memory?

I'm considering solving a problem using Elixir, mainly because of the ability to spawn large numbers of processes cheaply.
In my scenario, I'd want to create several "original" processes, which load specific, immutable data into memory, then make copies of those processes as needed. The copies would all use the same base data, but do different, read-only tasks with it; eg, imagine that one "original" has the text of "War and Peace" in memory, and each copy of that original does a different kind of analysis on the text.
My questions:
Is it possible to copy an existing process, memory contents and all, in Elixir / the Erlang VM?
If so, does each copy consume as much memory as the original, or can they share memory, as Unix processes do with the "copy on write" strategy? (And in this case, there would be no subsequent writes.)
There is no built-in way to copy processes. The easiest way to do it is to start the "original" process and the "copies" and send all the relevant data in messages to the copies. Processes don't share data so there is no more efficient way of doing it. Putting the data in ETS tables only partially helps with sharing as the data in the ETS tables are copied to the process when they are used, however, you don't need to have all the data in the process heap.
An Erlang process has no process-specific data apart from what's stored in variables (and the process dictionary), so to make a copy of the memory of a process, just spawn a new process passing all relevant variables as arguments to the function.
In general, memory is not shared between processes; everything is copied. The exceptions are ETS tables (though data is copied from ETS tables when processes read it), and binaries larger than 64 bytes. If you store "War and Peace" in a binary, and send it to each worker process (or pass it along when you spawn those worker processes), then the processes would share the memory, only copying it if they wanted to modify the binary. See the chapter on binaries in the Erlang efficiency guide for more details.
You are thinking of Erlang/Elixir processes as similar to Unix processes. They aren't at all, I really wish they had a different name, because they really aren't either threads or processes in the standard Unix sense. It took me some time to wrap my head around the differences.
You have to throw out all your preconceived ideas about processes, they are all wrong. Eprocesses have the following characteristics.
They are cheap and fast. Use lot's, there are always more.
They share no resources[1]. ( Even writing to stdout is a message to another Eprocess. )
IPC ( or messages ) are very fast with relatively low overhead compared to standard Unix IPC.
What I would try in your case is to create a server that managed the data and have each analysis worker message the server for data chunks that it needs. It's perfectly acceptable to have an Eprocess be more or less a
manager of shared memory.
To me the most useful way to think of Eprocesses is as objects with their own thread of execution.
[1] Well, there is the ETS table, but it's best to think of them as not sharing resources until you absolutely have to.

What is cyclic and acyclic communication?

So I've already searched if there was a question like this posted before, but I wasn't able to find the answer I liked.
I've been working with some PLCs and variable frequency drives lately and thought it was about time I finally found out what cyclic and non-cyclic communication is.
So correct me if I'm wrong, but when I think of cyclic data, I think of data that is continuously being updated and is able to be sent/sampled to other devices. With relation to what I'm doing, I'm thinking that the variable frequency drive is able to update information such as speed and frequency that can be sampled/read from a PLC. This is what I would consider cyclic communication, something that is always updating a certain type of information that can be sent as data.
So I might be completely wrong with this assumption, and that leaves me with the question of what exactly would be considered non-cyclic or acyclic communication.
Any help?
Forenote: This is mostly a programming based site, and while your question does have an answer within the contexts of programming, I happen to know that in your industrial application, the importance of cyclic vs acyclic tends to be very hardware/protocol specific, and is really more of a networking problem than a programming one.
Cyclic data is not simply "continuous" data. In industry, it refers to data delivered on a guaranteed (or at least highly predictable) schedule. If the data stream were to violate the schedule, it could have disastrous consequences (a VFD misses its shutdown command by a fraction of a second, and you lose your arm!).
Acyclic data is still reliable for machine control, it is just delivered in a less deterministic way (on the order of milliseconds, sometimes up to several seconds). When accessing a single VFD with a single PLC, you will probably never notice this bursting behavior, and in fact, you may perceive smoother and quicker data transmissions. From the hardware interface perspective, acyclic data transfer does not provide as strong of a guarantee about if or when one machine will respond to the request of another.
Both forms of data transfer deliver data at speeds much faster than humans can deal with, but in certain applications they will each have their own consequences.
Cyclic networks usually must take the form of master/slave, where only one device is allowed speak at a time, and answers are always returned, even if just to confirm that the message was received. Cyclic networks usually do not allow as many devices on the same wire, and often they will pass larger amounts of data at slower rates.
Acyclic networks might be thought of as a bit more choatic, but since they skip handshaking formalities, they can often cheat more devices onto the network and get higher speeds all at the same time. This comes at the cost of occasional data collisions/bottlenecks, and sometimes even, requests for critical data are simply ignored/lost with no indication of failure or success from the target ( in the case the sender will likely be sitting and waiting desperately for a message it will not get, and often then trigger process watchdogs that will shutdown the system).
From a programmer perspective, not much is different between these two transmission types.
What will usually dictate a situation,
how many devices are running on the wire (sometimes this forces the answer right away)
how sensitive/volatile is the data they want to share (how useful are messages if they are a little late)
how much data they might be required to send at any given time ( shifting demands on a network that already produces race conditions can be hard to anticipate/avoid if you don't see it coming before hand).
Hope that helps :)

Basic Queue Optimizations

How would one optimize a queue for the typical:
access / store
memory usage
i'm not sure of anyway to reduce memory besides trying to run a compression algorithm on it, but that would take quite a deal of store time as a tradeoff - one would have to recompress everything I think.
As such I'm thinking the typical linked list with pointers.... a circle queue?
Any ideas?
Thanks
Edit: regardless of what is above; how does one make the fastest/least memory intensive basic queue structure essentially?
Linked lists are actually not very typical (except in functional languages or when newbies mistakenly think that a linked list is faster than a dynamic array). A dynamic circular buffer is more typical. The growing (and, optionally, shrinking) works slightly differently than in a dynamic array: if the "data holding part" crosses the end of the array, the data should be copied to the new space in such a way that it remains contiguous (simply extending the array would create a gap in the middle of the data).
As usual, it has some advantages and some drawbacks.
Drawbacks:
slightly more complicated implementation
not suitable for lock-free synchronization
Advantages:
more compact: in the worst case (when it just grew or is just about to shrink but hasn't yet) it has a space overhead of about 100%, a singly linked list almost always has an overhead of 100% or more (unless the data elements are larger than a pointer) and a doubly linked list is even worse.
cache efficient: reading happens close to previous reading, writing happens close to previous writing. So cache misses are rare, and when they do occur, they read data that is mostly relevant (or in the case of writing: they get a cache line that will probably be written to again soon). In a linked list, locality is poor and about half of every cache miss is wasted on the overhead (pointers to other nodes).
Usually these advantages outweigh the drawbacks.

File IO for MPI-FORTRAN

I have a FORTRAN MPI code to solve a flow field.
At the start I want to read data from file and distribute it to the participating processes.
The data is consisting of several 3-D arrays(velocities in space x,y,z).
Every process stores only a part of the array.
So if every process is going to read the file(the easiest way I think) it is not going to work as it will only store a the first part of the file corresponding to the number of arrays that the process can hold.
MPI Bcast can work for 3d arrays? But then things become complex.
Or is there an easier way?
You have, broadly speaking, 2 or 3 choices, depending on your platform.
One process reads the input data and sends (parts of) it to the other processes. I wouldn't usually use broadcast for this since it is a collective operation and all processes have to take part. I'd usually just send the necessary information to each process. If it is convenient (and not a memory issue) you could certainly broadcast all the input data to all the processes, it's just not a pattern of operation that I use or see much.
All processes read the data that they require. This may involve a process reading an entire input file and only storing those parts it requires. But if you have very large input files you can write routines to read only the necessary part into each process's memory space. This approach may involve processes competing for disk access, which is only slow in a relative sense: if you are running large-scale and long-running parallel computations waiting a few seconds while all the processes get their data is not much of an overhead.
If you have a parallel file system then you can use MPI's parallel I/O routines so that each process reads only those parts of the input data that it requires.
The canonical way of such an I/O pattern in MPI is either to
Read the data on rank 0, then use MPI_Scatter to distribute it. Or if memory is tight, do this blockwise, or then use 1-to-1 communication rather than MPI_Scatter.
Use MPI-I/O, and have each rank read its own subset of the data file (to be useful, this of course requires a file format where you can figure out the boundaries without first reading through the entire file).
For extreme scalability, one can combine the two approaches, that is a subset of processes (say, sqrt(N) as a rough rule of thumb) use MPI I/O, and each MPI process sends data to its own IO process.
If you are running your code on less than 1000 cores with a good file system (e.g. Lustre) then just use Fortran I/O where each rank opens the file and reads the data it needs (skipping the rest). Yes it takes a few minutes but you're only reading the file once during start.
MPI I/O (binary only) is non-trivial and usually you are always better off using higher level libs such as HDF5 or Parallel NetCDF. Performance will depend on how the data is read (contiguous vs non-contiguous and so on). The following links may be helpful ...
http://www.osc.edu/supercomputing/training/pario/parallel-io-nov04.pdf
https://support.scinet.utoronto.ca/wiki/images/0/01/Parallel_io_course.pdf

Are we asking too much of transactional memory?

I've been reading up a lot about transactional memory lately. There is a bit of hype around TM, so a lot of people are enthusiastic about it, and it does provide solutions for painful problems with locking, but you regularly also see complaints:
You can't do I/O
You have to write your atomic sections so they can run several times (be careful with your local variables!)
Software transactional memory offers poor performance
[Insert your pet peeve here]
I understand these concerns: more often than not, you find articles about STMs that only run on some particular hardware that supports some really nifty atomic operation (like LL/SC), or it has to be supported by some imaginary compiler, or it requires that all accesses to memory be transactional, it introduces type constraints monad-style, etc. And above all: these are real problems.
This has lead me to ask myself: what speaks against local use of transactional memory as a replacement for locks? Would this already bring enough value, or must transactional memory be used all over the place if used at all?
Yes, some of the problems you mention can be real ones now, but things evolve.
As any new technology, first there is a hype, then the new technology shows that there are some unresolved problems, and then some of these problems are solved and others not. This result in another possibility to solve your problems, for which this technology is the more adapted.
I will say that you can use STM for a part of your application that can leave with the constraints the currents state of the art have. Part of the application that don't mind about a lost of efficiency for example.
Communication between the transaction and non transactional parts is the big problem. There are STM that are lock aware, so them can interact in a consistent way with non transactional parts.
I/O is also possible, but your transaction becomes irrevocable, that is, can not be aborted. That means that only one transaction can use I/O at the same time. You can also use I/O once the top level transaction has succeed, on a non-transactional world, as now.
Most of the STM library base systems force the user to make the difference between transactional and non transactional data. So yes, you need to understand what this exactly means. On the other hand, compilers can deduce what access must be transactional or not, the problem been that they can be too conservative, decreasing the efficiency we can get when we manage explicitly the different kind of variables. This is the same as having static, local and dynamic variables. You need to know the constraints each one have to make a correct program.
I've been reading up a lot about transactional memory lately.
You might also be interested in this podcast on software transactional memory, which also introduces STM using an analogy based on garbage collection:
The paper is about an analogy between garbage collection and transactional memory.
In addition to seeing the beauty of the analogy, the discussion also serves as a good
introduction to transactional memory (which was mentioned in the Goetz/Holmes episode)
and - to some extent - to garbage collection.
If you use transactional memory as a replacement for locks, all the code that executes with that lock held could be rolled back upon completion. Thus the code that was previously using locks must be transactional, and will have all the same drawbacks (and benefits).
So, you could possibly restrict the influence of TM to only those parts of the code that hold locks, right? Every piece of code that can be called during a held lock must support TM, in that scenario. How much of your program does not hold locks and is never called by code that holds locks?