Provably correct marshalling?

Provably correct marshalling? - serialization

Does there exist serialization schemes (marshalling) of data structures that can be formally proven to be correct ?
I am agnostic to the particular programming language, could be ocaml/haskell or cpp, java or other as long as the data to be serialized can be assumed to be properly typed.
Perhaps as a way to reformulate/clarify my question what i am interested is whether there exists known standard encoding schemes to write data structures to disk which can proved to have 100% fidelity in the sense that the deserialized data is exactly the same as the original one.
As a simplifying assumption i can assume there is no complication of pointers/references. The input is “pure data” for a lack of a better way to say it.

It's a slightly vague question, but I'll have a go.
Heterogenous Environments
Serialisation's job is to take data in the memory of one computer program, convert it to some sort of standardised representation, and convert that back to data in the memory of, quite possibly, another computer program on a completely different sort of computer. That does open up some interesting possiblities.
For example, the representation of a floating point value on many computers is IEEE754. But it's not wholly universal; historically companies like Cray and IBM used alternate formats, and so there exists the possibility that a value when deserialised on those machines might not be exactly the same value as was serialised in the first place. Generally no one cares, because the differences are numerically very small.
This shows up in some serialisation technologies; ASN.1's own wireformats for floats are either a text representation, or it's own binary format that is not IEEE754. The idea behind the text representation is that it can convey any floating point value, with no constraints. In contrast a binary format often has limits in the precision, maximum value, etc.
Text is another potential problem area; serialised unicode strings sent to another computer that doesn't support unicode will likely result in the deserialised string being different to the original.
Similarly with platforms that don't support 64bit integers, etc. Java is very annoying - historically it had no unsigned integers, so handling 64bit unsigned values received from, say, a C++ program is a nuisance.
Conclusion - It's a Logical Impossibility
So in some senses, for heterogenous environments, there can be no serialiatsion technologies formally proven to reproduce identical values because the destination machine is of a different architecture, and its representation may well be different, or limited in someway.
Homogenous Environments
Serialisation used to convey data from a computer program on one computer to exactly the same program on an identical computer (i.e. a homogenous environment) ought to produce exactly the same values on deserialisation. AFAIK there's no formally proven serialisation technologies. If there's serialisation built into the Ada language (I don't know) the Greenhills Ada compiler is formally proven. Boost for C++ is heavily peer reviewed, so that comes close, especially if used on top of Greenhill's formally proven C++ compiler, and has a serialisation library. Some of the commercial ASN.1 tools / libraries are very mature and highly trusted.
What is it that is Formally Proven?
In that last para I have touched on a difficulty with your question; formal proof is perhaps only of value if the entire software development stack (libraries, compiler, CPU) and you application source code are themselves formally proven. Otherwise you could have perfect source code for a serialisation library being compiled by a junk compiler, linked against rubbish libraries, running on a shonky CPU; it's not going to work.
So, when one is talking about "formally proven" one is generally talking about the whole system, not just an individual component. A component part that is, by itself, formally proven to meet its specification is a good aid to achieving a proven system, but it does not magically confer "correctness" on the whole system all on its own. Every other component needs to meet its specification too.
And what we've seen historically is that, quite often, CPUs don't really do what their data sheet says they do. Some will take shortcuts in floating point arithmetic in the interests of completing instructions in a single cycle in preference to achieving a numerically perfect result.
Sorry for the rambling answer, but I hope that's of interest and help.

Related

What are motivations behind compiling to byte-code?

I'm working on my own toy programming language. For now I'm interpreting the source language from AST and I'm wondering what advantages compiling to a byte-code and then interpreting it could provide me.
For now I have three things in mind:
Traversing the syntax tree hundreds of time may be slower than running instructions in an array, especially if the array support O(1) random access(ie. jumping 10 instructions up and down).
In typed execution environment, I have some run-time costs because my AST is typed, and I'm constantly traversing it(ie. I have 10 types of nodes and I need to check what type I'm on now to execute). Maybe compiling to an untyped byte-code could help to improve this, since after type-checking and compiling, I would have an untyped values and code.
Compiling to byte-code may provide better portability.
Are my points correct? What are some other motivations behind compiling to bytecode?

Speed is the main reason; interpreting ASTs is just too slow in practice.
Another reason to use bytecode is that it can be trivially serialized (stored on disk), so that you can distribute it. This is what Java does.

The point of generating byte code (or any other "easily interpreted" form such as threaded code) is essentially performance.
For an AST intepreter to decide what to do next, it needs to traverse the tree, inspect nodes, determine the type of nodes, check the type of any operands, verify legality, and decide which special case of the AST-designated operator applies (it says "+", but it means 16 bit add or string concatenate?), before it finally performs some action.
If one takes the final action and generates some kind of easily interpreted structure, then at "execution" time the interpreter can focus simply on performing actions without all that checking/special-case determination.
Another recent excuse is that if you generate byte code for any of a number of well-known virtual machines (JVM, MSIL, Parrot, etc.) you don't even have to code the interpreter. For the JVM and MSIL, you also get the benefit of the JIT compilers associated with them, and with careful design of your language, compatibility with huge libraries, which are the real attraction of Java and C#.

Matching a virtual machine design with its primary programming language

As background for a side project, I've been reading about different virtual machine designs, with the JVM of course getting the most press. I've also looked at BEAM (Erlang), GHC's RTS (kind of but not quite a VM) and some of the JavaScript implementations. Python also has a bytecode interpreter that I know exists, but have not read much about.
What I have not found is a good explanation of why particular virtual machine design choices are made for a particular language. I'm particularly interested in design choices that would fit with concurrent and/or very dynamic (Ruby, JavaScript, Lisp) languages.
Edit: In response to a comment asking for specificity here is an example. The JVM uses a stack machine rather then a register machine, which was very controversial when Java was first introduced. It turned out that the engineers who designed the JVM had done so intending platform portability, and converting a stack machine back into a register machine was easier and more efficient then overcoming an impedance mismatch where there were too many or too few registers virtual.
Here's another example: for Haskell, the paper to look at is Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine. This is very different from any other type of VM I know about. And in point of fact GHC (the premier implementation of Haskell) does not run live, but is used as an intermediate step in compilation. Peyton-Jones lists no less then 8 other virtual machines that didn't work. I would like to understand why some VM's succeed where other fail.

I'll answer your question from a different tack: what is a VM? A VM is just a specification for "interpreter" of a lower level language than the source language. Here I'm using the black box meaning of the word "interpreter". I don't care how a VM gets implemented (as a bytecode intepereter, a JIT compiler, whatever). When phrased that way, from a design point of view the VM isn't the interesting thing it's the low level language.
The ideal VM language will do two things. One, it will make it easy to compile the source language into it. And two it will also make it easy to interpret on the target platform(s) (where again the interpreter could be implemented very naively or could be some really sophisticated JIT like Hotspot or V8).
Obviously there's a tension between those two desirable properties, but they do more or less form two end points on a line through the design space of all possible VMs. (Or, perhaps some more complicated shape than a line because this isn't a flat Euclidean space, but you get the idea). If you build your VM language far outside of that line then it won't be very useful. That's what constrains VM design: putting it somewhere into that ideal line.
That line is also why high level VMs tend to be very language specific while low level VMs are more language agnostic but don't provide many services. A high level VM is by its nature close to the source language which makes it far from other, different source languages. A low level VM is by its nature close to the target platform thus close to the platform end of the ideal lines for many languages but that low level VM will also be pretty far from the "easy to compile to" end of the ideal line of most source languages.
Now, more broadly, conceptually any compiler can be seen as a series of transformations from the source language to intermediate forms that themselves can be seen as languages for VMs. VMs for the intermediate languages may never be built, but they could be. A compiler eventually emits the final form. And that final form will itself be a language for a VM. We might call that VM "JVM", "V8"...or we might call that VM "x86", "ARM", etc.
Hope that helps.

One of the techniques of deriving a VM is to just go down the compilation chain, transforming your source language into more and more low level intermediate languages. Once you spot a low level enough language suitable for a flat representation (i.e., the one which can be serialised into a sequence of "instructions"), this is pretty much your VM. And your VM interpreter or JIT compiler would just continue your transformations chain from the point you selected for a serialisation.
Some serialisation techniques are very common - e.g., using a pseudo-stack representation for expression trees (like in .NET CLR, which is not a "real" stack machine at all). Otherwise you may want to use an SSA-form for serialisation, as in LLVM, or simply a 3-address VM with an infinite number of registers (as in Dalvik). It does not really matter which way you take, since it is only a serialisation and it would be de-serialised later to carry on with your normal way of compilation.
It is a bit different story if you intend to interpret you VM code immediately instead of compiling it. There is no consensus currently in what kind of VMs are better suited for interpretation. Both stack- (or I'd dare to say, Forth-) based VMs and register-based had proven to be efficient.

I found this book to be helpful. It discusses many of the points you are asking about. (note I'm not in any way affiliated with Amazon, nor am I promoting Amazon; just was the easiest place to link from).
http://www.amazon.com/dp/1852339691/

Choosing CPU architecture for LLVM/CLANG

I am designing TTL serial computer, and I am struggling on choosing architecture more suitable for LLVM compiler backend (I want to be able to run any C++ software there). There will be no MMU, no multiplication/division, no hardware stack, no interrupts.
I have 2 main options:
1) 8-bit memory, 8-bit ALU, 8-bit registers (~12-16). Memory address width 24 bit. So I will need to use 3 registers as IP and 3 registers for any memory location.
Needless to say that any address calculations would be pure pain to implement in compiler.
2) 24-bit memory, 24-bit ALU, 24-bit registers (~6-8). Flat memory, nice. The drawbacks is that due to serial nature of the design, each operation would take 3 time more clocks, even if we are operating on some booleans. 24-bit memory data width is expensive. And it's harder to implement in hardware in general.
The question is : Do you think implementing all c++ features on this 8-bit, stack-less based hardware is possible, or I need to have more complex hardware to have generated code of reasonable quality & speed?

I second the suggestion to use LCC. I used it in this homebrew 16-bit RISC project: http://fpgacpu.org/xsoc/cc.html .
I don't think it should make much difference whether you build the 8-bit variant and use 3 add-with-carries to increment IP, or the 24-bit variant and do the whole thing in hardware. You can hide the difference in your assembler.
If you look at my article above, or an even simpler CPU here: http://fpgacpu.org/papers/soc-gr0040-paper.pdf you will see you really don't need that many operators / instructions to cover the integer C repetoire. In fact there is an lcc utility (ops) to print the min operator set for a given machine.
For more information see my article on porting lcc to a new machine here: http://www.fpgacpu.org/usenet/lcc.html
Once I had ported lcc, I wrote an assembler, and it synthesized a larger repetoire of instructions from the basic ones. For example, my machine had load-byte-unsigned but not load-byte-signed, so I emitted this sequence:
lbs rd,imm(rs) ->
lbu rd,imm(rs)
lea r1,0x80
xor rd,r1
sub rd,r1
So I think you can get by with this min cover of operations:
registers
load register with constant
load rd = *rs
store *rs1 = rs2
+ - (w/ w/o carry) // actually can to + with - and ^
>> 1 // << 1 is just +
& ^ // (synthesize ~ from ^, | from & and ^)
jump-and-link rd,rs // rd = pc, pc = rs
skip-z/nz/n/nn rs // skip next insn on rs==0, !=0, <0, >=0
Even simpler is to have no registers (or equivalently blur registers with memory -- all registers have a memory address).
Set aside a register for SP, and write the function prolog/epilog handler in the compiler and you won't have to worry about stack instructions. There's just code to store each of the callee save registers, adjust the SP by the frame size, and so forth.
Interrupts (and return from interrupts) are straightforward. All you need to do is force a jump-and-link instruction into the instruction register. If you chose the bit pattern for that to be something like 0, and put the right addresses into the source register rs (especially if it is r0), it can be done with a flip-flop reset input or an extra force-to-0 and gate. I use a similar trick in the second paper above.
Interesting project. I see a TTL / 7400 contest is underway and I was thinking myself of how simple a machine could you get away with and would it be cheating to add a 32 KB or 128 KB async SRAM to the machine to hold the code and data.
Anyway, happy hacking!
p.s.
1) You will want to decide how large each integral type is. You can certainly make char, short, int, long, long long, etc. the same size, one 24b word, if you wish, although it won't be compliant in min representation ranges.
2) And although I focused on lcc here, you were asking about C++. I recommend persuing C first. Once you have things figured out for C, including *, /, % operators in software, etc., it should be more tractable to move to full blown C++ whether in LLVM or GCC. The difference between C and C++ is "only" the extra vtables and RTTI tables and code sequences (entirely built up out the primitive C integer operator repetoire) required to handle virtual function calls, pointer to member dereference, dynamic casts, static constructors, exception handling, etc.

IMHO, It is possible for c compiler. i am not sure for c++, though.
LLVM/CLang could be hard choice for 8bit computer,
Instead, first try lcc, then second llvm/etc, HTH.
Bill Buzbee succeed to retarget lcc compiler for his Magic-1(known as homebrewcpu).
Although the hardware design and construction of Magic-1 usually gets the most attention, the largest part of the project (by far) has been developing/porting the software. To this end, I've had to write an assembler and linker from scratch, retarget a C compiler, write and port the standard C libraries, write a simplified operating system and then port a more sophisticated one. It's been a challenge, but a fun one. I suppose I'm somewhat twisted, but I happen to enjoy debugging difficult problems. And, when the bug you're trying to track down could involve one or more of: hardware design flaw, loose or broken wire, loose or bad TTL chip, assembler bug, linker bug, compiler bug, C runtime library bug, or finally a bug in the program in question there's lot of opportunity for fun. Oh, and I also don't have the luxury of blaming the bugs on anyone else.
I'm continually amazed that the damn thing runs at all, much less runs as well as it does.

In my opinion, stackless hardware is already poorly suited for C and C++ code. If you have nested function calls, you will need to emulate a stack in software anyway, which of course is much slower.
When going the stackless route, you will probably allocate most of your variables as 'static', and have no re-entrant functions. In this case, 6502-style addressing modes can be effective. You could for example have these addressing modes:
Immediate address (24bit) as part of opcode
Immediate address (24bit) plus index register (8bit)
Indirect access: immediate 24bit address to memory, which contains the actual address
Indirect access: 24 bit address to memory, 8 bit index register added to value from memory.
The address modes outlined above would allow efficient access to arrays, structures and objects allocated at a constant address (static allocation). They would be less efficient (but still usable) for dynamically and stack-allocated objects.
You would also get some benefit from your serial design: usually the 24 bit + 8 bit addition does not take 24 cycles, but you can instead short-circuit the addition when carry is 0.
Instead of mapping the IP as registers directly, you could allow changing it only through goto/branch instructions, using the same address modes as above. Jumps into dynamically computed addresses are quite rare so it makes more sense to give the whole 24-bit address directly in the opcode.
I think that if you design the CPU carefully, you can use many C++ features quite efficiently. However, do not expect that any random C++ code would run fast on such a limited CPU.

The implementation is certainly possible, but I doubt it will be usable (at lest for C++ code). As it was already noted, first problem is lack of stack. Next, bunch of C++ relies heavily on dynamic memory allocation, also C++ "internal" structures are quite big.
So, as it seems to me, it will be better, if you:
Get rid of C++ requirement (or at least, limit yourself to some subset)
Use 24 bits, not 8 bits for everything (for registers as well)
Add hardware stack

You are not going to be able to run "any" C++ code there. For example fork(), system(), etc. Anything that clearly relies on interrupts for example. You can get a long way there, sure.
Now do you mean any programs that can/have been written in C++ or are you limiting yourself to the language only and not the libraries that are commonly associated with C/C++? The language itself is a much easier rule to live with.
I think the easier question/answer, is, why not just try? What have you tried so far? It could be argued that the x86 is an 8-bit machine, no regard for alignment and many 8 bit instructions. the msp430 was ported to llvm to show how easily and quickly it could be done, I would like to see that platform with better support (not where my strengths lie otherwise I would be doing it) a 16 bit platform. no mmu. does have a stack and interrupts sure, dont have to use them and if you remove library rules then what is left that needs an interrupt?
I would look at llvm but note that the documentation produced that shows how easy it is to port, is dated and wrong and you basically have to figure it out on your own from the compiler sources. llc has a book, known for that, not optimized. Sources dont compile well on modern computers, always having to go backwards in time to use it, any time I go near it after an evening just trying to build it as is I give up. vbcc, simple, clean, documented, not unfriendly to smaller processors. Is it C++, dont remember. Of all of them the easiest to get a compiler up and running though. Of all of them LLVM is the most attractive and most useful when all said and done. dont go near gcc or even think of it, duct tape and bailing wire inside holding it together.
Have you invented your instruction set yet? do you have a simulator and assembler yet? Look up lsasim at github to find my instruction set. You can write an llvm backend for mine as practice for yours...grin...(my vbcc backend is horrible, I need to start over)...
You have to have some idea of how the high level will be implemented but you really have to start with an instruction set and an instruction set simulator and an assembler of some sort. Then start hand converting C/C++ code into assembly for your instruction set, that should pretty quickly get you through "can I do this without a stack", etc. In this process define your calling convention, implement more C/C++ code by hand using your calling convention. THEN dig into a compiler and make a back end. I think you should consider vbcc as a stepping stone, then head for LLVM if it appears like it (the isa) will work.

decompilation resources and theory

There must be a million of books and papers on the theory and techniques of building compilers. Are there any resources on doing the reverse? Im not interested in any particular HW platform. Looking for good books/research papers that examine the subject and difficulties in depth.

I've worked on an AS3 and Java decompiler and I can assure you that everything I've learned in regards to decompilation is straight from compiler theory. Intermediate representations, data flow analysis, term rewriting, and other related concepts can all be found in the dragon book.

I've written about decompilers for dynamic languages here and for Python specifically.
Note though this is for dynamic languages with custom (high-level) VMs.

Decompilation is really a misnomer. Decompilers compile object code into a source representation. In many ways they are easier to write than traditional compilers - the 'source' code is already syntax checked and usually very precisely formatted.
They build up a symbol table (of addresses) and construct a target language representation of the application. The usual difficulty is that the original compiler has to a greater or lesser degree optimised the original application by removing common sub-expressions, hoisting constant code out of loops and many other similar techniques. These are often not possible to represent in the target language.
In cases where the source is for a well defined VM, then often this optimisation is left to the JIT compiler and the resulting decompiled code is very readable - in many cases almost identical to the original. Compilers of this type often leave some or all of the symbols in the object code allowing these to be recovered. Others include line numbers to help with debugging and troubleshooting. These all help to recover the original code.
As a counter, there are code obfuscators that deliberately perform transformations to the code that prevent simple restoration of the original source by scrambling names, change the sequence code is generated (without changing its resulting meaning) and introducing constructs for which there is no source language equivalent.

assembly language and optimization [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
How can programming in assembly help in achieving optimization

The most likely way programming in assembly can improve your code is by improving you: teaching you more about what is happening at a low level and getting the discipline of optimization can help you make good decisions in higher-level languages.
As far as actually helping one program: as others have noted it's rarely worth it. It's just possible you can use it as a kind of advanced profile-driven optimization: try many variations until you find one that's best on your particular problem.
To start with this: write a program in C or C++ or whatever compiled language you normally use, fire up your debugger, and disassemble a small but nontrivial function, and have a think about why the compiler did what it did. Then try writing a small bit of inline assembler yourself. On modern systems assembly is mostly easily embedded within C rather than done from scratch.
Or alternatively, get a teeny machine like a PIC and make it flash a LED...

These days, you have to be very good at assembly to beat the compiler.
I can do it any day of the week, but only by viewing the compiler's output first.
And then, if it gains more than a couple of percentage points I'd be surprised.
These days, I only program in assembly when I'm doing something the compiler can't do.

In principle, you can write highly-optimized code in assembly because the compiler is limited to specific, general-purpose optimizations that should apply to many programs, while you can be creative and use your knowledge of this particular program.
To take a simple example, back when I was new to this business compilers were very limited in their ability to optimize register usage. You know that to perform any sort of arithmetic or logical operation, the CPU must generally load one of the values into a register, then perform the operation on the other, then save the result? Like to add two numbers together -- and I'll use a pseudo-assembler here because I don't know what assembly languages you know and I've forgotten most of the details myself -- you'd write something like this:
LOAD A,value1
ADD A,value2
STORE a,destination
Compilers used to generate the loads for every operation. So if your C program said:
x=x+y;
z=z+x;
The compiler would generate something like:
LOAD A,x
ADD A,y
STORE A,x
LOAD A,z
ADD A,x
STORE A,z
But a human could observe that by the time we get to the second statement, register A already contains x, and addition is commutative, so we could optimize this to:
LOAD A,x
ADD A,y
STORE A,x
ADD A,z
STORE A,z
Et cetera. One could go through all sorts of tiny micro-optimizations like this. I used to do that all the time back when I was young and the world was green.
But over the years compilers have gotten much smarter, and CPUs have gotten more powerful so the micro-optimizations don't matter as much.
Thus, I haven't written any assembly language code in, wow, probably 15 years. I used to read the assembly generated by the compiler when debugging, sometimes it would give a clue to a subtle problem, but I haven't done that in years now either.
I don't think compilers are even written in assembly any more. Instead, you write the first draft of the compiler in a high level language on some other computer, i.e. you write a cross-compiler to get yourself off the ground.
I suspect the only real use of assembly today is for extremely constrained environments, embedded systems and that sort of thing; and for programs that have to deal intimately with the hardware, like device drivers.
I'd be interested to hear if there are any assembly programmers on this forum who care to tell us why they assembly programmers.

Programming in assembly won't, in and of itself, optimize your code. The main thing about assembly is that it allows you to have very low-level access and to choose exactly what instructions the processor executes.
Since you won't have some compiler generating the assembly for you, you can perform code optimizations when you write the program yourself, if you know how.

So, you think you are smarter than gcc optimizing compiler?
If not, then fughed aboud it (learning assembly for the sake of getting better at optimization). That would be akin to learning Scheme language for the sake of getting better at recursion :)

In general, the compiler will do a fairly good job at generating optimal code. There are, however, cases where writing your own assembly can result in even more optimized (in terms of space and/or speed) code.
Typically, this happens when there is something that you know about the target system that the compiler doesn't. Compilers are designed to work on a variety of systems; if you want to take advantage of something unique to your target system, sometimes you have to go in and do it yourself. Here's an example. A few months ago, I was writing some code for a MIPS-based embedded system. There are many different types of MIPS CPUs, and some support certain opcodes that others do not. My compiler would generate MIPS code using the set of assembly operations that all MIPS architectures support. However, I knew that my chip could do more. I had a subroutine that needed to count the number of leading zeroes in a 32-bit number. The compiler synthesized this into a loop that took about 10 lines of assembly to do. I re-wrote it in one line by using the CLZ opcode that was designed to do just this. I knew that my chip supported the opcode but the compiler didn't. Admittedly, situations like this aren't very common; when they do pop up, however, it's nice to have enough of a background in assembly to take advantage of them.

Sometimes one will need to perform a task which maps particularly well onto some CPU instructions, but does not fit well into any high-level-language constructs. For example, on many processors one may easily perform extended-precision arithmetic using something like:
add r0,r4
addc r1,r5
addc r2,r6
addc r3,r7
This will regard r3:r2:r1:r0 and r7:r6:r5:r4 as numbers four words long, adding the second to the first. Four nice easy instructions, any anyone who understands assembly would know what they do. I know of no way to perform the same task in C without it not only generating bigger and slower object code, but also being an incomprehensible mess of source code.
A somewhat more extreme but specialized real-world example: Given two arrays array1[0..63] and array2[0..63], compute array1[0]*array2[0] + array1[1]*array2[1] + array1[2]*array2[2] ... + array1[63]*array2[63]. On a DSP I used, the computation could be done in machine code in about 75 machine cycles (about 67 of which are a repeating MAC instruction). There's no way C code could come anywhere close.

About the only time I can think of using Assembly language for optimizing code is when you need something very specific, like you need a GPIO on a microcontroller to toggle between high and low exactly every 9 clock cycles. that's too short a time to manage with an interrupt, and higher level language compilers don't normally offer this kind of control over the instruction stream.

Typically you wouldn't program in assembly. You would program in C, and then look at the generated assembly to see what optimzations (or not) the C compiler made automatically. Adjusting your C code (to allow for better vectorization for example) will allow the compiler to re-arrange code better, which will give you optimized assembly

More likely than being able to beat the compiler at writing assembly code. Knowing how typical tasks translate to assembly may help you write better high level language code.
Typically you do not resort to assembly for optimiziation purposes. If this is possible, usually someone already will have provided the essential code ready for you to call, for example in form of a linear algebra library.
Likewise assembly offers direct access to the processor (e.g. for atomicity, time measurement, I/O) but the important accesses will already have have been made accessible for your high level language.

Compilers do a good job of generating assembler.
However, there's a bad reason why hand-written assembler is faster. Since it's harder to write, you write less of it.
It would be nice if programmers could discipline themselves to get the same job done in minimal code, regardless of language.

When writing assembly, or even just straight raw bytes the assembler outputs, you can write programs that use computer hardware specific features or makes something otherwise very carefully specified.
There might be really high benefits if your program does the optimized part far more often than it does anything else. Always set up benchmarks before attempting optimizations.
The downcome is that your hand-written assembly works on fewer different hardware. It may even end up getting limited into the hardware model and revision!
It's rare you ever can or need to write assembly routines because commonly written software must work on almost every hardware you find and your kitten.
There's one interesting application if you know assembly. You can then write programs that produce assembly routines. Though it's mostly only fun unless you keep it really small so you can port it easily.

Read the Graphics Programming Black Book by Michael Abrash

In most modern applications, it can't to any significant degree.
Inter-Process Communication Affects Application Response Time explains why algorithms are unlikely to be bottlenecks. (But always profile - never guess.)
In general, programming in assembly will increase time-to-market, bug density, and maintenance costs. Instead, strive for simplicity and readability in your code.
As poolie mentioned, the main benefit of learning assembly today is a deeper understanding of software and hardware. From that perspective, there's quite a bit of information on Steve Gibson's site.

If you understood why there is sometimes the need to do asm, you would appreciate the strengths, costs (headaches for you).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas