Do any microprocessors today use Scoreboarding or Tomasulo's algorithm? [closed] - hardware

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
I've researched a bit and i found out about Intel pentium pro, AMD K7, IBM power PC but these are pretty old. I'm not able to find any info about current day processors that use these mechanisms for dynamic scheduling

Every modern OoO exec CPU uses Tomasulo's algorithm for register renaming. The basic idea of renaming onto more physical registers in a kind of SSA dependency analysis hasn't changed.
Modern Intel CPUs like Skylake have evolved some since Pentium Pro (e.g. renaming onto a physical register file instead of holding data right in the ROB), but PPro and the P6 family is a direct ancestor of the Sandybridge-family. See https://www.realworldtech.com/sandy-bridge/ for some discussion of the first member of that new family. (And if you're curious about CPU internals, a much more in-depth look at it.) See also https://agner.org/optimize/ but Agner's microarch guide focuses more on how to optimize for it, e.g. that register renaming isn't a bottleneck on modern CPUs: rename width matches issue width, and the same register can be renamed 4 times in an issue group of 4 instructions.
Advancements in managing the RAT include Nehalem introducing fast-recovery for branch misses: snapshot the RAT on branches so you can restore to there when you detect a branch miss, instead of draining earlier un-executed uops before starting recovery.
Also mov-elimination and xor-zeroing elimination: they're handled at register-rename time instead of needing a back-end uop to write the register. (For xor-zeroing, presumably there's a physical zero register and zeroing idioms point the architectural register at that physical zero. What is the best way to set a register to zero in x86 assembly: xor, mov or and? and Can x86's MOV really be "free"? Why can't I reproduce this at all?)
If you're going to do OoO exec at all, you might as well go all-in, so AFAIK nothing modern does just scoreboarding instead of register renaming. (Except for in-order cores that scoreboard loads, so cache-miss latency doesn't stall until a later instruction actually reads the load's target register.)
There are still in-order execution cores that don't do either, leaving instruction scheduling / software-pipelining up to compilers / humans. aka statically scheduled. This is not rare; widely used budget smartphone chips use cores like ARM Cortex-A53. Most programs bottleneck on memory, and you can allow some memory-level parallelism in an in-order core, especially with a store buffer.
Sometimes energy per computation is more important than performance.

Tomasulo's algorithm dates back to 1967. It's quite old and several modifications and improvements have been made to it. Also, new dynamic scheduling methods have been developed.
Check out http://adusan.blogspot.com.au/2010/11/differences-between-tomasulos-algorithm.html
Likewise, pure Scoreboarding is not used anymore, at least not in mainstream architectures, but its core concept is used as a base element for modern dynamic scheduling techniques.
It is fair to say that although they're not used as is anymore, some of their features are still maintained in modern dynamic scheduling and out-of-order execution techniques.

Related

how to write sensor's libraries from scratch [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Can someone explain to me how can I write sensor's library from zero, I read the datasheet, and some Arduino libraries but I did not understand how they had written them
It's not a trivial task to write a library for embedded projects. Most of the times, it's almost impossible to write a completely generic one that can satisfy everyone's needs.
Don't let Arduino library examples fool you. Most of them are not designed and optimized for real world applications with strict timing constraints. They are useful when reading that sensor is the only thing your embedded system does. You can also use them sequentially in a master loop when blocking read operations are not a concern.
Complex embedded applications don't fit into this scheme. Most of the time, you need to execute more than one task at the same time, and you use interrupts and DMA to handle your sensor streams. Sometimes you need to use an RTOS. Timing constrains can be satisfied by using the advanced capabilities of STM32 Timer modules.
Connecting timers, DMAs, interrupts and communication (or GPIO) modules together so that they can work in harmony is not easy (also add RTOS, if you use one), and it's almost impossible to generalize. Here is an list of examples that comes into my mind:
You need to allocate channels for DMA usage. You library must be aware of the channel usage of other libraries to avoid conflicts.
TIM modules are not the same. They may have different number of I/O pins. Some specific peripherals (like ADC) can be triggered by some TIM modules but not the others. There are constraints if you want to chain them, you can't just take one timer and connect it to some other one.
The library user may want to use DMAs or interrupts. Maybe even an RTOS. You need to create different API calls for all possible situations.
If you use RTOS, you must consider different flavors. Although the RTOS concepts are similar, their approaches to these concepts are not the same.
HW pin allocation is a problem. In Arduino libraries, library user just says "Use pins 1, 2, 3 for the SPI". You can't do this in a serious application. You need to use pins which are connected to hardware modules. But you also need to avoid conflicts with other modules.
Devices like STM32 have a clock tree, which affects the cloks of each peripheral module. Your library must be aware of the clock frequency of the module it uses. Low power modes can change these settings and break a library which isn't flexible for such changes. Some communication modules have more complicated timing settings, like the CAN Bus module for example, which needs a complex calculation for both bit rate and bit sampling position.
[And probably many more reasons...]
This is probably why the uC vendors provide offline configuration & code generation tools, like the CubeMX for STM32's. Personally I don't like them and I don't use them. But I must admit that I still use CubeMX GUI to determine HW pin allocations, even though I don't use the code it generates.
It's not all hopeless if you only want to create libraries for your own use and your own programming style. Because you can define constraints precisely from the start. I think creating libraries are easier in C++ compared to C. While working on different projects, you slowly create and accumulate your own code snippets and with some experience, these can evolve into easily configurable libraries. But don't expect someone else can benefit from them as much as you do.

About embedded firmware development [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
In the past few days I found how important is RTOS layer on the top of the embedded hardware.
My question is :
Is there any bifurcation between device driver (written in C directly burned over the microcontroller)
And the Linux Device driver ?
This question is a little broad, but an answer, a little broad itself, can be given.
The broadness comes from the fact that "embedded hardware" is not a precise term. That hardware ranges from 4 bit microcontrollers, or 8 pins ones, up to big CPUs which have many points in common with typical processors used tipically on linux machines (desktop and servers). Linux itself can be tailored up to the point it does not resemble a normal operating system anymore.
Anyway, a few things, generally acceptable, can be the following. Linux is not, in its "plain" version, a real time operating system - with the term RTOS instead, the "real time" part is implied. So, this can be one bifurcation. But the most important thing, I think, is that embedded firmware tries to address the hardware and the task to be done without anything else added. Linux O.S. instead is general purpose - it means that it offers a lot of services and functionalities that, in many cases, are not needed and only give more cost, less performances, more complication.
Often, in a small or medium embedded system, there is not even a "driver": the hardware and the application talk directly to each other. Of course, when the hardware is (more or less) standard (like a USB port, a ethernet controller, a serial port), the programming framework can provide ready-to-use software that sometimes is called "driver" - but very often it is not a driver, but simply a library with a set of functions to initialize the device, and exchange data. The application uses those library routines to directly manage the device. The O.S. layer is not present or, if the programmer wants to use an RTOS, he must check that there are no problems.
A Linux driver is not targeted to the application, but to the kernel. And the application seldom talks to the driver - it uses instead a uniform language (tipically "file system idiom") to talk to the kernel, which in turns calls the driver on behalf of the application.
A simple example I know very well is a serial port. Under Linux you open a file (may be /dev/ttyS0), use some IOCTL and alike to set it up, and then start to read and write to the file. You don't even care that there is a driver in the middle, and the driver was written without knowledge of the application - the driver only interacts with the kernel.
In many embedded cases instead, you set up the serial port writing directly to the hardware registers; you then write two interrupt routines which read and write to the serial port, getting and putting data from/into ram buffers. The application reads and writes data directly to those buffers. Special events (or not so special ones) can be signaled directly from the interrupt handlers to the application. Sometimes I implement the serial protocol (checksum, packets, sequences) directly in the interrupt routine. It is faster, and simpler, and uses less resources. But clearly this piece of software is no more a "driver" in the common sense.
Hope this answer explains at least a part of the whole picture, which is very large.

Where to learn more about low-level programming? e.g device drivers [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
From http://www.altdevblogaday.com/2011/08/06/demise-low-level-programmer/:
"When I started programming many of the elements we take for granted now, did not exist. There was no DirectX and not many compatible libs were available for the free compilers of the day. So I had to write my own code for most basic programs, keyboard handlers, mouse handlers, video memory accessors, rasterizers, texture mappers, blitters… the programs I wrote then were 100% my own code and I had to be able to handle anything and everything."
I am looking for advice on how to learn more about low-level programming. For example, writing a keyboard/mouse driver, a vga driver or basic function like malloc.
What is the best way to approach this? I am an undergraduate CS student and had courses about computer architecture/assembly where we wrote a simple pipelined processor in VHDL on DE0-nano FPGA. I have a decent C programming knowledge.
It seems to me difficult to learn that stuff on a modern x86 computer. Are microcontrollers better (like arduino board) ? Or FPGA (but my DE0-nano seems limited) ? Or maybe buy an old computer like Commodore 64 to learn like the people back then ?
And is there any resources like book on this subject ?
I would recommend a microcontroller approach, msp430, avr, or one of the many arms. You can get an msp430 or one of the arm launchpads from ti for about 10-20 bucks, you can find stm32 based discovery boards for 10-20 bucks as well, the stm32f0 (cortex-m0) is simple, basic. The stm32f4 is loaded with stuff like caches and an fpu, but still a microcontroller.
I think the microcontroller approach will take you back to the roots, very basic stuff, the hardware doesnt have video usually or hard drives or anything, but you do learn how to master the tools (compiler, linker, etc) and how to read the manuals, etc.
You could just fast forward to linux capable hardware like the raspberry pi or beaglebone. At all levels but in particular the linux capable levels, in addition to the manuals for the video chips, usb chips, etc you also will want to use the linux source code to learn from. Understand that the linux drivers are often very generic and often contain code to drive a whole family or whole history of hardware from one vendor, so a bunch of the code may not match anything in the manual, but some will and that is where you will get over the hump of understanding what the manual is saying. Doesnt mean you have to re-write linux in order to use this information.
Start small take it one step at a time, pcie and video alone from scratch is a major task, if you start there you are quite likely to fail unless you already have the knowledge and experience (if you did you wouldnt be here asking).
An approach you can also take is bochs or something that emulates an old 8086 DOS system, and if you can find some good, old, dos/8086 based manuals you can poke at regisiters, cga mode, vga mode, etc. The disc controllers can be difficult still, etc...
There are plenty of arm, avr, msp430, mips, etc (and x86 but I wouldnt go there for this until later) instruction set simulators that would avoid the cost of hardware and also avoid the cost of damaging the hardware and having to buy more (it happens even with lots of experience), even better writing your own instruction set simulator...vba some nds simulators, working your way up to qemu.

what are the the options for real time operating system for ARM cortex architechture? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am looking for RTOS for Arm M/R series (developing in C++) ?
Can someone recommend on good RTOS for ARM Cortex-M or R series?
Thank you.
Getting an answer of any value would require someone to have objectively evaluated all of them, and that is unlikely.
Popularity and suitability are not necessarily the same thing. You should select the RTOS that has the features your application needs, works with your development tools, and for which the licensing model and costs that meet your needs and budget.
The tool-chain you use is a definite consideration - kernel aware debugging and start-up projects are all helpful in successful development. Some debugger/RTOS combinations may even allow thread-level breakpoints and debugging.
Keil's MDK-ARM includes a simple RTOS with priority based pre-emptive scheduling and inter-process communication as well as a selection of middleware such as a file system, and TCP/IP, CAN and USB stacks included at no extra cost (unless you want the source code).
IAR offer integrations with a number of RTOS products for use with EWB. Their ARM EWB page lists the RTOSes with built-in and vendor plug-in support.
Personally I have used Keil RTX but switched to Segger embOS because at the time RTX was not as mature on Cortex-M and caused me a few problems. Measured context switch times for RTX were however faster than embOS. It is worth noting that IAR's EWB integrates with embOS so that would probably be the simpler route if you have not already invested in a tool-chain. I have also evaluated FreeRTOS (identical to OpenRTOS but with different licensing and support models) on Cortex M, but found its API to be a little less sophisticated and complete than embOS, and with significantly slower context switch times.
embOS has similar middleware support to RTX, but at additional cost. However I managed to hook in an alternative open source file system and processor vendor supplied USB stack without any problems in both embOS and RTX, so middleware support may not be critical in all cases.
Other options are Micro C/OS-II. It has middleware support again at additional cost. Its scheduler is a little more primitive than most others, requiring that every thread have a distinct priority level, and this does not support round-robin/time-slice scheduling which is often useful for non-realtime background tasks. It is popular largely through the associated book that describes the kernel implementation in detail. The newer Micro C/OS-III overcomes the scheduler limitations.
To the other extreme eCos is a complete RTOS solution with high-end features that make it suitable for many applications where you might choose say Linux but need real-time support and a small footprint.
The point is that you can probably take it as read that an RTOS supports pre-emptive scheduling and IPC, and has a reasonable performance level (although I mentioned varying context switch times, the range was between 5 and 15 us at 72MHz on an STM32F1xx). So I would look at things like maturity (how long as the RTOS been available for your target -you might even look at release notes to see how quickly it reached maturity and what problems there may have been), tool integration, whether the API suits your needs an intended software architecture, middleware support wither from the vendor or third-parties and licensing (can you afford it, and can you legally deploy it in the manner you intend?).
With respect to using C++, most RTOSes present a C API (even eCos which is written in C++). This is not really a problem since C code is interoperable with C++ at the binary level, however you can usefully leverage the power of C++ to make the RTOS choice less critical. What I have done is define a C++ RTOS class library that presents a generic API providing the facilities that need; for example I have classes such as cTask, cMutex, cInterrupt, cTimer, cSemaphore etc. The application code is written to this API, and the class library implemented for any number of RTOSes. This way the application code can be ported with little or no change to a number of targets and RTOSes because the class library acts as an abstraction layer. I have successfully implemented this class library for Windriver VxWorks, Segger embOS, Keil RTX, and even Linux and Windows for simulation and prototyping.
Some vendors do provide C++ wrappers to their RTOS such as Accelerated Technology's Neucleus C++ for Neucleus RTOS, but that does not necessarily provide the abstraction you might need to change the RTOS without changing the application code.
One thing to be aware of with C++ development in an RTOS is that most RTOS libraries are initialised in main() while C++ invokes constructors for static global objects before main() is called. It is common for some RTOS calls to be invalid before RTOS initialisation, and this can cause problems - especially as it differs between RTOSes. One solution is to modify your C runtime start-up code so that the RTOS initialisation is invoked before the static constructors, and main(), but after the basic C runtime environment is established. Another solution is simply to avoid RTOS calls in static objects.

Why would you need to know about each processor in particular? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm curious to understand what could be the motivation behind the fine-grained detail of each virtual processor that the Windows 8 task manager seems to be focusing on.
Here's a screenshot (from here):
I know this setup could only exist in a non-standard, costly, important server environment (1TB RAM!), but what is the use of a heatmap? Or, setting processor affinity:
What I'm asking is, under what circumstances a developer would care if specific processor X is being used more than processor Y (instead of just knowing that a single non-multithreaded process is maxing out a core, which would be better shown as a process heatmap, instead of a processor heatmap), or care whether a process will use this or that processor (which I can't expect a human to guess better than an auto-balancing algorithm)?
In most cases, it doesn't matter, and the heatmap does nothing more than look cool.
Big servers, though, are different. Some processors have a "NUMA", or Non-Uniform Memory Access, architecture. In these cases, some processor cores are able to access some chunks of memory faster than other cores. In these cases, adjusting the process affinity to keep the process on the cores with faster memory access might prove useful. Also, if a processor has per-core caches (as many do), there might be a performance cost if a thread were to jump from one core to another. The Windows scheduler should do a good job avoiding switches like these, but I could imagine in some strange workloads you might need to force it.
These settings could also be useful if you want to limit the number of cores an application is using (say to keep some other cores free for another dedicated task.) It might also be useful if you're running a stress test and you are trying to determine if you have a bad CPU core. It also could work around BIOS/firmware bugs such as the bugs related to high-performance timers that plagued many multi-core CPUs from a few years back.
I can't give you a good use case for this heat map (except that it looks super awesome), but I can tell you a sad story about how we used CPU affinity to fix something.
We were automating some older version of MS Office to do some batch processing of Word documents and Word was occasionally crashing. After a while of troubleshooting and desperation, we tried setting Word process' affinity to just one CPU to reduce concurrency and hence reduce the likelihood of race conditions. It worked. Word stopped crashing.
One possible scenario would be a server that is running multiple VMs where each client is paying to have access to their VM.
The administrator may set the processor affinities so that each VM has guaranteed access to X number of cores (and would charge the client appropriately).
Now, suppose that the administrator notices that the cores assigned to ABC Company Inc.'s VMs are registering highly on the heatmap. This would be a perfect opportunity to upsell ABC Company Inc and get them to pay for more cores.
Both the administrator and ABC Company Inc win - the administrator makes more money, and ABC Company Inc experience better performance.
In this way, the heatmap can function as a Decision Support System which helps ABC Company Inc decide whether their needs merit more cores, and helps the administrator to target their advertising better to the their customers that can benefit.