The meaning of SUP bit in page tables - process

One of the page table entry attributes is the SUP bit.
I read in couple documents that:
"if the SUP is set then only a process in kernel mode can access that page. Versus if it is not set then a process in user mode can access it."
I find this statement confusing, because a process in kernel mode can be a process within which a user program is running or a kernel program (user process vs system process). So which does the statement refer to? Or is it both as long as the process is currently executing in kernel mode?
If this statement refers also to processes within which a user program is running(user process), then we already know that memory access can only be made when the process switches into kernel mode, then there is no need to have the SUP bit.
My guess is that the SUP bit is meant to say that this page is accessed only by system processes (excluding user processes running in kernel mode), but I am not sure as I don't have knowledge about how the kernel code is stored in memory and if it is paged and how.

When ever you have any doubt about the working of an Intel CPU consult the Manuals, not any random internet page1.
Paging access rights are described in Section 4.6.
The CPU distinguishes the privilege of an address and of an access (to an address), each privilege is either a user-mode or supervisor-mode (where supervisor-mode is intended to be more privileged).
Access mode
Every access to a linear address is either a supervisor-mode access or a user-mode access. For all instruction
fetches and most data accesses, this distinction is determined by the current privilege level (CPL): accesses made
while CPL < 3 are supervisor-mode accesses, while accesses made while CPL = 3 are user-mode accesses.
Some operations implicitly access system data structures with linear addresses; the resulting accesses to those
data structures are supervisor-mode accesses regardless of CPL.
[...]
All these accesses are called implicit supervisor-mode
accesses regardless of CPL. Other accesses made while CPL < 3 are called explicit supervisor-mode accesses.
So when a program access a memory location, its CPL determine the access mode, user programs run at CPL = 3, thus they only perform user-mode accesses.
Kernel instead performs supervisor-mode accesses as it runs at CPL = 0.
Address mode
Access rights are also controlled by the mode of a linear address as specified by the paging-structure entries
controlling the translation of the linear address. If the U/S flag (bit 2) is 0 in AT LEAST ONE of the paging-structure
entries, the address is a supervisor-mode address. Otherwise, the address is a user-mode address.
The SUP bit, formally known as U/S, then determines the mode of an address.
Since it is present in the PDE too (not only on the PTE), the idea is to take the more restrictive setting, thus a U/S set in one entry at any level suffices to make the address a supervisor-mode one.
Access rights
A user-mode access to a supervisor-mode address is always forbidden and an exception will be generated on attempt.
An access to the same mode address2 and to a lower mode address3 are generally permitted, they are not equal4 though and there are a variety of flags altering the behavior of the CPU5.
The idea is that supervisor mode accesses can do whatever they want and to reduce the surface attack available to exploiters there are a few mechanisms to lower the privileges of an access.
1 Including this one.
2 User-mode access to user-mode address, supervisor-mode access to supervisor-mode address.
3 Supervisor-mode access to user-mode address.
4 Supervisor accesses can write to read only pages.
5 For instance the CR0.WP flags disable write accesses to read only pages for supervisor accesses, the NXE bit disables fetching from a page with XD set.

It's just a check for whether the CPU is in ring0 or not. The CPU doesn't really know about processes: It doesn't matter how you got into ring 0, just that the CPU is currently executing kernel code. (i.e. could run privileged instructions).
See Margaret's more detailed answer for the full details.
And yes, all access to memory even inside the kernel is by mapping it to a virtual address. Kernels don't disable paging temporarily to access a specific physical address. Note that Linux (and many other kernels) keep kernel pages locked into memory, and don't swap them out to disk, but they are still paged.

Related

How do operating systems isolate processes from each other?

Assuming the CPU is in protected mode:
When a ring-0 kernel sets up a ring-3 userspace process, which CPU-level datastructure does it have to modify to indicate which virtual address space this specific process can access?
Does it just set the Privilege Bit of all other memory segments in the Global Descriptor Table to (Ring) 0?
Each process will have a set of page tables it uses. On x86 that means a page directory with some page tables. The address to the page directory will be in the CR3 Register. Every set of pagetables will have the kernel mapped (with kernel permissions) so when you do a system call, the kernel can access it's own pages. User processes can't access these pages. When you do a context switch, you change the address in the CR3 register to the page tables of the process that will be executed. Because each process has a different set of pagetables, they will each have a different view on memory. To make sure that no two processes have access to the same physical memory, you should have some kind of physical memory manager, which can be queried for a brand new area of memory that is not yet mapped in any other pagetable.
So as long as each Process struct keeps track of it's own page table structure, the only cpu level datastructure you will have to modify is the CR3 register.
It appears that the Global Descriptor Table (GDT) provides a segmentation mechanism that can be used in conjunction with Paging, but is now considered legacy.
By loading the page directory address into the CR3 control register, the Ring 3 process is restricted to the linear memory defined by the paging mechanism. CR3 can only be changed from Ring 0:
In protected mode, the 2 CPL bits in the CS register indicate which ring/privilege level the CPU is on.
More here:
https://forum.osdev.org/viewtopic.php?f=1&t=31835
https://wiki.osdev.org/Paging
https://sites.google.com/site/masumzh/articles/x86-architecture-basics/x86-architecture-basics
https://en.wikipedia.org/wiki/X86_memory_segmentation
https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4

Can VMs on Google Compute detect when they've been migrated?

Is it possible to notify an application running on a Google Compute VM when the VM migrates to different hardware?
I'm a developer for an application (HMMER) that makes heavy use of vector instructions (SSE/AVX/AVX-512). The version I'm working on probes its hardware at startup to determine which vector instructions are available and picks the best set.
We've been looking at running our program on Google Compute and other cloud engines, and one concern is that, if a VM migrates from one physical machine to another while running our program, the new machine might support different instructions, causing our program to either crash or execute more slowly than it could.
Is there a way to notify applications running on a Google Compute VM when the VM migrates? The only relevant information I've found is that you can set a VM to perform a shutdown/reboot sequence when it migrates, which would kill any currently-executing programs but would at least let the user know that they needed to restart the program.
We ensure that your VM instances never live migrate between physical machines in a way that would cause your programs to crash the way you describe.
However, for your use case you probably want to specify a minimum CPU platform version. You can use this to ensure that e.g. your instance has the new Skylake AVX instructions available. See the documentation on Specifying the Minimum CPU Platform for further details.
As per the Live Migration docs:
Live migration does not change any attributes or properties of the VM
itself. The live migration process just transfers a running VM from
one host machine to another. All VM properties and attributes remain
unchanged, including things like internal and external IP addresses,
instance metadata, block storage data and volumes, OS and application
state, network settings, network connections, and so on.
Google does provide few controls to set the instance availability policies which also lets you control aspects of live migration. Here they also mention what you can look for to determine when live migration has taken place.
Live migrate
By default, standard instances are set to live migrate, where Google
Compute Engine automatically migrates your instance away from an
infrastructure maintenance event, and your instance remains running
during the migration. Your instance might experience a short period of
decreased performance, although generally most instances should not
notice any difference. This is ideal for instances that require
constant uptime, and can tolerate a short period of decreased
performance.
When Google Compute Engine migrates your instance, it reports a system
event that is published to the list of zone operations. You can review
this event by performing a gcloud compute operations list --zones ZONE
request or by viewing the list of operations in the Google Cloud
Platform Console, or through an API request. The event will appear
with the following text:
compute.instances.migrateOnHostMaintenance
In addition, you can detect directly on the VM when a maintenance event is about to happen.
Getting Live Migration Notices
The metadata server provides information about an instance's
scheduling options and settings, through the scheduling/
directory and the maintenance-event attribute. You can use these
attributes to learn about a virtual machine instance's scheduling
options, and use this metadata to notify you when a maintenance event
is about to happen through the maintenance-event attribute. By
default, all virtual machine instances are set to live migrate so the
metadata server will receive maintenance event notices before a VM
instance is live migrated. If you opted to have your VM instance
terminated during maintenance, then Compute Engine will automatically
terminate and optionally restart your VM instance if the
automaticRestart attribute is set. To learn more about maintenance
events and instance behavior during the events, read about scheduling
options and settings.
You can learn when a maintenance event will happen by querying the
maintenance-event attribute periodically. The value of this
attribute will change 60 seconds before a maintenance event starts,
giving your application code a way to trigger any tasks you want to
perform prior to a maintenance event, such as backing up data or
updating logs. Compute Engine also offers a sample Python script
to demonstrate how to check for maintenance event notices.
You can use the maintenance-event attribute with the waiting for
updates feature to notify your scripts and applications when a
maintenance event is about to start and end. This lets you automate
any actions that you might want to run before or after the event. The
following Python sample provides an example of how you might implement
these two features together.
You can also choose to terminate and optionally restart your instance.
Terminate and (optionally) restart
If you do not want your instance to live migrate, you can choose to
terminate and optionally restart your instance. With this option,
Google Compute Engine will signal your instance to shut down, wait for
a short period of time for your instance to shut down cleanly,
terminate the instance, and restart it away from the maintenance
event. This option is ideal for instances that demand constant,
maximum performance, and your overall application is built to handle
instance failures or reboots.
Look at the Setting availability policies section for more details on how to configure this.
If you use an instance with a GPU or a preemptible instance be aware that live migration is not supported:
Live migration and GPUs
Instances with GPUs attached cannot be live migrated. They must be set
to terminate and optionally restart. Compute Engine offers a 60 minute
notice before a VM instance with a GPU attached is terminated. To
learn more about these maintenance event notices, read Getting live
migration notices.
To learn more about handling host maintenance with GPUs, read
Handling host maintenance on the GPUs documentation.
Live migration for preemptible instances
You cannot configure a preemptible instances to live migrate. The
maintenance behavior for preemptible instances is always set to
TERMINATE by default, and you cannot change this option. It is also
not possible to set the automatic restart option for preemptible
instances.
As Ramesh mentioned, you can specify the minimum CPU platform to ensure you are only migrated to an instance which has at least the minimum CPU platform you specified. At a high level it looks like:
In summary, when you specify a minimum CPU platform:
Compute Engine always uses the minimum CPU platform where available.
If the minimum CPU platform is not available or the minimum CPU platform is older than the zone default, and a newer CPU platform is
available for the same price, Compute Engine uses the newer platform.
If the minimum CPU platform is not available in the specified zone and there are no newer platforms available without extra cost, the
server returns a 400 error indicating that the CPU is unavailable.

What would cause a LBC-managed region access to hang?

I'm trying to write a Compact Flash driver for the RouterBoard 800, for FreeBSD, and running into problems. The CF slot is managed by the Local Bus Controller (LBC) of the CPU (MPC8544E), using the User Programmable Machine (UPM) module and any access to the memory region the CF is located at hangs the thread (the CPU can still be interrupted, but the thread never continues). Even dummy accesses, when programming or reading the UPM, hang. Now, the question is, what would cause the thread to hang when accessing the UPM-managed region, even if it's a dummy access, which should not actually assert the bus?
I know the CF card and slot themselves work, because the kernel itself boots from the card, loaded by the RouterBoard boot loader.
For posterity, the mpc8544E (probably most of the mpc85xx series) has the concept of Local Access Windows (LAWs). If an address does not exist in any of the 10 windows created, it's dropped on the floor, with no exception thrown nor garbage data returned. This is true for all address regions, including external RAM.

What is application state?

This is a very general question. I am a bit confused with the term state. I would like to know what do people mean by "state of an application"? Why do they call webserver as "stateless" and database as "stateful"?
How is the state of an application (in a VM) transferred, when the VM memory is moved from one machine to another during live migration.
Is transferring the memory, caches and register values of a system enough to transfer the state of the running application?
You've definitely asked a mouthful -- it's unfortunate that the word state is used in so many different contexts, but each one is a valid use of the word.
State of an application
An application's state is roughly the entire contents of its memory. This can be a difficult concept to get behind until you've seen something like Erlang's server loops, which explicitly pass all the state of the application in a variable from one invocation of the function to the next. In more "normal" programming languages, the "state" of the program is all its global variables, static variables, objects allocated on the heap, objects allocated on the stack, registers, open file descriptors and file offsets, open network sockets and associated kernel buffers, and so forth.
You can actually save that state and resume execution of the process elsewhere. The BLCR checkpoint tools for Linux do exactly this. (Though it is an extremely uncommon task to perform.)
State of a protocol
The state of a protocol is a different sort of meaning -- the statelessness of HTTP requests means that every web browser communication with webservers essentially starts over, from scratch -- every cookie is re-transmitted in both directions to try to "fake" some amount of a "session" for the user's sake. The servers don't hold any resources open for any given client across requests -- each one starts from scratch.
Networked filesystems might also be stateless (earlier versions of NFS) or stateful (newer versions of NFS). The earlier versions assumed every individual packet of reading, writing, or metadata control would be committed as it arrived, and every time a specific byte was needed from a file, it would be re-requested. This allowed the servers to be very simple -- they would do what the client packets told them to do and no effort was required to bring servers and clients back to consistency if a server rebooted or routers disappeared. However, this was bad for performance -- every client requested static data hundreds or thousands of times each day. So newer versions of NFS allowed some amount of data caching on the clients, and persistent file handles between servers and clients, and the servers had to keep track of the state of the clients that were connected -- and vice versa: the clients also had to know what promises they had made to the servers.
A stateful firewall will keep track of active TCP sessions. It knows which sessions the system administrators want to allow through, so it looks for those initial packets specifically. Once the session is set up, it then tracks the established connections as entities in their own rights. (This was a real advancement upon previous stateless firewalls which considered packets in isolation -- the rulesets on previous firewalls were much more permissive to achieve the same levels of functionality, but allowed through far too many malicious packets that pretended a session was already active.)
An application state is simply the state at which an application resides with regards to where in a program is being executed and the memory that is stored for the application. The web is "stateless," meaning everytime you reload a page, no information remains from the previous version of the page. All information must be resent from the server in order to display the page.
Technically, browsers get around the statelessness of the web by utilizing techniques like caching and cookies.
Application state is a data repository available to all classes. Application state is stored in memory on the server and is faster than storing and retrieving information in a database. Unlike session state, which is specific to a single user session, application state applies to all users and sessions. Therefore, application state is a useful place to store small amounts of often-used data that does not change from one user to another.
Resource:http://msdn.microsoft.com/en-us/library/ms178594.aspx
Is transferring the memory, caches and register values of a system enough to transfer the state of the running application?
Does the application have a file open, positioned at byte 225? If so, that file is part of the application's state because the next byte written should go to position 226.
Has the application authenticated itself to a secure server with a time-based key? Then that connection is part of the application's state, because if the application were to be suspended for 24 hours after saving memory, cache, and register values, when it resumes it will no longer have a valid connection to the secure server because it will have timed out.
Things which make an application stateful are easy to overlook.

Is it possible to "wake up" linux kernel process from user space without system call?

I'm trying to modify a kernel module that manages a special hardware.
The user space process, performs 2 ioctl() system calls per milliseconds to talk with the module. This doesn't meet my real.time requirements because the 2 syscalls sometimes take to long to execute and go out my time slot.
I know that with mmap I could share a memory area, and this is great, but how can I synchronize the data exchange with the module without ioctl() ?