How do operating systems isolate processes from each other? - process

Assuming the CPU is in protected mode:
When a ring-0 kernel sets up a ring-3 userspace process, which CPU-level datastructure does it have to modify to indicate which virtual address space this specific process can access?
Does it just set the Privilege Bit of all other memory segments in the Global Descriptor Table to (Ring) 0?

Each process will have a set of page tables it uses. On x86 that means a page directory with some page tables. The address to the page directory will be in the CR3 Register. Every set of pagetables will have the kernel mapped (with kernel permissions) so when you do a system call, the kernel can access it's own pages. User processes can't access these pages. When you do a context switch, you change the address in the CR3 register to the page tables of the process that will be executed. Because each process has a different set of pagetables, they will each have a different view on memory. To make sure that no two processes have access to the same physical memory, you should have some kind of physical memory manager, which can be queried for a brand new area of memory that is not yet mapped in any other pagetable.
So as long as each Process struct keeps track of it's own page table structure, the only cpu level datastructure you will have to modify is the CR3 register.

It appears that the Global Descriptor Table (GDT) provides a segmentation mechanism that can be used in conjunction with Paging, but is now considered legacy.
By loading the page directory address into the CR3 control register, the Ring 3 process is restricted to the linear memory defined by the paging mechanism. CR3 can only be changed from Ring 0:
In protected mode, the 2 CPL bits in the CS register indicate which ring/privilege level the CPU is on.
More here:
https://forum.osdev.org/viewtopic.php?f=1&t=31835
https://wiki.osdev.org/Paging
https://sites.google.com/site/masumzh/articles/x86-architecture-basics/x86-architecture-basics
https://en.wikipedia.org/wiki/X86_memory_segmentation
https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4

Related

Customizing L2 Cache Sharing between Cores

I am trying to create a multi-chip multi-processor design where the L2 Caches are private to each chip. For example I am trying to create the following configuration:
2 Chips each containing 2 CPU Cores
Each Chip has 2 CPU Cores(each having its own L1 Cache) and a single L2 Cache shared between the two CPUs
Finally I will have the Main Memory shared between the 2 Chips
I am using the MOESI_CMP_directory protocol to generate the design. And I am using garnet2.0 to create the topology. But what I have understood is that all of the 4 CPUs share the two L2-Caches. But I want the L2-Cache be private to each Chip. Is there any way to do that in gem5?
Additional Info:
I checked the memory addresseses and accessed Caches through RubyNetwork to confirm that L1-Cache0 accesses L2-Cache0 as well as L2-Cache1. It seems the protocol is working correctly because L2 Cache being the last level cache is being shared. But I was wondering if I could make some customization so that L1-Cache0/1 requests only go to L2-Cache0 and not L2-Cache1.
I think I know how to resolve this. There are two files which will need modification for this:
src/mem/ruby/protocol/MESI_Two_Level-L1Cache.sm - In this file the Coherent Messages from L1Cache are sent through "actions". The function which controls the mapping, as to which L2Cache Node receives the coherent request, is "mapAddressToRange". This function is passed certain parameters in the .sm file and can be modified.
src/mem/ruby/slicc_interface/RubySlicc_ComponentMapping.hh - This file contains the implementation of the functions "mapAddressToRange" and we can make modifications here as per our requirement.

The meaning of SUP bit in page tables

One of the page table entry attributes is the SUP bit.
I read in couple documents that:
"if the SUP is set then only a process in kernel mode can access that page. Versus if it is not set then a process in user mode can access it."
I find this statement confusing, because a process in kernel mode can be a process within which a user program is running or a kernel program (user process vs system process). So which does the statement refer to? Or is it both as long as the process is currently executing in kernel mode?
If this statement refers also to processes within which a user program is running(user process), then we already know that memory access can only be made when the process switches into kernel mode, then there is no need to have the SUP bit.
My guess is that the SUP bit is meant to say that this page is accessed only by system processes (excluding user processes running in kernel mode), but I am not sure as I don't have knowledge about how the kernel code is stored in memory and if it is paged and how.
When ever you have any doubt about the working of an Intel CPU consult the Manuals, not any random internet page1.
Paging access rights are described in Section 4.6.
The CPU distinguishes the privilege of an address and of an access (to an address), each privilege is either a user-mode or supervisor-mode (where supervisor-mode is intended to be more privileged).
Access mode
Every access to a linear address is either a supervisor-mode access or a user-mode access. For all instruction
fetches and most data accesses, this distinction is determined by the current privilege level (CPL): accesses made
while CPL < 3 are supervisor-mode accesses, while accesses made while CPL = 3 are user-mode accesses.
Some operations implicitly access system data structures with linear addresses; the resulting accesses to those
data structures are supervisor-mode accesses regardless of CPL.
[...]
All these accesses are called implicit supervisor-mode
accesses regardless of CPL. Other accesses made while CPL < 3 are called explicit supervisor-mode accesses.
So when a program access a memory location, its CPL determine the access mode, user programs run at CPL = 3, thus they only perform user-mode accesses.
Kernel instead performs supervisor-mode accesses as it runs at CPL = 0.
Address mode
Access rights are also controlled by the mode of a linear address as specified by the paging-structure entries
controlling the translation of the linear address. If the U/S flag (bit 2) is 0 in AT LEAST ONE of the paging-structure
entries, the address is a supervisor-mode address. Otherwise, the address is a user-mode address.
The SUP bit, formally known as U/S, then determines the mode of an address.
Since it is present in the PDE too (not only on the PTE), the idea is to take the more restrictive setting, thus a U/S set in one entry at any level suffices to make the address a supervisor-mode one.
Access rights
A user-mode access to a supervisor-mode address is always forbidden and an exception will be generated on attempt.
An access to the same mode address2 and to a lower mode address3 are generally permitted, they are not equal4 though and there are a variety of flags altering the behavior of the CPU5.
The idea is that supervisor mode accesses can do whatever they want and to reduce the surface attack available to exploiters there are a few mechanisms to lower the privileges of an access.
1 Including this one.
2 User-mode access to user-mode address, supervisor-mode access to supervisor-mode address.
3 Supervisor-mode access to user-mode address.
4 Supervisor accesses can write to read only pages.
5 For instance the CR0.WP flags disable write accesses to read only pages for supervisor accesses, the NXE bit disables fetching from a page with XD set.
It's just a check for whether the CPU is in ring0 or not. The CPU doesn't really know about processes: It doesn't matter how you got into ring 0, just that the CPU is currently executing kernel code. (i.e. could run privileged instructions).
See Margaret's more detailed answer for the full details.
And yes, all access to memory even inside the kernel is by mapping it to a virtual address. Kernels don't disable paging temporarily to access a specific physical address. Note that Linux (and many other kernels) keep kernel pages locked into memory, and don't swap them out to disk, but they are still paged.

Migrate memory pages of a running process

On a NUMA machine, is it possible to migrate memory pages of a running process to one node?
P.S: I know taskset can change the affinity in runtime, but there's no documentation says how the already allocated memory pages are affected.
Numactl can only works when creating process, as far as I know.
There is such call in the libnuma library (numactl package, since 2003): http://linux.die.net/man/3/numa
void numa_tonode_memory(void *start, size_t size, int node);
numa_tonode_memory() put memory on a specific node.
It may be implemented with mbind call with MPOL_MF_MOVE option: http://man7.org/linux/man-pages/man2/mbind.2.html
mbind - set memory policy for a memory range
If MPOL_MF_MOVE is specified in flags, then the kernel will attempt
to move all the existing pages in the memory range so that they
follow the policy.
https://www.kernel.org/doc/Documentation/vm/page_migration
Page migration allows a process to manually relocate the node on which its
pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
a new memory policy via mbind().
Or with move_pages: http://man7.org/linux/man-pages/man2/move_pages.2.html
"move_pages - move individual pages of a process to another node"

Why address space need to be preserved when Switching the cpu from one process to another?

I have read in the Galvin book that Switching the cpu from one process to another process requires preserving the address space of the current process.Why this address space need to be preserved?
By address space I think you are trying to ask why the page table of a process needs to be saved, when there is a context switch.
Well imagine that when a process is context switched virtual page 100 is mapped to physical page 400.this information is saved in the page table corresponding to this process.If this table is not saved when it is context switched , the next time this process is scheduled to run how will we know where the virtual page 100 is mapped in the physical space.Saving the page table gives you this information on the virtual to physical address mappings.
In reality what happens is that when a context switch takes place , a register (cr3) on x86 holds a pointer to page table and this points to a new table when there is context switch, so that the virtual to physical mapping of the new process is available when we do address translations.

Is it possible to "wake up" linux kernel process from user space without system call?

I'm trying to modify a kernel module that manages a special hardware.
The user space process, performs 2 ioctl() system calls per milliseconds to talk with the module. This doesn't meet my real.time requirements because the 2 syscalls sometimes take to long to execute and go out my time slot.
I know that with mmap I could share a memory area, and this is great, but how can I synchronize the data exchange with the module without ioctl() ?