I have an assignment with the following prompt:
The page size for a virtual memory system is 8KB.
The instruction TLB is direct-mapped with 2 sets and each block contains one translation.
^(I don't believe this is relevant for the following 3 questions, as there are two more questions about the TLB)
The number of bits in a virtual address is 20.
The number of bits in a physical address is 15.
(1) What is the number of virtual pages?
I think I have this one figured out.
Page size = 8 * 2^10 = 8192, so the offset is 13 bits.
Virtual page number = 20 - 13 = 7 bits
Virtual pages = 2^7 pages
(2) What is the number of physical pages?
Here's where I'm a little confused. I think I'm supposed to add in the valid, dirty, and reference bits to the physical page number (which is 2, from 15 - 13). However 5 * 2^7 = 640 bytes, which seems incredibly small.
(3) How many bits are used in the virtual address for the page offset?
Answered above, it appears to be 13 bits.
Could anyone point me in the right direction? Thanks!
The valid, dirty, and reference bits are in a page table entry but are not part of the address bits. Therefore using your results there are 2^2 or 4 physical pages.
Yes this does seem small, but realize that there is only 2^15 or 32K bytes of physical memory.
Related
I was doing some research over the weekend on some blockchain dev in the Solana blockchain and came across a construct called Compact-u16. The definition of this in the documentation says the following: "A compact-u16 is a multi-byte encoding of 16 bits. The first byte contains the lower 7 bits of the value in its lower 7 bits. If the value is above 0x7f, the high bit is set and the next 7 bits of the value are placed into the lower 7 bits of a second byte. If the value is above 0x3fff, the high bit is set and the remaining 2 bits of the value are placed into the lower 2 bits of a third byte.".
I have been coding for 30+ years. Maybe I'm just old school on this, but why is there a construct to store 16 bits of data in 3 bytes? This is just vastly inefficient from my standpoint. Is there a reason for this? On further research, I found a doc related to assembly instruction pointers, which referenced 7 instruction pointers that are useful for caching values when context switching in and out of the processor stack. But this construct is used for a web app platform. Like, literally, there is no reason that I have been able to find that justifies using 3 bytes to store 16 bits of data. If the developers wanted to use an elegant bit mapping solution to compress space, why not just use a semaphore? Why create a brand new construct that increases the storage requirements for the data by 33%.
What am I missing?
I had some similar confusion when reading the compact-u16 description. Based on the code for parsing them in the solana python module I believe they're doing something conceptually similar to UTF-8, and storing the number in 1-3 bytes depending on its size.
Basically instead of each byte having 8 bits of a number, it has 7 bits of the number and a flag (the most significant bit) that indicates whether the number continues in the next byte. For the largest numbers they need an extra byte, but for numbers less than 128 they need only one byte. Since Solana seems to use these for storing the length of arrays, if it's common that the length of the arrays is less than 128 then they will end up with fewer total bytes to transfer across all transactions.
Some examples I worked out for myself:
hex | compact-u16
--------+------------
0x0000 | [0x00]
0x0001 | [0x01]
0x007f | [0x7f]
0x0080 | [0x80 0x01]
0x3fff | [0xff 0x7f]
0x4000 | [0x80 0x80 0x01]
0xc000 | [0x80 0x80 0x03]
0xffff | [0xff 0xff 0x03])
There is a 32-bit virtual address space in the machine and a 30-bit physical address space, with a virtual memory page size of 8 KBytes. The translation look-aside buffer (TLB) is organized to have 64 total entries and is 4-way set associative. And the L1 data cache organization parameters are 16 KByte, 2-way set associative, 64-byte block/line size. The L1 cache is also physically tagged.
Can someone explain how I could calculate the tag bits in the TLB?
And how many tag bits will be required for each L1 cache block?
I am novice in Xilinx HLS. I am following tutorial ug871-vivado-high-level-synthesis-tutorial.pdf(page 77).
The code is
#define N 32
void array_io (dout_t d_o[N], din_t d_i[N])
{
//..do something
}
After synthesis, I got report like
I am confused that how the width of the address port has been automatically sized match to the number of addresses that must be accessed (5-bit for 32 addresses)?
Please help.
From the UG871, it seems that the size of the array is from 0 to 16 samples, hence you need 32 addresses to access all values (see Figure 69). I guess that the number N is somewhere limited to be less than 32 (or be exactly 16). This means that Vivado knows this limitation, and generates only as many address bits as are needed. Most synthesis tools check the constraints on size and optimize unnecessary code away.
When you synthetise a function you create, also, some registers to store the variables. It means that the address that you put as input is the one of the data that you are concurrently writing in d_o or d_in.
In your case, where N=32, you have 32 different variables (in both input and output). To adress 32 different variables you need 32 different combination of bit (to point to a specific one, without ambiguity). With 5 bit you have 2^5=32 different combination of addresses: the minimum number of bit to address all your data.
For instance if you have 32
The address number of bit is INDIPENDENT from the size of data (i.e. they can be int, float, char, short, double, arbitrary precision and so on)
I'm currently working on a project that involves a lot of bit level manipulation of data such as comparison, masking and shifting. Essentially I need to search through chunks of bitstreams between 8kbytes - 32kbytes long for bit patterns between 20 - 40bytes long.
Does anyone know of general resources for optimizing for such operations in CUDA?
There has been a least a couple of questions on SO on how to do text searches with CUDA. That is, finding instances of short byte-strings in long byte-strings. That is similar to what you want to do. That is, a byte-string search is much like a bit-string search where the number of bits in the byte-string can only be a multiple of 8, and the algorithm only checks for matches every 8 bits. Search on SO for CUDA string searching or matching, and see if you can find them.
I don't know of any general resources for this, but I would try something like this:
Start by preparing 8 versions of each of the search bit-strings. Each bit-string shifted a different number of bits. Also prepare start and end masks:
start
01111111
00111111
...
00000001
end
10000000
11000000
...
11111110
Then, essentially, perform byte-string searches with the different bit-strings and masks.
If you're using a device with compute capability >= 2.0, store the shifted bit-strings in global memory. The start and end masks can probably just be constants in your program.
Then, for each byte position, launch 8 threads that each checks a different version of the 8 shifted bit-strings against the long bit-string (which you now treat like a byte-string). In each block, launch enough threads to check, for instance, 32 bytes, so that the total number of threads per block becomes 32 * 8 = 256. The L1 cache should be able to hold the shifted bit-strings for each block, so that you get good performance.
Any comparison table available?
The basic change for the 430X architecture was to introduce a 20 bit address range to allow addressing outside the 64K available on the original 430 devices. There are a new set of instructions that operate on the 20 bit address in parallel with the old style 16 bit instructions. e.g.
CALL ; takes a 16 bit address
CALLA ; takes a 20 bit address
PUSH ; Push the bottom 16 bits of a register onto the stack
PUSHA ; Push the full 20 bit register
The existing code compiled for a 430 based processor will run within the bottom 64K address space of the 430X processor. In the development tools (IAR and probably Rowley) you can specify a memory model so that the longer function calls and other 430X specific instructions are not generated if you ensure that your code does not cross the 64K boundary.
Wikipedia's usually good for this sort of thing. It looks like it's to increase the address space to 1MB on the X from 64K on the regular.
http://en.wikipedia.org/wiki/MSP430#MSP430X_20-bit_extension
The MSP430X extension has only 20 bit address space. So the CALLA takes only a 20 bit address.