Relationship between addressability/address space - system

A question was posed where if a computer has n-bit addressability and uses m bits to access a memory location, what is the size of memory in bytes? I am just confused on what the relationship between these two are, I believe you would multiply m * n to get the total bytes of memory

2^n* m
With n bits one can represent 2^n items.To find memory we raise the 2 to the number of n-bit addressability and multiply by m bits

Related

What is the bias exponent?

My question concerns the IEEE 754 Standard
So im going through the steps necessary to convert denary numbers into a floating point number (IEEE 754 standard) but I dont understand the purpose of determining the biased exponent. I cant get my head around that step and what it is exactly and why its done?
Could any one explain what this is - please keep in mind that I have just started a computer science conversion masters so I wont completely understand certain choices of terminology!
If you think its very long to explain please point me in the right direction!
The exponent in an IEEE 32-bit floating point number is 8 bits.
In most cases where we want to store a signed quantity in an 8-bit field, we use a signed representation. 0x80 is -128, 0xFF is -1, 0x00 is 0, up to 0x7F is 127.
But that's not how the exponent is represented. The exponent field is treated as if it were unsigned 8-bit number that is 127 too large. Look at the unsigned value in the exponent field and subtract 127 to get the actual value. So 0x00 represents -127. 0x7F represents 0.
For 64-bit floats, the field is 11 bits, with a bias of 1023, but it works the same.
A floating point number could be represented (but is not) as a sign bit s, an exponent field e, and a mantissa field m, where e is a signed integer and m is an unsigned fraction of an integer. The value of that number would then be computed as (-1)^s • 2^e • m. But this would not allow to represent important special cases.
Note that one could increase the exponent by ±n and shift the mantissa right by ±n without changing the value of the number. This allows for nearly a numbers to adjust the exponent such that the mantissa starts with a 1 (one exception is of course 0, a special FP number). If one does so, one has normalized FP numbers, and since the mantissa now starts always with 1, one does not have to store the leading 1 in memory, and the saved bit is used to increase the precision of the FP number. Thus, no mantissa m is stored but a mantissa field mf.
But how is now 0 represented? And what about FP numbers that have already the max or min exponent field, but due to their normalization, cannot be made larger or smaller? And what about "not a number-s" that are e.g. the result of x/0?
Here comes the idea of a biased exponent: If half of the max exponent value is added to the exponent field, one gets the bias exponent be. To compute the value of the FP number, this bias has to be subtracted of course. But all normalized FP numbers have now 0 < be <(all 1). Therefore these special biased exponents 0 and (all 1) can now be reserved for special purposes.
be = 0, mf = 0: Exact 0.
be = 0, mf ≠ 0: A denormalized number, i.e. mf is the real mantissa that does not have a leading 1.
be = (all 1), mf = 0: Infinite
be = (all 1), mf ≠ 0: Not a number (NaN)

Encoding - Efficiently send sparse boolean array

I have a 256 x 256 boolean array. These array is constantly changing and set bits are practically randomly distributed.
I need to send a current list of the set bits to many clients as they request them.
Following numbers are approximations.
If I send the coordinates for each set bit:
set bits data transfer (bytes)
0 0
100 200
300 600
500 1000
1000 2000
If I send the distance (scanning from left to right) to the next set bit:
set bits data transfer (bytes)
0 0
100 256
300 300
500 500
1000 1000
The typical number of bits that are set in this sparse array is around 300-500, so the second solution is better.
Is there a way I can do better than this without much added processing overhead?
Since you say "practically randomly distributed", let's assume that each location is a Bernoulli trial with probability p. p is chosen to get the fill rate you expect. You can think of the length of a "run" (your option 2) as the number of Bernoulli trials necessary to get a success. It turns out this number of trials follows the Geometric distribution (with probability p). http://en.wikipedia.org/wiki/Geometric_distribution
What you've done so far in option #2 is to recognize the maximum length of the run in each case of p, and reserve that many bits to send all of them. Note that this maximum length is still just a probability, and the scheme will fail if you get REALLY REALLY unlucky, and all your bits are clustered at the beginning and end.
As #Mike Dunlavey recommends in the comment, Huffman coding, or some other form of entropy coding, can redistribute the bits spent according to the frequency of the length. That is, short runs are much more common, so use fewer bits to send those lengths. The theoretical limit for this encoding efficiency is the "entropy" of the distribution, which you can look up on that Wikipedia page, and evaluate for different probabilities. In your case, this entropy ranges from 7.5 bits per run (for 1000 entries) to 10.8 bits per run (for 100).
Actually, this means you can't do much better than you're currently doing for the 1000 entry case. 8 bits = 1 byte per value. For the case of 100 entries, you're currently spending 20.5 bits per run instead of the theoretically possible 10.8, so that end has the highest chance for improvement. And in the case of 300: I think you haven't reserved enough bits to represent these sequences. The entropy comes out to 9.23 bits per pixel, and you're currently sending 8. You will find many cases where the space between true exceeds 256, which will overflow your representation.
All of this, of course, assumes that things really are random. If they're not, you need a different entropy calculation. You can always compute the entropy right out of your data with a histogram, and decide if it's worth pursuing a more complicated option.
Finally, also note that real-life entropy coders only approximate the entropy. Huffman coding, for example, has to assign an integer number of bits to each run length. Arithmetic coding can assign fractional bits.

I/O Disk Drive Calculations

So I am studying for an up and coming exam, one of the questions involves calculating various disk drive properties. I have spent a fair while researching sample questions and formula but because I'm a bit unsure on what I have come up with I was wondering could you possibly help confirm my formulas / answers?
Information Provided:
Rotation Speed = 6000 RPM
Surfaces = 6
Sector Size = 512 bytes
Sectors / Track = 500 (average)
Tracks / Surfaces = 1,000
Average Seek Time = 8ms
One Track Seek Time = 0.4 ms
Maximum Seek Time = 10ms
Questions:
Calculate the following
(i) The capacity of the disk
(ii) The maximum transfer rate for a single track
(iii) Calculate the amount of cylinder skew needed (in sectors)
(iv) The Maximum transfer rate (in bytes) across cylinders (with cylinder skew)
My Answers:
(i) Sector Size x Sectors per Track x Tracks per Surface x No. of surfaces
512 x 500 x 1000 x 6 = 1,536,000,000 bytes
(ii) Sectors per Track x Sector Size x Rotation Speed per sec
500 x 512 x (6000/60) = 25,600,000 bytes per sec
(iii) (Track to Track seek time / Time for 1 Rotation) x Sectors per Track + 4
(0.4 / 0.1) x 500 + 4 = 24
(iv) Really unsure about this one to be honest, any tips or help would be much appreciated.
I fairly sure a similar question will appear on my paper so it really would be a great help if any of you guys could confirm my formulas and derived answers for this sample question. Also if anyone could provide a bit of help on that last question it would be great.
Thanks.
(iv) The Maximum transfer rate (in bytes) across cylinders (with cylinder skew)
500 s/t (1 rpm = 500 sectors) x 512 bytes/sector x 6 (reading across all 6 heads maximum)
1 rotation yields 1536000 bytes across 6 heads
you are doing 6000 rpm so that is 6000/60 or 100 rotations per second
so, 153,600,000 bytes per second (divide by 1 million is 153.6 megabytes per second)
takes 1/100th of a second or 10ms to read in a track
then you need a .4ms shift of the heads to then read the next track.
10.0/10.4 gives you a 96.2 percent effective read rate moving the heads perfectly.
you would be able to read at 96% of the 153.6 or 147.5 Mb/s optimally after the first seek.
where 1 Mb = 1,000,000 bytes

What's the best way to store a bi-dimensional sparse array (2d sparse matrix) ? How much size will it have in VoltDB?

Question one: Are there specialized databases to store dense and sparse matrices ? I googled but didn't find any...
The matrix in question is huge (10^5 by 10^5) but it's sparse, which means that most of its values are zeros and I only need to store the non-zero values. So I thought of making a table like this:
2D Matrix
---------------
X Y val
---------------
1 2 4.2
5 1 91.0
9 3 139.1
And so on. 3 columns, two for the coordinates, the third for the value of that cell in the sparse matrix. Question 2: Is this the best way to store a sparse matrix ? I also considered MongoDB but it seems that making one document per cell of the matrix would be too much overhead. Table oriented databases are slow but I can use VoltDB :) Side-node: I thought of a Redis Hash but can't make it bi-dimensional (found a way to serialize 2D matrixes and make it 1D, that way I can store in a Redis Hash or even List)
Question 3: How many bytes per line will VoltDB use ? The coordinates will be integers ranging from 0 to 10^5 maybe more, the values of the cell will be floats.
Regarding Question 3, based on your example, the X and Y columns could be the INTEGER datatype in VoltDB, which is 4 bytes. The value column could be a FLOAT datatype, which is 8 bytes.
Each record would therefore be 16 bytes, so the nominal size in memory would be 16 bytes * row count. In general, you add 30% for overhead, and then 1GB per server for heap size to determine the overall memory needed. See the references below for more detail.
You will probably want to index this table, so assuming you wanted a compound index of (x,y), the size would be as follows:
Tree index:
(sum-of-column-sizes + 8 + 32) * rowcount
Hash index:
(((2 * rowcount) + 1) * 8) + ((sum-of-column-sizes + 32) * rowcount)
sum-of-column-sizes for (x,y) is 8 bytes.
References:
The available datatypes are listed in Appendix A of Using VoltDB:
http://community.voltdb.com/docs/UsingVoltDB/ddlref_createtable#TabDatatypes
Guidelines and formulas for estimating the memory size are in the VoltDB Planning Guide:
http://community.voltdb.com/docs/PlanningGuide/ChapMemoryRecs
The two most relevant questions are 1) how sparse? and 2) how do you want to use the data?
First, I'm assuming you want to read/write/process the data from within the database. If not, then there are many sparse matrix encodings that could even be packed into a blob and optionally compressed.
Assuming your data is fairly sparse, and assuming you want to use the data within the database, the x,y,value tuple storage is probably best. Chapter 4 of the VoltDB Planning Guide covers estimating memory usage and sizing your hardware.
http://community.voltdb.com/docs/PlanningGuide/ChapMemoryRecs
The short answer is that tables with numeric data are packed pretty tight. You've got 12 bytes of real data per row (short, short, double). You should see an average of just over 1 byte beyond that in overhead per row. You'll also need to add in the size of any indexes. The documentation describes the worst case. I think for an index on two short integers such as the X and Y columns, the storage per key will be close to 28 bytes, including overhead.

Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks

As far as my understanding goes, shared memory is divided into banks and accesses by multiple threads to a single data element within the same bank will cause a conflict (or broadcast).
At the moment I allocate a fairly large array which conceptually represents several pairs of two matrices:
__shared__ float A[34*N]
Where N is the number of pairs and the first 16 floats of a pair are one matrix and the following 18 floats are the second.
The thing is, access to the first matrix is conflict free but access to the second one has conflicts. These conflicts are unavoidable, however, my thinking is that because the second matrix is 18 all future matrices will be misaligned to the banks and therefore more conflicts than necessary will occur.
Is this true, if so how can I avoid it?
Everytime I allocate shared memory, does it start at a new bank? So potentially could I do
__shared__ Apair1[34]
__shared__ Apair2[34]
...
Any ideas?
Thanks
If your pairs of matrices are stored contiguously, and if you are accessing the elements linearly by thread index, then you will not have shared memory bank conflicts.
In other words if you have:
A[0] <- mat1 element1
A[1] <- mat1 element2
A[2] <- mat1 element3
A[15] <- mat1 element16
A[16] <- mat2 element1
A[17] <- mat2 element2
A[33] <- mat2 element18
And you access this using:
float element;
element = A[pairindex * 34 + matindex * 16 + threadIdx.x];
Then adjacent threads are accessing adjacent elements in the matrix and you do not have conflicts.
In response to your comments (below) it does seem that you are mistaken in your understanding. It is true that there are 16 banks (in current generations, 32 in the next generation, Fermi) but consecutive 32-bit words reside in consecutive banks, i.e. the address space is interleaved across the banks. This means that provided you always have an array index that can be decomposed to x + threadIdx.x (where x is not dependent on threadIdx.x, or at least is constant across groups of 16 threads) you will not have bank conflicts.
When you access the matrices further along the array, you still access them in a contiguous chunk and hence you will not have bank conflicts. It is only when you start accessing non-adjacent elements that you will have bank conflicts.
The reduction sample in the SDK illustrates bank conflicts very well by building from a naive implementation to an optimised implementation, possibly worth taking a look.
Banks are set up such that each successive 32 bits are in the next bank. So, if you declare an array of 4 byte floats, each subsequent float in the array will be in the next bank (modulo 16 or 32, depending on your architecture). I'll assume you're on compute capability 1.x, so you have a bank of width 16.
If you have arrays of 18 and 16, things can be funny. You can avoid bank conflicts in the 16x16 array by declaring it like
__shared__ float sixteen[16][16+1]
which avoids bank conflicts when accessing transpose elements using threadIdx.x (as I assume you're doing if you're getting conflicts). When accessing elements in, say, the first row of a 16x16 matrix, they'll all reside in the 1st bank. What you want to do is have each of these in a successive bank. Padding does this for you. You treat the array exactly as you would before, as sixteen[row][column], or similarly for a flattened matrix, as sixteen[row*(16+1)+column], if you want.
For the 18x18 case, when accessing in the transpose, you're moving at an even stride. The answer again is to pad by 1.
__shared__ float eighteens[18][18+1]
So now, when you access in the transpose (say accessing elements in the first column), it will access as (18+1)%16 = 3, and you'll access banks 3, 6, 9, 12, 15, 2, 5, 8 etc, so you should get no conflicts.
The particular alignment shift due to having a matrix of size 18 isn't the problem, because the starting point of the array makes no difference, it's only the order in which you access it. If you want to flatten the arrays I've proposed above, and merge them into 1, that's fine, as long as you access them in a similar fashion.