How much memory does mergesort use? - ram

Currently I am implementing a standard mergesort that requires (n) space.
My RAM is 8GB, and a text file of 1 million numbers is 7.8MB which can be sorted by merge sort but for a text file of 2 million (which is 15.6MB) when I run the program there is a segmentation fault.
My question is if there is a way to calculate the maximum number of integers I can sort and is my RAM in any way related to the maximum number of integers I can sort?

Related

Mersenne Primes processing

I took interest in Mersenne Primes https://www.mersenne.org/.
Great Internet Mersenne Prime Search (GIMPS) is doing the research in this field.
These are Prime Numbers but are very large and few.
49th Mersenne Prime is 22 million digits long. It is unbelievable that one number can be 22 million digits.
I tried and could catch up to 8th Mersenne Prime which is 10 digits long and within 2 billions.
I am using Postgres BIGINT which supports up to 19 digit long integers which 9 million billions.
So, if I am processing 1 billion rows at a time, it would take me 9 million iterations.
I can further use NUMERIC data type which supports 131072 digits to left of decimal and a precision 16383 digits. Of course I need to work with integers only. I do not need precision.
Another alternative is Postgres's CHAR VARYING which stores up to a billion. But it can not be used for calculations.
What Postgres provides is enough for any practical needs.
My question is how the guys at GIMPS are calculating such large numbers.
Are they storing these numbers in any database. Which database supports such large numbers.
Am I out of sync with progresses made in database world.
I know they have huge processing power Curtis Cooper has mentioned 700 servers are being used to discover and verify the numbers.
Exactly how much storage it is needed. What language is being used.
Just curiosity. Does this sound like I am out of job.
thanks
bb23850
Mersenne numbers are very easy to calculate. They are always one less than a power of 2:
select n, cast(power(cast(2 as numeric), n) - 1 as numeric(1000,0))
from generate_series(1, 100, 1) gs(n)
order by n;
The challenge is determining whether or not the resulting number is a prime. Mersenne knew that n needs to be prime for the number corresponding Mersenne number to be prime.
As fast a computers are, once the number has more than a dozen or two dozen or so digits, an exhaustive search of all factors is not feasible. You can see from the above code that an exhaustive search becomes infeasible long before the 100th Mersenne number.
In order to determine if such a number is prime, a lot of mathematics is used -- some of it invented for or inspired by this particular problem. I'm pretty sure that it would be quite hard to implement any of those primality tests in a relational database.

OpenCL (AMD GCN) global memory access pattern for vectorized data: strided vs. contiguous

I'm going to improve OCL kernel performance and want to clarify how memory transactions work and what memory access pattern is really better (and why).
The kernel is fed with vectors of 8 integers which are defined as array: int v[8], that means, before doing any computation entire vector must be loaded into GPRs. So, I believe the bottleneck of this code is initial data load.
First, I consider some theory basics.
Target HW is Radeon RX 480/580, that has 256 bit GDDR5 memory bus, on which burst read/write transaction has 8 words granularity, hence, one memory transaction reads 2048 bits or 256 bytes. That, I believe, what CL_DEVICE_MEM_BASE_ADDR_ALIGN refers to:
Alignment (bits) of base address: 2048.
Thus, my first question: what is the physical sense of 128-byte cacheline? Does it keep the portion of data fetched by single burst read but not really requested? What happens with the rest if we requested, say, 32 or 64 bytes - thus, the leftover exceeds the cache line size? (I suppose, it will be just discarded - then, which part: head, tail...?)
Now back to my kernel, I think that cache does not play a significant role in my case because one burst reads 64 integers -> one memory transaction can theoretically feed 8 work items at once, there is no extra data to read, and memory is always coalesced.
But still, I can place my data with two different access patterns:
1) contiguous
a[i] = v[get_global_id(0) * get_global_size(0) + i];
(wich actually perfomed as)
*(int8*)a = *(int8*)v;
2) interleaved
a[i] = v[get_global_id(0) + i * get_global_size(0)];
I expect in my case contiguous would be faster because as said above one memory transaction can completely stuff 8 work items with data. However, I do not know, how the scheduler in compute unit physically works: does it need all data to be ready for all SIMD lanes or just first portion for 4 parallel SIMD elements would be enough? Nevertheless, I suppose it is smart enough to fully provide with data at least one CU first, as soon as CU's may execute command flows independently.
While in second case we need to perform 8 * global_size / 64 transactions to get a complete vector.
So, my second question: is my assumption right?
Now, the practice.
Actually, I split entire task in two kernels because one part has less register pressure than another and therefore can employ more work items. So first I played with pattern how the data stored in transition between kernels (using vload8/vstore8 or casting to int8 give the same result) and the result was somewhat strange: kernel that reads data in contiguous way works about 10% faster (both in CodeXL and by OS time measuring), but the kernel that stores data contiguously performs surprisingly slower. The overall time for two kernels then is roughly the same. In my thoughts both must behave at least the same way - either be slower or faster, but these inverse results seemed unexplainable.
And my third question is: can anyone explain such a result? Or may be I am doing something wrong? (Or completely wrong?)
Well, not really answered all my question but some information found in vastness of internet put things together more clear way, at least for me (unlike abovementioned AMD Optimization Guide, which seems unclear and sometimes confusing):
«the hardware performs some coalescing, but it's complicated...
memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.
thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.»
So, for my case it does not realy matter how the data is read/stored while it is always contiguous, but the order the parts of vectors are loaded may affect the consequent command flow (re)scheduling by compiler.
I also can imagine that newer GCN architecture can do cached/coalesced writes, that is why my results are different from those prompted by that Optimization Guide.
Have a look at chapter 2.1 in the AMD OpenCL Optimization Guide. It focuses mostly on older generation cards but the GCN architecture did not completely change, therefore should still apply to your device (polaris).
In general AMD cards have multiple memory controllers to which in every clock cycle memory requests are distributed. If you for example access your values in column-major instead of row-major logic your performance will be worse because the requests are sent to the same memory controller. (by column major I mean a column of your matrix is accessed together by all the work-items executed in the current clock cycle, this is what you refer to as coalesced vs interleaved). If you access one row of elements (meaning coalesced) in a single clock cycle (meaning all work-items access values within the same row), those requests should be distributed to different memory controllers rather than the same.
Regarding alignment and cache line sizes, I'm wondering if this really helps improving the performance. If I were in your situation I would try to have a look whether I can optimize the algorithm itself or if I access the values often and it would make sense to copy them to the local memory. But than again it is hard to tell without any knowledge about what your kernels execute.
Best Regards,
Michael

Solitaire: storing guaranteed wins cheaply

Given a list of deals of Klondike Solitaire that are known to win, is there a way to store a reasonable amount of deals (say 10,000+) in a reasonable amount of space (say 5MB) to retrieve on command? (These numbers are arbitrary)
I thought of using a pseudo random generator where a given seed would generate a decimal string of numbers, where each two digits represents a card, and the index represents the location of the deal. In this case, you would only have to store the seed and the PRG code.
The only cons I can think of would be that A) the number of possible deals is 52!, and so the number of possible seeds would be at least 52!, and would be monstrous to store in the higher number range, and B) the generated number can't repeat a two digit number (though they can be ignored in the deck construction)
Given no prior information, the theoretical limit on how compactly you can represent an ordered deck of cards is 226 bits. Even the simple naive 6-bits-per card is only 312 bits, so you probably won't gain much by being clever.
If you're willing to sacrifice a large part of the state-space, you could use a 32- or 64-bit PRNG to generate the decks, and then you could reproduce them from the 32- or 64-bit initial PRNG state. But that limits you to 2^64 different decks out of the possible 2^225+.
If you are asking hypothetically, I would say that you would need at least 3.12 MB to store 10,000 possible deals. You need 6 bits to represent each card (assuming you number them 1-52) and then you would need to order them so 6 * 52 = 312. Take that and multiply it by the number of deals 312 * 10,000 and you get 3,120,000 bits or 3.12 MB.

Find first one million prime numbers

I want to get the first one million prime numbers.
I know the way of finding small prime numbers. My problem is, how can I store such large numbers in simple data types such as long, int, etc?
Well the millionth prime is less than 16 million, and with the amount of memory in today's computers an ordinary C array of 16 million booleans (you can use 1 byte for each) isn't that large...
So allocate your large array, fill it with true's, treat the first element as representing the integer 2 (i.e. index + 2 is the represented value), and implement the skip n/set false version of the standard sieve. Count the true's as you go and when you get to 1 million you can stop.
There are others ways, but this has the merit of being simple.
You can allocate an array of 1000000 integers - it is only four megabytes, a small number by today's standards. Prime #1000000 should fit in a 32-bit integer (prime #500000 is under 8000000, so 2000000000 should be more than enough of a range for the first 1000000 primes).
You are more likely to encounter issues with the time, not with the space for your computation. Remember that you can stop testing candidate divisors when you reach the square root of the candidate prime, and that you can use the primes that you found so far as your candidate divisors.

What is the VInt in Lucene?

I want to know what is the VInt in Lucene ?
I read this article , but i don't understand what is it and where does Lucene use it ?
Why Lucene doesn't use simple integer or big integer ?
Thanks .
VInt is extremely space efficient. It could theoretically save upto 75% space.
In Lucene, many of the structures are list of integers. For example, list of documents for a given term, positions (and offsets) of the terms in documents, among others. These lists form bulk of the lucene data.
Think of Lucene indices for millions of documents that need tens of GBs of space. Shrinking space by more than half reduces disk space requirements. While savings of disk space may not be a big win, given that disk space is cheap, the real gain comes reduced disk IO. Disk IO for reading VInt data is lower than reading integers which automatically translates to better performance.
VInt refers to Lucene's variable-width integer encoding scheme. It encodes integers in one or more bytes, using only the low seven bits of each byte. The high bit is set to zero for all bytes except the last, which is how the length is encoded.
For your first question:
A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on. https://lucene.apache.org/core/3_0_3/fileformats.html.
So, to save a list of n integers the amount of memory you would need is [eg] 4*n bytes. But with Vint all numbers under 128 would be stored using only 1 byte [and so on] saving a lot of memory.
Vint provides a compressed representation of integers and Shashikant's answer already explains the requirements and benefits of compression in Lucene.