hive running beyond physical memory limits - hive

I normally resolve this by allocating more memory, however sometimes I find mys elf allocating quite a lot and I wonder if there are other alternatives I can use to solve this, is there a way to make the map jobs smaller, perhaps in aggregate they use same memory, but not all the time
I am mostly trying to learn. I can provide an example but I am not asking about a query or a specific instance, more in general

Related

Is it better better to open or to read large matrices in Julia?

I'm in the process of switching over to Julia from other programming languages and one of the things that Julia will let you hang yourself on is memory. I think this is likely a good thing, a programming language where you actually have to think about some amount of memory management forces the coder to write more efficient code. This would be in contrast to something like R where you can seemingly load datasets that are larger than the allocated memory. Of course, you can't actually do that, so I wonder how does R get around that problem?
Part of what I've done in other programming languages is work on large tabular datasets, often converted over to a R dataframe or a matrix. I think the way this is handled in Julia is to stream data in wherever possible, so my main question is this:
Is it better to use readline("my_file.txt") to access data or is it better to use open("my_file.txt", "w")? If possible, wouldn't it be better to access a large dataset all at once for speed? Or would it be better to always stream data?
I hope this makes sense. Any further resources would be greatly appreciated.
I'm not an extensive user of Julia's data-ecosystem packages, but CSV.jl offers the Chunks and Rows alternatives to File, and these might let you process the files incrementally.
While it may not be relevant to your use case, the mechanisms mentioned in #Przemyslaw Szufel's answer are used other places as well. Two I'm familiar with are the TiffImages.jl and NRRD.jl packages, both I/O packages mostly for loading image data into Julia. With these, you can load terabyte-sized datasets on a laptop. There may be more packages that use the same mechanism, and many package maintainers would probably be grateful to receive a pull request that supports optional memory-mapping when applicable.
In R you cannot have a data frame larger than memory. There is no magical buffering mechanism. However, when running R-based analytics you could use a disk.frame package for that.
Similarly, in Julia if you want to process data frames larger than memory you need to use am appropriate package. The most reasonable and natural option in Julia ecosystem is JuliaDB.
If you want to do something more low-level solution have a look at:
Mmap that provides Memory-mapped I/O that exactly solves the issue of conveniently handling data too large to fit into memory
SharedArrays that offers a disk mapped array with implementation based on Mmap.
In conclusion, if your data is data frame based - try JuliaDB, otherwise have a look at Mmap and SharedArrays (look at the filename parameter)

When to use VK_IMAGE_LAYOUT_GENERAL

It isn't clear to me when it's a good idea to use VK_IMAGE_LAYOUT_GENERAL as opposed to transitioning to the optimal layout for whatever action I'm about to perform. Currently, my policy is to always transition to the optimal layout.
But VK_IMAGE_LAYOUT_GENERAL exists. Maybe I should be using it when I'm only going to use a given layout for a short period of time.
For example, right now, I'm writing code to generate mipmaps using vkCmdBlitImage. As I loop through the sub-resources performing the vkCmdBlitImage commands, should I transition to VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL as I scale down into a mip, then transition to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL when I'll be the source for the next mip before finally transitioning to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL when I'm all done? It seems like a lot of transitioning, and maybe generating the mips in VK_IMAGE_LAYOUT_GENERAL is better.
I appreciate the answer might be to measure, but it's hard to measure on all my target GPUs (especially because I haven't got anything running on Android yet) so if anyone has any decent rule of thumb to apply it would be much appreciated.
FWIW, I'm writing Vulkan code that will run on desktop GPUs and Android, but I'm mainly concerned about performance on the latter.
You would use it when:
You are lazy
You need to map the memory to host (unless you can use PREINITIALIZED)
When you use the image as multiple incompatible attachments and you have no choice
For Store Images
( 5. Other cases when you would switch layouts too much (and you don't even need barriers) relatively to the work done on the images. Measurement needed to confirm GENERAL is better in that case. Most likely a premature optimalization even then.
)
PS: You could transition all the mip-maps together to TRANSFER_DST by a single command beforehand and then only the one you need to SRC. With a decent HDD, it should be even best to already have them stored with mip-maps, if that's a option (and perhaps even have a better quality using some sophisticated algorithm).
PS2: Too bad, there's not a mip-map creation command. The cmdBlit most likely does it anyway under the hood for Images smaller than half resolution....
If you read from mipmap[n] image for creating the mipmap[n+1] image then you should use the transfer image flags if you want your code to run on all Vulkan implementations and get the most performance across all implementations as the flags may be used by the GPU to optimize the image for reads or writes.
So if you want to go cross-vendor only use VK_IMAGE_LAYOUT_GENERAL for setting up the descriptor that uses the final image and not image reads or writes.
If you don't want to use that many transitions you may copy from a buffer instead of an image, though you obviously wouldn't get the format conversion, scaling and filtering that vkCmdBlitImage does for you for free.
Also don't forget to check if the target format actually supports the BLIT_SRC or BLIT_DST bits. This is independent of whether you use the transfer or general layout for copies.

Load table mostly or partly into working memory for querying

I have a PostgreSQL read replica with the sole purpose to do some aggregate queries. Currently, a lot of I/O is going on in order to do the aggregates, even though there's a lot of memory available and since the instance serves only this purpose I was wondering if it is possible:
To simply tell PostgreSQL to cache most of the content of the table in order to speed up aggregate queries. Is that possible?
Per Craig's comment:
The pg_prewarm module provides a convenient way to load relation data into either the operating system buffer cache or the PostgreSQL buffer cache.
http://www.postgresql.org/docs/9.4/static/pgprewarm.html
Don't forget pg_prewarm is an extension, so you'll have to add it with:
create extension pg_prewarm;
.. a lot of I/O is going on in order to do the aggregates, even though there's a lot of memory available ..
You might also want to research some of the memory config options (http://www.postgresql.org/docs/9.4/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY)

Is there any performance difference between creating an NSFileHandle for a large versus a small file?

This question strikes me as almost silly, but I just want to sanity check myself. For a variety of reasons, I'm welding together a bunch of files into a single megafile before packing this as a resource in my iOS app. I'm then using NSFileHandle to open the file, seek to the right place, and read out just the bytes I want.
Is there any performance difference between doing it this way and reading loose files? Or, supposing I could choose to use just one monolithic megafile, versus, say, 10 medium-sized (but still joined) files, is there any performance difference between "opening" the large versus a smaller file?
Since I know exactly where to seek to, and I'm reading just the bytes I want, I don't see how there could be a difference. But, hey -- Stranger things have proved to be. Thanks in advance!
There could be a difference if it was an extremely large number of files. Every open file uses up resources in memory (file handles, and the like), and on some storage devices, a file will take up an entire block even if it doesn't fill it. That can lead to wasted space in extreme cases. But in practice, it probably won't be a problem. To know for sure, you can profile your code and see if it's faster one way vs. the other, and see what sort of space it takes up on a typical device.

Scattered-write speed versus scattered-read speed on modern Intel or AMD CPUs?

I'm thinking of optimizing a program via taking a linear array and writing each element to a arbitrary location (random-like from the perspective of the CPU) in another array. I am only doing simple writes and not reading the elements back.
I understand that a scatted read for a classical CPU can be quite slow as each access will cause a cache miss and thus a processor wait. But I was thinking that a scattered write could technically be fast because the processor isn't waiting for a result, thus it may not have to wait for the transaction to complete.
I am unfortunately unfamiliar with all the details of the classical CPU memory architecture and thus there may be some complications that may cause this also to be quite slow.
Has anyone tried this?
(I should say that I am trying to invert a problem I have. I currently have an linear array from which I am read arbitrary values -- a scattered read -- and it is incredibly slow because of all the cache misses. My thoughts are that I can invert this operation into a scattered write for a significant speed benefit.)
In general you pay a high penalty for scattered writes to addresses which are not already in cache, since you have to load and store an entire cache line for each write, hence FSB and DRAM bandwidth requirements will be much higher than for sequential writes. And of course you'll incur a cache miss on every write (a couple of hundred cycles typically on modern CPUs), and there will be no help from any automatic prefetch mechanism.
I must admit, this sounds kind of hardcore. But I take the risk and answer anyway.
Is it possible to divide the input array into pages, and read/scan each page multiple times. Every pass through the page, you only process (or output) the data that belongs in a limited amount of pages. This way you only get cache-misses at the start of each input page loop.