Memory limit in Azure Data Lake Analytics - azure-data-lake

I have implemented a custom extractor for NetCDF files and load the variables into arrays in memory before outputting them. Some arrays can be quite big, so I wonder what the memory limit is in ADLA. Is there some max amount of memory you can allocate?

Each vertex has 6GB available. Keep in mind that this memory is shared between the OS, the U-SQL runtime, and the user code running on the vertex.

In addition to Saveen's reply: Please note that a row can at most contain 4MB of data, thus your SqlArray will be limited by the maximal row size as well once you return it from your extractor.

Related

Why can't I store un-serialized data structure on disk the same way I can store them in memory?

Firstly, I am assuming that data structures, like a hash-map for example, can only be stored in-memory but not on disk unless they are serialized. I want to understand why not?
What is holding us back from dumping a block of memory which stores the data structure directly into disk without any modifications?
Something like a JSON could be thought of as a "serialized" python dictionary. We can very well store JSON in files, so why not a dict?
You may say how would you represent non-string values like bool/objects on disk? I can argue "the same way you store them in memory". Am I missing something here?
naming a few problems:
Big endian vs Little endian makes reading data from disk depend on the architecture of the CPU, so if you just dumped it you won't be able to read it again on different device.
items are not contagious in memory, a list (or dictionary) for example only contains pointers to things that exist "somewhere" in memory, you can only dump contagious memory, otherwise you are only storing the locations in memory that the data was in, which won't be the same when you load the program again.
the way structures are laid in memory can change between two compiled versions of the same program, so if you just recompile your application, you may get different layouts for structures in memory so you just lost your data.
different versions of the same application may wish to update the shape of the structures to allow extra functionality, this won't be possible if the data shape on disk is the same as in memory. (which is one of the reasons why you shouldn't be using pickle for portable data storage, despite it using a memory serializer)

Optimize memory usage of very large HashMap

I need to preprocess data from OpenStreetMap. First step is to store a bunch of nodes (more than 200 million) from a unprocessed.pbf file(Europe, ~21GB). Therefore I'm using a HashMap. After importing the data into the map, my programm checks each single Node if it fulfills some conditions. If not, the node is removed from the map. Afterwards each remaining node in the map is written into a new processed.pbf file.
The problem is, that this programm is using more than 100GB RAM. I want to optimize the memory usage.
I've read that I should adjust the initial capacity and load factor of HashMap if many entries are used. Now I'm asking myself which is the best value for those two parameters.
I've also seen that the memory load when using JVM of Oracle-JDK (1.8) raises slower than using OpenJDK JVM (1.8). Are there some settings which i can use for OpenJDK JVM, to minimize memory usage?
Thanks for your help.
There will be a lot of collision in the hashmap if you don't provide the load factor and initial size while searching the key.
Generally for,
default load factor = 0.75, we provide a
initial size = ((number of data) / loadFactor) + 1
It increases the efficiency of the code. As hashmap has more space to store the data which reduces the collision occurring inside hashmap while searching a key.

ParaView: Must Data Be Loaded Into One Node?

I have a very large data set: 512x512x512 cells. Loading the mesh into the memory takes over 120GB and the memory on a single node is a problem for me. I am wandering if paraview can load the data in memory on multiple nodes so that there is more memory in total?
Thanks.
Generally speaking, yes. ParaView can be run in parallel (http://www.paraview.org/Wiki/Users_Guide_Client-Server_Visualization) to distribute the data across nodes.
What kind of file format this is? Based on the file-format, the reader could either read in partitioned data on processes or will read on single node and then one will have to redistribute using filters.

Why a 500MB Redis dump.rdb file takes about 5.0GB memory?

Actually, I have 3 Redis instances and I put them together into this 500MB+ dump.rdb. The Redis server can read this dump.rdb and it seems that everything is ok. Then I notice that redis-server cost more than 5.0GB memory. I don't know why.
Is there anything wrong with my file? My db has about 3 million keys, values for each key is a list contains about 80 integers.
I use this METHOD to put 3 instance together.
PS:Another dump.rdb with the same size and same key-value structure cost only 1GB memory.
And my data looks like keyNum->{num1, num2, num3,......}. All numbers is between 1 and 4,000,000. So should I use List to store them? For now, I use lpush(k, v). Did this way cost too much?
The ratio of memory to dump size depends on the data types Redis uses internally.
For small objects (hashes, lists and sortedsets), redis uses ziplists to encode data. For small sets made of integers, redis uses Intsets. ZipLists and IntSets are stored on disk in the same format as they are stored in memory. So, you'd expect a 1:1 ratio if your data uses these encodings.
For larger objects, the in-memory representation is completely different from the on-disk representation. The on-disk format is compressed, doesn't have pointers, doesn't have to deal with memory fragmentation. So, if your objects are large, a 10:1 memory to disk ratio is normal and expected.
If you want to know which objects eat up memory, use redis-rdb-tools to profile your data (disclaimer: I am the author of this tool). From there, follow the memory optimization notes on redis.io, as well as the memory optimization wiki entry on redis-rdb-tools.
There may be more to it, but I believe Redis compresses the dump files.

How to calculate the size of an Erlang process in memory?

I have a 'worker' process which I am going to assign to a job. Before I spawn hundreds of processes of this type I would like to know the memory consumption figures for it.
I know that I should sum all the elements which are stored in the process' loop data (all tuples, atoms, lists, etc) and the actual process memory footprint.
As I understand, before doing that I have to know the actual size of a {tuple|atom|list|process} itself.
Given a certain data structure which is stored in the process' memory how can I calculate the overall size of the process in memory?
erlang:process_info/2 will give you the amount of memory, in bytes, that a process occupies. For example:
1> erlang:process_info(whereis(code_server), memory).
{memory,284208}
Note that binaries are not included since they are not located in the process heap. Those you have to count the size of manually.
Did you read the Erlang Efficiency Guide on Memory?