.NET CLR Memory "Bytes in all heaps" is much lower than "Gen 0 heap size" - performancecounter

I am looking at performance counters for may ASP.NET 4 Application (Workflow Service)
.NET CLR Memory -- # Bytes in all Heaps : 44,420,488
.NET CLR Memory -- Gen 0 heap size : 311,665,568
.NET CLR Memory -- Gen 1 heap size : 17,723,080
.NET CLR Memory -- Gen 2 heap size : 25,956,920
.NET CLR Memory -- Large Object Heap size : 740,488
Description on "# Bytes in all Heaps" counter
This counter is the sum of four other counters; Gen 0 Heap Size; Gen 1 Heap Size; Gen 2 Heap Size and the Large Object Heap Size. This counter indicates the current memory allocated in bytes on the GC Heaps.
Notice that it says "sum of four other counters" not just "sum of bytes allocated on four other heaps", which would make sense, since there are 0 bytes in Gen 0 heap after a Gen 0 garbage collection.
I noticed that in my case #Bytes in all Heaps is precisely Gen 1 + Gen 2 + Large Object Heap. Is this a bug? Or am I miss reading the numbers?
Windows 7 Enterprise, Performance Monitor 6.1.7601

According to this, all Heaps = Gen 1 + Gen 2 + Large Object Heap
(no Gen 0)

Related

Opencl Maximum Size of Private memory per Work Item

I Have an AMD RX 570 4G,
Opencl tells me that I can use a Maximum of 256 Workgroup and 256 WorkItem per group...
Let's say I use all 256 Workgroup with 256 WorkItem in each of them,
Now, What is the Maximum Size of private memory per work item?
Is Private memory Equal to Total VRAM(4GB) Divided by Total Work Items(256x256)?
Or is it equal to Cache if so, how?
VRAM is represented in OpenCL as global memory.
Private memory is initially allocated from the register file. Your RX 570 is from AMD's Polaris architecture, a.k.a. GCN 4 where each compute unit (64 shader processors) has access to 256 vector (SIMD) registers (64x32 bits wide) and 512 32-bit scalar registers. So that works out to about 66KiB per CU, but it's not as simple as just quoting that total.
A workgroup will always be scheduled on a single compute unit, so if you assign it 256 work items, then it will have to perform every vector instruction 4 times in sequence (64 x 4 = 256) and the vector registers will (simplifying slightly) effectively have to be treated as 64 256-entry registers.
Scalar registers are used for data and calculations which are identical on each work item, e.g. incrementing a loop counter, holding buffer base pointers, etc.
Private memory will usually spill to global if you use more than will fit in your register file. So performance simply drops.
So essentially, on GCN, your optimal workgroup size is usually 64. Use as little private memory as possible; definitely aim for less than half of the available register file so that more than one workgroup can be scheduled so latency from memory access can be papered over, otherwise your shader cores will be spending a lot of time just waiting for data to arrive or be written out.
Cache is used for OpenCL local and constant memory spaces. (Constant will again spill to global if you try to use too much. The size of local memory can be checked via the OpenCL API and again is divided among workgroups scheduled on the same compute unit, so if you use more than half, only one group can run on a CU, etc.)
I don't know where you're getting a limit of 256 workgroups from, the limit is essentially set by whether the GPU uses 32-bit or 64-bit addressing. Most applications won't get close to 4bn work items even in the 32-bit case.
Private memory space is registers on the GPU die (0 cycle access latency) and not related to the amount of VRAM (global memory space) at all. The amount of private memory depends on the device (private memory per compute unit).
I don't know private memory size for the RX 570, but for older HD7000 series GPUs it is 256kB per CU. If you have a work group size of 256, you get 1kB per work item, which is equal to 256 float variables.
Cache size determines the size of local and constant memory space.

Loading large set of images kill the process

Loading 1500 images of size (1000,1000,3) breaks the code and throughs kill 9 without any further error. Memory used before this line of code is 16% of system total memory. Total size of images direcotry is 7.1G.
X = np.asarray(images).astype('float64')
y = np.asarray(labels).astype('float64')
system spec is:
OS: macOS Catalina
processor: 2.2 GHz 6-Core Intel Core i7 16 GB 2
memory: 16 GB 2400 MHz DDR4
Update:
getting the bellow error while running the code on 32 vCPUs, 120 GB memory.
MemoryError: Unable to allocate 14.1 GiB for an array with shape (1200, 1024, 1024, 3) and data type float32
You would have to provide some more info/details for an exact answer but, assuming that this is a memory error(incredibly likely, size of the images on disk does not represent the size they would occupy in memory, so that is irrelevant. In 100% of all cases, the images in memory will occupy a lot more space due to pointers, objects that are needed and so on. Intuitively I would say that 16GB of ram is nowhere nearly enough to load 7GB of images. It's impossible to tell you how much you would need but from experience I would say that you'd need to bump it up to 64GB. If you are using Keras, I would suggest looking into the DirectoryIterator.
Edit:
As Cris Luengo pointed out, I missed the fact that you stated the size of the images.

Yarn memory allocation for spark streaming

When we use spark on yarn for non-streaming apps, we generally get the allocated memory to match the number of executors times memory per executor. When doing streaming apps, the allocated memory is immediately pushed to the limit (total memory) as shown in the yarn console.
With this set of parameters
--driver-memory 2g --num-executors 32 --executor-memory 500m
total memory 90G, memory used 85.88G
total vcores 64, vcores used 33
you would expect a basis of 32 * 1 G (500m + overhead) + driver memory or around 34 G, and 33 vcores (32 workers + 1 driver)
question:
is the 64 vcore due to the requirement of 2 core pairs for streaming connection and processing?
how did the estimated 34 G get pushed to 85.88 G? is this always true that with streaming apps, yarn gives it all it has?

Does Neo4j calculate JVM heap on Ubuntu?

In the neo4j-wrapper.conf file I see this:
# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
#wrapper.java.initmemory=512
#wrapper.java.maxmemory=512
Does that mean that I should not worry about -Xms and -Xmx?
I've read elsewhere that -XX:ParallelGCThreads=4 -XX:+UseNUMA -XX:+UseConcMarkSweepGC would be good.
Should I add that on my Intel® Core™ i7-4770 Quad-Core Haswell 32 GB DDR3 RAM 2 x 240 GB 6 Gb/s SSD (Software-RAID 1) machine?
I would still configure it manually.
Set both to 12 GB and use the remaining 16GB for memory mapping in neo4j.properties. Try to match it to you store file sizes

What is the limit on automatic storage on OS X in Obj-C, and why do I need to use malloc instead of a normal auto array for large arrays?

I came across a strange error today, and I still don't understand it:
long long N = 2000;
long long N2 = N*N;
long long *s = malloc(sizeof(long long)*N2); // create array
// populate it
for (long long k = 1; k <= 55; k++) {
doesn't produce any errors, but
long long N = 2000;
long long N2 = N*N;
long long s[4000000]; // create array
// populate it
for (long long k = 1; k <= 55; k++) {
gives me a code=2 EXC_BAD_ACCESS on the for line before assigning 1 to k (according to the debugger), as if there was no space left to allocate another 8-byte variable. This code is at the beginning of a method; no other variables have been assigned or allocated. I'm guessing that I simply can't allocate a 4000000-element long long array to the stack, but somehow I can allocate it to the dynamic heap. Could someone please explain what's going on, what the limits are, etc.? This is Objective-C on a Mac running Mountain Lion, 2GB RAM. A long long is 8 bytes wide, so the array should only be 32MB; I can't see why this should be an issue.
Thank you!
(By the way, if the details look familiar, it's because this is the beginning of my solver for Project Euler's Problem 149. I've avoided mentioning any details of the solution here, as I've solved the problem already.)
Your first example allocates memory from the heap, which is in the “data segment”, and your second allocates memory on the stack, which is in the “stack segment”. Each of these has a different size limit.
The default stack segment size limit, according to Technical Q&A QA1419, is 8 MiB. You can double-check this by running ulimit -a in a terminal:
:; ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 256
pipe size (512 bytes, -p) 1
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 709
virtual memory (kbytes, -v) unlimited
As you can see, the stack size is limited to 8192 KiB = 8 MiB.
The Technical Q&A I linked above describes some ways to increase the stack size limit. The maximum to which you can increase it without running as root is 64 MiB.
If you create threads, each thread gets its own stack. According to the Q&A, you can set a thread's stack size up to 1 GiB if you use the NSThread API.
Auto locals are allocated on the stack; according to this technical note, the default stack size for an OSX process's main thread is 8MB, and less for additional threads. You can try the linker option or setrlimit solutions given in the note, but C tradition is to use the heap for any large allocations.