for example like HotSpot.. I stopped its complied mode and I was thinking bytecode of classes
should be in the memory by the opcode presents..
But it seems I am wrong.. so some experts told me that there should be some
transformation processes when loading bytecode into memory..
Could any body give me more instructions about this issue...?
Thank you a lot!
You can get some hints by looking at the documentation of an API that forces the JVM to transform the internal representation back into the official class file format:
http://docs.oracle.com/javase/7/docs/api/java/lang/instrument/Instrumentation.html#retransformClasses(java.lang.Class...)
The initial class file bytes represent the bytes passed to ClassLoader.defineClass or redefineClasses (before any transformations were applied), however they might not exactly match them. The constant pool might not have the same layout or contents. The constant pool may have more or fewer entries. Constant pool entries may be in a different order; however, constant pool indices in the bytecodes of methods will correspond. Some attributes may not be present. Where order is not meaningful, for example the order of methods, order might not be preserved
From this documentation you can draw the conclusion that you can expect instructions accessing the constant pool to look different, at least they may have different indices, and that you cannot assume that methods are placed into a contiguous memory space. This does not imply that these are the only transformations, but all others can be reversed if needed— at least in a JVM that supports Instrumentation.
While running the code the JVM might replace instructions by specialized VM-internal instructions to optimize further execution. If you are curious what kind of instruction a JVM might have you can run Oracle’s HotSpot-Engine with the arguments
-XX:+UnlockDiagnosticVMOptions -XX:+PrintInterpreter
Then it will print the table of all instructions and their associated native code as used by the interpreter. This table will necessarily contain these specialized instructions. E.g. on my machine and jdk 1.7 I see about 30 non-standard bytecode instructions in the range 203 to 231.
Related
Khronos just released their new memory model extension, but there is yet to be an informal discussion, example implementation, etc. so I am confused about the basic details.
https://www.khronos.org/blog/vulkan-has-just-become-the-worlds-first-graphics-api-with-a-formal-memory-model.-so-what-is-a-memory-model-and-why-should-i-care
https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#memory-model
What do these new extensions try to solve exactly? Are they trying to solve synchronization problems at the language level (to say remove onerous mutexes in your C++ code), or is it a new and complex set of features to give you more control over how the GPU deals with sync internally?
(Speculative question) Would it be a good idea to learn and incorporate this new model in the general case or would this model only apply to certain multi-threaded patterns and potentially add overhead?
Most developers won't need to know about the memory model in detail, or use the extensions. In the same way that most C++ developers don't need to be intimately familiar with the C++ memory model (and this isn't just because of x86, it's because most programs don't need anything beyond using standard library mutexes appropriately).
But the memory model allows specifying Vulkan's memory coherence and synchronization primitives with a lot less ambiguity -- and in some cases, additional clarity and consistency. For the most part the definitions didn't actually change: code that was data-race-free before continues to be data-race-free. For a few developers doing advanced or fine-grained synchronization, the additional precision and clarity allows them to know exactly how to make their programs data-race-free without using expensive overly-strong synchronization.
Finally, in building the model the group found a few things that were poorly-designed or broken previously (many of them going all the way back into OpenGL). They've been able to now say precisely what those things do, whether or not they're still useful, and build replacements that do what was actually intended.
The extension advertises that these changes are available, but even more than that, once the extension is final instead of provisional, it will mean that the implementation has been validated to actually conform to the memory model.
Among other things, it enables the same kind of fine grained memory ordering guarantees for atomic operations that are described for C++ here. I would venture to say that many, perhaps even most, PC C++ developers don't really know much about this because the x86 architecture basically makes the memory ordering superfluous.
However, GPUs are not x86 architecture and compute operations executed massively parallel on GPU shader cores probably can, and in some cases must use explicit ordering guarantees to be valid when working against shared data sets.
This video is a good presentation on atomics and ordering as it applies to C++.
This is related but NOT the same as frame pointer omitting ? Any risk?
I am trying to follow this old (but still relevan article)
http://blogs.msdn.com/b/larryosterman/archive/2007/03/12/fpo.aspx
Larry (author writes)
machines got sufficiently faster since 1995 that the performance
improvements that were achieved by FPO weren't sufficient to counter
the pain in debugging and analysis that FPO caused
However in the discussion further down the page one user writes
Disabling FPO can have both serious code size and performance impact.
Tail call optimizations have to be disabled when a frame pointer is
present, leading to much greater stack usage in affected paths. Small
functions are also disproportionately affected by prolog/epilog code.
Third, although there are still six registers available with a frame
pointer on X86, only three of them are nonvolatile with respect to
nested calls: EBX, ESI, and EDI. Opening up a fourth register can drop
out a bunch of spill code.
I have a couple of question.
Spill code == Register spillage?
Is the author correct that FPO is generally considered a pain and
the gain doe not out-weigh the benefits.
Is FPO still relevant today in x64 architecture since there are a
LOT more registers o play with.
Do you use FPO? What for (if yes) and does it make a difference to
you?
Finally in this article
http://www.altdevblogaday.com/2012/05/24/x64-abi-intro-to-the-windows-x64-calling-convention/
The author says
[with repect to Windows x64 calling convention].....
All parameters have space reserved on the stack, even the ones passed in registers. In fact, there’s stack space for 4 parameters
even if your function doesn’t have any params. Those parameters are 8
bytes so that’s at least 32 bytes on the stack for every function
(every function actually has at least 48 bytes on the stack…I’ll
explain that another time). This stack area is called the home space.
There are few reasons behind this home space:
If the registers need to be used for something else, the called function can store the data in the home space without moving the stack
pointer.
It keeps the stack structure easy to determine. That’s very handy for debugging, and perhaps necessary for x64′s stack metadata (another
point I’ll come back to another time). ...... The compiler can use it
for whatever it wants, and an optimized build will likely make great
use of it.
Wouldn't an optimized build optimize the excess allocation away?
1.Spill code == Register spillage?
Almost. Stricly speaking, spill code is the code added by the compiler to implement a register spill. The spill itself is the decision to tag a live range as not able to be placed in a register.
2.Is the author correct that FPO is generally considered a pain and the gain doe not out-weigh the benefits.
The author is probably correct that in modern processor architectures, the kinds of functions where FPOs will generate a significant performance gain is a smaller set than in the past. Yet FPO's do make code smaller, reducing cache pressure. They do reduce register pressure. These can be important in some settings. They do speed up prolog and epilog code by a few instructions. The point about debuggers not working well without the FP is noteworthy. It means core dumps are less useful for post mortems on production-optimized code. You'd never want to use FPO during development except for final testing.
3.Is FPO still relevant today in x64 architecture since there are a LOT more registers o play with.
Modern processors are so various and complex that you just about never know what's "relevant" until you try it and measure.
4.Do you use FPO? What for (if yes) and does it make a difference to you?
I have written a medium-size C library (20K SLOC) where it made a small (~5%) difference in run time overall under gcc. This was a native language extension to a scripting language that had to compile under both gcc and Visual C. Using it would have split the build path. I decided 5% was not worth it for the purpose the extension served. But if it had been a dynamic fluid simulation to predict the weather, 5% could have been worth many millions of dollars. The decision would have been different.
5.Wouldn't an optimized build optimize the excess allocation away?
That's entirely up to the compiler and optimizer designer. It looks from the MS documentation here that MS has defined the ABI to require home space for all data even if it's whole lifetime is spent in a register.
1) When you need to use a register and don't have any unused ones, you need to write code to save some register value on the stack and later restore it.
2) FPO was a pain back when unwinding was primarily done by walking the stack. Nowadays standard unwind ABIs exist anyway (e.g. to enable exception handling), so the information already exists, and is organized more efficiently (away from the hot code), so there's no pain. Sure, there would be some pain if you wrote all your machine code by hand, but that's not the typical use case.
3) Typical x86_64 ABIs don't use frame pointers at all (except when absolutely necessary, like for variable-length arrays in C).
4) I'm not a compiler. My compiler doesn't generate frame pointers.
Optimize excess away) Not sure what your question is. The space consumption for the home area isn't a problem. The benefit of not having to adjust any stack pointers is a big advantage, since you need a lot less code. The same goes for the red zone just beyond the stack frame, which allows leaf code to use a lot of memory without ever needing any stack pointer gymnastics.
Problem
Currently I have an elasticsearch cluster that is running out of file descriptors, and checking elasticsearch setup documentation page I saw that is recommended to set the number of file descriptors on the machine to 32K or even 64K and digging a bit on search results I found some people that set this limit to a threshold even higher (128K or unlimited).
The exception I'm getting is quite common to the exhaustion of file descriptors:
Caused by: org.apache.lucene.store.LockReleaseFailedException: Cannot forcefully unlock a NativeFSLock which is held by another indexer component
Question
Is there an equation for the number of file descriptors we should expect to be required by elasticsearch / lucene based on the number of Indexes, Shards, Replicas and / or Documents? Or even foor the number of files for all elasticsearch indexes?
I wouldn't like to set it by try and error, and unlimited number of file descriptors isn't possible for my situation.
I know very little about elasticsearch but will try to answer this from Lucene perspective.
I'm afraid there's no easy way of finding out how many descriptors do you really need.
First, this depends on Directory implementation (which itself depends on the underlying OS, if you use FSDirectory.open(File)).
Secondly, it also depends on your merge policy (which may depend on Lucene version, unless elasticsearch overrides it).
Finally, it can even depend on various exotic circumstances, such as garbage collection behaviour (if certain bits depend on finalizers to free resources). We even had an instance of Lucene which was leaking file descriptors until we manually switched -d64 mode on.
Above said, I would recommend you to set up a monitoring script which gathers some stats over a week or so and come up with the range fitting your typical usage. Add some variance for unexpected cases.
P.S. I am struggling to imagine a case these days where file descriptors would be a genuine problem. Is this a C10K problem? Can you elaborate on this?
I have this molecular dynamics program that writes atom position and velocities to a file at every n steps of simulation. The actual writing is taking like 90% of the running time! (checked by eiminating the writes) So I desperately need to optimize that.
I see that some fortrans have an extension to change the write buffer size (called i/o block size) and the "number of blocks" at the OPEN statement, but it appears that gfortran doesn't. Also I read somewhere that gfortran uses 8192 bytes write buffer.
I even tried to do an FSTAT (right after opening, is that right?) to see what is the block size and number of blocks it is using but it returns -1 on both. (compiling for windows 64 bit)
Isn't there a way to enlarge the write buffer for a file in gfortran? Will it be diferent compiling for linux than for windows?
I'd really really rather stay in fortran but as a desperate measure isn't there a way to do so by adding some c routine?
thanks!
IanH question is key. Unformatted IO is MUCH faster than formatted. The conversion from base 2 to base 10 is very CPU intensive. If you don't need the values to be human readable, then use unformatted IO. If you want to be able to read the values in another language, then use access='stream'.
Another approach would be to add your own buffering. Replace the write statement with a call to a subroutine. Have that subroutine store values and write only when it has received M values. You'll also have to have a "flush" call to the subroutine to cause it to write the last values, if they are fewer them M.
If gcc C is faster at IO, you could mix Fortran and C with Fortran's ISO_C_Binding: https://stackoverflow.com/questions/tagged/fortran-iso-c-binding. There are examples of the use of the ISO C Binding in the gfortran manual under "Mixed Language Programming".
If you spend 90% of your runtime writing coords/vels every n timesteps, the obvious quick fix would be to instead write data every, say, n/100 timestep. But I'm sure you already thought of that yourself.
But yes, gfortran has a fixed 8k buffer, whose size cannot be changed except by modifying the libgfortran source and rebuilding it. The reason for the buffering is to amortize the syscall overhead; (simplistic) tests on Linux showed that 8k is sufficient and more than that goes far into diminishing returns territory. That being said, if you have some substantiated claims that bigger buffers are useful on some I/O patterns and/or OS, there's no reason why the buffer can't be made larger in a future release.
As for you performance issues, as already mentioned, unformatted is a lot faster than formatted I/O. Additionally, gfortran has rather high per-IO-statement overhead. You can amortize that by writing arrays (or, array sections) rather than individual elements (this matters mostly for unformatted, for formatted IO there is so much to do that this doesn't help that much).
I am thinking that if cost of IO is comparable or even larger than the effort of simulation, then it probably isn't such a good idea to store all these data to disk the first place. It is better to do whatever processing you intend to do directly during the simulation, instead of saving lots of intermediate data them later read them in again to do the processing.
Moreover, MD is an inherently highly parallelizable problem, and with IO you will severely cripple the efficiency of parallelization! I would avoid IO whenever possible.
For individual trajectories, normally you just need to store the initial condition of each trajectory, along with its key statistics, or important snapshots at a small number of time values. When you need one specific trajectory plotted you can regenerate the exact same trajectory or section of trajectory from the initial condition or the closest snapshot, and with similar cost as reading it from the disk.
i am working on embedded software projects in automotive domain. In one of my projects, the application software consumes almost 99% of RAM memory. Actual RAM size available is 12KB. we use TMS470R1B1 Titan F05 microcontroller. I have done some optimisation like finding unused messages in software and deleting them but its still not worth reducing RAM. could you please suggest some good ways to reduce the RAM by some software optimisation?
Unlike speed optimisation, RAM optimisation might be something that requires "a little bit here, a little bit there" all through the code. On the other hand, there may turn out to be some "low hanging fruit".
Arrays and Lookup Tables
Arrays and look-up tables can be good "low-hanging fruit". If you can get a memory map from the linker, check that for large items in RAM.
Check for look-up tables that haven't used the const declaration properly, which puts them in RAM instead of ROM. Especially look out for look-up tables of pointers, which need the const on the correct side of the *, or may need two const declarations. E.g.:
const my_struct_t * param_lookup[] = {...}; // Table is in RAM!
my_struct_t * const param_lookup[] = {...}; // In ROM
const char * const strings[] = {...}; // Two const may be needed; also in ROM
Stack and heap
Perhaps your linker config reserves large amounts of RAM for heap and stack, larger than necessary for your application.
If you don't use heap, you can possibly eliminate that.
If you measure your stack usage and it's well under the allocation, you may be able to reduce the allocation. For ARM processors, there can be several stacks, for several of the operating modes, and you may find that the stacks allocated for the exception or interrupt operating modes are larger than needed.
Other
If you've checked for the easy savings, and still need more, you might need to go through your code and save "here a little, there a little". You can check things like:
Global vs local variables
Check for unnecessary use of static or global variables, where a local variable (on the stack) can be used instead. I've seen code that needed a small temporary array in a function, which was declared static, evidently because "it would take too much stack space". If this happens enough times in the code, it would actually save total memory usage overall to make such variables local again. It might require an increase in the stack size, but will save more memory on reduced global/static variables. (As a side benefit, the functions are more likely to be re-entrant, thread-safe.)
Smaller variables
Variables that can be smaller, e.g. int16_t (short) or int8_t (char) instead of int32_t (int).
Enum variable size
enum variable size may be bigger than necessary. I can't remember what ARM compilers typically do, but some compilers I've used in the past by default made enum variables 2 bytes even though the enum definition really only required 1 byte to store its range. Check compiler settings.
Algorithm implementation
Rework your algorithms. Some algorithms have have a range of possible implementations with a speed/memory trade-off. E.g. AES encryption can use an on-the-fly key calculation which means you don't have to have the entire expanded key in memory. That saves memory, but it's slower.
Deleting unused string literals won't have any effect on RAM usage because they aren't stored in RAM but in ROM. The same goes for code.
What you need to do is cut back on actual variables and possibly the size of your stack/stacks. I'd look for arrays that can be resized and unused varaibles. Also, it's best to avoid dynamic allocation because of the danger of memory fragmentation.
Aside from that, you'll want to make sure that constant data such as lookup tables are stored in ROM. This can usually be achieved with the const keyword.
Make sure the linker produces a MAP file - it will show you where the RAM is used. Sometimes you can find things like string literals/constants that are kept in RAM. Sometimes you'll find there are unused arrays/variables put there by someone else.
IF you have the linker map file it's also easy to attack the modules which are using the most RAM first.
Here are the tricks I've used on the Cell:
Start with the obvious: squeeze 32-bit words into 16s where possible, rearrange structures to eliminate padding, cut down on slack in any arrays. If you've got any arrays of more than eight structures, it's worth using bitfields to pack them down tighter.
Do away with dynamic memory allocation and use static pools. A constant memory footprint is much easier to optimize and you'll be sure of having no leaks.
Scope local allocations tightly so that they don't stay on stack longer than they have to. Some compilers are very bad at recognizing when you're done with a variable, and will leave it on the stack until the function returns. This can be bad with large objects in outer functions that then eat up persistent memory they don't have to as the outer function calls deeper into the tree.
alloca() doesn't clean up until a function returns, so can waste stack longer than you expect.
Enable function body and constant merging in the compiler, so that if it sees eight different consts with the same value, it'll put just one in the text segment and alias them with the linker.
Optimize executable code for size. If you've got a hard realtime deadline, you know exactly how fast your code needs to run, so if you've any spare performance you can make speed/size tradeoffs until you hit that point. Roll loops, pull common code into functions, etc. In some cases you may actually get a space improvement by inlining some functions, if the prolog/epilog overhead is larger than the function body.
The last one is only relevant on architectures that store code in RAM, I guess.
w.r.t functions, following are the handles to optimise the RAM
Make sure that the number of parameters passed to a functions is deeply analysed. On ARM architectures as per AAPCS(ARM arch Procedure Call standard), maximum of 4 parameters can be passed using the registers and rest of the parameters would be pushed into the stack.
Also consider the case of using a global rather than passing the data to a function which is most frequently called with the same parameter.
The deeper the function calls, the heavier is the use of the stack. use any static analysis tool, to get to know worst cast function call path and look for venues to reduce it. When function A is calling function B, B is calling C, which in turn calls D, which in turn calls E and goes deeper. In this case registers can't be at all levels to pass the parameters and so obviously stack will be used.
Try for venues for clubbing the two parameters into one wherever applicable. remember that all the registers are of 32bit in ARM and so further optimisation is also possible.
void abc(bool a, bool b, uint16_t c, uint32_t d, uint8_t e)// makes use of registers and stack
void abc(uint8 ab, uint16_t c, uint32_t d, uint8_t e)//first 2 params can be clubbed. so total of 4 parameters can be passed using registers
Have a re-look on nested interrupt vectors. In any architecture, we use to have scratch-pad registers and preserved registers. Preserved registers are something which needs to be saved before the servicing the interrupt. In case of nested interrupts it will be needing huge stack space to back up the preserved registers to and from the stack.
if objects of type such as structure is passed to the function by value, then it pushes so much of data(depending on the struct size) which will eat up stack space easily. This can be changed to pass by reference.
regards
barani kumar venkatesan
Adding to the previous answers.
If you are running your program from RAM for faster execution, you can create a user defined section which contains all the initialization routines which you are sure that it wont run more than once after your system boots up. After all the initialization functions executed, you can re use the region for heap.
This can be applied to the data section which are identified as not helpful after a certain stage in your program.