I have a 64 bit system. I"m declaring my variables as 64 bit variables expecting my code to run faster. When I do functions such as 'String.IndexOf("J", X)', it fails because X is a Long but it is looking for a 32 bit value as the start index.
Is there any way to pass a 64 bit variable without converting it down to a 32 bit?
You got the wrong idea about 64-bit code. The arguments to the String.IndexOf() method do not change, the second argument is still an Integer. The only type in .NET that changes size is IntPtr.
This is all very much by design. A 64-bit processor does not execute code faster when you let it manipulate 64-bit integral values. Quite the contrary, it makes it run slower. Processor speed is throttled in large part by the size of the caches. The CPU caches are important because they help avoid the processor having to read or write data from RAM. Which is very slow compared to the speed of the processor. A worst-case processor stall from not having data in the L1, L2 or L3 caches can be 200 cycles.
The cache size is fixed. Using 64-bit variables makes the cache half as effective.
You also make your code slower by using Long and Option Strict Off. That requires the compiler to emit a conversion to cast the Long to an Integer. It is not visible in your code but it certainly is performed, you can see it when you look at the IL with ildasm.exe or a decompiler like ILSpy or Reflector.
Related
I am wondering what are the possible pros and cons of switching from 32 bit to 64 bit VM. Most of answers I encountered suggest that with VM you don't need to worry about switch and you get extra addressable memory, but I am afraid that there is more to the problem.
Here is what I came up with:
Pros:
substantially more addressable memory
Cons:
whatever uses pointers under the hood will double up the memory usage - I know VMs do optimisations in order to avoid this problem, but I guess it's safe to assume that same logic under 64bit VM will potentially take more memory
all deterministic outputs relying on memory hashes are likely to change
Anything more?
I am happy to take concrete example like JVM.
Now there is an iPhone coming with 64-bit architecture. long becomes 64 bits (while int remains 32-bit) , and everywhere NSInteger has been used is now a long and so 64-bits not 32. And twitter has quite a few people saying "I'm glad I've used NSInteger everywhere not int".
If you need to store a value that doesn't exceed 32 bits (for example in a loop which only loops 25 times), why should a long be used, because the 32 (at least) upper bits are going to be empty.
If the program has worked on 32-bit integers, then what benefit does using 64-bits for integers provide, when it uses up more memory?
Also there will be situations where using a 64-bit integer gives a different result to using a 32-bit integer. So if you use NSInteger then something may work on an iPhone 5S but not on an older device, whereas if int or long is explicitly used then the result will be the same on any device.
If you need to store a value that doesn't exceed 32 bits... why should a long be used?
If you can really make that guarantee, then there is absolutely no reason to value a 64 bit type over a 32 bit one. For simple operations like bounded loops, counters, and general arithmetic, 32-bit integers suffice. But for more complex operations, especially those required of high-performance applications - such as those that perform audio or image processing - the increase in the amount of data the processor can handle in 64 bit modes is significant.
If the program has worked on 32-bit integers, then what benefit does using 64-bits for integers provide, when it uses up more memory?
You make using more memory seem like a bad thing. By doubling the size of some data types, they can be addressed to more places in memory, and the more memory that can be addressed, the less time the OS spends loading code. In addition, having twice the amount of lanes for data in a processor bus equates to an order of magnitude more values that can be processed in a single go, and the increase in register size means an order of magnitude more data can be kept around in one register. This equates to, in simplest terms, a nearly automatic doubling of the speed of most applications.
Also there will be situations where using a 64-bit integer gives a different result to using a 32-bit integer? ...
Yes, but not in the way you'd think. 32-bit data types and operations as well as 64-bit operations (most simulated in software, or by special hardware or opcodes in 32-bit hosts) are "relatively stable" in terms of their sizes. You cannot make nearly as many guarantees on a 64-bit architecture because different compilers implement different versions of 64-bit data types (see LP64, SILP64, and LLP64). Practically, this means casting a 64 bit type to a 32-bit type - say a pointer to an int - is guaranteed to lead to information loss, but casting between two data types that are guaranteed to be 64 bits - a pointer and a long on LP64 - is acceptable. ARM is usually compiled using LP64 (all ints are 32-bit, all longs are 64-bit). Again, most developers should not be affected by the switch, but when you start dealing with arbitrarily large numbers that you try to store in integers, then precision becomes an issue.
For that reason, I'd recommend using NSUInteger and NSInteger in public interfaces, and APIs where there is no inherent bounds checking or overflow guards. For example, a TableView requests an NSUInteger amount of data not because it's worried about 32 and 64 bit data structures, but because it can make no guarantees about the architecture upon which it's compiled. Apple's attempt to make architecture-independent data types is actually a bit of a luxury, considering how little work you have to do to get your code to compile and "just work" in both architectures.
The internal storage for NSInteger can be one of many different backing types, which is why you can use it everywhere and do not need to worry about it, which is the whole point of it.
Apple takes care for backward compatibility if your app is running on a 32 or 64 bit engine and will convert your variable behind the scenes to a proper data type using the __LP64__ macro.
#if __LP64__
typedef long NSInteger;
typedef unsigned long NSUInteger;
#else
typedef int NSInteger;
typedef unsigned int NSUInteger;
#endif
Is there a way to convert from OMF 16 bit object file format to COFF 32 bit object file format?
I seriously doubt there would exist one. Code designed to be ran in 16 bit environment is binary incompatible with 32 bit mode. For example there's an instruction that tells the CPU to flip bit sizes for the upcoming instruction. In 16 bit mode such an instruction is needed to use 32 bit instructions. However the same opcode is needed to use 16 bit instructions in 32 bit mode.
Whether a series of opcodes are to be assumed to be 16 or 32 bits is specified in the segment descriptor.
Anyway, if you have 16 bit code that you'd like to use in 32 bit mode, that has no OS dependencies, you can use that by disaassembling it using IDA, then recompile it with a 32bit assembler. Of course only if that's permitted by its license. (although this could be fair use, but IANAL).
If the code is also tied to the underlying OS, this could be a lot more difficult, and would require to rewrite perhaps significant portions of the code.
Presumably the OMF16 code targets 16 bit x86 real-mode or 286 protected mode? That being the case, the object file format is not really your issue, the code itself is entirely incompatible since it uses different register sizes and a different addressing scheme.
Moreover if the code is targetted for DOS, Win16 or OS/2 (i.e. systems that used OMF16), then targeting it to a 32 bit target is not just a case of converting the object file format.
You need to rebuild from the source which give the tags to the question is either C or C++? Either that or you have a significant reverse engineering task on your hands!
I've searched on the net, and found these links:
The first one is a collection of tools:
http://sourceware.org/binutils/
The second one is a tool I think you need:
http://sourceware.org/binutils/docs/binutils/objcopy.html
They are not work in all cases(bazsi77 above), just test it.
I have been reading Wikipedia's article on K programming language and this is what I saw:
The small size of the interpreter and compact syntax of the language makes it possible for K applications to fit entirely within the level 1 cache of the processor.
I am intrigued. How is it possible to have the whole program in L1 cache? Say, CPU has 256kb L1 cache. Say my program is way less than that and it needs a very little amount of memory (say, just for the call stack and such). Say, it doesn't need any libraries (although if a program is for an OS, it would need to include kernel32.dll or whatever). And doesn't OS automatically allocates some minimal memory for any program (well, for executable code and stack and heap)?
Thank you.
I think what they're saying is not that the entire program fits in L1 cache, but that all the code that runs most of the time fits in the L1 cache.
Yes, the OS allocates lots of other structures, but those are hit rarely enough to not matter.
Of course, this is all speculation -- I know nothing about the 'K' language.
I believe they are speaking to the advantage that the main executing code will fit in the L1 cache; regardless of the memory allocated to the program. Once the K application is loaded, if it never touches that memory then it doesn't matter if it's allocated in terms of performance (i.e. the perf benefit of being totally in L1 cache).
The interpreter runs as a normal program managed by the OS. The interpreted program runs within the memory space of the interpreter, in the data segment. Many K programs may easily fit into the L1 cache completely, even though the entire interpreter may not. The main interpreter loop will probably fit though.
You confuse all the program code with the most frequently executed code.
For the interpreted languages the interpreter core is certainly among the most frequently executed code. Having most frequently executed code in cache speeds up execution the same way as having most frequently accessed data in cache does.
The key part is "most frequently" - it's not necessary to have all the code/data cached to see a significant acceleration.
I was doing some benchmarks for the performance of code on Windows mobile devices, and noticed that some algorithms were doing significantly better on some hosts, and significantly worse on others. Of course, taking into account the difference in clock speeds.
The statistics for reference (all results are generated from the same binary, compiled by Visual Studio 2005 targeting ARMv4):
Intel XScale PXA270
Algorithm A: 22642 ms
Algorithm B: 29271 ms
ARM1136EJ-S core (embedded in a MSM7201A chip)
Algorithm A: 24874 ms
Algorithm B: 29504 ms
ARM926EJ-S core (embedded in an OMAP 850 chip)
Algorithm A: 70215 ms
Algorithm B: 31652 ms (!)
I checked out floating point as a possible cause, and while algorithm B does use floating point code, it does not use it from the inner loop, and none of the cores seem to have a FPU.
So my question is, what mechanic may be causing this difference, preferrably with suggestions on how to fix/avoid the bottleneck in question.
Thanks in advance.
One possible cause is that the 926 has a shorter pipeline (5 cycles vs. 8 cycles for the 1136, iirc), so branch mispredictions are less costly on the 926.
That said, there are a lot of architectural differences between those processors, too many to say for sure why you see this effect without knowing something about the instructions that you're actually executing.
Clock speed is only one factor. Bus width and latency are big if not bigger factors. Cache is a factor. Speed of the media the program is run from if run from media and not memory.
Is this test using any shared libraries at all at any point in the test or is it all internal code? Fetching shared libraries on media that will vary from platform to platform (even if it is say the same sd card).
Is this the same algorithm compiled separately for each platform or the same binary? You can and will see some compiler induced variation as well. 50% faster and slower can easily come from the same compiler on the same platform by varying compiler settings. If possible you want to execute the same binary, and insure that no shared libraries are used in the loop under test. If not the same binary disassemble the loop under test for each platform and insure that there are no variations other than register selection.
From the data you have presented, its difficult to point the exact problem, but we can share some of the prior experience
Cache setting (check if all the
processors has the same CACHE
setting)
You need to check both D-Cache and I-Cache
For analysis,
Break down your code further, not just as algorithm but at a block level, and try to understand the block that causes the bottle-neck. After you find the block that causes the bottle-neck, try to disassemble the block's source code, and check the assembly. It may help.
Looks like the problem is in cache settings or something memory-related (maybe I-Cache "overflow").
Pipeline stalls, branch miss-predictions usually give less significant differences.
You can try to count some basic operations, executed in each algorithm, for example:
number of "easy" arithmetical/bitwise ops (+-|^&) and shifts by constant
number of shifts by variable
number of multiplications
number of "hard" arithmetics operations (divides, floating point ops)
number of aligned memory reads (32bit)
number of byte memory reads (8bit) (it's slower than 32bit)
number of aligned memory writes (32bit)
number of byte memory writes (8bit)
number of branches
something else, don't remember more :)
And you'll get info, that things get 926 much slower. After this you can check suspicious blocks, making using of them more or less intensive. And you'll get the answer.
Furthermore, it's much better to enable assembly listing generation in VS and use it (but not your high-level source code) as base for research.
p.s.: maybe the problem is in OS/software/firmware? Did you testing on clean system? OS is the same on all devices?