code running very slowly after importing ansi c into iphone project - objective-c

I have an ANSI C code that is about 10,000 lines long that I am trying to use in an iPhone project. When I compile the code with gcc on the command line, I type the following:
gcc -o myprog -O3 myprog.c
This program reads in large jpeg files and does some fancy processing on them, so I call it with the following
./myprog mypic.jpg
and from the command line, this takes take about 0.1 seconds.
I'm trying to import this code into an iPhone project but I'm not entirely sure how. I was able to get it to compile and run successfully by renaming myprog.c to myprog.h and then calling the functions in the C code from within a generic NSObject class. I added the O3 optimization to the project's Other C Flags. However, when I do this, the code on the simulator takes about 2 seconds to run and on the iPhone about 7 seconds to run which renders an unacceptable user experience.
Any tips on on hoe to get this going would be much appreciated.

It's hard to say for sure where the slowness comes from, or if there is any way around it, but right off the bat you've done something wrong.
You shouldn't have renamed a .c file to a .h file and included it. You should have written a .h (header) file that had the function, variable, and type declarations declared:
myprog.h:
#ifndef MYPROG_H_
#define MYPROG_H_
struct thing {
int a;
int b;
};
extern int woof;
int foo(void * buf, int size);
#endif /* MYPROG_H_ */
Then you should compile the .c file to an object file (or library) and link the main program against that. If you were to have included the .h file that was really just a renamed .c file into more than one source code files it may have resulted in having multiple versions of some data and code in your program.
You'll probably also want to go through and separate out any code in myprog.c that you won't be using in your iPhone program. I'll bet that there is plenty.
As far as why the program is slowing down, this could have to do with myprog being written to make use of some resources that aren't available on the iPhone. The first thing that comes to mind is large amounts of RAM, since many desktop applications are written as though available RAM is infinite, and I could see how some .jpg manipulation code could be written this way. The way to get around this would be to try to rework the algorithm so that it did not load as much of the picture at one time while working on it.
The second thing that come is floating point code. Floating point operations are common in image manipulation code, but often either not available or severely limited in embedded systems. In the case of iPhones they are available, but according to something I heard, their performance is noticeably hampered if you compile your code to thumb rather than regular ARM code. (I've never developed for an iPhone or its particular processor so I don't know for sure, but it is worth looking into).
Another place where things could be slowing down would be if there were some sort of translation between Objective C objects and C structures that you have somehow introduced and is happening a lot more often than it should need to. There are probably other slow downs that could happen because of this, but you might be able to test this theory out by creating a objective C program for your desktop that uses the myprog.c code in a manner similar to the iPhone program's use of it.
Another thing you probably should look into is profiling your iPhone program. Profiling determines (or only helps to determine, in some cases) where the program is spending its time. Knowing this doesn't necessarily tell you that the code that runs the most is bad or that anything about it could be improved, but it does tell you where to look. And sometimes you may look at the results and immediately know that some function that you thought was only going to be called once at the beginning of the program is actually being called repeatedly, which highly suggests that some improvement can be made.
I'm sure that a little searching will turn up how to go about this.

Related

Ways to make a D program faster

I'm working on a very demanding project (actually an interpreter), exclusively written in D, and I'm wondering what type of optimizations would generally be recommended. The project makes heavy use of GC, classes, asssociative arrays, and pretty much anything.
Regarding compilation, I've already experimented both with DMD and LDC flags and LDC with -flto=full -O3 -Os -boundscheck=off seems to be making a difference.
However, as rudimentary as this may sound, I would like you to suggest anything that comes to your mind that could help speed up the performance, related or not to the D language. (I'm sure I'm missing several things).
Compiler flags: I would add -mcpu=native if the program will be running on your machine. Not sure what effect -Os has in addition to -O3.
Profiling has been mentioned in comments. Personally under Linux I have a script which dumps a process's stack trace and I do that a few times to get an idea of where it's getting hung up on.
Not sure what you mean by GS.
Since you mentioned classes: in D, methods are virtual by default; virtual methods add indirections and are not inlineable. Make sure only those methods that must be virtual are. See if you can rewrite your program using a form of polymorphism that doesn't involve indirections, such as using template metaprogramming.
Since you mentioned associative arrays: these make heavy use of the GC; to speed them up, switch to a third-party library that works on top of std.allocator, such as https://github.com/dlang-community/containers
If some parts of your code are parallelizable, std.parallelism is a good tool for this.
Since you mentioned that the project is an interpreter: there are many avenues for optimizing them, up to JIT/AOT compilation. Perhaps you could link to an existing library such as LLVM or libjit.

How to find the size of a reg in verilog?

I was wondering if there were a way to compute the size of a reg in Verilog. I researched it quite a bit, and found $size(a), but it's only in SystemVerilog, and it won't work in my verilog program.
Does anyone know an alternative for this??
I also wanted to ask as a side note; I'm having some trouble with my test bench in the sense that when I update a value in the file, that change is not taken in consideration when I simulate. I've been told I might have been using an old test bench but the one I am continuously simulating is the only one available in this project.
EDIT:
To give you an idea of what's the problem: in my code there is a "start" signal and when it is set to 1, the operation starts. Otherwise, it stays idle. I began writing the test bench with start=0, tested it and simulated it, then edited the test bench by setting start to 1. But when I simulate it, the start signal remains 0 in the waveform. I tried to check whether I was using another test bench, but it is the only test bench I am using in this project.
Given that I was on a deadline, I worked on the code so that it would adapt to the "frozen" test bench. I am getting now all the results I want, but I wanted to test some other features of my code, so I created a new project and copy pasted the code in new files (including the same test bench). But when I ran a simulation, the waveform displayed wrong results (even though I was using the exact same code in all modules and test bench). Any idea why?
Any help would be appreciated :)
There is a standardised way to do this, but it requires you to use the VPI, which I don't think you get on Modelsim's student edition. In short, you have to write C code, and dynamically link it to the simulator. In the C code, you can get object properties using routines such as vpi_get. Useful properites might be vpiSize, which is what you want, vpiLeftRange, vpiRightRange, and so on.
Having said all that, Verilog is essentially a static language, and objects have to be declared with a static width using constant expressions. Having a run-time method to determine an object's size is therefore of pretty limited value (since you should already know it), and may not solve whatever problem you actually have. Your question would make more sense for VHDL (and SystemVerilog?), which are much more dynamic.
Note on Icarus: the developers have pushed lots of SystemVerilog stuff back into the main language. If you take advantge of this you may find that your code is not portable.
Second part of your question: you need to be specific on what your problem actually is.

Why all programs are divided into 200 basic blocks by Valgrind?

Why all programs are divided into 200 basic blocks by Valgrind? And how to divided?
First Question
It's been some time since I've worked on a Valgrind tool (even longer than this question is old), but in case anyone is still interested, here's what I've dredged up from memory:
First, a distinction: a super block is a bit different from a basic block. Valgrind uses super blocks, not basic blocks. A super block may exit at any point, but a basic block will only ever exit by running off its end.
Valgrind doesn't divide a program into 200 super blocks. I'm pretty sure that it instead breaks programs up into super blocks of no more than 200 IRStatements (which may or may not translate directly into instructions).
The reason for this I'm pretty sure is for efficiency of the translator: at least with current versions of Valgrind I'm reasonably sure it doesn't translate your entire program up front. Translating the program into its IR format is time consuming and resource intensive, so the translator seeks to only translate as much of the program as it needs to. It does this by only translating code as it gets executed for the first time.
Second Question
Now, as to your second question... I'm not entirely sure what you're asking. If you're asking, "How does Valgrind decide how to divide up the program?", then the answer is that it decides similarly to a compiler. It starts converting the program into super blocks, and starts a new super block whenever it reaches the block limit size or detects that there is an entry point into the block from elsewhere (super blocks and basic blocks can only have one entry point).
If you instead meant, "Can I change the size of an IRSB super block?", then yes, there is an option you can pass back to Valgrind in your tools initialization code to tell it what size super blocks you want (although I don't recall if you can increase this to an arbitrary size). None of this is documented online, and only sparsely documented in the files themselves. You can take a look at the source to the other tools to see how they pass configuration options to Valgrind during initialization. That should at least give you a good idea on which headers to look at to figure out what option you need to pass back to Valgrind.

Simulating multiple instances of an embedded processor

I'm working on a project which will entail multiple devices, each with an embedded (ARM) processor, communicating. One development approach which I have found useful in the past with projects that only entailed a single embedded processor was develop the code using Visual Studio, divided into three portions:
Main application code (in unmanaged C/C++ [see note])
I/O-simulating code (C/C++) that runs under Visual Studio
Embedded I/O code (C), which Visual Studio is instructed not to build, runs on the target system. Previously this code was for the PIC; for most future projects I'm migrating to the ARM.
Feeding the embedded compiler/linker the code from parts 1 and 3 yields a hex file that can run on the target system. Running parts 1 and 2 together yields code which can run on the PC, with the benefit of better debugging tools and more precise control over I/O behavior (e.g. I can make the simulation code introduce certain types of random hiccups more easily than I can induce controlled hiccups on real hardware).
Target code is written in C, but the simulation environment uses C++ so as to simulate I/O registers. For example, I have a PortArray data structure; the header file for the embedded compiler includes a line like unsigned char LATA # 0xF89; and my header file for simulation includes #define LATA _IOBIT(f89,1) which in turn invokes a macro that accesses a suitable property of an I/O object, so a statement like LATA |= 4; will read the simulated latch, "or" the read value with 4, and write the new value. To make this work, the target code has to compile under C++ as well as under C, but this mostly isn't a problem. The biggest annoyance is probably with enum types (which behave as integers in C, but have to be coaxed to do so in C++).
Previously, I've used two approaches to making the simulation interactive:
Compile and link a DLL with target-application and simulation code, and have VB code in the same project which interacts with it.
Compile the target-application code and some simulation code to an EXE with instance of Visual Studio, and use a second instance of Visual Studio for the simulation-UI. Have the two programs communicate via TCP, so nearly all "real" I/O logic is in the simulation program. For example, the aforementioned `LATA |= 4;` would send a "read port 0xF89" command to the TCP port, get the response, process the received value, and send a "write port 0xF89" command with the result.
I've found the latter approach to run a tiny bit slower than the former in some cases, but it seems much more convenient for debugging, since I can suspend execution of the unmanaged simulation code while the simulation UI remains responsive. Indeed, for simulating a single target device at a time, I think the latter approach works extremely well. My question is how I should best go about simulating a plurality of target devices (e.g. 16 of them).
The difficulty I have is figuring out how to make each simulated instance get its own set of global variables. If I were to compile to an EXE and run one instance of the EXE for each simulated target device, that would work, but I don't know any practical way to maintain debugger support while doing that. Another approach would be to arrange the target code so that everything would compile as one module joined together via #include. For simulation purposes, everything could then be wrapped into a single C++ class, with global variables turning into class-instance variables. That would be a bit more object-oriented, but I really don't like the idea of forcing all the application code to live in one compiled and linked module.
What would perhaps be ideal would be if the code could load multiple instances of the DLL, each with its own set of global variables. I have no idea how to do that, however, nor do I know how to make things interact with the debugger. I don't think it's really necessary that all simulated target devices actually execute code simultaneously; it would be perfectly acceptable for simulation instances to use cooperative multitasking. If there were some way of finding out what range of memory holds the global variables, it might be possible to have the 'task-switch' method swap out all of the global variables used by the previously-running instance and swap in the contents applicable to the instance being switched in. Although I'd know how to do that in an embedded context, though, I'd have no idea how to do that on the PC.
Edit
My questions would be:
Is there any nicer way to allow simulation logic to be paused and examined in VS2010 debugger, while keeping a responsive UI for the simulator front-end, than running the simulator front end and the simulator logic in separate instances of VS2010, if the simulation logic must be written in C and the simulation front end in managed code? For example, is there a way to tell the debugger that when a breakpoint is hit, some or all other threads should be allowed to keep running while the thread that had hit the breakpoint sits paused?
If the bulk of the simulation logic must be source-code compatible with an embedded system written in C (so that the same source files can be compiled and run for simulation purposes under VS2010, and then compiled by the embedded-systems compiler for use in real hardware), is there any way to have the VS2010 debugger interact with multiple simulated instances of the embedded device? Assume performance is not likely to be an issue, but the number of instances will be large enough that creating a separate project for each instance would be likely be annoying in the absence of any way to automate the process. I can think of three somewhat-workable approaches, but don't know how to make any of them work really nicely. There's also an approach which would be better if it's possible, but I don't know how to make it work.
Wrap all the simulation code within a single C++ class, such that what would be global variables in the target system become class members. I'm leaning toward this approach, but it would seem to require everything to be compiled as a single module, which would annoyingly affect the design of the target system code. Is there any nice way to have code access class instance members as though they were globals, without requiring all functions using such instances to be members of the same module?
Compile a separate DLL for each simulated instance (so that e.g. if I want to run up to 16 instances, I would include 16 DLL's in the project, all sharing the same source files). This could work, but every change to the project configuration would have to be repeated 16 times. Really ugly.
Compile the simulation logic to an EXE, and run an appropriate number of instances of that EXE. This could work, but I don't know of any convenient way to do things like set a breakpoint common to all instances. Is it possible to have multiple running instances of an EXE attached to a single debugger instance?
Load multiple instances of a DLL in such a way that each instance gets its own global variables, while still being accessible in the debugger. This would be nicest if it were possible, but I don't know any way to do so. Is it possible? How? I've never used AppDomains, but my intuition would suggest that might be useful here.
If I use one VS2010 instance for the front-end, and another for the simulation logic, is there any way to arrange things so that starting code in one will automatically launch the code in the other?
I'm not particularly committed to any single simulation approach; while it might be nice to know if there's some way of slightly improving the above, I'd also like to know of any other alternative approaches that could work even better.
I would think that you'd still have to run 16 copies of your main application code, but that your TCP-based I/O simulator could keep a different set of registers/state for each TCP connection that comes in.
Instead of a bunch of global variables, put them into a single structure that encompasses the I/O state of a single device. Either spawn off a new thread for each socket, or just keep a list of active sockets and dedicate a single instance of the state structure for each socket.
the simulators I have seen that handle multiple instances of the instruction set/processor are designed that way. There is a structure usually that contains a complete set of registers, and a new pointer or an array of these structures are used to multiply them into multiple instances of the processor.

Using open source SNES emulator code to turn a rom file into a self-contained executable game

Would it be possible to take the source code from a SNES emulator (or any other game system emulator for that matter) and a game ROM for the system, and somehow create a single self-contained executable that lets you play that particular ROM without needing either the individual rom or the emulator itself to play? Would it be difficult, assuming you've already got the rom and the emulator source code to work with?
It shouldn't be too difficult if you have the emulator source code. You can use a method that is often used to store images in c source files.
Basically, what you need to do is create a char * variable in a header file, and store the contents of the rom file in that variable. You may want to write a script to automate this for you.
Then, you will need to alter the source code so that instead of reading the rom in from a file, it uses the in memory version of the rom, stored in your variable and included from your header file.
It may require a little bit of work if you need to emulate file pointers and such, or you may be lucky and find that the rom loading function just loads the whole file in at once. In this case it would probably be as simple as replacing the file load function with a function to return your pointer.
However, be careful for licensing issues. If the emulator is licensed under the GPL, you may not be legally allowed to store a proprietary file in the executable, so it would be worth checking that, especially before you release / distribute it (if you plan to do so).
Yes, more than possible, been done many times. Google: static binary translation. Graham Toal has a good howto paper on the subject, should show up early in the hits. There may be some code out there I may have left some code out there.
Completely removing the rom may be a bit more work than you think, but not using an emulator, definitely possible. Actually, both requirements are possible and you may be surprised how many of the handheld console games or set top box games are translated and not emulated. Esp platforms like those from Nintendo where there isnt enough processing power to emulate in real time.
You need a good emulator as a reference and/or write your own emulator as a reference. Then you need to write a disassembler, then you have that disassembler generate C code (please dont try to translate directly to another target, I made that mistake once, C is portable and the compilers will take care of a lot of dead code elimination for you). So an instruction of a make believe instruction set might be:
add r0,r0,#2
And that may translate into:
//add r0,r0,#2
r0=r0+2;
do_zflag(r0);
do_nflag(r0);
It looks like the SNES is related to the 6502 which is what Asteroids used, which is the translation I have been working on off and on for a while now as a hobby. The emulator you are using is probably written and tuned for runtime performance and may be difficult at best to use as a reference and to check in lock step with the translated code. The 6502 is nice because compared to say the z80 there really are not that many instructions. As with any variable word length instruction set the disassembler is your first big hurdle. Do not think linearly, think execution order, think like an emulator, you cannot linearly translate instructions from zero to N or N down to zero. You have to follow all the possible execution paths, marking bytes in the rom as being the first byte of an instruction, and not the first byte of an instruction. Some bytes you can decode as data and if you choose mark those, otherwise assume all other bytes are data or fill. Figuring out what to do with this data to get rid of the rom is the problem with getting rid of the rom. Some code addresses data directly others use register indirect meaning at translation time you have no idea where that data is or how much of it there is. Once you have marked all the starting bytes for instructions then it is a trivial task to walk the rom from zero to N disassembling and or translating.
Good luck, enjoy, it is well worth the experience.