I have a lex/yaac code which captures some data after parsing an file. That file is in specific format. Consider this format:
File format:
ABC
Something something
ABC
Something something
....
....
Lex/Yacc code is sequential right now. Is it possible to make the code multithreading for single file by dividing it into chunks separated by ABC.
Where to start?
I shall be happy to share more details, if needed.
The problem is that you don't know where the delimiters are until you lexically analyse the file. Unless the delimiters are contextually unambiguous (which won't be the case if your syntax includes things like comments or quoted strings), then you can't start the analysis at a random point in the file without knowing the context of that point. So in common cases, finding the delimiters requires a linear scan.
You could hand a chunk of the file to a separate chunk parser thread every time you identify the chunk's endpoints. That will require scanning the chunk twice (once for the initial delimiter scan and a second time for the precise parse) but it might still be a win if parsing is relatively slow. You could avoid the second scan by retaining the token stream in memory and passing the parser thread the list of tokens, but it may well turn out that the memory management and other bookkeeping for the token lists are as or more expensive than the lexical scan. That will depend on a lot of factors, so experimentation and benchmarking in your particular use case will be necessary.
If the delimiters are completely unambiguous and the chunks are relatively short, you could divide the files into pages of some convenient size, and give each thread its own pages to analyse. When the thread is assigned a page, it will have to skip initial data until it finds the first delimiter; when it reaches the end of the page, it will have to continue scanning the rest of the current chunk in the next page. So there will be a bit of duplicate scanning, but if the chunks are short relative to the page size, it won't be too bad. Again, benchmarking your particular use case will be useful. And, again, you can only do this if you can unambiguously recognise a delimiter without knowing anything about the lexical context.
Related
The various headers of a wav-file contain file-length information. Consider the case where I generate a wav file without knowing how long it is going to be and possibly without the ability to alter the header after I finished (i.e. in case of writing to a pipe), what should I write into these fields?
Either way this isn't an ideal situation. But, if there's absolutely no way to edit the file, I'd recommend writing 0xFFFFFFFF, that is, the maximum possible value that can be assigned to the Subchunk2Size field of a standard wav header (albeit somewhat of a hack). Doing so will allow the whole file to be read/played by practically all players.
As some players solely rely on this field to calculate the audio's length (so it knows when to loop, how far to allow seeking, etc.), therefore, saying the file is longer than it actually is will "trick" the player into processing the entire file (although, depending on the player an error may occur once it reaches the end of the audio).
I'm writing a directory tree scanning function in D that tries to combine tools such as grep and file and conditionally grep for things in a file only if it's not matching a set of magic bytes indicating filetypes such as ELF, images, etc.
What is the best approach to making such an exclusion logic run as fast as possible with regards to minimizing file io? I typically don't want to read in the whole file if I only need to read some magic bytes in the beginning. However to make the code more future-general (some magics may lie at the end or somewhere else than at the beginning) it would be nice if I could use a mmap-like interface to lazily fetch data from the disk only when I it is read. The array interface also simplifies my algorithms.
Is D's std.mmfile the best option in this case?
Update: According to this post I guess mmap is adviced: http://forum.dlang.org/thread/dlrwzrydzjusjlowavuc#forum.dlang.org
If I only need read-access as an array (opIndex) are there any cons to using std.mmfile over std.stdio.File or std.file?
If you want to lazily read a file with Phobos, you pretty much have three options
Use std.stdio.File's byLine and read a line at a time.
Use std.stdio.File's byChunk and read a particular number of bytes at a time.
Use std.mmfile.MmFile and operate on the file as an array, taking advantage of mmap underneath the hood to avoid reading in the whole file.
I fully expect that #3 is going to be the fastest (profiling could prove differently, but I'd be very surprised given how fantastic mmap is). It's also probably the easiest to use, because you get an array to operate on. The only problem with MmFile that I'm aware of is that it's a class when it should arguably be a ref-counted struct so that it would clean itself up when you were done. Right now, if you don't want to wait for the GC to clean it up, you'd have to manually call unmap on it or use destroy to destroy it without freeing its memory (though destroy should be used with caution). There may be some sort of downside to using mmap (which would then naturally mean that there was a downside to using MmFile), but I'm not aware of any.
In the future, we're going to end up with some range-based streaming I/O stuff, which might be closer to what you need without actually using mmap, but that hasn't been completed yet, and mmap is so incredibly cool that there's a good chance that it would still be better to use MmFile.
you can combine seek and rawread of std.stdio.File to do what you want
you can then do a rawRead for only the first few bytes
File file=//...
ubyte[1024] buff;
ubtye[] magic=file.rawRead(buff[0..4]);//only the first 4 bytes are read
//check magic
then depending on the OS' caching/read-ahead strategy this can be nearly as fast as mmfile, however multiple seeks will ruin the read-ahead behavior
In the sample installation and configuration instructions, it is seemingly suggested that OpenGrok requires two staging areas, with the rationale being, that one area is an index-regeneration-work-area, and the other is a production area, and they are rotated with every index regen.
Is that really necessary? Can I only have one area instead of two?
I'm looking for an answer that is specific to opengrok, and not a general list of race conditions one might encounter.
Strictly said, this is not necessary. In fact, I am pretty sure overwhelming majority of the deployments are without staging area.
That said, you need to decide if you are comfortable with a window of inconsistency that could result in some failed/imprecise searches. Let's assume that the source was updated (e.g. via git pull in case of Git) and the indexer has not finished processing the new changes yet. Thus, the index still contains the data reflecting the old state of the source. Let's say the changes applied to the source removed a file. Now if someone initiates a search that matches the contents of the removed file, the search result will probably end with an error. This is probably the better alternative - consider the case when more subtle change is done to a file such as removal/addition of couple of lines of code. In that case the symbol definitions will be off so the search results will bring you to the wrong line of code. Or, not so subtle change, when e.g. a function definition is removed from a file, the search results for references of this function will contain invalid places.
The length of the inconsistency window stems from the indexing time that is largely dependent on 2 things, at least currently:
size of the changes applied to the source
size of the source directory tree
The first is relevant because of history processing. The more incoming history changes (e.g. changesets in Git), the more work the indexer will have to do to generate history cache and/or history fields for the index (assuming history handling is on).
The second is relevant because the indexer traverses the whole source directory tree to find out which files have changed which might incur lots syscalls and potentially lots of I/O. At least until https://github.com/oracle/opengrok/issues/3077 is implemented and that will help only Source Code Management systems based on changesets.
I'm creating a vb.net winforms application that will take in user given strings, parse them, and print out labels with variable information. The given string will be used in all the labels, but the variable part of the string will change with each label.
My question is: is it better to parse the strings one time, then store those values in arrays, or to parse the string each time a label is printed? Which will perform better? Which is better practice? What is the proper way to test something like this?
If saving data in memory is done for performance, my preference would be not to do it until I knew the parsing was actually a performance problem, which I detect by random-pausing.
Generally my rule of thumb is - the less data structure the better.
If you have data structure you have to worry about it getting stale, and any time you change the program you have more code to modify and put bugs into.
Besides, parsing does not need to be slow, especially compared to whatever else you're doing.
Parsing is definitely heavy stuff.
In this particular instance if you do not have any memory constraints then you should parse once and store to speed up your application, especially if it is going to be something that is being used by people :)
A long time ago, I had to test a program generating a postscript file image. One quick way to figure out if the program was producing the correct, expected output was to do an md5 of the result to compare against the md5 of a "known good" output I checked beforehand.
Unfortunately, Postscript contains the current time within the file. This time is, of course, different depending on when the test runs, therefore changing the md5 of the result even if the expected output is obtained. As a fix, I just stripped off the date with sed.
This is a nice and simple scenario. We are not always so lucky. For example, now I am programming a writer program, which creates a big fat RDF file containing a bunch of anonymous nodes and uuids. It is basically impossible to check the functionality of the whole program with a simple md5, and the only way would be to read the file with a reader, and then validate the output through this reader. As you probably realize, this opens a new can of worms: first, you have to write a reader (which can be time consuming), second, you are assuming the reader is functionally correct and at the same time in sync with the writer. If both the reader and the writer are in sync, but on incorrect assumptions, the reader will say "no problem", but the file format is actually wrong.
This is a general issue when you have to perform functional testing of a file format, and the file format is not completely reproducible through the input you provide. How do you deal with this case?
In the past I have used a third party application to validate such output (preferably converting it into some other format which can be mechanically verified). The use of a third party ensures that my assumptions are at least shared by others, if not strictly correct. At the very least this approach can be used to verify syntax. Semantic correctness will probably require the creation of a consumer for the test data which will likely always be prone to the "incorrect assumptions" pitfall you mention.
Is the randomness always in the same places? I.e. is most of the file fixed but there are some parts that always change? If so, you might be able to take several outputs and use a programmatic diff to determine the nondeterministic parts. Once those are known, you could use the information to derive a mask and then do a comparison (md5 or just a straight compare). Think about pre-processing the file to remove (or overwrite with deterministic data) the parts that are non-deterministic.
If the whole file is non-deterministic then you'll have to come up with a different solution. I did testing of MPEG-2 decoders which are non-deterministic. In that case we were able to do a PSNR and fail if it was above some threshold. That may or may not work depending on your data but something similar might be possible.