I am working on a project that will copy files to a database every time something is added to a specific directory. Now the program works fine when I'm testing with a small set of data but I was wondering if someone could explain how the FileSystemWatcher.Created event work.
My main concern is when I use this on a larger scale the program may slow down when it handles 100,000+ files.
If this is an issue could anyone explain if there is some sort of workaround to polling the original folder, lets call that "C:\folder", and maybe poll a temp folder instead.
I have not tested the watcher with 100,000 files. However, in most cases you should not have so many files in a folder awaiting processing. I recommend a structure like
C:\folder
C:\folder\processing
C:\folder\archive
C:\folder\error
As soon as you begin working on a given file, move it into processing. If you successfully process it, move the file again to archive. If there is an error while processing a file, instead move it into error.
This will make it easier for you to keep the files organized and diagnose problems that occur in production.
With that file structure, you will not run into issues with large numbers of files in the folder you are watching, unless you receive files in incredibly large bursts compared to the speed with which they can be moved into the processing state.
Related
This is going to be a question with a lot of hypotheticals, but it's been on my mind for a while now and I finally want to get some perspectives on how to tackle this "issue". For the sake of the question, I'll make up an example requirement of how the program I want to make would work on a conceptual level without too many specifics.
The Problem
I want to create a program to keep track of miscellaneous info for files and folders. This miscellaneous info can be anything from comments, authors, to more specific info like the original source of the file (a URL for example), categories, tags, and more. All this info is kept track of in an SQLite database.
Now... how would you create a connection to the file (or folder) to the database? Whatever file is added to the program, the file should continue to operate on an independent level from the program, meaning you should be able to edit, copy, move, rename or do anything else with the file you would usually do with your OS of choice - even deleting it.
You should even be able to archive it, zip it, upload it somewhere or do other things that temporarily or permanently removes the file from your system, without losing the connection to the database. The program itself doesn't actually ever touch the files themselves, unless to generate a new entry in the database, but obviously, there should be some kind of reference in the file to a database entry in the program.
Yes, I know that if you delete the file, you would have a dead entry in the database. For now, just treat this as an unfortunate reality that can't be solved unless you incorporate the file more closely into the program.
Possible solutions and why I decided against them
Reference inside Filename
Probably the most obvious choice, you could just have a reference inside the filename to point to a database entry, for example by including the id at the start of the filename:
#1 my-example-file.txt
#12814 this-is-one-of-many-files.txt
Obviously, that goes against what I established earlier, as you would be restricted from freely renaming the file. You would always have to keep in mind to not mess with the id inside the filename, or else the connection to your program is broken. Unfortunately, that is the best bet I currently have, but I would like to avoid using that approach if possible.
Alternate Data Streams (ADS)
A pretty cool feature I recently discovered that's available on NTFS file systems, ADS allows you to store different streams of data for your files, to grossly simplify it. You could attach a data stream to your file that saves the id for the database entry in the program, and a regular user would never be able to mess directly with that.
However, since this is a feature reserved for specific file systems, there's some ugly side effects to ADS, as you can easily lose that part of the file by:
moving/copying it to a file system that doesn't support ADS, such as the file systems most often used in removable drives
uploading it to a cloud then later downloading it
moving it to another OS that might not support ADS or treats it in an unexpected way
zipping it
Thus I can't really rely on ADS either.
I have a multitude of macros breaking down information into multiple files, e.g. for each row, create a separate worksheet; or for each row, create a .docx and a .pdf document.
Now to test these macros, I always need to move outside my folders synced to OneDrive/Sharepoint because whenever a new file is created in a synced location, Office takes its damned time to do some stuff related to the synchronisation, thus considerably slowing down the macro execution.
This is equally, or even moreso a problem in production where the macro is run on a much larger sample and by other users, so I have to train them to move the file outside of the shared location (dedicated to collaboration) to their own drive.
Is there a way to defer these actions after the macro execution (besides disabling OneDrive app)? This is causing me issues with development as I am used to the file being autosaved and to having my own version control. This is equally important during the testing, when I change a lot of the code.
I am working with multiple processes that write to the same directory.
I have a directory dir1/
My process creates a file a.txt under dir1/. However the other process creates a-temp1.txt and renames it to a.txt. I don't have control over the other process since that code comes from a library. Can I prevent a-temp.txt from being renamed?
There's nothing you can do that the other process can't undo. Your best hope (other than changing your program to work sanely) is that the other process doesn't try too hard to do the rename. That is, it tries the simple approach and gives up if that fails.
In particular, you can set the UF_IMMUTABLE flag on either file and that will prevent one from being renamed to replace the other. You can set the flag using chflags(). Using Cocoa, you could also use [someURL setResourceValue:#YES forKey:NSURLIsUserImmutableKey error:NULL].
Keep in mind that you won't be able to change the file in any other way, either, until that flag is removed. If the other process is determined to rename the file, it has permission to remove the flag just like your process does.
Also keep in mind that a system such as this is inherently race-prone.
You really ought to use separate names for the files, or separate directories, or ditch that library that doesn't give you the control you need.
Set the user immutable flag chflags(...,uchg). This will keep the other process from changing your file unless it takes action to clear the bit. Of course I don't know how the other process will react to you putting things in it's way, but that wasn't the question.
You can use chflags() on an HFS+ (Mac OS X) file system to set the UF_APPEND attribute. (Do a man 2 chflags.) That will permit appending to the file, but not deleting or renaming, even by the same user.
You can, but it unlikely will solve your problem. I strongly suspect this is an X-Y problem, and almost certainly the correct solution is to redesign some part of this system entirely, probably by changing your file names, using unique temporary files, moving to another directory, or reworking the usage of the library (libraries only do what callers tell them to do; and libraries are just code anyway). You shouldn't try to defeat another process; you're all working for the same user.
All that said, sure, you can prevent your own userid from renaming over file. Just deny yourself permission. You can modify the file:
chmod 400 a.txt
That says that you can read the file but may not write it. However, if you already have an open file handle, you may continue to use it (so you can keep writing to the file, even though another process running as the same user may not).
Similarly, you may change permissions on the directory:
chmod 500 .
This would prevent the rename because file names are kept in the directory.
We're using the Zend_PDF module in SugarCRM to merge pdf invoices that our system generates. I have been able to successfully merge a number of PDFs (around 10 to 30 in my tests), but we're getting memory errors when we try to merge larger numbers of pdf files. The error looks something like this:
[30-Jan-2012 14:10:20] PHP Fatal error: Allowed memory size of 268435456 bytes exhausted at /usr/local/src/php-5.3.8/Zend/zend_operators.c:1265 (tried to allocate 68134 bytes) in /srv/www/htdocs/sugar6_mf/Zend/Pdf/Element/Object/Stream.php on line 442
The above error was generated when we tried to merge 457 pdf files - that's files, not pages. We're going to need to merge 5,000 and more at a time eventually.
Can anyone offer any help/advice on how to address this?
If needed, ask, and I'll post the code on how the merged pdf is being generated.
Thanks.
I should preface this answer by saying that I know nothing about SugarCRM - my response is based solely on my knowledge of Zend_Pdf.
If my understanding is correct, you have a PHP script (hopefully not running inside Apache considering the length of time it will take to process 5,000 files) that is taking multiple PDF files as input using the Zend_Pdf::load() method and then iterating through the pages of each PDF object and adding them to one target instance of Zend_Pdf, which you are then writing to a file using the save() method.
Using this approach, even if you unset() each of the source PDF objects after you've added the pages to the target PDF object, you'll still need enough memory to store the entire output file. If you blew through 250MB with only 457 files, then I'm guessing your input PDF files are probably about 500KB, so your output file is going to be absolutely huge, so you are still going to end up running out of memory.
My advice would be to ditch this method entirely and use pdftk instead, which you could invoke using the exec() function. I'm sure there's a limit to the size of the arguments you can provide to exec(), so it will probably be a multi-step process with several intermediate files, but ultimately I think this will be a faster, more robust solution.
And just to re-iterate an earlier point, I would not run this process within Apache. I would set up a cron job that runs at the appropriate intervals and drops the output file into a secure area on your web/file server.
OK, I'm working on a project right now and I need to create a graphic library.
The game I'm experimenting with is an RPG; this project is expected to contain many big graphic files to use and I would prefer not to load everything into memory at once, like I've done before with other smaller projects.
So, does anyone have experience with libraries such as this one? Here's what I've came up with:
Have graphic library files and paths in an XML file
Each entry in the XML file would be designated "PERMANENT" or "TEMPORARY", with perm. being that once loaded it stays in memory and won't be cleared (like menu-graphics)
The library that the XML file loads into would have a CLEAR command, that clears out all non-PERMANENT graphics
I have experience throwing everything into memory at startup, and with running the program running with the assumption that all necessary graphics are currently in memory. Are there any other considerations I might need to think about?
Ideally everything would be temporary and you would have a sensible evict function that chooses the right objects to victimize (based on access patterns) when your program decides it needs more memory.
There'll be some minimum amount of RAM your game needs to run, otherwise stuff will be constantly swapping, but this approach does mean you're not dumping objects marked TEMPORARY that you will just need to reload next frame because you happen to be using it currently.