Send very large file (>> 2gb) via browser - wcf

I have a task to do. I need to build a WCF service that allow a client to import a file inside a database using the server backend. In order to do this, i need to communicate to the server, the setting, the events needed to start and set the importation and most importantly the file to import. Now the problem is that these files can be extremely large (much bigger then 2gb), so it's not possible to send them via browser as they are. The only thing that comes into my mind is to split these files and send them one by one to the server.
I have also another requirement: i need to be 100% sure that this file are not corrupted, so i need to implement also a sort of policy for correction and possibly recover of the errors.
Do you know if there is a sort of API or dll that can help me to achieve my goals or is it better to write the code by myself? And in this case, which would be the optimal size of the packets?

Related

BizTalk 2010 - Using external source for credentials

On my BizTalk server I use several different credentials to connect to internal and external systems. There is an upcoming task to change the passwords for a lot of systems and I'm searching for a solution to simplify this task on my BizTalk server.
Is there a way that I could adjust the File/FTP adapters to extract the information from an XML file so that I can change it only in the XML file and everything will be updated or is there an alternative that I could use such as PowerShell?
Did someone else had this task as well?
I rather don't want to create a custom adapter but if there is no alternative I will go for that one. Using dynamic credentials for the send port can be solved with Orchestration but I need this as well for the receive port.
You can export the bindings of all your applications. All the passwords for the FTP and File Adapter will be masked out with a series off * (asterisks).
You could then edit your binding down to just those ports you want to update, replace the masked out passwords with the correct passwords, and when you want the passwords changed, import them.
Unfortunately unless you have already prepared tokenised binding files the above is a manual effort.
I was going to recommend that you take a look at Enterprise Single Sign-On, but on second thoughts, I think you probably just need to 'bite the bullet' and make the change in the various Adapters.
ESSO would be beneficial if you have a single Adapter with multiple endpoints/credentials, but I infer from your question that isn't the case (i.e. you're not just using a single adapter). I also don't think re-writing the adapters to include functionality to read usernames/passwords from file is feasible IMHO - just changing the passwords would be much faster, by an order of weeks or months ;-)
One option that is available to you however, depending on which direction the adapter is being used: if you need to change credentials on Send Adapters, you should consider setting usernames/passwords at runtime via the various Adapter Property Schemas (see http://msdn.microsoft.com/en-us/library/aa560564.aspx for the FTP Adapter Properties for example). You could then easily create an encoding Send Pipeline Component that reads an Xml file containing credentials and updates the message context properties accordingly, the message would then be send with the appropriate credentials to the required endpoint.
There is also the option of using ESSO as your (encrypted) config store instead of Xml files / database etc. Richard Seroter has a really good post on this from way back in 2007 (its still perfectly valid tho.)

configure multiple servers and scale

I have been given a task configure 1000 of servers with some simple data. Lets say I need to login to server (linux or windows) and setup the ntp server. I need to come up with some kind of automation framework using perl. I have some ideas and want to get more.
Here is my thought process:
a) Since there are 1000s of servers, definitely the framework should be able to read in a csv file so all inputs can be provided as apposed to single input.
b) Since there are so many servers, I have to find a way to do things in parallel. I cant go server by server in a sequential way
c) I should have some output file that shows the results of all the servers that I successfully configured, servers that failed. That way I can compare input file and output file and generate a report
Should I consider anything else in my framework ?
How can I do parallel processing using perl ?
Even if you want to stick with Perl, it looks like there are already some alternatives available that would keep you from implementing another framework from scratch.
Check out the comments from http://my.opera.com/cstrep/blog/2010/05/14/puppet-fabric-and-a-perl-alternative for a couple options.

Need suggestions on which option will be efficient to store data on iPad

This is my first time that I am working on a big project for a client. So I was not sure how to solve this problem. However I have come up with two different ideas but I need professionals opinion about which one is better :)
Situation :
There is an application which runs on different client's iPad. Application data is stored by using giant XML file. This XML file is shared among all client by a server. So a server has a centralised copy and each client has their own copy. Once client made changes to their XML copy they updates server copy in and other client updates their copy by updated server copy.
Now only one client can make changes at one time, To fix this I have logic by which before client starts editing XML they need to get ownership from server and server will only allow one client to edit at one time.
Visual Representation :
Now on client side I have to think of a logic by which I will update my client copy and upload it to server. There are two options,
Option 1 :
In option 1, I can directly manipulate XML file by using GDataXML parser and upload that copy to server. For persistence I can save client copy on my iPad in document directory.
Option 2 :
In option 2, I can read XML file create a CoreData representation for local storage. When ever I update data inside core data it will I will change XML file too and than upload that file on server. Double work but I guess better persistence.
Now which one more robust and advisable? Personally I was planning to do option 2 because it seems more robust as I am persisting application data in core data. But option 1 seems more easy work but I don't know how good persistency will remain.
Sorry for lengthy question,
Thanks for any input given.
There are a number of factors which would influence selecting the second option over the first.
How big is the XML file? If you need to work with very large documents, you may need to incrementally parse the XML (SAX) into core data. This will allow you to access the document's contents without loading it all into memory at once.
Do you need to run complex queries in the data? If so, you may be better off using core data fetch predicates, rather than xpath or XSL.
Are you already using core data? Depending on how the XML data is structured, it might be simpler overall to import the data into your existing persistent store.
Otherwise, you can probably make due with parsing the entire document and either traversing the resulting tree or querying with xpath.
If you need to create an object graph based on what you get from server and show it to user (which you most probably need to do), you should stick up to second option, since it allows easy and robust data persistence.
If you do not need to present user with any data from the XML file you can, of course, store it in the Documents directory.
So, if this is a client application and it has at least some visual representation of the data from an XML file you should use CoreData.
If you want a regular update of data , then use CoreData

How can I speed up batch processing job in Coldfusion?

Every once in awhile I am fed a large data file that my client uploads and that needs to be processed through CMFL. The problem is that if I put the processing on a CF page, then it runs into a timeout issue after 120 seconds. I was able to move the processing code to a CFC where it seems to not have the timeout issue. However, sometime during the processing, it causes ColdFusion to crash and has to restarted. There are a number of database queries (5 or more, mixture of updates and selects) required for each line (8,000+) of the file I go through as well as other logic provided by me in the form of CFML.
My question is what would be the best way to go through this file. One caveat, I am not able to move the file to the database server and process it entirely with the DB. However, would it be more efficient to pass each line to a stored procedure that took care of everything? It would still be a lot of calls to the database, but nothing compared to what I have now. Also, what would be the best way to provide feedback to the user about how much of the file has been processed?
Edit:
I'm running CF 6.1
I just did a similar thing and use CF often for data parsing.
1) Maintain a file upload table (Parent table). For every file you upload you should be able to keep a list of each file and what status it is in (uploaded, processed, unprocessed)
2) Temp table to store all the rows of the data file. (child table) Import the entire data file into a temporary table. Attempting to do it all in memory will inevitably lead to some errors. Each row in this table will link to a file upload table entry above.
3) Maintain a processing status - For each row of the datafile you bring in, set a "process/unprocessed" tag. This way if it breaks, you can start from where you left off. As you run through each line, set it to be "processed".
4) Transaction - use cftransaction if possible to commit all of it at once, or at least one line at a time (with your 5 queries). That way if something goes boom, you don't have one row of data that is half computed/processed/updated/tested.
5) Once you're done processing, set the file name entry in the table in step 1 to be "processed"
By using the approach above, if something fails, you can set it to start where it left off, or at least have a clearer path of where to start investigating, or worst case clean up in your data. You will have a clear way of displaying to the user the status of the current upload processing, where it's at, and where it left off if there was an error.
If you have any questions, let me know.
Other thoughts:
You can increase timeouts, give the VM more memory, put it in 64 bit but all of those will only increase the capacity of your system so much. It's a good idea to do these per call and do it in conjunction with the above.
Java has some neat file processing libraries that are available as CFCS. if you run into a lot of issues with speed, you can use one of those to read it into a variable and then into the database
If you are playing with XML, do not use coldfusion's xml parsing. It works well for smaller files and has fits when things get bigger. There are several cfc's written out there (check riaforge, etc) that wrap some excellent java libraries for parsing xml data. You can then create a cfquery manually if need be with this data.
It's hard to tell without more info, but from what you have said I shoot out three ideas.
The first thing, is with so many database operations, it's possible that you are generating too much debugging. Make sure that under Debug Output settings in the administrator that the following settings are turned off.
Enable Robust Exception Information
Enable AJAX Debug Log Window
Request Debugging Output
The second thing I would do is look at those DB queries and make sure they are optimized. Make sure selects are happening with indicies, etc.
The third thing I would suspect is that the file hanging out in memory is probably suboptimal.
I would try looping through the file using file looping:
<cfloop file="#VARIABLES.filePath#" index="VARIABLES.line">
<!--- Code to go here --->
</cfloop>
Have you tried an event gateway? I believe those threads are not subject to the same timeout settings as page request threads.
SQL Server Integration Services (SSIS) is the recommended tool for complex ETL (Extract, Transform, and Load) work, which is what this sounds like. (It can be configured to access files on other servers.) The question might be, can you work up an interface between Cold Fusion and SSIS?
If you can upgrade to cf8 and take advantage of cfloop file="" which would give you greater speed and the file would not be put in memory (which is probably the cause of the crashing).
Depending on the situation you are encountering you could also use cfthread to speed up processing.
Currently, an event gateway is the only way to get around the timeout limits of an HTTP request cycle. CF does not have a way to process CF pages offline, that is, there is no command-line invocation (one of my biggest gripes about CF - very little offling processing).
Your best bet is to use an Event Gateway or rewrite your parsing logic in straight Java.
I had to do the same thing, Ben Nadel has written a bunch of great articles uses java file io, to allow you to more speedily read files, write files etc...
Really helped improve the performance of our csv importing application.

How to reliably handle files uploaded periodically by an external agent?

It's a very common scenario: some process wants to drop a file on a server every 30 minutes or so. Simple, right? Well, I can think of a bunch of ways this could go wrong.
For instance, processing a file may take more or less than 30 minutes, so it's possible for a new file to arrive before I'm done with the previous one. I don't want the source system to overwrite a file that I'm still processing.
On the other hand, the files are large, so it takes a few minutes to finish uploading them. I don't want to start processing a partial file. The files are just tranferred with FTP or sftp (my preference), so OS-level locking isn't an option.
Finally, I do need to keep the files around for a while, in case I need to manually inspect one of them (for debugging) or reprocess one.
I've seen a lot of ad-hoc approaches to shuffling upload files around, swapping filenames, using datestamps, touching "indicator" files to assist in synchronization, and so on. What I haven't seen yet is a comprehensive "algorithm" for processing files that addresses concurrency, consistency, and completeness.
So, I'd like to tap into the wisdom of crowds here. Has anyone seen a really bulletproof way to juggle batch data files so they're never processed too early, never overwritten before done, and safely kept after processing?
The key is to do the initial juggling at the sending end. All the sender needs to do is:
Store the file with a unique filename.
As soon as the file has been sent, move it to a subdirectory called e.g. completed.
Assuming there is only a single receiver process, all the receiver needs to do is:
Periodically scan the completed directory for any files.
As soon as a file appears in completed, move it to a subdirectory called e.g. processed, and start working on it from there.
Optionally delete it when finished.
On any sane filesystem, file moves are atomic provided they occur within the same filesystem/volume. So there are no race conditions.
Multiple Receivers
If processing could take longer than the period between files being delivered, you'll build up a backlog unless you have multiple receiver processes. So, how to handle the multiple-receiver case?
Simple: Each receiver process operates exactly as before. The key is that we attempt to move a file to processed before working on it: that, and the fact the same-filesystem file moves are atomic, means that even if multiple receivers see the same file in completed and try to move it, only one will succeed. All you need to do is make sure you check the return value of rename(), or whatever OS call you use to perform the move, and only proceed with processing if it succeeded. If the move failed, some other receiver got there first, so just go back and scan the completed directory again.
If the OS supports it, use file system hooks to intercept open and close file operations. Something like Dazuko. Other operating systems may let you know about file operations in anoter way, for example Novell Open Enterprise Server lets you define epochs, and read list of files modified during an epoch.
Just realized that in Linux, you can use inotify subsystem, or the utilities from inotify-tools package
File transfers is one of the classics of system integration. I'd recommend you to get the Enterprise Integration Patterns book to build your own answer to these questions -- to some extent, the answer depends on the technologies and platforms you are using for endpoint implementation and for file transfer. It's a quite comprehensive collection of workable patterns, and fairly well written.