Transferring large files with Apache NMS - apache

What is currently considered state-of-art, so to speak, when transferring large files over Apache NMS (using ActiveMQ)? Putting the whole content into a StreamMessage? However, I've seen the naming here is a bit misleading as the file isn't actually streamed over JMS, the entire content will reside in memory (or disk?) and will be sent all at once. Here I got some problems with files > 100 MB: Apache.NMS.NMSException : Unable to write data to the transport connection: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.
BlobMessage is not supported in NMS...I really see no option but to split the file in chunks, re-assemble them on the other side, etc.
Thank you,
Cristian.

how about using GZIPInputStream,
for example :
GZIPInputStream inputStream = new GZIPInputStream(new ByteArrayInputStream(gzipped));

Related

how does file reading ( streaming ) really work in mule?

Am trying to understand how streaming works w.r.t mule 4.4.
Am reading a large file and am using 'Repeating file store stream' as streaming strategy
'In memory size' = 128 KB
The file is 24 MB and for sake of argument lets say 1000 records is equivalent to 128 KB
so about 1000 records will be stored in memory and rest all will be written to file store by mule .
Here's the flow:
At stage#1 we are reading a file
At stage#2 we are logging payload - so I am assuming initially 128KB worth of data is logged and internally mule will move rest of the data from file storage to in memory and this data will be written to log.
Question : so does the heap memory increase from 128KB to 24 MB ?
I am assuming no , but needed confirmation ?
At stage#3 we are using transform script to create a json payload
So what happens here :
so now is the json payload all in memory now ? ( say 24 MB ) ?
what has happened to the stream ?
so really I am struggling to understand how stream is beneficial if during transformation the data is stored in memory ?
Thanks
It really depends on how each component works but usually logging means to load the full payload in memory. Having said that, logging 'big' payloads is considered a bad practice and you should avoid doing it in the first place. Even a few KBs logs are really not a good idea. Logs are not intended to be used that way. Using logs, as any computational operation, have a cost in processing and resource usage. I have seen several times people causing out of memory errors or performance issues because of excessive logging.
The case with the Transform component is different. In some cases it is able to benefit from streaming, depending on the format used and the script. Sequential access to records is required for streaming to work. If you try an indexed access to the 24MB payload it will probably load the entire payload in memory (example payload[14203]). Also referencing the payload more than once in a step may fail. Streamed records are consumed after being read so it is not possible to use them twice.
Streaming for Dataweave needs to be enabled (it is not the default) by using the property streaming=true.
You can find more details in the documentation for DataWeave Streaming and Mule Streaming.

apache nifi S3 PutObject stuck

Sorry if this is a dumb question, very new to nifi.
Have set up a process group to dump sql queries to CSV and then upload them to S3. Worked fine with small queries, but appears to be stuck with larger files.
The input queue to the PutS3Object processor has a limit of 1GB, but the file it is trying to put is almost 2 GB. I have set the multi-part parameters in the S3 processor to be 100M but it is still stuck.
So my theory is the S3PutObject needs a complete file before it starts uploading. Is this correct? Is there no way to get it uploading in a "streaming" manner? Or do I just have to up the input queue size?
Or am I on the wrong track and there is something else holding this all up.
The screenshot suggests that the large file is in PutS3Object's input queue, and PutS3Object is actively working on it (from the 1 thread indicator in the top-right of the processor box).
As it turns out, there were no errors, just a delay from processing a large file.

SQL FILESTREAM and Connection Pooling

I am currently enhancing a product to support web delivery of large file-content.
I would like to store it in the database, and whether or not I choose to FILESTREAM by BLOB, the following question still holds.
My WCF method will return a stream, meaning that the file stream will remain open while the content is read by the client. If the connection is slow, then the stream could be open for some time.
Question: Connection pooling assumes that connections are exclusively held, only for a short period of time. Am I correct in assuming, that given I have a connection pool of finite size, there could be a contention problem, if slow network connections are used to download files?
Under this assumption, I really want to use FILESTREAM, and open the file directly from the file-system, rather than the SQL connection. However, if the database is remote, I will have no choice but to pull the content from the SQL connection (until I have a local cache of the file anyway).
I realise I have other options, such as to server-buffer the stream, however that will have implications as well. I wish at this time, to discuss only the issues relating to returning a stream obtained from a DB connection.

wcf very slow in lan flie transfer

I have a service that has a method to send a file to the service from the client. I notice that when I run the client and the service in the same machine and the file that I want to send is also in the local machine, all works very fast.
However, I the client and the service are in the same machine but the file is in other computer, then speed is very slow.
If I copy the file from one computer to other, the speed is fast, so the problem does not seem to be the bandwidth.
I try to use tcp and basicHttp Binding, but the results are the same.
This problem also occurrs when I try to send if the client are in other computer.
Thanks.
EDIT: If I open the task manager, in the network tab of the computer taht run the client, I can see that the use of the network is about 0.5%. Why?
WCF for transmitting large file is not the optimal method because WCF has a lot of layers and overhead that adds up and causes delay in file transmission. Morever, you may not have written the WCF service to continuously read chunks of byte and write to the response. You might be doing a File.ReadAll and then just return the whole string, which would cause a large amount of sync read on the server, a lot of memory allocation and then writing the large string to WCF buffer, which in turn write to IIS buffer and so on.
The best way to transmit large files is by using HttpHandlers. You can just use Response.TransmitFile to transfer the file and IIS will transmit the file in the most optimal way. Otherwise you can always read 8k at a time and then write to the Response stream and call Flush after every 8k write.
If you cannot go for HttpHandler for any weird reason, can you show me the WCF code?
Another thing. You might be expecting performance that is simply not possible when IIS is in the picture. First you should measure how long it takes for IIS to transmit the file if you just host the file directly on a website and download the file by doing a WebClient.downloadString.
Another thing is, how are you downloading? Via Browser? or via client side code? Client side code can be suboptimal as well if you are trying to transmit the whole file in one shot and trying to hold it in a string. For ex, WebClient.DownloadString would be the worst approach.

RSync single (archive) file that changes every time

I am working on an open source backup utility that backs up files and transfers them to various external locations such as Amazon S3, Rackspace Cloud Files, Dropbox, and remote servers through FTP/SFTP/SCP protocols.
Now, I have received a feature request for doing incremental backups (in case the backups that are made are large and become expensive to transfer and store). I have been looking around and someone mentioned the rsync utility. I performed some tests with this but am unsure whether this is suitable, so would like to hear from anyone that has some experience with rsync.
Let me give you a quick rundown of what happens when a backup is made. Basically it'll start dumping databases such as MySQL, PostgreSQL, MongoDB, Redis. It might take a few regular files (like images) from the file system. Once everything is in place, it'll bundle it all in a single .tar (additionally it'll compress and encrypt it using gzip and openssl).
Once that's all done, we have a single file that looks like this:
mybackup.tar.gz.enc
Now I want to transfer this file to a remote location. The goal is to reduce the bandwidth and storage cost. So let's assume this little backup package is about 1GB in size. So we use rsync to transfer this to a remote location and remove the file backup locally. Tomorrow a new backup file will be generated, and it turns out that a lot more data has been added in the past 24 hours, and we build a new mybackup.tar.gz.enc file and it looks like we're up to 1.2GB in size.
Now, my question is: Is it possible to transfer just the 200MB that got added in the past 24 hours? I tried the following command:
rsync -vhP --append mybackup.tar.gz.enc backups/mybackup.tar.gz.enc
The result:
mybackup.tar.gz.enc 1.20G 100% 36.69MB/s 0:00:46 (xfer#1, to-check=0/1)
sent 200.01M bytes
received 849.40K bytes
8.14M bytes/sec
total size is 1.20G
speedup is 2.01
Looking at the sent 200.01M bytes I'd say the "appending" of the data worked properly. What I'm wondering now is whether it transferred the whole 1.2GB in order to figure out how much and what to append to the existing backup, or did it really only transfer the 200MB? Because if it transferred the whole 1.2GB then I don't see how it's much different from using the scp utility on single large files.
Also, if what I'm trying to accomplish is at all possible, what flags do you recommend? If it's not possible with rsync, is there any utility you can recommend to use instead?
Any feedback is much appreciated!
The nature of gzip is such that small changes in the source file can result in very large changes to the resultant compressed file - gzip will make its own decisions each time about the best way to compress the data that you give it.
Some versions of gzip have the --rsyncable switch which sets the block size that gzip works at to the same as rsync's, which results in a slightly less efficient compression (in most cases) but limits the changes to the output file to the same area of the output file as the changes in the source file.
If that's not available to you, then it's typically best to rsync the uncompressed file (using rsync's own compression if bandwidth is a consideration) and compress at the end (if disk space is a consideration). Obviously this depends on the specifics of your use case.
It sent only what it says it sent - only transferring the changed parts is one of the major features of rsync. It uses some rather clever checksumming algorithms (and it sends those checksums over the network, but this is negligible - several orders of magnitude less data than transferring the file itself; in your case, I'd assume that's the .01 in 200.01M) and only transfers those parts it needs.
Note also that there already are quite powerful backup tools based on rsync - namely, Duplicity. Depending on the license of your code, it may be worthwhile to see how they do this.
New rsync --append WILL BREAK your file contents, if there are any changes in your existing data. (Since 3.0.0)