What is the best practice for downloading large CSV files from S3 in Java? - amazon-s3

I'm trying to get a large CSV file from S3 but the download fails with “java.net.SocketException: Connection reset”, which is probably due to the InputStream simply being open for too long (the download often takes more than an hour since I am doing multiple time-consuming processes on the streamed content). This is how I currently parse the file:
InputStream inputStream = new GZIPInputStream(s3Client.getObject("bucket", "key").getObjectContent());
Reader decoder = new InputStreamReader(inputStream, Charset.defaultCharset());
BufferedReader isr = new BufferedReader(decoder);
CSVParser csvParser = new CSVParser(isr, CSVFormat.DEFAULT);
CSVRecord nextRecord = csvParser.iterator().next();
...
I know I have to split the download into multiple short getObject-calls with a defined offset for the GetObjectRequest, but I'm wondering how to define this offset in case of a CSV, since I need complete lines.
Do I have to ditch the parser library and parse each line into an Object myself so I can keep a count of the read bytes and use it as an offset for the next batch? That doesn't seem very robust to me. Is there any best practice way to achieve "batch downloading" of CSV records?

I decided on simply using the dedicated getObject(GetObjectRequest getObjectRequest, File destinationFile) method to copy the entire CSV to a temporary file on disk. This closes the HTTP connection as soon as possible and allows me to get the InputStream from the local file with no problems. It doesn't resolve the question of the best way to download in batches, but it's a nice and simple workaround.

Related

Rename filename.ext.crswap to filename.ext rather than copying

When performing this sequence
Obtain a handle to a new file via window.showSaveFilePicker, say filename.ext
Obtain a writeable file stream from the handle
Write some content into the file using the stream
close the stream to signal completion
the File System API writes to filename.ext.crswap and on close copies filename.ext.crswap to filename.ext
Is there a reason that filename.ext.crswap is not rather renamed to filename.ext?
The reason for this behavior is to avoid partial writes:
"User agents try to ensure that no partial writes happen, i.e. the file represented by fileHandle will either contain its old contents or it will contain whatever data was written through stream up until the stream has been closed."—Spec.

JSR 352 : How do you write to a MVS Dataset from a Java Batch program?

I need to write to a non-VSAM dataset in the mainframe. I know that we need to use the ZFile library to do it and I found how to do it here
I am running my Java batch job in the WebSphere Liberty on zOS. How do I specify the dataset? Can I directly give the DataSet a name like this?
dsnFile = new ZFile("X.Y.Z", "wb,type=record,noseek");
I am able to write it to a text file on the server itself using Java's File Writers but I don't know how to access a mvs dataset.
I am relatively new to the world of zOS and mainframe.
It sounds like you might be asking more generally how to use the ZFile API on WebSphere Liberty on z/OS.
Have you tried something like:
String pdsName = ZFile.getSlashSlashQuotedDSN("X.Y.Z");
ZFile zfile = new ZFile(pdsName , ...options...)
As far as batch-specific use cases, you might obviously have to differentiate between writing to a new file that's created for the first time on an original execution, as opposed to appending to an already-existing one on a restart.
You also might find some useful snipopets in this doctorbatch.io repo, along with the original link you posted.
For reference, I'll copy/paste from the ZFile Javadoc:
ZFile dd = new ZFile("//DD:MYDD", "r");
Opens the DD namee MYDD for reading
ZFile dsn = new ZFile("//'SYS1.HELP(ACCOUNT)'", "rt");
Opens the member ACCOUNT from the PDS SYS1.HELP for reading text records
ZFile dsn = new ZFile("//SEQ", "wb,type=record,recfm=fb,lrecl=80,noseek");
Opens the data set {MVS_USER}.SEQ for sequential binary writing. Note that ",noseek" should be specified with "type=record" if access is sequential, since performance is greatly improved.
One final note, another couple useful ZFile helper methods are: bpxwdyn() and getFullyQualifiedDSN().

Using a text file as Spark streaming source for testing purpose

I want to write a test for my spark streaming application that consume a flume source.
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ suggests using ManualClock but for the moment reading a file and verifying outputs would be enough for me.
So I wish to use :
JavaStreamingContext streamingContext = ...
JavaDStream<String> stream = streamingContext.textFileStream(dataDirectory);
stream.print();
streamingContext.awaitTermination();
streamingContext.start();
Unfortunately it does not print anything.
I tried:
dataDirectory = "hdfs://node:port/absolute/path/on/hdfs/"
dataDirectory = "file://C:\\absolute\\path\\on\\windows\\"
adding the text file in the directory BEFORE the program begins
adding the text file in the directory WHILE the program run
Nothing works.
Any suggestion to read from text file?
Thanks,
Martin
Order of start and await are indeed inversed.
In addition to that, the easiest way to pass data to your Spark Streaming application for testing is a QueueDStream. It's a mutable queue of RDD of arbitrary data. This means that you could create the data programmatically or load it from disk into an RDD and pass that to your Spark Streaming code.
Eg. to avoid the timing issues faced with the fileConsumer, you could try this:
val rdd = sparkContext.textFile(...)
val rddQueue: Queue[RDD[String]] = Queue()
rddQueue += rdd
val dstream = streamingContext.queueStream(rddQueue)
doMyStuffWithDstream(dstream)
streamingContext.start()
streamingContext.awaitTermination()
I am so stupid, I inverted calls to start() and awaitTermination()
If you want to do the same, you should read from HDFS, and add the file WHILE the program runs.

what's the principle of uploading a file in android-async-http?

I had a question when using android-async-http. After reading the source code, I knew about how to add a File or InputStream as a parameter into RequestParam. Then the RequestParam would be transferred into an AsyncHttpClient which would use RequestParam to get/put/post....Just like this:
String url = ...;
File file = ...;
ResponseHandlerInterface respHandler = ...;
AsyncHttpClient client = new AsyncHttpClient();
RequestParams params = new RequestParams();
params.add("upload_file", file);
client.get(url, params, respHandler);
As we all know, Files of any types are bits essentially. So when delivered over the internet, files could be transferred into a byte stream. But I didn't find any codes about this conversion. So, I wonder how android-async-http completes this, or did I miss something when reading source codes?
I thought I found the way android-async-http handled with files/inputstreams. Uploading a file depends on the call of put(?)/post(?), but not get(?). By searching the overwrited methods of put(?)/post(?), you will find paramsToEntity(RequestParams, ResponseHandlerInterface) which will return a HttpEntity. And then, HttpPost/HttpPut will setEntity(HttpEntity). Because get(?)s don't support uploading files, then you can't find operations about uploading files in get(?)s.

How to download a Byte Array?

So, using vb.net, I retrieve from my server the byte data for a file that the user wishes to download. I always know what the filename and extension is, but what I don't know is how to start downloading the byte data and in the proper file format. How do I got about doing this?
EDIT: Just to clarify, I already retrieve the data in byte format in code, I just need to download it as the proper file type which is also known. I'm keeping the URL to the file hidden at all times so it's never exposed.
If you want to download the file directly to the hard drive, the easiest solution is to use WebClient.DownloadFile. The MSDN page contains a nice example.
If you want to put the file into a byte array instead of a file on disk, use WebClient.DownloadData instead:
Dim myWebClient As New WebClient()
Dim myByteArray = myWebClient.DownloadData("http://...")
Again, a larger example can be found on the MSDN page.
If you want your program to stay responsive while downloading, check out the asynchronous versions of those methods.
EDIT: I'm still having a hard time understanding your situation, but it you already have a byte array and just want to write it to the disk, you can use File.WriteAllBytes:
File.WriteAllBytes("C:\my\path\myfile.bin", myByteArray)
Okay, I figured it out. Using BinaryWrite with the other Response functions like AddHeader and ContentType I got it to work. GetMimeType is a function I made. Code below:
Response.Clear()
Response.AddHeader("Content-Disposition", "attachment; filename=" + FileName)
Response.ContentType = GetMimeType(FileName)
Response.BinaryWrite(data)
Response.End()
Response.Flush()
Thanks to those who tried to help!