Subtleties in Reading/ Writing Binary Data with Java

Subtleties in Reading/ Writing Binary Data with Java - file-io

One phenomenon I've noticed with Java file reads using a byte array buffer, is that just like C's fread(), if I don't dynamically control the length of the final read, and the total size of the data being read is not a multiple of the buffer size, then excess garbage data could be read into the file. When performing binary I/O, some copied files would be rendered somewhat corrupted.
The garbage values could possibly be values previously stored in the buffer that were not overwritten since the final read was not of full buffer length.
While looking over various tutorials, all methods of reading binary data was similar to the code below:
InputStream inputStream = new FileInputStream("prev_font.ttf");;
OutputStream outputStream = new FileOutputStream("font.ttf");
byte buffer[] = new byte[512];
while((read = inputStream.read(buffer)) != -1)
{
outputStream.write(buffer, 0, read);
}
outputStream.close();
inputStream.close();
But while reading from an input stream from a file packaged in a JAR, I couldn't make a copy of the file properly. I would output as an invalid file of that type.
Since I was quite new to JAR access, I could not pinpoint whether the issue was with my resource file pathing or something else. So it took quite a bit of time to realize what was going on.
All codes I came across had a vital missing portion. The read amount should not be the entire buffer, but only the amount that is read:
InputStream inputStream = new FileInputStream("prev_font.ttf");
OutputStream outputStream = new FileOutputStream(font.ttf");
byte dataBuffer[] = new byte[512];
int read;
while((read = inputStream.read(dataBuffer)) != -1)
{
outputStream.write(dataBuffer, 0, read);
}
outputStream.close();
inputStream.close();
Now that's all fine now, but why was something so major not mentioned in any of the tutorials? Did I simply look at bad tutorials, or was Java supposed to handle the oveflow reads and my implementation was off somehow? It was simply unexpected.
Please correct me if any of my statements were wrong, and kindly provide alternative solutions to handling the issue if there are any.

There isn't much difference between the code blocks you've provided except for minor typos which mean that they won't compile. The buffer is not corrupted by read, but the output file is corrupted if the number of bytes read is not provided to the writer for each iteration of the loop.
To copy a file - say src -> dst just use try with resources and the built in transferTo:
Path src = Path.of("prev_font.ttf");
Path dst = Path.of("font.ttf");
try(InputStream in = Files.newInputStream(src);
OutputStream out = Files.newOutputStream(dst)) {
in.transferTo(out);
}
Or call one of the built in methods of Files:
Files.copy(src, dst);
// or
Files.copy(src, dst, StandardCopyOption.REPLACE_EXISTING);

Related

How to read all bytes of a file in winrt with ReadBufferAsync?

I have an array of objects that each needs to load itself from binary file data. I create an array of these objects and then call an AsyncAction for each of them that starts it reading in its data. Trouble is, they are not loading entirely - they tend to get only part of the data from the files. How can I make sure that the whole thing is read? Here is an outline of the code: first I enumerate the folder contents to get a StorageFile for each file it contains. Then, in a for loop, each receiving object is created and passed the next StorageFile, and it creates its own Buffer and DataReader to handle the read. m_file_bytes is a std::vector.
m_buffer = co_await FileIO::ReadBufferAsync(nextFile);
m_data_reader = winrt::Windows::Storage::Streams::DataReader::FromBuffer(m_buffer);
m_file_bytes.resize(m_buffer.Length());
m_data_reader.ReadBytes(m_file_bytes);
My thought was that since the buffer and reader are class members of the object they would not go out of scope and could finish their work uninterrupted as the next objects were asked to load themselves in separate AsyncActions. But the DataReader only gets maybe half of the file data or less. What can be done to make sure it completes? Thanks for any insights.
[Update] Perhaps what is going is that the file system can handle only one read task at a time, and by starting all these async reads each is interrupting the previous one -? But there must be a way to progressively read a folder full of files.
[Update] I think I have it working, by adopting the principle of concentric loops - the idea is not to proceed to the next load until the first one has completed. I think - someone can correct me if I'm wrong, that the file system cannot do simultaneous reads. If there is an accepted and secure example of how to do this I would still love to hear about it, so I'm not answering my own question.

#include <wrl.h>
#include <robuffer.h>
uint8_t* GetBufferData(winrt::Windows::Storage::Streams::IBuffer& buffer)
{
::IUnknown* unknown = winrt::get_unknown(buffer);
::Microsoft::WRL::ComPtr<::Windows::Storage::Streams::IBufferByteAccess> bufferByteAccess;
HRESULT hr = unknown->QueryInterface(_uuidof(::Windows::Storage::Streams::IBufferByteAccess), &bufferByteAccess);
if (FAILED(hr))
return nullptr;
byte* bytes = nullptr;
bufferByteAccess->Buffer(&bytes);
return bytes;
}
https://learn.microsoft.com/en-us/cpp/cppcx/obtaining-pointers-to-data-buffers-c-cx?view=vs-2017
https://learn.microsoft.com/en-us/windows/uwp/xbox-live/storage-platform/connected-storage/connected-storage-using-buffers

How to read byte data from a StorageFile in cppwinrt?

A 2012 answer at StackOverflow (“How do I read a binary file in a Windows Store app”) suggests this method of reading byte data from a StorageFile in a Windows Store app:
IBuffer buffer = await FileIO.ReadBufferAsync(theStorageFile);
byte[] bytes = buffer.ToArray();
That looks simple enough. As I am working in cppwinrt I have translated that to the following, within the same IAsyncAction that produced a vector of StorageFiles. First I obtain a StorageFile from the VectorView using theFilesVector.GetAt(index);
//Then this line compiles without error:
IBuffer buffer = co_await FileIO::ReadBufferAsync(theStorageFile);
//But I can’t find a way to make the buffer call work.
byte[] bytes = buffer.ToArray();
“byte[]” can’t work, to begin with, so I change that to byte*, but then
the error is “class ‘winrt::Windows::Storage::Streams::IBuffer’ has no member ‘ToArray’”
And indeed Intellisense lists no such member for IBuffer. Yet IBuffer was specified as the return type for ReadBufferAsync. It appears the above sample code cannot function as it stands.
In the documentation for FileIO I find it recommended to use DataReader to read from the buffer, which in cppwinrt should look like
DataReader dataReader = DataReader::FromBuffer(buffer);
That compiles. It should then be possible to read bytes with the following DataReader method, which is fortunately supplied in the UWP docs in cppwinrt form:
void ReadBytes(Byte[] value) const;
However, that does not compile because the type Byte is not recognized in cppwinrt. If I create a byte array instead:
byte* fileBytes = new byte(buffer.Length());
that is not accepted. The error is
‘No suitable constructor exists to convert from “byte*” to “winrt::arrayView::<uint8_t>”’
uint8_t is of course a byte, so let’s try
uint8_t fileBytes = new uint8_t(buffer.Length());
That is wrong - clearly we really need to create a winrt::array_view. Yet a 2015 Reddit post says that array_view “died” and I’m not sure how to declare one, or if it will help. That original one-line method for reading bytes from a buffer is looking so beautiful in retrospect. This is a long post, but can anyone suggest the best current method for simply reading raw bytes from a StorageFile reference in cppwinrt? It would be so fine if there were simply GetFileBytes() and GetFileBytesAsync() methods on StorageFile.
---Update: here's a step forward. I found a comment from Kenny Kerr last year explaining that array_view should not be declared directly, but that std::vector or std::array can be used instead. And that is accepted as an argument for the ReadBytes method of DataReader:
std::vector<unsigned char>fileBytes;
dataReader.ReadBytes(fileBytes);
Only trouble now is that the std::vector is receiving no bytes, even though the size of the referenced file is correctly returned in buffer.Length() as 167,513 bytes. That seems to suggest the buffer is good, so I'm not sure why the ReadBytes method applied to that buffer would produce no data.
Update #2: Kenny suggests reserving space in the vector, which is something I had tried, this way:
m_file_bytes.reserve(buffer.Length());
But it didn't make a difference. Here is a sample of the code as it now stands, using DataReader.
buffer = co_await FileIO::ReadBufferAsync(nextFile);
dataReader = DataReader::FromBuffer(buffer);
//The following line may be needed, but crashes
//co_await dataReader.LoadAsync(buffer.Length());
if (buffer.Length())
{
m_file_bytes.reserve(buffer.Length());
dataReader.ReadBytes(m_file_bytes);
}
The crash, btw, is
throw hresult_error(result, hresult_error::from_abi);
Is it confirmed, then, that the original 2012 solution quoted above cannot work in today's world? But of course there must be some way to read bytes from a file, so I'm just missing something that may be obvious to another.
Final (I think) update: Kenny's suggestion that the vector needs a size has hit the mark. If the vector is first prepared with m_file_bytes.assign(buffer.Length(),0) then it does get filled with file data. Now my only worry is that I don't really understand the way IAsyncAction is working and maybe could have trouble looping this asynchronously, but we'll see.

The array_view bridges the gap between Windows APIs and C++ array types. In this example, the ReadBytes method expects the caller to provide some array that it can copy bytes into. The array_view forwards a pointer to the caller's array as well as its size. In this case, you're passing an empty vector. Try resizing the vector before calling ReadBytes.

When you know how many bytes to expect (in this case 2 bytes), this worked for me:
std::vector<unsigned char>fileBytes;
fileBytes.resize(2);
DataReader reader = DataReader::FromBuffer(buffer);
reader.ReadBytes(fileBytes);
cout<< fileBytes[0] << endl;
cout<< fileBytes[1] << endl;

iText throws ClassCastException: PdfNumber cannot be cast to PdfLiteral

I am using iText v5.5.1 to read PDF and render paint text from it:
pdfReader = new PdfReader(new CloseShieldInputStream(is));
pdfParser = new PdfReaderContentParser(pdfReader);
int maxPageNumber = pdfReader.getNumberOfPages();
int pageNumber = 1;
StringBuilder sb = new StringBuilder();
SimpleTextExtractionStrategy extractionStrategy = new SimpleTextExtractionStrategy();
while (pageNumber <= maxPageNumber) {
pdfParser.processContent(pageNumber, extractionStrategy);
sb.append(extractionStrategy.getText());
pageNumber++;
}
On one PDF file the following exception is thrown:
java.lang.ClassCastException: com.itextpdf.text.pdf.PdfNumber cannot be cast to com.itextpdf.text.pdf.PdfLiteral
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:382)
at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:80)
That PDF file seems to be broken, but maybe its contents still makes sense...

Indeed
That PDF file seems to be broken
The content streams of all pages look like this:
/GS1 gs
q
595.00 0 0
It looks like they all are cut off early as the last line is not a complete operation. This certainly can make a parser hickup as iText does.
Furthermore the content should be longer because even the size of their compressed stream is a bit larger than the length of this. This indicates streams broken on the byte level.
Looking at the bytes of the PDF file one cannot help but notice that
even inside binary streams the codes 13 and 10 only occur together and
cross-reference offset values are less than the actual positions.
So I assume that this PDF has been transmitted using a transport method handling it as textual data, especially replacing any kind of assumed line break (CR or LF or CR LF) with the CR LF now omnipresent in the file (CR = Carriage Return = 13; LF = Line Feed = 10). Such replacements will automatically break any compressed data stream like the content streams in your file.
Unfortunately, though...
but maybe its contents still makes sense
Not much. There is one big image associated to each page respectively. Considering the small size of the content streams and the large image size I would assume that the PDF only contains scanned pages. But the images also are broken due to the replacements mentioned above.

This isn't the best solution, but I had this exact problem and unfortunately can't share the exact PDFs I was having issues with.
I made a fork of itextpdf that catches the ClassCastException and just skips PdfObjects that it takes issue with. It prints to System.out what the text contained and what type itextpdf thinks it was. I haven't been able to map this out to some systemic problem with my PDFs (someone smarter than me will need to do that), and this exception only happens once in a blue moon. Anyway, in case it helps anyone, this fork at least doesn't crash your code, lets you parse the majority of your PDFs, and gives you a bit of info on what types of bytestrings seem to give itextpdf indigestion.
https://github.com/njhwang/itextpdf

How to minimize serialized data - binary serialization has huge overhead

I'd like to reduce the message size when sending serialized integer over the network.
In the below section buff.Length is 256 - to great an overhead to be efficient!
How it can be reduced to the minimum (4 bytes + minimum overhead)?
int val = RollDice(6);
// Should 'memoryStream' be allocated each time!
MemoryStream memoryStream = new MemoryStream();
BinaryFormatter formatter = new BinaryFormatter();
formatter.Serialize(memoryStream, val);
byte[] buff = memoryStream.GetBuffer();
Thanks in advance,
--- KostaZ

Have a look at protobuf.net...it is a very good serialization lib (you can get it on NuGet). Also, ideally you should be using a "using" statement around your memory stream.
To respond to the comment below, then the most efficient method depends on your use case. If you know exactly what you need to serialize and don't need a general purpose serializer then you could write your own binary formatter, which might have no overhead at all (there is some detail here custom formatters).
This link has a comparison of the BinaryFormatter and protobuf.net for your reference.

Hacky Sql Compact Workaround

So, I'm trying to use ADO.NET to stream a file data stored in an image column in a SQL Compact database.
To do this, I wrote a DataReaderStream class that takes a data reader, opened for sequential access, and represents it as a stream, redirecting calls to Read(...) on the stream to IDataReader.GetBytes(...).
One "weird" aspect of IDataReader.GetBytes(...), when compared to the Stream class, is that GetBytes requires the client to increment an offset and pass that in each time it's called. It does this even though access is sequential, and it's not possible to read "backwards" in the data reader stream.
The SqlCeDataReader implementation of IDataReader enforces this by incrementing an internal counter that identifies the total number of bytes it has returned. If you pass in a number either less than or greater than that number, the method will throw an InvalidOperationException.
The problem with this, however, is that there is a bug in the SqlCeDataReader implementation that causes it to set the internal counter to the wrong value. This results in subsequent calls to Read on my stream throwing exceptions when they shouldn't be.
I found some infomation about the bug on this MSDN thread.
I was able to come up with a disgusting, horribly hacky workaround, that basically uses reflection to update the field in the class to the correct value.
The code looks like this:
public override int Read(byte[] buffer, int offset, int count)
{
m_length = m_length ?? m_dr.GetBytes(0, 0, null, offset, count);
if (m_fieldOffSet < m_length)
{
var bytesRead = m_dr.GetBytes(0, m_fieldOffSet, buffer, offset, count);
m_fieldOffSet += bytesRead;
if (m_dr is SqlCeDataReader)
{
//BEGIN HACK
//This is a horrible HACK.
m_field = m_field ?? typeof (SqlCeDataReader).GetField("sequentialUnitsRead", BindingFlags.NonPublic | BindingFlags.Instance);
var length = (long)(m_field.GetValue(m_dr));
if (length != m_fieldOffSet)
{
m_field.SetValue(m_dr, m_fieldOffSet);
}
//END HACK
}
return (int) bytesRead;
}
else
{
return 0;
}
}
For obvious reasons, I would prefer to not use this.
However, I do not want to buffer the entire contents of the blob in memory either.
Does any one know of a way I can get streaming data out of a SQL Compact database without having to resort to such horrible code?

I contacted Microsoft (through the SQL Compact Blog) and they confirmed the bug, and suggested I use OLEDB as a workaround. So, I'll try that and see if that works for me.

Actually, I decided to fix the problem by just not storing blobs in the database to begin with.
This eliminates the problem (I can stream data from a file), and also fixes some issues I might have run into with Sql Compact's 4 GB size limit.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas