Uncompressing a Gzip format? - gzip

I am facing a problem with Gzip uncompressing.
The situation is like this. I have some text in UTF-8 format. Now this text is compressed using gzdeflate() function in PHP and then stored in a blob object in Mysql.
Now I tried to retrieve the blob object and then used Java's Gzip Stream to un compress it. But it throws an error saying that it is not in GZIP format.
I even used Inflater in Java to do the same but now I get "DataFormatException:incorrect header check". The code for the inflater is as below.
//rs is the resultset
//blobAsBytes is the byte array
while(rs.next()){
blob = rs.getBlob("old_text");
int blobLength = (int) blob.length();
blobAsBytes = blob.getBytes(1, blobLength);
}
Inflater decompresser = new Inflater();
decompresser.setInput(blobAsBytes);
byte[] result = new byte[100];
int resultLength = decompresser.inflate(result); // this is the line where the exception is occurring.
decompresser.end();
// Decode the bytes into a String
String outputString = new String(result, 0, resultLength, "UTF-8");
System.out.println(outputString);
I have to do this using Java and get all the text back that is stored in the database.
Can someone please help me with this.

Use gzencode(), not gzdeflate(). The latter does not produce the gzip format, it produces the deflate format. The former does produce the gzip format. The PHP functions are horribly and confusingly named.
Alternatively, use the java.util.zip.Inflater class with nowrap true in the Inflater constructor. That will decode raw deflate data on the Java end.

Related

Convert html content to PDF Byte Array with kotlin

val sanitizedHTML = Jsoup.clean(html, whitelist)
val textRenderer = ITextRenderer()
val outputStream = ByteArrayOutputStream()
textRenderer.setDocumentFromString(sanitizedHTML)
textRenderer.layout()
textRenderer.createPDF(outputStream)
textRenderer.finishPDF()
return Base64.getDecoder().decode(outputStream.toByteArray())
I would like to generate pdf from html content and rather than saving as file, would like to upload to server which expects it to be ByteArray.
I tried to do above using jsoup to clean html and textRenderer for generating pdf but keep receiving error about invalid Base64 character 25. Could someone help what I am doing wrong here.
return Base64.getDecoder().decode(outputStream.toByteArray())
This was incorrect, if I remove Base64 decoding it is working well.

Convert compress functions from Python to Kotlin

I have functions in Python for compression and decompression of a string (bytearray):
def compress_message(data):
compressor = zlib.compressobj(-1, zlib.DEFLATED, 31, 8, zlib.Z_DEFAULT_STRATEGY)
compressed_data = compressor.compress(data)
compressed_data += compressor.flush()
return compressed_data
def decompress_message(compressed_data):
return zlib.decompress(compressed_data, wbits=31)
I need to convert these functions to kotlin, so I can use them in my mobile app. So far I tried this:
fun String.zlibCompress(): ByteArray {
val input = this.toByteArray(charset("UTF-8"))
val output = ByteArray(input.size * 4)
val compressor = Deflater().apply {
setLevel(-1)
setInput(input)
finish()
}
val compressedDataLength: Int = compressor.deflate(output)
return output.copyOfRange(0, compressedDataLength)
}
However, it gives totally different results for example for string abcdefghijklmnouprstuvwxyz:
Python: 1f8b080000000000000a4b4c4a4e494d4bcfc8cccacec9cdcb2f2d282a2e292d2bafa8ac0200c197b2d21a000000
Kotlin: 789c4b4c4a4e494d4bcfc8cccacec9cdcb2f2d282a2e292d2bafa8ac020090b30b24
Is there any way, how can I modify the kotlin code, so it gives same result as Python?
Thanks for your replies. <3
The 31 parameter in the Python code is requesting a gzip stream, not a zlib stream. In Kotlin, you are generating a zlib stream. (zlib is described in RFC 1950, gzip in RFC 1952.)
It does not appear that Java's Deflater (spelled wrong) class has that option. It does have a nowrap option that gives raw deflate compressed data, around which you can construct your own gzip wrapper, using the RFC to see how.
By the way, the results are not "totally different". You have gzip and zlib wrappers around exactly the same raw deflate compressed data: 4b4c...0020.

newtonsoft SerializeXmlNode trailing nulls

I am creating an XmlDoc in C# and using Newtonsoft to serialize to JSON. It works, but I am getting a bunch of what appear to be "NUL"'s at the end of the JSON. No idea why. Anyone seen this before?
CODE:
XmlDocument xmlDoc = BuildTranslationXML(allTrans, applicationName, language);
// Convert the xml doc to json
// the conversion inserts \" instead of using a single quote, so we need to replace it
string charToReplace = "\"";
string jsonText = JsonConvert.SerializeXmlNode(xmlDoc);
// json to a stream
MemoryStream memoryStream = new MemoryStream();
TextWriter tw = new StreamWriter(memoryStream);
tw.Write(jsonText);
tw.Flush();
tw.Close();
// output the stream as a file
string fileName = string.Format("{0}_{1}.json", applicationName, language);
return File(memoryStream.GetBuffer(), "text/json", fileName);
The file is served up to the calling web page and the browser prompts the user to save the file. When opening the file, it displays the correct JSON but also has all the trailing nulls. See image below (hopefully the stackoverflow link works):
file screenshot
The GetBuffer() method returns the internal representation of the MemoryStream. Use ToArray() instead to get just the part of that internal array that has data Newtonsoft has put in there.

iTextSharp: Convert PdfObject to PdfStream

I am attempting to pull some font streams out of a pdf file (legality is not an issue, as my company has paid for the rights to display these documents in their original manner - and this requires a conversion which requires the extraction of the fonts).
Now, I had been using MUTool - but it also extracts the images in the pdf as well with no method for bypassing them and some of these contain 10s of thousands of images. So, I took to the web for answers and have come to the following solution:
I get all of the fonts into a font dictionary and then I attempt to convert them into PdfStreams (for flatedecode and then writing to files) using the following code:
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject((PdfObject)cItem.pObj);
PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
try
{
int xrefIdx = ((PRIndirectReference)((PdfObject)cItem.pObj)).Number;
PdfObject pdfObj = (PdfObject)reader.GetPdfObject(xrefIdx);
PdfStream str = (PdfStream)(pdfObj);
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);
}
catch { }
But, when I get to PdfStream str = (PdfStream)(pdfObj); I get the error below:
Unable to cast object of type 'iTextSharp.text.pdf.PdfDictionary'
to type 'iTextSharp.text.pdf.PdfStream'.
Now, I know that PdfDictionary derives from (extends) PdfObject so I am uncertain as to what I am doing incorrectly here. Someone please help - I either need advice on patching this code, or if entirely incorrect, either code to extract the stream properly or direction to a place with said code.
Thank you.
EDIT
My revised code is here:
public static void GetStreams(PdfReader pdf)
{
int page_count = pdf.NumberOfPages;
for (int i = 1; i <= page_count; i++)
{
PdfDictionary pg = pdf.GetPageN(i);
PdfDictionary fObj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.FONT));
if (fObj != null)
{
foreach (PdfName name in fObj.Keys)
{
PdfObject obj = fObj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
int xrefIdx = ((PRIndirectReference)obj).Number;
PdfObject pdfObj = pdf.GetPdfObject(xrefIdx);
if (pdfObj == null && pdfObj.IsStream())
{
PdfStream str = (PdfStream)(pdfObj);
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);
}
}
}
}
}
}
However, I am still receiving the same error - so I am assuming that this is an incorrect method of retrieving font streams. The same document has had fonts extracted using muTool successfully - so I know the problem is me and not the pdf.
There are at least two things wrong in your code:
You cast an object to a stream without performing this check: if (pdfObj == null && pdfObj.isStream()) { // cast to stream } As you get the error message that you're trying to cast a dictionary to a stream, I'm 99% sure that the second part of the check will return false whereas pdfObj.isDictionary() probably returns true.
You try extracting a stream from PdfReader and you're trying to cast that object to a PdfStream instead of to a PRStream. PdfStream is the object we use to create PDFs, PRStream is the object used when we inspect PDFs using PdfReader.
You should fix this problem first.
Now for your general question. If you read ISO-32000-1, you'll discover that a font is defined using a font dictionary. If the font is embedded (fully or partly), the font dictionary will refer to a stream. This stream can contain the full font information, but most of the times, you'll only get a subset of the glyphs (because that's best practice when creating a PDF).
Take a look at the example ListFontFiles from my book "iText in Action" to get a first impression of how fonts are organized inside a PDF. You'll need to combine this example with ISO-32000-1 to find more info about the difference between FONTFILE, FONTFILE2 and FONTFILE3.
I've also written an example that replaces an unembedded font with a font file: EmbedFontPostFacto. This example serves as an introduction to explain how difficult font replacement is.
Please go to http://tinyurl.com/iiacsCH16 if you need the C# version of the book samples.

Play Framework 2, Rest Services and gzip decompression

I'm facing what seems a charset issue of play when decompressing gzip content from rest services. When I try to run the code snippet below, an error is thrown, saying "Malformed JSON. Illegal character ((CTRL-CHAR, code 31))":
val url:String = "https://api.stackexchange.com/2.0/info?site=stackoverflow"
Async {
WS.url(url)
.withHeaders("Accept-Encoding" -> "gzip, deflate")
.get()
.map { response =>
Ok("Response: " + (response.json \ "items"))
}
}
At first I thought it would be a problem in StackExchange API itself, but I tried a similar service, which uses gzip compression as well, and the same error happens. It's hard to fix the code because I don't even know where is the "Illegal character". Is there something missing or it's actually a bug in play?
The clue I can provide is that the first byte of a gzip stream is 31 (0x1f). So you probably need to do something else to cause the gzip stream to be decompressed.
By the way, I recommend that you not accept deflate encoding, just gzip.
Here is how it can be done with Play 2.3
// set Http compression: https://www.playframework.com/documentation/2.3.x/ScalaWS
val clientConfig = new DefaultWSClientConfig()
val secureDefaults: AsyncHttpClientConfig = new NingAsyncHttpClientConfigBuilder(clientConfig).build()
val builder = new AsyncHttpClientConfig.Builder(secureDefaults)
builder.setCompressionEnabled(true)
val secureDefaultsWithSpecificOptions: AsyncHttpClientConfig = builder.build()
implicit val implicitClient = new NingWSClient(secureDefaultsWithSpecificOptions)
val response = WS.clientUrl("http://host/endpoint/item").withHeaders(("Accepts-encoding", "gzip")).get()