I'm working about KBA TREC collection. A generic TREC collection contains a lot of documents that can be news, forums, etc.
When I download a generic document, it has the format filename.sc.xz.gpg:
gpg: the file was encrypted and so I decrypted it with the apposite key;
xz: the file was compressed and so I decompressed it;
sc: the file was serialized and I MUST deserialize it.
The problem is with the process of deserialization. In the documentation I read that the data was serialized with thrift.
My question is: how can I use thrift to deserialize my file filename.sc?
Related
I am trying to duplicate the solution shown here but no luck.
Basically Ivan Kuckir managed to decompress a PDF1.6 xref stream by first decrypting it and then decompressing. This stream like mine belongs to an encrypted PDF file.
One issue here however, is that the PDF 1.6 spec states on p.83 that "The cross-reference stream must NOT be encrypted, nor may any strings appearing in the cross-reference stream dictionary. It must not have a Filter entry that specifies a Crypt filter (see 3.3.9, “Crypt Filter”)." What I understand from this is that, like cross ref tables before them, xref streams must not be encrypted.
When I try to inflate the stream the zlib dll crashes. It also crashes when I decrypt first and then inflate... Has anyone managed to duplicate Ivan Kuckir's solution? Thanks.
P.S. I tried to ask the question in the above thread but for some reason it was deleted by the admin...
This is the link to the object: https://drive.google.com/file/d/1DwOf3zarg9p_B8DNZ2gZdaBr43NKDWR3/view?usp=sharing
I replaced the stream charecters with a hex string for unrisky pasting
So, as you read in the spec, xref streams are not encrypted. So you don't need to decrypt any strings in the xref stream dictionary nor the stream itself. What you need to take into account are the /Filter and /DecodeParams entries when decoding the stream.
Most of the time an xref stream uses a /Flate decode filter together with parameters that allow for better compression due to the way an xref stream is structured. So have a look at sections 7.4.4.1 and 7.4.4.4 of the PDF specification.
I have an API endpoint for uploading large files, streaming then directly to DB. I use ASP.NET Core's IFormFeature to do this, calling IFormFile.OpenReadStream() to get a Stream that I pass to SqlClient for streaming.
I want to enforce a a maximum file size to avoid abuse. I know IFormFile has a Length property, but I assume that is based on Content-Length or similar and can not be trusted (please correct me if I'm wrong, but AFAIK the only way to be 100% sure about the file size is to actually read the data; the client could send an incorrect Content-Length.)
I must therefore ensure that when the stream is read, it does not read more than what is specified in IFormFile.Length (ideally it should throw if it encounters additional bytes). I have not found a way to do this. Is this possible, or is there perhaps a better way to ensure the server doesn't read enormous amounts of data from clients sending incorrect Content-Length headers?
(It should go without saying that this must not entail reading the entire file into memory.)
Every CMIS document has:
a contentStream (for instance a video, as a binary)
a contentStreamFileName (for instance myvideo.ogv)
(well except CMIS documents that have a null content stream)
Paragraph 2.1.4.3.3 of the CMIS 1.1 specification says that contentStreamFileName is NOT updatable.
So, when a CMIS client wants to rename myvideo.ogv to cinematon.ogv, how should it do?
Anything more efficient than downloading and re-uploading the same binary with a different name?
The binary can be several gigabytes.
A generic CMIS client cannot rename a content stream without replacing the content (with the same content).
There is no unified definition among repositories how the content stream filename is handled. That's why the property is read-only.
Some repositories allow changing the filename, but how is repository specific.
I have been using the ZipPackage-class in .NET for some time and I really like the simple and intuitive API it has. When reading from an entry I do entry.GetStream() and I read from this stream. When writing/updating an entry I do entry.GetStream(FileAccess.ReadWrite) and write to this stream. Very simple and useful because I can hand over the reading/writing to some other code not knowing where the Stream comes from originally.
Now since the ZipPackage-API doesn't contain support for entry properties such as LastModified etc I have been looking into other zip-api's such as DotNetZip. But I'm a bit confused over how to use it. For instance, when wanting to read from an entry I first have to extract the entire entry into a MemoryStream, seek to the beginning and hand-over this stream to my other code. And to write to an entry I have to input a stream that the ZipEntry itself can read from. This seem very backwards to me. Am I using this API in a wrong way?
Isn't it possible for the ZipEntry to deliver the file straight from the disk where it is stored and extract it as the reader reads it? Does it really need to be fully extracted into memory first? I'm no expert but it seems wrong to me.
using the DotNetZip libraries does not require you to read the entire zip file into a memory stream. When you instantiate an instance an instance of ZipFile as shown below, the library is only reading from the zip file header. The zip file headers contain properties such as last modified, etc. Here is an example of opening a zip file. The DotNetZip library then reads the zip file headers and constructs a list of all entries on the zip:
using (Ionic.Zip.ZipFile zipFile = Ionic.Zip.ZipFile.Read(this.FileAbsolutePath))
{
...
}
It's up to you to then extract zip files either to a stream, to the file system, etc. In the example below, I'm using a string property accessor on zipFile to get a zip file named SomeFile.txt. The matching ZipEntry object is then extracted to a memory stream.
MemoryStream memStr = new MemoryStream();
zipFile["SomeFile.txt"].Extract(memStr); // Response.OutputStream);
Zip entries must be read into the .NET process space in order to be deflated, there's no way to bypass that by going straight into the filesystem. Similar to how your Windows Explorer shell zip extractor would work - The Windows shell extensions for 7zip or Windows built in Compressed Folders have to read entries into memory and then write them to the file system in order for you to be able to open an entry.
Okey I'm answering this my self because I found the answers. There are apparently methods for both these things I wanted in DotNetZip. For opening a read-stream -> myZipEntry.OpenReader() and for opening a write-stream -> myZipFile.UpdateEntry(e, (fn, obj) => Serialize(obj)). This works fine.
I consuming Web-Services that streams a pdf file to my iOS device. I used SOAP message to interact with web-services and using NSXMLParser:foundCharacters() after the stream is complete and I want to get the content of pdf file from my streamed xml file which was created in first step. The data I get is encoded in base64 and I have the methods to decode the content back.For small file the easiest approach is reading/collect all content with NSXMLParser:foundCharacters from first streamed file and call the decoding base64 method when I get whole data from parse:didEndElement
(the above approach works fine I tested for this case and I made the right pdf file out of it).
Now my question is what is the best approach(optimizing memory/speed) to read/decode/write to make the final pdf from a big streamed files.
Is there any code available or any thought to accomplish this in objective-C