JPEG 2000 JP2: how do I discover/access the containers in the file?

JPEG 2000 JP2: how do I discover/access the containers in the file? - jpeg2000

I'm especially interested to find out if there's a common way to access any "multiple-size" (e.g. 2k/4k) image data which might exist.

Related

inserting an entir pdf into another by raw text manipulation

I need to include a pdf into another pdf that is being created by text manipulation, not through a package. (In particular, I'm using livecode, which is well suited to the generation of the information I need, and can easily do text manipulation).
Once included, I will be adding additional objects (primarily text, but also a few small squares).
I only need to be able to access the included pdf by page and area, such as (200,200) to (400,400) of page 5; I don't need any access to its objects.
Simply appending to the pdf won't do the job, as I'll actually be including multiple source pdfs into a single pdf output with my addition.
I would like to simply make the original pdf an indirect object in the output pdf, and then refer to and use it. In particular, I would like to avoid having to "disassemble" the source pdf into components to build a new cross-reference table.
Can this be done? Or do I need to make new absolute references for each object in every dictionary, and to every reference to them? (I only need to be able to refer to regions and page, not the actual objects).
something that could be used on a one-time basis to convert an entire multi-page pdf wold also be a usable (but inferior) solution.
I've found that search engines aren't yielding usable results, as they are swamped with solutions for individual products, and not the pdf itself.

First of all, PDFs in general are not text data, they are binary. They may look textual as they contain identifiers built from ASCII values of words, but treating them as text, unless one and one's tools are extremely cautious, is a sure way to damage them.
But even if we assume such caution, unless your input PDFs are internally of a very simple and similar structure, creating code that allows to merge them and manipulate their content essentially is complexity-wise akin to creating a generic PDF library/package.
I would like to simply make the original pdf an indirect object in the output pdf, and then refer to and use it.
Putting them into one indirect object each would work if you needed them merely as an unchanged attachment. But you want to change them.
In particular, I would like to avoid having to "disassemble" the source pdf into components to build a new cross-reference table.
You will at least have to parse ("disassemble") the objects related to the pages you want to manipulate, add the manipulated versions thereof, and add cross references for the changed objects.
And you only mention cross reference tables. Don't forget that in case of a general solution you also have to be able to handle cross reference streams and object streams.
Or do I need to make new absolute references for each object in every dictionary, and to every reference to them? (I only need to be able to refer to regions and page, not the actual objects).
If you really want to merge the source PDFs into a target one, you'll indeed need to renumber the objects from most source PDFs.
If as a target a portable collection (aka portfolio) of the source PDFs would suffice, you might not need to do that. In that case you merely have to apply the changes you want to the source PDFs (by means of incremental updates, if you prefer), and then combine all those manipulated sources in a result portfolio.
I've found that search engines aren't yielding usable results
The cause most likely is that you underestimate the complexities of the format PDF. Combining and manipulating arbitrary existing PDFs usually requires you to use a third-party library or create the equivalent of such a library yourself.
Only manipulating existing PDFs is a bit easier, and so is combining PDFs in a portfolio. Nonetheless, even in this case you should have studied the PDF specification quite a bit.
Restricting oneself to string manipulations to implement this makes the task much more complex - I'd say impossible for generic PDFs, daring for PDFs of simple and similar build.

Who uses the `UserUnit` property in PDF?

I am working on a PDF library which uses the UserUnit property in some computations. I would like to test the library on real data but I cannot find any real-world PDF documents with non-standard UserUnit. What are the use cases for this property?

You only need UswerUnit if you want to specify a MediaSize in excess of (IIRC) 1440 points (200 inches) in either dimension.
Note that this is an Acrobat limitation, it can't handle MediaSize values larger than that, other applications can. If you want a MediaSize exceeding that, and you expect your file to view in Acrobat, you need to set UserUnit.
I do have real world files which set UserUnit (very, very few) but cannot share them as they are customer files. I'm told that these are used for architectural plans, apparently some regulatory bodies require all the plans to be on a single 'sheet', the only way to do that with (for example) a multi-story building is to have a very large 'sheet'.

"Data Repository" software solution

I am trying to find a software solution that will allow our group to easily upload datasets (scriptable and or through some UI), tag those datasets, retrieve those datasets, access control for the datasets, search the tags, search the files name/attributes/metadata (e.g. file creation date). The datasets can be anything from CSV files, image(binary) datasets, texts, server logs, folders within folders of images, zip files of csv data. It can be anything. We will need to be storing GBs to potentially PBs of data. A single file can range from a few KB to 100's of GB. Usable API to retrieve these datasets programmatically.
We just want to have a centralized location of finding information and we want to be able to answer a question such as "Hey do you know if we have any lightening strike datasets?" If there is a file/folder/zip file tagged with "lightening" when I search it should pull back that dataset.
A possible solution would be something like Dataverse, Dspace, Fedora Commons, CKAN. However, those seem to be really geared towards academia and publications or small datasets. On top of that they remove any type of complex folder structure that might exist (e.g. Folder1-->subFolder1-->subFolder2). I also question the scalability of having a 10 million 100kb files within one of these systems.
A filesystem share would allow us to simply store whatever we want but I don't know of a reasonable way of enabling tagging of data.
It is almost like I am looking for a combination of the two. Does someone know of a tool preferably open source that would be able to do something like this?

From what you have described so far, DSpace does seem to be a good fit.
With following examples I want to address the concerns you raised:
Scalability
Here's an example of a multi-terabyte item:
https://ore.exeter.ac.uk/repository/handle/10871/14881
Complex structure
Dryad is based on DSpace and uses a more complex data model, with data files, data packages and the original publication each being represented as separate objects:
http://datadryad.org/resource/doi:10.5061/dryad.322vn
If that's what you want, you can also start your project off the Dryad codebase, since this one is open source as well:
https://github.com/datadryad/dryad-repo

Provide example for why it is not advisable to store images in CoreData?

this question has been asked many times, I have read many users telling that it is not advisable to store images in a DB, in particular within CoreData. By they all seems to omit the reason why they would do so. Even Apple documentation state this, and everybody points to that direction, and every discussion end like this "well you can, but storing the path is better".
Apart from opinions, I would like to have a concrete example of why it is not a good solution.
I explain better, I have a strong background in building Web Application. A concrete example I would give from my point of view could be: do not store images in a DB, but rather the path to them, because you can have them served them by the web server, which can apply all of its caching issues.
But in a desktop environment, especially in iOS application, what are the downside of having stored in Core Data using sqllite, providing that:
There's a separate entity holding the images, it is not an attribute
of main entity
Also seems to be a limit of 100kb for images. Why ? What does happen with a 110,120...200kb ecc ?
thanks

There's nothing special about what Core Data normally does here. It's just using an SQLite database. You can put large blobs of data into it, but it just doesn't scale all that well. You can read more about it here: Internal Versus External BLOBs in SQLite.
That said, Core Data has support for external blobs which in Core Data terminology is called stored in external record (iOS 5.0 and later). Again, there's nothing magic about it, it's just storing the large pieces of data in the file system separately from the SQLite db itself. The benefit is that Core Data updates all this for you.
When you're in Xcode, there'll be a checkbox called Allows External Storage that you can check for Binary Data properties.

The filesystem, and the API:s surrounding it is (just like a webserver) optimized to serve files, of any size, and to apply caching where appropriate.
CoreData is optimized for handling an object graph with tiny pieces of data, like integers and short strings.
Also, there are a number of other issues that tend to creep up on you, like periodically vacuuming the SQLite database CoreData uses, or it won't be able to shrink, just grow.

Leonardo,
With Lion/iOS 5, Core Data started handling file system storage of large BLOBs for you.
The choice is really determined by how many images you are going to have open. If you have many, then you should keep them in the DB. Why? Because you only have a modest number of file descriptors, one of which is used for each open image stored in the file system.
That said, there is still a reason to manage the files yourself. If your BLOBs are really big, say 2+ MB, you will want to map them into memory and not just read them in. (When the memory warnings come, this lets the OS automatically purge them from your resident memory. This is a very good thing.) Even so, you still have the limited number of file descriptors problem.
Andrew

embed serial number to PDF file?

To prevent the casual distribution of pdf document, is there any way such as embedding the serial number to the file?
My idea is to embed the id bound to user and enable to find who distribute the file.
I know it's not preventing the distribution but may discourage the casual distribution by the certain level.
Any other solution is also welcome.
Thanks!

Common way is placing of meta data, but you can easily remove them.
Let's search hideouts (most of them low-level)!
Non-mark text
Text under overlapping objects
Objects of older versions (doesn't noticed by reader, but there with redundant information)
Marks in streams between BX-EX (with weird information from readers point of view)
Information before %PDF-X
Information above %%EOF
Substitution of names for some elements (like font name)
Steganography
Manipulation from used fonts
Whitespacing
Images with setganograpy
My favorite are steganography and BX-EX block within stream, with proper compression and/or encryption it is hard to find (if do not know, where it is). To make search harder wrap some normal blocks with BX-EX.
Some of ways are easy to remove, some harder, but decided attacker will be able to find and sanitize them all. Think about copy-paste of text or print trough PDF-printer.

You can render transparent text. You can write text outside the media box of a page. You can add custom document property. There are plenty of ways to do this.

Why not create a digital id on the documents?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas