Explain Bitstream's Importance in DSpace - dataformat

Can someone explain to me in brief and concise way:
What is a bitstream?
What is a bitstream format?
What is bitstream and bitstream format's importance?
What is Bitstream Format Conversion?
Why do we have to implement Bitstream Format Conversion?
Thanks.

See https://wiki.duraspace.org/display/DSDOC4x/Functional+Overview#FunctionalOverview-Supportedfiletypes
DSpace can accommodate any type of uploaded file. While DSpace is most known for hosting text based
materials including scholarly communication and electronic theses and dissertations (ETDs), there are many
stakeholders in the community who use DSpace for multimedia, data and learning objects. While some
restrictions apply, DSpace can even serve as a store for HTML Archives.
Files that have been uploaded to DSpace are often referred to as "Bitstreams". The reason for this is mainly
historic and tracks back to the technical implementation. After ingestion, files in DSpace are stored on the file
system as a stream of bits without the file extension.

Related

How to create a PDF/E file with Video for long-term archiving system?

I am trying to create a PDF/A file for long-term archiving system. My PDF file has a Video. I check in Wikipedia PDF/A doesn't accept video content. PDF/X is for graphics exchange and last, PDF/E is for the dynamic content like videos. I am trying with Acrobat Pro to create a PDF/E file, but at the end i am getting a failure message that Acrobat can not create the PDF/E file. There is this Preflight inside Acrobat Pro but it has the same failure again during the convert process. Is there any other solution to create a PDF/E file with video inside, or even is it possible to create a long-term archiving PDF with a video inside?
The short answer is no - there is currently no archival standard for PDFs with embedded video. However, there is little evidence that standard PDF is going to go obsolete any time soon, so I think the best strategy may be simply to wait for the standards to catch up.
If you're seeking a little more reassurance, you could check if other PDF renderers support your files. If one or more of the open source implementations have no problem with it, then there's very little reason to suppose the content is at risk.
If you really want to protect against problems with PDF, you could diversify somewhat and make a HTML+MP4 version of the same content. That would ensure that even if the PDF strategy fails, there's something else to fall back on. Having some or all of the content of one item available in alternate formats also makes any future format conversions easier to validate.

How to convert InDesign IDML to Tiff?

I have a requirement to take idml files provided by a client, twiddle them a bit to fill in some placeholders and generate a TIFF file. This needs to happen automatically and I have InDesign Server at my disposal.
I have the first part down. I have also found how to connect to InDesign Server via SOAP and convert IDML files to hi-res PDF or low-res JPG (This implies a few other other options).
I am at a bit of a loss as to how to take it the rest of the way to generate a TIFF file, the adobe forums have not been much help. It is my impression that this sort of thing is exactly why the IDML format was introduced so I'm assuming there's decent support out there for it but the best I've been able to come up with so far is to go IDML via Indesign Server to PDF (or SVG) via Inkscape Command-line to PNG via System.Drawing to TIFF but that seems horribly contrived and fault-prone (and I have no idea how I'm going to handle multiple pages).
Any ideas?
I don't believe there is a way to export to TIFF via InDesign Server, however I did find this post on the Adobe Forums that suggests using Photoshop to render the Tiff after exporting it as a PDF from IDS. Maybe that would be an option? Otherwise maybe you could use one of the formats that you CAN export from (i.e. JPG, PDF, EPS).
Hope this helps!
For reference, I ended up using Ghostscript to achieve the results.

HowTo extract embedded OCR data from a PDF?

I have PDF-files with embedded OCR data. (So I already orcd them) So they are searchable. Now I want to extract this OCR data, because I want to put in in my tomcat6 searchserver. For doing this, I need the plain OCR data.
So my question is, is it possible to extract this embedded OCR-Data from the pdf Files?
It would be nice to get files with coordinates. But it would also be sufficient to get plaintext files.
You should be able to do this with iText or iTextsharp. iTextsharp has 0 documentation however, and a good number of the functions are not equivalent to those found in iText.
PDFSharp does not support iref streams. Those are pretty much the only comprehensive opensource solutions. If you do not mind paying, vista solutions may have something for you, they mostly handle workflow, but they have some pretty extensive pdf libraries as well.

Haskell: parsing PDF

What I need is to read pdf, make some transformations (generate TOC bookmarks) and write it back.
I found this http://hackage.haskell.org/package/HPDF , but it only mentions generating pdf, not the parsing (although I could have missed it)
Haskell is chosen purely for (self)educational purposes.
There are a few tools for PDF manipulation, though they seem to bias towards generation, rather than parsing:
http://johnmacfarlane.net/pandoc/
Pandoc is a great cross-markup library, but doesn't support PDF parsing (it does support PDF generation from a variety of formats).
There's also:
http://hackage.haskell.org/package/HsHaruPDF
http://hackage.haskell.org/package/pdf2line -- tool for extracting text from pdf
http://hackage.haskell.org/package/HPDF -- another pdf generation library
I'm not sure we have a good parsing tool yet.
Also as a learning exercise, I started a PDF parsing library in Haskell, but it's incomplete and has been languishing a bit from lack of attention. I'd be happy to share it with you, and would love feedback, improvements, etc. It's not currently hosted on hackage, but if you're interested in working with an incomplete implementation, let me know and I'll ask some colleagues for advice on getting it up there.
Here's a haskell binding to parts of xpdf:
http://hackage.haskell.org/package/pdf2line
Checkout pdf-toolbox library. It's support for PDF file generating is low level, but powerful enough for your task.
Here is an example how to change title of an existing PDF file using incremental update feature.
Another package to consider is rakhana which is also on hackage.

Are there any services out there that will let me convert an URL to PDF and let the user download the result?

I have a lot of different sites written in PHP (Drupal) and more and more often clients ask me to create PDFs of various lists, product descriptions and so on.. I've been using dompdf and other pdf libraries but they are a pain to use and have a very limited functionality.
Are there any services out there that'll let me generate a PDF file from a URL and let the user download the result? That would definitely save my day :)
Best regards,
Thomas
If you are trying to convert html to PDF, then there are a couple of services out there which can do that for you (search), but from the top of my head a2ps does a pretty ok job. The basic idea if that if you can generate PostScript from your source, then creating a PDF is not an issue.
If you are looking for a more feature full library then iText can do it (Java though and not free for commercial use).