Issues decompressing (FlateDecode) audio data in XFDF from Adobe Acrobat - pdf

I'm trying to decompress AIFF audio data from an XFDF sound annotation, and play the audio in the browser. Any time I export a sound annotation from Adobe Acrobat and decompress the audio data it results in seemingly garbage data.
The code below works for compressed audio starting with 789C header, but a few compressed streams start with 4889 or 78DA and fail to play in the browser. After numerous tests with Acrobat, exported sound annotations in XFDF start with 4889.
// Two audio examples, one working and one failing.
// Can't paste it all due to char limit.
const compressedHexString_working = `789C3C9C059C13D7FBF5473293F55DBCB8BBBBBB53DC8ABB142FEE5AB4 ...`;
const compressedHexString_failing = `48892C97077C8FE716C7CFF9532EB7418D5E24768492DABB0862146DAF ...`;
// Convert hex string into ArrayBuffer.
const compressedByteArray = new Uint8Array(
compressedHexString_working
.match(/[a-fA-F0-9]{2}/g)
.map((str: string) => parseInt(str, 16))
);
const { default: zlib } = await import('pako');
const audioData = await zlib.inflate(compressedByteArray);
// Parse header bytes to determine audio file.
const fileHeader = new TextDecoder().decode(audioData).substring(0, 4);
console.log('Sound File Header:', fileHeader);
The console.log usually prints ID3, RIFF, or FORM when decompression is successful. But when decompressing the failing example, the result is a garbage value: �D�C.
Am I doing the decompression incorrectly? Or am I missing a step?

Related

Google Script - Read PDF file, recognised as text/html

Do you have any idea how to read PDF files, which mimetype is text/html?
I have tried the snippet below, but OCR doesn't work, resulting in this issue "API call to drive.files.insert failed with error: OCR is not supported for files of type text/html"
function extractTextFromPDF(pdfID) {
// PDF File URL
// You can also pull PDFs from Google Drive
var url = "https://drive.google.com/file/d/"+pdfID
var blob = UrlFetchApp.fetch(url).getBlob();
var resource = {
title: blob.getName(),
mimeType: blob.getContentType(),
};
// Enable the Advanced Drive API Service
var file = Drive.Files.insert(resource, blob, { ocr: true, ocrLanguage: 'en' });
// Extract Text from PDF file
var doc = DocumentApp.openById(file.id);
var text = doc.getBody().getText();
return text;
}
Also, I have tried to convert files to any other format like .csv .css or text, but when did it the text is horrible, long HTML, with content encrypted I think. I considered splitting data from extracted HTML, but unfortunately, content is not there or is encrypted somehow.
What I want to do is to print the text from this wired pdf, so I can later write it to Google Sheets. Do you have any idea how I can read this file?
File
I am attaching a pdf here, so you can see what I am fighting with.
https://drive.google.com/file/d/1HXQk6PU9hzBb26EwoFQ0840W6ZihDUIX/view?usp=sharing
I used your sample file, see how I did it below:
function myFunction() {
var pdfFile = DriveApp.getFilesByName("222-1522118.pdf").next();
var blob = pdfFile.getBlob();
// Get the text from pdf
var filetext = pdfToText( blob, {keepTextfile: false} );
console.log(filetext)
}
Output:
I used Mogsdad's library pdfToText
Reference: Get text from PDF in Google

Adobe PDF Embed API Save Content To Base64

Using Adobe PDF Embed API, you can register a callback:
this.adobeDCView = new window.AdobeDC.View(config);
this.adobeDCView.registerCallback(
window.AdobeDC.View.Enum.CallbackType.SAVE_API, (metaData, content, options) => {
})
Content is according to the docs here: https://www.adobe.io/apis/documentcloud/dcsdk/docs.html?view=view
content: The ArrayBuffer of file content
When I debug this content using chrome inspector, it shows me that content is a Int8Array.
Normally when we upload a pdf file, the user selects a file and we read as dataURI and get base64 and push that to AWS. So I need to convert this PDF's data (Int8Array) to Base64, so I can also push it to AWS.
Everything I have found online uses UInt8Array to base64, and I don't understand how to go from Int8Array to UInt8Array. I would think you can just add 128 to the signed int to get a ratio between 0-256, but this doesn't seem to work.
I have tried using this:
let decoder = new TextDecoder('utf8');
let b64 = btoa(decoder.decode(content));
console.log(b64);
But I get this error:
ERROR DOMException: Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range.
Please help me figure out how to go from Int8Array to Base64.
I use the function in this answer.
For Embed API, use the "content" parameter from the save callback as the input to the function.
You can see a working example at this CodePen. The functional part is below.
adobeDCView.registerCallback(
AdobeDC.View.Enum.CallbackType.SAVE_API,
function (metaData, content, options) {
/* Add your custom save implementation here...and based on that resolve or reject response in given format */
var base64PDF = arrayBufferToBase64(content);
var fileURL = "data:application/pdf;base64," + base64PDF;
$("#submitButton").attr("href", fileURL);
/* End save code */
return new Promise((resolve, reject) => {
resolve({
code: AdobeDC.View.Enum.ApiResponseCode.SUCCESS,
data: {
/* Updated file metadata after successful save operation */
metaData: { fileName: urlToPDF.split("/").slice(-1)[0] }
}
});
});
},
saveOptions
);

WebRTC video/audio streams out of sync (MediaStream -> MediaRecorder -> MediaSource -> Video Element)

I am taking a MediaStream and merging two separate tracks (video and audio) using a canvas and the WebAudio API. The MediaStream itself does not seem to fall out of sync, but after reading it into a MediaRecorder and buffering it into a video element the audio will always seem to play much earlier than the video Here's the code that seems to have the issue:
let stream = new MediaStream();
// Get the mixed sources drawn to the canvas
this.canvas.captureStream().getVideoTracks().forEach(track => {
stream.addTrack(track);
});
// Add mixed audio tracks to the stream
// https://stackoverflow.com/questions/42138545/webrtc-mix-local-and-remote-audio-steams-and-record
this.audioMixer.dest.stream.getAudioTracks().forEach(track => {
stream.addTrack(track);
});
// stream = stream;
let mediaRecorder = new MediaRecorder(stream, { mimeType: 'video/webm;codecs=opus,vp8' });
let mediaSource = new MediaSource();
let video = document.createElement('video');
video.src = URL.createObjectURL(mediaSource);
document.body.appendChild(video);
video.controls = true;
video.autoplay = true;
// Source open
mediaSource.onsourceopen = () => {
let sourceBuffer = mediaSource.addSourceBuffer(mediaRecorder.mimeType);
mediaRecorder.ondataavailable = (event) => {
if (event.data.size > 0) {
const reader = new FileReader();
reader.readAsArrayBuffer(event.data);
reader.onloadend = () => {
sourceBuffer.appendBuffer(reader.result);
console.log(mediaSource.sourceBuffers);
console.log(event.data);
}
}
}
mediaRecorder.start(1000);
}
AudioMixer.js
export default class AudioMixer {
constructor() {
// Initialize an audio context
this.audioContext = new AudioContext();
// Destination outputs one track of mixed audio
this.dest = this.audioContext.createMediaStreamDestination();
// Array of current streams in mixer
this.sources = [];
}
// Add an audio stream to the mixer
addStream(id, stream) {
// Get the audio tracks from the stream and add them to the mixer
let sources = stream.getAudioTracks().map(track => this.audioContext.createMediaStreamSource(new MediaStream([track])));
sources.forEach(source => {
// Add it to the current sources being mixed
this.sources.push(source);
source.connect(this.dest);
// Connect to analyser to update volume slider
let analyser = this.audioContext.createAnalyser();
source.connect(analyser);
...
});
}
// Remove all current sources from the mixer
flushAll() {
this.sources.forEach(source => {
source.disconnect(this.dest);
});
this.sources = [];
}
// Clean up the audio context for the mixer
cleanup() {
this.audioContext.close();
}
}
I assume it has to do with how the data is pushed into the MediaSource buffer but I'm not sure. What am I doing that de-syncs the stream?
A late reply to an old post, but it might help someone ...
I had exactly the same problem: I have a video stream, which should be supplemented by an audio stream. In the audio stream short sounds (AudioBuffer) are played from time to time. The whole thing is recorded via MediaRecorder.
Everything works fine on Chrome. But on Chrome for Android, all sounds were played back in quick succession. The "when" parameter for "play()" was ignored on Android. (audiocontext.currentTime continued to increase over time ... - that was not the point).
My solution is similar to Jacob's comment Sep 2 '18 at 7:41:
I created and connected a sine wave oscillator with inaudible 48,000 Hz, which played permanently in the audio stream during recording. Apparently this leads to the proper time progress.
An RTP endpoint that is emitting multiple related RTP streams that
require synchronization at the other endpoint(s) MUST use the same
RTCP CNAME for all streams that are to be synchronized. This
requires a short-term persistent RTCP CNAME that is common across
several RTP streams, and potentially across several related RTP
sessions. A common example of such use occurs when lip-syncing audio
and video streams in a multimedia session, where a single participant
has to use the same RTCP CNAME for its audio RTP session and for its
video RTP session. Another example might be to synchronize the
layers of a layered audio codec, where the same RTCP CNAME has to be
used for each layer.
https://datatracker.ietf.org/doc/html/rfc6222#page-2
There is a bug in Chrome, that plays buffered media stream audio with 44100KHz, even when it's encoded with 48000 (which leads to gaps and video desync). All other browsers seem to play it fine. You can choose to change codec to the one which supports 44.1KHz encoding or play a file from web link as a source (this way Chrome can play it correctly)

What is a blob URL and why it is used?

I am having trouble with blob URLs.
I was searching for src of a video tag on YouTube and I found that the video src was like:
src="blob:https://video_url"
I opened the blob URL that was in src of the video, but it gave an error. I can't open the link, but it was working with the src tag. How is this possible?
I have a few questions:
What is a blob URL?
Why it is used?
Can I make my own blob URL on a server?
Any additional details about blob URLs would be helpful as well.
Blob URLs (ref W3C, official name) or Object-URLs (ref. MDN and method name) are used with a Blob or a File object.
src="blob:https://crap.crap" I opened the blob url that was in src of
video it gave a error and i can't open but was working with the src
tag how it is possible?
Blob URLs can only be generated internally by the browser. URL.createObjectURL() will create a special reference to the Blob or File object which later can be released using URL.revokeObjectURL(). These URLs can only be used locally in the single instance of the browser and in the same session (ie. the life of the page/document).
What is blob url?
Why it is used?
Blob URL/Object URL is a pseudo protocol to allow Blob and File objects to be used as URL source for things like images, download links for binary data and so forth.
For example, you can not hand an Image object raw byte-data as it would not know what to do with it. It requires for example images (which are binary data) to be loaded via URLs. This applies to anything that require an URL as source. Instead of uploading the binary data, then serve it back via an URL it is better to use an extra local step to be able to access the data directly without going via a server.
It is also a better alternative to Data-URI which are strings encoded as Base-64. The problem with Data-URI is that each char takes two bytes in JavaScript. On top of that a 33% is added due to the Base-64 encoding. Blobs are pure binary byte-arrays which does not have any significant overhead as Data-URI does, which makes them faster and smaller to handle.
Can i make my own blob url on a server?
No, Blob URLs/Object URLs can only be made internally in the browser. You can make Blobs and get File object via the File Reader API, although BLOB just means Binary Large OBject and is stored as byte-arrays. A client can request the data to be sent as either ArrayBuffer or as a Blob. The server should send the data as pure binary data. Databases often uses Blob to describe binary objects as well, and in essence we are talking basically about byte-arrays.
if you have then Additional detail
You need to encapsulate the binary data as a BLOB object, then use URL.createObjectURL() to generate a local URL for it:
var blob = new Blob([arrayBufferWithPNG], {type: "image/png"}),
url = URL.createObjectURL(blob),
img = new Image();
img.onload = function() {
URL.revokeObjectURL(this.src); // clean-up memory
document.body.appendChild(this); // add image to DOM
}
img.src = url; // can now "stream" the bytes
This Javascript function supports to show the difference between the Blob File API and the Data API to download a JSON file in the client browser:
/**
* Save a text as file using HTML <a> temporary element and Blob
* #author Loreto Parisi
*/
var saveAsFile = function(fileName, fileContents) {
if (typeof(Blob) != 'undefined') { // Alternative 1: using Blob
var textFileAsBlob = new Blob([fileContents], {type: 'text/plain'});
var downloadLink = document.createElement("a");
downloadLink.download = fileName;
if (window.webkitURL != null) {
downloadLink.href = window.webkitURL.createObjectURL(textFileAsBlob);
} else {
downloadLink.href = window.URL.createObjectURL(textFileAsBlob);
downloadLink.onclick = document.body.removeChild(event.target);
downloadLink.style.display = "none";
document.body.appendChild(downloadLink);
}
downloadLink.click();
} else { // Alternative 2: using Data
var pp = document.createElement('a');
pp.setAttribute('href', 'data:text/plain;charset=utf-8,' +
encodeURIComponent(fileContents));
pp.setAttribute('download', fileName);
pp.onclick = document.body.removeChild(event.target);
pp.click();
}
} // saveAsFile
/* Example */
var jsonObject = {"name": "John", "age": 30, "car": null};
saveAsFile('out.json', JSON.stringify(jsonObject, null, 2));
The function is called like saveAsFile('out.json', jsonString);. It will create a ByteStream immediately recognized by the browser that will download the generated file directly using the File API URL.createObjectURL.
In the else, it is possible to see the same result obtained via the href element plus the Data API, but this has several limitations that the Blob API has not.
I have modified working solution to handle both the case.. when video is uploaded and when image is uploaded .. hope it will help some.
HTML
<input type="file" id="fileInput">
<div> duration: <span id='sp'></span><div>
Javascript
var fileEl = document.querySelector("input");
fileEl.onchange = function(e) {
var file = e.target.files[0]; // selected file
if (!file) {
console.log("nothing here");
return;
}
console.log(file);
console.log('file.size-' + file.size);
console.log('file.type-' + file.type);
console.log('file.acutalName-' + file.name);
let start = performance.now();
var mime = file.type, // store mime for later
rd = new FileReader(); // create a FileReader
if (/video/.test(mime)) {
rd.onload = function(e) { // when file has read:
var blob = new Blob([e.target.result], {
type: mime
}), // create a blob of buffer
url = (URL || webkitURL).createObjectURL(blob), // create o-URL of blob
video = document.createElement("video"); // create video element
//console.log(blob);
video.preload = "metadata"; // preload setting
video.addEventListener("loadedmetadata", function() { // when enough data loads
console.log('video.duration-' + video.duration);
console.log('video.videoHeight-' + video.videoHeight);
console.log('video.videoWidth-' + video.videoWidth);
//document.querySelector("div")
// .innerHTML = "Duration: " + video.duration + "s" + " <br>Height: " + video.videoHeight; // show duration
(URL || webkitURL).revokeObjectURL(url); // clean up
console.log(start - performance.now());
// ... continue from here ...
});
video.src = url; // start video load
};
} else if (/image/.test(mime)) {
rd.onload = function(e) {
var blob = new Blob([e.target.result], {
type: mime
}),
url = URL.createObjectURL(blob),
img = new Image();
img.onload = function() {
console.log('iamge');
console.dir('this.height-' + this.height);
console.dir('this.width-' + this.width);
URL.revokeObjectURL(this.src); // clean-up memory
console.log(start - performance.now()); // add image to DOM
}
img.src = url;
};
}
var chunk = file.slice(0, 1024 * 1024 * 10); // .5MB
rd.readAsArrayBuffer(chunk); // read file object
};
jsFiddle Url
https://jsfiddle.net/PratapDessai/0sp3b159/
The OP asks:
What is blob URL? Why is it used?
Blob is just byte sequence. Browsers recognize Blobs as byte streams. It is used to get byte stream from source.
According to Mozilla's documentation
A Blob object represents a file-like object of immutable, raw data. Blobs represent data that isn't necessarily in a JavaScript-native format. The File interface is based on Blob, inheriting blob functionality and expanding it to support files on the user's system.
The OP asks:
Can i make my own blob url on a server?
Yes you can there are several ways to do so for example try http://php.net/manual/en/function.ibase-blob-echo.php
Read more here:
https://developer.mozilla.org/en-US/docs/Web/API/Blob
http://www.w3.org/TR/FileAPI/#dfn-Blob
https://url.spec.whatwg.org/#urls
blob urls are used for showing files that the user uploaded, but they are many other purposes, like that it could be used for secure file showing, like how it is a little difficult to get a YouTube video as a video file without downloading an extension. But, they are probably more answers. My research is mostly just me using Inspect to try to get a YouTube video and an online article.
Another use case of blob urls is to load resources from the server, apply hacks and then tell the browser to interpret them.
One such example would be to load template files or even scss files.
Here is the scss example:
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/sass.js/0.11.1/sass.sync.min.js"></script>
function loadCSS(text) {
const head = document.getElementsByTagName('head')[0]
const style = document.createElement('link')
const css = new Blob([text], {type: 'text/css'})
style.href = window.URL.createObjectURL(css)
style.type = 'text/css'
style.rel = 'stylesheet'
head.append(style)
}
fetch('/style.scss').then(res => res.text()).then(sass => {
Sass.compile(sass, ({text}) => loadCSS(text))
})
Now you could swap out Sass.compile for any kind of transformation function you like.
Blob urls keeps your DOM structure clean this way.
I'm sure by now you have your answers, so this is just one more thing you can do with it.

Can I read PDF or Word Docs with Node.js?

I can't find any packages to do this. I know PHP has a ton of libraries for PDFs (like http://www.fpdf.org/) but anything for Node?
textract is a great lib that supports PDFs, Doc, Docx, etc.
Looks like there's a few for pdf, but I didn't find any for Word.
CPU bound processing like that isn't really Node's strong point anyway (i.e. you get no additional benefits using node to do it over any other language). A pragmatic approach would be to find a good tool and utilise it from Node.
I have heard good things around the office about docsplit http://documentcloud.github.com/docsplit/
While it's not Node, you could easily invoke it from Node with http://nodejs.org/docs/latest/api/all.html#child_process.exec
You can easily convert one into another, or use for example a .doc template to generate a .pdf file, but you will probably want to use an existing web service for this task.
This can be done using the services of Livedocx for example
To use this service from node, see node-livedocx (Disclaimer: I am the author of this node module)
I would suggest looking into unoconv for your initial conversion, this uses LibreOffice or OpenOffice for the actual conversion. Which adds some overhead.
I'd setup a few workers with all the necessities setup, and use a request/response queue for handling the conversion... (may want to look into kue or zmq)
In general this is a CPU bound and heavy task that should be offloaded... Pandoc and others specifically mention .docx, not .doc so they may or may not be options as well.
Note: I know this question is old, just wanted to provide a current answer for others coming across this.
you can use pdf-text for pdf files. it will extract text from a pdf into an array of text 'chunks'. Useful for doing fuzzy parsing on structured pdf text.
var pdfText = require('pdf-text')
var pathToPdf = __dirname + "/info.pdf"
pdfText(pathToPdf, function(err, chunks) {
//chunks is an array of strings
//loosely corresponding to text objects within the pdf
//for a more concrete example, view the test file in this repo
})
var fs = require('fs')
var buffer = fs.readFileSync(pathToPdf)
pdfText(buffer, function(err, chunks) {
console.log(chunks)
})
for docx files you can use mammoth, it will extract text from .docx files.
var mammoth = require("mammoth");
mammoth.extractRawText({path: "./doc.docx"})
.then(function(result){
var text = result.value; // The raw text
console.log(text);
var messages = result.messages;
})
.done();
I hope this will help.
For parsing pdf files you can use pdf2json node module
It allows you to convert pdf file to json as well as to raw text data.
Another good option if you only need to convert from Word documents is Mammoth.js.
Mammoth is designed to convert .docx documents, such as those created
by Microsoft Word, and convert them to HTML. Mammoth aims to produce
simple and clean HTML by using semantic information in the document,
and ignoring other details. For instance, Mammoth converts any
paragraph with the style Heading 1 to h1 elements, rather than
attempting to exactly copy the styling (font, text size, colour, etc.)
of the heading.
There's a large mismatch between the structure used by .docx and the
structure of HTML, meaning that the conversion is unlikely to be
perfect for more complicated documents. Mammoth works best if you only
use styles to semantically mark up your document.
Here is an example showing how to download and extract text from a PDF using PDF.js:
import _ from 'lodash';
import superagent from 'superagent';
import pdf from 'pdfjs-dist';
const url = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';
const main = async () => {
const response = await superagent.get(url).buffer();
const data = response.body;
const doc = await pdf.getDocument({ data });
for (const i of _.range(doc.numPages)) {
const page = await doc.getPage(i + 1);
const content = await page.getTextContent();
for (const { str } of content.items) {
console.log(str);
}
}
};
main().catch(error => console.error(error));
You can use Aspose.Words Cloud SDK for Node.js to extract text from DOC/DOCX,Open Office, and PDF. It's paid API but the free plan provides 150 free monthly API calls.
P.S: I'm developer evangelist at Aspose.
const { WordsApi, ConvertDocumentRequest } = require("asposewordscloud");
const fs = require('fs');
// Get Customer ID and Customer Key from https://dashboard.aspose.cloud/
wordsApi = new WordsApi("xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxx");
const request = new ConvertDocumentRequest({
format: "txt",
document: fs.createReadStream("C:/Temp/02_pages.pdf"),
});
const outputFile = "C:/Temp/ConvertPDFtotxt.txt";
wordsApi.convertDocument(request).then((result) => {
console.log(result.response.statusCode);
console.log(result.body.byteLength);
fs.writeFileSync(outputFile, result.body);
}).catch(function(err) {
// Deal with an error
console.log(err);
});