AWS s3 in Rust: Get and store a file - Invalid file header when opening

AWS s3 in Rust: Get and store a file - Invalid file header when opening - amazon-s3

What I want to do: Download an S3 file (pdf) in a lambda and extract its text, using Rust.
The Error:
ERROR PDF error: Invalid file header
I checked the pdf file in the bucket, downloaded it from the console and everything looks correct, so something is breaking in the way I store the file.
How I am doing it:
let config = aws_config::load_from_env().await;
let client = s3::Client::new(&config);
// Get uploaded object in raw bucket (serde derived the json)
let key = event.records.get(0).unwrap().s3.object.key.clone();
let key = key.replace('+', " ");
let key = percent_encoding::percent_decode_str(&key).decode_utf8().unwrap().to_string();
let content = client
.get_object()
.bucket(raw_bucket_name)
.key(&key)
// .response_content_type("application/pdf") // this did not make any difference
.send()
.await?;
let mut bytes = content.body.into_async_read();
let file = tempfile::NamedTempFile::new()?;
let path = file.into_temp_path();
let mut file = tokio::fs::File::create(&path).await?;
tokio::io::copy(&mut bytes, &mut file).await?;
let content = pdf_extract::extract_text(path)?; // this line breaks
Versions:
tokio = { version = "1", features = ["macros"] }
aws-sdk-s3 = "0.21.0"
aws-config = "0.51.0"
pdf-extract = "0.6.4"
I feel like I misunderstood something in how to store the bytestream, but e.g. https://stackoverflow.com/a/62003659/4986655 do it in the same way afaiks.
Any help or pointers on what the issue might be or how to debug this are very welcome.

Related

Why such a simple BufWriter operation didn't work

The following code is very simple. Open a file as a write, create a BufWriter using the file, and write a line of string.
The program reports no errors and returns an Ok(10) value, but the file just has no content and is empty.
#[tokio::test]
async fn save_file_async() {
let path = "./hello.txt";
let inner = tokio::fs::OpenOptions::new()
.create(true)
.write(true)
//.truncate(true)
.open(path)
.await
.unwrap();
let mut writer = tokio::io::BufWriter::new(inner);
println!(
"{} bytes wrote",
writer.write("1234567890".as_bytes()).await.unwrap()
);
}

Need an explicit flush:
writer.flush().await.unwrap();

Can't figure out how to send a signed POST request to OKEx

I want to send a signed POST request to Okex: Authentication Docs POST Request Docs.
I always get back an "invalid sign" error.
I successfully sent a signed GET request. For the POST you also need to add the body in the signature. If I do that, none of my signatures are valid anymore. I already verified that my signature is the same as one produced by their official Python SDK (that's why I wrote the JSON by hand. Python has spaces in the JSON). I am new to Rust so I am hoping I am missing something obvious.
OKEx client implementations in other languages: https://github.com/okcoin-okex/open-api-v3-sdk
/// [dependencies]
/// hmac="0.7.1"
/// reqwest = "0.9.18"
/// chrono = "0.4.6"
/// base64="0.10.1"
/// sha2="0.8.0"
use reqwest::header::{HeaderMap, HeaderValue, CONTENT_TYPE};
use chrono::prelude::{Utc, SecondsFormat};
use hmac::{Hmac, Mac};
use sha2::{Sha256};
static API_KEY: &'static str = "<insert your key!>";
static API_SECRET: &'static str = "<insert your secret!>";
static PASSPHRASE: &'static str = "<insert your passphrase!>";
fn main() {
let timestamp = Utc::now().to_rfc3339_opts(SecondsFormat::Millis, true);
let method = "POST";
let request_path = "/api/spot/v3/orders";
let body_str = "{\"type\": \"market\", \"side\": \"sell\", \"instrument_id\": \"ETH-USDT\", \"size\": \"0.001\"}";
let mut signature_content = String::new();
signature_content.push_str(&timestamp);
signature_content.push_str(method);
signature_content.push_str(request_path);
signature_content.push_str(&body_str);
type HmacSha256 = Hmac<Sha256>;
let mut mac = HmacSha256::new_varkey(API_SECRET.as_bytes()).unwrap();
mac.input(signature_content.as_bytes());
let signature = mac.result().code();
let base64_signature = base64::encode(&signature);
let mut header_map = HeaderMap::new();
header_map.insert("OK-ACCESS-KEY", HeaderValue::from_str(API_KEY).unwrap());
header_map.insert("OK-ACCESS-SIGN", HeaderValue::from_str(&base64_signature).unwrap());
header_map.insert("OK-ACCESS-TIMESTAMP", HeaderValue::from_str(&timestamp).unwrap());
header_map.insert("OK-ACCESS-PASSPHRASE", HeaderValue::from_str(PASSPHRASE).unwrap());
header_map.insert(CONTENT_TYPE, HeaderValue::from_static("application/json; charset=UTF-8"));
let client = reqwest::Client::new();
let mut complete_url = String::from("https://okex.com");
complete_url.push_str(request_path);
let res = client
.post(complete_url.as_str())
.headers(header_map)
.body(body_str)
.send().unwrap().text();
println!("{:#?}", res);
}
This returns an "Invalid Sign" error at the moment but should return a successful http code (if enough funds are on the account).

Solution was to use "https://www.okex.com" instead of "https://okex.com. The latter produces the "Invalid Sign" error. But just for POST requests. Issue was therefore not Rust related.

how to apply password on the zip file or on the csv in nodejs or javascript

var csvString = ['rest','test','age'];
var fileName_CSV = "Report_1.csv";
var fileName_ZIP = "Report_1.zip";
var blob = new Blob(dd,{type: application/zip"});
var zip = new JSZip();
zip.file(fileName_CSV,csvString),{type:"blob"};
var content = zip.generate({type:"blob"});
saveAs(content,fileName_ZIP);
I have the json data i have converted it to fit in csv format so i created the csv file with the data then saves it in memory and now zipped the csv file and now i want to apply password on it .. so when we open the zip and try to open the csv it should ask for the user defined password.. and either i want to use java script or nodejs for it... please help

The mini-zip-asm package supports creating zip archives with passwords.
https://www.npmjs.com/package/minizip-asm.js
From the docs:
npm install minizip-asm.js
Example Usage:
var Minizip = require('minizip-asm.js');
var fs = require("fs");
var csvString = new Buffer("Abc~~~");
var mz = new Minizip();
mz.append("Report_1.csv", csvString, {password: "insert-password"});
fs.writeFileSync("Report_1.zip", new Buffer(mz.zip()));

AWS S3 filenaming when using MediaConvert

I am currently uploading files to an Amazon S3 using MediaConvert via Lambda functions. However part of my processing is to create thumbnail images from uploaded videos. To do this I am using the AmazonMediaConvetClient and creating a job request.
However the files that are generated have a suffix applied to them of 0000000 which from what I can gather is in reference to the frame captured.
However I do not want this suffix on the filename. Is there anyway to ensure that the filename created from a video thumbnail is what I specify with no suffix?
var jpgOutput = new Output
{
NameModifier = $"-Medium",
ContainerSettings = new ContainerSettings { Container = ContainerType.RAW },
Extension = "jpg",
VideoDescription = new VideoDescription
{
CodecSettings = new VideoCodecSettings()
{
Codec = VideoCodec.FRAME_CAPTURE,
FrameCaptureSettings = new FrameCaptureSettings()
{
MaxCaptures = 1, Quality = 100
}
},
Height = thumbnail.Height,
Width = thumbnail.Width
}
};
In the code snippet above the file is created as 1-Medium.000000.jpg but I want 1-Medium.jpg

Get pdf-attachments from Gmail as text

I searched around the web & Stack Overflow but didn't find a solution. What I try to do is the following: I get certain attachments via mail that I would like to have as (Plain) text for further processing. My script looks like this:
function MyFunction() {
var threads = GmailApp.search ('label:templabel');
var messages = GmailApp.getMessagesForThreads(threads);
for (i = 0; i < messages.length; ++i)
{
j = messages[i].length;
var messageBody = messages[i][0].getBody();
var messageSubject = messages [i][0].getSubject();
var attach = messages [i][0].getAttachments();
var attachcontent = attach.getContentAsString();
GmailApp.sendEmail("mail", messageSubject, "", {htmlBody: attachcontent});
}
}
Unfortunately this doesn't work. Does anybody here have an idea how I can do this? Is it even possible?
Thank you very much in advance.
Best, Phil

Edit: Updated for DriveApp, as DocsList deprecated.
I suggest breaking this down into two problems. The first is how to get a pdf attachment from an email, the second is how to convert that pdf to text.
As you've found out, getContentAsString() does not magically change a pdf attachment to plain text or html. We need to do something a little more complicated.
First, we'll get the attachment as a Blob, a utility class used by several Services to exchange data.
var blob = attachments[0].getAs(MimeType.PDF);
So with the second problem separated out, and maintaining the assumption that we're interested in only the first attachment of the first message of each thread labeled templabel, here is how myFunction() looks:
/**
* Get messages labeled 'templabel', and send myself the text contents of
* pdf attachments in new emails.
*/
function myFunction() {
var threads = GmailApp.search('label:templabel');
var threadsMessages = GmailApp.getMessagesForThreads(threads);
for (var thread = 0; thread < threadsMessages.length; ++thread) {
var message = threadsMessages[thread][0];
var messageBody = message.getBody();
var messageSubject = message.getSubject();
var attachments = message.getAttachments();
var blob = attachments[0].getAs(MimeType.PDF);
var filetext = pdfToText( blob, {keepTextfile: false} );
GmailApp.sendEmail(Session.getActiveUser().getEmail(), messageSubject, filetext);
}
}
We're relying on a helper function, pdfToText(), to convert our pdf blob into text, which we'll then send to ourselves as a plain text email. This helper function has a variety of options; by setting keepTextfile: false, we've elected to just have it return the text content of the PDF file to us, and leave no residual files in our Drive.
pdfToText()
This utility is available as a gist. Several examples are provided there.
A previous answer indicated that it was possible to use the Drive API's insert method to perform OCR, but it didn't provide code details. With the introduction of Advanced Google Services, the Drive API is easily accessible from Google Apps Script. You do need to switch on and enable the Drive API from the editor, under Resources > Advanced Google Services.
pdfToText() uses the Drive service to generate a Google Doc from the content of the PDF file. Unfortunately, this contains the "pictures" of each page in the document - not much we can do about that. It then uses the regular DocumentService to extract the document body as plain text.
/**
* See gist: https://gist.github.com/mogsdad/e6795e438615d252584f
*
* Convert pdf file (blob) to a text file on Drive, using built-in OCR.
* By default, the text file will be placed in the root folder, with the same
* name as source pdf (but extension 'txt'). Options:
* keepPdf (boolean, default false) Keep a copy of the original PDF file.
* keepGdoc (boolean, default false) Keep a copy of the OCR Google Doc file.
* keepTextfile (boolean, default true) Keep a copy of the text file.
* path (string, default blank) Folder path to store file(s) in.
* ocrLanguage (ISO 639-1 code) Default 'en'.
* textResult (boolean, default false) If true and keepTextfile true, return
* string of text content. If keepTextfile
* is false, text content is returned without
* regard to this option. Otherwise, return
* id of textfile.
*
* #param {blob} pdfFile Blob containing pdf file
* #param {object} options (Optional) Object specifying handling details
*
* #returns {string} id of text file (default) or text content
*/
function pdfToText ( pdfFile, options ) {
// Ensure Advanced Drive Service is enabled
try {
Drive.Files.list();
}
catch (e) {
throw new Error( "To use pdfToText(), first enable 'Drive API' in Resources > Advanced Google Services." );
}
// Set default options
options = options || {};
options.keepTextfile = options.hasOwnProperty("keepTextfile") ? options.keepTextfile : true;
// Prepare resource object for file creation
var parents = [];
if (options.path) {
parents.push( getDriveFolderFromPath (options.path) );
}
var pdfName = pdfFile.getName();
var resource = {
title: pdfName,
mimeType: pdfFile.getContentType(),
parents: parents
};
// Save PDF to Drive, if requested
if (options.keepPdf) {
var file = Drive.Files.insert(resource, pdfFile);
}
// Save PDF as GDOC
resource.title = pdfName.replace(/pdf$/, 'gdoc');
var insertOpts = {
ocr: true,
ocrLanguage: options.ocrLanguage || 'en'
}
var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);
// Get text from GDOC
var gdocDoc = DocumentApp.openById(gdocFile.id);
var text = gdocDoc.getBody().getText();
// We're done using the Gdoc. Unless requested to keepGdoc, delete it.
if (!options.keepGdoc) {
Drive.Files.remove(gdocFile.id);
}
// Save text file, if requested
if (options.keepTextfile) {
resource.title = pdfName.replace(/pdf$/, 'txt');
resource.mimeType = MimeType.PLAIN_TEXT;
var textBlob = Utilities.newBlob(text, MimeType.PLAIN_TEXT, resource.title);
var textFile = Drive.Files.insert(resource, textBlob);
}
// Return result of conversion
if (!options.keepTextfile || options.textResult) {
return text;
}
else {
return textFile.id
}
}
The conversion to DriveApp is helped with this utility from Bruce McPherson:
// From: http://ramblings.mcpher.com/Home/excelquirks/gooscript/driveapppathfolder
function getDriveFolderFromPath (path) {
return (path || "/").split("/").reduce ( function(prev,current) {
if (prev && current) {
var fldrs = prev.getFoldersByName(current);
return fldrs.hasNext() ? fldrs.next() : null;
}
else {
return current ? null : prev;
}
},DriveApp.getRootFolder());
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

AWS s3 in Rust: Get and store a file - Invalid file header when opening - amazon-s3

Related

Why such a simple BufWriter operation didn't work

Can't figure out how to send a signed POST request to OKEx

how to apply password on the zip file or on the csv in nodejs or javascript

AWS S3 filenaming when using MediaConvert

Get pdf-attachments from Gmail as text

Categories

Resources