I'm using node.js (express) on Heroku, where the slug size is limited to 300MB.
In order to keep my slug small, I'd like to use git-lfs to track my express' public folder.
In that way all my assets (images, videos...) are uploaded to a lfs-store (say AWS S3) and git-lfs leaves a pointer file (with probably the S3 URL in it?).
I'd like express redirects to the remote S3 file when serving files from the public folder.
My problem is I don't kwon how to retrieve the URL from the pointer file's content...
app.use('/public/:pointerfile', function (req, res, next) {
var file = req.params.pointerfile;
fs.readFile('public/'+file, function (er, data) {
if (er) return next(er);
var url = retrieveUrl(data); // <-- HELP ME HERE with the retrieveUrl function
res.redirect(url);
});
});
Don't you think it will not be too expensive to make express read and parse potentially all the public/* files. Maybe I could cache the URL once parsed?
Actually the pointer file doesn't contain any url information in it (as can be seen in the link you provided, or here) - it just keeps the oid(Object ID) for the blob which is just its sha256.
You can however achieve what you're looking for using the oid and the lfs api that allows you to download specific oids using the batch request.
You can tell what is the endpoint that's used to store your blobs from .git/config which can accept non-default lfsurl tags such as:
[remote "origin"]
url = https://...
fetch = +refs/heads/*:refs/remotes/origin/*
lfsurl = "https://..."
or a separate
[lfs]
url = "https://..."
If there's no lfsurl tag then you're using GitHub's endpoint (which may in turn redirect to S3):
Git remote: https://git-server.com/user/repo.git
Git LFS endpoint: https://git-server.com/user/repo.git/info/lfs
Git remote: git#git-server.com:user/repo.git
Git LFS endpoint: https://git-server.com/user/repo.git/info/lfs
But you should work against it and not S3 directly, as GitHub's redirect response will probably contain some authentication information as well.
Check the batch response doc to see the response structure - you will basically need to parse the relevant parts and make your own call to retrieve the blobs (which is what git lfs would've done in your stead during checkout).
A typical response (taken from the doc I referenced) would look something like:
{
"_links": {
"download": {
"href": "https://storage-server.com/OID",
"header": {
"Authorization": "Basic ...",
}
}
}
}
So you would GET https://storage-server.com/OID with whatever headers was returned from the batch response - the last step will be to rename the blob that was returned (it's name will typically be just the oid as git lfs uses checksum based storage) - the pointer file has the original resource's name so just rename the blob to that.
I've finally made a middleware for this: express-lfs with a demo here: https://expresslfs.herokuapp.com
There you can download a 400Mo file as a proof.
See usage here: https://github.com/goodenough/express-lfs#usage
PS: Thanks to #fundeldman for good advices in his answer ;)
Related
Goal
Using googleapis with firebase functions. Get a JWT token so firebase functions can use a service account with domain-wide delegation to authorize G Suite APIs like directory and drive.
Question
What goes in path.join();
What is __dirname
What is 'jwt.keys.json'?
From this example:
https://github.com/googleapis/google-api-nodejs-client/blob/master/samples/jwt.js
// Create a new JWT client using the key file downloaded from the Google Developer Console
const auth = new google.auth.GoogleAuth({
keyFile: path.join(__dirname, 'jwt.keys.json'), // <---- WHAT GOES IN path.join()
scopes: 'https://www.googleapis.com/auth/drive.readonly',
});
Error
When I run
const auth = new google.auth.GoogleAuth({
keyFile: path.join(__dirname, "TEST"), // <-- __dirname == /srv/ at runtime
scopes: 'https://www.googleapis.com/auth/drive.readonly',
});
From the GCP Logs I get this error:
Error: ENOENT: no such file or directory, open '/srv/TEST'
Obviously TEST isn't valid, but is '/srv/?
What is the keyFile, a file path? a credential?
Another Example
https://github.com/googleapis/google-api-nodejs-client#service-to-service-authentication
I found documentation here:
https://googleapis.dev/nodejs/google-auth-library/5.10.1/classes/JWT.html
If you do not want to include a file, you can use key, keyId, and email to submit credentials when requesting authorization.
You seem to have a lot of questions around how this works. I would strongly encourage you to read the basics of Google authentication.
JWT is short for JSON Web Token. It is a standard standard defining secure way to transmit information between parties in JSON format. In your code "jwt" is a class containing a keys property. There are a ton of JWT libraries. There are some popularly packages using Node/Express frameworks.
__dirname // In Node this is the absolute path of the directory containing the currently executing file.
path.join is a method that joins different path segments into one path.
Here you are taking the absolute path and concatenating some piece of information to the end of the path. I am not certain what is contained in jwt.keys.json but that is what is being appended to the end of the absolute path in this case.
Without knowing your project structure or what you are pointing to it's not really possible to say what is and is not a valid path in your project.
keyFile is a key in an object (as denoted by the {key: value} format) under google.auth. As seen in the sample code you referenced, the script is taking the google.auth library and calling a method to construct and object with the information to are providing so that it abstract away other elements of the authentication process for you. You are giving it two pieces of information: 1) The location of the keyFile which presumably are the credentials and 2) The scope or set of permissions you are allowing. In the example it is readonly access to Drive.
EDIT: The private key file that the calling service uses to sign the JWT.
My code uses the AWS Javascript SDK to upload to S3 directly from a browser. Before the upload happens, my server sends it a value to use for 'Authorization'.
But I see no way in the AWS.S3.upload() method where I can add this header.
I know that underneath the .upload() method, AWS.S3.ManagedUpload is used but that likewise doesn't seem to return a Request object anywhere for me to add the header.
It works successfully in my dev environment when I hardcode my credentials in the S3() object, but I can't do that in production.
How can I get the Authorization header into the upload() call?
Client Side
this posts explains how to post from a html form with a pre-generated signature
How do you upload files directly to S3 over SSL?
Server Side
When you initialise the S3, you can pass the access key and secret.
const s3 = new AWS.S3({
apiVersion: '2006-03-01',
accessKeyId: '[value]',
secretAccessKey: '[value]'
});
const params = {};
s3.upload(params, function (err, data) {
console.log(err, data);
});
Reference: https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html
Alternatively if you are running this code inside AWS services such as EC2, Lambda, ECS etc, you can assign a IAM role to the service that you are using. The permissions can be assigned to the IAM Role
I suggest that you use presigned urls.
I have an API with a file upload endpoint using Rails and ActiveStorage with S3 as the fileserver. I would like to upload directly to S3 from my client app but the code provided in the Active Storage docs only shows that using Javascript https://edgeguides.rubyonrails.org/active_storage_overview.html#direct-uploads
Since i am sending a POST request with the file to the Rails API directly, there is no place i can run JS.
Is there a way with Rails API only apps to use direct upload?
In order to solve a similar issue I've followed the approach proposed in AWS documentation
The simple concept is that for each file that I want to upload, I do the following workflow:
Request my server for a S3 presigned_url/public url pair
Send the file via post/put (depending on the presigned you choose) to S3
Once I get the 200 (OK) from the S3 upload, I send a new request to my server with the resource that i'm trying to update and in the params for this resource i include the public URL.
e.g:
GET myserver.com/api/signed_url?filename=<safe_file_name>
1.1. Replies with
{
presigned_url: "https://bucket-name.s3.us-west-1.amazonaws.com/uploads/1bb275c5-0199-41fe-ac40-133601f5efb0?x-amz-acl=public-read...",
public_url: "https://bucket-name.s3.us-west-1.amazonaws.com/uploads/1bb275c5-0199-41fe-ac40-133601f5efb0"
}
PUT <presigned_url>, data: <file_to_upload>, cache: false, processData: false
2.1. Wait for 200 (OK) from S3 direct upload
POST myserver.com/api/document, data: { name: 'new file', document_url: <public_url> }
Have a SPA with a redux client and an express webapi. One of the use cases is to upload a single file from the browser to the express server. Express is using the multer middleware to decode the file upload and place it into an array on the req object. Everything works as expected when running on localhost.
However when the app is deployed to AWS, it does not function as expected. Deployment pushes the express api to an AWS Lambda function, and the redux client static assets are served by Cloudfront CDN. In that environment, the uploaded file does make it to the express server, is handled by multer, and the file does end up as the first (and only) item in the req.files array where it is expected to be.
The problem is that the file contains the wrong bytes. For example when I upload a sample image that is 2795 bytes in length, the file that ends up being decoded by multer is 4903 bytes in length. Other images I have tried always end up becoming larger by approximately the same factor by the time multer decodes and puts them into the req.files array. As a result, the files are corrupted and are not displaying as images.
The file is uploaded like so:
<input type="file" name="files" onChange={this.onUploadFileSelected} />
...
onUploadFileSelected = (e) => {
const file = e.target.files[0]
var formData = new FormData()
formData.append("files", file)
axios.post('to the url', formData, { withCredentials: true })
.then(handleSuccessResponse).catch(handleFailResponse)
}
I have tried setting up multer with both MemoryStorage and DiskStorage. Both work, both on localhost and in the aws lambda, however both exhibit the same behavior -- the file is a larger size and corrupted in the store.
I have also tried setting up multer as both a global middleware (via app.use) and as a route-specific middleware on the upload route (via routes.post('the url', multerMiddlware, controller.uploadAction). Again, both exhibit the same behavior. Multer middleware is configured like so:
const multerMiddleware = multer({/* optionally set dest: '/tmp' */})
.array('files')
One difference is that on localhost, both the client and express are served over http, whereas in aws, both the client and express are served over https. I don't believe this makes a difference, but I have yet been unable to test -- either running localhost over https, or running in aws over http.
Another peculiar thing I noticed was that when the multer middleware is present, other middlewares do not seem to function as expected. Rather than the next() function moving flow down to the controller action, instead, other middlewares will completely exit before the controller action invocation, and when the controller invocation exits, control does not flow back into the middlware after the next() call. When the multer middleware is removed, other middlewares do function as expected. However this observation is on localhost, where the entire end-to-end use case does function as expected.
What could be messing up the uploaded image file payload when deployed to the cloud, but not on localhost? Could it really be https making the difference?
Update 1
When I upload this file (11228 bytes)
Here is the HAR chrome is giving me for the local (expected) file upload:
"postData": {
"mimeType": "multipart/form-data; boundary=----WebKitFormBoundaryC4EJZBZQum3qcnTL",
"text": "------WebKitFormBoundaryC4EJZBZQum3qcnTL\r\nContent-Disposition: form-data; name=\"files\"; filename=\"danludwig.png\"\r\nContent-Type: image/png\r\n\r\n\r\n------WebKitFormBoundaryC4EJZBZQum3qcnTL--\r\n"
}
Here is the HAR chrome is giving me for the aws (corrupted) file upload:
"postData": {
"mimeType": "multipart/form-data; boundary=----WebKitFormBoundaryoTlutFBxvC57UR10",
"text": "------WebKitFormBoundaryoTlutFBxvC57UR10\r\nContent-Disposition: form-data; name=\"files\"; filename=\"danludwig.png\"\r\nContent-Type: image/png\r\n\r\n\r\n------WebKitFormBoundaryoTlutFBxvC57UR10--\r\n"
}
The corrupted image file that is saved is 19369 bytes in length.
Update 2
I created a text file with the text hello world that is 11 bytes long and uploaded it. It does NOT become corrupted in aws. This is the case even if I upload it with the txt or png suffix, it ends up as 11 bytes in length when persisted.
Update 3
Tried uploading with a much larger text file (12132 bytes long) and had the same result as in update 2 -- the file is persisted intact, not corrupted.
Potential answers:
Found this https://forums.aws.amazon.com/thread.jspa?threadID=252327
API Gateway does not natively support multipart form data. It is
possible to configure binary passthrough to then handle this multipart
data in your integration (your backend integration or Lambda
function).
It seems that you may need another approach if you are using API Gateway events in AWS to trigger the lambda that hosts your express server.
Or, you could configure API Gateway to work with binary payloads per https://stackoverflow.com/a/41770688/304832
Or, upload directly from your client to a signed s3 url (or a public one) and use that to trigger another lambda event.
Until we get a chance to try out different API Gateway settings, we found a temporary workaround: using FileReader to convert the file to a base64 text string, then submit that. The upload does not seem to have any issues as long as the payload is text.
I have a rails app that uses aws cli to sync bunch of content and config with my s3 bucket like so:
aws s3 sync --acl 'public-read' #{some_path} s3://#{bucket_path}
Now I am looking for some easy way to mark everything that was just updated in sync to be marked as invalidated or expired for CloudFront.
I am wondering if there is some way to use -cache-control flag that aws cli provides to make this happen. So that instead of invalidating CouldFont, just mark the files as expired, so CloudFront will be forced to fetch fresh data from bucket.
I am aware of CloudFront POST API to mark files for invalidation, but that will mean I will have detect what changed in the last sync, then make the API call. I might have any where from 1000s to 1 file syncing. Not a pleasent prospect. But if I have to go this route, how would I go about detecting changes without parsing the s3 sync's console output of-course.
Or any other ideas?
Thanks!
You cannot use the --cache-control option that aws cli provides to invalidate files in CloudFront. The --cache-control option maps directly to the Cache-Control header and CloudFront caches the headers along with the file, so if you change a header you must also invalidate to tell CloudFront to pull in the changed headers.
If you want to use the aws cli, then you must parse the output of the sync command and then use the aws cloudfront cli.
Or, you can use s3cmd from s3tools.org. This program provides the the --cf-invalidate option to invalidate the uploaded filed in CloudFront and a sync command synchronize a directory tree to S3.
s3cmd sync --cf-invalidate <local path> s3://<bucket name>
Read, the s3cmd usage page for more details.
What about using the brand new AWS Lambda? Basically, it executes custom code whenever an event is triggered in AWS (in your case, a file is synchronized in S3).
Whenever you synchronize a file you get an event similar to:
{
"Records": [
{
"eventVersion": "2.0",
// ...
"s3":
{
"s3SchemaVersion": "1.0",
// ...
"object":
{
"key": "hello.txt",
"size": 4,
"eTag": "1234"
}
}
}
]
}
Thus, you can check the name of the file that has changed and invalidate it in CloudFront. You receive one event for every file that has changed.
I have created a script that invalidates a path in CloudFront whenever an update occurs in S3, which might be a good starting point if you decide to use this approach. It is written in JavaScript (Node.js) as it is the language used by Lambda.
var aws = require('aws-sdk'),
s3 = new aws.S3({apiVersion: '2006-03-01'}),
cloudfront = new aws.CloudFront();
exports.handler = function(event, context) {
var filePath = '/' + event.Records[0].s3.object.key,
invalidateParams = {
DistributionId: '1234',
InvalidationBatch: {
CallerReference: '1',
Paths: {
Quantity: 1,
Items: [filePath]
}
}
};
console.log('Invalidating file ' + filePath);
cloudfront.createInvalidation(invalidateParams, function(err, data) {
if (err) {
console.log(err, err.stack); // an error occurred
} else {
console.log(data); // successful response
}
});
context.done(null,'');
};
For more info you can check Lambda's and CloudFront's API documentation.
Note however that the service is still in preview and is subject to change.
The AWS CLI tool can output JSON. Collect the JSON results, then submit an invalidation request per the link you included in your post. To make it really simple you could use a gem like CloudFront Invalidator, which will take a list of paths to invalidate.