What is the best approach to handle file upload in graphql? - file-upload

I'm looking for a way to handle file upload in my backend powered by prisma (graphcool). However I am a beginner and it looks very intimidating and I don't know anything about how file upload works. What is the best aproach to do this ? Can I do it using prisma ? I have red about Amazon S3 buckets but it looks like a complicated approach to begin with.

There are many ways to accomplish this. I will lay out a few solutions and resources and you should have something to get you going.
Two common ways to handle this primarily differ in the method used to persist the uploaded files, upload directly to the server file system vs. upload to a cloud service, typically S3.
For most use cases, the second option, uploading to a cloud service will be superior due to ease of scalability, ease of backing up data in something like S3 and additional security features such as signed urls. One more neat thing about using S3 in particular is that you can take advantage of AWS Athena, described (by AWS) as
an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
๐Ÿ‘Note that the examples below all leverage the awesome work by Jayden Seric, who has done tons of work on uploads using GraphQL.๐Ÿ‘
Upload to FS, API and react example: Upload files to node server file system
middleware for apollo-upload-server: small abstraction on top of apollo-upload-server which simplifies server development. This example also shows how to integrate with S3.

Related

File conversion in AWS

I am trying to find the most efficient way to process files in AWS.
Read a json, xml, csv from S3 bucket
Map it to another type of json, xml, csv
Save it to S3 bucket
Right now we are using Java with AWS lambdas but we write lots of code.
AWS Data Glue looks good but my experience with MS BizTalk is even better.
Is there any service that can help me with this?
There are many options available within AWS for reading from one file format and writing it to another file format in s3 bucket. Below are some options -
A) AWS SDK for Pandas (DataWrangler) which is an open source Python library from AWS ProServe. You can run this either from a Lambda, or any other server. It provides several out of the box connectors for reading, writing data from various sources and sinks. This option may be used if the volumes are low. It also provides the flexibility to use this from Amazon Lambda or any other server where the SDK can be installed.
B) AWS Glue either using Spark or Python which is a is a serverless data integration service. This also provides a drag and drop option using the Glue Studio to generate data pipelines using many out of the box transformations. One can control the processing windows by using the desired number of Data Processing Units (DPUs). It also has the Glue Workflow for orchestration.
C) EMR which is a PetaByte scale AWS Service that one can use for high volume distributed data processing, machine learning, interactive analytics using open source frameworks like Apache Spark.
Which option one would choose would depend on the use cases one is trying to solve and the requirements. Other factors like volume of data, processing window, low code\no code options, cost, etc. would help decide which option to leverage.

AWS S3 ETL tool options

Trying to get a handle on what I would use to schedule and run jobs to move data into S3, run scripts on it and move it around s3 afterward.
My requirement is to be able to ingest from API's and also directly from databases. Some formats to ingest will be XML, and others could be flat files. The raw files need to be joined and transformed and turned into a format that graphs could be produced with.
What is AWS glue is like as an ETL tool? My specific question is can you see the finished pipelines showing the data sources and processing parts in a graphical view once they are created?
I have used Azure Data Factory - and it had a graphical UI to view and monitor the pipelines which I found quite useful. Just wondering if AWS glue has a similar thing.
If not - would Nifi on AWS S3 be a good way to do this?
Thanks
If you are looking for the best GUI, I would recommend NiFi. It is commonly used with S3 and has many connectors out of the box for other data sources. It becomes even more interesting if you want to do things outside of the AWS cloud.
That being said, I would think that Glue will also get the job done.
Running Data Factory when you have a heavy AWS footprint feels like an anti-pattern.
Full Disclosure: Have not worked with Glue/Data Factory and work for Cloudera, the driving force behind NiFi
I'm currently using AWS Glue to extract data from DB into s3, manipulate the data and save it back to Redshift/S3 or send via API to my client. AWS Glue GUI is not that good, you won't see a diagram of your flow and sometimes you will need to use other tools like step functions, airflow to orchestrate your job. Also, most of my jobs I have to use PySpark because AWS Glue methods are too limited.
Related to monitoring, you can see if there is an error, how many CPU and memory is been consumed by your job, s3 bytes read/written. If you want additional information you need to use logger or print to send it to the logs.

How to track Amazon AWS S3 bucket downloads using Google Analytics Measurement Protocol?

I'm using AWS S3 as my CDN to store files. Often these are directly linked from places all over the world. I'd like to track the file downloads in the S3 bucket using Google Analytics. It appears Google Analytics Measurement Protocol may be able to do this. But since I'm new to both the AWS environment and GAMP, I was hoping I'm not the first to ever do this. Anyone know of a way this can be accomplished?
I doubt this is possible without you doing extra work on top.
You could create a proxy site that, when hit, records an event to Google Analytics and then redirects to the download page/bucket.
You could also maybe have some script/job/etc scrape events from the AWS dashboards and write them to Google Analytics, although this would probably be less than real-time.
You can turn on logging for the buckets you care about, then download the little logfile fragments that Amazon delivers and feed them into an off-the-shelf analytics package such as Webalizer. If you're willing to spend the time and effort to build a pipeline and massage the data so that it fits.
I've written about how to do that here:
https://www.expatsoftware.com/articles/2007/11/roll-your-own-web-stats-for-amazon-s3.html
If you just want the reports today, there are a handful of 3rd party services built around doing this for you, so if you have ~$10/month to spend that's probably the best solution.
S3stat (https://www.s3stat.com/) is my suggestion. But then it should be since it's also my product.

Reuploading to EdgeCast or Amazon S3?

I am working on a project in which we are planning to use EdgeCast to store our data. I am concerned about it, because the client wants to upload the image to our server first, and then use curl to upload it to EdgeCast. In this case our servers will be "proxying" the request, doubling the time needed for uploads.
What would you suggest? And is direct uploading risky?
PS the reason I mentioned S3 is because of its similarity to EdgeCast. Hence I assume the same principle will apply.
Yep - Martin's right - usually a good idea when letting users have direct access to storage to have a proxy. EdgeCast supports rsync which will automatically sync content from your server and the EdgeCast storage account. Or you can use our "customer origin" reverse proxy feature for our network to pull content automatically from your servers as its requested by the public. Feel free to contact us at sales#edgecast.com with questions.
Having your server in between the end user and the storage, is probably a good idea. Whenever I let users direct access to storage places, with FTP or SSH, it tends to get really messy. A place where you can upload files, that get accessible from the web, is used for all sorts of things.
Having your server in between you can organise the files uploaded into some rational structure. A folder per date for instance, and perhaps also enforce some strict naming of the files themselves, avoiding URL encoding problems etc.
There is no reason to be concerned about Edgecast. On my opinion, it always makes its best to serve its customers the best way so that the customers have their websites as fast as it's possible and also secured and well-optimized. The whole comparison of Edgecasr vs Amazon look at http://jodihost.com/2014_edgecast_vs_amazon.php

Storing files (videos/images/music) in CouchDB/Cloudant vs CDN (CloudFront)?

I am new to CouchDB/Cloudant and CDN (CloudFront).
I am about to build an application using CouchDB as database.
This web application will handle a lot of files.
I know that CouchDB can store files in the database as attachments. But then I have heard about leveraging CDN to store and distribute the files all over the world.
My questions:
How is storing files in CouchDB compared to CDN (CloudFront)?
How is Cloudant's service compared to CDN (CloudFront)?
Is Google storage also a CDN?
What is the difference between Amazon CloudFront and S3?
Do I have to choose to store files either in CouchDB/Cloudant or CDN, or could/should I actually combine them?
What are best practices for storing files when using CouchDB?
Some of these questions are based on your specific implementation, but here's a generalization (not in any particular order):
Unless they have Cloudant mirrored on numerous servers around the world (effectively a CDN in its own right, just sans static files), a true CDN would probably have better response time, depending mostly on how you used Cloudant (eg, you might get good response times, but if you load the entire file into memory before outputting it, you're losing the CDN battle).
CouchDB has to process more data server-side before it can output an attachment.
CloudFront (and CDNs in general) are optimized for the fastest possible response time with the closest server.
S3 is only storage; CloudFront uses that storage and distributes it across many servers that serve the content based upon which one is closer to the user requesting that content.
Yes, you have to choose between Cloudant or the CDN; one stores them in the filesystem verbatim, the other stores them in the filesystem within the database.
I don't know the answer to some of these, eg, how CouchDB actually handles attachment storage at a low level, nor its best practices, however, this should give you enough of an idea to at least start thinking about which suits your needs best.