What is the easiest way to upload data from S3 to Redshift? - amazon-s3

I'm looking for a simple way to load data from S3 to Redshift.
I've tried AWS Glue and Firehouse, without success.
EDIT:
As right now it's not the best way to do it but AWS Glue is working. I'll revist the COPY command to try to get better results!
Thanks guys!

The simplest solution is to use the COPY command, e.g.
create table my_table(...);
copy my_table
from 's3://my_bucket/my_prefix/data.txt'
iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
region 'us-west-2';
By default, the data file should contain pipe-separated plain text columns. There are plenty more options: JSON, Parquet, using a manifest file to load from multiple files, etc.
UNLOAD is the reverse command (dumping a table to S3).

Related

Pyspark write a DataFrame to csv files in S3 with a custom name

I am writing files to an S3 bucket with code such as the following:
df.write.format('csv').option('header','true').mode("append").save("s3://filepath")
This outputs to the S3 bucket as several files as desired, but each part has a long file name such as:
part-00019-tid-5505901395380134908-d8fa632e-bae4-4c7b-9f29-c34e9a344680-236-1-c000.csv
Is there a way to write this as a custom file name, preferably in the PySpark write function? Such as:
part-00019-my-output.csv
You can't do that with only Spark. The long random numbers behind are to make sure there is no duplication, no overwriting would happen when there are many many executors trying to write files at the same location.
You'd have to use AWS SDK to rename those files.
P/S: If you want one single CSV file, you can use coalesce. But the file name is still not determinable.
df.coalesce(1).write.format('csv')...

Redshift UNLOAD JSON

Is there any way to directly export a JSON file to S3 from Redshift using UNLOAD? I'm not seeing anything in the documentation (Redshift UNLOAD documentation), but maybe I'm missing something.
The COPY command supports downloading JSON, so I'm surprised that there's not JSON flag for UNLOAD.
For more context:
I'm loading data from one Redshift instance to another, but the respective tables have different column orders, and I need to respect the column order in the destination. Seems like the best way to do this is to work with a serialization format that doesn't care about column order.
No, unload to a JSON file is not possible with UNLOAD command in Redshift.
How to dump data from Redshift to JSON | DevelByte claims there is a workaround, I haven't tried it, but it might give you an idea.
Unload as json is supported by Redshift now.
Ref:
https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD_command_examples.html#unload-examples-json
Example from AWS doc
unload ('select * from venue')
to 's3://mybucket/unload/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
JSON;

How to view what is being copied in SQL

I have JSON data in an Amazon Web Service S3 bucket. I am trying to copy it into a database (AWS Redshift).
I am using the following command:
COPY mytable FROM 's3://bucket/somedata'
iam_role 'arn:aws:iam::12345678:role/MyRole';
I am thinking the bucket's data is being copied with some additional meta data. I think the meta data is causing my COPY command to fail.
Can you tell me, is it possible to print the copied data somehow?
Thanks in advance!
If your COPY command fails, you should check stl_load_errors system table. It has raw_line column which which shows raw data that caused the failure. There are also other columns which will provide you with more details about the error.

How to load multiple huge csv (with different columns) into AWS S3

I have around 50 csv files each of different structure. Each csv file has close to 1000 columns. I am using DictReader to merge csv files locally, but it is taking too much time to merge. The approach was to merge 1.csv and 2.csv to create 12.csv. Then merge 12.csv with 3.csv. This is not the right approach.
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
Since I have to finally upload this huge single csv to AWS, I was thinking about a better AWS based solution. Any suggestions on how I can import these multiple different structure csv and merge it in AWS?
Launch an EMR cluster and merge the files with Apache Spark. This gives you complete control over the schema. This answer might help for example.
Alternatively, you can also try your luck and see how AWS Glue handles the multiple schemas when you create a crawler.
You should copy your data to s3 in both cases.

loading a pg_dump off of s3 into redshift

I'm trying to load a complete database dump into Redshift. Is there a single command to restore the data from a pg_dump living on s3 into Redshift? If not, what are the best steps for tackling this?
Thanks
If you have a non compressed pg_dump this should be possible using a psql command (you may need to manually edit to get the right syntax, depending on your versions and options set).
However this is a very inefficient and slow way to load redshift and I do not recommend it. If your tables are large it could take days or weeks!
What you need to do is this:
create target tables on redshift based upon the source table, but
considering sort keys and distribution.
unload you postgres source tables into csv files using postgres
"copy" command
If the source csv files are very big (e.g. more than say 100MB),
consider splitting these into separate files as they will load
faster (redshift will parallelize)
gzip the csv files (recommended but not essential)
upload these csv files to s3, with a separate folder per table
load the data into redshift from s3 by using the redshift copy
command