loading a pg_dump off of s3 into redshift - amazon-s3

I'm trying to load a complete database dump into Redshift. Is there a single command to restore the data from a pg_dump living on s3 into Redshift? If not, what are the best steps for tackling this?
Thanks

If you have a non compressed pg_dump this should be possible using a psql command (you may need to manually edit to get the right syntax, depending on your versions and options set).
However this is a very inefficient and slow way to load redshift and I do not recommend it. If your tables are large it could take days or weeks!
What you need to do is this:
create target tables on redshift based upon the source table, but
considering sort keys and distribution.
unload you postgres source tables into csv files using postgres
"copy" command
If the source csv files are very big (e.g. more than say 100MB),
consider splitting these into separate files as they will load
faster (redshift will parallelize)
gzip the csv files (recommended but not essential)
upload these csv files to s3, with a separate folder per table
load the data into redshift from s3 by using the redshift copy
command

Related

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

creating a single parquet file in s3 pyspark job

I have written a pyspark program that is reading data from cassandra and writing into aws s3 . Before writing into s3 I have to do repartition(1) or coalesce(1) as this creates one single file otherwise it creates multiple parquet files in s3 .
using repartition(1) or coalesce(1) has performance issue and I feel creating one big partition is not good option with huge data .
what are ways to create one single file in s3 but without compromising on performance ?
coalesce(1) or repartition(1) will put all your data on 1 partition (with a shuffle step when you use repartition compare to coalesce). In that case, only 1 worker will have to write all your data, which is the reason why you have performance issues - you already figured it out.
That is the only way you can use Spark to write 1 file on S3. Currently, there is no other way using just Spark.
Using Python (or Scala), you can do some other things. For example, you write all your files with spark without changing the number of partitions, and then :
you acquire your files with python
you concatenate your files as one
you upload that one file on S3.
It works well for CSV, not that well for non-sequential file type.

Is there any performance-wise better option for exporting data to local than Redshift unload via s3?

I'm working on a Spring project that needs exporting Redshift table data into local a single CSV file. The current approach is to:
Execute Redshift UNLOAD to write data across multiple files to S3 via JDBC
Download said files from S3 to local
Joining them together into one single CSV file
UNLOAD (
'SELECT DISTINCT #{#TYPE_ID}
FROM target_audience
WHERE #{#TYPE_ID} is not null
AND #{#TYPE_ID} != \'\'
GROUP BY #{#TYPE_ID}'
)
TO '#{#s3basepath}#{#s3jobpath}target_audience#{#unique}_'
credentials 'aws_access_key_id=#{#accesskey};aws_secret_access_key=#{#secretkey}'
DELIMITER AS ',' ESCAPE GZIP ;
The above approach has been fine and all. But i think the overall performance can be improved by, for example skipping the S3 part and get data directly from Redshift to local.
After searching through online resources, i found that you can export data from redshift directly through psql or to perform SELECT queries and move the result data myself. But neither option can top Redshift UNLOAD performance with parallel writing.
So is there any way i can mimic UNLOAD parallel writing to achieve the same performance without having to go through S3 ?
You can avoid the need to join files together by using UNLOAD with the PARALLEL OFF parameter. It will output only one file.
This will, however, create multiple files if the filesize exceeds 6.2GB.
See: UNLOAD - Amazon Redshift
It is doubtful that you would get better performance by running psql, but if performance is important for you then you can certainly test the various methods.
We do exactly same as you'r trying to do here. In our performance comparison, it found to be almost same or even better in some cases in our user case. Hence programming and debugging wise its easy. As there is practically one step.
//replace user/password,host,region,dbname appropriately in given command
psql postgresql://user:password#xxx1.xxxx.us-region-1.redshift.amazonaws.com:5439/dbname?sslmode=require -c "select C1,C2 from sch1.tab1" > ABC.csv
This enables us to avoid 3 steps,
Unload using JDBC
Download the exported Data from S3
Decompress gzip file, (this we used to save network Input/Output).
On other hand also saving some cost(S3 storing, though its negligible).
By the way, pgsql(9.0+) onwards, sslcompression is bydefault on.

How to load multiple huge csv (with different columns) into AWS S3

I have around 50 csv files each of different structure. Each csv file has close to 1000 columns. I am using DictReader to merge csv files locally, but it is taking too much time to merge. The approach was to merge 1.csv and 2.csv to create 12.csv. Then merge 12.csv with 3.csv. This is not the right approach.
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
Since I have to finally upload this huge single csv to AWS, I was thinking about a better AWS based solution. Any suggestions on how I can import these multiple different structure csv and merge it in AWS?
Launch an EMR cluster and merge the files with Apache Spark. This gives you complete control over the schema. This answer might help for example.
Alternatively, you can also try your luck and see how AWS Glue handles the multiple schemas when you create a crawler.
You should copy your data to s3 in both cases.

Incrementally add data to Parquet tables in S3

I would like to keep a copy of my log data in in Parquet on S3 for ad hoc analytics. I mainly work with Parquet through Spark and that only seems to offer operations to read and write whole tables via SQLContext.parquetFile() and SQLContext.saveAsParquetFile().
Is there any way to add data to and existing Parquet table
without writing a whole new copy of it
particularly when it is stored in S3?
I know I can create separate tables for the updates and in Spark I can form the union of the corresponig DataFrames in Spark at query time but I have my doubts about the scalability of that.
I can use something other than Spark if needed.
The way to append to a parquet file is using SaveMode.Append
`yourDataFrame.write.mode(SaveMode.Append).parquet("/your/file")`
You don't need to union DataFrames after creating them separately, just supply all the paths related to your query to the parquetFile(paths) and get one DataFrame. Just as the signature of reading parquet file: sqlContext.parquetFile(paths: String*) suggests.
Under the hood, in newParquetRelation2, all the .parquet files from all the folders you supply, as well as all the _common_medata and _metadata would be filled into a single list and regard equally.