PgSQL - Export select query data direct to amazon s3 with headers - amazon-s3

I have this requirement where i need to export the report data directly to csv since getting the array/query response and then building the scv and again uploading the final csv to amazon takes time. Is there a way by which i can directly create the csv with the redshift postgresql.
PgSQL - Export select query data direct to amazon s3 servers with headers
here is my version of pgsql - Version PgSQL 8.0.2 on amazon redshift
Thanks

You can use UNLOAD statement to save results to a S3 bucket. Keep in mind that this will create multiple files (at least one per computing node).
You will have to download all the files, combine them locally, sort (if needed), then add column headers and upload result back to S3.
Using the EC2 instance shouldn't take a lot of time - connection between EC2 and S3 is quite good.
In my experience, the quickest method is to use shells' commands:
# run query on the redshift
export PGPASSWORD='__your__redshift__pass__'
psql \
-h __your__redshift__host__ \
-p __your__redshift__port__ \
-U __your__redshift__user__ \
__your__redshift__database__name__ \
-c "UNLOAD __rest__of__query__"
# download all the results
s3cmd get s3://path_to_files_on_s3/bucket/files_prefix*
# merge all the files into one
cat files_prefix* > files_prefix_merged
# sort merged file by a given column (if needed)
sort -n -k2 files_prefix_merged > files_prefix_sorted
# add column names to destination file
echo -e "column 1 name\tcolumn 2 name\tcolumn 3 name" > files_prefix_finished
# add merged and sorted file into destination file
cat files_prefix_sorted >> files_prefix_finished
# upload destination file to s3
s3cmd put files_prefix_finished s3://path_to_files_on_s3/bucket/...
# cleanup
s3cmd del s3://path_to_files_on_s3/bucket/files_prefix*
rm files_prefix* files_prefix_merged files_prefix_sorted files_prefix_finished

Related

Query runs successfully and fetches empty result from user defined bucket, scope, and collection

I have set up a local couchbase one node cluster environment on Ubuntu.
Query runs and fetches result from default bucket after importing all the JSON documents in zip folder using cbdoclcoader command to default bucket
Command:
/opt/couchbase/bin/cbdocloader -c localhost:8091 -u Administrator -p 10i-0113 -b mybucket -m 100 -d Downloads/JSONs_List20211229-20220123T140145Z-001.zip
Query runs and fetches empty result from user defined bucket, scope, and collection and I don't find the reason of this although i have successfully imported json documents using the below command
/opt/couchbase/bin/cbimport json -c localhost:8091 -u Administrator -p 10i-0113 -b One_bucket -f lines -d file://'fileset__e53c883b-bc30-42cb-b4f7-969998c91e3d.json' -t 2 -g %type%::%id% --scope-collection-exp Raw.%type%
My guess is that when I try to create the index, it creates an index on the default bucket and I can not find a way to create an index on my custom bucket.
Please assist
I have fixed it :). Yes I was not getting any results when I try to query the collection because there was no index created on it.
Creating the index fixed the issue.
CREATE PRIMARY INDEX ON default:onebucket.rawscope.fileset

How to see output from executors on Amazon EMR?

I am running the following code on AWS EMR:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
sc = spark.sparkContext
def f(_):
print("executor running") # <= I can not find this output
return 1
from operator import add
output = sc.parallelize(range(1, 3), 2).map(f).reduce(add)
print(output) # <= I found this output
spark.stop()
I am recording logs to s3 (Log URI is s3://brand17-logs/).
I can see output from master node here:
s3://brand17-logs/j-20H1NGEP519IG/containers/application_1618292556240_0001/container_1618292556240_0001_01_000001/stdout.gz
Where can I see output from executor node ?
I see this output when running locally.
You are almost there while browsing the log files.
The general convention of the stored log is something like this: Inside the containers path where there are multiple application_id, the first one(something like this application_1618292556240_0001 ending with 001) will be of the driver node and the rest will be from the executor.
I have no official documentation where it is mentioned above. But I have seen this in all my clusters.
So if you browse to the other application id, you will be able to see the executor log file.
Having said that it is very painful to browse to so many executors and search for the log.
How do I personally see the log from EMR cluster:
log in to one of the EC2 instance having enough access to download the files from S3 where the log of EMR is getting saved.
Navigate to the right path on the instance.
mkdir -p /tmp/debug-log/ && cd /tmp/debug-log/
Download all the files from S3 in a recursive manner.
aws s3 cp --recursive s3://your-bucket-name/cluster-id/ .
In your case, it would be
`aws s3 cp --recursive s3://brand17-logs/j-20H1NGEP519IG/ .`
Uncompress the log file:
find . -type f -exec gunzip {} \;
Now that all the compressed files are uncompressed, we can do a recursive grep like below:
grep -inR "message-that-i-am-looking-for"
the flag with grep means the following:
i -> case insensitive
n -> will display the file and line number where the message is present
R -> search it in a recursive manner.
Browse to the exact file by vi pointed by the above grep command and see the more relevant log in that file.
More readings can be found here:
View Log Files
access spark log

impala shell command to export a parquet file as a csv

I have some parquet files stored in HDFS that I want to convert to csv files FIRST and export them in a remote file using ssh.
I don't know if it's possible or simple by writing a spark job (I know that we can convert parquet to csv file JUST by using spark.read.parquet then to the same DF use spark.write as a csv file). But I really wanted to do it by using a impala shell request.
So, I thought about something like this :
hdfs dfs -cat my-file.parquet | ssh myserver.com 'cat > /path/to/my-file.csv'
Can you help me PLEASE with this request ? Please.
Thank you !!
Example without kerberos:
impala-shell -i servername:portname -B -q 'select * from table' -o filename '--output_delimiter=\001'
I could explain it all, but it is late and here is a link that allows you to do that as well as the header if you want: http://beginnershadoop.com/2019/10/02/impala-export-to-csv/
You can do that by multiples ways.
One approach could be as in the example below.
With impala-shell you can run a query and pipe to ssh to write the output in a remote machine.
$ impala-shell --quiet --delimited --print_header --output_delimiter=',' -q 'USE fun; SELECT * FROM games' | ssh remoteuser#ip.address.of.remote.machine "cat > /home/..../query.csv"
This command change from default database to a fun database and run a query on it.
You can change the --output_delimiter='\t', --print_header or not along with other options.

Compare (not sync) the contents of a local folder and a AWS S3 bucket

I need to compare the contents of a local folder with a AWS S3 bucket so that where there are differences a script is executed on the local files.
The idea is that local files (pictures) get encrypted and uploaded to S3. Once the upload has occurred I delete the encrypted copy of the pictures to save space. The next day new files get added to the local folder. I need to check between the local folder and the S3 bucket which pictures have already been encrypted and uploaded so that I only encrypt the newly added pictures rather than all of them all over again. I have a script that does exactly this between two local folders but I'm struggling to adapt it so that the comparison is performed between a local folder and a S3 bucket.
Thank you to anyone who can help.
Here is the actual script I am currently using for my picture sorting, encryption and back up to S3:
!/bin/bash
perl /volume1/Synology/scripts/Exiftool/exiftool '-createdate
perl /volume1/Synology/scripts/Exiftool/exiftool '-model=camera model missing' -r -if '(not $model)' -overwrite_original -r /volume1/photo/"input"/ --ext .DS_Store -i "#eaDir"
perl /volume1/Synology/scripts/Exiftool/exiftool '-Directory
cd /volume1/Synology/Pictures/"Pictures Glacier back up"/"Compressed encrypted pics for Glacier"/post_2016/ && (cd /volume1/Synology/Pictures/Pictures/post_2016/; find . -type d ! -name .) | xargs -i mkdir -p "{}"
while IFS= read -r file; do /usr/bin/gpg --encrypt -r xxx#yyy.com /volume1/Synology/Pictures/Pictures/post_2016/**///$(basename "$file" .gpg); done < <(comm -23 <(find /volume1/Synology/Pictures/Pictures/post_2016 -type f -printf '%f.gpg\n'|sort) <(find /volume1/Synology/Pictures/"Pictures Glacier back up"/"Compressed encrypted pics for Glacier"/post_2016 -type f -printf '%f\n'|sort))
rsync -zarv --exclude=#eaDir --include="/" --include=".gpg" --exclude="" /volume1/Synology/Pictures/Pictures/post_2016/ /volume1/Synology/Pictures/"Pictures Glacier back up"/"Compressed encrypted pics for Glacier"/post_2016/
find /volume1/Synology/Pictures/Pictures/post_2016/ -name ".gpg" -type f -delete
/usr/bin/aws s3 sync /volume1/Synology/Pictures/"Pictures Glacier back up"/"Compressed encrypted pics for Glacier"/post_2016/ s3://xyz/Pictures/post_2016/ --exclude "" --include ".gpg" --sse
It would be inefficient to continually compare the local and remote folders, especially as the quantity of objects increases.
A better flow would be:
Unencrypted files are added to a local folder
Each file is:
Copied to another folder in an encrypted state
Once that action is confirmed, the original file is then deleted
Files in the encrypted local folder are copied to S3
Once that action is confirmed, the source file is then deleted
The AWS Command-Line Interface (CLI) has an aws s3 sync command that makes it easy to copy new/modified files to an Amazon S3 bucket, but this could be slow if you have thousands of files.

Not able to download multiple dynamoDB tables by using dynamodump

Not able to download multiple dynamoDB tables by using dynamodump
$ python dynamodump.py -m backup -r us-east-1 -s 'DEV_*'
INFO:root:Found 0 table(s) in DynamoDB host to backup:
INFO:root:Backup of table(s) DEV_* completed!
But i'm able to download if i give single table name and "*" (download all DynamoDB tables).
I have followed this procedure which is in the below link:
https://github.com/bchew/dynamodump
Can anyone suggest me how to download multiple dynamoDB tables with the specific pattern (like QA_* / DEV_* / PROD_* / TEST_*)
for i in aws dynamodb list-tables | jq -r ''| grep 'QA*'| tr ',' ' ' | cut -d'"' -f2;
do
echo "======= Starting backup of $i date =========="
python dynamodump.py -m backup -r us-east-1 -s $i
done
The above script will work if you want to take multiple dynamoDB tables backup. prior running the script you have to download the jq: (https://stedolan.github.io/jq/download/)