Getting results of a pig script on a remote cluster - apache-pig

Is there a way to get the results of a pig script run on a remote cluster directly without STORE-ing them and retrieving them separately?

So you can use a pig parameters to run your scripts. For example:
example.pig
A = LOAD '$PATH_TO_FOLDER_WITH_DATA' AS (f1:int, f2:int, f3:int);
--# Do Something With Your Data, and get output
C = STORE ouput INTO '$OUTPUT_PATH'
Then you can run the script like:
pig -p "/path/to/local/file" -p "/path/to/the/output" example.pig
So to automate in BASH:
storelocal.sh
#!/bin/bash
pig -p '$PATH_TO_FILES' -p '$PATH_TO_HDFS_OUT' example.pig
hdfs dfs -getmerge '$PATH_TO_HDFS_OUT' '$PATH_TO_LOCAL'
And you can run it ./storelocal.sh /path/to/local/file /path/to/the/local/output

Related

Export Query Result as CSV file from Docker PostgreSQL container to local machine

I'm not sure if this is possible of if I'm doing something wrong since I'm still pretty new to Docker. Basically, I want to export a query result inside PostgreSQL docker container as a csv file to my local machine.
This is where I got so far. Firstly, I run my PostgreSQL docker container with this command:
sudo docker run --rm --name pg-docker -e POSTGRES_PASSWORD=something -d -p 5432:5432 -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data postgres
Then I access the docker container with docker exec to run PostgreSQL command that would copy the query result to a csv file with specified location like this:
\copy (select id,value from test) to 'test_1.csv' with csv;
I thought that should export the query result as a csv file named test_1.csv in the local machine, but I couldn't find the file anywhere in my local machine, also checked both of these directories: $HOME/docker/volumes/postgres; /var/lib/postgresql/data postgres
You can export the data to the STDOUT and pipe the result to a file in the client machine:
docker exec -it -u database_user_name container_name \
psql -d database_name -c "COPY (SELECT * FROM table) TO STDOUT CSV" > output.csv
-c tells psql you to execute a given SQL statement when the connection is established.
So your command should look like this:
docker exec -it -u postgres pgdocker \
psql -d yourdb -c "COPY (SELECT * FROM test) TO STDOUT CSV" > test_1.csv
The /var/lib/postgresql/data directory is where the database server stores its data files. It isn't a directory that users need to manipulate directly or where nothing interesting can be found.
Paths like test_1.csv are relative to working directory. The default directory when you enter the postgres container with docker exec is / so that's where your file should be. You can also switch to another directory with cd before running psql:
root#b9e5a0572207:/some/other/path# cd /some/other/path/
root#b9e5a0572207:/some/other/path# psql -U postgres
... or you can provide an absolute path:
\copy (select id,value from test) to '/some/other/path/test_1.csv' with csv;
You can use docker cp to transfer a file from the container to the host:
docker cp pg-docker:/some/other/path/test_1.csv /tmp
... or you can create a volume if this is something you do often.

impala shell command to export a parquet file as a csv

I have some parquet files stored in HDFS that I want to convert to csv files FIRST and export them in a remote file using ssh.
I don't know if it's possible or simple by writing a spark job (I know that we can convert parquet to csv file JUST by using spark.read.parquet then to the same DF use spark.write as a csv file). But I really wanted to do it by using a impala shell request.
So, I thought about something like this :
hdfs dfs -cat my-file.parquet | ssh myserver.com 'cat > /path/to/my-file.csv'
Can you help me PLEASE with this request ? Please.
Thank you !!
Example without kerberos:
impala-shell -i servername:portname -B -q 'select * from table' -o filename '--output_delimiter=\001'
I could explain it all, but it is late and here is a link that allows you to do that as well as the header if you want: http://beginnershadoop.com/2019/10/02/impala-export-to-csv/
You can do that by multiples ways.
One approach could be as in the example below.
With impala-shell you can run a query and pipe to ssh to write the output in a remote machine.
$ impala-shell --quiet --delimited --print_header --output_delimiter=',' -q 'USE fun; SELECT * FROM games' | ssh remoteuser#ip.address.of.remote.machine "cat > /home/..../query.csv"
This command change from default database to a fun database and run a query on it.
You can change the --output_delimiter='\t', --print_header or not along with other options.

Run Bash script on GCP Dataproc

I want to run shell script on Dataproc which will execute my Pig scripts with arguments. These arguments are always dynamic and are calculated by shell script.
Currently this scripts are running on AWS with the help of script-runner.jar. I am not sure how to move this to Dataproc. Is there anything similar which is available for Dataproc?
Or I will have to change all my scripts and calculate the arguments in Pig with the help of pig sh or pig fs?
As Aniket mentions, pig sh would itself be considered the script-runner for Dataproc jobs; instead of having to turn your wrapper script into a Pig script in itself, just use Pig to bootstrap any bash script you want to run. For example, suppose you have an arbitrary bash script hello.sh:
gsutil cp hello.sh gs://${BUCKET}/hello.sh
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
-e 'fs -cp -f gs://${BUCKET}/hello.sh file:///tmp/hello.sh; sh chmod 750 /tmp/hello.sh; sh /tmp/hello.sh'
The pig fs command uses Hadoop paths so to copy your script from GCS you must copy to a destination specified as file:/// to make sure it's on the local filesystem instead of HDFS; then the sh commands afterwards will be referencing local filesystem automatically so you don't use file:/// there.
Alternatively, you can take advantage of the way --jars works to automatically stage a file into the temporary directory created just for your Pig job rather than explicitly copying from GCS into a local directory; you simply specify your shell script itself as a --jars argument:
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
--jars hello.sh \
-e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'
Or:
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
--jars gs://${BUCKET}/hello.sh \
-e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'
In these cases, the script would only temporarily be downloaded into a directory that looks like /tmp/59bc732cd0b542b5b9dcc63f112aeca3 and which only exists for the lifetime of the pig job.
There is no shell job in Dataproc at the moment. As an alternative, you can use a use a pig job with sh command that forks your shell script which can then (again) run your pig job. (You can use pyspark similarly if you prefer python).
For example-
# cat a.sh
HELLO=hello
pig -e "sh echo $HELLO"
# pig -e "sh $PWD/a.sh"

Shell script to copy data from remote server to Google Cloud Storage using Cron

I want to Sync my server data to Google Cloud Storage to copy automatically using shell script. I don't know how to make script. Every time i need to use:
gsutil -m rsync -d -r [Source] gs://[Bucket-name]
If anyone knows the answer please help me!
To automate the sync process use cron job:
Create a script to run with cron $ nano backup.sh
Paste your gsutil command in the script $ gsutil -m rsync -d -r [Source_PATH] gs://bucket-name
Make the script executable $ chmod +x backup.sh
Based on your use case, put the shell script (backup.sh) in one of the below folders: a) /etc/cron.daily b) /etc/cron.hourly c) /etc/cron.monthly d)
/etc/cron.weekly
If you want to run this script for a specific time then go to the terminal and type: $ crontab -e
Then simply call out the script with cron as often as you want, for example, in midnight: 00 00 * * * /path/to/your/backup.sh
In case you are using Windows on your local server, The commands will be the same as above but make sure to use Windows path instead.

How to start pig with -t ColumnMapKeyPrune on aws emr

In my pig script i want file name with each record for some further processing so i used -tagFile option. Now after using -tagFile option, the column names were getting un aligned so i used below command to get only required columns after referring this blog : http://www.webopius.com/content/764/resolved-apache-pig-with-tagsource-tagfile-option-generates-incorrect-columns
pig -x mapreduce -t ColumnMapKeyPrune
Now i want to run the script on AWS EMR but i am not sure how to enable this -t ColumnMapKeyPrune option on EMR Pig.?
I am using AWS CLI to create aws cluster and submit jobs.
Any pointer for how to enable -t ColumnMapKeyPrune on EMR Pig.?
I got the solution. I need to add below line in pig script:
set pig.optimizer.rules.disabled 'ColumnMapKeyPrune';