Running GeoMesa HBase on AWS S3, how do I ingest / export remotely - amazon-emr

I am running Geomesa-Hbase on an EMR cluster, set up as described here. I'm able to ssh into the Master and ingest / export from there. How would I ingest / export the data remotely from for example a lambda function (preferably a python solution). Right now for the ingest part I'm running a lambda function that just sends a shell command via SSH:
c = paramiko.SSHClient()
c.connect(hostname = host, username = "ec2-user", pkey = k )
c.exec_command("geomesa-hbase ingest <file_to_ingest_on_S3> ...")
But I imagine I should be able to ingest / export remotely without using ssh. I've been looking for days for a solution but no luck so far.

You can ingest or export remotely just by running GeoMesa code on a remote box. This could mean installing the command-line tools, or using the GeoTools API in a processing framework of your choice. GeoServer is typically used for interactive (not bulk) querying.
There isn't any out-of-the-box solution for ingest/export via AWS lambdas, but you could create a docker image with the GeoMesa command-line tools and invoke that.
Also note that the command-line tools support ingest and export via map/reduce job, which allows you to run a distributed process using your local install.

Related

How do I access files generated during Cloud Run function execution?

I'm running a very simple program getting screenshots of a page using Selenium in Cloud Run. I know that Cloud Run is stateless and I cannot access the screenshot that is generated after the program finishes executing, but I wanted to know where/how can I access these files right after the screenshot is taken and read them, so I can store a reference to them in my Cloud Storage bucket too
You have several solution:
Store the screenshot locally, and then upload them to Cloud Storage (you can create a script for that, use client libraries,...). A good evolution is to make a tar (optionally a gzip also) to upload only 1 file, it's faster.
Use Cloud Run execution runtime 2nd generation, and mount a bucket with GCSFuse into your Cloud Run instance. Like that, a file directly written in the mounted directory will be written on Cloud Storage. For that solution, and despite the good tutorial, it requires good skills in container.

What is the best approach to sync data from AWS 3 bucket to Azure Data Lake Gen 2

Currently, I download csv files from AWS S3 to my local computer using:
aws s3 sync s3://<cloud_source> c:/<local_destination> --profile aws_profile. Now, I would like to use the same process to sync the files from AWS to Azure Data Lake Storage Gen2 (one-way sync) on a daily basis. [Note: I only have read/download permissions for the S3 data source.]
I thought about 5 potential paths to solving this problem:
Use AWS CLI commands within Azure. I'm not entirely sure how to do that without running an Azure VM. Also, I would like to have my AWS profile credentials persist?
Use Python's subprocess library to run AWS CLI commands. I run into similar issues as option 1, namely a) maintaining a persistent install of AWS CLI, b) passing AWS profile credentials, and c) running without an Azure VM.
Use Python's Boto3 library to access AWS services. In the past, it appears that Boto3 didn't support the AWS sync command. So, developers like #raydel-miranda developed their own. [see Sync two buckets through boto3]. However, it now appears that there is a DataSync class for Boto3. [see DataSync | Boto3 Docs 1.17.27 documentation]. Would I still need to run this in an Azure VM or could I use Azure Data Factory?
Use Azure Data Factory to copy data from AWS S3 bucket. [see Copy data from Amazon Simple Storage Service by using Azure Data Factory] My concern would be that I would want to sync rather than copy. I believe Azure Data Factory has functionality to check if a file already exists, but what if the file has been deleted from AWS S3 data source?
Use Azure Data Science Virtual Machine to: a) install the AWS CLI, 2) create my AWS profile to store the access credentials, and 3) run the aws s3 sync... command.
Any tips, suggestions, or ideas on automating this process are greatly appreciated.
Adding one more to the list :)
6. Please do also look into Azcopy option . https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3?toc=/azure/storage/blobs/toc.json
I am not aware of any tool which helps in syncing the data , more or less all will do the copy , I think you will have to implement that . Couple of quick thoughts .
#3 ) You can run this from a batch service . You can initate that from Azure data factory . Also since are talking about Python , you can also run that from Azure data bricks .
#4) ADF does not have any sync logic for the files to be deleted. We can implement that using the getMetadat activity . https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
AzReplciate is another option - especially for very large containers https://learn.microsoft.com/en-us/samples/azure/azreplicate/azreplicate/

backup distributed cache data to cloud storage

I want to backup the REDIS data on google storage bucket as flat file, is there any existing utility to do that?
Although, I do not fully agree to idea of backing up of cache data on cloud. I was wondering if there is any existing utility rather than reinventing the wheel.
If you are using Cloud Memorystore for Redis you can simply refer to the following documentation. Notice that you can simply use the following gcloud command:
gcloud redis instances export gs://[BUCKET_NAME]/[FILE_NAME].rdb [INSTANCE_ID] --region=[REGION] --project=[PROJECT_ID]
or use the Export operation from the Cloud Console.
If you manage your own instance (e.g. you have the Redis instance hosted on a Compute Engine Instance) you could simply use the SAVE or BGSAVE (preferred) commands to take a snapshot of the instance and then upload the .rdb file to Google Cloud Storage using any of the available methods, from which I think the most convenient one would be gsutil (notice that it will require the following installation procedure) in a similar fashion to:
gsutil cp path/to/your-file.rdb gs://[DESTINATION_BUCKET_NAME]/

Export data from QlikSense cloud to AWS S3 bucket

I am trying to create a pipeline to export Qliksense app data to AWS S3 bucket, but not sure if I can do it directly.
Two options I tried:
Use export API to export data as qvf to local disk, then connect to S3 via python script and push the data.
2.Store the data as csv using qliksense script locally and then push to S3.
Basically my ideal solution would be use a single python script to connect to Qliksense, read the data, convert to csv and export to S3.
Any ideas/code/approach would be helpful.
Qliksense is a tool to consume data and display in general, but when there is need
STORE INTO vFile (text);
is easy option. This way you will get csv file saved on drive. If you are running python script on the same machine is easy as you can EXECUTE statement which can run python script and store it to s3 using boto3 library.
Otherwise just use network drive instead and then in cron put data from there via python to s3.
Last option is that on qliksense server you can map s3 bucket. There are few tools for that like TnTDrive.
So much depends on your infrastructure.

Automate Cross region copying tables in aws redshift

I have tables in a cluster at region-1 and I want to copy some of those tables in another cluster at some other region (region-2).
Till now I have used matillion and for that I have followed following steps-
Copy data to s3 from cluster-a.
Load this data from s3 to cluster-b.
Since matillion is a little bit costly for me to do work, and I want to have an alternative solution for this.
Although I have heard about CLI, Lambda and API but I am having no idea for how should I use these, since I go through this procedure on weekly basis and I want to automate this process.
The AWS Command-Line Interface (CLI) is not relevant for this use-case, because it is used to control AWS services (eg launch an Amazon Redshift database, change security settings). The commands to import/export data to/from Amazon Redshift must be issued to Redshift directly via SQL.
To copy some tables to an Amazon Redshift instance in another region:
Use an UNLOAD command in Cluster A to export data from Redshift to an Amazon S3 bucket
Use a COPY command in Cluster B to load data from S3 into Redshift, using the REGION parameter to specify the source region
You will therefore need separate SQL connections to each cluster. Any program that can connect to Redshift via JDBC would suffice. For example, you could use the standard psql tool (preferably version 8.0.2) since Redshift is based on PostgreSQL 8.0.2.
See: Connect to Your Cluster by Using the psql Tool
So, your script would be something like:
psql -h clusterA -U username -d mydatabase -c 'UNLOAD...'
psql -h clusterB -U username -d mydatabase -c 'COPY...'
You could run this from AWS Lambda, but Lambda functions only run for a maximum of five minutes, and your script might exceed that limit. Instead, you could run a regular cron job on some machine.