I have seen docs on how to import data but how do I do backups?
Is there tooling, is there an API, or is it safe to copy the files on the file system?
exporting data is done via COPY TO. Exports can go to the local filesystem or amazon s3.
every node containing data will export in parallel. you can export to a network attached storage or s3 or simply copy the exported files to another host using scp.
Related
I want to backup the REDIS data on google storage bucket as flat file, is there any existing utility to do that?
Although, I do not fully agree to idea of backing up of cache data on cloud. I was wondering if there is any existing utility rather than reinventing the wheel.
If you are using Cloud Memorystore for Redis you can simply refer to the following documentation. Notice that you can simply use the following gcloud command:
gcloud redis instances export gs://[BUCKET_NAME]/[FILE_NAME].rdb [INSTANCE_ID] --region=[REGION] --project=[PROJECT_ID]
or use the Export operation from the Cloud Console.
If you manage your own instance (e.g. you have the Redis instance hosted on a Compute Engine Instance) you could simply use the SAVE or BGSAVE (preferred) commands to take a snapshot of the instance and then upload the .rdb file to Google Cloud Storage using any of the available methods, from which I think the most convenient one would be gsutil (notice that it will require the following installation procedure) in a similar fashion to:
gsutil cp path/to/your-file.rdb gs://[DESTINATION_BUCKET_NAME]/
I am trying to create a pipeline to export Qliksense app data to AWS S3 bucket, but not sure if I can do it directly.
Two options I tried:
Use export API to export data as qvf to local disk, then connect to S3 via python script and push the data.
2.Store the data as csv using qliksense script locally and then push to S3.
Basically my ideal solution would be use a single python script to connect to Qliksense, read the data, convert to csv and export to S3.
Any ideas/code/approach would be helpful.
Qliksense is a tool to consume data and display in general, but when there is need
STORE INTO vFile (text);
is easy option. This way you will get csv file saved on drive. If you are running python script on the same machine is easy as you can EXECUTE statement which can run python script and store it to s3 using boto3 library.
Otherwise just use network drive instead and then in cron put data from there via python to s3.
Last option is that on qliksense server you can map s3 bucket. There are few tools for that like TnTDrive.
So much depends on your infrastructure.
I have an S3 bucket and the URL of a large file. I would like to store the content located at the URL in the S3 bucket.
I could download the file to my local machine and then upload it to S3 with Cloudberry or Jungledisk or whatever. However, if the file is large, this may take a long time because the file must be transferred twice, and my network connection is much slower than Amazon's.
If I have a lot of data to store in S3, I can start an EC2 instance, retrieve the files to the instance with curl or wget, and then push the data from the EC2 instance to S3. This works, but it's a lot of steps if I just want to archive one file.
Any suggestions?
You can stream the file directly from the source to S3.
If you are using node, you can use streaming-s3.
Can csv files from the AWS S3 bucket be configured to go straight into ML or do the files need to land somewhere and then the CSV files have to get ingested using MCLP?
Assuming you have CSV files in the S3 Bucket and that one row in the CSV file is to be inserted as a single XML record...that wasn't clear in your question, but is the most common use case. If your plan is to just pull the files in and persist them as CSV files, there are undocumented XQuery functions that could be used to access the S3 bucket and pull the files in off that. Anyway, the MLCP documents are very helpful in understanding this very versatile and powerful tool.
According to the documentation (https://developer.marklogic.com/products/mlcp) the supported data sources are:
Local filesystem
HDFS
MarkLogic Archive
Another MarkLogic Database
You could potentially mount the S3 Bucket to a local filesystem on EC2 to bypass the need to make the files accessible to MLCP. Google's your friend if that's important. I personally haven't seen a production-stable method for that, but it's been a long time since I've tried.
Regardless, you need to make those files available on a supported source, most likely a filesystem location in this case, where MLCP can be run and can reach the files. I suppose that's what you meant by having the files land somewhere. MLCP can process delimited files in import mode. The documentation is very good for understanding all the options.
I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.