HBase : Retention policy for HBase Export - amazon-s3

We are using HBase 1.2.3. I am trying to configure HBase backup functionality (Export functionality in 1.2.3 version).
Am able to successfully export table on S3. Both full & incremental backups.
On S3, all the files goes in default root/base folder and a mapping file (not sure in which language) goes inside specified folder.
2 Ques
How can i set retention policy to keep backups for x days. I wrote custom code to delete files/folders under specific folder but how to determine which block files are for which table and whether for full backup or incremental
Can we change the way HBase stores backup file. When we take backup on file system, it stores backup files under same folder. Can we achieve same result on S3?

Related

How can I backup a Memgraph database?

I'm running a Memgraph within Ubuntu WSL. I want to make a backup of my database. I'm having trouble locating the database files.
I've found the question that addresses Memgraph platform, but I need some solution for WSL.
While running, Memgraph generates several different files in its data directory. This is the location where Memgraph saves all permanent data. The default data directory is /var/lib/memgraph.
If you want to trigger creating a snapshot of the current database state, run the following query in mgconsole or Memgraph Lab:
CREATE SNAPSHOT;
Creating a backup of a Memgraph instance would consist of simply copying the data directory. This is impossible without additional help because the durability files can be deleted when an event is triggered (the number of snapshots exceeded the maximum allowed number).
To disable this behavior, you can use the following query in mgconsole or Memgraph Lab:
LOCK DATA DIRECTORY;
If you are using Linux to run Memgraph, here are the steps for copying files:
Start your Memgraph instance.
Open a new Linux terminal and check the location of the permanent data directory:
grep -A 1 'permanent data' /etc/memgraph/memgraph.conf
Copy a file from the snapshot directory to the backup folder, e.g.:
cp /var/lib/memgraph/snapshots/20220325125308366007_timestamp_3380 ~/backup/
To allow the deletion of the files, run the following query in mgconsole or Memgraph Lab::
UNLOCK DATA DIRECTORY;
Memgraph will delete the files which should have been deleted before and allow any future deletion of the files contained in the data directory.

Can make figure out file dependencies in AWS S3 buckets?

The source directory contains numerous large image and video files.
These files need to be uploaded to an AWS S3 bucket with the aws s3 cp command. For example, as part of this build process, I copy my image file my_image.jpg to the S3 bucket like this: aws s3 cp my_image.jpg s3://mybucket.mydomain.com/
I have no problem doing this copy to AWS manually. And I can script it too. But I want to use the makefile to upload my image file my_image.jpg iff the same-named file in my S3 bucket is older than the one in my source directory.
Generally make is very good at this kind of dependency checking based on file dates. However, is there a way I can tell make to get the file dates from files in S3 buckets and use that to determine if dependencies need to be rebuilt or not?
The AWS CLI has an s3 sync command that can take care of a fair amount of this for you. From the documentation:
A s3 object will require copying if:
the sizes of the two s3 objects differ,
the last modified time of the source is newer than the last modified time of the destination,
or the s3 object does not exist under the specified bucket and prefix destination.
I think you'll need to make S3 look like a file system to make this work. On Linux it is common to use FUSE to build adapters like that. Here are some projects to present S3 as a local filesystem. I haven't tried any of those, but it seems like the way to go.

Add jsonserde.jar to EMR Hive permanantly

We know that
add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;
is only effective during the current session.
Is there a way to add jar to Hive permanently and globally so that the jar will be available during the lifecycle of the cluster?
UPDATE:
I figured out a way: download the jar by using aws cli, aws s3 cp s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar ., then copy the jar to /usr/lib/hive/lib of all nodes of the EMR cluster
Is there a better way to do this?
insert your ADD JAR commands in your .hiverc file and start hive.
add jar yourjarName.jar
What is .hiverc file?
It is a file that is executed when you launch the hive shell - making it an ideal place for adding any hive configuration/customization you want set, on start of the hive shell. This could be:
Setting column headers to be visible in query results
Making the current database name part of the hive prompt
Adding any jars or files
Registering UDFs
2 .hiverc file location
The file is loaded from the hive conf directory.
I have the CDH4.2 distribution and the location is:
/etc/hive/conf.cloudera.hive1
If the file does not exist, you can create it. It needs to be
deployed to every node from where you might launch the Hive shell.
ref-http://hadooped.blogspot.in/2013/08/hive-hiverc-file.html

Where do I put .mdf and .ldf files to share an SQL script through git

I am attempting to share a file that builds and populates an SQL database through git, but it won't create the DB on my team members' machines because the .mdf and .ldf files are located on my machine. How can I rectify this?
If you want to share a SQL script, you don't have to share the database with it!
What is generally done (best practice) is that you have the script needed to create the database (and eventually populate it with static/test data) in git, and then the user will launch that script to build the database.
git is here to keep track of your source code and the changes made to it, you shouldn't put in it any generated file, and .mdf / .ldf files are typically part of what should not be in your git. For generated files within your folder, there are ways to configure git to ignore them.
The value of git is to record differences between files, if you want to share your software, git is definitely not the good tool. Put those file on a shared folder (NAS), on dropbox, give them through an USB key or whatever.
However, if you really want to do this (bad idea), I guess you can add your files in your repository and either configure SQL Server to find them here or create a symbolic link.

Sync with S3 with s3cmd, but not re-download files that only changed name

I'm syncing a bunch of files between my computer and Amazon S3. Say a couple of the files change name, but their content is still the same. Do I have to have the local file removed by s3cmd and then the "new" file re-downloaded, just because it has a new name? Or is there any other way of checking for changes? I would like s3cmd to, in that case, simply change the name of the local file in accordance with the new name on the server.
s3cmd upstream (github.com/s3tools/s3cmd master branch) and 1.5.0-rc1 latest published version, can figure this out, if you used a recent version to put the file into S3 in the first place that used the --preserve option to store the md5sum of each file. Using the md5sums, it knows that you have a duplicate (even if renamed) file locally, and won't re-download it, but instead will do a local copy (or hardlink) from the file system name to the name from S3.