How to import hbase snapshot from s3 - amazon-s3

I exported hbase snapshot to s3.
I used this command.
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot my-snapshot -copy-to s3://my-buckets/tests -mappers 16
But, how can I import s3 snapshot to my hbase?
I read many posts about export snapshot to other.
But, I could not find how to import snapshot from s3.
In other words, how can I create new table from s3 snapshot?
Environment
EC2 instances
CDH 5.14.1
HBase 1.2

You can use the same ExportSnapshot command to import snapshot by specifying -copy-from with s3a path and -copy-to with your server details.
Example:
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot my-snapshot -copy-from s3a://my-buckets/tests -copy-to hdfs://<name-node>:8020/hbase -mappers 16

Related

Hadoop distcp to S3

I am using Hadoop distcp command to move data from hdfs to s3. Recently after hadoop cdh to cdp upgrade I am facing a difference in -update option. Previously-update will move files with same file name, same size but with different content. now it ignores the file if name and size are same. Is there anyways to move achieve this kind of update in cloudera cdp version?
hadoop distcp -pu -update -delete hdfspath s3bucket

AWS GLUE Not able to write Delta lake in s3

I am working on AWS Glue and created an ETL job for upserts. I have a s3 bucket where I have my csv file in a folder. I am reading the file from s3 and want to write back to s3 using delta lake (parquet file) using this code
from delta import *
from pyspark.sql.session import SparkSession
spark = SparkSession.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
inputDF = spark.read.format("csv").option("header", "true").load('s3://demonidhi/superstore/')
print(inputDF)
# Write data as DELTA TABLE
inputDF.write.format("delta").mode("overwrite").save("s3a://demonidhi/current/")
# Generate MANIFEST file for Athena/Catalog
deltaTable = DeltaTable.forPath(spark, "s3a://demonidhi/current/")
I am using a jar file of delta named 'delta-core_2.11-0.6.1.jar' which is in s3 bucket folder and i gave the path of it in python libraby path and in Dependent jars path while creating my job.
Till the reading part the code is working just fine but after that for the writing and manifesting it is not working and giving some error which I am not able to see in GLUE terminal. I tried to follow several different approaches, but not able to figure out how can i resolve this. Any help would be appericiated.
Using the spark.config() notation will not work in Glue, because the abstraction that Glue is using (the GlueContext), will override those parameters.
What you can do instead is provide the config as a parameter to the job itself, with the key --conf and the value spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Port data from HDFS/S3 to local FS and load in Java

I have a Spark job running on an EMr cluster that writes out a DataFrame to HDFS (which is then s3-dist-cp-ed to S3). The data size isn't big (2 GB when saved as parquet). These data in S3 are then copied to a local filesystem (EC2 instance running Linux) and then loaded into a Java application.
It turns out I cannot have the data in parquet format because parquet has been designed for HDFS and cannot be used in local FS (if I am wrong, please point me to a resource on how to read parquet files on local FS).
What other format can I use to address this? Would Avro be compact enough and not blow up the size of data by packing the schema with each row of the dataframe?
You can use Parquet on a local filesystem. To see an example in action, download the parquet-mr library from here, build it with the local profile (mvn -P local install should do it, provided that you have thrift and protoc installed), then issue the following to see the contents of your parquet file:
java -jar parquet-tools/target/parquet-tools-1.10.0.jar cat /path/to/your-file.parquet

Flink on EMR cannot access S3 bucket from "flink run" command

I'm prototyping the use of AWS EMR for a Flink-based system that we're planning to deploy. My cluster has the following versions:
Release label: emr-5.10.0
Hadoop distribution: Amazon 2.7.3
Applications: Flink 1.3.2
In the documentation provided by Amazon here: Amazon flink documentation
and the documentation from Flink: Apache flink documentation
both mention directly using S3 resources as an integrated file system with the s3://<bucket>/<file> pattern. I have verified that all the correct permissions are set, I can use the AWS CLI to copy S3 resources to the Master node with no problem, but attempting to start a Flink job using a Jar from S3 does not work.
I am executing the following step:
JAR location : command-runner.jar
Main class : None
Arguments : flink run -m yarn-cluster -yid application_1513333002475_0001 s3://mybucket/myapp.jar
Action on failure: Continue
The step always fails with
JAR file does not exist: s3://mybucket/myapp.jar
I have spoken to AWS support, and they suggested having a previous step copy the S3 file to the local Master node and then referencing it with a local path. While this would obviously work, I would rather get the native S3 integration working.
I have also tried using the s3a filesystem and get the same result.
You need to download your jar from s3 to be available in the classpath.
aws s3 cp s3://mybucket/myapp.jar myapp.jar
and then run the run -m yarn-cluster myapp.jar

How to install sqoop in Amazon EMR?

I've created a cluster in Amazon EMR and using -emr-4.0.0. Hadoop distribution:Amazon 2.6.0 and Hive 1.0.0. Need to install Sqoop so that I can communicate between Hive and Redshift? What are the steps to install Sqoop in EMR cluster? Requesting to provide the steps. Thank You!
Note that in EMR 4.0.0 hadoop fs -copyToLocal will throw errors.
Use aws s3 cp instead.
To be more specific than Amal:
Download the latest version of SQOOP and upload it to an S3 location. I am using sqoop-1.4.4.bin__hadoop-2.0.4-alpha and it seems to work just fine with EMR 4.0.0
Download the JAR connector for Redshift and upload it to same S3 location. This page might help.
Upload a script similar to the one below to S3
#!/bin/bash
# Install sqoop and mysql connector. Store in s3 and load
# as bootstrap step.
bucket_location='s3://your-sqoop-jars-location/'
sqoop_jar='sqoop-1.4.4.bin__hadoop-2.0.4-alpha'
sqoop_jar_gz=$sqoop_jar.tar.gz
redshift_jar='RedshiftJDBC41-1.1.7.1007.jar'
cd /home/hadoop
aws s3 cp $bucket_location$sqoop_jar_gz .
tar -xzf $sqoop_jar_gz
aws s3 cp $bucket_location$redshift_jar .
cp $redshift_jar $sqoop_jar/lib/
Set SQOOP_HOME and add SQOOP_HOME to the PATH to be able to call sqoop from anywhere. These entries should be made in /etc/bashrc. Otherwise you will have to use the full path, in this case: /home/hadoop/sqoop-1.4.4.bin__hadoop-2.0.4-alpha/bin/sqoop
I am using Java to programatically launch my EMR cluster. To configure bootstrap steps in Java I create a BootstrapActionConfigFactory:
public final class BootstrapActionConfigFactory {
private static final String bucket = Config.getBootstrapBucket();
// make class non-instantiable
private BootstrapActionConfigFactory() {
}
/**
* Adds an install Sqoop step to the job that corresponds to the version set in the Config class.
*/
public static BootstrapActionConfig newInstallSqoopBootstrapActionConfig() {
return newInstallSqoopBootstrapActionConfig(Config.getHadoopVersion().charAt(0));
}
/**
* Adds an install Sqoop step to the job that corresponds to the version specified in the parameter
*
* #param hadoopVersion the main version number for Hadoop. E.g.: 1, 2
*/
public static BootstrapActionConfig newInstallSqoopBootstrapActionConfig(char hadoopVersion) {
return new BootstrapActionConfig().withName("Install Sqoop")
.withScriptBootstrapAction(
new ScriptBootstrapActionConfig().withPath("s3://" + bucket + "/sqoop-tools/hadoop" + hadoopVersion + "/bootstrap-sqoop-emr4.sh"));
}
}
Then when creating the job:
Job job = new Job(Region.getRegion(Regions.US_EAST_1));
job.addBootstrapAction(BootstrapActionConfigFactory.newInstallSqoopBootstrapActionConfig());
Download the tarball of sqoop and keep it in an s3 bucket. Create a bootstrap script that performs the following activity
Download the sqoop tarball to the required instances
extract the tarball
set SQOOP_HOME and add SQOOP_HOME to the PATH. These entries should be made in /etc/bashrc
Add the required connector jars to the lib of SQOOP.
Keep this script in S3 and point this script in the bootstrap actions.
Note that from Emr-4.4.0 AWS added support for Sqoop 1.4.6 to the EMR cluster. Installation is done with couple clicks on setup. No need for manual installation.
References:
https://aws.amazon.com/blogs/aws/amazon-emr-4-4-0-sqoop-hcatalog-java-8-and-more/
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-sqoop.html