I'm trying to load an entire table from a MySQL Database (RDS) to S3, through AWS Glue.
I have already configured the RDS connection and created the Glue table with a crawler.
Now I have to perform a ETL Job to load the table from RDS to S3. However, after correctly following the AWS documentation procedure, inside the S3 directory, I find some files, but none of these is a JSON file (with headers and data) as I requested in Glue's Job.
Where am I wrong?
Thank you.
Related
We want to check BigQuery performance on external store parquet files. These parquet files are store on AWS S3. Without transfering files to GCP, Is possible to write BigQuery which can run on AWS S3 stored parquet files dataset.
No, this is not possible. BigQuery supports "external tables" where the data exists as files in Google Cloud Storage but no other cloud file store is supported, including AWS S3.
You will need to either copy/move the files from S3 to Cloud Storage and then use BigQuery on them, or use a similar AWS service such as Athena to query the files in situ on S3.
You can use the BigQuery Data Transfer Service for Amazon S3 which allows you to automatically schedule and manage recurring loads jobs from Amazon S3 into BigQuery and allows loading data in Parquet format. In this link you will find the documentation on how to set up an Amazon S3 data transfer.
We have a requirement to copy files within Spark job (runs on Hadoop cluster spun up by EMR) to respective S3 bucket.
As of now, we are using Hadoop FileSystem API (FileUtil.copy) to copy or move files between two different file systems.
val config = Spark.sparkContext.hadoopConfiguration
FileUtil.copy(sourceFileSystem, sourceFile, destinationFileSystem, targetLocation, true, config)
This method works as required but not efficient. It streams a given file and execution time depends on the size of file and number of files to be copied.
In another similar requirement to move files between two folders of same S3 bucket, we are using functionalities of com.amazonaws.services.s3 package as below.
val uri1 = new AmazonS3URI(sourcePath)
val uri2 = new AmazonS3URI(targetPath)
s3Client.copyObject(uri1.getBucket, uri1.getKey, uri2.getBucket, uri2.getKey)
The above package only has methods to copy/ move between two S3 locations. My requirement is to copy files between HDFS (on a cluster spun up by EMR) and root S3 bucket.
Can anyone suggest a better way or any AWS S3 api available to use in spark scala for moving files between HDFS and S3 bucket.
We had similar scenerio and we ended up using S3DistCp .
S3DistCp is an extension of DistCp that is optimized to work with AWS, particularly S3.You can use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3. S3DistCp is more scalable and efficient for parallel copying large numbers of objects across buckets and across AWS accounts.
You can find more details here.
You can refer to this sample java code for same here
Hope this helps !
I am trying to migrate my Hive metadata to Glue. While migrating the delta table, when I am providing the same dbfs path, I am getting an error - "Cannot create table: The associated location is not empty.
When I am trying to create the same delta table on the S3 location it is working properly.
Is there a way to find the S3 location for the DBFS path the database is pointed on?
First configure Databricks Runtime to use AWS Glue Data Catalog as its metastore and then migrate the delta table.
Every Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. Instead of using the Databricks Hive metastore, you have the option to use an existing external Hive metastore instance or the AWS Glue Catalog.
External Apache Hive Metastore
Using AWS Glue Data Catalog as the Metastore for Databricks Runtime
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the following benefits:
Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
Allows you to interact with object storage using directory and file semantics instead of storage URLs.
Persists files to object storage, so you won’t lose data after you terminate a cluster.
Is there a way to find the S3 location for the DBFS path the database
is pointed on?
You can access AWS S3 bucket by mounting buckets using DBFS or directly using APIs.
Reference: "Databricks - Amazon S3"
Hope this helps.
I'd like to upload a Hive external table to AWS Redshift directly via command line. I don't want to use the Data Pipeline. Do I have to upload the table to S3 first and then copy it to Redshift? Is there any way to do it directly?
You can load Redshift directly from remote hosts using SSH or, if you're using their EMR version of Hadoop, you can load directly from the HDFS file system.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
I'm running hive over EMR,
and need to copy some files to all EMR instances.
One way as I understand is just to copy files to the local file system on each node the other is to copy the files to the HDFS however I haven't found a simple way to copy stright from S3 to HDFS.
What is the best way to go about this?
the best way to do this is to use Hadoop's distcp command. Example (on one of the cluster nodes):
% ${HADOOP_HOME}/bin/hadoop distcp s3n://mybucket/myfile /root/myfile
This would copy a file called myfile from an S3 bucket named mybucket to /root/myfile in HDFS. Note that this example assumes you are using the S3 file system in "native" mode; this means that Hadoop sees each object in S3 as a file. If you use S3 in block mode instead, you would replace s3n with s3 in the example above. For more info about the differences between native S3 and block mode, as well as an elaboration on the example above, see http://wiki.apache.org/hadoop/AmazonS3.
I found that distcp is a very powerful tool. In addition to being able to use it to copy a large amount of files in and out of S3, you can also perform fast cluster-to-cluster copies with large data sets. Instead of pushing all the data through a single node, distcp uses multiple nodes in parallel to perform the transfer. This makes distcp considerably faster when transferring large amounts of data, compared to the alternative of copying everything to the local file system as an intermediary.
Now Amazon itself has a wrapper implemented over distcp, namely : s3distcp .
S3DistCp is an extension of DistCp that is optimized to work with
Amazon Web Services (AWS), particularly Amazon Simple Storage Service
(Amazon S3). You use S3DistCp by adding it as a step in a job flow.
Using S3DistCp, you can efficiently copy large amounts of data from
Amazon S3 into HDFS where it can be processed by subsequent steps in
your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use
S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon
S3
Example Copy log files from Amazon S3 to HDFS
This following example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.
elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
--dest,hdfs:///output,\
--srcPattern,.*daemons.*-hadoop-.*'
Note that according to Amazon, at http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html "Amazon Elastic MapReduce - File System Configuration", the S3 Block FileSystem is deprecated and its URI prefix is now s3bfs:// and they specifically discourage using it since "it can trigger a race condition that might cause your job flow to fail".
According to the same page, HDFS is now 'first-class' file system under S3 although it is ephemeral (goes away when the Hadoop jobs ends).