I am working on AWS Glue and created an ETL job for upserts. I have a s3 bucket where I have my csv file in a folder. I am reading the file from s3 and want to write back to s3 using delta lake (parquet file) using this code
from delta import *
from pyspark.sql.session import SparkSession
spark = SparkSession.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
inputDF = spark.read.format("csv").option("header", "true").load('s3://demonidhi/superstore/')
print(inputDF)
# Write data as DELTA TABLE
inputDF.write.format("delta").mode("overwrite").save("s3a://demonidhi/current/")
# Generate MANIFEST file for Athena/Catalog
deltaTable = DeltaTable.forPath(spark, "s3a://demonidhi/current/")
I am using a jar file of delta named 'delta-core_2.11-0.6.1.jar' which is in s3 bucket folder and i gave the path of it in python libraby path and in Dependent jars path while creating my job.
Till the reading part the code is working just fine but after that for the writing and manifesting it is not working and giving some error which I am not able to see in GLUE terminal. I tried to follow several different approaches, but not able to figure out how can i resolve this. Any help would be appericiated.
Using the spark.config() notation will not work in Glue, because the abstraction that Glue is using (the GlueContext), will override those parameters.
What you can do instead is provide the config as a parameter to the job itself, with the key --conf and the value spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Related
Im trying to load a .csv file to BQ using console. it has a size of 45 mb. I see that using "upload" i can only load upto 10mb. I dont have access to Drive and dont have access to run bq load from command line on my local machine as permission denied.
Any workaround for this? It will be a great help.Thanks
You can upload the file to a Google Cloud Storage bucket, then copy the "//gs:" storage URL. Then in the console, you can Create Table and select source "Google Cloud Storage" and paste your URL.
I where able to upload a file greater than 10Mb limit following this tutorial
In order to execute the python script, you just need to install the bigquery lib in your virtualenv.
pip install google-cloud-bigquery
If you do not have a dataset created you just need to run the command from console cloud to create a new dataset.
$ bq mk pythoncsv
#Dataset 'healthy-pager-276023:pythoncsv' successfully created.
After create your dataset with sucess just fire the python script to upload your csv.
My final solution is this python script:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# JUST FOLLOW THIS PATTERN: <projectid>.<datasetname>.<tablename>
table_id = "healthy-pager-276023.pythoncsv.table_name"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True,
)
path_to_file_name = "massdata.csv" #<-- PATH TO CSV TO IMPORT
with open(path_to_file_name, "rb") as source_file:
job = client.load_table_from_file(source_file, table_id, job_config=job_config)
job.result() # Waits for the job to complete.
table = client.get_table(table_id) # Make an API request.
print("Loaded {} rows and {} columns to {}".format(table.num_rows, len(table.schema), table_id))
And here my configurations from gcloud console from big query:
I have a Spark job running on an EMr cluster that writes out a DataFrame to HDFS (which is then s3-dist-cp-ed to S3). The data size isn't big (2 GB when saved as parquet). These data in S3 are then copied to a local filesystem (EC2 instance running Linux) and then loaded into a Java application.
It turns out I cannot have the data in parquet format because parquet has been designed for HDFS and cannot be used in local FS (if I am wrong, please point me to a resource on how to read parquet files on local FS).
What other format can I use to address this? Would Avro be compact enough and not blow up the size of data by packing the schema with each row of the dataframe?
You can use Parquet on a local filesystem. To see an example in action, download the parquet-mr library from here, build it with the local profile (mvn -P local install should do it, provided that you have thrift and protoc installed), then issue the following to see the contents of your parquet file:
java -jar parquet-tools/target/parquet-tools-1.10.0.jar cat /path/to/your-file.parquet
I need to copy a zipped file from one AWS S3 folder to another and would like to make that a scheduled AWS Glue job. I cannot find an example for such a simple task. Please help if you know the answer. May be the answer is in AWS Lambda, or other AWS tools.
Thank you very much!
You can do this, and there may be a reason to use AWS Glue: if you have chained Glue jobs and glue_job_#2 is triggered on the successful completion of glue_job_#1.
The simple Python script below moves a file from one S3 folder (source) to another folder (target) using the boto3 library, and optionally deletes the original copy in source directory.
import boto3
bucketname = "my-unique-bucket-name"
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucketname)
source = "path/to/folder1"
target = "path/to/folder2"
for obj in my_bucket.objects.filter(Prefix=source):
source_filename = (obj.key).split('/')[-1]
copy_source = {
'Bucket': bucketname,
'Key': obj.key
}
target_filename = "{}/{}".format(target, source_filename)
s3.meta.client.copy(copy_source, bucketname, target_filename)
# Uncomment the line below if you wish the delete the original source file
# s3.Object(bucketname, obj.key).delete()
Reference: Boto3 Docs on S3 Client Copy
Note: I would use f-strings for generating the target_filename, but f-strings are only supported in >= Python3.6 and I believe the default AWS Glue Python interpreter is still 2.7.
Reference: PEP on f-strings
I think you can do it with Glue, but wouldn't it be easier to use the CLI?
You can do the following:
aws s3 sync s3://bucket_1 s3://bucket_2
You could do this with Glue but it's not the right tool for the job.
Far simpler would be to have a Lambda job triggered by a S3 created-object event. There's even a tutorial on AWS Docs on doing (almost) this exact thing.
http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
We ended up using Databricks to do everything.
Glue is not ready. It returns error messages that make no sense. We created tickets and waited for five days still no reply.
the S3 API lets you do a COPY command (really a PUT with a header to indicate source URL) to copy objects within or between buckets. It's used to fake rename()s regularly but you could initiate the call yourself, from anything.
There is no need to D/L any data; within the same S3 region the copy has a bandwidth of about 6-10 MB/s.
AWS CLI cp command can do this.
You can do that by downloading your zip file from s3 to tmp/ directory and then re-uploading the same to s3.
s3 = boto3.resource('s3')
Download file to local spark directory tmp:
s3.Bucket(bucket_name).download_file(DATA_DIR+file,'tmp/'+file)
Upload file from local spark directory tmp:
s3.meta.client.upload_file('tmp/'+file,bucket_name,TARGET_DIR+file)
Now you can write python shell job in glue to do it. Just select Type in Glue job Creation wizard to Python Shell. You can run normal python script in it.
Nothing required. I believe aws data pipeline is a best options. Just use command line option. Scheduled run also possible. I already tried. Successfully worked.
I've created a cluster in Amazon EMR and using -emr-4.0.0. Hadoop distribution:Amazon 2.6.0 and Hive 1.0.0. Need to install Sqoop so that I can communicate between Hive and Redshift? What are the steps to install Sqoop in EMR cluster? Requesting to provide the steps. Thank You!
Note that in EMR 4.0.0 hadoop fs -copyToLocal will throw errors.
Use aws s3 cp instead.
To be more specific than Amal:
Download the latest version of SQOOP and upload it to an S3 location. I am using sqoop-1.4.4.bin__hadoop-2.0.4-alpha and it seems to work just fine with EMR 4.0.0
Download the JAR connector for Redshift and upload it to same S3 location. This page might help.
Upload a script similar to the one below to S3
#!/bin/bash
# Install sqoop and mysql connector. Store in s3 and load
# as bootstrap step.
bucket_location='s3://your-sqoop-jars-location/'
sqoop_jar='sqoop-1.4.4.bin__hadoop-2.0.4-alpha'
sqoop_jar_gz=$sqoop_jar.tar.gz
redshift_jar='RedshiftJDBC41-1.1.7.1007.jar'
cd /home/hadoop
aws s3 cp $bucket_location$sqoop_jar_gz .
tar -xzf $sqoop_jar_gz
aws s3 cp $bucket_location$redshift_jar .
cp $redshift_jar $sqoop_jar/lib/
Set SQOOP_HOME and add SQOOP_HOME to the PATH to be able to call sqoop from anywhere. These entries should be made in /etc/bashrc. Otherwise you will have to use the full path, in this case: /home/hadoop/sqoop-1.4.4.bin__hadoop-2.0.4-alpha/bin/sqoop
I am using Java to programatically launch my EMR cluster. To configure bootstrap steps in Java I create a BootstrapActionConfigFactory:
public final class BootstrapActionConfigFactory {
private static final String bucket = Config.getBootstrapBucket();
// make class non-instantiable
private BootstrapActionConfigFactory() {
}
/**
* Adds an install Sqoop step to the job that corresponds to the version set in the Config class.
*/
public static BootstrapActionConfig newInstallSqoopBootstrapActionConfig() {
return newInstallSqoopBootstrapActionConfig(Config.getHadoopVersion().charAt(0));
}
/**
* Adds an install Sqoop step to the job that corresponds to the version specified in the parameter
*
* #param hadoopVersion the main version number for Hadoop. E.g.: 1, 2
*/
public static BootstrapActionConfig newInstallSqoopBootstrapActionConfig(char hadoopVersion) {
return new BootstrapActionConfig().withName("Install Sqoop")
.withScriptBootstrapAction(
new ScriptBootstrapActionConfig().withPath("s3://" + bucket + "/sqoop-tools/hadoop" + hadoopVersion + "/bootstrap-sqoop-emr4.sh"));
}
}
Then when creating the job:
Job job = new Job(Region.getRegion(Regions.US_EAST_1));
job.addBootstrapAction(BootstrapActionConfigFactory.newInstallSqoopBootstrapActionConfig());
Download the tarball of sqoop and keep it in an s3 bucket. Create a bootstrap script that performs the following activity
Download the sqoop tarball to the required instances
extract the tarball
set SQOOP_HOME and add SQOOP_HOME to the PATH. These entries should be made in /etc/bashrc
Add the required connector jars to the lib of SQOOP.
Keep this script in S3 and point this script in the bootstrap actions.
Note that from Emr-4.4.0 AWS added support for Sqoop 1.4.6 to the EMR cluster. Installation is done with couple clicks on setup. No need for manual installation.
References:
https://aws.amazon.com/blogs/aws/amazon-emr-4-4-0-sqoop-hcatalog-java-8-and-more/
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-sqoop.html
I want to use LZO compression on my Elastic Map Reduce job's output that is being stored on S3, but it is not clear if the files are automatically indexed so that future jobs run on this data will split the files into multiple tasks.
For example, if my output is a bunch of lines of TSV data, in a 1GB LZO file, will a future map job only create 1 task, or something like (1GB/blockSize) tasks (i.e. the behavior of when files were not compressed, or if there was a LZO index file in the directory)?
Edit: If this is not done automatically, what is recommended for getting my output to be LZO-indexed? Do the indexing before uploading the file to S3?
Short answer to my first question: AWS does not do automatic indexing. I've confirmed this with my own job, and also read the same from Andrew#AWS on their forum.
Here's how you can do the indexing:
To index some LZO files, you'll need to use my own Jar built from the Twitter hadoop-lzo project. You'll need to build the Jar somewhere, then upload to Amazon S3, if you want to Index directly with EMR.
On side note, Cloudera has good instructions on all the steps for setting this up on your own cluster. I did this on my local cluster, which allowed me to build the Jar and upload to S3. You can probably find a pre-built Jar on the net if you don't want to build it yourself.
When outputting your data from your Hadoop job, make sure you use the LzopCodec and not the LzoCodec, otherwise the files are not indexable (at least based on my experience). Example Java code (same idea carries over to Streaming API):
import com.hadoop.compression.lzo.LzopCodec;
TextOutputFormat.setCompressOutput(job, true);
TextOutputFormat.setOutputCompressorClass(job, LzopCodec.class)
Once your hadoop-lzo Jar is on S3, and your Hadoop job has outputted .lzo files, run your indexer on the output directory (instructions below you got a EMR job/cluster running):
elastic-mapreduce -j <existingJobId> \
--jar s3n://<yourBucketName>/hadoop-lzo-0.4.17-SNAPSHOT.jar \
--args com.hadoop.compression.lzo.DistributedLzoIndexer \
--args s3://<yourBucketName>/output/myLzoJobResults \
--step-name "Lzo file indexer Jar"
Then when you're using the data in a future job, be sure to specify that the input is in LZO format, otherwise the splitting won't occur. Example Java code:
import com.hadoop.mapreduce.LzoTextInputFormat;
job.setInputFormatClass(LzoTextInputFormat.class);