How to read parquet files from tar.bz2 using pyspark on EMR Notebook? - amazon-s3

I'm trying to read in parquet (snappy) files in a tar.bz2 format from an S3 bucket, but receive this error:
"java.io.IOException: Could not read footer for file"
when using:
df = spark.read.parquet("s3://bucket/folder/filename.tar.bz2")
I have downloaded one of the tar.bz2 files and verified that it is valid parquet within the underlying tar directory.
I also looked into using S3 Select in the boto3 package, but it looks like BZIP is only supported for CSV and JSON files currently, not parquet. I have also attempted to unzip the files back into S3 like this:
sc.install_pypi_package("boto3")
import boto3
from io import BytesIO
import tarfile
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket,Key='folder/filename.tar.bz2')
wholefile = obj['Body'].read()
fileobj = BytesIO(wholefile)
tar = tarfile.open(fileobj=fileobj)
s3.upload_fileobj(Fileobj=tar, Bucket=bucket, Key='folder/filename.parquet')
However, this simply yields a "Interpreter died:" error message without any additional information (possibly due to insufficient memory in cluster?). Is there a step that I'm missing to decompress the tar library?

Related

How to upload large files from local pc to DBFS?

I am trying to learn spark SQL in databricks and want to work with the Yelp dataset; however, the file is too large to upload to DBFS from UI. Thanks, Filip
There are several approaches to that:
Use Databricks CLI's dbfs command to upload local data to DBFS.
Download dataset directly from notebook, for example by using %sh wget URL, and unpacking the archive to DBFS (either by using /dbfs/path/... as destination, or using dbutils.fs.cp command to copy files from driver node to DBFS)
Upload files to AWS S3, Azure Data Lake Storage, Google Storage or something like, and accessing data there.
Upload the file you want to load in Databricks to google drive
from urllib.request import urlopen
from shutil import copyfileobj
my_url = 'paste your url here'
my_filename = 'give your filename'
file_path = '/FileStore/tables' # location at which you want to move the downloaded file
# Downloading the file from google drive to Databrick
with urlopen(my_url) as in_stream, open(my_filename, 'wb') as out_file:
copyfileobj(in_stream, out_file)
# check where the file has download
# in my case it is
display(dbutils.fs.ls('file:/databricks/driver'))
# moving the file to desired location
# dbutils.fs.mv(downloaded_location, desired_location)
dbutils.fs.mv("file:/databricks/driver/my_file", file_path)
I hope this helps

Cannot Locate the file in AWS EMR Notebook

I have been trying to use some .txt, .csv files in an EMR Notebook but I cannot locate them.
I am trying to read via:
with open('file.txt', 'r') as f:
notes = f.read()
Things I tried:
Uploaded the file by using JupyterHub UI. I can see the file but I cant read it from the path. I also checked the file using JupyterHub terminal.
Tried to read from s3 (lots of people got it working in this way):
with open('s3://<repo>/file.txt', 'r') as f:
Copied the file to hdfs and hadoop in master node (in cluster) using both: hdfs dfs and hadoop fs . File is present in both directories.
However, I have no clue how I can reach the file in EMR Notebook.
Any ideas?

AWS GLUE Not able to write Delta lake in s3

I am working on AWS Glue and created an ETL job for upserts. I have a s3 bucket where I have my csv file in a folder. I am reading the file from s3 and want to write back to s3 using delta lake (parquet file) using this code
from delta import *
from pyspark.sql.session import SparkSession
spark = SparkSession.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
inputDF = spark.read.format("csv").option("header", "true").load('s3://demonidhi/superstore/')
print(inputDF)
# Write data as DELTA TABLE
inputDF.write.format("delta").mode("overwrite").save("s3a://demonidhi/current/")
# Generate MANIFEST file for Athena/Catalog
deltaTable = DeltaTable.forPath(spark, "s3a://demonidhi/current/")
I am using a jar file of delta named 'delta-core_2.11-0.6.1.jar' which is in s3 bucket folder and i gave the path of it in python libraby path and in Dependent jars path while creating my job.
Till the reading part the code is working just fine but after that for the writing and manifesting it is not working and giving some error which I am not able to see in GLUE terminal. I tried to follow several different approaches, but not able to figure out how can i resolve this. Any help would be appericiated.
Using the spark.config() notation will not work in Glue, because the abstraction that Glue is using (the GlueContext), will override those parameters.
What you can do instead is provide the config as a parameter to the job itself, with the key --conf and the value spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Not able to read HDF5 file present in S3 in sagemaker notebook instance

My directory structure looks like this: bucket-name/training/file.hdf5
I tried reading this file in sagemaker notebook instance by this code cell:
bucket='bucket-name'
data_key = 'training/file.hdf5'
data_location = 's3://{}/{}'.format(bucket, data_key)
hf = h5py.File(data_location, 'r')
But it gives me error:
Unable to open file (unable to open file: name = 's3://bucket-name/training/file.hdf5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
I have also tried pd.read_hdf(data_location) but was not succesfull.
Trying to read a csv file into dataframe from same key doesnt throw error.
Any help is appreciated. Thanks
Thanks for asking the question here!
Your file is on the remote storage service Amazon S3. The string data_location is not the name of a local file, hence your data reader cannot open it. It order to read the S3 file, you have 2 options:
use a library that can read files from S3. It seems that h5py can do that, if you specify driver='ros3'
alternatively, you can also bring the file from S3 to your machine, and then read it from the machine. For example using the AWS CLI to bring the file from S3 to local aws s3 cp s3://<your bucket>/<your file on s3> /home/ec2-user/SageMaker/ then File(data_location='/home/ec2-user/SageMaker/your-file-name.hdf5') should work

Could we use AWS Glue just copy a file from one S3 folder to another S3 folder?

I need to copy a zipped file from one AWS S3 folder to another and would like to make that a scheduled AWS Glue job. I cannot find an example for such a simple task. Please help if you know the answer. May be the answer is in AWS Lambda, or other AWS tools.
Thank you very much!
You can do this, and there may be a reason to use AWS Glue: if you have chained Glue jobs and glue_job_#2 is triggered on the successful completion of glue_job_#1.
The simple Python script below moves a file from one S3 folder (source) to another folder (target) using the boto3 library, and optionally deletes the original copy in source directory.
import boto3
bucketname = "my-unique-bucket-name"
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucketname)
source = "path/to/folder1"
target = "path/to/folder2"
for obj in my_bucket.objects.filter(Prefix=source):
source_filename = (obj.key).split('/')[-1]
copy_source = {
'Bucket': bucketname,
'Key': obj.key
}
target_filename = "{}/{}".format(target, source_filename)
s3.meta.client.copy(copy_source, bucketname, target_filename)
# Uncomment the line below if you wish the delete the original source file
# s3.Object(bucketname, obj.key).delete()
Reference: Boto3 Docs on S3 Client Copy
Note: I would use f-strings for generating the target_filename, but f-strings are only supported in >= Python3.6 and I believe the default AWS Glue Python interpreter is still 2.7.
Reference: PEP on f-strings
I think you can do it with Glue, but wouldn't it be easier to use the CLI?
You can do the following:
aws s3 sync s3://bucket_1 s3://bucket_2
You could do this with Glue but it's not the right tool for the job.
Far simpler would be to have a Lambda job triggered by a S3 created-object event. There's even a tutorial on AWS Docs on doing (almost) this exact thing.
http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
We ended up using Databricks to do everything.
Glue is not ready. It returns error messages that make no sense. We created tickets and waited for five days still no reply.
the S3 API lets you do a COPY command (really a PUT with a header to indicate source URL) to copy objects within or between buckets. It's used to fake rename()s regularly but you could initiate the call yourself, from anything.
There is no need to D/L any data; within the same S3 region the copy has a bandwidth of about 6-10 MB/s.
AWS CLI cp command can do this.
You can do that by downloading your zip file from s3 to tmp/ directory and then re-uploading the same to s3.
s3 = boto3.resource('s3')
Download file to local spark directory tmp:
s3.Bucket(bucket_name).download_file(DATA_DIR+file,'tmp/'+file)
Upload file from local spark directory tmp:
s3.meta.client.upload_file('tmp/'+file,bucket_name,TARGET_DIR+file)
Now you can write python shell job in glue to do it. Just select Type in Glue job Creation wizard to Python Shell. You can run normal python script in it.
Nothing required. I believe aws data pipeline is a best options. Just use command line option. Scheduled run also possible. I already tried. Successfully worked.