Unable to unzip gz file transferred to HDFS via Flume spool directory - gzip

I am using spooldir source to move .gz files from SpoolDirectory to HDFS.
I am using following config,
==========================
a1.channels = ch-1
a1.sources = src-1
a1.sinks = k1
a1.channels.ch-1.type = memory
a1.channels.ch-1.capacity = 1000
a1.channels.ch-1.transactionCapacity = 100
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /path_to/flumeSpool
a1.sources.src-1.deserializer=org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
a1.sources.src-1.basenameHeader=true
a1.sources.src-1.deserializer.maxBlobLength=400000000
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = ch-1
a1.sinks.k1.hdfs.path = hdfs://{namenode}:8020/path_to_hdfs
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.rollInterval =100
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.rollSize=0
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC=gzip
a1.sinks.k1.hdfs.callTimeout=120000
========================================
So file does get transferred to HDFS but it appends time_in_millis.gz extension at the end.
Also when I try to gunzip the file in HDFS (by copying via terminal) it shows unknown characters in it. So not sure what is going on.
I would like to maintain the same filename post transfer to HDFS
I would like to be able to unzip the file and read content
Can someone help?

Related

Transfer Files from one s3 Folder to Another s3 Folder using Boto3 Python

I am trying to copy files from one s3 bucket to another with some modifications in destination path.
The Original Script is as below:
import boto3
import os
old_bucket_name = 'XT01-sample-data'
old_prefix = 'Test/'
new_bucket_name = 'XT02-sample-data2'
new_prefix = old_bucket_name + '/' + old_prefix
s3 = boto3.resource('s3')
ENCRYPTION = os.environ.get('SERVER_SIDE_ENCRYPTION', 'AES256')
STORAGE_CLASS = os.environ.get('STORAGE_CLASS', 'INTELLIGENT_TIERING')
old_bucket = s3.Bucket(old_bucket_name)
new_bucket = s3.Bucket(new_bucket_name)
extra_args = {
'ServerSideEncryption': ENCRYPTION,
'StorageClass': STORAGE_CLASS
}
for obj in old_bucket.objects.filter(Prefix=old_prefix):
old_source = { 'Bucket': old_bucket_name,
'Key': obj.key}
# replace the prefix
new_key = obj.key.replace(old_prefix, new_prefix, 1)
new_obj = new_bucket.Object(new_key)
print("Object old ", obj)
print("new_key ", new_key)
print("new_obj ", new_obj)
new_obj.copy(old_source,ExtraArgs=extra_args)
print("Starting Deletion Loop")
bucket = s3.Bucket(old_bucket_name)
bucket.objects.filter(Prefix=old_prefix).delete()
The above script is copying the files from bucket XT01-sample-data, Folder Test/
to New bucket XT02-sample-data2 with new path XT01-sample-data/Test1/
The ask is now to modify the script to add timestamp in destination path and files from one folder lands under once time stamp.
Eg:
We have below files in source bucket at various folders
XT01-sample-data/Test1/Test1.1/File1.csv
XT01-sample-data/Test1/Test1.1/File2.csv
XT01-sample-data/Test1/Test1.1/File3.csv
XT01-sample-data/Test1/Test1.2/File1.2.1.csv
XT01-sample-data/Test1/Test1.2/File1.2.2.csv
XT01-sample-data/Test1/Test1.3/File1.3.csv
XT01-sample-data/Test1/Test2/Test2.1/File2.1.csv
Expected output should be all files from one subfolder should be placed under one timestamp.
not all files should be placed in one timestamp there should be a level of segregation based on Timestamp at milisecond level (unix timestamp)
For Files under folder Test1/Test1.1
XT01-sample-data/Test1/2020/01/23/0000/File1.csv
XT01-sample-data/Test1/2020/01/23/0000/File2.csv
XT01-sample-data/Test1/2020/01/23/0000/File3.csv
For files under folder Test1/Test1.2
XT01-sample-data/Test1/2020/01/23/0003/File1.2.1.csv
XT01-sample-data/Test1/2020/01/23/0003/File1.2.2.csv
For files under folder Test1/Test1.3
XT01-sample-data/Test1/2020/01/23/0004/File1.3.csv
For files under folder Test1/Test2/Test2.1
XT01-sample-data/Test1/2020/01/23/0005/File2.1.csv
I was able to resolve this issue by making an entry in dynamodb table.
Logic:
as soon as the first instance is encountered then entry is made in dynamodb table.
when next instance is encountered a corresponding check is made in dynamodb table before copying the data and same date which is present in dynamodb is taken.

Loading local file using Cloud Functions to BigQuery returns OSError: [Errno 30] `Read-only file system:

I am trying to load a Dataframe into BigQuery. I do this as follows:
# Prepare temp file to stream from local file
temp_file = table_name + '-' + str(timestamp_in_ms())
df.to_csv(temp_file, index=None, header=True)
# Define job_config
job_config = bigquery.LoadJobConfig()
job_config.schema = schema
job_config.skip_leading_rows = 1
job_config.source_format = bigquery.SourceFormat.CSV
# Create job to load data into table
with open(temp_file, "r+b") as source_file:
load_job = client.load_table_from_file(source_file, dataset_ref.table(table_name), job_config=job_config)
This works fine in local development, however when I deploy the Cloud Function it returns the following error:
OSError: [Errno 30] Read-only file system: '{temp_file}'
This happens on the line with open(temp_file, "r+b") as source_file:
Why can it not read local files on the Cloud Function temporary storage? What went wrong?
Probably you didn't specified the folder /tmp
Local Disk
Cloud Functions provides access to a local disk mount point (/tmp)
which is known as a "tmpfs" volume in which data written to the volume
is stored in memory. There is no specific fee associated with this
however writing data to the /tmp mountpoint will consume memory
resources provisioned for the function.
As explained on: https://cloud.google.com/functions/pricing

Unzip a file to s3

I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.
I am unable to achieve this with any of the API's currently.
Have tried native boto, pyfilesystem(fs), s3fs.
The source and destination links seem to be an issue for these functions.
(Using with Python 2.x/3.x & Boto 2.x )
I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.
Couple of implementations i can think of:
A simple API to extract the zip file within the same bucket.
Use s3 as a filesystem and manipulate data
Use a data pipeline to achieve this
Transfer the zip to ec2 , extract and copy back to s3.
The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.
Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.
Thanks in Advance,
Sundar.
You could try https://www.cloudzipinc.com/ that unzips/expands several different formats of archives from S3 into a destination in your bucket. I used it to unzip components of a digital catalog into S3.
Have solved by using ec2 instance.
Copy the s3 files to local dir in ec2
and copy that directory back to S3 bucket.
Sample to unzip to local directory in ec2 instance
def s3Unzip(srcBucket,dst_dir):
'''
function to decompress the s3 bucket contents to local machine
Args:
srcBucket (string): source bucket name
dst_dir (string): destination location in the local/ec2 local file system
Returns:
None
'''
#bucket = s3.lookup(bucket)
s3=s3Conn
path=''
bucket = s3.lookup(bucket_name)
for key in bucket:
path = os.path.join(dst_dir, key.name)
key.get_contents_to_filename(path)
if path.endswith('.zip'):
opener, mode = zipfile.ZipFile, 'r'
elif path.endswith('.tar.gz') or path.endswith('.tgz'):
opener, mode = tarfile.open, 'r:gz'
elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
opener, mode = tarfile.open, 'r:bz2'
else:
raise ValueError ('unsuppported format')
try:
os.mkdir(dst_dir)
print ("local directories created")
except Exception:
logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")
cwd = os.getcwd()
os.chdir(dst_dir)
try:
file = opener(path, mode)
try: file.extractall()
finally: file.close()
logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
except Exception as e:
logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
os.chdir(cwd)
s3.close
sample code to upload to mysql instance
Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly
def upload(file_path,timeformat):
'''
function to upload a csv file data to mysql rds
Args:
file_path (string): local file path
timeformat (string): destination bucket to copy data
Returns:
None
'''
for file in file_path:
try:
con = connect()
cursor = con.cursor()
qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, #datetime , col4 ) set datetime = str_to_date(#datetime,'%s');""" %(file,timeformat)
cursor.execute(qry)
con.commit()
logger_rds.info ("Loading file:"+file)
except Exception:
logger_rds.error ("Exception in uploading "+file)
##Rollback in case there is any error
con.rollback()
cursor.close()
# disconnect from server
con.close()
Lambda function:
You can use a Lambda function where you read zipped files into the buffer, gzip the individual files, and reupload them to S3. Then you can either archive the original files or delete them using boto.
You can also set an event based trigger that runs the lambda automatically everytime there is a new zipped file in S3. Here's a full tutorial for the exact thing here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

Use flume to stream data to S3

I am trying flume for something very simple, where I would like to push content from my log files to S3. I was able to create a flume agent that would read the content from an apache access log file and use a logger sink. Now I am trying to find a solution where I can replace the logger sink with an "S3 sink". (I know this does not exist by default)
I was looking for some pointers to direct me in the correct path. Below is my test properties file that I am using currently.
a1.sources=src1
a1.sinks=sink1
a1.channels=ch1
#source configuration
a1.sources.src1.type=exec
a1.sources.src1.command=tail -f /var/log/apache2/access.log
#sink configuration
a1.sinks.sink1.type=logger
#channel configuration
a1.channels.ch1.type=memory
a1.channels.ch1.capacity=1000
a1.channels.ch1.transactionCapacity=100
#links
a1.sources.src1.channels=ch1
a1.sinks.sink1.channel=ch1
S3 is built over HDFS so you can use HDFS sink, you must replace hdfs path to your bucket in this way. Don't forget replace AWS_ACCESS_KEY and AWS_SECRET_KEY.
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3n://<AWS.ACCESS.KEY>:<AWS.SECRET.KEY>#<bucket.name>/prefix/
agent.sinks.s3hdfs.hdfs.fileType = DataStream
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.rollCount = 0
agent.sinks.s3hdfs.hdfs.rollSize = 67108864 #64Mb filesize
agent.sinks.s3hdfs.hdfs.batchSize = 10000
agent.sinks.s3hdfs.hdfs.rollInterval = 0
This makes sense, but can rollSize of this value be accompanied by
agent_messaging.sinks.AWSS3.hdfs.round = true
agent_messaging.sinks.AWSS3.hdfs.roundValue = 30
agent_messaging.sinks.AWSS3.hdfs.roundUnit = minute

Lua - How can I save data from server on my pc?

How can I read and save data from my server on PC?
a=io.open(path.."/datafile","wb")
a:write("nonsense")
a:close()
Is it the same way or a other way?
I want to read an save this file from my server to my PC, but how can I do that?
I hope someone can help me
It is not completely clear what you are trying to do. If you want to copy a file from one machine to another one, the following is a way to do it. Note that it will work by reading the whole file content into memory before copying it to destination, so it is not suitable for really huge files, say >~100MB (YMMV).
local SOURCE_PATH = "my/source/path/datafile.txt"
local DESTINATION_PATH = "another/path/datafile.txt"
local fh = assert( io.open( SOURCE_PATH, "rb" ) )
local content = fh:read "*all"
fh:close()
local fh_out = assert( io.open( DESTINATION_PATH, "wb" ) )
fh_out:write( content )
fh_out:close()
EDIT
Following a suggestion by #lhf here is a version which can cope with huge files. It reads and then writes the files in small chunks:
local SOURCE_PATH = "my/source/path/datafile.txt"
local DESTINATION_PATH = "another/path/datafile.txt"
local BUFFER_SIZE = 4096 -- in bytes
local fh = assert( io.open( SOURCE_PATH, "rb" ) )
local fh_out = assert( io.open( DESTINATION_PATH, "wb" ) )
local data = fh:read( BUFFER_SIZE )
while data do
fh_out:write( data )
data = fh:read( BUFFER_SIZE )
end
fh:close()
fh_out:close()