Use flume to stream data to S3 - apache

I am trying flume for something very simple, where I would like to push content from my log files to S3. I was able to create a flume agent that would read the content from an apache access log file and use a logger sink. Now I am trying to find a solution where I can replace the logger sink with an "S3 sink". (I know this does not exist by default)
I was looking for some pointers to direct me in the correct path. Below is my test properties file that I am using currently.
a1.sources=src1
a1.sinks=sink1
a1.channels=ch1
#source configuration
a1.sources.src1.type=exec
a1.sources.src1.command=tail -f /var/log/apache2/access.log
#sink configuration
a1.sinks.sink1.type=logger
#channel configuration
a1.channels.ch1.type=memory
a1.channels.ch1.capacity=1000
a1.channels.ch1.transactionCapacity=100
#links
a1.sources.src1.channels=ch1
a1.sinks.sink1.channel=ch1

S3 is built over HDFS so you can use HDFS sink, you must replace hdfs path to your bucket in this way. Don't forget replace AWS_ACCESS_KEY and AWS_SECRET_KEY.
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3n://<AWS.ACCESS.KEY>:<AWS.SECRET.KEY>#<bucket.name>/prefix/
agent.sinks.s3hdfs.hdfs.fileType = DataStream
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.rollCount = 0
agent.sinks.s3hdfs.hdfs.rollSize = 67108864 #64Mb filesize
agent.sinks.s3hdfs.hdfs.batchSize = 10000
agent.sinks.s3hdfs.hdfs.rollInterval = 0

This makes sense, but can rollSize of this value be accompanied by
agent_messaging.sinks.AWSS3.hdfs.round = true
agent_messaging.sinks.AWSS3.hdfs.roundValue = 30
agent_messaging.sinks.AWSS3.hdfs.roundUnit = minute

Related

Synapse Spark: Python logging to log file in Azure Data Lake Storage

I am working in Synapse Spark and building a logger function to handle error logging. I intend to push the logs to an existing log file (data.log) located in AzureDataLakeStorageAccount/Container/Folder/.
In addition to the root logger I have added a StreamHandler and trying to setup a FileHandler to manage the log file write-out.
The log file path I am specifying has path in this format: 'abfss:/container#storageaccountname.dfs.core.windows.net/Data/logging/data.log'
When I run the below code, I get the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_<number/container_number/abfss:/container#storageaccountname.dfs.core.windows.net/Data/logging/data.log'
The default mount path is getting prefixed to the ADLS file path.
Here is the code:
'''
import logging
def init_logger(name: str, logging_level: int = logging.DEBUG) -> logging.Logger:
_log_format = "%(levelname)s %(asctime)s %(name)s: %(message)s"
_date_format = "%Y-%m-%d %I:%M:%S %p %z"
_formatter = logging.Formatter(fmt=_log_format, datefmt=_date_format)
_root_logger = logging.getLogger()
_logger = logging.getLogger(name)
_logger.setLevel(logging_level)
#Root and Stream Handler
if _root_logger.handlers:
for handler in _root_logger.handlers:
handler.setFormatter(_formatter)
_logger.setLevel(logging_level)
else:
_handler = logging.StreamHandler(sys.stderr)
_handler.setLevel(logging_level)
_handler.setFormatter(_formatter)
_logger.addHandler(_handler)
__handler = logging.FileHandler(LogFilepath, 'a')
__handler.setLevel(logging_level)
__handler.setFormatter(_formatter)
_logger.addHandler(__handler)
return _logger
'''
To address the mount path prefix I added a series of '../' to move level up but even with this I end up with a solitary '/' prefixed to my ADLS path.
I have not found any online assistance or article where this has been implemented in an Azure Data Lake setup. Any assistance will be appreciated.
Looking at the error, it seems to be the path of the file might be the issue here. Instead of giving the abfss path, you can try by giving the path like /synfs/{jobid}/<mount point>/filepath.
Use the below code for that.
jobid=mssparkutils.env.getJobId()
LogFilepath='/synfs/'+jobid+'/<mount_point>/filename.log'
print(LogFilepath)
The above path might work for you. If not, you can try this method with open() as an alternative for logging to an ADLS file.

Change UTF 16 encoding to UTF 8 encoding for files in AWS S3

My main goal is to have AWS Glue move files stored in S3 to a database in RDS. My current issue is that the format in which I get these files has a UTF 16 LE encoding and AWS Glue will only process text files with UTF 8 encoding. See (https://docs.aws.amazon.com/glue/latest/dg/glue-dg.pdf, pg. 5 footnote). On my local machine, python can easily change the encoding by this method:
from pathlib import Path
path = Path('file_path')
path.write_text(path.read_text(encoding="utf16"), encoding="utf8")
I attempted to implement this in a Glue job as such:
bucketname = "bucket_name"
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucketname)
subfolder_path = "folder1/folder2"
file_filter = "folder2/file_header_identifier"
for obj in my_bucket.objects.filter(Prefix=file_filter):
filename = (obj.key).split('/')[-1]
file_path = Path("s3://{}/{}/{}".format(bucketname, subfolder_path, filename))
file_path.write_text(file_path.read_text(encoding="utf16"), encoding="utf8")
I'm not getting an error in Glue, but it is not changing the text encoding of my file. But when I try something similar in Lambda, which is probably the wiser service to work with, I do get an error that s3 has no Bucket attribute. I'd prefer to keep all this ETL work in glue for convenience.
I'm very new to AWS so any advice is welcomed.

Google Cloud Data Transfer to a GCS subfolder

I am trying to transfer data from AWS S3 bucket (e.g. s3://mySrcBkt) to GCS location ( a folder under a bucket as gs://myDestBkt/myDestination ). I could not find the same option from Interface as it has only provision to provide bucket and not a subfolder. Neither I found the similar povision from the storagetransfer API. Here is my code snippet:
String SOURCE_BUCKET = .... ;
String ACCESS_KEY = .....;
String SECRET_ACCESS_KEY = .....;
String DESTINATION_BUCKET = .......;
String STATUS = "ENABLED";
TransferJob transferJob =
new TransferJob()
.setName(NAME)
.setDescription(DESCRIPTION)
.setProjectId(PROJECT)
.setTransferSpec(
new TransferSpec()
.setObjectConditions(new ObjectConditions()
.setIncludePrefixes(includePrefixes))
.setTransferOptions(new TransferOptions()
.setDeleteObjectsFromSourceAfterTransfer(false)
.setOverwriteObjectsAlreadyExistingInSink(false)
.setDeleteObjectsUniqueInSink(false))
.setAwsS3DataSource(
new AwsS3Data()
.setBucketName(SOURCE_BUCKET)
.setAwsAccessKey(
new AwsAccessKey()
.setAccessKeyId(ACCESS_KEY)
.setSecretAccessKey(SECRET_ACCESS_KEY))
)
.setGcsDataSink(
new GcsData()
.setBucketName(DESTINATION_BUCKET)
))
.setSchedule(
new Schedule()
.setScheduleStartDate(date)
.setScheduleEndDate(date)
.setStartTimeOfDay(time))
.setStatus(STATUS);
Unfortunately I could not find anywhere to mention the destination folder for this transfer. I know gsutil rsync has similar however the scale & data integrity is a concern. Can anyone guide me/point me any way/workaround to achieve the goal ?
As the bucket and not a subdirectory is the available option for data transfer destination, the workaround for this scenario would be doing the transfer to your bucket, then doing the rsync operation between your bucket and the subdirectory, just keep in mind that you should try running the gsutil -m rsync -r -d -n to verify what it'll do, as you could delete data accidentally.

Unzip a file to s3

I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.
I am unable to achieve this with any of the API's currently.
Have tried native boto, pyfilesystem(fs), s3fs.
The source and destination links seem to be an issue for these functions.
(Using with Python 2.x/3.x & Boto 2.x )
I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.
Couple of implementations i can think of:
A simple API to extract the zip file within the same bucket.
Use s3 as a filesystem and manipulate data
Use a data pipeline to achieve this
Transfer the zip to ec2 , extract and copy back to s3.
The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.
Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.
Thanks in Advance,
Sundar.
You could try https://www.cloudzipinc.com/ that unzips/expands several different formats of archives from S3 into a destination in your bucket. I used it to unzip components of a digital catalog into S3.
Have solved by using ec2 instance.
Copy the s3 files to local dir in ec2
and copy that directory back to S3 bucket.
Sample to unzip to local directory in ec2 instance
def s3Unzip(srcBucket,dst_dir):
'''
function to decompress the s3 bucket contents to local machine
Args:
srcBucket (string): source bucket name
dst_dir (string): destination location in the local/ec2 local file system
Returns:
None
'''
#bucket = s3.lookup(bucket)
s3=s3Conn
path=''
bucket = s3.lookup(bucket_name)
for key in bucket:
path = os.path.join(dst_dir, key.name)
key.get_contents_to_filename(path)
if path.endswith('.zip'):
opener, mode = zipfile.ZipFile, 'r'
elif path.endswith('.tar.gz') or path.endswith('.tgz'):
opener, mode = tarfile.open, 'r:gz'
elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
opener, mode = tarfile.open, 'r:bz2'
else:
raise ValueError ('unsuppported format')
try:
os.mkdir(dst_dir)
print ("local directories created")
except Exception:
logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")
cwd = os.getcwd()
os.chdir(dst_dir)
try:
file = opener(path, mode)
try: file.extractall()
finally: file.close()
logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
except Exception as e:
logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
os.chdir(cwd)
s3.close
sample code to upload to mysql instance
Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly
def upload(file_path,timeformat):
'''
function to upload a csv file data to mysql rds
Args:
file_path (string): local file path
timeformat (string): destination bucket to copy data
Returns:
None
'''
for file in file_path:
try:
con = connect()
cursor = con.cursor()
qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, #datetime , col4 ) set datetime = str_to_date(#datetime,'%s');""" %(file,timeformat)
cursor.execute(qry)
con.commit()
logger_rds.info ("Loading file:"+file)
except Exception:
logger_rds.error ("Exception in uploading "+file)
##Rollback in case there is any error
con.rollback()
cursor.close()
# disconnect from server
con.close()
Lambda function:
You can use a Lambda function where you read zipped files into the buffer, gzip the individual files, and reupload them to S3. Then you can either archive the original files or delete them using boto.
You can also set an event based trigger that runs the lambda automatically everytime there is a new zipped file in S3. Here's a full tutorial for the exact thing here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

Unable to unzip gz file transferred to HDFS via Flume spool directory

I am using spooldir source to move .gz files from SpoolDirectory to HDFS.
I am using following config,
==========================
a1.channels = ch-1
a1.sources = src-1
a1.sinks = k1
a1.channels.ch-1.type = memory
a1.channels.ch-1.capacity = 1000
a1.channels.ch-1.transactionCapacity = 100
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /path_to/flumeSpool
a1.sources.src-1.deserializer=org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
a1.sources.src-1.basenameHeader=true
a1.sources.src-1.deserializer.maxBlobLength=400000000
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = ch-1
a1.sinks.k1.hdfs.path = hdfs://{namenode}:8020/path_to_hdfs
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.rollInterval =100
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.rollSize=0
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC=gzip
a1.sinks.k1.hdfs.callTimeout=120000
========================================
So file does get transferred to HDFS but it appends time_in_millis.gz extension at the end.
Also when I try to gunzip the file in HDFS (by copying via terminal) it shows unknown characters in it. So not sure what is going on.
I would like to maintain the same filename post transfer to HDFS
I would like to be able to unzip the file and read content
Can someone help?