Flume not writing correctly in amazon s3 (weird characters) - amazon-s3

My flume config:
agent.sinks = s3hdfs
agent.sources = MySpooler
agent.channels = channel
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3a://mybucket/test
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.channel = channel
agent.sinks.s3hdfs.hdfs.useLocalTimeStamp = true
agent.sources.MySpooler.channels = channel
agent.sources.MySpooler.type = spooldir
agent.sources.MySpooler.spoolDir = /flume_to_aws
agent.sources.MySpooler.fileHeader = true
agent.channels.channel.type = memory
agent.channels.channel.capacity = 100
Now I will add a file in /flume_to_aws folder with the following content (text):
Oracle and SQL Server
After it is uploaded in S3, I downloaded the file and opened it, and it show the following text:
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable
Œúg ÊC•ý¤ïM·T.C ! †"û­þ Oracle and SQL ServerÿÿÿÿŒúg ÊC•ý¤ïM·T.C
Why the file is not uploaded only with the text "Oracle and SQL Server"??

Problem solved. I have found this question in stackoverflow here
Flume is generating files in binary format instead of text format.
So, I have added the following lines:
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.fileType = DataStream

Related

azure runbook PowerShell script content is not importing in terraform properly in azure automation account

I have created azure automation account using terraform. I have save my existing runbook PowerShell script files in local. I have successfully uploaded all the script files at one time while creation of automation account with below code:
resource "azurerm_automation_runbook" "example" {
for_each = fileset("Azure_Runbooks/", "*")
name = split(".", each.key)[0]
location = var.location
resource_group_name = var.resource_group
automation_account_name = azurerm_automation_account.example.name
log_verbose = var.log_verbose
log_progress = var.log_progress
runbook_type = var.runbooktype
content = each.value
}
After running the terraform apply command, all the script files are uploading successfully to the automation account but the content of the PowerShell script is not getting uploaded. I have checked the runbooks in the automation account but there is not content inside the file. I am seeing only the name of the file.
Can some one please help me with above issue.
You assuming fileset(path, pattern) returns the contents of the file as each.value, but that is not the case. The each.value is just the file name.
You need something like:
resource "azurerm_automation_runbook" "example" {
for_each = fileset("Azure_Runbooks/", "*")
name = split(".", each.key)[0]
location = var.location
resource_group_name = var.resource_group
automation_account_name = azurerm_automation_account.example.name
log_verbose = var.log_verbose
log_progress = var.log_progress
runbook_type = var.runbooktype
content = file(format("%s%s", "Azure_Runbooks/", each.value)
}
I hope this helps.
I have fixed the issue with correct code:
resource "azurerm_automation_runbook" "example" {
for_each = fileset("Azure_Runbooks/", "*")
name = split(".", each.key)[0]
location = var.location
resource_group_name = var.resource_group
automation_account_name = azurerm_automation_account.example.name
log_verbose = var.log_verbose
log_progress = var.log_progress
runbook_type = var.runbooktype
content = file(format("%s%s" , "Azure_Runbooks/" , each.key))
}
Thanks #YoungGova for your help.

How to upload large file (~100mb) to Azure blob storage using Python SDK?

I am using the latest Azure Storage SDK (azure-storage-blob-12.7.1). It works fine for smaller files but throwing exceptions for larger files > 30MB.
azure.core.exceptions.ServiceResponseError: ('Connection aborted.',
timeout('The write operation timed out'))
from azure.storage.blob import BlobServiceClient, PublicAccess, BlobProperties,ContainerClient
def upload(file):
settings = read_settings()
connection_string = settings['connection_string']
container_client = ContainerClient.from_connection_string(connection_string,'backup')
blob_client = container_client.get_blob_client(file)
with open(file,"rb") as data:
blob_client.upload_blob(data)
print(f'{file} uploaded to blob storage')
upload('crashes.csv')
Seems everything works for me by your code when I tried to upload a ~180MB .txt file. But if uploading small files work for you, I think uploading your big file in small parts could be a workaround. Try the code below:
from azure.storage.blob import BlobClient
storage_connection_string=''
container_name = ''
dest_file_name = ''
local_file_path = ''
blob_client = BlobClient.from_connection_string(storage_connection_string,container_name,dest_file_name)
#upload 4 MB for each request
chunk_size=4*1024*1024
if(blob_client.exists):
blob_client.delete_blob()
blob_client.create_append_blob()
with open(local_file_path, "rb") as stream:
while True:
read_data = stream.read(chunk_size)
if not read_data:
print('uploaded')
break
blob_client.append_block(read_data)
Result:

COPY INTO snowflake table not loading data - No errors

As part of the Snowflake WebUI Essentials course, I'm trying to load data from 'WEIGHT.TXT' on AWS S3 bucket into a Snowflake DB table.
select * from weight_ingest
> Result: 0 rows
list #S3TESTBKT/W
> Result:1
> s3://my-s3-tstbkt/WEIGHT.txt 509814 6e66e0c954a0dfe2c5d9638004a98912 Tue, 17 Dec 2019 14:52:52 GMT
COPY INTO WEIGHT_INGEST
FROM #S3TESTBKT/W
FILES = 'WEIGHT.TXT'
FILE_FORMAT = (FORMAT_NAME=USDA_FILE_FORMAT)
> Result: Copy executed with 0 files processed.
Can someone please help me resolve this? Thanks in advance.
Further Information:
S3 Object URL: https://my-s3-tstbkt.s3.amazonaws.com/WEIGHT.txt (I'm able to open the file contents in a browser)
Path to file: s3://my-s3-tstbkt/WEIGHT.txt
File Format Definition:
ALTER FILE FORMAT "USDA_NUTRIENT_STDREF"."PUBLIC".USDA_FILE_FORMAT
SET COMPRESSION = 'AUTO'
FIELD_DELIMITER = '^'
RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE'
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '\134'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N');
Stage Definition:
ALTER STAGE "USDA_NUTRIENT_STDREF"."PUBLIC"."S3TESTBKT"
SET URL = 's3://my-s3-tstbkt';
```
I believe issue is with your copy command. Try following steps:
Execute list command to get list of files:
List #S3TESTBKT
if your source file appear here just make sure folder name in your copy command.
COPY INTO WEIGHT_INGEST
FROM #S3TESTBKT/
FILES = ('WEIGHT.txt')
FILE_FORMAT = (FORMAT_NAME = USDA_FILE_FORMAT);

Trac html Notifications

I am setting up Trac, on windows. I used a Bitnami installer. It is the newest stable version 1.2.3. I have a lot of stuff setup, including notifications, but I want to see HTML notifications. The plain text emails look wierd.
I did add the TracHtmlNotificationPlugin. Before doing that I was not getting emails with the default_format.email set to text/html.
Now I get the emails but they are still in plain text.
This is my trac.ini notification section. Let me know if I am missing something.
[notification]
admit_domains = domain.com
ambiguous_char_width = single
batch_subject_template = ${prefix} Batch modify: ${tickets_descr}
default_format.email = text/html
email_sender = HtmlNotificationSmtpEmailSender
ignore_domains =
message_id_hash = md5
mime_encoding = base64
sendmail_path =
smtp_always_bcc =
smtp_always_cc =
smtp_default_domain =
smtp_enabled = enabled
smtp_from = trac#domain.com
smtp_from_author = disabled
smtp_from_name =
smtp_password =
smtp_port = 25
smtp_replyto =
smtp_server = smtp.domain.com
smtp_subject_prefix =
smtp_user =
ticket_subject_template = ${prefix} #${ticket.id}: ${summary}
use_public_cc = disabled
use_short_addr = disabled
use_tls = disabled
I have my domain replaced in the real file.
Like I said I get emails now, just not html emails.
Edit:
I changed back the setting and now
Trac[web_ui] ERROR: Failure sending notification on change to ticket #7: KeyError: 'class'
Edit 2:
Fixed the error by putting the htmlnotification_ticket.html file (from the plugin) into the templates directory.

Unable to ingest data from flume to hdfs hadoop for logs

I am using following configuration for pushing data to hdfs from log file.
agent.channels.memory-channel.type = memory
agent.channels.memory-channel.capacity=5000
agent.sources.tail-source.type = exec
agent.sources.tail-source.command = tail -F /home/training/Downloads/log.txt
agent.sources.tail-source.channels = memory-channel
agent.sinks.log-sink.channel = memory-channel
agent.sinks.log-sink.type = logger
agent.sinks.hdfs-sink.channel = memory-channel
agent.sinks.hdfs-sink.type = hdfs
agent.sinks.hdfs-sink.batchSize=10
agent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:8020/user/flume/data/log.txt
agent.sinks.hdfs-sink.hdfs.fileType = DataStream
agent.sinks.hdfs-sink.hdfs.writeFormat = Text
agent.channels = memory-channel
agent.sources = tail-source
agent.sinks = log-sink hdfs-sink
agent.channels = memory-channel
agent.sources = tail-source
agent.sinks = log-sink hdfs-sink
I got no error message, but still i m not able to find out the output in hdfs.
on interrupting I can see sink interruption exception & some data of that log file.
I am running following command:
flume-ng agent --conf /etc/flume-ng/conf/ --conf-file /etc/flume-ng/conf/flume.conf -Dflume.root.logger=DEBUG,console -n agent;
I had a similar issue. In my case now it's working. Below is the conf file:
#Exec Source
execAgent.sources=e
execAgent.channels=memchannel
execAgent.sinks=HDFS
#channels
execAgent.channels.memchannel.type=file
execAgent.channels.memchannel.capacity = 20000
execAgent.channels.memchannel.transactionCapacity = 1000
#Define Source
execAgent.sources.e.type=org.apache.flume.source.ExecSource
execAgent.sources.e.channels=memchannel
execAgent.sources.e.shell=/bin/bash -c
execAgent.sources.e.fileHeader=false
execAgent.sources.e.fileSuffix=.txt
execAgent.sources.e.command=cat /home/sample.txt
#Define Sink
execAgent.sinks.HDFS.type=hdfs
execAgent.sinks.HDFS.hdfs.path=hdfs://localhost:8020/user/flume/
execAgent.sinks.HDFS.hdfs.fileType=DataStream
execAgent.sinks.HDFS.hdfs.writeFormat=Text
execAgent.sinks.HDFS.hdfs.batchSize=1000
execAgent.sinks.HDFS.hdfs.rollSize=268435
execAgent.sinks.HDFS.hdfs.rollInterval=0
#Bind Source Sink Channel
execAgent.sources.e.channels=memchannel
execAgent.sinks.HDFS.channel=memchannel
I suggest using the prefix configuration when placing files in HDFS:
agent.sinks.hdfs-sink.hdfs.filePrefix = log.out
#bhavesh - Are you sure, the log file(agent.sources.tail-source.command = tail -F /home/training/Downloads/log.txt) keeps appending data ? Since you have used a Tail command with -F, only changed data(within the file) will be dumped into HDFS