Flume not closing all files when adding it successively - amazon-s3

Here is my flume conf
agent.sinks = s3hdfs
agent.sources = MySpooler
agent.channels = channel
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3a://testbucket/test
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.fileType = DataStream
agent.sinks.s3hdfs.channel = channel
agent.sinks.s3hdfs.hdfs.useLocalTimeStamp = true
agent.sinks.s3hdfs.hdfs.rollInterval = 0
agent.sinks.s3hdfs.hdfs.rollSize = 0
agent.sinks.s3hdfs.hdfs.rollCount = 0
agent.sinks.s3hdfs.hdfs.idleTimeout = 15
agent.sources.MySpooler.channels = channel
agent.sources.MySpooler.type = spooldir
agent.sources.MySpooler.spoolDir = /flume_to_aws
agent.sources.MySpooler.fileHeader = false
agent.sources.MySpooler.deserializer.maxLineLength = 110000
agent.channels.channel.type = memory
agent.channels.channel.capacity = 100000000
When I add a file in /flume_to_aws and wait for it, it is uploaded in amazon s3 and file is closed normally.
[root#de flume_to_aws]# cp /tmp_flume/globalterrorismdb_0522dist.00001.csv .
log:
06 Feb 2023 14:02:11,802 INFO [hdfs-s3hdfs-roll-timer-0] (org.apache.flume.sink.hdfs.BucketWriter.doClose:438) - Closing s3a://testbucket/test/FilePrefix.1675699321675.tmp
06 Feb 2023 14:02:13,599 INFO [hdfs-s3hdfs-call-runner-4] (org.apache.flume.sink.hdfs.BucketWriter$7.call:681) - Renaming s3a://testbucket/test/FilePrefix.1675699321675.tmp to s3a://testbucket/test/FilePrefix.1675699321675
But when I add several files without wait, it does not upload all files
ie:
[root#de flume_to_aws]# cp /tmp_flume/globalterrorismdb_0522dist.00001.csv .
[root#de flume_to_aws]# cp /tmp_flume/globalterrorismdb_0522dist.00002.csv .
[root#de flume_to_aws]# cp /tmp_flume/globalterrorismdb_0522dist.00003.csv .
log (only one file).
06 Feb 2023 14:02:27,842 INFO [hdfs-s3hdfs-roll-timer-0] (org.apache.flume.sink.hdfs.BucketWriter.doClose:438) - Closing s3a://testbucket/test/FilePrefix.1675699338165.tmp
06 Feb 2023 14:02:31,411 INFO [hdfs-s3hdfs-call-runner-0] (org.apache.flume.sink.hdfs.BucketWriter$7.call:681) - Renaming s3a://testbucket/test/FilePrefix.1675699338165.tmp to s3a://testbucket/test/FilePrefix.1675699338165
In s3 I only see one file. Why this happen?

I misunderstood the concept.
Actually, it is working fine. Flume seems to work doing something called "roll". Those 3 files are rolled together, especially because those 3 parameters.
agent.sinks.s3hdfs.hdfs.rollInterval = 0
agent.sinks.s3hdfs.hdfs.rollSize = 0
agent.sinks.s3hdfs.hdfs.rollCount = 0
Since there is no interval to roll (rollInterval), no size to roll (rollSize) and no event count to roll (rollCount), it will roll those files together and store all the files in a single file in s3 after the timeout agent.sinks.s3hdfs.hdfs.idleTimeout = 15.
In my case, now I am using agent.sinks.s3hdfs.hdfs.rollSize = 2097152, so it will roll when the file reaches 2mb. In this case the size of those three files are:
[root#de flume_to_aws]# du -sk /tmp_flume/globalterrorismdb_0522dist.00001.csv
1532 /tmp_flume/globalterrorismdb_0522dist.00001.csv
[root#de flume_to_aws]# du -sk /tmp_flume/globalterrorismdb_0522dist.00002.csv
1040 /tmp_flume/globalterrorismdb_0522dist.00002.csv
[root#de flume_to_aws]# du -sk /tmp_flume/globalterrorismdb_0522dist.00003.csv
908 /tmp_flume/globalterrorismdb_0522dist.00003.csv
1532kb + 1040kb + 908kb = 3,480 (3.4mb)
As I am setting it to roll after 2mb, it will store 2 files in s3.
as we can see, the size of the files in s3 match with the above sum.
2mb + 1.4mb = 3.4mb
Please, I just leaned that. Leave a feedback if something is wrong.

Related

Reading an EXCEL file from bucket into Bigquery

I'm trying to upload this data into from my bucket into bigquery, but it's complaining.
My file is an excel file.
ID A B C D E F Value1 Value2
333344 ALG A RETAIL OTHER YIPP Jun 2019 123 4
34563 ALG A NON-RETAIL OTHER TATS Mar 2019 124 0
7777777777 - E RETAIL NASAL KHPO Jul 2020 1,448 0
7777777777 - E RETAIL SEVERE ASTHMA PZIFER Oct 2019 1,493 162
From python I call the file as follow:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
table_id = "project.dataset.my_table"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField('ID','STRING'),
bigquery.SchemaField('A','STRING'),
bigquery.SchemaField('B','STRING'),
bigquery.SchemaField('C','STRING'),
bigquery.SchemaField('D','STRING'),
bigquery.SchemaField('E','STRING'),
bigquery.SchemaField('F','STRING'),
bigquery.SchemaField('Value1','STRING'),
bigquery.SchemaField('Value2','STRING'),
],
skip_leading_rows=1
)
uri = "gs://bucket/folder/file.xlsx"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Wait for the job to complete.
table = client.get_table(table_id)
print("Loaded {} rows to table {}".format(table.num_rows, table_id))
I am getting the following error, and its complaining about a line that it's not even there.
BadRequest: 400 Error while reading data, error message: CSV table references column position 8, but line starting at position:660 contains only 1 columns.
I thought the problem was the data type, as I had selected ID as integer and value1 and value2 as Integer too and F as timestamp, so now I'm trying everything as String, and I still get the error.
My file is only 4 lines in this test I'm doing
Excel files are not supported by BigQuery.
A few workaround solutions:
Upload a CSV version of your file into your bucket (a simple bq load command will do, cf here),
Read the Excel file with Pandas in your python script and insert the rows in BQ with the to_gbq() function,
Upload your Excel file in your Google Drive, make a spreadsheet version out of it and make an external table linked to that spreadsheet.
Try specifying field_delimiter in the LoadJobConfig
Your input file seems like TSV.
So you need to set field delimiter to '\t' like this,
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField('ID','STRING'),
bigquery.SchemaField('A','STRING'),
bigquery.SchemaField('B','STRING'),
bigquery.SchemaField('C','STRING'),
bigquery.SchemaField('D','STRING'),
bigquery.SchemaField('E','STRING'),
bigquery.SchemaField('F','STRING'),
bigquery.SchemaField('Value1','STRING'),
bigquery.SchemaField('Value2','STRING'),
],
skip_leading_rows=1,
field_delimiter='\t'
)

kannel receiving delivery issue

I'm using the following configuration on my kannel on 3 gateways, each gateway contains 2 sessions, one for sending and the other for receiving, the below is for receiving delivery status.
I have no issue with all of them till 3 months before one of them not able to pull the status, at the same time the same gateways is connected with more than 6000 clients with no issue at all.
The provider asked me to change the register_dlr=1
Any idea?
interface-version = 34
host = xx.xx.xx.xx
port = 0
receive-port = 8899
smsc-username = user
smsc-password = pass
system-type = VMA
source-addr-ton = 5
source-addr-npi = 1
dest-addr-ton = 0
dest-addr-npi = 0
keepalive = 600
reconnect-delay = 3
enquire-link-interval = 30
esm-class = 0
msg-id-type = 0x01

Reading, parsing and storing .txt files contents in Torch tensors efficiently

I have a huge number of .txt files (maybe around 10 millions) each having the same number of rows/colums. They actually are some single channel images and the pixel values are separated with an space. Here's the code I've written to do the work but it's very slow. I wonder if someone can suggest a more optimized/efficient way of doing this:
require 'torch'
f = assert(io.open(txtFilePath, 'r'))
local tempTensor = torch.Tensor(1, 64, 64):fill(0)
local i = 1
for line in f:lines() do
local l = line:split(' ')
for key, val in ipairs(l) do
tempTensor[{1, i, key}] = tonumber(val)
end
i = i + 1
end
f:close()
In brief, change you source files if it is possible.
The only I can suggest is to use binary data instead of txt as a source.
You have got the long-term methods: f:lines(), line:split(' ') and tonumber(val). All of them are using strings as variables.
As I understood, you have got file like this:
0 10 20
11 18 22
....
so, change your source it into binary like this:
<0><18><20><11><18><22> ...
where <18> is a byte in hex form, that is 12 , <20> is 16 , etc.
to read
fid = io.open(sup_filename, "rb")
while true do
local bytes = fid:read(1)
if bytes == nil then break end -- EOF
local st = bytes[0]
print(st)
end
fid:close()
https://www.lua.org/pil/21.2.2.html
It would be dramatically faster.
May be using regular expressions (instead of :split() and lines()) can help to you but I do not think.

Aerospike connection errors

We run Aerospike server 3.5.15-1 on Ubuntu 14.04 and periodically getting server connection errors from PHP clients ([-1]Unable to connect to server). PHP client version 3.4.1. We run PHP 5.3 clients from a separate server node. Connections created from php-fpm.
There are no any corresponding errors in the server logs and server didn't have to be restarted. So, the problem seem to be on the client side.
This application creates up to 400 simultaneous connections to Aerospike. We use r3.xlarge EC2 instance and server has plenty of available resources.
We followed Aerospike tuning documentation and tried updating proto-fd and recommended OS patameters on the server, but it didn't help
proto-fd-max 100000
proto-fd-idle-ms 15000
That's how we initialize and use Aerospike:
$opts = array(Aerospike::OPT_CONNECT_TIMEOUT => 1250,Aerospike::OPT_WRITE_TIMEOUT => 5000);
$this->db = new Aerospike($config, false, $opts);
//set key
$aero_key = $this->db->initKey($this->keyspace, $this->table, $key);
$aero_value = array("value" => $value);
$status = $this->db->put($aero_key, $aero_value, $ttl, $options);
//get key
$aero_key = $this->db->initKey($this->keyspace, $this->table, $key);
$status = $this->db->get($aero_key, $result);
Aerospike server stats before the disconnect:
Aug 27 2015 19:32:50 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (237, 16073516, 16073279) : hb (0, 0, 0) : fab (16, 16, 0)
Aug 27 2015 19:33:00 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (334, 16076516, 16076182) : hb (0, 0, 0) : fab (16, 16, 0)
Aug 27 2015 19:33:10 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 1 ::: dq 0 : fds - proto (288, 16079478, 16079190) : hb (0, 0, 0) : fab (16, 16, 0)
Aug 27 2015 19:33:20 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (131, 16082477, 16082346) : hb (0, 0, 0) : fab (16, 16, 0)
Aug 27 2015 19:33:30 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (348, 16084665, 16084317) : hb (0, 0, 0)
From the log segment, we can see that there are around 300 client connections open on the node at any one time, well under the 100000 limit in proto-fd-max.
If you are using multicast for heartbeats (and I think you are), the heartbeats of 0 are fine.
I expect that you have already looked at this, but are you able to check network connectivity between the client and server at the time of the failure? I know that under normal conditions, the client and the server happily coexist, but at the time of the failure, do you see any basic connectivity problems?
Do you happen to have other applications installed on the client machine? Do they have any similar failures, possibly at the time of the Aerospike client problems?
Do you have the client installed on more than one server? Do you maybe only see the connectivity errors on one of the servers?
I know you have already been looking at this, so I apologize if I am covering topics that you have already reviewed.
Thank you for your time,
-DM

Difficulty connecting to an ntp time server

I am trying to use the following piece of code to connect to a time server and attain the time but have had no luck:
Dim ntpServer As String = "time.windows.com"
Dim ntpData(47) As Byte
Dim addresses = Dns.GetHostEntry(ntpServer).AddressList
Dim EndP As IPEndPoint = New IPEndPoint(addresses(0), 123)
Dim soc As Socket = New Socket(AddressFamily.InterNetwork, _
SocketType.Dgram, ProtocolType.Udp)
soc.Connect(EndP)
soc.Send(ntpData)
soc.Receive(ntpData)
soc.Close()
Tracing through the program I can't get past the following line of code soc.Receive(ntpData). What am I doing wrong?
Thanks
you need to provide some basic information to the server:
ntpData(0) = 27
ntpData(0) contains a section called firstByteBits.
This section needs to be set before sending the data to query for a reply.
First byte is
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|LI | VN |Mode |
LI = leap indicator (0 in sent data)
VN = version number (3, bits 3 and 4 set)
Mode = Mode (client mode = 3, bits 6 and 7 set)
00011011 = 27 = 0x1B
And possibly a better NTP server. The time.windows.com:123 server pool is known to
be slow, sometimes not responding for a while, and of low accuracy. Better: pool.ntp.org:123 (but please read what's written on poo.ntp.org about regular use).
e.g. RFC 5905 for more details.