extract/filter syslog-ng log on linux - syslog-ng

I have configured syslog-ng to receive log from another machine, the logs are coming every minute but contains , how to filter unrequired messages from row data ?
Example:
date=2021-06-01 time=10:01:01 ABC="1" cde=2 Xyz="aaa" name=UK
date=2021-06-01 time=10:01:02 ABC="3" cde=5 name=USA
date=2021-06-01 time=10:01:03 ABC="4" cde=2
output of syslog-ng needs to be as below :
2020-06-01/data-20200601.log:
`date=2021-06-01 time=10:01:01 ABC="1" cde=2 Xyz="aaa" name=UK `
date=2021-06-01 time=10:01:02 ABC="3" cde=5 XyZ="" name=USA
date=2021-06-01 time=10:01:03 ABC="4" cde=2 XyZ="" name=""
mean filter based on KEY= and if value missing the KEY= should be logged with "" ( so missing value won't be shifted to left ) , so I can filter later as per my need :
I tried to parse with awk & sed but the size of log file generated by syslog "data-20200601.log" is around 10GB and took me long time to get this output
2021-06-01,10:01:01,1,2,aaa,UK
2021-06-01,10:01:02,3,5,,USA
2021-06-01,10:01:03,4,,,,

syslog-ng has a parser called kv-parser() that would extract all such key=value parts into syslog-ng name-value pairs.
log {
source(some_source);
parser { kv-parser(); };
destination { file("this_is_where_all_logs_go" template("${name} ${ABC}")); };
};
In the template section, as you can see, you can reference the extracted name-value pairs, using the normal syslog-ng syntax.
You can even format a series of name-value pairs into JSON or other structured formats using $(format-json), $(format-welf), etc.

Related

Nextflow input how to declare tuple in tuple

I am working with a nextflow workflow that, at a certain stage, groups a series of files by their sample id using groupTuple(), and resulting in a channel that looks like this:
[sample_id, [file_A, file_B, ... , file_N]]
[sample_id, [file_A, file_B, ... , file_N]]
...
[sample_id, [file_A, file_B, ... , file_N]]
Note that this is the same channel structure that you get from .fromFilePairs().
I want to use these channel items in a process in such a way that, for each item, the process reads the sample_id from the first field and all the files from the inner tuple at once.
The nextflow documentation is somewhat cryptic about this, and it is hard to find how to declare this type of input in a channel, so I thought I'd create a question on stack overflow and then answer it myself for anyone who will ever be looking for this answer.
How does one declare the inner tuple in the input section of a nextflow process?
In the example given above, my inner tuple contains items of only one type (files). I can therefore pass the whole second term of the tuple (i.e. the inner tuple) as a single input item under the file() qualifier. Like this:
input:
tuple \
val(sample_id), \
file(inner_tuple) \
from Input_channel
This will ensure that the tuple content is read as file (one by one), the same way as performing .collect() on a channel of files, in the sense that all files will then be available in the nextflow temp directory where the process is executed.
The question is how you come up with sample_id, but in case they just have different file extensions you might use something like this:
all_files = Channel.fromPath("/path/to/your/files/*")
all_files.map { it -> [it.simpleName, it] }
.groupTuple()
.set { grouped_files }
The path qualifier (previously the file qualifier) can be used to stage a single (file) value or a collection of (file) values into the process execution directory. The note at the bottom of the multiple input files section in the docs also mentions:
The normal file input constructs introduced in the input of files
section are valid for collections of multiple files as well.
This means, you can use a script variable, e.g.:
input:
tuple val(sample_id), path(my_files)
In which case, the variable will hold the list of files (preserving the original filenames). You could use it directly to refer to all of the files in the list, or, you could access specific (file) elements (if you need them) using square bracket (slice) notation.
This is the syntax you will want most of the time. However, if you need predicable filenames or if you need to deal with files with the identical filenames, you may need a different approach:
Alternatively, you could specify a target filename, e.g.:
input:
tuple val(sample_id), path('my_file')
In the case where a single file is received by the process, the file would be staged with the target filename. However, when a collection of files is received by the process, the filename will be appended with a numerical suffix representing its ordinal position in the list. For example:
process test {
tag { sample_id }
debug true
stageInMode 'rellink'
input:
tuple val(sample_id), path('fastq')
"""
echo "${sample_id}:"
ls -g --time-style=+"" fastq*
"""
}
workflow {
readgroups = Channel.fromFilePairs( '*_{1,2}.fastq' )
test( readgroups )
}
Results:
$ touch {foo,bar,baz}_{1,2}.fastq
$ nextflow run .
N E X T F L O W ~ version 22.04.4
Launching `./main.nf` [scruffy_caravaggio] DSL2 - revision: 87a80d6d50
executor > local (3)
[65/66f860] process > test (bar) [100%] 3 of 3 ✔
baz:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../baz_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../baz_2.fastq
foo:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../foo_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../foo_2.fastq
bar:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../bar_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../bar_2.fastq
Note that the names of staged files can be controlled using the * and ? wildcards. See the links above for a table that shows how the wildcards are replaced depending on the cardinality of the input collection.

check if nextflow channel is empty

I am trying to figure out how to check if a channel is empty or not.
For instance, I have two processes. The first process runs only if a combination of parameters/flags are set and if so, checks also if its input file from another process (input via a channel) is not empty, then it creates a new input file for a second process (to eventually replace the default one). As a simplified example:
.....
.....
// create the channel here to force nextflow to wait for the first process
_chNewInputForProcessTwo = Channel.create()
process processOne {
when:
params.conditionOne && parameters.conditionTwo
input:
file inputFile from _channelUpstreamProcess
output:
file("my.output.file") into _chNewInputForProcessTwo
script:
"""
# check if we need to produce new input for second process (i.e., input file not empty)
if [ -s ${inputFIle} ]
then
<super_command_to_generate_new_fancy_input_for_second_process> > "my.output.file"
else
echo "No need to create new input"
fi
"""
}
// and here I would like to check if new input was generated or leave the "default" one
_chInputProcessTwo = Channel.from(_chNewInputForProcessTwo).ifEmpty(Channel.value(params.defaultInputProcessTwo))
process secondProcess {
input:
file inputFile from _chInputProcessTwo
......
......
etc.
When I try running with this approach it fails because the channel _chNewInputForProcessTwo contains DataflowQueue(queue=[]) therefore, not being actually empty.
I've tried several things looking at the documentation and the threads on google groups and on gitter. trying to set it to empty, but then it complains i am trying to use the channel twice. putting create().close(), etc.
Is there a clean/reasonable way to do this? I could do it using a value channel and have the first process output some string on the stdout to be picked up and checked by the second process, but that seems pretty dirty to me.
Any suggestions/feedback is appreciated. Thank you in advance!
Marius
Best to avoid trying to check if the channel is empty. If your channel could be empty and you need a default value in your channel, you can use the ifEmpty operator to supply one. Note that a single value is implicitly a value channel. I think all you need is:
myDefaultInputFile = file(params.defaultInputProcessTwo)
chInputProcessTwo = chNewInputForProcessTwo.ifEmpty(myDefaultInputFile)
Also, calling Channel.create() is usually unnecessary.

BigQuery Could not parse 'null' as int for field

Tried to load csv files into bigquery table. There are columns where the types are INTEGER, but some missing values are NULL. So when I use the command bq load to load, got the following error:
Could not parse 'null' as int for field
So I am wondering what are the best solutions to deal with this, have to reprocess the data first for bq to load?
You'll need to transform the data in order to end up with the expected schema and data. Instead of INTEGER, specify the column as having type STRING. Load the CSV file into a table that you don't plan to use long-term, e.g. YourTempTable. In the BigQuery UI, click "Show Options", then select a destination table with the table name that you want. Now run the query:
#standardSQL
SELECT * REPLACE(SAFE_CAST(x AS INT64) AS x)
FROM YourTempTable;
This will convert the string values to integers where 'null' is treated as null.
Please try with job config setting.
job_config.null_marker = 'NULL'
configuration.load.nullMarker
string
[Optional] Specifies a string that represents a null value in a CSV file. For example, if you specify "\N", BigQuery interprets "\N" as a null value when loading a CSV file. The default value is the empty string. If you set this property to a custom value, BigQuery throws an error if an empty string is present for all data types except for STRING and BYTE. For STRING and BYTE columns, BigQuery interprets the empty string as an empty value.
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
BigQuery Console has it's limitations and doesn't allow you to specify a null marker while loading data from a CSV. However, it can easily be done by using the BigQuery command-line tool's bq load command. We can use the --null_marker flag to specify the marker which is simply null in this case.
bq load --source_format=CSV \
--null_marker=null \
--skip_leading_rows=1 \
dataset.table_name \
./data.csv \
./schema.json
Setting the null_marker as null does the trick here. You can omit the schema.json part if the table is already present with a valid schema. --skip_leading_rows=1 is used because my first row was a header.
You can learn more about the bg load command in the BigQuery Documentation.
The load command however lets you create and load a table in a single go. The schema needs to be specified in a JSON file in the below format:
[
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
},
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
}
]

How can I use the value of mp2t.af.pcr as a Tshark field?

I have a wireshark capture that contains an RTP multicast stream (plus some other incidental data).
Using a Tshark command like the following, I can produce a CSV of the RTP timestamp compared with the packet capture time:
tshark.exe -r "capture.pcap" -Eseparator=, -Tfields -e rtp.timestamp -e frame.time_epoch -d udp.port==5000,rtp
This decodes the UDP packets as RTP, and successfully prints out the two fields as expected.
Now, my question: The payload of the RTP stream is an MPEG2 Transport Stream, and I also want to print the PCR value (if there is one) alongside the packet and RTP timestamps.
In wireshark, I can see the PCR being decoded correctly, however using a command like the following:
tshark.exe -r "HBO HD CZ.pcap" -Eseparator=,-Tfields -e rtp.timestamp -e frame.time_epoch -e mp2t.af.pcr -d udp.port==5000,mp2t
...only prints out a "1" if there is a PCR oresent, not the actual value. I have also checked the .pcr_flag to confirm that these two are not exchanged, but still I see the same result.
The documentation seems to call mp2t.af.pcr a "Label", does this mean that Tshark is not able to use it as a field? Is there a way to generate a CSV with these values?
(What part of the documentation calls it a "Label"? That's a somewhat odd description of a named field.)
The problem is that the value that Wireshark displays after "base(XXX)*300 + ext(YYY)" is calculated and displayed, but the field itself isn't given an integral type and is instead given a type that doesn't have a value. Arguably, it should be an FT_UINT64 field and should be given a value, so that you can filter on it and can print the value in TShark.
Please file an enhancement request for this on the Wireshark Bugzilla.

A redis command ERR:wrong number of arguments

using hiredis to pass the command to redis-server.
My code:
redisContext* c = redisConnect("127.0.0.1", 6379);
char y[15]={"pointx"};
strcat(y," 2");
redisReply* reply= (redisReply*)redisCommand(c,"set %s",y);
printf("%s\n", reply->str);
The output is "ERR wrong number of arguments for 'set' command".
However,it works when I change the code like this:
redisContext* c = redisConnect("127.0.0.1", 6379);
char y[15]={"pointx"};
char x[5] = {"2"};
redisReply* reply= (redisReply*)redisCommand(c,"set %s %s",y,x);
printf("%s\n", reply->str);
The output is "OK".
why??
The Redis server does not parse the command built with redisCommand. The server only accepts Redis protocol, with already delimited parameters.
Parsing therefore occurs in hiredis, and it applies only on the format string, in one step. For performance reasons, hiredis avoids multiple formatting passes (or a recursive implementation), so expansion of the parameters is not done before parsing, but while parsing in on-going - contrary to what you think.
Imagine your objects are very big (say several MB), you would not want them to be parsed at each query. This is why hiredis only parses the format string and not the parameters.
In your first example, hiredis parses a format string with a unique parameter, it builds a message with only one parameter, and redis receives:
$ netcat -l -p 6379
*2
$3
set
$8
pointx 2
which is an ill formed set command (one parameter only).