Cloudwatch insight grok extract json as field - amazon-cloudwatch

I am trying to extra a json field which can be either null or array.
example logs is;
04 Jun 2020 09:48:00,741 [32m[INFO] [m 4277a4fa-13fe-49f9-8348-9c515c988481 Class1: Method1: {"property1":"property1Value","property2":["string1", "string2"] , "property3": "property3Value" }
04 Jun 2020 09:48:00,741 [32m[INFO] [m 4277a4fa-13fe-49f9-8348-9c515c988481 Class1: Method1: {"property1":"property1Value","property2":null , "property3": "property3Value" }
Currently I am able to write a grok pattern which can either extract if property 2 is array
| parse "*property2*]*" as blah1, property2, blah2
Is there a way I can extract out null also here ?
Os is there a way to just convert #message to json object ?

You can use (?:case1|case2) for case1 or case2.
For your example: "property2":(?:null|\[(?<property2>.*?)])
This gives:
for input "property2":["string1", "string2"] (your first log line):
"property2": [[ ""string1", "string2"" ]]
for input "property2":null (your second log line):
"property2": [[ null ]]
You can test it at http://grokdebug.herokuapp.com/

Related

Get array from 1 to number of columns of csv in nextflow

One of my process gives output of one csv file. I want to create an array channel from 1 to number of columns. For example:
My output
my_out_ch.view() -> test.csv
Assume, test.csv has 11 columns. Now I want to create a channel which gives me:
1,2,3,4,5,6,7,8,910,11
How could I get this? I have tried with splitText operator as below without luck:
my_out_ch.splitText(by:1,limit:1)
But it only gives me the columns names. There is a parameter elem, I am not sure if elem could give me the array and also not sure how to use it. Any help?
You could use the splitCsv operator to parse the CSV file. Then create an intRange using the map operator. Either call collect() to emit a java.util.ArrayList or call join() to emit a string. For example:
params.input_tsv = 'test.tsv'
Channel.fromPath( params.input_tsv )
| splitCsv( sep: '\t', limit: 1 )
| map { (1..it.size()).join(',') }
| view()
Results:
1,2,3,4,5,6,7,8,9,10,11

Nexflow: structured inputs with files

I have an array of structure data similar to:
- name: foobar
sex: male
fastqs:
- r1: /path/to/foobar_R1.fastq.gz
r2: /path/to/foobar_R2.fastq.gz
- r1: /path/to/more/foobar_R1.fastq.gz
r2: /path/to/more/foobar_R2.fastq.gz
- name: bazquux
sex: female
fastqs:
- r1: /path/to/bazquux_R1.fastq.gz
r2: /path/to/bazquux_R2.fastq.gz
Note that fastqs come in pairs, and the number of pairs per "sample" may be variable.
I want to write a process in nextflow that processes one sample at a time.
In order for the nextflow executor to properly marshal the files, they must somehow be typed as path (or file). Thus typed, the executor will copy the files to the compute node for processing. Simply typing the files paths as var will treat the paths as strings and no files will be copied.
A trivial example of a path input from the docs:
process foo {
input:
path x from '/some/data/file.txt'
"""
your_command --in $x
"""
}
How should I go about declaring the process input so that the files are properly marshaled to the compute node? So far I haven't found any examples in the docs for how to handle structured inputs.
Your structured data looks a lot like YAML. If you can include a top-level object so that your file looks something like this:
samples:
- name: foobar
sex: male
fastqs:
- r1: ./path/to/foobar_R1.fastq.gz
r2: ./path/to/foobar_R2.fastq.gz
- r1: ./path/to/more/foobar_R1.fastq.gz
r2: ./path/to/more/foobar_R2.fastq.gz
- name: bazquux
sex: female
fastqs:
- r1: ./path/to/bazquux_R1.fastq.gz
r2: ./path/to/bazquux_R2.fastq.gz
Then, we can use Nextflow's -params-file option to load the params when we run our workflow. We can access the top-level object from the params, which gives us a list that we can use to create a Channel using the fromList factory method. The following example uses the new DSL 2:
process test_proc {
tag { sample_name }
debug true
stageInMode 'rellink'
input:
tuple val(sample_name), val(sex), path(fastqs)
"""
echo "${sample_name},${sex}:"
ls -g *.fastq.gz
"""
}
workflow {
Channel.fromList( params.samples )
| flatMap { rec ->
rec.fastqs.collect { rg ->
readgroup = tuple( file(rg.r1), file(rg.r2) )
tuple( rec.name, rec.sex, readgroup )
}
}
| test_proc
}
Results:
$ mkdir -p ./path/to/more
$ touch ./path/to/foobar_R{1,2}.fastq.gz
$ touch ./path/to/more/foobar_R{1,2}.fastq.gz
$ touch ./path/to/bazquux_R{1,2}.fastq.gz
$ nextflow run main.nf -params-file file.yaml
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [desperate_colden] DSL2 - revision: 391a9a3b3a
executor > local (3)
[ed/61c5c3] process > test_proc (foobar) [100%] 3 of 3 ✔
foobar,male:
lrwxrwxrwx 1 users 35 Oct 14 13:56 foobar_R1.fastq.gz -> ../../../path/to/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 35 Oct 14 13:56 foobar_R2.fastq.gz -> ../../../path/to/foobar_R2.fastq.gz
bazquux,female:
lrwxrwxrwx 1 users 36 Oct 14 13:56 bazquux_R1.fastq.gz -> ../../../path/to/bazquux_R1.fastq.gz
lrwxrwxrwx 1 users 36 Oct 14 13:56 bazquux_R2.fastq.gz -> ../../../path/to/bazquux_R2.fastq.gz
foobar,male:
lrwxrwxrwx 1 users 40 Oct 14 13:56 foobar_R1.fastq.gz -> ../../../path/to/more/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 40 Oct 14 13:56 foobar_R2.fastq.gz -> ../../../path/to/more/foobar_R2.fastq.gz
As requested, here's a solution that runs per sample. The problem we have is that we cannot simply feed in a list of lists using the path qualifier (since an ArrayList is not a valid path value). We could flatten() the list of file pairs, but this makes it difficult to access each of the file pairs if we need them. You may not necessarily need the file pair relationship but assuming you do, I think the right solution is to feed the R1 and R2 files in separately (i.e. using a path qualifier for R1 and another path qualifier for R2). The following example introspects the instance type to (re-)create the list of readgroups. We can use the stageAs option to localize the files into progressively indexed subdirectories, since some files in the YAML have identical names.
process test_proc {
tag { sample_name }
debug true
stageInMode 'rellink'
input:
tuple val(sample_name), val(sex), path(r1, stageAs:'*/*'), path(r2, stageAs:'*/*')
script:
if( [r1, r2].every { it instanceof List } )
readgroups = [r1, r2].transpose()
else if( [r1, r2].every { it instanceof Path } )
readgroups = [[r1, r2], ]
else
error "Invalid readgroup configuration"
read_pairs = readgroups.collect { r1, r2 -> "${r1},${r2}" }
"""
echo "${sample_name},${sex}:"
echo ${read_pairs.join(' ')}
ls -g */*.fastq.gz
"""
}
workflow {
Channel.fromList( params.samples )
| map { rec ->
def r1 = rec.fastqs.r1.collect { file(it) }
def r2 = rec.fastqs.r2.collect { file(it) }
tuple( rec.name, rec.sex, r1, r2 )
}
| test_proc
}
Results:
$ nextflow run main.nf -params-file file.yaml
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [berserk_sanger] DSL2 - revision: 2f317a8cee
executor > local (2)
[93/6345c9] process > test_proc (bazquux) [100%] 2 of 2 ✔
foobar,male:
1/foobar_R1.fastq.gz,1/foobar_R2.fastq.gz 2/foobar_R1.fastq.gz,2/foobar_R2.fastq.gz
lrwxrwxrwx 1 users 38 Oct 19 13:43 1/foobar_R1.fastq.gz -> ../../../../path/to/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 38 Oct 19 13:43 1/foobar_R2.fastq.gz -> ../../../../path/to/foobar_R2.fastq.gz
lrwxrwxrwx 1 users 43 Oct 19 13:43 2/foobar_R1.fastq.gz -> ../../../../path/to/more/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 43 Oct 19 13:43 2/foobar_R2.fastq.gz -> ../../../../path/to/more/foobar_R2.fastq.gz
bazquux,female:
1/bazquux_R1.fastq.gz,1/bazquux_R2.fastq.gz
lrwxrwxrwx 1 users 39 Oct 19 13:43 1/bazquux_R1.fastq.gz -> ../../../../path/to/bazquux_R1.fastq.gz
lrwxrwxrwx 1 users 39 Oct 19 13:43 1/bazquux_R2.fastq.gz -> ../../../../path/to/bazquux_R2.fastq.gz

How to remove rows in Pandas DataFrame that are partial duplicates?

I have a DataFrame of scraped tweets, and I am trying to remove the rows of tweets that are partial duplicates.
Below is a simplified DataFrame with the same issue. Notice how the first and the last tweet have all but the attached url ending in common; I need a way to drop partial duplicates like this and only keep the latest instance.
data = {
'Tweets':[' The Interstate is closed www.txdot.com/closed',\
'The project is complete www.txdot.com/news',\
'The Interstate is closed www.txdot.com/news'],
'Date': ['Mon Aug 03 20:48:42', 'Mon Aug 03 20:15:42', 'Mon Aug 03 20:01:42' ]
}
df =pd.DataFrame(data)
I've tried dropping duplicates with the drop_duplicates method below, but there doesn't seem to an argument to accomplish this.
df.drop_duplicates(subset=['Tweets'])
Any ideas how to accomplish this?
you can write a regex to remove the slash identify each column by the main url portion and ignore the forward slash.
df['Tweets'].replace('(www\.\w+\.com)/(\w+)',r'\1',regex=True).drop_duplicates()
Yields
0 The Interstate is closed www.txdot.com
1 The project is complete www.txdot.com
Name: Tweets, dtype: object
we can pass the index and create a boolean filter.
df.loc[df['Tweets'].replace('(www\.\w+\.com)/(\w+)',r'\1',regex=True).drop_duplicates().index]
Tweets Date
0 The Interstate is closed www.txdot.com/closed Mon Aug 03 20:48:42
1 The project is complete www.txdot.com/news Mon Aug 03 20:15:42

How to construct a data frame from raw data from a CSV file

I am currently learning the python environment to process sensor data.
I have a board with 32 sensors reading temperature. At the following link, you can find an extract of the raw data: https://5e86ea3db5a86.htmlsave.net/
I am trying to construct a data frame grouped by date from my CSV file using pandas (see the potential structure of the table https://docs.google.com/spreadsheets/d/1zpDI7tp4nSn8-Hm3T_xd4Xz7MV6VDGcWGxwNO-8S0-s/edit?usp=sharing
So far, I have read the data file in pandas and delete all the unnamed columns. I am struggling with the creation of a column sensor ID which should contain the 32 sensor ID and the column temperature.
How should I loop through this CSV file to create 3 columns (date, sensor ID and temperature)?
Thanks for the help
It looks like the first item in each line is the date, then there are pairs of sensor id and value, then a blank value that we can exclude. If so, then the following should work. If not, try to modify the code to your purposes.
data = []
with open('filename.txt', 'r') as f:
for line in f:
# the if excludes empty strings
parts = [part for part in line.split(',') if part]
# this gets the date in a format that pandas can recognize
# you can omit the replace operations if not needed
sensor_date = parts[0].strip().replace('[', '').replace(']', '')
# the rest of the list are the parings of sensor and reading
sensor_readings = parts[1:]
# this uses list slicing to iterate over even and odd elements in list
# ::2 means every second item starting with zero, which are evens
# 1::2 means every second item starting with one, which are odds
for sensor, reading in zip(sensor_readings[::2], sensor_readings[1::2]):
data.append({'sensor_date': sensor_date,
'sensor': sensor,
'reading': reading})
pd.DataFrame(data)
Using your sample data, I got the following:
=== Output: ===
Out[64]:
sensor_date sensor reading
0 Tue Jul 02 16:35:22.782 2019 28C037080B000089 16.8750
1 Tue Jul 02 16:35:22.782 2019 284846080B000062 17.0000
2 Tue Jul 02 16:35:22.782 2019 28A4BA070B00002B 16.8750
3 Tue Jul 02 16:35:22.782 2019 28D4E3070B0000D5 16.9375
4 Tue Jul 02 16:35:22.782 2019 28A21E080B00002F 17.0000
.. ... ... ...

How to get a formatted date and time string from `now`?

I'm using "Red Programming Language" version "0.6.4" on Windows and making a command line application.
I don't know much Red language and I don't understand many things. I did go over "work in progress" docs at (https://doc.red-lang.org/en/) before asking here.
I need to get a date and time string formatted as yyyymmdd_hhmm.
I've started with code like this:
Red []
dt: to string! now/year
print dt
which gives me 2019 but I need the other things month, day and time to obtain something like 20190608_2146
I tried also:
Red []
dt: to string! now/precise
print dt
which gives me 8-Jun-2019/21:47:51.299-07:00 but again what I needed was 20190608_2147
Question:
How to modify the code above to obtain something like 20190608_2147 from now?
Thank you.
I have written a script for Rebol and Red called 'Form Date' that will format dates/times in a similar fashion to STRFTIME. The Red version is here.
do %form-date.red
probe form-date now "%Y%m%d_%H%M"
print first spec-of :form-date
Within the script are individual snippets of code used for formatting the various components of a date! value.
You don't need the script for your specific example though, you can extract and join the various components thus:
date: now
rejoin [ ; reduce-join
form date/year
pad/left/with date/month 2 #"0"
pad/left/with date/day 2 #"0"
"_"
pad/left/with date/hour 2 #"0"
pad/left/with date/minute 2 #"0"
]
As the above solution has some problems under Rebol2 here a variation that works with Rebol and Red the same way
date: now
rejoin [
date/year
next form 100 + date/month
next form 100 + date/day
"_"
next form 100 + date/time/hour
next form 100 + date/time/minute
]
Here is another way:
rejoin [
now/year
either 10 > x: (now/month) [join "0" x][x]
either 10 > x: (now/day) [join "0" x][x]
"_"
either 10 > x: first (now/time) [join "0" x][x]
either 10 > x: second (now/time) [join "0" x][x]
]
Red has pad, so rgchris' answer up there is good. Yet, there is no need for date: now as rgchris has done:
rejoin [
now/year
pad/left/with now/month 2 #"0"
pad/left/with now/day 2 #"0"
"_"
pad/left/with first (now/time) 2 #"0"
pad/left/with second (now/time) 2 #"0"
]