awk - collapsing files with same date + getting sum of column - awk

this is my question, - awk - collapsing files with same date + getting sum of column - but there might be a better way of wording it, aggregating...
1/
this is my output:
> awk -F' {2,}' 'BEGIN{FS=OFS=","} $5 =="1927" {cnt++} ENDFILE{print FILENAME, (cnt>0&&cnt?cnt:"0"); cnt=0}' /log/msg/sched/kevin_nbr_deletion/2023-01-31_*_table.log | head
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600010_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600012_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600014_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600016_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600018_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600020_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600024_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600026_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600028_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_618100_table.log,0
2/
And from this I want to colapse all these files into one and show the sum of the times the pattern is matched.
> awk -F' {2,}' 'BEGIN{FS=OFS=","} $5 =="1927" {cnt++} ENDFILE{print FILENAME, (cnt>0&&cnt?cnt:"0"); cnt=0}' /log/msg/sched/xxyyzz/2023-01-31_*_table.log | head | awk -F, '{print substr($1,35,10)}'
2023-01-31
2023-01-31
2023-01-31
2023-01-31
2023-01-31
2023-01-31
2023-01-31
2023-01-31
2023-01-31
2023-01-31
3
> awk -F' {2,}' 'BEGIN{FS=OFS=","} $5 =="1927" {cnt++} ENDFILE{print FILENAME, (cnt>0&&cnt?cnt:"0"); cnt=0}' /log/msg/sched/xxyyzz/2023-01-31_*_table.log | head | awk -F, '{sum+=$2} END{print sum}'
0
>
how do I combine 2 and 3 together to get this output,
2023-01-31,0
Or better still if I change the wild card to pick up more files:
then I want output like this
2023-01-30,0
2023-01-31,0
2023-02-01,0
Or even combine 1 2 and 3 into one.

To simulate OP's current awk output:
$ cat awk.out
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600010_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600012_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600014_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600016_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600018_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600020_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600024_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600026_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_600028_table.log,0
/log/msg/sched/xxyyzz/2023-01-31_06:05:00_eNB_618100_table.log,0
One simple fix for the current code:
cat awk.out |
awk '
BEGIN { FS=OFS="," }
{ date=substr($1,23,10) # OP will need to review the start position (23 vs 35 vs something else) based on actual data; otherwise this could be expanded to parse the filename based on delimiters
datesums[date]+=$2
}
END { for (i in datesums)
print i,datesums[i]
}
'
This generates:
2023-01-31,0
'course, this should be pulled into the main awk script, and the hardcoded substr() could be replaced with something a bit more dynamic; one idea:
awk -F' {2,}' '
BEGIN { FS=OFS="," }
$5 =="1927" { cnt++ }
ENDFILE { if (match($1,/[0-9]{4}-[0-9]{2}-[0-9]{2}/)) {
date=substr($1,RSTART,RLENGTH)
datesums[date]+=(cnt+0)
}
cnt=0
}
END { for (i in datesums)
print i,datesums[i]
}
' /log/msg/sched/kevin_nbr_deletion/2023-01-31_*_table.log
NOTE: without actual input files I'm not going to try to address the dual delimiter definitions (-F' {2,}' vs FS=OFS=",")

Related

Count rows and columns for multiple CSV files and make new file

I have multiple large comma separated CSV files in a directory. But, as a toy example:
one.csv has 3 rows, 2 columns
two.csv has 4 rows 5 columns
This is what the files look like -
# one.csv
a b
1 1 3
2 2 2
3 3 1
# two.csv
c d e f g
1 4 1 1 4 1
2 3 2 2 3 2
3 2 3 3 2 3
4 1 4 4 1 4
The goal is to make a new .txt or .csv that gives the rows and columns for each:
one 3 2
two 4 5
To get the rows and columns (and dump it into a file) for a single file
$ awk -F "," '{print NF}' *.csv | sort | uniq -c > dims.txt
But I'm not understanding the syntax to get counts for multiple files.
What I've tried
$ awk '{for (i=1; i<=2; i++) -F "," '{print NF}' *.csv$i | sort | uniq -c}'
With any awk, you could try following awk program.
awk '
FNR==1{
if(cols && rows){
print file,rows,cols
}
rows=cols=file=""
file=FILENAME
sub(/\..*/,"",file)
cols=NF
next
}
{
rows=(FNR-1)
}
END{
if(cols && rows){
print file,rows,cols
}
}
' one.csv two.csv
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line of each line then do following.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
rows=cols=file="" ##Nullifying rows, cols and file here.
file=FILENAME ##Setting FILENAME value to file here.
sub(/\..*/,"",file) ##Removing everything from dot to till end of value in file.
cols=NF ##Setting NF values to cols here.
next ##next will skip all further statements from here.
}
{
rows=(FNR-1) ##Setting FNR-1 value to rows here.
}
END{ ##Starting END block of this program from here.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
}
' one.csv two.csv ##Mentioning Input_file names here.
Using gnu awk you can do this in a single awk:
awk -F, 'ENDFILE {
print gensub(/\.[^.]+$/, "", "1", FILENAME), FNR-1, NF-1
}' one.csv two.csv > dims.txt
cat dims.txt
one 3 2
two 4 5
You will need to iterate over all CSVs print the name for each file and the dimensions
for i in *.csv; do awk -F "," 'END{print FILENAME, NR, NF}' $i; done > dims.txt
If you want to avoid awk you can also do it wc -l for lines and grep -o "CSV-separator" | wc -l for fields
I would harness GNU AWK's ENDFILE for this task as follows, let content of one.csv be
1,3
2,2
3,1
and two.csv be
4,1,1,4,1
3,2,2,3,2
2,3,3,2,3
1,4,4,1,4
then
awk 'BEGIN{FS=","}ENDFILE{print FILENAME, FNR, NF}' one.csv two.csv
output
one.csv 3 2
two.csv 4 5
Explanation: ENDFILE is executed after processing every file, I set FS to , assuming that fields are ,-separated and there is not , inside filed, FILENAME, FNR, NF are built-in GNU AWK variables: FNR is number of current row in file, i.e. in ENDFILE number of last row, NF is number of fileds (again of last row). If you have files with headers use FNR-1, if you have rows prepended with row number use NF-1.
edit: changed NR to FNR
Without GNU awk you can use the shell plus POSIX awk this way:
for fn in *.csv; do
cols=$(awk '{print NF; exit}' "$fn")
rows=$(awk 'END{print NR-1}' "$fn")
printf "%s %s %s\n" "${fn%.csv}" "$rows" "$cols"
done
Prints:
one 3 2
two 4 5

Get unique string occurrence and display it

I'm not really good with awk, so here what I just did to count the number occurence in one row:
the input.txt has this:
18 18 21 21 21 21 18 21
I just want to display the unique number that occur above. So, here is my code:
input="input.txt"
output=$(fmt -1 "$input" | sort | uniq | awk '{printf $1","}')
echo "$output"
The output:
18,21,
I got the result correctly but that comma , at the end, how do I remove that comma? Also, is there a simpler or a clean method without using fmt ?
The expected output:
18,21
Edit to remove comma, I use this:
sed 's/,$//'
and it's working, but is there a simpler way to do this without using fmt ?
Could you please try following.
awk '
BEGIN{ OFS="," }
{
for(i=1;i<=NF;i++){
if(!arr[$i]++){
val=(val?val OFS:"")$i
}
}
print val
val=""
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ OFS="," } ##Setting output field separator as comma here.
{
for(i=1;i<=NF;i++){ ##Traversing through all fields of currnet line here.
if(!arr[$i]++){ ##Checking condition if arr is NOT having current field present in it
val=(val?val OFS:"")$i ##Creating val and keep adding values to it, to print at last all values.
}
}
print val ##printing val here.
val="" ##Nullify val here.
}' Input_file ##mentioning Input_file name here.
Here is an alternative way in gnu awk:
awk -v RS='[[:blank:]]+' '!seen[$1]++{s=s (s!=""?",":"") $1} END{print s}' file.txt
18,21
Newer versions of perl has uniq in standard library. Otherwise, you'll have to manually write the logic (How do I print unique elements in Perl array?) or use https://metacpan.org/pod/List::MoreUtils
perl -MList::Util=uniq -lane 'print join ",", uniq #F'
perl -lane 'print join ",", grep { !$seen{$_}++ } #F'
With ruby
ruby -ane 'puts $F.uniq * ","'
I had similar problem today and I solved it by using echo,tr,sed. Code for your example is as below -
echo -n "18 18 21 21 21 21 18 21" | tr -s ' ' '\n' | sort -u | tr '\n' ',' | sed 's/,$//'
output- 18,21
echo '18 18 21 21 21 21 18 21' |
mawk 'END { print "\n" } !___[$_]-- && $!NF= NF < ++__\
? ","$_ : $_' ORS= RS='[\t- ]+'
18,21
or if u don't mind chaining up 2 awks :
mawk '!__[$_]--' ORS=',' RS='[\t- ]+' | mawk 7 RS=',$'
18,21

How to count values between empty cells

I'm facing one problem which is bigger than me. I have 18 relative large text files (ca 30k lines each) and I need to count the values between the empty cells in the second column. Here is a simple example of my file:
Metabolism
line_1 10.2
line_2 10.1
line_3 10.3
TCA_cycle
line_4 10.7
line_5 10.8
Pyruvate_metab
line_6 100.8
In reality, I have circa 500 description lines (Metabolism, TCA_cycle, etc.) and the range of lines is between zero to a few hundred.
I would like to count values for each block (block starts with a description and corresponding lines are always below), e.g.
Metabolism 30.6
line_1 10.2
line_2 10.1
line_3 10.3
TCA_cycle 21.5
line_4 10.7
line_5 10.8
Pyruvate_metab 100.8
line_6 100.8
Or just
30.3
21.5
100.8
It won't be a problem if results will be printed line by line into an additional file... Or another alternative way.
There is one tricky thing and it's descriptions without lines with numbers.
Transport
line_1000 100.1
line_1001 100.2
Cell_signal
Motility
Processing
Translation
line_1002 500.1
line_1003 200.2
And even for those lines and would like to get 0 value.
Transport 200.3
line_1000 100.1
line_1001 100.2
Cell_signal 0
Motility 0
Processing 0
Translation 700.3
line_1002 500.1
line_1003 200.2
The rest of the file looks same and it's consistent - 2 columns, tab separators, descriptions in the first column, values in the second, no spaces (only underlines).
Actually I have no experience with more sophisticated coding so I really don't know how to solve it in the command line. I've already tried some Excel ways but it was painful and unsuccessful.
With tac and any awk:
tac file | awk 'NF==2{sum+=$2; print; next} {print $1 "\t" sum; sum=0}' | tac
With two improvements proposed by kvantour and Ed Morton. See the comments.
tac file | awk '($NF+0==$NF){sum+=$2; print; next} {print $1 "\t" sum+0; sum=0}' | tac
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==NR{
if($0!~/line/){ a[$0]; prev=$0 }
else { a[prev]+=$NF }
next
}
!/line/{
$0=$0 OFS (a[$0]?a[$0]:0)
}
1' Input_file Input_file
OR in case you want output in good looking form add column -t to above command like as follows:
awk '
FNR==NR{
if($0!~/line/){ a[$0]; prev=$0 }
else { a[prev]+=$NF }
next
}
!/line/{
$0=$0 OFS (a[$0]?a[$0]:0)
}
1' Input_file Input_file | column -t
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking FNR==NR which will be TRUE when Input_file is being read first time.
if($0!~/line/){ a[$0]; prev=$0 } ##checking condition if line contains string line and setting index of current line in a and setting prev value to current line.
else { a[prev]+=$NF } ##Else if line not starting from line then creating array a with index prev variable and keep on adding last field value to same index of array.
next ##next will skip all further statements from here.
}
!/line/{ ##Checking if current line doesnot have line keyword in it then do following.
$0=$0 OFS (a[$0]?a[$0]:0) ##Re-creating current line with its current value then OFS(which is space by default) then either add value of a[$0] or 0 based on current line value is NOT NULL here.
}
1 ##Printing current line here.
' Input_file Input_file ##Mentioning Input_file names here.
In plain awk:
awk '{
if (NF == 1) {
if (blockname)
printf("%s\t%.2f\n%s", blockname, sum, lines)
blockname = $0
sum = 0
lines=""
} else if (NF == 2) {
sum += $2
lines = lines $0 "\n"
}
next
}
END { printf("%s\t%.2f\n%s", blockname, sum, lines) }
' input.txt

awk multiple row and printing results

I would like to print some specific parts of a results with awk, after multiple pattern selection.
What I have is (filetest):
A : 1
B : 2
I expect to have:
1 - B : 2
So, only the result of the first row, then the whole second row.
The dash was added by me.
I have this:
awk -F': ' '$1 ~ /A|B/ { printf "%s", $2 "-" }' filetest
Result:
1 -2 -
And I cannot get the full second row, without failing in showing just the result of the first one
awk -F': ' '$1 ~ /A|B/ { printf "%s", $2 "$1" }' filetest
Result:
1 - A 2 - B
Is there any way to print in the same line, exactly the column/row that I need with awk?
In my case R1C2 - R2C1: R2C2?
Thanks!
This will do what you are expecting:
awk -F: '/^A/{printf "%s -", $2}/^B/{print}' filetest
$ awk -F: 'NR%2 {printf "%s - ", $2; next}1' filetest
1 - B : 2
You can try this
awk -F: 'NR%2==1{a=$2; } NR%2==0{print a " - " $0}' file
output
1 - B : 2
I'd probably go with #jas's answer as it's clear, simple, and not coupled to your data values but just to show an alternative approach:
$ awk '{printf "%s", (NR%2 ? $3 " - " : $0 ORS)}' file
1 - B : 2
tried on gnu awk
awk -F':' 'NR==1{s=$2;next}{FS="";s=s" - "$0;print s}' filetest

awk not printing header in output file

The below awk seems to work great with 1 issue, the header lines do hot print in the output? I have been staring at this awhile with no luck. What am I missing? Thank you :).
awk
awk 'NR==FNR{for (i=1;i<=NF;i++) a[$i];next} FNR==1 || ($7 in a)' /home/panels/file1 test.txt |
awk '{split($2,a,"-"); print a[1] "\t" $0}' |
sort |
cut -f2-> /home/panels/test_filtered.vcf
test.txt (used in the awk to give the filtered output --only a small portion of the data but the tab delimited format is shown)
Chr Start End Ref Alt
chr1 949608 949608 G A
current output (has no header)
chr1 949608 949608 G A
desired output (has header)
Chr Start End Ref Alt
chr1 949608 949608 G A
It looks like the header is going to sort, and getting mixed in with your data. A simple solution is to do:
... | { read line; echo $line; sort; } |
to prevent the first line from going to sort.
you can combine your scripts and add the sort into awk and handle header this way.
$ awk 'NR==FNR{for(i=1;i<=NF;i++)a[$i]; next}
FNR==1{print "dummy\t" $0; next}
$7 in a{split($2,b,"-"); print b[1] "\t" $0 | "sort" }' file1 file2 |
cut -f2