Enumerate lines with same ID in awk - awk

I'm using awk to process the following [sample] of data:
id,desc
168048,Prod_A
217215,Prod_C
217215,Prod_B
168050,Prod_A
168050,Prod_F
168050,Prod_B
What I'm trying to do is to create a column 'item' enumerating the lines within the same 'id':
id,desc,item
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#2
168050,Prod_A,#1
168050,Prod_F,#2
168050,Prod_B,#3
Here what I've tried:
BEGIN {
FS = ","
a = 1
}
NR != 1 {
if (id != $1) {
id = $1
printf "%s,%s\n", $0, "#"a
}
else {
printf "%s,%s\n", $0, "#"a++
}
}
But it messes the numbering:
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#1
168050,Prod_A,#2
168050,Prod_F,#2
168050,Prod_B,#3
Could someone give me some hints?
P.S. The line order doesn't matter

$ awk -F, 'NR>1{print $0,"#"++c[$1]}' OFS=, file
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#2
168050,Prod_A,#1
168050,Prod_F,#2
168050,Prod_B,#3
How it works
-F,
This sets the field separator on input to a comma.
NR>1{...}
This limits the commands in braces to lines other than the first, that is, the one with the header.
print $0,"#"++c[$1]
This prints the line followed by # and a count of the number of times that we have seen the first column.
Associative array c keeps a count of the number of times that an id has been seen. For every line, we increment by 1 the count for id $1. ++ increments. Because ++ precedes c[$1], the increment is done before the value if printed.
OFS=,
This sets the field separator on output to a comma.
Printing a new header as well
$ awk -F, 'NR==1{print $0,"item"} NR>1{print $0,"#"++c[$1]}' OFS=, file
id,desc,item
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#2
168050,Prod_A,#1
168050,Prod_F,#2
168050,Prod_B,#3

Related

How to replace all escape sequences with non-escaped equivalent with unix utilities (sed/tr/awk)

I'm processing a Wireshark config file (dfilter_buttons) for display filters and would like to print out the filter of a given name. The content of file is like:
Sample input
"TRUE","test","sip contains \x22Hello, world\x5cx22\x22",""
And the resulting output should have the escape sequences replaced, so I can use them later in my script:
Desired output
sip contains "Hello, world\x22"
My first pass is like this:
Current parser
filter_name=test
awk -v filter_name="$filter_name" 'BEGIN {FS="\",\""} ($2 == filter_name) {print $3}' "$config_file"
And my output is this:
Current output
sip contains \x22Hello, world\x5cx22\x22
I know I can handle these exact two escape sequences by piping to sed and matching those exact two sequences, but is there a generic way to substitutes all escape sequences? Future filters I build may utilize more escape sequences than just " and , and I would like to handle future scenarios.
Using gnu-awk you can do this using split, gensub and strtonum functions:
awk -F '","' -v filt='test' '$2 == filt {n = split($3, subj, /\\x[0-9a-fA-F]{2}/, seps); for (i=1; i<n; ++i) printf "%s%c", subj[i], strtonum("0" substr(seps[i], 2)); print subj[i]}' file
sip contains "Hello, world\x22"
A more readable form:
awk -F '","' -v filt='test' '
$2 == filt {
n = split($3, subj, /\\x[0-9a-fA-F]{2}/, seps)
for (i=1; i<n; ++i)
printf "%s%c", subj[i], strtonum("0" substr(seps[i], 2))
print subj[i]
}' file
Explanation:
Using -F '","' we split input using delimiter ","
$2 == filt we filter input for $2 == "test" condition
Using /\\x[0-9a-fA-F]{2}/ as regex (that matches 2 digit hex strings) we split $3 and save split tokens into array subj and matched separators into array seps
Using substr we remove first char i.e \\ and prepend 0
Using strtonum we convert hex string to equivalent ascii number
Using %c in printf we print corresponding ascii character
Last for loop joins $3 back using subj and seps array elements
Using GNU awk for FPAT, gensub(), strtonum(), and the 3rd arg to match():
$ cat tst.awk
BEGIN { FPAT="([^,]*)|(\"[^\"]*\")"; OFS="," }
$2 == ("\"" filter_name "\"") {
gsub(/^"|"$/,"",$3)
while ( match($3,/(\\x[0-9a-fA-F]{2})(.*)/,a) ) {
printf "%s%c", substr($3,1,RSTART-1), strtonum(gensub(/./,0,1,a[1]))
$3 = a[2]
}
print $3
}
$ awk -v filter_name='test' -f tst.awk file
sip contains "Hello, world\x22"
The above assumes your escape sequences are always \x followed by exactly 2 hex digits. It isolates every \xHH string in the input, replaces \ with 0 in that string so that strtonum() can then convert the string to a number, then uses %c in the printf formatting string to convert that number to a character.
Note that GNU awk has a debugger (see https://www.gnu.org/software/gawk/manual/gawk.html#Debugger) so if you're ever not sure what any part of a program does you can just run it in the debugger (-D) and trace it, e.g. in the following I plant a breakpoint to tell awk to stop at line 1 of the script (b 1), then start running (r) and the step (s) through the script printing the value of $3 (p $3) at each line so I can see how it changes after the gsub():
$ awk -D -v filter_name='test' -f tst.awk file
gawk> b 1
Breakpoint 1 set at file `tst.awk', line 1
gawk> r
Starting program:
Stopping in BEGIN ...
Breakpoint 1, main() at `tst.awk':1
1 BEGIN { FPAT="([^,]*)|(\"[^\"]*\")"; OFS="," }
gawk> p $3
$3 = uninitialized field
gawk> s
Stopping in Rule ...
2 $2 == "\"" filter_name "\"" {
gawk> p $3
$3 = "\"sip contains \\x22Hello, world\\x5cx22\\x22\""
gawk> s
3 gsub(/^"|"$/,"",$3)
gawk> p $3
$3 = "\"sip contains \\x22Hello, world\\x5cx22\\x22\""
gawk> s
4 while ( match($3,/(\\x[0-9a-fA-F]{2})(.*)/,a) ) {
gawk> p $3
$3 = "sip contains \\x22Hello, world\\x5cx22\\x22"

Extract columns by matching, rename, and assign value using AWK

I have a tab delimited csv file containing summary statistics for object lengths:
sampled. objs. obj. min. len. obj. mean. len. obj. max. len. obj. std.
50 22 60 95 5
I want the information about minimum and maximum lengths by searching matching column headers obj. min. len. and obj. max. len.. I then want to create a new csv file, comma-delimited with new column headers to get the result
object_minimum,object_maximum
22,95
I first print the new headers. Then I tried retrieving the indices of the match and then extracting from the second row using these indices:
#!/bin/awk -f
BEGIN {
cols="object_minimum:object_maximum"
FS="\t"
RS="\n"
col_count=split(cols, col_arr, ":");
for(i=1; i<=col_count; i++) printf col_arr[i] ((i==col_count) ? "\n" : ",");
}
{
for (i=1; i<=NF; i++) {
if(index($i,"obj. min. len.") !=0) {
data["object_minimum"]=i;
}
if(index($i,"obj. max. len.") !=0) {
data["object_maximum"]=i;
}
}
}
END NR==1 {
for (j=1; j<=col_count; j++) printf NF==data[j] ((i==col_count) ? "\n" : ",");
}
There could be more columns and in a different order so it is necessary to do the matching to find the position, and also I may have to select for more columns by changing cols and looking for more matches. I execute by running
awk -f awk_script.awk original.csv > new.csv
With awk:
awk 'BEGIN {FS="\t"; OFS=","}
NR==1 {for (i=1; i<=NF; i++){f[$i] = i}} # fill array with header
NR> 1 {print $(f["obj. min. len."]), $(f["obj. max. len."])}' file
Output:
22,95
Source: https://unix.stackexchange.com/a/359699/74329
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
here is one working prototype, add formatting and error checking...
$ awk -F'\t' -v OFS=, '
NR==1 {for(i=1;i<=NF;i++)
if($i=="obj. min. len.") min=i;
else if($i=="obj. max. len.") max=i;
print "min","max"}
NR==2 {print $min,$max; exit}' file
min,max
22,95
Could you please try following, completely based on your shown samples only, written and tested in GNU awk. Created an awk variable named sep="###" this could be changed as per need too.
awk -v sep="###" '
BEGIN{
OFS=","
}
FNR==1{
while(match($0,/ +obj\./)){
val=substr($0,RSTART,RLENGTH)
sub(/^ +/,"",val)
line=(line?line:"")substr($0,1,RSTART-1)sep val
$0=substr($0,RSTART+RLENGTH)
}
if(substr($0,RSTART+RLENGTH)!=""){
line=line substr($0,RSTART+RLENGTH)
}
num=split(line,arr,sep)
for(i=1;i<=num;i++){
if(arr[i]=="obj. min. len."){ min=i }
if(arr[i]=="obj. max. len."){ max=i }
}
print "object_minimum,object_maximum"
next
}
{
print $min,$max
}
' Input_file
Logical explanation: Working on the very first line of Input_file. Then using awk's match function to look for matches +obj\. in current line. In this creating a variable which has values of matched and before matched values. Once all searching of specific regex is done(means all occurrences of matched regex are found). Then splitting newly created variable(which has value of first line with separators ### assuming these are NOT present in your Input_file else change them to something else) into array. Finally going through all elements of that array and putting condition if a column is obj. min. len. then setting min variable value to that specific index number(which is actually field number for rest of the lines) and if value is obj. max. len. then setting max variable. After processing first line simply printing corresponding fields by doing $min,$max.

Analysing two files using awk with if condition

I have two files. First contains names, numbers and days for all samples
sam_name.csv
Number,Day,Sample
171386,0,38_171386_D0_2-1.raw
171386,0,38_171386_D0_2-2.raw
171386,2,30_171386_D2_1-1.raw
171386,2,30_171386_D2_1-2.raw
171386,-1,40_171386_D-1_1-1.raw
171386,-1,40_171386_D-1_1-2.raw
The second includes information about batches (last column)
sam_batch.csv
Number,Day,Quar,Code,M.F,Status,Batch
171386,0,1,x,F,C,1
171386,1,1,x,F,C,2
171386,2,1,x,F,C,5
171386,-1,1,x,F,C,6
I would like to get the information about batches (using two condition number and day) and add it to the first file. I have used awk command to do that, but I am getting results only at one-time point (-1).
Here is my command:
awk -F"," 'NR==FNR{number[$1]=$1;day[$1]=$2;batch[$1]=$7; next}{if($1==number[$1] && $2==day[$1]){print $0 "," number[$1] "," day[$1] "," batch[$1]}}' sam_batch.csv sam_nam.csv
Here are my results: (a file sam_name, number and day from file sam_batch (just to check if a condition is working) and batch number (a value which I need)
Number,Day,Sample,Number,Day, Batch
171386,-1,40_171386_D-1_1-1.raw,171386,-1,6
171386,-1,40_171386_D-1_1-2.raw,171386,-1,6
175618,-1,08_175618_D-1_1-1.raw,175618,-1,2
Here I corrected your AWK code:
awk -F"," 'NR==FNR{
number_day = $1 FS $2
batch[number_day]=$7
next
}
{
number_day = $1 FS $2
print $0 "," batch[number_day]
}' sam_batch.csv sam_name.csv
Output:
Number,Day,Sample,Batch
171386,0,38_171386_D0_2-1.raw,1
171386,0,38_171386_D0_2-2.raw,1
171386,2,30_171386_D2_1-1.raw,5
171386,2,30_171386_D2_1-2.raw,5
171386,-1,40_171386_D-1_1-1.raw,6
171386,-1,40_171386_D-1_1-2.raw,6
(No need for double-checking if you understand how the script works.)
Here's another AWK solution (my original answer):
awk -v "b=sam_batch.csv" 'BEGIN {
FS=OFS=","
while(( getline line < b) > 0) {
n = split(line,a)
nd = a[1] FS a[2]
nd2b[nd] = a[n]
}
}
{ print $1,$2,$3,nd2b[$1 FS $2] }' sam_name.csv
Both solutions parse file sam_batch.csv at the beginning to form a dictionary of (number, day) -> batch. Then they parse sam_name.csv, printing out the first three fields together with the "Batch" from another file.

awk Joining n fields with delimiter

How can I use awk to join various fields, given that I don't know how many of them I have? For example, given the input string
aaa/bbb/ccc/ddd/eee
I use -F'/' as delimiter, do some manipulation on aaa, bbb, ccc, ddd, eee (altering, removing...) and I want to join it back to print something line
AAA/bbb/ddd/e
Thanks
... given that I don't know how many of them I have?
Ah, but you do know how many you have. Or you will soon, if you keep reading :-)
Before giving you a record to process, awk will set the NF variable to the number of fields in that record, and you can use for loops to process them (comments aren't part of the script, I've just put them there to explain):
$ echo pax/is/a/love/god | awk -F/ '{
gsub (/god/,"dog",$5); # pax,is,a,love,dog
$4 = ""; # pax,is,a,,dog
$6 = $5; # pax,is,a,,dog,dog
$5 = "rabid"; # pax,is,a,,rabid,dog
printf $1; # output "pax"
for (i = 2; i <= NF; i++) { # output ".<field>"
if ($i != "") { # but only for non-blank fields (skip $4)
printf "."$i;
}
}
printf "\n"; # finish line
}'
pax.is.a.rabid.dog
This shows manipulation of the values, as well as insertion and deletion.
The following will show you how to process each field and do some example manipulations on them.
The only caveat of using the output field separator OFS is that "deleted" fields will still have delimiters as shown in the output below; however it makes the code much simpler if you can live with that.
awk '
BEGIN{FS=OFS="/"}
{
for(i=1;i<=NF;i++){
if($i == "aaa")
$i=toupper($i)
else if($i ~ /c/)
$i=""
else if($i ~ /^eee$/)
$i="e"
}
}1' <<<'aaa/bbb/ccc/ddd/eee'
Output
AAA/bbb//ddd/e
This might work for you:
echo "aaa/bbb/ccc/ddd/eee" |
awk 'BEGIN{FS=OFS="/"}{sub(/../,"",$4);NF=4;print}'
aaa/bbb/ccc/d
To delete fields not at the end use a function to shuffle the values:
echo "aaa/bbb/ccc/ddd/eee" |
awk 'func d(n){for(x=n;x<=NF-1;x++){y=x+1;$x=$y}NF--};BEGIN{FS=OFS="/"}{d(2);print}'
aaa/ccc/ddd/eee
Deletes the second field.
awk -F'/' '{ # I'd suggest to add them to an array, like:
# for (i=1;i<=NF;i++) {a[i]=$i }
# Now manipulate your elements in the array
# then finally print them:
n = asorti(a, dest)
for (i=1;i<=n;i++) { output+=dest[i] "/") }
print gensub("/$","","g",output)
}' INPUTFILE
Doing it this way you can delete elements as well. Note deleting an item can be done like delete array[index].

Use Awk to Print every character as its own column?

I am in need of reorganizing a large CSV file. The first column, which is currently a 6 digit number needs to be split up, using commas as the field separator.
For example, I need this:
022250,10:50 AM,274,22,50
022255,11:55 AM,275,22,55
turned into this:
0,2,2,2,5,0,10:50 AM,274,22,50
0,2,2,2,5,5,11:55 AM,275,22,55
Let me know what you think!
Thanks!
It's a lot shorter in perl:
perl -F, -ane '$,=","; print split("",$F[0]), #F[1..$#F]' <file>
Since you don't know perl, a quick explanation. -F, indicates the input field separator is the comma (like awk). -a activates auto-split (into the array #F), -n implicitly wraps the code in a while (<>) { ... } loop, which reads input line-by-line. -e indicates the next argument is the script to run. $, is the output field separator (it gets set iteration of the loop this way, but oh well). split has obvious purpose, and you can see how the array is indexed/sliced. print, when lists as arguments like this, uses the output field separator and prints all their fields.
In awk:
awk -F, '{n=split($1,a,""); for (i=1;i<=n;i++) {printf("%s,",a[i])}; for (i=2;i<NF;i++) {printf("%s,",$i)}; print $NF}' <file>
I think this might work. The split function (at least in the version I am running) splits the value into individual characters if the third parameter is an empty string.
BEGIN{ FS="," }
{
n = split( $1, a, "" );
for ( i = 1; i <= n; i++ )
printf("%s,", a[i] );
sep = "";
for ( i = 2; i <= NF; i++ )
{
printf( "%s%s", sep, $i );
sep = ",";
}
printf("\n");
}
here's another way in awk
$ awk -F"," '{gsub(".",",&",$1);sub("^,","",$1)}1' OFS="," file
0,2,2,2,5,0,10:50 AM,274,22,50
0,2,2,2,5,5,11:55 AM,275,22,55
Here's a variation on a theme. One thing to note is it prints the remaining fields without using a loop. Another is that since you're looping over the characters in the first field anyway, why not just do it without using the null-delimiter feature of split() (which may not be present in some versions of AWK):
awk -F, 'BEGIN{OFS=","} {len=length($1); for (i=1;i<len; i++) {printf "%s,", substr($1,i,1)}; printf "%s", substr($1,len,1);$1=""; print $0}' filename
As a script:
BEGIN {FS = OFS = ","}
{
len = length($1);
for (i=1; i<len; i++)
{printf "%s,", substr($1, i, 1)};
printf "%s", substr($1, len, 1)
$1 = "";
print $0
}