awk: extract data from a column by name rather than position - awk

I have a text file that is comma delimited. The first line is a list of field names, and subsequent lines contain data. I'll get new versions of the file, and I want to extract all the values from a particular column by name rather than by column number. (I.e. the column I want may be in different positions in different versions of the file.)
For example, here are two files:
foo,bar,interesting,junk
1,2,gold,ramjet
2,25,diamonds,superfluous
and
foo,bar,baz,interesting,junk,morejunk
5,3,smurf,platinum,garbage,scrap
6,2.5,mushroom,sodium,liverwurst,eew
I'd like a single script that will go through multiple files, extracting the minerals in the "interesting" column. :-)
What I've got so far is something that works on ONE file, but I know that awk is more elegant than this. How do I clean this up and make it work on multiple files at once?
BEGIN {
FS=",";
}
NR == 1 {
for(i=1; i<=NF; i++) {
if($i=="interesting") {
col=i;
}
}
}
NR > 1 {
print $col;
}

You're pretty darn close already. Just use FNR instead of NR, for "File NR".
#!/usr/bin/awk -f
BEGIN { FS="," }
FNR==1 {
for (col=1;col<=NF;col++)
if ($col=="interesting")
next
}
{ print $col }
Or if you like:
#!/usr/bin/awk -f
BEGIN { FS="," }
FNR==1 { for (col=1;$col!="interesting";col++); next }
{ print $col }
Or if you prefer one-liners:
$ awk -F, -v txt="interesting" 'FNR==1{for(c=1;$c!=txt;c++);next} {print $c}' file1 file2
Of course, be careful that you actually have the specified column, or you may find yourself in an endless loop. You can probably figure out the extra condition that saves you from that risk.
Note that in awk, you only need to terminate commands with semicolons if they are followed by another command. Thus, you would do this:
command1; command2
But you can drop the semicolon if you separate commands with newlines:
command1
command2

Do it this way:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 { for (i=1;i<=NF;i++) f[$i]=i; next }
{ print $(f["interesting"]) }
$ awk -f tst.awk file1 file2
gold
diamonds
platinum
sodium
Creating a name->value array is always the best approach when it's applicable. It keeps every part of the code simple and decoupled from the rest of the code, and it sets you up for doing other things like changing the order of the fields when you output the results, e.g.:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 { for (i=1;i<=NF;i++) f[$i]=i; next }
{ print $(f["junk"]), $(f["interesting"]), $(f["bar"]) }
$ awk -f tst.awk file1 file2
ramjet,gold,2
superfluous,diamonds,25
garbage,platinum,3
liverwurst,sodium,2.5

Related

Truncation of strings after running awk script

I have this code
BEGIN { FS=OFS=";" }
{ key = $(NF-1) }
NR == FNR {
for (i=1; i<(NF-1); i++) {
if ( !seen[key,$i]++ ) {
map[key] = (key in map ? map[key] OFS : "") $i
}
}
next
}
{ print $0 map[key] }
I use code in this way
awk -f tst.awk 2.txt 1.txt
I have two text files
1.txt
AA;BB;
2.txt
CC;DD;BB;AA;
I try to generate this 3.txt output
AA;BB;CC;DD;
but with this script is not possible because this script return only AA;BB;
logic: The above just uses literal strings in a hash lookup of array indices so it doesn't care what characters you have in your input. However about sample output:if in 2.txt there are common fields also in 1.txt.for example BB;AA; then you need concatenate them in a single row, i.e AA;BB;CC;DD; Ordering is not required, for example is not relevant if output is BB;AA;DD;CC; Only condition that is required is avoid duplicates but my script already does this
Could you please try following, as per OP's comment both files have only 1 line. So using paste command to combine both the files and then processing its output by awk command.
paste -d';' 1.txt 2.txt |
awk '
BEGIN{
FS=OFS=";"
}
{
for(i=1;i<=NF;i++){
if(!seen[$i]++){ val=(val?val OFS:"")$i }
}
print val
delete seen
val=""
}'

Do specific task for some lines, copy all other lines

I have set of specific lines in file where I would like to do some changes, and I want to just coppy all other lines. I imagine code should look something like this
awk -v imin=5 -v imax=10 -v shift=5.54545 '{
(NR==5){ print $1+5,$2; }
(NR==7){ print $1+shift,$2; }
((NR>imin)&&(NR<imax)){ print $1,$2,$3+shift; }
(NR == EVERY_OTHER_LINE){ print $0; }
}' input_data.dat
But I don't know how to do this (NR == EVERY_OTHER_LINE), meaning every line except the ones handled above.
Best what I found is here, but it is not really what I want.
https://unix.stackexchange.com/questions/563455/awk-print-all-remaining-lines
I would follow the following approach:
(NR==5){ print $1+5,$2; next }
(NR==7){ print $1+shift,$2; next }
((NR>imin) && (NR<imax)){ print $1,$2,$3+shift; next}
1;
We introduce the next command to avoid that any special lines have a secondary print statement
This is, however, a bit convoluted, so the following method for this particular case might be better:
{line=$0}
(NR==5) { line=$1+5 OFS $2 }
(NR==7) { line=$1+shift OFS $2 }
((NR>imin)&&(NR<imax)){ line = $1 OFS $2 OFS $3+shift }
{print line}
Ofcourse, if record 5 and 7 only have 2 fields and the records between imin and imax with imin>7 have 3 fields, then it is even easier:
(NR==5){ $1+=5 }
(NR==7){ $1+=shift }
(NR>imin)&&(NR<imax){ $3+=shift }
1

Split multiple column with awk

I need to split a file with multiple columns that looks like this:
TCONS_00000001 q1:Ovary1.13|Ovary1.13.1|100|32.599877 q2:Ovary2.16|Ovary2.16.1|100|88.36
TCONS_00000002 q1:Ovary1.19|Ovary1.19.1|100|12.876644 q2:Ovary2.15|Ovary2.15.1|100|365.44
TCONS_00000003 q1:Ovary1.19|Ovary1.19.2|0|0.000000 q2:Ovary2.19|Ovary2.19.1|100|64.567
Output needed:
TCONS_00000001 Ovary1.13.1 32.599877 Ovary2.16.1 88.36
TCONS_00000002 Ovary1.19.1 12.876644 Ovary2.15.1 365.44
TCONS_00000003 Ovary1.19.2 0.000000 Ovary2.19.1 64.567
My attempt:
awk 'BEGIN {OFS=FS="\t"}{split($2,two,"|");split($3,thr,"|");print $1,two[2],two[4],thr[2],thr[4]}' in.file
Problem:
I have many more columns to split like 2 and 3, I would like to find a shorter solutions than splitting every column one by one.
While Sundeep's answer is great, if you are planning for a redundant action on a set of records, suggest using a function and run it on each record.
I would write an awk script as below
#!/usr/bin/env awk
function split_args(record) {
n=split(record,split_array,"[:|]")
return (split_array[3]"\t"split_array[n])
}
BEGIN { FS=OFS="\t" }
{
for (i=2;i<=NF;i++) {
$i=split_args($i)
}
print
}
and invoke it as
awk -f script.awk inputfile
An ugly command-line version of it would be
awk 'function split_args(record) {
n=split(record,split_array,"[:|]")
return (split_array[3]"\t"split_array[n])
}
BEGIN { FS=OFS="\t" }
{
for (i=2;i<=NF;i++) {
$i=split_args($i)
}
print
}
' newfile
$ # borrowing simplicity from #Inian's answer ;)
$ awk 'BEGIN{FS=OFS="\t"}
{for(i=2; i<=NF; i++){split($i,a,/[:|]/); $i=a[3] "\t" a[5]}} 1' ip.txt
TCONS_00000001 Ovary1.13.1 32.599877 Ovary2.16.1 88.36
TCONS_00000002 Ovary1.19.1 12.876644 Ovary2.15.1 365.44
TCONS_00000003 Ovary1.19.2 0.000000 Ovary2.19.1 64.567
$ # previous solution which leaves tab character at end
$ awk -F'\t' '{printf "%s\t",$1;
for(i=2; i<=NF; i++){split($i,a,/[:|]/); printf "%s\t%s\t",a[3],a[5]};
print ""}' ip.txt
TCONS_00000001 Ovary1.13.1 32.599877 Ovary2.16.1 88.36
TCONS_00000002 Ovary1.19.1 12.876644 Ovary2.15.1 365.44
TCONS_00000003 Ovary1.19.2 0.000000 Ovary2.19.1 64.567

How to split - awk

I was wondering if I can make lists having left characters OR right characters after splitting with '=', and finally each character also get splitted with another '|' and ','. I have tried but failed because the number of lists are not fixed. Even C16 can be shown up, then it will be 16 item in the input.
Can you give me any hint?
Input
C1=34,C2=35,C3="99"
Output
C1|C2|C3#34,35,"99"
You can pass multiple characters as the delimiter when using -F. The command could look like this:
awk -F'[,=]' '{printf "%s|%s|%s#%s,%s,%s", $1,$3,$5,$2,$4,$6}' input.txt
I'm using , and = as the delimiter. This makes it simple to access individual values and reassemble them using printf.
If the number of columns is unknown, you need to loop over the columns. First over the odd columns which are the names, then over the even columns which are the values. I suggest to put it into a script:
test.awk
BEGIN {
FS="[,=]"
}
{
for(i=1;i<=NF;i+=2){
if(i>=NF-1){
fmt="%s"
}else{
fmt="%s|"
}
printf fmt,$i
}
printf "#"
for(i=2;i<=NF;i+=2){
if(i>=NF-1){
fmt="%s"
}else{
fmt="%s,"
}
printf fmt,$i
}
}
Then execute it like this:
awk -f test.awk input.txt
awk -F'[=,]' '
{
for (i=1;i<=NF;i+=2) {
printf "%s%s", $i, (i<(NF-1)?"|":"#")
}
for (i=2;i<=NF;i+=2) {
printf "%s%s", $i, (i<NF?",":ORS)
}
}
' file
C1|C2|C3#34,35,"99"
awk '{sub(/C1=34,C2=35,C3="99"/,"C1|C2|C3#34,35,\"99\"")}1' file
C1|C2|C3#34,35,"99"

Convert rows into columns using awk

Not all columns (&data) are present for all records. Hence whenever fields missing are missing, they should be replaced with nulls.
My Input format:
.set 1000
EMP_NAME="Rob"
EMP_DES="Developer"
EMP_DEP="Sales"
EMP_DOJ="20-10-2010"
EMR_MGR="Jack"
.set 1001
EMP_NAME="Koster"
EMP_DEP="Promotions"
EMP_DOJ="20-10-2011"
.set 1002
EMP_NAME="Boua"
EMP_DES="TA"
EMR_MGR="James"
My desired output Format:
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
I tried the below:
awk 'NR>1{printf "%s"(/^\.set/?RS:"~"),a} {a=substr($0,index($0,"=")+1)} END {print a}' $line
This is printing:
Rob~Developer~Sales~20-10-2010~Jack
Koster~Promotions~20-10-2011~
Boua~TA~James~
This awk script produces the desired output:
BEGIN { FS = "[=\"]+"; OFS = "~" }
/\.set/ { ++records; next }
NR > 1 { f[records,$1] = $2 }
END {
for (i = 1; i <= records; ++i) {
print f[i,"EMP_NAME"], f[i,"EMP_DES"], f[i,"EMP_DEP"], f[i,"EMP_DOJ"], f[i,"EMR_MGR"]
}
}
A two-dimensional array is used to store all of the values that are defined for each record.
After all the file has been processed, the loop goes through each row of the array and prints all of the values. The elements that are undefined will be evaluated as an empty string.
Specifying the elements explicity allows you to control the order in which they are printed. Using print rather than printf allows you to make correct use of the OFS variable which has been set to ~, as well as the ORS which is a newline character by default.
Thanks to #Ed for his helpful comments that pointed out some flaws in my original script.
Output:
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
$ cat tst.awk
BEGIN{ FS="[=\"]+"; OFS="~" }
/\.set/ { ++numRecs; next }
{ name2val[numRecs,$1] = $2 }
!seen[$1]++ { names[++numNames] = $1 }
END {
for (recNr=1; recNr<=numRecs; recNr++)
for (nameNr=1; nameNr<=numNames; nameNr++)
printf "%s%s", name2val[recNr,names[nameNr]], (nameNr<numNames?OFS:ORS)
}
$ awk -f tst.awk file
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
If you want some pre-defined order of fields in your output rather than creating it on the fly from the rows in each record as they're read, just populate the names[] array explicitly in the BEGIN section and if you have that situation AND don't want to save the whole file in memory:
$ cat tst.awk
BEGIN{
FS="[=\"]+"; OFS="~";
numNames=split("EMP_NAME EMP_DES EMP_DEP EMP_DOJ EMR_MGR",names,/ /)
}
function prtName2val( nameNr, i) {
if ( length(name2val) ) {
for (nameNr=1; nameNr<=numNames; nameNr++)
printf "%s%s", name2val[names[nameNr]], (nameNr<numNames?OFS:ORS)
delete name2val
}
}
/\.set/ { prtName2val(); next }
{ name2val[$1] = $2 }
END { prtName2val() }
$ awk -f tst.awk file
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
The above uses GNU awk for length(name2val) and delete name2val, if you don't have that then use for (i in name2val) { do stuff; break } and split("",name2val) instead..
This is all I can suggest:
awk '{ t = $0; sub(/^[^"]*"/, "", t); gsub(/"[^"]*"/, "~", t); sub(/".*/, "", t); print t }' file
Or sed:
sed -re 's|^[^"]*"||; s|"[^"]*"|~|g; s|".*||' file
Output:
Rob~Developer~Sales~20-10-2010~Jack~Koster~Promotions~20-10-2011~Boua~TA~James