awk output first two columns then the minimum value out of the third and fourth columns - awk

I have a tab delimited file like so:
col1 col2 col3 col4
a 5 y:3.2 z:5.1
b 7 r:4.1 t:2.2
c 8 e:9.1 u:3.2
d 10 o:5.2 w:1.1
For each row, I want to output the values in the first and second columns, and the smallest number out of the two values in the third and fourth columns.
col1 col2 min
a 5 3.2
b 7 2.2
c 8 3.2
d 10 1.1
My poor attempt:
awk -F'\t' '{min = ($3 < $4) ? $3 : $4; print $1, $2, min}'
One reason it's incorrect is because the values in the third and fourth columns aren't numbers but strings.
I don't know how to extract the number out of the third and fourth columns, the number is always after the colon..

awk to the rescue!
$ awk -F'[ *:]' 'NR==1{print $1,$2,"min";next} {print $1,$2, $4<$6?$4:$6}' file
col1 col2 min
a 5 3.2
b 7 2.2
c 8 3.2
d 10 1.1

Related

How do I print starting from a certain row of output with awk? [duplicate]

I have millions of records in my file, what i need to do is print columns 1396 to 1400 for specific number of rows, and if i can get this in excel or notepad.
Tried with this command
awk {print $1396,$1397,$1398,$1399,$1400}' file_name
But this is running for each row.
You need a condition to specify which rows to apply the action to:
awk '<<condition goes here>> {print $1396,$1397,$1398,$1399,$1400}' file_name
For example, to do this only for rows 50 to 100:
awk 'NR >= 50 && NR <= 100 {print $1396,$1397,$1398,$1399,$1400}' file_name
(Depending on what you want to do, you can also have much more complicated selection patterns than this.)
Here's a simpler example for testing:
awk 'NR >= 3 && NR <= 5 {print $2, $3}'
If I run this on an input file containing
1 2 3 4
2 3 4 5
3 a b 6
4 c d 7
5 e f 8
6 7 8 9
I get the output
a b
c d
e f

Non-grep method to remove lines from a file where a string appears in another file

I know that there are a few similar questions to this that have previously been answered, but I haven't managed to find exactly what I want (and have tried variants of proposed solutions). Hopefully this is an easy question.
I have a tab-separated file (file.txt) with 10 columns and about half a million lines, which in simplified form looks like this:
ID Col1 Col2 Col3
a 4 2 8
b 5 6 1
c 8 4 1
d 3 5 9
e 8 5 2
I'd like to remove all the lines where, say, "b" and "d" appear in the first (ID) column. The output that I want is:
ID Col1 Col2 Col3
a 4 2 8
c 8 4 1
e 8 5 2
It is important that the order of the IDs is maintained in my output file.
In reality, there are about 100,000 lines that I want to remove. I therefore have a reference file (referencefile.txt) that lists all the IDs that I want removed from file.txt. In this example, the reference file would simply contain "b" and "d" on successive lines.
I am using grep at the moment, and while it works, it is proving painfully slow.
grep -v -f referencefile.txt file.txt
Is there a way of using awk or sed (or anything else for that matter) to speed up the process?
Many thanks.
AB
Using awk:
awk 'FNR>1 && ($1 == "b" || $1 == "d"){ next } 1' infile
# OR
awk 'FNR>1 && $1 ~ /^([bd])$/{ next } 1' infile
# To exclude line from infile, where list of ids from id_lists
# exists in first field of infile
awk 'FNR==NR{ids[$1];next}FNR>1 && ($1 in ids){next}1' id_lists infile
# To include line from infile, where list of ids from id_lists
# exists in first field of infile
awk 'FNR==NR{ids[$1];next}FNR==1 || ($1 in ids)' id_lists infile
Test Results:
Input
$ cat infile
ID Col1 Col2 Col3
a 4 2 8
b 5 6 1
c 8 4 1
d 3 5 9
e 8 5 2
Output
$ awk 'FNR>1 && $1 ~ /^([bd])$/{ next } 1' infile
ID Col1 Col2 Col3
a 4 2 8
c 8 4 1
e 8 5 2
$ awk 'FNR>1 && ($1 == "b" || $1 == "d"){ next } 1' infile
ID Col1 Col2 Col3
a 4 2 8
c 8 4 1
e 8 5 2
but "b" and "d" were for illustrative purposes, and I actually have
about 100,000 IDs that I need to remove. So I want all those IDs
listed in a separate file (referencefile.txt)
If you have file with list of ids like below then
To Exclude list of ids
$ cat id_lists
a
b
$ awk 'FNR==NR{ids[$1];next}FNR>1 && ($1 in ids){next}1' id_lists infile
ID Col1 Col2 Col3
c 8 4 1
d 3 5 9
e 8 5 2
To Include list of ids
$ awk 'FNR==NR{ids[$1];next}FNR==1 || ($1 in ids)' id_lists infile
ID Col1 Col2 Col3
a 4 2 8
b 5 6 1
There are ways of speeding up grep itself.
I'd suggest:
-F treat the input in the -f referencefile.txt as fixed strings and not regexes.
-w match words
Possibly LC_ALL=C - use the LC_ALL environment variable to instruct grep to use ascii rather than UTF-8

If two columns from different files equal, replace third column with awk

I am looking for a way to replace a column in a file, if two ID columns match.
I have file A.txt
c a b ID
1 0.01 5 1
2 0.1 6 2
3 2 3
and file B.txt
ID a b
1 10 15
2 20 16
3 30 12
4 40 14
The output im looking for is
file A.txt
ID a b
1 0.01 5
2 0.1 6
3 30 2
I can find with awk which ID columns from both files match
awk 'NR==FNR{a[$1];next}$1 in a' B.txt A.txt
But how to add replacement. Thank you for any suggestions.
awk solution:
awk 'NR==FNR{ if(NR>1) a[$1]=$2; next }
FNR>1 && $1 in a && NF<3{ f=$2; $2=a[$1]; $3=f }1' B.txt A.txt | column -t
if(NR>1) a[$1]=$2; - capturing column values from file B.txt except the header line (N>1)
FNR>1 && $1 in a && NF<3 - if IDs match and some line from A.txt has less than 3 fields
The output:
ID a b
1 0.01 5
2 0.1 6
3 30 2
Adapted to your new data format
awk '
# Load file b reference
FNR==NR && NR > 1 {ColB[$1]=$2; next}
# treat file A
{
# set missing field if know in file B (and not 1st line)
if ( NF < 4 && ( $NF in ColB) && FNR > 1) $0 = $NF FS ColB[$NF] FS $2
# print result (in any case)
print
}
#order of file is mandatory' B.txt A.txt
Self documented.
Assume this is only the second field that is missing like in your sample

Find the ratio among columns

I have some input files of the following format:
File1.txt File2.txt File3.txt
1 2 1 6 1 20
2 3 2 9 2 21
3 7 3 14 3 28
Now I need to output a new single file using AWK with three columns, the first column remains the same, and it is the same among the three files (just an ordinal number).
However for 2nd and the 3rd column of this newly created file, I need to values of the 2nd column of the second file divided by the values of the 2nd column of the 1st file, also the values of the second column of the third file divided by the value of the 2nd column of the first file. In other words, the 2nd columns for the 2nd and 3rd file divided by the 2nd column of the first file.
e.g.:
Result.txt
1 3 10
2 3 7
3 2 4
Use a multidimensional matrix to store the values:
awk 'FNR==NR {a[$1]=$2; next}
{b[$1,ARGIND]=$2/a[$1]}
END {for (i in a)
print i,b[i,2],b[i,3]
}' f1 f2 f3
Test
$ awk 'FNR==NR {a[$1]=$2; next} {b[$1,ARGIND]=$2/a[$1]} END {for (i in a) print i,b[i,2],b[i,3]}' f1 f2 f3
1 3 10
2 3 7
3 2 4

print a line from every 5 elements of a column

I am looking for a way to select a column (e. g. eighth column) of a data file and write the first five numbers of that column in a row, the next five numbers in second row, and so on.
I have been testing with awk and printf without success.
The awk way to do this is to switch from using OFS and ORS to separate the output using the modulus function:
$ seq 1 20 | awk '{printf "%s", $1 (NR % 5 ? OFS : ORS)}'
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Change $1 to $8 for the eigth column for example and NR % 5 to NR % 10 for rows of 10 instead of 5. The seq command just generate a single column of digits from 1 to 20 used for demonstration.
I also find using xargs useful for this kind of thing:
$ seq 1 20 | awk '{print $1}' | xargs -n5
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
The awk isn't necessary for the example as seq only produces a single column however for your question change $1 to $8 to select only the eighth column from your input. With this approach you could also switch out awk with cut.
This will also produce the format requested
seq 1 20 | awk '{printf("%s ", $1); if (NR % 5 == 0) printf("\n")}'
where $1 indicates de column number which could be changed when passing an archive to the awk line.