Find repeat in one column then subtract value in another column - awk

My input file columns are:
a Otu1 w 4
b Otu1 x 1
c Otu2 y 12424
d Otu3 z 1756
I want to search for each repetition of second column, subtract their values in fourth column. My desired output would be:
a Otu1 w 3
c Otu2 y 12424
d Otu3 z 1756
I have tried the following awk script in a small file with two column
a 3
a 1
b 4
awk '$1 in a{print $1, a[$1]-$2} {a[$1]=$2}' small_input_file
Which gives me the subtracting value only
a 2
How can I modify this script for my input file with four columns?
Thanks.

a double scan algorithm won't care how many records are there or whether they are consecutive
$ awk 'NR==FNR {a[$2]=$2 in a?a[$2]-$4:$4; next}
!b[$2]++ {print $1,$2,$3,a[$2]}' file{,}
a Otu1 w 3
c Otu2 y 12424
d Otu3 z 1756

Here is a single pass that outputs in awk default order:
$ awk '{
if($2 in a) # current $2 met before
b[$2]-=$4 # subtract $4
else { # first time meet current $2
a[$2]=$0 # store record to a var
b[$2]=$4 # and $4 to another, key with $2
}
}
END { # after processing
for(i in a) { # iterate all stored records
sub(/[^ ]+$/,b[i],a[i]) # replace the last space separated string with the count
print a[i] # output
}
}' file
Output order appears random:
d Otu3 z 1756
a Otu1 w 3
c Otu2 y 12424

Related

how to keep newline(s) when selecting a given column with awk

Suppose I have a file like this (disclaimer: this is not fixed I can have more than 7 rows, and more than 4 columns)
R H A 23
S E A 45
T E A 34
U A 35
Y T A 35
O E A 353
J G B 23
I want the output to select second column if third column is A but keeping newline or whitespace character.
output should be:
HEE TE
I tried this:
awk '{if ($3=="A") print $2}' file | awk 'BEGIN{ORS = ""}{print $1}'
But this gives:
HEETE%
Which has a weird % and is missing the space.
You may use this gnu-awk solution using FIELDWIDTHS:
awk 'BEGIN{ FIELDWIDTHS = "1 1 1 1 1 1 *" } $5 == "A" {s = s $3}
END {print s}' file
HEE TE
awk splits each record using width values provided in this variable FIELDWIDTHS.
1 1 1 1 1 1 * means each of first 6 columns will have single character length and remaining text will be filled in 7th column. Since you have a space after each value so $2,$4,$6 will be filled with a single space and $1,$3,$5 will be filled with the provided values in input.
$5 == "A" {s = s $3}: Here we are checking if $5 is A and if that condition is true then we keep appending value of $3 in a variable s. In the END block we just print variable s.
Without using fixed width parsing, awk will treat A in 4th row as $2.
Or else if we let spaces part of column value then use:
awk '
BEGIN{ FIELDWIDTHS = "2 2 2 *" }
$3 == "A " {s = s substr($2,1,1)}
END {print s}
' file

Compare columns in 2 files, append data for shared items and print the non shared items of the first file

I have found similar questions as mine but none helped me with my specific problem (and i'm not quite sure whether there actually is such an easy solution..)
I have two files:
file1:
a b c
d e f
g h i
file2:
a b x y z
d e x
f h i
Desired Output:
a b c x y z
d e f x
g h i
So, I want all the rows and columns from file 1 and additionally, if there is a match of the first two columns in file 2, I want to append the rest of those columns (from file 2) to the ones in file 1 and write it in a new file.
I have tried with awk but so far I have only managed to append the columns for those rows that have a match, but the other ones (in my example the "g,h,i" row) is not printed.
Another issue seems to be that the items in file 2 do not always have the same amount of columns.
Does anyone have an idea how to solve this?
Thank you!
here is another awk
awk '{k=$1 FS $2}
NR==FNR {sub(k,"",$0); a[k]=$0; next}
k in a {$0 = $0 a[k]}1' file2 file1
a b c x y z
d e f x
g h i
note the order of the files.
Use the following approach:
awk 'FNR==NR{k=$1$2; $1=$2=""; a[k]=$0; next}
{ if($1$2 in a){print $0a[$1$2] } else print $0}' file2 file1 | tr -s ' '
The output:
a b c x y z
d e f x
g h i
FNR==NR - ensures performing the first file
k=$1$2; - k is a key for associative array which accumulates all column values from the second file except the first two columns(as they become a key/hash). For ex. for the first file2 line the array will be indexeed as a['ab']='x y z'

How to print all columns after matching on key field

How can I join all fields of each row from both files after matching on a key field? How to generalize this one-liner if the number of fields is unknown in f2?
f2:
a 1 2
b 3 4
c 5 6
f3:
10 a x y z
11 g x y z
12 j x y z
observed:
a 10 x y z
a1 10 x y z
Desired:
a 1 2 10 x y z
These are my best attempts but are incorrect:
awk 'FNR==NR{a[$1]=$2;next} ($2 in a) {print a[$2],$0}' f2.txt f3.txt > f4.txt
awk 'FNR==NR{a[$1]=$2$3;next} ($2 in a) {print a[$2],$0}' f2.txt f3.txt > f4.txt
awk 'NR==FNR{a[$1]=$0;next} ($2 in a){print a[$2],$1,$3,$4,$5}' f2.txt f3.txt > f4.txt
save the whole as value and column1 as key, when read 2nd file, check column2 in array a or not, if it is, print a[$2] and the rest columns
A shorter way(the disadvantage of this command is there's one extra space between 10 and x):
awk 'NR==FNR{a[$1]=$0;next} ($2 in a){second=$2; $2="";print a[second],$0}' f2.txt f3.txt > f4.txt
replace $2 of 2nd file with empty string, and print the whole line $0
if your files are sorted in the keys as in your example, join is the tool for this task
join -11 -22 f2.txt f3,txt
#mxttgen31: try:
awk 'FNR==NR{Q=$2;$2="";A[Q]=$0;next} ($1 in A){print $0,A[$1]}' f3 f2
Explanation of above command as follows:
awk 'FNR==NR{ ##### Checking condition FNR==NR here, where FNR and NR both denotes the number of line,
only difference between FNR and NR is as we could read mutiple files from awk,
value of FNR will be RESET on next file's start, where NR's value will be keep on increasing till
it completes reading all the file. so this condition will be TRUE only when first Input_file(which is f3 here) will be TRUE.
Q=$2; ##### Assigning second field's value to variable Q.
$2=""; ##### making second field's value to NULL now.
A[$2]=$0; ##### Create an array named A whose index is $2 and value is current line.
next} ##### putting next(awk's in-built keyword) which skips all next further statements and take the cursor again to starting.
($1 in A) ##### Cursor will come here whenever second Input_file is being read, so here checking $1(first field) is present in array A then do following.
{print $0,A[$1]} ##### print the current line($0) of current file named f2 and print array A's value whose index is $1 of current file f2.
' f3 f2 ##### Mentioning Input_files here.

Print the 1st and every nth column of a text file using awk

I have a txt file contains a total of 10177 columns and a total of approximately 450,000 rows. The information is separated by tabs. I am trying to trim the file down using awk so that it only prints the 1-3, 5th, and every 14th column after the fifth one.
My file has a format that looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... 10177
A B C D E F G H I J K L M N O P Q R S T ...
X Y X Y X Y X Y X Y X Y X Y X Y X Y X Y ...
I am hoping to generate an output txt file (also separated with tab) that contains:
1 2 3 5 18 ...
A B C E R ...
X Y X X Y ...
The current awk code I have looks like (I am using cygwin to use the code):
$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt
But the result I am getting shows something like:
123518...ABCER...XYXXY...
When opened with excel program, the results are all mashed into 1 single cell.
In addition, when I try to include code
for (i=0;i<=3;i++) printf "%s ",$i
in the awk to get the first 3 columns, it just prints out the original input document together with the mashed result. I am not familiar with awk, so I am not sure what causes this issue.
Awk field numbers, strings, and array indices all start at 1, not 0, so when you do:
for (i=0;i<=3;i++) printf "%s ",$i
the first iteration prints $0 which is the whole record.
You're on the right track with:
$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt
but never do printf with input data as the only argument to printf since then printf will treat it as a format string without data (rather than what you want which is a plain string format with your data) and then that will fail cryptically if/when your input data contains formatting characters like %s or %d. So, always use printf "%s", $i, never printf $i.
The problem you're having with excel, I would guess, is you're trying to double click on the file and hoping excel knows what to do with it (it won't, unlike if this was a CSV). You can import tab-separated files into excel after it's opened though - google that.
You want something like:
awk '
BEGIN { FS=OFS="\t" }
{
for (i=1; i<=3; i++) {
printf "%s%s", (i>1?OFS:""), $i
}
for (i=5; i<=NF; i+=14) {
printf "%s%s", OFS, $i
}
print ""
}
' file
I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
In awk using conditional operator in for:
$ awk 'BEGIN { FS=OFS="\t" }
{
for(i=1; i<=NF; i+=( i<3 ? 1 : ( i==3 ? 2 : 14 )))
printf "%s%s", $i, ( (i+14)>NF ? ORS : OFS)
}' file
1 2 3 5 19
A B C E S
X Y X X X
In the for if i<3 increment by one, if i==3 increment by two to get to 5 and after that by 14.
I would be tempted to solve the problem along the following lines. I think you'll find you save time by not iterating in awk.
$ cols="$( { echo 1 2 3; seq 5 14 10177; } | sed 's/^/$/; 2,$ s/^/, /' )"
$ awk -F\\t "{print $cols}" test.txt

awk keep only the first and last value on comma-seaparated field

Hi' I am trying to keep only first and last value of comma-separated field on my data. Following is how my input data would look like
a 1 y 1,2,4,3,6,2,1
b 2 y 3,56,3,2,1
c 3 n 4,3,2,1,4
I just want to keep first and last value on the 4th coulmn of my data so that my data would look like this:
a 1 y 1,1
b 2 y 3,1
c 3 n 4,4
Can you anyone help me how to do this? Thank you
Try this:
awk -F, -vOFS=, '{print $1,$NF}' input.txt
-F, input field separator
-vOFS=, output field separator
$1 the 1st field
$NF the last field
Try this awk command:
awk '{size = split($4,numbers,",")} {print $1" "$2" "$3" "numbers[1]","numbers[size]}'
This splits the fourth field into an array, saves the size as size, prints the first 3 fields, then the first and last elements of the numbers array.
awk -F, '{ printf "%s,%s\n", $1, $NF}' should do the job!
If your other fields can contain commas:
$ awk '{sub(/,.*,/,",",$NF)}1' file
a 1 y 1,1
b 2 y 3,1
c 3 n 4,4
If not:
$ awk '{sub(/,.*,/,",")}1' file
a 1 y 1,1
b 2 y 3,1
c 3 n 4,4