replace strings in column with matching value from another file using awk - awk

I'm having a little issue trying to use awk to replace some strings in a column using another file as a reference for the replacement.
I want the strings in the third column of my File2 to be replaced by the strings in the second column of File1 when they match the string of the first column of File1.
Here are the files and the desired outcome to be more clear.
File1
AAA XZA
AAB XSZ
AAC XWQ
BAA XCD
File2
ADZ-4 128720 AAA 451351351 5135 jhgt 215
SZQ-2 036051 AAB 55654 grt
KFD-9 036266 AAC
ODS-10 036267 AAA 57321
POS-11 036268 AAC 8435435 764 frd
desired output :
ADZ-4 128720 XZA 451351351 5135 jhgt 215
SZQ-2 036051 XSZ 55654 grt
KFD-9 036266 XWQ
ODS-10 036267 XZA 57321
POS-11 036268 XWQ 8435435 764 frd
I tried the following command line.
awk 'FNR==NR{a[$1]=$2;next} {if ($3 in a){$3=a[1]}; print $0}' File1 File2
but I'm pretty sure I'm not doing something right in the second curly brakes, since it prints out a file with the third column removed.
If I only had a few, I would happily use sed by I have 500+ substitutions to do...
Any help would be appreciated and if you can explain so I can learn from my mistake, I would be immensely grateful.

You didn't reference the associative array in the right way. Please change:
...{if ($3 in a){$3=a[1]}; print $0...
into:
...{if ($3 in a){$3=a[$3]}; print $0
The keys of your array a are AAA,AAB... instead of 1,2,3....

Related

How can I remove a string after a specific character ONLY in a column/field in awk or bash?

I have a file with tab-delimited fields (or columns) like this one below:
cat abc_table.txt
a b c
1 11;qqw 213
2 22 222
3 333;rs2 83838
I would like to remove everything after the ";" on only the second field.
I have tried with
awk 'BEGIN{FS=OFS="\t"} NR>=1 && sub (/;[*]/,"",$2){print $0}' abc_table.txt
but it does not seem to work.
I also tried with sed:
's/;.*//g' abc_table.txt
but it erases also the strings in the third field:
a b c
1 11
2 22 222
3 333
The desired output is:
a b c
1 11 213
2 22 222
3 333 83838
If someone could help me, I would be very grateful!
You need to simply correct your regex.
awk '{sub(/;.*/,"",$2)} 1' Input_file
In case you have Input_file TAB delimited then try:
awk 'BEGIN{FS=OFS="\t"} {sub(/;.*/,"",$2)} 1' Input_file
Problem in OP's regex: OP's regex ;[*] is looking for ; and *(literal character) in 2nd field that's why its NOT able to substitute everything after ; in 2nd field. We need to simply give ;.* which means grab everything from very first occurrence of ; till last of 2nd field and then substitute with NULL in 2nd field.
An alternative solution using gnu sed:
sed -E 's/(^[^\t]*\t+[^;]*);[^\t]*/\1/' file
a b c
1 11 213
2 22 222
3 333 83838
This might work for you (GNU sed):
sed 's/[^\t]*/&\n/2;s/;[^\t]*\n//;s/\n//' file
Append a unique marker e.g. newline, to the end of field 2.
Remove everything from the first ; which is not a tab to a newline.
Remove the newline if any.
N.B. This method can be extended for selective or all fields e.g. same removal but for the first and third fields:
sed 's/[^\t]*/&\n/1;s//&\n/3;s/;[^\t]*\n//g;s/\n//g' file

How to replace substrings in a column of file with strings from another file using awk?

I got two files, and want to use awk to replace subtring in one column of file with the string in another file
f1:
1a1 aaa 777
3_3 ccc 6b6
3.3 ddd 666
f2:
b5g9aaa8y
5_6ccc9.
output:
1a1 b5g9aaa8y 777
3_3 5_6ccc9. 6b6
I think I can do this within two steps:
make a intersection dict FILE of substring and string
use awk(sub) to finish it
However, is there a one line awk command to check if substring in string and then do replacement?
#
Sorry, I should have explained it more clearly.
The string format and length in file2 are not fixed.
file1 and file2 do not have the same number of records. file2 is a subset of file1, only need to output the string in file2
assume there is not multiple hits
EDIT2: Since OP has changed samples and added complete conditions now, so adding this solution.
awk 'FNR==NR{a[$2]=$1;b[$2]=$3;next} {for(i in a){if(index($0,i)){print a[i],$0,b[i];delete a[i];break}}}' Input_file1 Input_file2
OR adding a non-one liner form of solution now.
awk '
FNR==NR{
a[$2]=$1
b[$2]=$3
next
}
{
for(i in a){
if(index($0,i)){
print a[i],$0,b[i]
delete a[i]
break;
}
}
}' Input_file1 Input_file2
EDIT: As per #sjsam comment in case range for substr may vary then as per samples provided one could try following too. It considers that you want to have only alphabets as index and remove everything from line of Input_file2(which OP confirmed in comments too).
awk 'FNR==NR{val=$0;gsub(/[^[a-zA-Z]]*/,"");a[$0]=val;next} {$2=$2 in a?a[$2]:$2} 1' Input_file2 Input_file1
Could you please try following.
awk 'FNR==NR{a[substr($0,3,3)]=$0;next} {$2=$2 in a?a[$2]:$2} 1' Input_file2 Input_file1
Output will be as follows.
111 33aaa8 777
333 56ccc9 666

unix - compare columns of two files

I have two files. First file is masterlist of IDS. Second file is normal input file.
I'm trying to print only the records of input where it's id (column 3) is NOT in masterlist (column 1).
sample:
masterlist.txt
111
222
333
input.txt
col1,col2,col3,col4,col5,col6
abc,abc,111,xyz,xyz,xyz
abc,abc,222,xyz,xyz,xyz
abc,abc,333,xyz,xyz,xyz
abc,abc,444,xyz,xyz,xyz
desired output:
col3,col4,col5,col6
abc,abc,444,xyz,xyz,xyz
I have come up with this code so far but I'm not getting the correct output.
awk -F\| '!b{a[$0]; next}$3 in a {true; next} {print $3","$4","$11","$12}' masterlist.txt b=1 input.txt
Could you please try following awk and let us know if this helps you.
awk 'FNR==NR{a[$1];next} !($3 in a)' masterlist.txt FS="," input.txt

Print rows that has numbers in it

this is my data - i've more than 1000rows . how to get only the the rec's with numbers in it.
Records | Num
123 | 7 Y1 91
7834 | 7PQ34-102
AB12AC|87 BWE 67
5690278| 80505312
7ER| 998
Output has to be
7ER| 998
5690278| 80505312
I'm new to linux programming, any help would be highly useful to me. thanks all
I would use awk:
awk -F'[[:space:]]*[|][[:space:]]*' '$2 ~ /^[[:digit:]]+$/'
If you want to print the number of lines deleted as you've been asking in comments, you may use this:
awk -F'[[:space:]]*[|][[:space:]]*' '
{
if($2~/^[[:digit:]]+$/){print}else{c++}
}
END{printf "%d lines deleted\n", c}' file
A short and simple GNU awk (gawk) script to filter lines with numbers in the second column (field), assuming a one-word field (e.g. 1234, or 12AB):
awk -F'|' '$2 ~ /\y[0-9]+\y/' file
We use the GNU extension for regexp operators, i.e. \y for matching the word boundary. Other than that, pretty straightforward: we split fields on | and look for isolated digits in the second field.
Edit: Since the question has been updated, and now explicitly allows for multiple words in the second field (e.g. 12 AB, 12-34, 12 34), to get lines with numbers and separators only in the second field:
awk -F'|' '$2 ~ /^[- 0-9]+$/' file
Alternatively, if we say only letters are forbidden in the second field, we can use:
awk -F'|' '$2 ~ /^[^a-zA-Z]+$/' file

awk to handle un formatted input

Would like know how to handle below situation, sample input delimited by space and want to format as comma-separated output.
All the text in a line up until the first field starting with a digit should be considered as a single field in the output. In the sample data, there are always 3 numeric fields at the end of a line; in the real data, there are 14 such fields.
Input.txt
mmm 4394850 4465411 2579770
xxx yyy 2155419 2178791 1516446
aaa bbb (incl. ccc) 14291585 14438704 6106341
U.U.(W) 6789781 6882021 5940226
nnn 7335050 7534302 2963345
Have tried the command below, but I know it is incomplete:
awk 'BEGIN {FS =" "; OFS = ","} {print $1,$2,$3,$4,$5,$6} ' Input.txt
Desired output:
mmm,4394850,4465411,2579770
xxx yyy,2155419,2178791,1516446
aaa bbb (incl. ccc),14291585,14438704,6106341
U.U.(W),6789781,6882021,5940226
nnn,7335050,7534302,2963345
With GNU awk for gensub():
$ awk '{match($0,/[0-9 ]+$/); print substr($0,1,RSTART-1) gensub(/ /,",","g",substr($0,RSTART,RLENGTH))}' file
mmm,4394850,4465411,2579770
xxx yyy,2155419,2178791,1516446
aaa bbb (incl. ccc),14291585,14438704,6106341
U.U.(W),6789781,6882021,5940226
nnn,7335050,7534302,2963345
with other awks, save the 2nd substr() output in a var and use gsub():
awk '{match($0,/[0-9 ]+$/); digs=substr($0,RSTART,RLENGTH); gsub(/ /,",",digs); print substr($0,1,RSTART-1) digs}' file
Assuming that it's the last 3 columns that are numerical (as in your example):
awk '{for(i=1;i<=NF;++i)printf "%s%s",$i,(i<NF-3?OFS:(i<NF?",":ORS))}' file
Basically print each field followed by a space, comma or newline depending on the field number.
Another awk
awk '$0=gensub(/ ([0-9]+)/,",\\1","g")' file
mmm,4394850,4465411,2579770
xxx yyy,2155419,2178791,1516446
aaa bbb (incl. ccc),14291585,14438704,6106341
U.U.(W),6789781,6882021,5940226
nnn,7335050,7534302,2963345