Using AWK for best match replace - awk

I have two files:
operators.txt # includes Country_code and Country_name
49 Germany
43 Austria
32 Belgium
33 France
traffic.txt # MSISDN and VLR_address (includes Country_code prefix)
123456789 491234567
123456788 432569874
123456787 333256987
123456789 431238523
I need to replace the VLR_address in traffic.txt file with Country_name from the first file.
The following awk command do that:
awk 'NR==FNR{a[$1]=$2;next} {print $1,a[$2]}' <(cat operators.txt) <(cat traffic.txt|awk '{print $1,substr($2,1,2)}')
123456789 Germany
123456788 Austria
123456787 France
123456789 Austria
but how to do it in case operators file is:
49 Germany
43 Austria
32 Belgium
33 France
355 Albania
1246 Barbados
1 USA
when country_code is not fixed length and in some case best match will apply e.g.
124612345 shall be Barbados
122018523 shall be USA

The sample input/output you provided isn't adequate to test with as it doesn't include the cases you later described as problematic but if we modify it to include a representation of those later statements:
$ head operators.txt traffic.txt
==> operators.txt <==
49 Germany
43 Austria
32 Belgium
33 France
1 USA
355 Albania
1246 Barbados
==> traffic.txt <==
123456789 491234567
123456788 432569874
123456787 333256987
123456789 431238523
foo 124612345
bar 122018523
then this may be what you want:
$ cat tst.sh
#!/usr/bin/env bash
awk '
NR==FNR {
keys[++numKeys] = $1
map[$1] = $2
next
}
{
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
if ( index($2,key) == 1 ) {
$2 = map[key]
break
}
}
print
}
' <(sort -k1,1rn operators.txt) traffic.txt
$ ./tst.sh
123456789 Germany
123456788 Austria
123456787 France
123456789 Austria
foo Barbados
bar USA

You obviously need to try a substring of the correct length.
awk 'NR==FNR{a[$2]=$1;next}
{ for (prefix in a) {
p = a[prefix]; l = length(p)
if ($2 ~ "^" p) { $2 = prefix; break } } }1' operators.txt traffic.txt
Notice how Awk itself is perfectly capable of reading files without the help of cat. You also nearly never need to pipe one Awk script into another; just refactor to put all the logic in one script.
I swapped the value of the key and the value in the NR==FNR block but that is more of a stylistic change.
And, as always, the final 1 is a shorthand idiom for printing all lines.
Perhaps as an optimization, pull the prefixes into a regular expression so that you can simply match on them all in one go, instead of looping over them.
awk 'NR==FNR{a[$1]=$2; regex = regex "|" $1; next}
FNR == 1 { regex = "^(" substr(regex, 2) ")" } # trim first "|"
match($2, regex) { $2 = a[substr($2, 1, RLENGTH)] } 1' operators.txt traffic.txt
The use of match() to pull out the length of the matched substring is arguably a complication; I wish Awk would provide this information for a normal regex match without the use of a separate dedicated function.

Related

Extract text between patterns in new files

I'm trying to analyze a file with the following structure:
AAAAA
123
456
789
AAAAA
555
777
999
777
The idea is to detect the 'AAAAA' pattern and extract the two following lines. After this is done, I would like to append the next 'AAAAA' pattern and the following two lines, so th final file will look something like this:
AAAAA
123
456
AAAA
555
777
Taking into account that the last one will not end with the 'AAAAA' pattern.
Any idea about how this can be done ? I've use sed but I don't know how to select the number of lines to be retained after the pattern...
Fo example with AWK:
awk '/'$AAAAA'/,/'$AAAAA'/' INPUTFILE.txt
Bu this will only extract all the text between the two AAAAA
Thanks
With sed
sed -n '/AAAAA/{N;N;p}' file.txt
with smart counters
$ awk '/AAAAA/{n=3} n&&n--' file
AAAAA
123
456
AAAAA
555
777
The grep command has a flag that prints lines after each match. For example:
grep AAAAA --after 2 <file>
Unless I misunderstood, this should match your requirements, and is much simpler than awk scripts.
You may try this awk:
awk '$1 == "AAAAA" {n = NR+2} NR <= n' file
AAAAA
123
456
AAAAA
555
777
just cheat
mawk/mawk2/gawk 'BEGIN { FS = OFS = "AAAAA\n"; RS = "^$";
} END { for(x=2; x<= NF; x++) { print $(x) } }'
no one says the fields must be split by spaces, and rows must be new-lines one-by-one. By design of FS, every field after $1 will contain the matches you need, and fitting multiple "rows" of each within $2 etc.
In this example, in $2 you will find 12 bytes, like this :
1 2 3 \n 4 5 6 \n 7 8 9 \n # spaced out for readability

count, groupby with sed, or awk

i want to perform two different sort and count on a file, based on each line's content.
1. i need to take the first column of a .tsv file
i would like to group by each line that starts with three digits, and keep only the three first digits, and for everything else, just sort and count the whole occurrence of the sentence in the first column.
Sample data:
687/878 9
890987 4
01a 55
1b 8743917
890a 34
abcdee 987
dfeqfe fkdjald
890897 34213
6878853 834
32fasd 53891
abcdee 8794371
abd 873
result:
687 2
890 3
01a 1
1b 1
32fasd 1
abd 1
dfeqfe 1
abcdee 2
I would also appreciate a solution that would
also take into account a sample input like
687/878 9
890987 4
01a 55
1b 8743917
890a 34
abcdee 987
dfeqfe 545
890897 34213
6878853 834
(632)fasd 53891
(88)abcdee 8794371
abd 873
so the first column may have values like (,), #, ', all kind of characters
so output will have two columns, the first with the values extracted, and the second with the new count, with the new values extracted from the source file.
Again preferred output format tsv.
so i need to extract all values that start with
^\d\d\d, and then for these three first digits, sort and count unique values,
but in a second pass, also do the same for each line, that does not start with 3 digits, but this time, keep the whole columns value and sort count by it.
what i have tried:
| sort | uniq -c | sort -nr for the lines that do start with ^\d\d\d, and
the same for those that do not fulfill the above regex, but is there a more elegant way using either sed or awk?
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ cnt[/^[0-9]{3}/ ? substr($1,1,3) : $1]++ }
END {
for (key in cnt) {
print (key !~ /^[0-9]{3}/), cnt[key], key, cnt[key]
}
}
$ awk -f tst.awk file | sort -k1,2n | cut -f3-
687 1
890 2
abcdee 1
You can try Perl
$ cat nefijaka.txt
687 878 9
890987 4
890a 34
abcdee 987
$ perl -lne ' /^(\d{3})|(\S+)/; $x=$1?$1:$2; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt
687 1
890 2
abcdee 1
$
You can pipe it to sort and get the values sorted..
$ perl -lne ' /^(\d{3})|(\S+)/; $x=$1?$1:$2; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt | sort -k2 -nr
890 2
abcdee 1
687 1
EDIT1:
$ cat nefijaka.txt2
687 878 9
890987 4
890a 34
abcdee 987
a word and then 23
$ perl -lne ' /^(\d{3})|(.+?\t)/; $x=$1?$1:$2; $x=~s/\t//g; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt2
687 1
890 2
a word and then 1
abcdee 1
$

Sed replace nth column of multiple tsv files without header

Here are multiple tsv files, where I want to add 'XX' characters only in the second column (everywhere except in the header) and save it to this same file.
Input:
$ls
file1.tsv file2.tsv file3.tsv
$head -n 4 file1.tsv
a b c
James England 25
Brian France 41
Maria France 18
Ouptut wanted:
a b c
James X1_England 25
Brian X1_France 41
Maria X1_France 18
I tried this, but the result is not kept in the file, and a simple redirection won't work:
# this works, but doesn't save the changes
i=1
for f in *tsv
do awk '{if (NR!=1) print $2}’ $f | sed "s|^|X${i}_|"
i=$((i+1))
done
# adding '-i' option to sed: this throws an error but would be perfect (sed no input files error)
i=1
for f in *tsv
do awk '{if (NR!=1) print $2}’ $f | sed -i "s|^|T${i}_|"
i=$((i+1))
done
Some help would be appreciated.
The second column is particularly easy because you simply replace the first occurrence of the separator.
for file in *.tsv; do
sed -i '2,$s/\t/\tX1_/' "$file"
done
If your sed doesn't recognize the symbol \t, use a literal tab (in many shells, you type it with ctrlv tab.) On *BSD (and hence MacOS) you need -i ''
AWK solution:
awk -i inplace 'BEGIN { FS=OFS="\t" } NR!=1 { $2 = "X1_" $2 } 1' file1.tsv
Input:
a b c
James England 25
Brian France 41
Maria France 18
Output:
a b c
James X1_England 25
Brian X1_France 41
Maria X1_France 18

extracting data from a column based on another column

I have some files as shown below. I would like to extract the values of $5 based on $1.
file1
sam 60.2 143 40.4 19.8
mathew 107.9 144 35.6 72.3
baby 48.1 145 17.8 30.3
rehna 47.2 146 21.2 26.0
sam 69.9 147 .0 69.9
file2
baby 58.9 503 47.5 11.4
daisy 20.8 504 20.4 .4
arch 61.1 505 12.3 48.8
sam 106.6 506 101.6 5.0
rehna 73.5 507 35.9 37.6
sam 92.0 508 61.1 30.9
I used the following code to extract $5.
awk '$1 == "rehna" { print $5 }' *
awk '$1 == "sam" { print $5 }' *
I would like to get the output as shown below
rehna sam
26.0 19.8
37.6 69.9
5.0
30.9
How do I achieve this? your suggestions would be appreciated!
The simplest is probably to paste the results together:
#!/bin/bash
function myawk {
awk -v name="$1" 'BEGIN {print name} $1 == name { print $5 }' file1 file2
}
paste <(myawk rehna) <(myawk sam)
Running this produces the results you requested (with TAB as the separator character). See paste documentation for other options.
Update: peak's answer has since wrapped this approach in a function, in the spirit of DRY. If you want more background information, read on.
Assuming Bash, Ksh, or Zsh as the shell:
printf '%s\t%s\n' 'rehna' 'sam'
paste \
<(awk '$1 == "rehna" { print $5 }' *) \
<(awk '$1 == "sam" { print $5 }' *)
The above produces tab-separated output.
paste is a POSIX utility that outputs corresponding lines from its input files, by default separated with tabs; e.g., paste fileA fileB yields:
<line 1 from fileA>\t<line 1 from fileB>
<line 2 from fileA>\t<line 2 from fileB>
...
If any input file runs out of lines, it supplies empty lines.
In the case at hand, the respective outputs from the awk commands are used as input files, using process substitution (<(...)).

AWK - how to selectively modify txt file

I would like to print particular 2nd field (that matches regex) of each record
awk '$2 ~ /regex1/'
BUT, ONLY specific records, that are between regex2 and regex3
awk '/regex2/,/regex3/'
other records, that are not between regex2 and regex3 shall be printed normally (all fields)
any ideas, how to put it together?
quick sample of input and output:
input
parrot milana 3 ukraine
dog husky 1 poland
cat husky 5 france
elephant malamut 5 belgium
bird husky 5 turkey
output: (show me
parrot milana 3 ukraine
dog husky 1 poland
husky
elephant malamut 5 belgium
bird husky 5 turkey
Show entire input but:
Between /dog/ and /elephant/ (show these records unchanged) show only 2nd field, which match regex /husky/
I hope this is usefull...
This:
awk '/regex2/,/regex3/'
is shorthand for
awk '/regex2/{f=1} f; /regex3/{f=0}'
The shorthand version IMHO should NEVER be used as it's brevity isn't worth the difficulty it introduces when you try to build on it with other criteria, e.g. not printing the start line and/or not printing the end line and/or introducing other REs to match within the range as you're doing now.
Given that, you're starting with this script:
awk '/dog/{f=1} f; /elephant/{f=0}'
and you want to only print the lines where you find "husky" so it's the simple, obvious tweak:
awk '/dog/{f=1} f && /husky/; /elephant/{f=0}'
EDIT: in response to changed requirements, and using a tab-separated file:
$ cat file
parrot milana 3 ukraine
dog husky 1 poland
cat husky 5 france
elephant malamut 5 belgium
bird husky 5 turkey
$ awk '
BEGIN{ FS=OFS="\t" }
/elephant/ {f=0}
{
if (f) {
if ($2 == "husky") {
print "", $2
}
}
else {
print
}
}
/dog/ {f=1}
' file
parrot milana 3 ukraine
dog husky 1 poland
husky
elephant malamut 5 belgium
bird husky 5 turkey
You can write it more briefly:
$ awk '
BEGIN{ FS=OFS="\t" }
/elephant/ {f=0}
f && /husky/ { print "", $2 }
!f
/dog/ {f=1}
' file
parrot milana 3 ukraine
dog husky 1 poland
husky
elephant malamut 5 belgium
bird husky 5 turkey
but I think the if-else syntax is clearest and easiest to modify for newcomers to awk. If you want different output formatting, look up "printf" in the manual.
infile:
$ cat input
parrot milana 3 ukraine
dog husky 1 poland
cat husky 5 france
elephant malamut 5 belgium
bird husky 5 turkey
command:
$ awk '/dog/{m=1} $2 ~ /husky/ && m{print $2} !m{print} /elephant/{m=0}' input
parrot milana 3 ukraine
husky
husky
bird husky 5 turkey
There are some ambiguities with your question, but this should do it:
awk '/regex2/ {inside=1}
/regex3/ {inside=0}
$2 ~ /regex1/ && inside {print $2}
!inside {print}' input_file