Most efficient way to gsub strings in awk where strings come from a separate file - awk

I have a tab-sebarated file called cities that looks like this:
Washington Washington N 3322 +Geo+Cap+US
Munich München N 3842 +Geo+DE
Paris Paris N 4948 +Geo+Cap+FR
I have a text file called countries.txt which looks like this:
US
DE
IT
I'm reading this file into a Bash variable and sending it to an awk program like this:
#!/usr/bin/env bash
countrylist=$(<countries.txt)
awk -v countrylist="$countrylist" -f countries.awk cities
And I have an awk file which should split the countrylist variable into an array, then process the cities file in such a way that we replace "+"VALUE with "" in $5 only if VALUE is in the countries array.
{
FS = "\t"; OFS = "\t";
split(countrylist, countries, /\n/)
# now gsub efficiently every country in $5
# but only if it's in the array
# i.e. replace "+US" with "" but not
# "+FR"
}
I am stuck in this last bit because I don't know how to check if $5 has a value from the array countries and to remove it only then.
Many thanks in advance!
[Edit]
The output should be tab-delimited:
Washington Washington N 3322 +Geo+Cap
Munich München N 3842 +Geo
Paris Paris N 4948 +Geo+Cap+FR

Could you please try following, if I understood your requirement correctly.
awk 'FNR==NR{a[$0]=$0;next} {for(i in a){if(index($5,a[i])){gsub(a[i],"",$5)}}} 1' countries.txt cities
A non-one liner form of code is as follows(you could set FS and OFS to \t in case your Input_file is TAB delimited):
awk '
FNR==NR{
a[$0]=$0
next
}
{
for(i in a){
if(index($5,a[i])){
gsub(a[i],"",$5)
}
}
}
1
' countries.txt cities
Output will be as follows.
Washington Washington N 3322 +Geo+Cap+
Munich München N 3842 +Geo+
Paris Paris N 4948 +Geo+Cap+FR

This is the awk way of doing it:
$ awk '
BEGIN {
FS=OFS="\t" # delimiters
}
NR==FNR { # process countries file
countries[$0] # hash the countries to an array
next # skip to next citi while there are cities left
}
{
n=split($5,city,"+") # split the 5th colby +
if(city[n] in countries) # search the last part in countries
sub(city[n] "$","",$5) # if found, replace in the 5th
}1' countries cities # output and mind the order of files
Output (with actual tabs in data):
Washington Washington N 3322 +Geo+Cap+
Munich München N 3842 +Geo+
Paris Paris N 4948 +Geo+Cap+FR

Related

substr & index - extract from string $2 until the last string of the line

I have multiple files from which I want from a specific line to extract from string $2 until the last string of line or until the end of line (it could be one more string after $2, or it could be more)
So what I thought is use awk, substr and index, but I dont know how to write the index part so it could print until the end of line or until the last string of line
EXAMPLE
input:
DATA USA CALIFORNIA
DATA CANADA NORTH Quebec city
DATA AMERICA Washington DC
output
USA CALIFORNIA
CANADA NORTH Quebec city
AMERICA Washington DC
Code:
awk '{num=NR; var=substr($2, index($2, " ")+1, NF)}'
But this doesn't work.
Any help would be more than appreciated!
Thank you in advance
When you have one space between the words, use
cut -sd' ' -f2- inputfile
You can do $1 = "" to remove the first field, then print the updated line.
awk '{$1 = ""; print}'
The above will however print a space at the start of each line. If you want to remove that space:
awk '{$1 = ""; $0=substr($0, 2); print}'
From your examples, it seems that you only want to remove $1. In that case you could use sed to remove it:
sed -E 's/[^[:space:]]+[[:space:]]//'
[^[:space:]]+ - match 1 or more non whitespace characters
[[:space:]] - followed by a whitespace character
Substitiute with an empty string (the ending//).
You're not printing anything from your code, so that might be why "this doesn't work". Assuming that's not the problem, please edit your question to tell us what the problem is you need help with as "this doesn't work" is famously the worst possible problem statement when asking for help with software or anything else in life in general.
Having said that, regarding index($2, " ") - that's trying to find a space within a field when fields are separated by spaces so obviously that can never succeed. ITYM index($0, " ") and then substr($2... would be substr($0.... I'm not sure what you were thinking by having NF (the number of fields in the line) at the end of the substr() - maybe you meant length() (the number of chars in the line) but that'd also be wrong (and unnecessary) since that'd be more chars than are left after the substr() and just going til the end of the string as you want is substr()s default behavior anyway.
To fix your existing code try this:
$ awk '{num=NR; var=substr($0, index($0, " ")+1); print var}' file
USA CALIFORNIA
CANADA NORTH Quebec city
AMERICA Washington DC
or more robustly in case of most regexp FS and most input values:
$ awk '{num=NR; var=substr($0, match($0, FS)+1); print var}' file
USA CALIFORNIA
CANADA NORTH Quebec city
AMERICA Washington DC
The above and all other answers so far would fail if you're using the default FS and your input starts with blanks since given input like:
<blank>X<blank>Y
$1 is X and $2 is Y so if you want to print from Y on then you can't just delete whatever's before the first blank as $1 comes AFTER any leading whitespace when the default FS is used.
You also can't rely on using index() since it only matches strings while a multi-char FS is a regexp, nor can you rely on using match() since a single-char FS is a literal character.
So a robust solution to extract from string $2 until the last string of the line would have to handle:
FS being a blank to handle leading/trailing spaces and match any white space between fields,
Any other single-char FS as a literal character.
Any multi-char FS as a regexp.
FS being null in which case there's just 1 field, $1.
Let us know if you actually need that.
Your code
awk '{num=NR; var=substr($2, index($2, " ")+1, NF)}'
has three issues, firstly you are storing what substr returns into variable, but do not print it, secondly you assume desired length is number of fields, which is not case, thirdly $2 is respectively USA, CANADA, AMERICA, whilst you want also further fields.
After commencing repairs to your code it become
awk '{num=NR; var=substr($0, index($0, " ")+1); print var}' file.txt
which for
DATA USA CALIFORNIA
DATA CANADA NORTH Quebec city
DATA AMERICA Washington DC
gives output
USA CALIFORNIA
CANADA NORTH Quebec city
AMERICA Washington DC
that being, if you do not have to use index AT ANY PRICE you might use sub function to get desired output and more concise code following way
awk '{sub(/^[^ ]+ /,"");print}' file.txt
it does replace start of string (^) followed by 1-or-more (+) non (^)-spaces and space using empty string, i.e. delete it, then print such changed line.
(tested in GNU Awk 5.0.1)
mawk NF=NF FS='^[ \t]*[^ \t]+[ \t]+' OFS=
USA CALIFORNIA
CANADA NORTH Quebec city
AMERICA Washington DC
Modifying the input:
$ cat input
DATA USA CALIFORNIA # 2x spaces between DATA and USA
DATA CANADA NORTH Quebec city
DATA AMERICA Washington , DC # starts with 1x tab
DATA South Korea Seoul # starts with 2x spaces
Another awk variation:
awk '
{ print substr($0,index($0,$2)) } # strip off 1st field by finding starting point of 2nd field
' input
# or as a one-liner sans comments
awk '{ print substr($0,index($0,$2)) }' input
NOTES:
for this particular example there's no need for the num and var variables so they've been removed
this assumes the 2nd field is not a substring of the 1st field; if the 2nd field is any of D / A / T / DA / AT / TA / DAT / ATA / DATA then index() will match on the same string in the 1st field (DATA) which means we'll fail to strip off the 1st field
This generates:
USA CALIFORNIA
CANADA NORTH Quebec city
AMERICA Washington , DC
South Korea Seoul
Addressing the scenario where the 2nd field could be a substring of the 1st field ...
Sample input:
$ cat input2
DATA DA USA CALIFORNIA # DA is substring of DATA
DATA A CANADA NORTH Quebec city # A is substring of DATA
DATA TA AMERICA Washington, DC # TA is substring of DATA
DATA DAT South Korea Seoul # DAT is substring of DATA
A couple awk ideas:
awk '
{ match($0,$1) # find 1st field location
$0=substr($0,RSTART+RLENGTH) # strip off 1st field; this will not strip off any field separators so we need to ...
print substr($0,index($0,$1)) # find start of new 1st field (old 2nd field)
}
' input2
# replace match()/substr() with substr()/index()/length()
awk '
{ $0=substr($0,index($0,$1)+length($1)) # find 1st field location and length; this strips off 1st field but not the trailing field separator(s) so we still need to ...
print substr($0,index($0,$1)) # find start of new 1st field (old 2nd field)
}
' input2
These both generate:
DA USA CALIFORNIA
A CANADA NORTH Quebec city
TA AMERICA Washington, DC
DAT South Korea Seoul

Awk not printing what is wanted

my attempt:
awk '$4 != "AZ" && max<$6 || NR==1{ max=$6; data=$0 } END{ print data }' USA.txt
I am trying to print the row that
does NOT have "AZ" in the 4th column
and the greatest value in the 6th column
the file has 6 colums
firstname lastname town/city state-abv. zipcode score
Shellstrop Eleanor Phoenix AZ 85023 -2920765
Shellstrop Donna Tarantula_Springs NV 89047 -5920765
Mendoza Jason Jacksonville FL 32205 -4123794
Mendoza Douglas Jacksonville FL 32209 -3193274
Peleaz Steven Jacksonville FL 32203 -3123794
Based on your attempts, please try following awk code. This checks if 4th field is NOT AZ then it compares previous value of max with current value of $6 if its greater than previous value then it assigns current $6 to max else keeps it to previous value. In END block of awk program its printing its value.
awk -v max="" '$4!="AZ"{max=(max>$6?max:$6)} END{print max}' Input_file
To print complete row for a maximum value found would be:
awk -v max="" '$4!="AZ"{max=(max>$6?max:$6);arr[$6]=$0} END{print arr[max]}' Input_file

reodering the columns in a csv file + awk + keeping the comma delimiter

this is my file:
$ cat temp
country,latitude,longitude,name,code
AU,-25.274398,133.775136,Australia,61
CN,35.86166,104.195397,China,86
DE,51.165691,10.451526,Germany,49
FR,46.227638,2.213749,France,33
NZ,-40.900557,174.885971,New Zealand,64
WS,-13.759029,-172.104629,Samoa,685
CH,46.818188,8.227512,Switzerland,41
US,37.09024,-95.712891,United States,1
VU,-15.376706,166.959158,Vanuatu,678
I want to reorder the columns like below. but I want to keep the comma delimiter and don't want the space delimiter. How do I do this?
$ awk -F"," '{ print $5,$4,$1,$2,$3 }' temp
code name country latitude longitude
61 Australia AU -25.274398 133.775136
86 China CN 35.86166 104.195397
49 Germany DE 51.165691 10.451526
33 France FR 46.227638 2.213749
64 New Zealand NZ -40.900557 174.885971
685 Samoa WS -13.759029 -172.104629
41 Switzerland CH 46.818188 8.227512
1 United States US 37.09024 -95.712891
678 Vanuatu VU -15.376706 166.959158
The OFS record also needs to be set if you don't want the output field separator to be a space character (default).
$ awk 'BEGIN{FS=OFS=","}{ print $5,$4,$1,$2,$3 }' temp
code,name,country,latitude,longitude
61,Australia,AU,-25.274398,133.775136
86,China,CN,35.86166,104.195397
49,Germany,DE,51.165691,10.451526
33,France,FR,46.227638,2.213749
64,New Zealand,NZ,-40.900557,174.885971
685,Samoa,WS,-13.759029,-172.104629
41,Switzerland,CH,46.818188,8.227512
1,United States,US,37.09024,-95.712891
678,Vanuatu,VU,-15.376706,166.959158

adding common names in the column - awk

Is it possible to print unique names in 1st column by adding the names in the 2nd column like below ? thanx in advance!
input
tony singapore
johnny germany
johnny singapore
output
tony singapore
johnny germany;singapore
try this one-liner:
awk '{a[$1]=$1 in a?a[$1]";"$2:$2}END{for(x in a)print x, a[x]}' file
$ awk '{name2vals[$1] = name2vals[$1] sep[$1] $2; sep[$1] = ";"} END { for (name in name2vals) print name, name2vals[name]}' file
johnny germany;singapore
tony singapore
Here is a cryptic sed variant:
Content of script.sed
$ cat script.sed
:a # Create a label called loop
$!N # If not last line, append the line to pattern space
s/^(([^ ]+ ).*)\n\2/\1;/ # If first column is same append second column to it separated by ;
ta # If the last substitution was successful loop back
P # Print up to the first \n of the current pattern space
D # Delete from current pattern space, up to the \n character
Execution:
$ cat file
tony singapore
johnny germany
johnny singapore
$ sed -rf script.sed file
tony singapore
johnny germany; singapore

awk how to remove duplicates in fields only if previous fields are the same

I am trying to remove duplicates from fields (and replace them with blanks) only if the previous fields are the same. For example:
Sample input:
France Paris Museum of Fine Arts blabala
France Paris Museum of Fine Arts blajlk
France Paris Yet another museum lqmsjdf
France Paris Museum of National History mlqskjf
France Bordeaux Museum of Fine Arts qsfsqf
France Bordeaux City Hall lmqjflqsk
France Bordeaux City Hall lqkjfqlskjflqskfj
Spain Madrid Museum of Fine Arts lqksjfh
Spain Madrid Museum of Fine Arts qlmfjlqsjf
Spain Barcelona City Hall nvqjvvnqk
Spain Barcelona Museum of Fine Arts lmkqjflqksfj
Desired output:
France Paris Museum of FineArts blabala
blajlk
Yet another museum lqmsjdf
Museum of National History mlqskjf
Bordeaux Museum of Fine Arts qsfsqf
City Hall lmqjflqsk
lqkjfqlskjflqskfj
Spain Madrid Museum of Fine Arts lqksjfh
qlmfjlqsjf
Barcelona City Hall nvqjvvnqk
Museum of Fine Arts lmkqjflqksfj
Thank you much in advance for any kind of help.
Give this a try:
awk -F '\t' 'BEGIN {OFS=FS} {if ($1 == prev1) $1 = ""; else prev1 = $1; if ($2 == prev2) $2 = ""; else prev2 = $2; if ($3 == prev3) $3 = ""; else prev3 = $3; print}' inputfile
Here is a shorter version that works for any number of fields (the last field is always printed):
awk -F '\t' 'BEGIN {OFS=FS} {for (i=1; i<=NF-1;i++) if ($i == prev[i]) $i = ""; else prev[i] = $i; print}' inputfile
The output won't be aligned for on-screen use, but there will be the correct number of tabs.
The output will look like this:
field1 TAB field2 TAB field3 TAB field4
TAB TAB TAB field4
TAB TAB field3 TAB field4
TAB field2 TAB field3 TAB field4
etc.
If you need columns aligned, that is also possible.
Edit:
This version allows you to specify the fields to deduplicate:
#!/usr/bin/awk -f
BEGIN {
FS="\t"; OFS=FS
deduplist=ARGV[1]
ARGV[1]=""
split(deduplist,tmp," ")
for (i in tmp) dedup[tmp[i]]=1
}
{
for (i=1; i<=NF;i++)
if (i in dedup) {
if ($i == prev[i])
$i = ""
else
prev[i] = $i
}
# prevent printing lines that are completely blank because
# it's an exact duplicate of the preceding line and all fields
# are being deduplicated
if ($0 !~ /^[[:blank:]]*$/)
print
}
Run it like this: ./script.awk "2 3" inputfile to deduplicate fields 2 and three.
Try this Perl one-liner:
perl -F"\t" -nae '#O=#F;if(!$x){$x=1}else{for($i=0;$i<=$#S;$i++){$F[$i]=""if($S[$i] eq "" || $S[$i] eq $F[$i])}};print join "\t",#F;#S=#O;'
See it
I've assumed the fields are tab separated.