awk how to remove duplicates in fields only if previous fields are the same - awk

I am trying to remove duplicates from fields (and replace them with blanks) only if the previous fields are the same. For example:
Sample input:
France Paris Museum of Fine Arts blabala
France Paris Museum of Fine Arts blajlk
France Paris Yet another museum lqmsjdf
France Paris Museum of National History mlqskjf
France Bordeaux Museum of Fine Arts qsfsqf
France Bordeaux City Hall lmqjflqsk
France Bordeaux City Hall lqkjfqlskjflqskfj
Spain Madrid Museum of Fine Arts lqksjfh
Spain Madrid Museum of Fine Arts qlmfjlqsjf
Spain Barcelona City Hall nvqjvvnqk
Spain Barcelona Museum of Fine Arts lmkqjflqksfj
Desired output:
France Paris Museum of FineArts blabala
blajlk
Yet another museum lqmsjdf
Museum of National History mlqskjf
Bordeaux Museum of Fine Arts qsfsqf
City Hall lmqjflqsk
lqkjfqlskjflqskfj
Spain Madrid Museum of Fine Arts lqksjfh
qlmfjlqsjf
Barcelona City Hall nvqjvvnqk
Museum of Fine Arts lmkqjflqksfj
Thank you much in advance for any kind of help.

Give this a try:
awk -F '\t' 'BEGIN {OFS=FS} {if ($1 == prev1) $1 = ""; else prev1 = $1; if ($2 == prev2) $2 = ""; else prev2 = $2; if ($3 == prev3) $3 = ""; else prev3 = $3; print}' inputfile
Here is a shorter version that works for any number of fields (the last field is always printed):
awk -F '\t' 'BEGIN {OFS=FS} {for (i=1; i<=NF-1;i++) if ($i == prev[i]) $i = ""; else prev[i] = $i; print}' inputfile
The output won't be aligned for on-screen use, but there will be the correct number of tabs.
The output will look like this:
field1 TAB field2 TAB field3 TAB field4
TAB TAB TAB field4
TAB TAB field3 TAB field4
TAB field2 TAB field3 TAB field4
etc.
If you need columns aligned, that is also possible.
Edit:
This version allows you to specify the fields to deduplicate:
#!/usr/bin/awk -f
BEGIN {
FS="\t"; OFS=FS
deduplist=ARGV[1]
ARGV[1]=""
split(deduplist,tmp," ")
for (i in tmp) dedup[tmp[i]]=1
}
{
for (i=1; i<=NF;i++)
if (i in dedup) {
if ($i == prev[i])
$i = ""
else
prev[i] = $i
}
# prevent printing lines that are completely blank because
# it's an exact duplicate of the preceding line and all fields
# are being deduplicated
if ($0 !~ /^[[:blank:]]*$/)
print
}
Run it like this: ./script.awk "2 3" inputfile to deduplicate fields 2 and three.

Try this Perl one-liner:
perl -F"\t" -nae '#O=#F;if(!$x){$x=1}else{for($i=0;$i<=$#S;$i++){$F[$i]=""if($S[$i] eq "" || $S[$i] eq $F[$i])}};print join "\t",#F;#S=#O;'
See it
I've assumed the fields are tab separated.

Related

substr & index - extract from string $2 until the last string of the line

I have multiple files from which I want from a specific line to extract from string $2 until the last string of line or until the end of line (it could be one more string after $2, or it could be more)
So what I thought is use awk, substr and index, but I dont know how to write the index part so it could print until the end of line or until the last string of line
EXAMPLE
input:
DATA USA CALIFORNIA
DATA CANADA NORTH Quebec city
DATA AMERICA Washington DC
output
USA CALIFORNIA
CANADA NORTH Quebec city
AMERICA Washington DC
Code:
awk '{num=NR; var=substr($2, index($2, " ")+1, NF)}'
But this doesn't work.
Any help would be more than appreciated!
Thank you in advance
When you have one space between the words, use
cut -sd' ' -f2- inputfile
You can do $1 = "" to remove the first field, then print the updated line.
awk '{$1 = ""; print}'
The above will however print a space at the start of each line. If you want to remove that space:
awk '{$1 = ""; $0=substr($0, 2); print}'
From your examples, it seems that you only want to remove $1. In that case you could use sed to remove it:
sed -E 's/[^[:space:]]+[[:space:]]//'
[^[:space:]]+ - match 1 or more non whitespace characters
[[:space:]] - followed by a whitespace character
Substitiute with an empty string (the ending//).
You're not printing anything from your code, so that might be why "this doesn't work". Assuming that's not the problem, please edit your question to tell us what the problem is you need help with as "this doesn't work" is famously the worst possible problem statement when asking for help with software or anything else in life in general.
Having said that, regarding index($2, " ") - that's trying to find a space within a field when fields are separated by spaces so obviously that can never succeed. ITYM index($0, " ") and then substr($2... would be substr($0.... I'm not sure what you were thinking by having NF (the number of fields in the line) at the end of the substr() - maybe you meant length() (the number of chars in the line) but that'd also be wrong (and unnecessary) since that'd be more chars than are left after the substr() and just going til the end of the string as you want is substr()s default behavior anyway.
To fix your existing code try this:
$ awk '{num=NR; var=substr($0, index($0, " ")+1); print var}' file
USA CALIFORNIA
CANADA NORTH Quebec city
AMERICA Washington DC
or more robustly in case of most regexp FS and most input values:
$ awk '{num=NR; var=substr($0, match($0, FS)+1); print var}' file
USA CALIFORNIA
CANADA NORTH Quebec city
AMERICA Washington DC
The above and all other answers so far would fail if you're using the default FS and your input starts with blanks since given input like:
<blank>X<blank>Y
$1 is X and $2 is Y so if you want to print from Y on then you can't just delete whatever's before the first blank as $1 comes AFTER any leading whitespace when the default FS is used.
You also can't rely on using index() since it only matches strings while a multi-char FS is a regexp, nor can you rely on using match() since a single-char FS is a literal character.
So a robust solution to extract from string $2 until the last string of the line would have to handle:
FS being a blank to handle leading/trailing spaces and match any white space between fields,
Any other single-char FS as a literal character.
Any multi-char FS as a regexp.
FS being null in which case there's just 1 field, $1.
Let us know if you actually need that.
Your code
awk '{num=NR; var=substr($2, index($2, " ")+1, NF)}'
has three issues, firstly you are storing what substr returns into variable, but do not print it, secondly you assume desired length is number of fields, which is not case, thirdly $2 is respectively USA, CANADA, AMERICA, whilst you want also further fields.
After commencing repairs to your code it become
awk '{num=NR; var=substr($0, index($0, " ")+1); print var}' file.txt
which for
DATA USA CALIFORNIA
DATA CANADA NORTH Quebec city
DATA AMERICA Washington DC
gives output
USA CALIFORNIA
CANADA NORTH Quebec city
AMERICA Washington DC
that being, if you do not have to use index AT ANY PRICE you might use sub function to get desired output and more concise code following way
awk '{sub(/^[^ ]+ /,"");print}' file.txt
it does replace start of string (^) followed by 1-or-more (+) non (^)-spaces and space using empty string, i.e. delete it, then print such changed line.
(tested in GNU Awk 5.0.1)
mawk NF=NF FS='^[ \t]*[^ \t]+[ \t]+' OFS=
USA CALIFORNIA
CANADA NORTH Quebec city
AMERICA Washington DC
Modifying the input:
$ cat input
DATA USA CALIFORNIA # 2x spaces between DATA and USA
DATA CANADA NORTH Quebec city
DATA AMERICA Washington , DC # starts with 1x tab
DATA South Korea Seoul # starts with 2x spaces
Another awk variation:
awk '
{ print substr($0,index($0,$2)) } # strip off 1st field by finding starting point of 2nd field
' input
# or as a one-liner sans comments
awk '{ print substr($0,index($0,$2)) }' input
NOTES:
for this particular example there's no need for the num and var variables so they've been removed
this assumes the 2nd field is not a substring of the 1st field; if the 2nd field is any of D / A / T / DA / AT / TA / DAT / ATA / DATA then index() will match on the same string in the 1st field (DATA) which means we'll fail to strip off the 1st field
This generates:
USA CALIFORNIA
CANADA NORTH Quebec city
AMERICA Washington , DC
South Korea Seoul
Addressing the scenario where the 2nd field could be a substring of the 1st field ...
Sample input:
$ cat input2
DATA DA USA CALIFORNIA # DA is substring of DATA
DATA A CANADA NORTH Quebec city # A is substring of DATA
DATA TA AMERICA Washington, DC # TA is substring of DATA
DATA DAT South Korea Seoul # DAT is substring of DATA
A couple awk ideas:
awk '
{ match($0,$1) # find 1st field location
$0=substr($0,RSTART+RLENGTH) # strip off 1st field; this will not strip off any field separators so we need to ...
print substr($0,index($0,$1)) # find start of new 1st field (old 2nd field)
}
' input2
# replace match()/substr() with substr()/index()/length()
awk '
{ $0=substr($0,index($0,$1)+length($1)) # find 1st field location and length; this strips off 1st field but not the trailing field separator(s) so we still need to ...
print substr($0,index($0,$1)) # find start of new 1st field (old 2nd field)
}
' input2
These both generate:
DA USA CALIFORNIA
A CANADA NORTH Quebec city
TA AMERICA Washington, DC
DAT South Korea Seoul

Join of two files introduces extraneous newline

Update: I figured out the reason for the extraneous newline. I created file1 and file2 on a Windows machine. Windows adds <cr><newline> to the end of each line. So, for example, the first record in file1 is not this:
Bill <tab> 25 <newline>
Instead, it is this:
Bill <tab> 25 <cr><newline>
So when I set a[Bill] to $2 I am actually setting it to $2<cr>.
I used a hex editor and removed all of the <cr> symbols in file1 and file2. Now the AWK program works as desired.
I have seen the SO posts on using AWK to do a natural join of two files. I took one of the solutions and am trying to get it to work. Alas, I have been unsuccessful. I am hoping you can tell me what I am doing wrong.
Note: I appreciate other solutions, but what I really want is to understand why my AWK program doesn't work (i.e., why/how an extraneous newline is being introduced).
I want to do a join of these two files:
file1 (name, tab, age):
Bill 25
John 24
Mary 21
file2 (name, tab, marital-status)
Bill divorced
Glenn married
John married
Mary single
When joined, I expect to see this (name, tab, age, tab, marital-status):
Bill 25 divorced
John 24 married
Mary 21 single
Notice that file2 has a person named Glenn, but file1 doesn't. No record in file1 joins to it.
My AWK program almost produces that result. But, for reasons I don't understand, the marital-status value is on the next line:
Bill 25
divorced
John 24
married
Mary 21
single
Here is my AWK program:
awk 'BEGIN { OFS = '\t' }
NR == FNR { a[$1] = ($1 in a? a[$1] OFS : "")$2; next }
$1 in a { $0 = $0 OFS a[$1]; delete a[$1]; print }' file2 file1 > joined_file1_file2
You may try this awk solution:
awk 'BEGIN {FS=OFS="\t"} {sub(/\r$/, "")}
FNR == NR {m[$1]=$2; next} {print $0, m[$1]}' file2 file1
Bill 25 divorced
John 24 married
Mary 21 single
Here:
Using sub(/\r$/, "") to remove any DOS line ending
If $1 doesn't exist in mapping m then m[$1] will be an empty string so we can simplify awk processing

Most efficient way to gsub strings in awk where strings come from a separate file

I have a tab-sebarated file called cities that looks like this:
Washington Washington N 3322 +Geo+Cap+US
Munich München N 3842 +Geo+DE
Paris Paris N 4948 +Geo+Cap+FR
I have a text file called countries.txt which looks like this:
US
DE
IT
I'm reading this file into a Bash variable and sending it to an awk program like this:
#!/usr/bin/env bash
countrylist=$(<countries.txt)
awk -v countrylist="$countrylist" -f countries.awk cities
And I have an awk file which should split the countrylist variable into an array, then process the cities file in such a way that we replace "+"VALUE with "" in $5 only if VALUE is in the countries array.
{
FS = "\t"; OFS = "\t";
split(countrylist, countries, /\n/)
# now gsub efficiently every country in $5
# but only if it's in the array
# i.e. replace "+US" with "" but not
# "+FR"
}
I am stuck in this last bit because I don't know how to check if $5 has a value from the array countries and to remove it only then.
Many thanks in advance!
[Edit]
The output should be tab-delimited:
Washington Washington N 3322 +Geo+Cap
Munich München N 3842 +Geo
Paris Paris N 4948 +Geo+Cap+FR
Could you please try following, if I understood your requirement correctly.
awk 'FNR==NR{a[$0]=$0;next} {for(i in a){if(index($5,a[i])){gsub(a[i],"",$5)}}} 1' countries.txt cities
A non-one liner form of code is as follows(you could set FS and OFS to \t in case your Input_file is TAB delimited):
awk '
FNR==NR{
a[$0]=$0
next
}
{
for(i in a){
if(index($5,a[i])){
gsub(a[i],"",$5)
}
}
}
1
' countries.txt cities
Output will be as follows.
Washington Washington N 3322 +Geo+Cap+
Munich München N 3842 +Geo+
Paris Paris N 4948 +Geo+Cap+FR
This is the awk way of doing it:
$ awk '
BEGIN {
FS=OFS="\t" # delimiters
}
NR==FNR { # process countries file
countries[$0] # hash the countries to an array
next # skip to next citi while there are cities left
}
{
n=split($5,city,"+") # split the 5th colby +
if(city[n] in countries) # search the last part in countries
sub(city[n] "$","",$5) # if found, replace in the 5th
}1' countries cities # output and mind the order of files
Output (with actual tabs in data):
Washington Washington N 3322 +Geo+Cap+
Munich München N 3842 +Geo+
Paris Paris N 4948 +Geo+Cap+FR

How to use Awk to create a new field but retain the original field?

Can this be done in Awk?
FILE_IN (Input file)
ID_Number|Title|Name
65765765|The Cat Sat on the Mat|Dennis Smith
65765799|The Dog Sat on the Catshelf|David Jones
65765797|The Horse Sat on the Sofa|Jeff Jones
FILE_OUT (Desired Results)
ID_Number|Title|Nickname|Name
65765765|The Cat Sat on the Mat|Cat Sat|Dennis Smith
65765799|The Dog Sat on the Catshelf|Dog|David Jones
65765797|The Horse Sat on the Sofa||Jeff Jones
Logic to apply:
IF Title contains “ Cat Sat ” OR " cat sat " THEN Nickname = “Cat Sat” #same titlecase/text as was found#
IF Title contains “ Dog ” OR " dog " THEN Nickname = “Dog”
Also, is this task possible with Sed?
This might work for you (GNU sed):
sed -i '1s/|/&Nickname&/2;1b;s/|.*\b\(Cat\|Dog\)\b.*|/&\u\1|/I;t;s/|.*|/&|/' file
Insert the column Nickname into the headings. If the second column contains either the word Cat or Dog insert a third column with the matching word in it. Otherwise insert a blank third column.
another awk
$ awk 'BEGIN{FS=OFS="|"}
{delete a;
match($2,"([Cc]at [Ss]at|[Dd]og)",a);
$NF=(NR==1?"Nickname":a[1]) OFS $NF}1' file
ID_Number|Title|Nickname|Name
65765765|The Cat Sat on the Mat|Cat Sat|Dennis Smith
65765799|The Dog Sat on the Catshelf|Dog|David Jones
65765797|The Horse Sat on the Sofa||Jeff Jones
You could try this with GNU awk:
awk -F"|" -v OFS="|" 'NR==1{$2 = $2 OFS "Nickname"}
NR>1{if($0 ~ /\s*[Cc]at [Ss]at\s+/) n="Cat"; else if($0 ~ /\s*[dD]og\s+/)n="Dog";
else n=""; $2 = $2 OFS n} 1' file
-F "|" OFS="|" to specify delimiter input and output respectively.
NR==1 To handle header case.
NR>1 To handle data case.
With the same logic, you could use this more compacted code:
awk -F"|" -v OFS="|" 'NR==1{$2 = $2 OFS "Nickname"}
NR>1{n=($0 ~ /\s*[Cc]at [Ss]at\s+/) ? "Cat" : ($0 ~ /\s*[dD]og\s+/) ? "Dog" : ""; $2 = $2 OFS n} 1' file

reodering the columns in a csv file + awk + keeping the comma delimiter

this is my file:
$ cat temp
country,latitude,longitude,name,code
AU,-25.274398,133.775136,Australia,61
CN,35.86166,104.195397,China,86
DE,51.165691,10.451526,Germany,49
FR,46.227638,2.213749,France,33
NZ,-40.900557,174.885971,New Zealand,64
WS,-13.759029,-172.104629,Samoa,685
CH,46.818188,8.227512,Switzerland,41
US,37.09024,-95.712891,United States,1
VU,-15.376706,166.959158,Vanuatu,678
I want to reorder the columns like below. but I want to keep the comma delimiter and don't want the space delimiter. How do I do this?
$ awk -F"," '{ print $5,$4,$1,$2,$3 }' temp
code name country latitude longitude
61 Australia AU -25.274398 133.775136
86 China CN 35.86166 104.195397
49 Germany DE 51.165691 10.451526
33 France FR 46.227638 2.213749
64 New Zealand NZ -40.900557 174.885971
685 Samoa WS -13.759029 -172.104629
41 Switzerland CH 46.818188 8.227512
1 United States US 37.09024 -95.712891
678 Vanuatu VU -15.376706 166.959158
The OFS record also needs to be set if you don't want the output field separator to be a space character (default).
$ awk 'BEGIN{FS=OFS=","}{ print $5,$4,$1,$2,$3 }' temp
code,name,country,latitude,longitude
61,Australia,AU,-25.274398,133.775136
86,China,CN,35.86166,104.195397
49,Germany,DE,51.165691,10.451526
33,France,FR,46.227638,2.213749
64,New Zealand,NZ,-40.900557,174.885971
685,Samoa,WS,-13.759029,-172.104629
41,Switzerland,CH,46.818188,8.227512
1,United States,US,37.09024,-95.712891
678,Vanuatu,VU,-15.376706,166.959158