Keep only the last duplicate line

Keep only the last duplicate line - awk

I have this data, how to remove the first duplicate ($1$2$3) using awk
785016 AGTCGCGTCCGT 142
785031 CGGCGTCGACTA 705
785031 CGGCGTCGACTA 705 CACTCCCCTGGAG
848841 GCTCAGTCAAAC 1595
848841 GCTCAGTCAAAC 1595 matched
848847 CAAATCGAGATC 1672
880844 TGCCGACGACAT 520
880844 TGCCGACGACAT 520 GTGTTCCGATCAG
880851 GACGACAACGTC 582
the expected output is
785016 AGTCGCGTCCGT 142
785031 CGGCGTCGACTA 705 CACTCCCCTGGAG
848841 GCTCAGTCAAAC 1595 matched
848847 CAAATCGAGATC 1672
880844 TGCCGACGACAT 520 GTGTTCCGATCAG
880851 GACGACAACGTC 582

With tac and awk:
tac file | awk '!a[$1,$2,$3]++' | tac
Output:
785016 AGTCGCGTCCGT 142
785031 CGGCGTCGACTA 705 CACTCCCCTGGAG
848841 GCTCAGTCAAAC 1595 matched
848847 CAAATCGAGATC 1672
880844 TGCCGACGACAT 520 GTGTTCCGATCAG
880851 GACGACAACGTC 582
See: man tac

Related

awk getting good distribution of random integer values between 2 inputs

How to get a good distribution of random integer values between 2 inputs using awk?.
I'm trying the below
$ awk -v min=200 -v max=500 ' BEGIN { srand();for(i=0;i<10;i++) print int(min+rand()*100*(max/min)) } '
407
406
360
334
264
365
303
417
249
225
$
Is there a better way in awk

Sorry to inform you that your code is not even correct. Try with min=max=10.
Something like this will work. You can verify the uniformity as well.
$ awk -v min=200 -v max=210 ' BEGIN{srand();
for(i=0;i<10000;i++) a[int(min+rand()*(max-min))]++;
for(k in a) print k,a[k]}' | sort
200 1045
201 966
202 1014
203 1016
204 985
205 1010
206 988
207 1027
208 986
209 963
Note also that min is inclusive but max is not.

output matching column from multiple input in awk

Assumes there are some data from these two input which I only want, which is "A" from inputA.txt and "B" from inputB.txt
==> inputA.txt <==
A 10214027 6369158
A 10214028 6369263
A 10214029 6369321
A 10214030 6369713
A 10214031 6370146
A 10214032 6370553
A 10214033 6370917
A 10214034 6371322
A 10214035 6371735
A 10214036 6372136
So I only want the data with A's
==> inputB.txt <==
B 50015214 5116941
B 50015215 5116767
B 50015216 5116577
B 50015217 5116409
B 50015218 5116221
B 50015219 5116044
B 50015220 5115845
B 50015221 5115676
B 50015222 5115512
B 50015223 5115326
Same goes here, only want B's
and I've built the script, but it's been doubled due to using multiple inputs.
#!/bin/awk -f
BEGIN{
printf "Column 1\tColumn 2\tColumn 3"
}
/^A/{
c=substr($2,1,4)
d=substr($2,5,3)
e=substr($3,1,4)
f=substr($3,5,3)
}
{
printf "%4.1f %4.1f %4.1f %4.1f\n",c,d,e,f > "outputA.txt"
}
/^B/{
c=substr($2,1,4)
d=substr($2,5,3)
e=substr($3,1,4)
f=substr($3,5,3)
}
{
printf "%4.1f %4.1f %4.1f %4.1f\n",c,d,e,f > "outputB.txt"
}
Let me know your thought on this.
Expected output
==> outputA.txt <==
Column 1 Column 2 Column 3 Column 4
1021 4027 6369 158
1021 4028 6369 263
1021 4029 6369 321
1021 4030 6369 713
1021 4031 6370 146
1021 4032 6370 553
1021 4033 6370 917
1021 4034 6371 322
1021 4035 6371 735
1021 4036 6372 136
==> outputB.txt <==
Column 1 Column 2 Column 3 Column 4
5001 5214 5116 941
5001 5215 5116 767
5001 5216 5116 577
5001 5217 5116 409
5001 5218 5116 221
5001 5219 5116 044
5001 5220 5115 845
5001 5221 5115 676
5001 5222 5115 512
5001 5223 5115 326

With GNU awk and FIELDWIDTHS:
awk 'BEGIN{FIELDWIDTHS="1 1 4 4 1 4 3"}
{out="output" $1 ".txt"}
FNR==1{print "Column 1 Column 2 Column 3 Column 4" >out}
{print $3,$4,$6,$7 >out}' inputA.txt inputB.txt
Use FIELDWIDTHS to split current row to seven columns. out contains name of new file. If first row of current file is reached print header to new file. For every row print four columns to new file.
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Could you please try following.
awk '
FNR==1{
sub(/[a-z]+/,"",FILENAME)
file="output"FILENAME".txt"
print "Column 1 Column 2 Column 3 Column 4" > (file)
}
{
print substr($0,3,4),substr($0,7,4),substr($0,12,4),substr($0,16,3) > (file)
}
' inputA inputB
Explanation:
awk ' ##Starting awk program here.
FNR==1{ ##Checking condition if FNR==1, line number is 1 then do following.
sub(/[a-z]+/,"",FILENAME) ##Substituting all small letters from file name with NULL.
file="output"FILENAME".txt" ##Creating variable file whose value is string output FILENAME and .txt
print "Column 1 Column 2 Column 3 Column 4" > (file) ##Printing headers to output file.
}
{
print substr($0,3,4),substr($0,7,4),substr($0,12,4),substr($0,16,3) > (file) ##Printing substrings values as per OP need to output files.
}
' inputA inputB ##Mentioning multiple Input_file names here.

You don't need substr here. Empty out the first field, insert a space after every four digits, force awk to reparse fields and then print:
awk '$1=="A"{
$1=""
gsub(/[0-9]{4}/,"& ")
$1=$1
print
}' inputA.txt
Its output:
1021 4027 6369 158
1021 4028 6369 263
1021 4029 6369 321
1021 4030 6369 713
1021 4031 6370 146
1021 4032 6370 553
1021 4033 6370 917
1021 4034 6371 322
1021 4035 6371 735
1021 4036 6372 136
Obviously this works with only one input but I believe referring to other answers you can tweak it to work with multiple files

just keep it simple :
${...input_data...} |
{m,g,n}awk 'gsub(" ....", "& ")^_'
A 1021 4027 6369 158
A 1021 4028 6369 263
A 1021 4029 6369 321
A 1021 4030 6369 713
A 1021 4031 6370 146
A 1021 4032 6370 553
A 1021 4033 6370 917
A 1021 4034 6371 322
A 1021 4035 6371 735
A 1021 4036 6372 136
B 5001 5214 5116 941
B 5001 5215 5116 767
B 5001 5216 5116 577
B 5001 5217 5116 409
B 5001 5218 5116 221
B 5001 5219 5116 044
B 5001 5220 5115 845
B 5001 5221 5115 676
B 5001 5222 5115 512
B 5001 5223 5115 326

conditional awk statement to create a new field with additive value

Question
How would I use awk to create a new field that has $2+consistent value?
I am planning to cycle through a list of values but I wouldn't mind using a one liner for each command
PseudoCode
awk '$1 == Bob {$4 = $2 + 400}' file
Sample Data
Philip 13 2
Bob 152 8
Bob 4561 2
Bob 234 36
Bob 98 12
Rey 147 152
Rey 15 1547
Expected Output
Philip 13 2
Bob 152 8 408
Bob 4561 2 402
Bob 234 36 436
Bob 98 12 412
Rey 147 152
Rey 15 1547

just quote Bob, also you want to add third field not second
$ awk '$1=="Bob" {$4=$3+400}1' file | column -t
Philip 13 2
Bob 152 8 408
Bob 4561 2 402
Bob 234 36 436
Bob 98 12 412
Rey 147 152
Rey 15 1547

Here , check if $1 is equal to Bob and , reconstruct the record ($0) by appending $2 FS 400 in to $0. Here FS is the field separator used between 3rd and 4th fields. 1 in the end means tell awk to take the default action which is print.
awk '$1=="Bob"{$0=$0 FS $2 + 400}1' file
Philip 13 2
Bob 152 8 552
Bob 4561 2 4961
Bob 234 36 634
Bob 98 12 498
Rey 147 152
Rey 15 1547
Or , if you want to keep name(Bob) as variable
awk -vname="Bob" '$1==name{$0=$0 FS $2 + 400}1' file

1st solutiuon: Could you please try following too once. I am using here NF and NF+1 awk's out of the box variables. Where $NF denotes value of last column of current line and $(NF+1) will create an additional column if condition of st field stringBob` is found is TRUE.
awk '{$(NF+1)=$1=="Bob"?400+$NF:""} 1' OFS="\t" Input_file
2nd solution: In case we don't want to create a new field and simply want to print the values as per condition then try following, this should be more faster I believe.
awk 'BEGIN{OFS="\t"}{$1=$1;print $0,$1=="Bob"?400+$NF:""}' Input_file
Output will be as follows.
Philip 13 2
Bob 152 8 408
Bob 4561 2 402
Bob 234 36 436
Bob 98 12 412
Rey 147 152
Rey 15 1547
Explanation: Adding explanation for above code now.
awk ' ##Starting awk program here.
{
$(NF+1)=$1=="Bob"?400+$NF:"" ##Creating a new last field here whose value will be depending upon condition check.
##its checking condition if 1st field is having Bob string in it then add 400 value to last field value or make it NULL.
}
1 ##awk works on method of condition then action. so by mentioning 1 making condition TRUE here and NO action defined so by default print of current line will happen.
' OFS="\t" Input_file ##Setting OFS as TAB here where OFS ois output field separator and mentioning Input_file name here.

exchange columns based on some conditions

I have a text file with 5 columns. If the number of the 5th column is less than the 3rd column, replace the 4th and 5th column as 2nd and 3rd column. If the number of the 5th column is greater than 3rd column, leave that line as same.
1EAD A 396 B 311
1F3B A 118 B 171
1F5V A 179 B 171
1G73 A 162 C 121
1BS0 E 138 G 230
Desired output
1EAD B 311 A 396
1F3B A 118 B 171
1F5V B 171 A 179
1G73 C 121 A 162
1BS0 E 138 G 230

$ awk '{ if ($5 >= $3) print $0; else print $1"\t"$4"\t"$5"\t"$2"\t"$3; }' foo.txt

awk print multiple column file into single column

My file looks like that:
315
717
461 737
304
440
148 206 264 322 380 438 496
801
495
355
249 989
768
946
I want to print all those columns in a single column file (one long first column).
If I try to
awk '{print $1}'> new_file;
awk '{print $2}' >> new_file
There are white spaces in between. How to solve this thing?

Perhaps a bit cryptic:
awk '1' RS='[[:space:]]+' inputfile
It says "print every record, treating any whitespace as record separators".

You can simply use something like:
awk '{ for (i=1; i<=NF; i++) print $i }' file
For each line, iterate through columns, and print each column in a new line.

You don't need as much as sed for this: just translate spaces to newlines
tr ' ' '\n' < file
tr is purely a filter, so you have to redirect a file into it.

A perl solution:
perl -pe 's/\s+(?=\S)/\n/g' infile
Output:
315
717
461
737
304
440
148
206
264
322
380
438
496
801
495
355
249
989
768
946

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Keep only the last duplicate line - awk

With tac and awk: tac file | awk '!a[$1,$2,$3]++' | tac Output: 785016 AGTCGCGTCCGT 142 785031 CGGCGTCGACTA 705 CACTCCCCTGGAG 848841 GCTCAGTCAAAC 1595 matched 848847 CAAATCGAGATC 1672 880844 TGCCGACGACAT 520 GTGTTCCGATCAG 880851 GACGACAACGTC 582 See: man tac

Related

awk getting good distribution of random integer values between 2 inputs

output matching column from multiple input in awk

conditional awk statement to create a new field with additive value

exchange columns based on some conditions

awk print multiple column file into single column

Categories

Resources