Delete repetitions in text file by using awk - awk

I have a fragment of text file (this text file is huge):
114303 SOL1443
114311 SOL679
114316 SOL679
114432 SOL1156
114561 SOL122
114574 SOL2000
114952 SOL3018
115597 SOL609
115864 SOL2385
115993 SOL3448
SOL2 61571
SOL3 87990
SOL4 96242
SOL5 6329
SOL5 16550
SOL9 84894
SOL9 84911
SOL12 91985
SOL15 85816
I need to write script which will delete lines which have duplicate SOLnumber. It doesnt matter if SOL is in the first or in the second column
For example in text I have
115993 SOL269
SOL269 84911
12373 SOL269
So my script will delete second and third line
SOL269 84911
12373 SOL269
I know that in awk I can use
awk '!seen[$0]++' data.txt
to delete duplicate lines, but it deletes lines which have the same words in every column.
Please help me!

You need to extract the value of SOL and group the contents of the file based on it. The below command uses the regex match() function to match in the current line containing the pattern SOL followed by digit and store the captured group in variable sol.
Now with the value in the variable, use the logic !unique[sol]++ to list only the lines containing the pattern once.
awk 'match($0, /SOL[[:digit:]]+/){ sol = substr($0, RSTART, RLENGTH); } !unique[sol]++'
Not saying perl is any better than the above, but you can do
perl -ne '/(SOL\d+)/; print unless $unique{$1}++' file

As your SOL field is not always at the same place, you first have to find it.
awk '{
end=substr($0, index("SOL", $0))
sol=substr(end, 0, index(" ", end))
}
!seen[sol]++
' data.txt

You can do this, same idea than your awk command (just do some preprocessing to select the column to use in seen array:
awk '{if($1 ~ /^SOL/){sol_kw=$1}else{sol_kw=$2}}!seen[sol_kw]++' <file>

Related

Print filenames & line number with number of fields greater than 'x'

I am running Ubuntu Linux. I am in need to print filenames & line numbers containing more than 7 columns. There are several hundred thousand files.
I am able to print the number of columns per file using awk. However the output I am after is something like
file1.csv-463 which is to suggest file1.csv has more than 7 records on line 463. I am using awk command awk -F"," '{print NF}' * to print the number of fields across all files.
Please could I request help?
If you have GNU awk with you, try following code then. This will simply check condition if NF is greater than 7 then it will print that particular file's file name along with line number and nextfile will take program to next Input_file which will save our time because we need not to read whole Input_file then.
awk -F',' 'NF>7{print FILENAME,FNR;nextfile}' *.csv
Above will print only very first match of condition to get/print all matched lines try following then:
awk -F',' 'NF>7{print FILENAME,FNR}' *.csv
This might work for you (GNU sed):
sed -Ens 's/\S+/&/8;T;F;=;p' *.csv | paste - - -
If there is no eighth column, break.
Output the file name F, the line number = and print the current line p.
Feed the output into a paste command which prints three lines as one.
N.B. The -s option resets the line numbers for each file, without it, it will number each line for the entire input.

Select first and last column using regex or linux command

I have [a text file][1] that looks something like this...
("oo" (set CANDRA-E-O 0) "ऊ")
("o" (set CANDRA-E-O ?ऑ) "ओ")
("oa" "ऑ")
("au" "औ")
I need to extract the first and last columns like:
"oo", "ऊ"
"o", "ओ"
"oa", "ऑ"
"au", "औ"
I have managed to extract the first column. But not sure how to select the second column.
\ {2}\(\".+\"\
With your shown samples/attempts, please try following awk command. Written and tested in GNU awk.
awk -v FPAT='"[^"]*"' '{for(i=1;i<=NF;i++){printf("%s%s",$i,i==NF?ORS:OFS)}}' Input_file
Explanation: Simple explanation would be, setting FPAT to '"[^"]*"' which means setting field separator as regex form, from " to till next occurrence of " comes. Then in main program going through all fields of each line and printing them, when its last field of line then printing new line else printing spaces(to get all one line values into a single line).
With this awk solution:
awk -v OFS="," '{sub(/^\(/,"",$1);sub(/\)$/,"",$NF);print $1, $NF}' file
"oo","ऊ"
"o","ओ"
"oa","ऑ"
"au","औ"
with first sub() we remove the parenthesis ( of the first field.
Idem second sub() for last parenthesis ) of the last field.
we print the two fields separated by comma: OFS=","

How to filter the OTU by counts with AWK?

I am trying to filter all the singleton from a fasta file.
Here is my input file:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU2;size=1;
ATCCGGGACTGATC
>OTU3;size=5;
GAACTATCGGGTAA
>OTU4;size=1;
AATTGGCCATCT
The expected output is:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
I've tried
awk -F'>' '{if($1>=2) {print $0}' input.fasta > ouput.fasta
but this will remove all the header for each OTU.
Anyone could help me out?
Could you please try following.
awk -F'[=;]' '/^>/{flag=""} $3>=3{flag=1} flag' Input_file
$ awk '/>/{f=/=1;/}!f' file
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
awk -v FS='[;=]' 'prev_sz>=2 && !/size/{print prev RS $0} /size/{prev=$0;prev_sz=$(NF-1)}'
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
Store the size from each line in prev_sz variable and whole line in prev variables. Now check if its >= 2, then print the previous line and the current line. RS is used to print new line.
While all the above methods work, they are limited to the fact that input always has to look the same. I.e the sequence-name in your fasta-file needs to have the form:
>NAME;size=value;
A few solutions can handle a bit more extended sequence-names, but none handle the case where things go a bit more generic, i.e.
>NAME;label1=value1;label2=value2;STRING;label3=value3;
Print sequence where label xxx matches value vvv:
awk '/>{f = /;xxx=vvv;/}f' file.fasta
Print sequence where label xxx has a numeric value p bigger than q:
awk -v label="xxx" -v limit=q \
'BEGIN{ere=";" label "="}
/>/{ f=0; match($0,ere);value=0+substr($0,RSTART+length(ere)); f=(value>limit)}
f' <file>
In the above ere is a regular expression we try to match. We use it to find the location of the value attached to label xxx. This substring will have none-numeric characters after its value, but by adding 0 to it, it is converted to a number, losing all non-numeric values (i.e. 3;label4=value4; is converted to 3). We check if the value is bigger than our limit, and print the sequence based on that result.

check for value in csv file then print line with awk / sed

Is it possible to parse a .csv file and look for the 13th entry containing a particular value.
So data for example would be
10,1,a,bhd,5,7,10,,,8,9,3,19,0
I only want to extract lines which have a value of 3 in the 13th field if that makes sense.
Tried it wish a bash while loop using cut etc but was messy.
Not sure if there a awk / sed method.
Thanks in advance.
This is beginner level awk.
awk -F, '$13==3' file
-F, is for setting field separator to comma, $13 is the 13th field's value. For each line, if $13==3 evaluates true the line is printed.

Keep only the line that is latest in the file and is a duplicate based on two fields

This is related to the questions
awk - Remove line if field is duplicate
sed/awk + regex delete duplicate lines where first field matches (ip address)
I have a file like this:
FOO,BAR,100,200,300
BAZ,TAZ,500,600,800
FOO,BAR,900,1000,1000
HERE,THERE,1000,200,100
FOO,BAR,100,10000,200
BAZ,TAZ,100,40,500
The duplicates are determined by the first two fields. In addition, the more "recent" record (lower in the file / higher line number) is the one that should be retained.
What is an awk script that will output:
BAZ,TAZ,100,40,500
FOO,BAR,100,10000,200
HERE,THERE,1000,200,100
Output order is not so important.
Explanation of awk syntax would be great.
This is easy in awk : we just need to feed an array with a key combined with the 1st and the 2nd columns and the rest as values :
$ awk -F, '{a[$1","$2]=$3","$4","$5}END{for(i in a)print i,a[i]}' OFS=, file.txt
BAZ,TAZ,100,40,500
HERE,THERE,1000,200,100
FOO,BAR,100,10000,200
This might work for you (tac and GNU sort):
tac file | sort -sut, -k1,2