Select first and last column using regex or linux command

Select first and last column using regex or linux command - awk

I have [a text file][1] that looks something like this...
("oo" (set CANDRA-E-O 0) "ऊ")
("o" (set CANDRA-E-O ?ऑ) "ओ")
("oa" "ऑ")
("au" "औ")
I need to extract the first and last columns like:
"oo", "ऊ"
"o", "ओ"
"oa", "ऑ"
"au", "औ"
I have managed to extract the first column. But not sure how to select the second column.
\ {2}\(\".+\"\

With your shown samples/attempts, please try following awk command. Written and tested in GNU awk.
awk -v FPAT='"[^"]*"' '{for(i=1;i<=NF;i++){printf("%s%s",$i,i==NF?ORS:OFS)}}' Input_file
Explanation: Simple explanation would be, setting FPAT to '"[^"]*"' which means setting field separator as regex form, from " to till next occurrence of " comes. Then in main program going through all fields of each line and printing them, when its last field of line then printing new line else printing spaces(to get all one line values into a single line).

With this awk solution:
awk -v OFS="," '{sub(/^\(/,"",$1);sub(/\)$/,"",$NF);print $1, $NF}' file
"oo","ऊ"
"o","ओ"
"oa","ऑ"
"au","औ"
with first sub() we remove the parenthesis ( of the first field.
Idem second sub() for last parenthesis ) of the last field.
we print the two fields separated by comma: OFS=","

Related

replace new line with a space if next line starts with a word character

I've large text file that looks like
some random : demo text for
illustration, can be long
and : some more
here is : another
one
I want an output like
some random : demo text for illustration, can be long
and : some more
here is : another one
I tried some strange, obviously faulty regex like %s/\w*\n/ /g but can't really get my head around.

With your shown samples, please try following awk code. Using RS(record separator), setting it nullify. This is based on your shown samples only.
awk -v RS="" '{$1=$1} 1' Input_file

Adding another solution in case someone is looking for printf function with awk. Though 1st solution provided in Here should be used IMHO, as an alternative adding these solutions too here.
2nd solution: Adding solution to check if lines starts with alphabets, then only add them with previous lines or so.
awk '{printf("%s%s",$0~/^[a-zA-Z]/?(FNR>1 && prev~/^[a-zA-Z]/?OFS:""):ORS,$0);prev=$0} END{print ""}' Input_file
3rd solution: Note: This will work only if your lines has colon present in the lines as per shown samples.
awk '{printf("%s%s",$0~/:/?(FNR>1?ORS:""):OFS,$0)} END{print ""}' Input_file
Explanation: Using printf function of awk. Then using conditions, if current line has : and greater than 1 then print ORS else print nothing. If line doesn't contain : then print OFS for each line. In the END block of this program printing newline.

Get values from the next row and merge- awk

I have a pipe delimited file like this
OLD|123432
NEW|232322
OLD|1234452
NEW|232324
OLD|656966
NEW|232325
I am trying to create a new file where I am trying to merge rows based on the value in the first column (OLD/NEW). First column in the output file will have the new number and the second column will have the old number.
Output
232322|123432
232324|1234452
232325|656966
I looked at the answer here How to merge every two lines into one from the command line?. I know it is not the exact solution but used as a starting point.
and tried to make it work to solve this but throws syntax error.
awk -F "|" 'NR%2{OFS = "|" printf "%s ",$0;next;}1'

You may use this awk:
awk 'BEGIN {FS=OFS="|"} $1 == "NEW" {print $2, old} $1 == "OLD" {old = $2}' file
232322|123432
232324|1234452
232325|656966

Using $0 will have the value of the whole line. If the field separator is a pipe, the second column is $2 that has the number.
If you want to use the remainder with NR%2, another option could be storing the value of the second column in a variable, for example v
awk 'BEGIN{FS=OFS="|"} NR%2{v=$2;next;}{print $2,v}' file
Output
232322|123432
232324|1234452
232325|656966

Awk - Grep - Match the exact string in a file

I have a file that looks like this
ON,111111,TEN000812,Super,7483747483,767,Free
ON,262762,BOB747474,SuperMan,4347374,676,Free
ON,454644,FRED84848,Super Man,65757,555,Free
I need to match the values in the fourth column exactly as they are written. So if I am searching for "Super" I need it to return the line with "Super" only.
ON,111111,TEN000812,Super,7483747483,767,Free
Likewise, if I'm looking for "Super Man" I need that exact line returned.
ON,454644,FRED84848,Super Man,65757,555,Free
I have tried using grep, but grep will match all instances that contain Super. So if I do this:
grep -i "Super" file.txt
It returns all lines, because they all contain "Super"
ON,111111,TEN000812,Super,7483747483,767,Free
ON,262762,BOB747474,SuperMan,4347374,676,Free
ON,454644,FRED84848,Super Man,65757,555,Free
I have also tired with awk, and I believe I'm close, but when I do:
awk '$4==Super' file.txt
I still get output like this:
ON,111111,TEN000812,Super,7483747483,767,Free
ON,262762,BOB747474,SuperMan,4347374,676,Free
I have been at this for hours, and any help would be greatly appreciated at this point.

You were close, or I should say very close just put field delimiter as comma in your solution and you are all set.
awk 'BEGIN{FS=","} $4=="Super"' Input_file
Also one more thing in OP's attempt while comparison with 4th field with string value, string should be wrapped in "
OR in case you want to mention value to be compared as an awk variable then try following.
awk -v value="Super" 'BEGIN{FS=","} $4==value' Input_file

You are quite close actually, you can try :
awk -F, '$4=="Super" {print}' file.txt
I find this form easier to grasp. Slightly longer than #RavinderSingh13 though
-F is the field separator, in this case comma
Next you have a condition followed by action
Condition is to check if the fourth field has the string Super
If the string is found, print it

Delete repetitions in text file by using awk

I have a fragment of text file (this text file is huge):
114303 SOL1443
114311 SOL679
114316 SOL679
114432 SOL1156
114561 SOL122
114574 SOL2000
114952 SOL3018
115597 SOL609
115864 SOL2385
115993 SOL3448
SOL2 61571
SOL3 87990
SOL4 96242
SOL5 6329
SOL5 16550
SOL9 84894
SOL9 84911
SOL12 91985
SOL15 85816
I need to write script which will delete lines which have duplicate SOLnumber. It doesnt matter if SOL is in the first or in the second column
For example in text I have
115993 SOL269
SOL269 84911
12373 SOL269
So my script will delete second and third line
SOL269 84911
12373 SOL269
I know that in awk I can use
awk '!seen[$0]++' data.txt
to delete duplicate lines, but it deletes lines which have the same words in every column.
Please help me!

You need to extract the value of SOL and group the contents of the file based on it. The below command uses the regex match() function to match in the current line containing the pattern SOL followed by digit and store the captured group in variable sol.
Now with the value in the variable, use the logic !unique[sol]++ to list only the lines containing the pattern once.
awk 'match($0, /SOL[[:digit:]]+/){ sol = substr($0, RSTART, RLENGTH); } !unique[sol]++'
Not saying perl is any better than the above, but you can do
perl -ne '/(SOL\d+)/; print unless $unique{$1}++' file

As your SOL field is not always at the same place, you first have to find it.
awk '{
end=substr($0, index("SOL", $0))
sol=substr(end, 0, index(" ", end))
}
!seen[sol]++
' data.txt

You can do this, same idea than your awk command (just do some preprocessing to select the column to use in seen array:
awk '{if($1 ~ /^SOL/){sol_kw=$1}else{sol_kw=$2}}!seen[sol_kw]++' <file>

How to filter the OTU by counts with AWK?

I am trying to filter all the singleton from a fasta file.
Here is my input file:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU2;size=1;
ATCCGGGACTGATC
>OTU3;size=5;
GAACTATCGGGTAA
>OTU4;size=1;
AATTGGCCATCT
The expected output is:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
I've tried
awk -F'>' '{if($1>=2) {print $0}' input.fasta > ouput.fasta
but this will remove all the header for each OTU.
Anyone could help me out?

Could you please try following.
awk -F'[=;]' '/^>/{flag=""} $3>=3{flag=1} flag' Input_file

$ awk '/>/{f=/=1;/}!f' file
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA

awk -v FS='[;=]' 'prev_sz>=2 && !/size/{print prev RS $0} /size/{prev=$0;prev_sz=$(NF-1)}'
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
Store the size from each line in prev_sz variable and whole line in prev variables. Now check if its >= 2, then print the previous line and the current line. RS is used to print new line.

While all the above methods work, they are limited to the fact that input always has to look the same. I.e the sequence-name in your fasta-file needs to have the form:
>NAME;size=value;
A few solutions can handle a bit more extended sequence-names, but none handle the case where things go a bit more generic, i.e.
>NAME;label1=value1;label2=value2;STRING;label3=value3;
Print sequence where label xxx matches value vvv:
awk '/>{f = /;xxx=vvv;/}f' file.fasta
Print sequence where label xxx has a numeric value p bigger than q:
awk -v label="xxx" -v limit=q \
'BEGIN{ere=";" label "="}
/>/{ f=0; match($0,ere);value=0+substr($0,RSTART+length(ere)); f=(value>limit)}
f' <file>
In the above ere is a regular expression we try to match. We use it to find the location of the value attached to label xxx. This substring will have none-numeric characters after its value, but by adding 0 to it, it is converted to a number, losing all non-numeric values (i.e. 3;label4=value4; is converted to 3). We check if the value is bigger than our limit, and print the sequence based on that result.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Select first and last column using regex or linux command - awk

Related

replace new line with a space if next line starts with a word character

Get values from the next row and merge- awk

Awk - Grep - Match the exact string in a file

Delete repetitions in text file by using awk

How to filter the OTU by counts with AWK?

Categories

Resources