Delete everything before and after a string using sed - awk

I have data in a file in below form :--
The symbol ":=" was substituted for "QUERY" to continue.
SQL> _id : MS
itm : 4
it : NO
------
PL/SQL procedure successfully completed.
I want to delete everything before _id and after ------ .
To delete everything before _id I used following sed
sed '/_id/,$!d' 1.txt
It deletes the row before _id but doesnt delete SQL> .
Similarly I used below sed to delete all everything after ------ but it doesnt delete the rows below it
sed 's/\------.*/------/' 1.txt
Can someone help me where I am doing wrong ? What I need is :--
_id : MS
itm : 4
it : NO

Using sed
Try:
$ sed -n '/_id/,/------/{ s/.*_id/_id/; /------/q; p}' 1.txt
_id : MS
itm : 4
it : NO
How it works:
-n
This tells sed not to print unless we explicitly ask it to.
/_id/,/------/{...}
This selects a range of lines that start with a line containing _id and end with a line containing ------. For those lines the commands in curly braces are executed.
s/.*_id/_id/
On the line that contains _id, this removes everything before _id.
/------/q
On the line that contains ------, this tells sed to quit.
p
For lines that reach this command, this tells sed to print the line.
Using awk
$ awk -v RS='------' '/_id/{sub(/.*_id/, "_id"); print}' ORS="" 1.txt
_id : MS
itm : 4
it : NO
How it works:
-v RS='------'
This set ------ as the record separator.
/_id/{sub(/.*_id/, "_id"); print}
For any records that contain _id, we remove everything before _id and print the current record.
ORS=""
This sets the output record separator to the empty string.

Related

Count b or B in even lines

I need count the number of times in the even lines of the file.txt the letter 'b' or 'B' appears, e.g. for the file.txt like:
everyB or gbnBra
uitiakB and kanapB bodddB
Kanbalis astroBominus
I got the first part but I need to count these b or B letters and I do not know how to count them together
awk '!(NR%2)' file.txt
$ awk '!(NR%2){print gsub(/[bB]/,"")}' file
4
Could you please try following, one more approach with awk written on mobile will try it in few mins should work but.
awk -F'[bB]' 'NR%2 == 0{print (NF ? NF - 1 : 0)}' Input_file
Thanks to #Ed sir for solving zero matches found line problem in comments.
In a single awk:
awk '!(NR%2){gsub(/[^Bb]/,"");print length}' file.txt
gsub(/[^Bb]/,"") deletes every character in the line the line except for B and b.
print length prints the length of the resulting string.
awk '!(NR%2)' file.txt | tr -cd 'Bb' | wc -c
Explanation:
awk '!(NR%2)' file.txt : keep only even lines from file.txt
tr -cd 'Bb' : keep only B and b characters
wc -c : count characters
Example:
With file bellow, the result is 4.
everyB or gbnBra
uitiakB and kanapB bodddB
Kanbalis astroBominus
Here is another way
$ sed -n '2~2s/[^bB]//gp' file | wc -c

Sed/Awk: how to find and remove two lines if a pattern in the first line is being repeated; bash

I am processing text file(s) with thousands of records per file. Each record is made up of two lines: a header that starts with ">" and followed by a line with a long string of characters "-AGTCNR". The header has 10 fields separated by "|" whose first field is a unique identifier to each record e.g ">KEN096-15" and a record is termed duplicate if it has same identifier. Here is how a simple record look like:
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTTTCCT----TAAATAAT-----
Now I am triying to delete repeats, like duplicate records of "ACRJP458-10" and "PMANL2431-12".
Using a bash script I have extracted the unique identifiers and stored repeated ones in a variable "$duplicate_headers". Currently, I am trying to find any repeated instances of their two-line records and deleting them as follows:
for i in "$#"
do
unset duplicate_headers
duplicate_headers=`grep ">" $1 | awk 'BEGIN { FS="|"}; {print $1 "\n"; }' | sort | uniq -d`
for header in `echo -e "${duplicate_headers}"`
do
sed -i "/^.*\b${header}\b.*$/,+1 2d" $i
#sed -i "s/^.*\b${header}\b.*$//,+1 2g" $i
#sed -i "/^.*\b${header}\b.*$/{$!N; s/.*//2g; }" $i
done
done
The final result (with thousands of records in mind) will look like:
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
$ awk -F'[|]' 'NR%2{f=seen[$1]++} !f' file
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
To run it on multiple files at once would be this to remove duplicates across all files:
awk -F'[|]' 'FNR%2{f=seen[$1]++} !f' *
or this to only remove duplicates within each file:
awk -F'[|]' 'FNR==1{delete seen} FNR%2{f=seen[$1]++} !f' *

How do I check if input is one word or 2 separated by delimiter "-"

I need help with the following ksh script:
ExpResult=`echo "$LoadString" | awk -F"-" '{print NF}'=2`
MinExp=`echo "$ExpResult" | tr -s " " | sed 's/^[ ]//g'| cut -d"-" -f1`
MaxExp=`echo "$ExpResult" | tr -s " " | sed 's/^[ ]//g'| cut -d"-" -f2`
I can get an input as two options : "50-100" or "50" (for example)
I have two questions:
How do I check if the input is "one word" or two words separated by delimiter "-"?
If the input is two words, how can I separate them?
Rather than call an external program to parse your input, you can use the internal case statement to validate input and parameter expansion features to convert your input, i.e.
# set a copy/paste value for $1
set -- 50-10
case "$1" in
*-* )
range="$1"
min="${range%-*}"
max="${range#*-}"
;;
* )
singleNum="$1"
;;
esac
echo min=$min ... max=$max
output
min=50 ... max=100
Try for non-pair
unset min max
set -- other values
case ...
echo min= ... max= ... singleNum=$singleNum
output
min= ... max= ... singleNum=other
Hopefully the case processing is self-explanatory, but the parameter expansion may require a little explanation.
The statement
min=${range%-*}
says remove from the right side of the expanded value (50-100) anything starting at the last - until the end of the string. This leaves the value 50 remaining.
The reverse happens with
max=${range#*-}
Says remove from the left side of the expanded value anything up to the first - char. This leaves the 100.
As there is only one - char in this string, you don't need to worry about the other versions of ${var##*-} which says remove all from the left until the last match of -, and the reverse ${var%%-*} , remove all from right (backwards) until the very first - char.
The fanatical minimalists will remind us that this can be done without a temporary variable, i.e.
min=${1%-*} ; max=${1#*-}
And the one-line fantasists can be satisfied with
case "$1" in *-* ) range="$1";min="${range%-*}";max="${range#*-}";;* ) singleNum="$1";;esac; echo min=$min ... max=$max .,, singleNum=$singleNum
:-)
IHTH
you can try this;
LoadString=$1
MinExp=`echo "$LoadString" | awk -F"-" '{if (NF==2) print $1}`
MaxExp=`echo "$LoadString" | awk -F"-" '{if (NF==2) print $2}`
echo $MinExp
echo $MaxExp
eg:
user#host:/tmp/test$ ksh test.ksh 50-100
50
100

awk: how to delete first and last value on entire column

I have a data that is comprised of several columns. On one column I would like to delete two commas that are each located in beginning and the end of entire column. My data looks something like this:
a ,3,4,3,2,
b ,3,4,5,1,
c ,1,5,2,4,5,
d ,3,6,24,62,3,54,
Can someone teach me how to delete the first and last commas on this data? I would appreciate it.
$ awk '{gsub(/^,|,$/,"",$NF)}1' file
a 3,4,3,2
b 3,4,5,1
c 1,5,2,4,5
d 3,6,24,62,3,54
awk '{sub(/,/,"",$0); print substr($0,0,length($0)-1)}' input.txt
Output:
a 3,4,3,2,
b 3,4,5,1,
c 1,5,2,4,5,
d 3,6,24,62,3,54
You can do it with sed too:
sed -e 's/,//' -e 's/,$//' file
That says "substitue the first comma on the line with nothing" and then "substitute a comma followed by end of line with nothing".
If you want it to write a new file, do this:
sed -e 's/,//' -e 's/,$//' file > newfile.txt

awk: changing OFS without looping though variables

I'm working on an awk one-liner to substitute commas to tabs in a file ( and swap \\N for missing values in preparation for MySQL select into).
The following link http://www.unix.com/unix-for-dummies-questions-and-answers/211941-awk-output-field-separator.html (at the bottom) suggest the following approach to avoid looping through the variables:
echo a b c d | awk '{gsub(OFS,";")}1'
head -n1 flatfile.tab | awk -F $'\t' '{for(j=1;j<=NF;j++){gsub(" +","\\N",$j)}gsub(OFS,",")}1'
Clearly, the trailing 1 (can be a number, char) triggers the printing of the entire record. Could you please explain why this is working?
SO also has Print all Fields with AWK separated by OFS , but in that post it seems unclear why this is working.
Thanks.
Awk evaluates 1 or any number other than 0 as a true-statement. Since, true statements without the action statements part are equal to { print $0 }. It prints the line.
For example:
$ echo "hello" | awk '1'
hello
$ echo "hello" | awk '0'
$