How to use grep and sed simultaneously using pipe - awk

I have 2 files
File 1
TRINITY_DN10039_c1_g1_i1 216 Brassica rapa
TRINITY_DN10270_c0_g1_i1 233 Pan paniscus
TRINITY_DN10323_c0_g1_i2 209 Corynebacterium aurimucosum ATCC 700975
.
.
TRINITY_DN10462_c0_g1_i1 257 Helwingia himalaica
TRINITY_DN10596_c0_g1_i1 205 Homo sapiens
TRINITY_DN10673_c0_g2_i2 323 Anaerococcus prevotii DSM 20548
File 2
TRINITY_DN9856_c0_g1_i1 len=467 path=[0:0-466]
GATGCGGGCCAATATGAATGTGAGATTACTAATGAATTGGGGACTAAAAA
TRINITY_DN9842_c0_g1_i1 len=208 path=[0:0-207]
AAGTAATTTTATATCACTTGTTACATCGCAATTCGTGAGTTAAACTTAAT
.
.
TRINITY_DN9897_c0_g1_i1 len=407 path=[0:0-406]
AACTTTATTAACTTGTTGTACATATTTATTAATGCAAATACATATAGAG
TRINITY_DN9803_c0_g1_i1 len=795 path=[0:0-794]
AACTAAGACAAACTTCGCGGAGCAGTTAGAAAATATTACAAGAGATTTG
I want to delete 2 lines(same line and next line) in file2 whose pattern matches with the first column words of 1st file
awk '{print $1}' file1 | sed '/here_i_want_to_insert_output_of_pipe/{N;d;}' file2

If the field has no special characters in the first field, like . or / or [ or ( or \ or any regex-special characters, your idea is actually not that bad:
sed "$(cut -d' ' -f1 file1 | sed 's#.*#/&/{N;d}#')" file2
cut -d' ' -f1 file1 - extract first field from file1
| sed
.* - replace anything. ie. the first field from file1
/&/{N;d} - the & is substituted for the whole thing we are replacing. So for the first field. So it becomes /<first field>/{N;d}
then wrap it around sed "<here>" file2
No so much known feature, you can use another character for /regex/ with syntax \<char>regex<char> like \!regex!. Below I use ~:
sed "$(cut -d' ' -f1 file1 | sed 's#.*#\\~&~{N;d}#')" file2
If you however do have any special characters on the first field, then if you don't care about sorting: You can replace two lines in file2 for a single line with some magic separator (I chose ! below), then sort it and sort file1, and then just join them. The -v2 makes join output unpairable lines from second file - ie. not matched lines. After that restore the newline, by replacing the magic separator ! for a newline:
join -v2 <(cut -d' ' -f1 file1 | sort) <(sed 'N;s/\n/!/' file2 | sort -k1) |
tr '!' '\n'
If the output needs to be sorted as in file2, you can number lines in file2 and re-sort the output on line numbers:
join -11 -22 -v2 <(cut -d' ' -f1 file1 | sort) <(sed 'N;s/\n/!/' file2 | nl -w1 | sort -k2) |
sort -k2 | cut -d' ' -f1,3- | tr '!' '\n'
Tested on repl

I would do something like this with one awk, unless file1 is really really really huge :
awk 'NR==FNR{a[$1]++; next}counter{counter--}$1 in a{counter=2}!counter' <file1> <file2>
Input :
file1
TRINITY_DN10039_c1_g1_i1 216 Brassica rapa
TRINITY_DN10270_c0_g1_i1 233 Pan paniscus
TRINITY_DN10323_c0_g1_i2 209 Corynebacterium aurimucosum ATCC 700975
hello
TRINITY_DN10462_c0_g1_i1 257 Helwingia himalaica
TRINITY_DN10596_c0_g1_i1 205 Homo sapiens
TRINITY_DN10673_c0_g2_i2 323 Anaerococcus prevotii DSM 20548
file2 :
TRINITY_DN9856_c0_g1_i1 len=467 path=[0:0-466]
GATGCGGGCCAATATGAATGTGAGATTACTAATGAATTGGGGACTAAAAA
TRINITY_DN9842_c0_g1_i1 len=208 path=[0:0-207]
AAGTAATTTTATATCACTTGTTACATCGCAATTCGTGAGTTAAACTTAAT
TRINITY_DN9897_c0_g1_i1 len=407 path=[0:0-406]
AACTTTATTAACTTGTTGTACATATTTATTAATGCAAATACATATAGAG
hello
world
TRINITY_DN9803_c0_g1_i1 len=795 path=[0:0-794]
AACTAAGACAAACTTCGCGGAGCAGTTAGAAAATATTACAAGAGATTTG
Output :
TRINITY_DN9856_c0_g1_i1 len=467 path=[0:0-466]
GATGCGGGCCAATATGAATGTGAGATTACTAATGAATTGGGGACTAAAAA
TRINITY_DN9842_c0_g1_i1 len=208 path=[0:0-207]
AAGTAATTTTATATCACTTGTTACATCGCAATTCGTGAGTTAAACTTAAT
TRINITY_DN9897_c0_g1_i1 len=407 path=[0:0-406]
AACTTTATTAACTTGTTGTACATATTTATTAATGCAAATACATATAGAG
TRINITY_DN9803_c0_g1_i1 len=795 path=[0:0-794]
AACTAAGACAAACTTCGCGGAGCAGTTAGAAAATATTACAAGAGATTTG

I would do this with process substitution like so:
while read -r -d '' line; do
sed -i "/^${line}/{N;d;}" file2
done < <(awk '{printf "%s\0", $1}' file1 | sed 's|[][\\/.*^$]|\\&|g')
The reason for delimiting by nullbytes rather than newlines is because it's usually the best way.
Edit:
Updated to quote special characters with \ so sed won't malfunction.

Related

Find the second word delimited by space or comma then insert strings before and after

I have a file containing TABLE schema.table and want to put strings around it to make a command like MARK string REJECT
the file contains many lines
TABLE SCHEMA.MYTAB, etc. etc....
or
TABLE SCHEMA.MYTAB , etc. etc....
The result is
MARK SCHEMA.MYTAB REJECT
..etc
I have
grep TABLE dirx/myfile.txt | awk -F, '{print $1}' | awk '{print $2}' | sed -e 's/^/MARK /' |sed -e 's/$/ REJECT/'
It works, but can this be tidier? I think I can combine the awk and sed into single commands but not sure how.
Maybe:
awk '/^TABLE/ {gsub(/,.*$/, ""); print "MARK " $2 " REJECT"}' dirx/myfile.txt

Only output line if value in specific column is unique

Input:
line1 a gh
line2 a dd
line3 c dd
line4 a gg
line5 b ef
Desired output:
line3 c dd
line5 b ef
That is, I want to output line only in the case that no other line includes the same value in column 2. I thought I could do this with combination of sort (e.g. sort -k2,2 input) and uniq, but it appears that with uniq I can only skip columns from the left (-f avoid comparing the first N fields). Surely there's some straightforward way to do this with awk or something.
You can do this as a two-pass awk script:
awk 'NR==FNR{a[$2]++;next} a[$2]<2' file file
This runs through the file once incrementing a counter in an array whose key is the second field of each line, then runs through a second time printing only those lines whose counter is less than 2.
You'd need multiple reads of the file because at any point during the first read, you can't possibly know whether there will be another instance of the second field of that line later in the file.
Here is a one pass awk solution:
awk '{a1[$2]++;a2[$2]=$0} END{for (a in a1) if (a1[a]==1) print a2[a]}' file
The original order of the file will be lost however.
You can combine awk, grep, sort and uniq for a quick one-liner:
grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d) " input.txt
Edit, to avoid the regexes, \+ and \backreferences:grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d | sed 's/[^+0-9]/\\&/g') " input.txt
alternative to awk to demonstrate that it can still be done with sort and uniq (there is option -u for this), however setting up the right format requires some juggling (decorate/do stuff/undecorate pattern).
$ paste file <(cut -d' ' -f2 file) | sort -k2 | uniq -uf3 | cut -f1
line5 b ef
line3 c dd
as a side effect you lose the original sorting order, which can be recovered as well if you add line numbers...

Delete multiple strings/characters in a file

I have a curl output generated similar below, Im working on a SED/AWK script to eliminate unwanted strings.
File
{id":"54bef907-d17e-4633-88be-49fa738b092d","name":"AA","description","name":"AAxxxxxx","enabled":true}
{id":"20000000000000000000000000000000","name":"BB","description","name":"BBxxxxxx","enabled":true}
{id":"542ndf07-d19e-2233-87gf-49fa738b092d","name":"AA","description","name":"CCxxxxxx","enabled":true}
{id":"20000000000000000000000000000000","name":"BB","description","name":"DDxxxxxx","enabled":true}
......
I like to modify this file and retain similar below,
AA AAxxxxxx
BB BBxxxxxx
AA CCxxxxxx
BB DDxxxxxx
AA n.....
BB n.....
Is there a way I could remove word/commas/semicolons in-between so I can only retain these values?
Try this awk
curl your_command | awk -F\" '{print $(NF-9),$(NF-3)}'
Or:
curl your_command | awk -F\" '{print $7,$13}'
A semantic approach ussing perl:
curl your_command | perl -lane '/"name":"(\w+)".*"name":"(\w+)"/;print $1." ".$2'
For any number of name ocurrences:
curl your_command | perl -lane 'printf $_." " for ( $_ =~ /"name":"(\w+)"/g);print ""'
This might work for you (GNU sed):
sed -r 's/.*("name":")([^"]*)".*\1([^"]*)".*/\2 \3/p;d' file
This extracts the fields following the two name keys and prints them if successful.
Alternatively, on simply pattern matching:
sed -r 's/.*:.*:"([^"]*)".*:"([^"]*)".*:.*/\1 \2/p;d' file
In this particular case, you could do
awk -F ":|," '{print $4,$7}' file2 |tr -d '"'
and get
AA AAxxxxxx
BB BBxxxxxx
AA CCxxxxxx
BB DDxxxxxx
Here, the field separator is either : or ,, we print the fourth and seventh field (because all lines have the entries in these two fields) and finally, we use tr to delete the " because you don't want to have it.

list 3rd column of a file with spaces only

for listing 3rd column I am using
awk '{print $3}' inputfile.txt
and its output looks like
abc
xyz
lmn
pqr
But I need output like
abc xyz lmn pqr
How can I get this?
This might work for you (GNU sed):
sed -r 's/((\S*)\s){3}.*/\2/;1h;1!H;$!d;x;y/\n/ /' file
or more easily:
cut -d\ -f3 file | paste -sd\
print will always append a newline (actually, it will use ORS value). If you want more control, you can use printf:
awk '{printf "%s ", $3}'
This will also print an extra space character at the end, but for most use-cases this extra space is harmless.
Transliterate linefeeds into spaces
... | tr '\n' ' '
Use the awk Output Record Separator variable.
awk -v ORS=' ' '{print $3}' inputfile.txt
Avoiding adding a space to the beginning or end of the line:
awk '{printf "%s%s", fs, $3; fs=FS} END{print ""}' file

How to awk pattern over two consecutive lines?

I am trying do something which I guess could be done very easy but I cant seem to find the answer. I want to use awk to pick out lines between two patterns, but I also want the pattern to match two consecutive lines. I have tried to find the solution on the Internet bu perhaps I did not search for the right keywords. An example would better describe this.
Suppose I have the following file called test:
aaaa
bbbb
SOME CONTENT 1
ddddd
fffff
aaaa
cccc
SOME CONTENT 2
ccccc
fffff
For example lets say I would like to find "SOME CONTENT 1"
Then I would use awk like this:
cat test | awk '/aaa*/ { show=1} show; /fff*/ {show=0}'
But that is not want I want. I want somehow to enter the pattern:
aaaa*\nbbbb*
And the same for the end pattern. Any suggestions how to do this?
You can use this:
awk '/aaa*/ {f=1} /bbb*/ && f {show=1} show; /fff*/ {show=f=0}' file
bbbb
SOME CONTENT 1
ddddd
fffff
If pattern1 is aaa* then set flag f
If pattern2 is bbb* and flag f is true, then set the show flag
If you need to print patter1 the aaa*?
awk '/aaa*/ {f=$0} /bbb*/ && f {show=1;$0=f RS $0} show; /fff*/ {show=f=0}' file
aaaa
bbbb
SOME CONTENT 1
ddddd
fffff
If every record ends with fffff, and GNU awk is available, you could do something like this:
$ awk '/aaa*\nbbbb*/' RS='fffff' file
aaaa
bbbb
SOME CONTENT 1
ddddd
Or if you want just SOME CONTENT 1 to be visible, you can do:
$ awk -F $'\n' '/aaa*\nbbbb*/{print $4}' RS='fffff' file
SOME CONTENT 1
I searched for two patterns and checkd that they were consecutive using line numbers, having line numbers lets sed insert a line between them, well after the first line/pattern.
awk '$0 ~ "Encryption" {print NR} $0 ~ "Bit Rates:1" {print NR}' /tmp/mainscan | while read line1; do read line2; echo "$(($line2 - 1)) $line1"; done > /tmp/this
while read line
do
pato=$(echo $line | cut -f1 -d' ')
patt=$(echo $line | cut -f2 -d' ')
if [[ "$pato" = "$patt" ]]; then
inspat=$((patt + 1))
sed -i "${inspat}iESSID:##" /tmp/mainscan
sed -i 's/##/""/g' /tmp/mainscan
fi
done < /tmp/this