Join between two patterns - awk

I have been searching for an answer to this but am not finding what works. This is what I am trying to accomplish. In a file I have lines that begin with a specific pattern and sometimes there is a line between them and other times there is not. I am trying join the line between the patterns to the first pattern line. Example below:
Current output:
Name: Doe John
Some Random String
Mailing Address: 1234 Street Any Town, USA
Note: The "Some Random String" line sometimes does not exist so the join would not be needed
Desired output:
Name: Doe John Some Random String
Mailing Address: 1234 Street Any Town, USA
I have tried sed and awk answers I have found on the net but cannot wrap my head around how to make this work. My sed and awk skills are very basic at this point so I don't quite understand some of the solutions even when explained.
Thanks for any help or a point to documentation that talks about what I am trying to accomplish.

Could you please try following, written and tested with shown samples in GNU awk.
awk '{printf("%s%s",FNR>1 && $0~/^Mailing/?ORS:OFS,$0)} END{print ""}' Input_file
OR if you want to put new lines only for Name and Mailing both strings then try following.
awk '
{
printf("%s%s",FNR>1 && ($0~/^Mailing/ || $0 ~/Name:/)?ORS:OFS,$0)
}
END{
print ""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
printf("%s%s",FNR>1 && ($0~/^Mailing/ || $0 ~/Name:/)?ORS:OFS,$0)
##Using printf to print strings, 1st one is either newline or space, which is based on
##condition if line is greater than 1 OR line is either starts with Mailing or has Name
##Then print ORS(newline) or print OFS(space). For 2nd string print current line.
}
END{ ##Starting END block of this program from here.
print "" ##Printing new line here.
}
' Input_file ##Mentioning Input_file name here.

Another awk where you define the specific patterns:
$ awk '
BEGIN {
p["Name"] # define the specific patters that start the record
p["Mailing Address"]
}
{
printf "%s%s",(split($0,t,":")>1&&(t[1] in p)&&NR>1?ORS:""),$0
}
END {
print "" # conditional operator controls the ORS so needed here
}' file
Output on slightly modified data (extra space comes from your data, didn't trim them):
Name: Doe John Some Random String
Mailing Address: 1234 Street Any Town, USA Using: but not specific pattern

How about a GNU sed solution:
sed '
/^Name:/{ ;# if the line starts with "Name:" enter the block
N ;# read the next line and append to the pattern space
:l1 ;# define a label "l1"
/\nMailing Address:/! {N; s/\n//; b l1} ;# if the next line does not start with "Mailing Address:"
;# then append next line, remove newline and goto label "l1"
}' file

This might work for you (GNU sed):
sed '/Name:/{:a;N;/Mailing Address:/!s/\s*\n\s*/ /;$!ta}' file
If a line contains Name: keep appending lines and replacing white space either side of the newline by a space, until the end of file or a line containing Mailing Address:.

Related

How to find and match an exact string in a column using AWK?

I'm having trouble on matching an exact string that I want to find in a file using awk.
I have the file called "sup_groups.txt" that contains:
(the structure is: "group_name:pw:group_id:user1<,user2>...")
adm:x:4:syslog,adm1
admins:x:1006:adm2,adm12,manuel
ssl-cert:x:122:postgres
ala2:x:1009:aceto,salvemini
conda:x:1011:giovannelli,galise,aceto,caputo,haymele,salvemini,scala,adm2,adm12
adm1Group:x:1022:adm2,adm1,adm3
docker:x:998:manuel
now, I want to extract the records that have in the user list the user "adm1" and print the first column (the group name), but you can see that there is a user called "adm12", so when i do this:
awk -F: '$4 ~ "adm1" {print $1}' sup_groups.txt
the output is:
adm
admins
conda
adm1Group
the command of course also prints those records that contain the string "adm12", but I don't want these lines because I'm interested only on the user "adm1".
So, How can I change this command so that it just prints the lines 1 and 6 (excluding 2 and 5)?
thank you so much and sorry for my bad English
EDIT: thank you for the answers, u gave me inspiration for the solution, i think this might work as well as your solutions but more simplified:
awk -F: '$4 ~ "adm,|adm1$|:adm1," {print $1}' sup_groups.txt
basically I'm using ORs covering all the cases and excluding the "adm12"
let me know if you think this is correct
1st solution: Using split function of awk. With your shown samples, please try following awk code.
awk -F':' '
{
num=split($4,arr,",")
for(i=1;i<=num;i++){
if(arr[i]=="adm1"){
print
}
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':' ' ##Starting awk program from here setting field separator as : here.
{
num=split($4,arr,",") ##Using split to split 4th field into array arr with delimiter of ,
for(i=1;i<=num;i++){ ##Running for loop till value of num(total elements of array arr).
if(arr[i]=="adm1"){ ##Checking condition if arr[i] value is equal to adm1 then do following.
print ##printing current line here.
}
}
}
' Input_file ##Mentioning Input_file name here.
2nd solution: Using regex and conditions in awk.
awk -F':' '$4~/^adm1,/ || $4~/,adm1,/ || $4~/,adm1$/' Input_file
OR if 4th field doesn't have comma at all then try following:
awk -F':' '$4~/^adm1,/ || $4~/,adm1,/ || $4~/,adm1$/ || $4=="adm1"' Input_file
Explanation: Making field separator as : and checking condition if 4th field is either equal to ^adm1,(starting adm1,) OR its equal to ,adm1, OR its equal to ,adm1$(ending with ,adm1) then print that line.
This should do the trick:
$ awk -F: '"," $4 "," ~ ",adm1," { print $1 }' file
The idea behind this is the encapsulate both the group field between commas such that each group entry is encapsulated by commas. So instead of searching for adm1 you search for ,adm1,
So if your list looks like:
adm2,adm12,manuel
and, by adding commas, you convert it too:
,adm2,adm12,manuel,
you can always search for ,adm1, and find the perfect match .
once u setup FS per task requirements, then main body becomes barely just :
NF = !_ < NF
or even more straight forward :
{m,n,g}awk —- --NF
=
{m,g}awk 'NF=!_<NF' OFS= FS=':[^:]*:[^:]*:[^:]*[^[:alpha:]]?adm[0-9]+.*$'
adm
admins
conda
adm1Group

Search through a markdown using sed

The Problem
I multiple Markdown files in a folder, formatted like so...
# Cool Project
* Random Text
* Other information
TODO: This is a task
TODO: This is another task
And I've written a script that pulls out all the strings that start with TODO from all the files...
ag TODO: ~/myfolder/journal | sed 's/\(^.*:\)\(.*\)/TODO:\2 /g' | sed ''/TODO:/s//`printf "\033[35mTODO:\033[0m"`/'' | sed ''s/![a-zA-Z0-9]*/$(printf "\033[31;1m&\033[0m")/''
and this gives me an output like this
TODO: This is a task
TODO: This is another task
I was wondering if it would be possible to look backward from the pattern using sed to identify and pickup the line that starts with /^# / and appended it to the end of the string... something like this
TODO: This is a task # Cool Project
TODO: This is another task # Cool Project
Using sed:
sed -n '/^#/h;/^TODO/{G;s/\n/ /p}' file
Search for lines beginning with # and add to hold space (h) Then when a line begins with "TODO", append hold space to pattern space (G) and substitute new lines for a space.
You could do this in a single awk itself. With your shown samples, could you please try following, written and tested with GNU awk.
awk '/^# /{val=$0;next} /^TODO/{print $0,val}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^# /{ ##Checking condition if line starts from hash space then do following.
val=$0 ##Creating val which has current line value getting stored init here.
next ##next will skip all statements from here.
}
/^TODO/{ ##Checking condition if line starts with TODO then do following.
print $0,val ##Printing current line and val here.
}
' Input_file ##Mentioning Input_file name here.

Extract each word immediately preceded by an asterisk

I'm a computer science student and they asked us to extract a word from the text that results from the lpoptions -l command using the sed command so
PageSize/Page Size: Letter *A4 11x17 A3 A5 B5 Env10 EnvC5 EnvDL EnvISOB5 EnvMonarch Executive Legal
Resolution/Resolution: *default 150x150dpi 300x300dpi 600x600dpi 1200x600dpi 1200x1200dpi 2400x600dpi 2400x1200dpi 2400x2400dpi
InputSlot/Media Source: Default Tray1 *Tray2 Tray3 Manual
Duplex/Double-Sided Printing: DuplexNoTumble DuplexTumble *None
PreFilter/GhostScript pre-filtering: EmbedFonts Level1 Level2 *No
I need to get only the words preceded by a *, but I can't find how to do it with sed, I already did it using cut which is easier but I want to know it with sed.
I expect :
A4
default
Tray2
None
No
and I had tried :
sed -E 's/.*\*=(\S+).*/\1/'
but it didn't do anything.
With any POSIX sed (assuming there is always at least one non-space character following the asterisk):
sed 's/.*\*\([^[:space:]]*\).*/\1/'
With GNU sed it'd be:
sed -E 's/.*\*(\S+).*/\1/'
Given your sample they both output:
A4
default
Tray2
None
No
Could you please try following, in case you are ok with awk solution.
awk '{for(i=1;i<=NF;i++){if($i~/^\*/){sub(/^\*/,"",$i);print $i}}}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
for(i=1;i<=NF;i++){ ##Starting for loop here to loop through each field of currnet line.
if($i~/^\*/){ ##Checking condition if line starts from * then do following.
sub(/^\*/,"",$i) ##Substituting starting * with NULL in current field.
print $i ##Printing current field value here.
}
}
}
' Input_file ##Mentioning Input_file name here.

Replace a letter with another from the last word from the last two lines of a text file

How could I possibly replace a character with another, selecting the last word from the last two lines of a text file in shell, using only a single command? In my case, replacing every occurrence of a with E from the last word only.
Like, from a text file containing this:
tree;apple;another
mango.banana.half
monkey.shelf.karma
to this:
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
I tried using sed -n 'tail -2 'mytext.txt' -r 's/[a]+/E/*$//' but it doesn't work (my error: sed expression #1, char 10: unknown option to 's).
Could you please try following, tac + awk solution. Completely based on OP's samples only.
tac Input_file |
awk 'FNR<=2{if(/;/){FS=OFS=";"};if(/\./){FS=OFS="."};gsub(/a/,"E",$NF)} 1' |
tac
Output with shown samples is:
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
NOTE: Change gsub to sub in case you want to substitute only very first occurrence of character a in last field.
This might work for you (GNU sed):
sed -E 'N;${:a;s/a([^a.]*)$/E\1/mg;ta};P;D' file
Open a two line window throughout the length of the file by using the N to append the next line to the previous and the P and D commands to print then delete the first of these. Thus at the end of the file, signified by the $ address the last two lines will be present in the pattern space.
Using the m multiline flag on the substitution command, as well as the g global flag and a loop between :a and ta, replace any a in the last word (delimited by .) by an E.
Thus the first pass of the substitution command will replace the a in half and the last a in karma. The next pass will match nothing in the penultimate line and replace the a in karmE. The third pass will match nothing and thus the ta command will fail and the last two lines will printed with the required changes.
If you want to use Sed, here's a solution:
tac input_file | sed -E '1,2{h;s/.*[^a-zA-Z]([a-zA-Z]+)/\1/;s/a/E/;x;s/(.*[^a-zA-Z]).*/\1/;G;s/\n//}' | tac
One tiny detail. In your question you say you want to replace a letter, but then you transform karma in kErme, so what is this? If you meant to write kErma, then the command above will work; if you meant to write kErmE, then you have to change it just a bit: the s/a/E/ should become s/a/E/g.
With tac+perl
$ tac ip.txt | perl -pe 's/\w+\W*$/$&=~tr|a|E|r/e if $.<=2' | tac
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
\w+\W*$ match last word in the line, \W* allows any possible trailing non-word characters to be matched as well. Change \w and \W accordingly if numbers and underscores shouldn't be considered as word characters - for ex: [a-zA-Z]+[^a-zA-Z]*$
$&=~tr|a|E|r change all a to E only for the matched portion
e flag to enable use of Perl code in replacement section instead of string
To do it in one command, you can slurp the entire input as single string (assuming this'll fit available memory):
perl -0777 -pe 's/\w+\W*$(?=(\n.*)?\n\z)/$&=~tr|a|E|r/gme'
Using GNU awk forsplit() 4th arg since in the comments of another solution the field delimiter is every sequence of alphanumeric and numeric characters:
$ gawk '
BEGIN {
pc=2 # previous counter, ie how many are affected
}
{
for(i=pc;i>=1;i--) # buffer to p hash, a FIFO
if(i==pc && (i in p)) # when full, output
print p[i]
else if(i in p) # and keep filling
p[i+1]=p[i] # above could be done using mod also
p[1]=$0
}
END {
for(i=pc;i>=1;i--) {
n=split(p[i],t,/[^a-zA-Z0-9\r]+/,seps) # split on non alnum
gsub(/a/,"E",t[n]) # replace
for(j=1;j<=n;j++) {
p[i]=(j==1?"":p[i] seps[j-1]) t[j] # pack it up
}
print p[i] # output
}
}' file
Output:
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
Would this help you ? on GNU awk
$ cat file
tree;apple;another
mango.banana.half
monkey.shelf.karma
$ tac file | awk 'NR<=2{s=gensub(/(.*)([.;])(.*)$/,"\\3",1);gsub(/a/,"E",s); print gensub(/(.*)([.;])(.*)$/,"\\1\\2",1) s;next}1' | tac
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
Better Readable version :
$ tac file | awk 'NR<=2{
s=gensub(/(.*)([.;])(.*)$/,"\\3",1);
gsub(/a/,"E",s);
print gensub(/(.*)([.;])(.*)$/,"\\1\\2",1) s;
next
}1' | tac
With GNU awk you can set FS with the two separators, then gsub for the replacement in $3, the third field, if NR>1
awk -v FS=";|[.]" 'NR>1 {gsub("a", "E",$3)}1' OFS="." file
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
With GNU awk for the 3rd arg to match() and gensub():
$ awk -v n=2 '
NR>n { print p[NR%n] }
{ p[NR%n] = $0 }
END {
for (i=0; i<n; i++) {
match(p[i],/(.*[^[:alnum:]])(.*)/,a)
print a[1] gensub(/a/,"E","g",a[2])
}
}
' file
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
or with any awk:
awk -v n=2 '
NR>n { print p[NR%n] }
{ p[NR%n] = $0 }
END {
for (i=0; i<n; i++) {
match(p[i],/.*[^[:alnum:]]/)
lastWord = substr(p[i],1+RLENGTH)
gsub(/a/,"E",lastWord )
print substr(p[i],1,RLENGTH) lastWord
}
}
' file
If you want to do it for the last 50 lines of a file instead of the last 2 lines just change -v n=2 to -v n=50.
The above assumes there are at least n lines in your input.
You can let sed repeat changing an a into E only for the last word with a label.
tac mytext.txt| sed -r ':a; 1,2s/a(\w*)$/E\1/; ta' | tac

use awk to split one file into several small files by pattern

I have read this post about using awk to split one file into several files:
and I am interested in one of the solutions provided by Pramod and jaypal singh:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Because I still can not add any comment so I ask in here.
If the input is
>chr22
asdgasge
asegaseg
>chr1
aweharhaerh
agse
>chr14
gasegaseg
How come it will result in three files:
chr22.fasta
chr1.fasta
chr14.fasta
As an example, in chr22.fasta:
>chr22
asdgasge
asegaseg
I understand the first part
/^>chr/ {OUT=substr($0,2) ".fa"};
and these commands:
/^>chr/ substr() close() >>
But I don't understand that how awk split the input by the second part:
{print >> OUT; close(OUT)}
Could anyone explain more details about this command? Thanks a lot!
Could you please go through following and let me know if this helps you.
awk ' ##Starting awk program here.
/^>chr/{ ##Checking condition if a line starts from string chr then do following.
OUT=substr($0,2) ".fa" ##Create variable OUT whose value is substring of current line and starts from letter 2nd to till end. concatenating .fa to it too.
}
{
print >> OUT ##Printing current line(s) in file name whose value is variable OUT.
close(OUT) ##using close to close output file whose value if variable OUT value. Basically this is to avoid "TOO MANY FILES OPENED ERROR" error.
}' Input_File ##Mentioning Input_file name here.
You could take reference from man awk page for used functions of awk too as follows.
substr(s, i [, n]) Returns the at most n-character substring of s starting at i. If n is omitted, the rest of s is used.
The part you are asking questions about is a bit uncomfortable:
{ print $0 >> OUT; close(OUT) }
With this part, the awk program does the following for every line it processes:
Open the file OUT
Move the file pointer the the end of the file OUT
append the line $0 followed by ORS to the file OUT
close the file OUT
Why is this uncomfortable? Mainly because of the structure of your files. You should only close the file when you finished writing to it and not every time you write to it. Currently, if you have a fasta record of 100 lines, it will open, close the file 100 times.
A better approach would be:
awk '/^>chr/{close(OUT); OUT=substr($0,2)".fasta" }
{print > OUT }
END {close(OUT)}'
Here we only open the file the first time we write to it and we close it when we don't need it anymore.
note: the END statement is not really needed.