Bash how to split file on empty line with awk - awk

I have a text file (A.in) and I want to split it into multiple files. The split should occur everytime an empty line is found. The filenames should be progressive (A1.in, A2.in, ..)
I found this answer that suggests using awk, but I can't make it work with my desired naming convention
awk -v RS="" '{print $0 > $1".txt"}' file
I also found other answers telling me to use the command csplit -l but I can't make it match empty lines, I tried matching the pattern '' but I am not that familiar with regex and I get the following
bash-3.2$ csplit A.in ""
csplit: : unrecognised pattern
Input file:
A.in
4
RURDDD
6
RRULDD
KKKKKK
26
RRRULU
Desired output:
A1.in
4
RURDDD
A2.in
6
RRULDD
KKKKKK
A3.in
26
RRRULU

Another fix for the awk:
$ awk -v RS="" '{
split(FILENAME,a,".") # separate name and extension
f=a[1] NR "." a[2] # form the filename, use NR as number
print > f # output to file
close(f) # in case there are MANY to avoid running out f fds
}' A.in

In any normal case, the following script should work:
awk 'BEGIN{RS=""}{ print > ("A" NR ".in") }' file
The reason why this might fail is most likely due to some CRLF terminations (See here and here).
As mentioned by James, making it a bit more robust as:
awk 'BEGIN{RS=""}{ f = "A" NR ".in"; print > f; close(f) }' file
If you want to use csplit, the following will do the trick:
csplit --suppress-matched -f "A" -b "%0.2d.in" A.in '/^$/' '{*}'
See man csplit for understanding the above.

Input file content:
$ cat A.in
4
RURDDD
6
RRULDD
KKKKKK
26
RRRULU
AWK file content:
BEGIN{
n=1
}
{
if(NF!=0){
print $0 >> "A"n".in"
}else{
n++
}
}
Execution:
awk -f ctrl.awk A.in
Output:
$ cat A1.in
4
RURDDD
$ cat A2.in
6
RRULDD
KKKKKK
$ cat A3.in
26
RRRULU
PS: One-liner execution without AWK file:
awk 'BEGIN{n=1}{if(NF!=0){print $0 >> "A"n".in"}else{n++}}' A.in

Related

How can I print only lines that are immediately preceeded by an empty line in a file using sed?

I have a text file with the following structure:
bla1
bla2
bla3
bla4
bla5
So you can see that some lines of text are preceeded by an empty line.
I understand that sed has the concept of two buffers, a pattern space buffer and a hold space buffer, so I'm guessing these need to come in to play here, but I'm unclear how to specify them to accomplish what I need.
In my contrived example above, I'd expect to see the following lines outputted:
bla3
bla5
sed is for doing s/old/new on individual lines, that is all. Any time you start talking about buffers or doing anything related to multi-lines comparisons you're using the wrong tool.
You could do this with awk:
$ awk -v RS= -F'\n' 'NR>1{print $1}' file
bla3
bla5
but it would fail to print the first non-empty line if the first line(s) in the file were empty so this may be what you want if you want lines of all space chars considered to be empty lines:
$ awk 'NF && !p{print} {p=NF}' file
bla3
bla5
and this otherwise:
$ awk '($0!="") && (p==""){print} {p=$0}' file
bla3
bla5
All of the above will work even if there are multiple empty lines preceding any given non-empty line.
To see the difference between the 3 approaches (which you won't see given the sample input in the question):
PS1> printf '\nfoo\n \nbar\n\netc\n' | cat -E
$
foo$
$
bar$
$
etc$
PS1> printf '\nfoo\n \nbar\n\netc\n' | awk -v RS= -F'\n' 'NR>1{print $1}'
etc
PS1> printf '\nfoo\n \nbar\n\netc\n' | awk 'NF && !p{print} {p=NF}'
foo
bar
etc
PS1> printf '\nfoo\n \nbar\n\netc\n' | awk '($0!="") && (p==""){print} {p=$0}'
foo
etc
You can use the hold buffer easily to print the line before the blank like this:
sed -n -e '/^$/{x; p;}' -e h input
But I don't see an easy way to use it for your use case. For your case, instead of using the hold buffer, you could do:
sed -n -e '/^$/ba' -e d -e :a -e n -e p input
But I would do this with awk.
awk 'NR!=1{print $1}' RS= FS=\\n input-file
awk 'p;{p=/^$/}' file
above command does these for each line:
if p is 1, print line;
if line is empty, set p to 1.
if lines consisting of one or more spaces are also considered empty:
awk 'p;{p=!NF}' file
to print non-empty lines each coming right after an empty line, you can use this:
awk 'p*!(p=/^$/)' file
if p is 1 and this line is not empty (1*!(0) = 1*1 = 1), print this line;
otherwise (1*!(1) = 1*0 = 0, 0*anything = 0), don't print anything.
note that this one may not work with all awks, a portable version of this would look like:
awk 'p*(/./);{p=/^$/}' file
if lines consisting of one or more spaces are also considered empty:
awk 'p*NF;{p=!NF}' file
see them online here, and here.
If sed/awk is not mandatory, you can do it with grep:
grep -A 1 '^$' input.txt | grep -v -E '^$|--'
You can use sed to match a range of lines and do sub-matches inside the matches, like so:
# - use the "-n" option to omit printing of lines
# - match lines between a blank line (/^$/) and a non-blank one (/^./),
# then print only the line that contains at least a character,
# i.e, the non-blank line.
sed -ne '
/^$/,/^./ {
/^./{ p; }
}' input.txt
tested by gnu sed, your data in 'a':
$ sed -nE '/^$/{N;s/\n(.+)/\1/p}' a
bla3
bla5
add -i option precedes -n to real editing

How to avoid the return of system command with awk?

When I use awk with system command like this :
awk 'BEGIN{ if ( system("wc -l file_1") == 0 ) {print "something"} }' text.txt >> file_1
the result of system command is writen in my file file_1 :
0 file_1
something
How to avoid that? or just to redirect the output?
You appear to be under the impression that the output of the system() function includes the stdout of the command it runs. It does not.
If you want to test only for the existence of a non-zero-sized file, you might do it using the test command (on POSIX systems):
awk '
BEGIN{
if ( system("test -s file_1") ) { # a return value of 0 is "false" to awk
print "something"
}
}' text.txt >> file_1

Exact string match in awk

I have a file test.txt with the next lines
1997 100 500 2010TJ
2010TJXML 16 20 59
I'm using the next awk line to get information only about string 2010TJ
awk -v var="2010TJ" '$0 ~ var {print $0}' test.txt
But the code print the two lines. I want to know how to get the line containing the exact string
1997 100 500 2010TJ
the string can be placed in any column of the file.
Several options:
Use a gawk word boundary (not POSIX awk...):
$ gawk '/\<2010TJ\>/' file
An actual space or tab or what is separating the columns:
$ awk '/^2010TJ /' file
Or compare the field directly to the string:
$ awk '$1=="2010TJ"' file
You can loop over the fields to test each field if you wish:
$ awk '{for (i=1;i<=NF;i++) if ($i=="2010TJ") {print; next}}' file
Or, given your example of setting a variable, those same using a variable:
$ gawk -v s=2010TJ '$0~"\\<" s "\\>"'
$ awk -v s=2010TJ '$0~"^" s " "'
$ awk -v s=2010TJ '$1==s'
Note the first is a little different than the second and third. The first is the standalone string 2010TJ anywhere in $0; the second and third is a string that starts with that string.
Try this (for testing only column 1) :
awk '$1 == "2010TJ" {print $0}' test.txt
or grep like (all columns) :
gawk '/\<2010TJ\>/ {print $0}' test.txt
Note
\< \> is word boundarys
another awk with word boundary
awk '/\y2010TJ\y/' file
note \y matches either beginning or end of a word.

match two pattern in log file and output as table

I need to match two patterns in a log file and need to get the output as a table if possible. The log file has several lines with the words I want to match, here is an example of the log file:
Seed for random set to: uuzTjCqMVRk=
--out /home/ALL/ADRL.GLND.FET-EnhA
--max-shift False
--min-shift False
p-value = 0.542
Seed for random set to: P2+shGCxj70=
--out /home/ALL/BLD.CD14.MONO-EnhA
--max-shift False
--min-shift False
p-value = 0.737
I would like to get an output like this (tab delimited to export as text file):
Group Pvalue
ADRL.GLND.FET-EnhA 0.542
BLD.CD14.MONO-EnhA 0.737
I would like to do it in bash if it is possible
EDIT:
This is what I have tried:
grep 'out' file.log | awk '{print $0}' > file1.txt
grep 'p-value' file.log | awk '{print $0}' > file2.txt
paste -d"\t" file1.txt file2.txt > pval.txt
$ awk -F'[/ ]' -v OFS='\t' 'BEGIN{print "Group","Pvalue"} (NR%7)==2{g=$NF} (NR%7)==6{print g, $NF}' file
Group Pvalue
ADRL.GLND.FET-EnhA 0.542
BLD.CD14.MONO-EnhA 0.737
or if you prefer:
$ awk -F'[/ ]+' -v OFS='\t' 'BEGIN{print "Group","Pvalue"} $2=="--out"{g=$NF} $1=="p-value"{print g, $NF}' file
Group Pvalue
ADRL.GLND.FET-EnhA 0.542
BLD.CD14.MONO-EnhA 0.737
Zero error-checking:
awk '/--out/ { sub(".*/","",$2);printf "%s\t",$2; } /p-value = / { print $3; }' < file.log
If a line has --out, prints the base name of the path followed by a tab. If a line has p-value =, prints the number and a newline.
awk is nice, in this case, because you can modify the lines you match. Thinking in terms of grep, you'd have to deploy additional tools (like sed) to get the parts you wanted, then reassemble them into a useful form. Your use of grep and paste is valiant, and with tweaking would work, at the cost of many more processes and deployed tools.
You could do this in one bigger block of awk pattern matching, which would be more bullet-proof. I'll leave as exercise for the reader.

extract and truncate characters in unix

I am using -cut -c12-16 command in an awk script but it is not working or may be I am not writing properly. The characters between 12 and 16 are variable and I want to take them out of the line from a file which starts with 999999.
Using awk:
$ awk '/^YYYYYY/ { print substr($0,1,12) substr($0,17); next }1' file
YYYYYY9999990519
Update:
$ cat file
YYYYYY651006178045E46178D
YYYYYY6510617ESTN5258534
YYYYYY999999621409112ET0
YYYYYY99999949234091EA201
$ awk '/^YYYYYY999999/ { print substr($0,1,12) substr($0,17); next }1' file
YYYYYY651006178045E46178D
YYYYYY6510617ESTN5258534
YYYYYY99999909112ET0
YYYYYY9999994091EA201
Through GNU awk,
$ echo 'YYYYYY99999920120519' | awk '/^YYYYYY999999/{$0=gensub(/^(.{12}).{4}/,"\\1","g")}1'
YYYYYY9999990519