extract and truncate characters in unix - awk

I am using -cut -c12-16 command in an awk script but it is not working or may be I am not writing properly. The characters between 12 and 16 are variable and I want to take them out of the line from a file which starts with 999999.

Using awk:
$ awk '/^YYYYYY/ { print substr($0,1,12) substr($0,17); next }1' file
YYYYYY9999990519
Update:
$ cat file
YYYYYY651006178045E46178D
YYYYYY6510617ESTN5258534
YYYYYY999999621409112ET0
YYYYYY99999949234091EA201
$ awk '/^YYYYYY999999/ { print substr($0,1,12) substr($0,17); next }1' file
YYYYYY651006178045E46178D
YYYYYY6510617ESTN5258534
YYYYYY99999909112ET0
YYYYYY9999994091EA201

Through GNU awk,
$ echo 'YYYYYY99999920120519' | awk '/^YYYYYY999999/{$0=gensub(/^(.{12}).{4}/,"\\1","g")}1'
YYYYYY9999990519

Related

capture last line of file as integer variable and use in awk command

I am trying to capture the last line of a file as a variable for use in an awk command.
Here is an example of the file (the end of it) :
cat file.txt
....
phylum:Chlorophyta 1
phylum:Mucoromycota 1
column 6:
superkingdom:Eukaryota 99
column 7:
99
I want to use that '99' as an integer in an awk command, saving it as a variable,
tail -n1 file.txt
99
e.g.
div=$(tail -n1 file.txt)
echo $div
99
To be used in a 2nd file (conf.txt), to divide the numbers in the 2nd field:
cat conf.txt
Class 88
Family 78
Genus 44
Species 23
BUT, when I try to use the $div variable in the awk command (using -v flag as suggested here and elsewhere with awk when taking a variable) I get this error:
awk -v a=$div '{print $2/a}' conf.txt
awk: can't open file {print $2/a}
source line number 1
But when saivng 99 as a variable simply on the cmd line, It works just fine:
num=99
awk -v a=$num '{print $2/a}' conf.txt
0.888889
0.787879
0.444444
0.232323
Are there extra spaces/characters in the capture from tail -1? I am missing something simple, but fundamental.
Ultimatey, I don't even want to have to save as a separate variable first If I dont have to, instead, just capture that last line number (99) and put directly into an awk cmd, e.g.:
awk '{print $2/[tail -1 file.txt]}' conf.txt
This is psuedo code (in the brackets) ...but, this would ultimately be what Id want...
Thanks for any help!
There's a space at the beginning of the last line, so the command is becoming
awk -v a= 99 '{print $2/a}' conf.txt
This is setting a to an empty string, treating 99 as the awk script, and the rest as filenames.
Remove the spaces from $div.
div=${div// /}
Use quotes as a habit in the shell.
Given:
cat file
blah blah
99
The command n=$(tail -n1 file) produces leading spaces in front of the 99:
n=$(tail -n1 file)
printf "\"%s\"\n" "$n"
" 99"
It is especially a bug that bites when you think you are checking the value of $n without quotes because the leading spaces are stripped by the shell prior to invoking echo.
Consider:
echo $n # no quotes - leading spaces stripped
99
echo "$n" # preserve whitespace...
99
Now if you try and pass that argument without quotes to awk, the space has meaning to the shell and screws up how the command is interpreted:
awk -v n=$n 'BEGIN{printf "\"%s\", %s\n", n, n+1}'
awk: fatal: cannot open file `BEGIN{printf "\"%s\", %s\n", n, n+1}' for reading: No such file or directory
vs:
awk -v n="$n" 'BEGIN{printf "\"%s\", %s\n", n, n+1}'
" 99", 100
If you want to use awk to replace the use of tail you use the idiom of FNR==NR to test if the file is the first file and $1==$1+0 to test if awk is interpreting what it sees as a number:
awk 'FNR==NR {n=$1+0==$1 ? $1+0 : n; next} # n ends up being the last number seen
$2==$2+0{print $2/n}
' file conf.txt
0.888889
0.787879
0.444444
0.232323
Rather than have shell call some command to get the last line of file.txt then save it in a shell variable, then set an awk variable to that same value populated from the shell variable and passing it to awk, just use one call to awk:
$ awk 'NR==FNR{n=$1; next} {print $2/n}' file.txt conf.txt
0.888889
0.787879
0.444444
0.232323
Enabling debug mode and running the awk command:
$ set -x
$ awk -v a=$div '{print $2/a}' conf.txt
+ awk -v a= 99 '{print $2/a}'
awk: fatal: cannot open file `{print $2/a}' for reading: No such file or directory
Of interest:
-v a= - define awk variable a as being empty
99 - awk code/script
'{print $2/a}' - first file passed to awk script, and the source of the error message
As others have pointed out you can get around the error by wrapping $div in double quotes:
$ awk -v a="$div" '{print $2/a}' conf.txt
+ awk -v 'a= 99' '{print $2/a}' conf.txt
0.888889
0.787879
0.444444
0.232323
Of interest:
-v '= 99' - define awk variable a and string ' 99'
in this case awk ignores the spaces when the rest of the variable can be interpreted as a numeric
'{print $2/a}' - awk code/script
conf.txt - file passed to awk script
Barmar and dawg have addressed stripping the blanks from div and using awk for the entire process, respectively.

How to delete top and last non empty lines of the file

I want to delete top and last non empty line of the file.
Example:
cat test.txt
//blank_line
abc
def
xyz
//blank_line
qwe
mnp
//blank_line
Then output should be:
def
xyz
//blank_line
qwe
I have tried with commands
sed "$(awk '/./{line=NR} END{print line}' test.txt)d" test.txt
to remove last non empty line. At here there are two command, (1) sed and (2) awk. But I want to do by single command.
Reading the whole file in memory at once with GNU sed for -E and -z:
$ sed -Ez 's/^\s*\S+\n//; s/\n\s*\S+\s*$/\n/' test.txt
def
xyz
qwe
or with GNU awk for multi-char RS:
$ awk -v RS='^$' '{gsub(/^\s*\S+\n|\n\S+\s*$/,"")} 1' test.txt
def
xyz
qwe
Both GNU tools accept \s and \S as shorthand for [[:space:]] and [^[:space:]] respectively and GNU sed accepts the non-POSIX-sed-standard \n as meaning newline.
This is a double pass method:
awk '(NR==FNR) { if(NF) {t=FNR;if(!h) h=FNR}; next}
(h<FNR && FNR<t)' file file
The integers h and t keep track of the head and the tail. In this case, empty lines can also contain blanks. You could replace if(NF) by if(length($0)==0) to be more strict.
This one reads everything into memory and does a simple replace at the end:
$ awk '{b=b RS $0}
END{ sub(/^[[:blank:]\n]*[^\n]+\n/,"",b);
sub(/\n[^\n]+[[:blank:]\n]*$,"",b);
print b }' file
A single-pass, fast and relatively memory-efficient approach utilising a buffer:
awk 'f {
if(NF) {
printf "%s",buf
buf=""
}
buf=(buf $0 ORS)
next
}
NF {
f=1
}' file
here is a golfed version of #kvantour's solution
$ awk 'NR==(n=FNR){e=!NF?e:n;b=!b?e:b}b<n&&n<e' file{,}
This might work for you (GNU sed):
sed -E '0,/\S/d;H;$!d;x;s/.(.*)\n.*\S.*/\1/' file
Use a range to delete upto and including the first line containing a non-space character. Then copy the remains of the file into the hold space and at the end of file use substitution to remove the last line containing a non-space character and any empty lines to the end of the file.
Alternative:
sed '0,/\S/d' file | tac | sed '0,/\S/d'| tac

Bash how to split file on empty line with awk

I have a text file (A.in) and I want to split it into multiple files. The split should occur everytime an empty line is found. The filenames should be progressive (A1.in, A2.in, ..)
I found this answer that suggests using awk, but I can't make it work with my desired naming convention
awk -v RS="" '{print $0 > $1".txt"}' file
I also found other answers telling me to use the command csplit -l but I can't make it match empty lines, I tried matching the pattern '' but I am not that familiar with regex and I get the following
bash-3.2$ csplit A.in ""
csplit: : unrecognised pattern
Input file:
A.in
4
RURDDD
6
RRULDD
KKKKKK
26
RRRULU
Desired output:
A1.in
4
RURDDD
A2.in
6
RRULDD
KKKKKK
A3.in
26
RRRULU
Another fix for the awk:
$ awk -v RS="" '{
split(FILENAME,a,".") # separate name and extension
f=a[1] NR "." a[2] # form the filename, use NR as number
print > f # output to file
close(f) # in case there are MANY to avoid running out f fds
}' A.in
In any normal case, the following script should work:
awk 'BEGIN{RS=""}{ print > ("A" NR ".in") }' file
The reason why this might fail is most likely due to some CRLF terminations (See here and here).
As mentioned by James, making it a bit more robust as:
awk 'BEGIN{RS=""}{ f = "A" NR ".in"; print > f; close(f) }' file
If you want to use csplit, the following will do the trick:
csplit --suppress-matched -f "A" -b "%0.2d.in" A.in '/^$/' '{*}'
See man csplit for understanding the above.
Input file content:
$ cat A.in
4
RURDDD
6
RRULDD
KKKKKK
26
RRRULU
AWK file content:
BEGIN{
n=1
}
{
if(NF!=0){
print $0 >> "A"n".in"
}else{
n++
}
}
Execution:
awk -f ctrl.awk A.in
Output:
$ cat A1.in
4
RURDDD
$ cat A2.in
6
RRULDD
KKKKKK
$ cat A3.in
26
RRRULU
PS: One-liner execution without AWK file:
awk 'BEGIN{n=1}{if(NF!=0){print $0 >> "A"n".in"}else{n++}}' A.in

Why does awk not filter the first column in the first line of my files?

I've got a file with following records:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
depots/import/HDN1YYAA_20102018.txt;2;CLI001
depots/import/HDN1YYAA_20102018.txt;32;CLI001
depots/import/HDN1YYAA_25102018.txt;1;CAB001
depots/import/HDN1YYAA_50102018.txt;1;CAB001
depots/import/HDN1YYAA_65102018.txt;1;CAB001
depots/import/HDN1YYAA_80102018.txt;2;CLI001
depots/import/HDN1YYAA_93102018.txt;2;CLI001
When I execute following oneliner awk:
cat lignes_en_erreur.txt | awk 'FS=";"{ if(NR==1){print $1}}END {}'
the output is not the expected:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
While I am suppose get only the frist column:
If I run it through all the records:
cat lignes_en_erreur.txt | awk 'FS=";"{ if(NR>0){print $1}}END {}'
then it will start filtering only after the second line and I get the following output:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
depots/import/HDN1YYAA_20102018.txt
depots/import/HDN1YYAA_20102018.txt
depots/import/HDN1YYAA_25102018.txt
depots/import/HDN1YYAA_50102018.txt
depots/import/HDN1YYAA_65102018.txt
depots/import/HDN1YYAA_80102018.txt
depots/import/HDN1YYAA_93102018.txt
Does anybody knows why awk is skiping the first line only.
I tried deleting first record but the behaviour is the same, it will skip the first line.
First, it should be
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}END {}' filename
You can omit the END block if it is empty:
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}' filename
You can use the -F command line argument to set the field delimiter:
awk -F';' '{if(NR==1){print $1}}' filename
Furthermore, awk programs consist of a sequence of CONDITION [{ACTIONS}] elements, you can omit the if:
awk -F';' 'NR==1 {print $1}' filename
You need to specify delimiter in either BEGIN block or as a command-line option:
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}'
awk -F ';' '{ if(NR==1){print $1}}'
cut might be better suited here, for all lines
$ cut -d';' -f1 file
to skip the first line
$ sed 1d file | cut -d';' -f1
to get the first line only
$ sed 1q file | cut -d';' -f1
however at this point it's better to switch to awk
if you have a large file and only interested in the first line, it's better to exit early
$ awk -F';' '{print $1; exit}' file

Exact string match in awk

I have a file test.txt with the next lines
1997 100 500 2010TJ
2010TJXML 16 20 59
I'm using the next awk line to get information only about string 2010TJ
awk -v var="2010TJ" '$0 ~ var {print $0}' test.txt
But the code print the two lines. I want to know how to get the line containing the exact string
1997 100 500 2010TJ
the string can be placed in any column of the file.
Several options:
Use a gawk word boundary (not POSIX awk...):
$ gawk '/\<2010TJ\>/' file
An actual space or tab or what is separating the columns:
$ awk '/^2010TJ /' file
Or compare the field directly to the string:
$ awk '$1=="2010TJ"' file
You can loop over the fields to test each field if you wish:
$ awk '{for (i=1;i<=NF;i++) if ($i=="2010TJ") {print; next}}' file
Or, given your example of setting a variable, those same using a variable:
$ gawk -v s=2010TJ '$0~"\\<" s "\\>"'
$ awk -v s=2010TJ '$0~"^" s " "'
$ awk -v s=2010TJ '$1==s'
Note the first is a little different than the second and third. The first is the standalone string 2010TJ anywhere in $0; the second and third is a string that starts with that string.
Try this (for testing only column 1) :
awk '$1 == "2010TJ" {print $0}' test.txt
or grep like (all columns) :
gawk '/\<2010TJ\>/ {print $0}' test.txt
Note
\< \> is word boundarys
another awk with word boundary
awk '/\y2010TJ\y/' file
note \y matches either beginning or end of a word.