Get the line number of the last line with non-blank characters - awk

I have a file which has the following content:
10 tiny toes
tree
this is that tree
5 funny 0
There are spaces at the end of the file. I want to get the line number of the last row of a file (that has characters). How do I do that in SED?

This is easily done with awk,
awk 'NF{c=FNR}END{print c}' file
With sed it is more tricky. You can use the = operator but this will print the line-number to standard out and not in the pattern space. So you cannot manipulate it. If you want to use sed, you'll have to pipe it to another or use tail:
sed -n '/^[[:blank:]]*$/!=' file | tail -1

You can use following pseudo-code:
Replace all spaces by empty string
Remove all <beginning_of_line><end_of_line> (the lines, only containing spaces, will be removed like this)
Count the number of remaining lines in your file

It's tough to count line numbers in sed. Some versions of sed give you the = operator, but it's not standard. You could use an external tool to generate line numbers and do something like:
nl -s ' ' -n ln -ba input | sed -n 's/^\(......\)...*/\1/p' | sed -n '$p'
but if you're going to do that you might as well just use awk.

This might work for you (GNU sed):
sed -n '/\S/=' file | sed -n '$p'
For all lines that contain a non white space character, print a line number. Pipe this output to second invocation of sed and print only the last line.
Alternative:
grep -n '\S' file | sed -n '$s/:.*//p'

Related

replace strings between two patterns

I would like to replace (using sed/awk/tr) all the strings between CleanAgrobacterium and _gene by ZZZ in my file A.nwk:
(((CleanAgrobacterium_fabrum_str__C58_DE0068_Scaffold_Proteins_gene-FS783_RS12830:0,CleanAgrobacterium_fabrum_str__C58_DE0067_Scaffold_Proteins_gene-FS653_RS12825:0):0.056789,(CleanAgrobacterium_fabrum_GV2260_Complete_Genome_Proteins_gene-EML4058_RS17445:0,(CleanAgrobacterium_fabrum_1D1416_Chromosome_Proteins_gene-NQG32_RS17500:0,(CleanAgrobacterium_fabrum_PDC82_Contig_Proteins_gene-BLT49_RS14090:0,(CleanAgrobacterium_fabrum_N3394_Scaffold_Proteins_gene-G6L76_RS17395:0,(CleanAgrobacterium_fabrum_12D13_Complete_Genome_Proteins_gene-At12D13_RS18010:0,(CleanAgrobacterium_fabrum_Bi46_Contig_Proteins_gene-LQ162_RS02700:0,(CleanAgrobacterium_fabrum_ARqua1_Scaffold_Proteins_gene-HI842_RS18310:0,(CleanAgrobacterium_fabrum_N4094_Scaffold_Proteins_gene-G6L42_RS17400:0,(CleanAgrobacterium_fabrum_GV3101__pMP90_Complete_Genome_Proteins_gene-EML485_RS17435:0,(CleanAgrobacterium_fabrum_Kin001_Complete_Genome_Proteins_gene-FY134_RS17775:0,(CleanAgrobacterium_fabrum_LBA645_Complete_Genome_Proteins_gene-KXJ62_RS17445:0,(CleanAgrobacterium_fabrum_Di1525a_Scaffold_Proteins_gene-G6L89_RS17735:0,(CleanAgrobacterium_fabrum_NFIX02_Scaffold_Proteins_gene-BLR22_RS16795:0,(CleanAgrobacterium_fabrum_Arqua_Contig_Proteins_gene-EXN51_RS19140:0,(CleanAgrobacterium_fabrum_str__J-07_J-07_Scaffold_Proteins_gene-AGR8A_RS20015:0,CleanAgrobacterium_fabrum_1D132_Complete_Genome_Proteins_gene-At1D132_RS18580:0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0,(CleanAgrobacterium_fabrum_EHA105_Complete_Genome_Proteins_gene-EML540_RS17455:0,(CleanAgrobacterium_fabrum_RIT-As-3_Contig_Proteins_gene-ORG40_RS11815:0,(CleanAgrobacterium_fabrum_2788_Contig_Proteins_gene-G6L39_RS17590:0,(CleanAgrobacterium_fabrum_BG5_Complete_Genome_Proteins_gene-F3P66_RS17495:0,(CleanAgrobacterium_fabrum_Bi05_Contig_Proteins_gene-LQV40_RS07170:0,(CleanAgrobacterium_fabrum_str__C58_C58_Complete_Genome_Proteins_gene-ATU_RS17440:0,CleanAgrobacterium_fabrum_NFIX01_Scaffold_Proteins_gene-BMY00_RS16800:0):0):0):0):0):0):0);
sed "/CleanAgrobacterium/,/gene-/d" A.nwk
Instead of using a range, you could make the pattern more specific for the example data matching 1 or more alphanumeric chars or - or _ in between using [[:alnum:]_-]\+ and replace the match(es) with zzz
sed "s/CleanAgrobacterium[[:alnum:]_-]\+_gene/zzz/g" A.nwk
Output
(((zzz-FS783_RS12830:0,zzz-FS653_RS12825:0):0.056789,(zzz-EML4058_RS17445:0,(zzz-NQG32_RS17500:0,(zzz-BLT49_RS14090:0,(zzz-G6L76_RS17395:0,(zzz-At12D13_RS18010:0,(zzz-LQ162_RS02700:0,(zzz-HI842_RS18310:0,(zzz-G6L42_RS17400:0,(zzz-EML485_RS17435:0,(zzz-FY134_RS17775:0,(zzz-KXJ62_RS17445:0,(zzz-G6L89_RS17735:0,(zzz-BLR22_RS16795:0,(zzz-EXN51_RS19140:0,(zzz-AGR8A_RS20015:0,zzz-At1D132_RS18580:0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0,(zzz-EML540_RS17455:0,(zzz-ORG40_RS11815:0,(zzz-G6L39_RS17590:0,(zzz-F3P66_RS17495:0,(zzz-LQV40_RS07170:0,(zzz-ATU_RS17440:0,zzz-BMY00_RS16800:0):0):0):0):0):0):0);
This replaces all the text between CleanAgrobacterium and _gene by ZZZ:
sed -E 's/(CleanAgrobacterium).*(_gene)/\1ZZZ\2/g' A.nwk
But the result is probably not what you would expect. I assume you want ungreedy matching of the text in-between (.*). For that, use perl:
perl -pe 's/(CleanAgrobacterium).*(_gene)/\1ZZZ\2/g' A.nwk
This might work for you (GNU sed):
sed -E 's/CleanAgrobacterium/&\n/g
s/gene-/\n&/g
s/(CleanAgrobacterium)\n[^\n]*\n(gene-)/\1ZZZ\2/g
s/\n//g' file
Append a newline to CleanAgrobacterium and prepend a newline to gene-.
Replace everything that is not a newline between the desired words.
Remove any introduced newlines.
N.B. This does not cater for matches on separate lines. In this case use something like:
sed -E 'H;1h;$!d;x
s/\n/###NEWLINE%%%/g
s/CleanAgrobacterium/&\n/g
s/gene-/\n&/g
s/(CleanAgrobacterium)\n[^\n]*\n(gene-)/\1ZZZ\2/g
s/\n//g
s/###NEWLINE%%%/\n/g' file
This slurps the whole file into memory, replaces all newlines by a unique string, then applies the first solution and tidies up afterwards.
try this:
sed 's/gene-/gene-\n/g' < A.nwk | sed 's/CleanAgrobacterium.*gene-/CleanAgrobacteriumZZZgene-/g' | sed -n ':a;N;$!ba;s/\n//g;p' > output.txt
works with GNU Sed 4.9 using Linux .
Yet another sed solution. It replaces all THIS with THAT (with your samples in reality but more readable here) between START and END in "fooSTARTTHISENDfooSTARTTHISENDfoo"
and outputs "fooSTARTTHATENDfooSTARTTHATENDfoo".
$ sed -E 's/(CleanAgrobacterium)([^_]|_(_|g(_|e(_|n_)))*([^_g]|g([^_e]|e([^_n]|n[^_e]))))*(_(_|g(_|e(_|n_)))*(g(e?|en))?)?(_gene)/\1ZZZ\2/g' file
The solution is non-greedy and relies on regex capturing groups (CleanAgrobacterium)and (_gene), their backreferences \1, \2 and what is between them
([^_]|_(_|g(_|e(_|n_)))*([^_g]|g([^_e]|e([^_n]|n[^_e]))))*(_(_|g(_|e(_|n_)))*(g(e?|en))?)?
(not _gene) getting replaced by ZZZ. You could use it in, for example; GNU awk's gensub() which supports backreferencing:
$ gawk '{print gensub(/(CleanAgrobacterium)([^_]|_(_|g(_|e(_|n_)))*([^_g]|g([^_e]|e([^_n]|n[^_e]))))*(_(_|g(_|e(_|n_)))*(g(e?|en))?)?(_gene)/,"\\1ZZZ\\2","g",$0)}' file

How to delete the "0"-row for multiple fles in a folder?

Each file's name starts with "input". One example of the files look like:
0.0005
lii_bk_new
traj_new.xyz
0
73001
146300
I want to delete the lines which only includes '0' and the expected output is:
0.0005
lii_bk_new
traj_new.xyz
73001
146300
I have tried with
sed -i 's/^0\n//g' input_*
and
grep -RiIl '^0\n' input_* | xargs sed -i 's/^0\n//g'
but neither works.
Please give some suggestions.
Could you please try changing your attempted code to following, run it on a single Input_file once.
sed 's/^0$//' Input_file
OR as per OP's comment to delete null lines:
sed 's/^0$//;/^$/d' Input_file
I have intentionally not put -i option here first test this in a single file of output looks good then only run with -i option on multiple files.
Also problem in your attempt was, you are putting \n in regex of sed which is default separator of line, we need to put $ in it to tell sed delete those lines which starts and ends with 0.
In case you want to take backup of files(considering that you have enough space available in your file system) you could use -i.bak option of sed too which will take backup of each file before editing(this isn't necessary but for safer side you have this option too).
$ sed '/^0$/d' file
0.0005
lii_bk_new
traj_new.xyz
73001
146300
In your regexp you were confusing \n (the literal LineFeed character which will not be present in the string sed is analyzing since sed reads one \n-separated line at a time) with $ (the end-of-string regexp metacharacter which represents end-of-line when the string being parsed is a line as is done with sed by default).
The other mistake in your script was replacing 0 with null in the matching line instead of just deleting the matching line.
Please give some suggestions.
I would use GNU awk -i inplace for that following way:
awk -i inplace '!/^0$/' input_*
This simply will preserve all lines which do not match ^0$ i.e. (start of line)0(end of line). If you want to know more about -i inplace I suggest reading this tutorial.

How can I search for a dot an a number in sed or awk and prefix the number with a leading zero?

I am trying to modify the name of a large number of files, all of them with the following structure:
4.A.1 Introduction to foo.txt
2.C.3 Lectures on bar.pdf
3.D.6 Processes on baz.mp4
5.A.8 History of foo.txt
And I want to add a leading zero to the last digit:
4.A.01 Introduction to foo.txt
2.C.03 Lectures on bar.pdf
3.D.06 Processes on baz.mp4
5.A.08 History of foo.txt
At first I am trying to get the new names with sed (FreeBSD implementation):
ls | sed 's/\.[0-9]/0&/'
But I get the zero before the .
Note: replacing the second dot would also work. I am also open to use awk.
While it may have worked for you here, in general slicing and dicing ls output is fragile, whether using sed or awk or anything else. Fortunately one can accomplish this robustly in plain old POSIX sh using globbing and fancy-pants parameter expansions:
for f in [[:digit:]].[[:alpha:]].[[:digit:]]\ ?*; do
# $f = "[[:digit:]].[[:alpha:]].[[:digit:]] ?*" if no files match.
if [ "$f" != '[[:digit:]].[[:alpha:]].[[:digit:]] ?*' ]; then
tail=${f#*.*.} # filename sans "1.A." prefix
head=${f%"$tail"} # the "1.A." prefix
mv "$f" "${head}0${tail}"
fi
done
(EDIT: Filter out filenames that don't match desired format.)
This pipeline should work for you:
ls | sed 's/\.\([0-9]\)/.0\1/'
The sed command here will capture the digit and replace it with a preceding 0.
Here, \1 references the first (and in this case only) capture group - the parenthesized expression.
I am also open to use awk.
Let file.txt content be:
4.A.1 Introduction to foo.txt
2.C.3 Lectures on bar.pdf
3.D.6 Processes on baz.mp4
5.A.8 History of foo.txt
then
awk 'BEGIN{FS=OFS="."}{$3="0" $3;print}' file.txt
outputs
4.A.01 Introduction to foo.txt
2.C.03 Lectures on bar.pdf
3.D.06 Processes on baz.mp4
5.A.08 History of foo.txt
Explanation: I set dot (.) as both field seperator and output field seperator, then for every line I add leading 0 to third column ($3) by concatenating 0 and said column. Finally I print such altered line.
(tested in GNU Awk 5.0.1)
This might work for you (GNU sed):
sed 's/^\S*\./&0/' file
This appends 0 after the last . in the first string of non-empty characters in each line.
In case it helps somebody else, as an alternative to #costaparas answer:
ls | sed -E -e 's/^([0-9][.][A-Z][.])/\10/' > files
To then create the script the files:
cat files | awk '{printf "mv \"%s\" \"%s\"\n", $0, $0}' | sed 's/\.0/\./' > movefiles.sh

Convert data format using awk?

There is a file which contains data in a 'n*1' format:
1
2
3
4
5
6
Is there any way to convert it to a 'n*3' format like:
1,2,3
4,5,6
via awk rather than using for loop ?
Really no idea about this..Any help or key word is appreciated.
Using awk
$ awk '{printf "%s%s",$0,(NR%3==0?ORS:",")}' File
1,2,3
4,5,6
The command printf "%s%s",$0,(NR%3==0?ORS:",") tells awk to print two strings. The first is $0 which is the current line. The second string is NR%3==0?ORS:"," which is either ORS the output record separator (if the line number is a multiple of three) or else , for all other line numbers.
Using sed
$ sed 'N;N;s/\n/,/g' File
1,2,3
4,5,6
By default, sed reads in each line from the file one by one. N tells sed to read in another line, appending the line to the current one, separated by a newline. N;N tells sed to do that twice so that we have a total of three lines in the pattern space. s/\n/,/g tells sed to replace those two separator newlines with commas. The result is then printed.
The above assumes that we are using GNU sed. With minor modifications, this can be made to work with BSD/OSX sed.
The most simple one - paste command:
paste -d, - - - <file
The output:
1,2,3
4,5,6
Following may help you on same.
xargs -n3 < Input_file | sed 's/ /,/g'
Try this:
awk 'NR%3==0{print;next}{printf "%s,",$0}' file
or decomposed :
NR%3==0 # condition, modulo 3 == 0
{print;next} # then print and skip to the first line
{printf "%s,",$0} # printf to not print newlines but current int + ,
$ awk '{ORS=(NR%3?",":RS)}1' file
1,2,3
4,5,6

Replace chars after column X

Lets say my data looks like this
iqwertyuiop
and I want to replace all the letters i after column 3 with a Z.. so my output would look like this
iqwertyuZop
How can I do this with sed or awk?
It's not clear what you mean by "column" but maybe this is what you want using GNU awk for gensub():
$ echo iqwertyuiop | awk '{print substr($0,1,3) gensub(/i/,"Z","g",substr($0,4))}'
iqwertyuZop
Perl is handy for this: you can assign to a substring
$ echo "iiiiii" | perl -pe 'substr($_,3) =~ s/i/Z/g'
iiiZZZ
This would totally be ideal for the tr command, if only you didn't have the requirement that the first 3 characters remain untouched.
However, if you are okay using some bash tricks plus cut and paste, you can split the file into two parts and paste them back together afterwords:
paste -d'\0' <(cut -c-3 foo) <(cut -c4- foo | tr i Z)
The above uses paste to rejoin together the two parts of the file that get split with cut. The second section is piped through tr to translate i's to Z's.
(1) Here's a short-and-simple way to accomplish the task using GNU sed:
sed -r -e ':a;s/^(...)([^i]*)i/\1\2Z/g;ta'
This entails looping (t), and so would not be as efficient as non-looping approaches. The above can also be written using escaped parentheses instead of unescaped characters, and so there is no real need for the -r option. Other implementations of sed should (in principle) be up to the task as well, but your MMV.
(2) It's easy enough to use "old awk" as well:
awk '{s=substr($0,4);gsub(/i/,"Z",s); print substr($0,1,3) s}'
The most intuitive way would be to use awk:
awk 'BEGIN{FS="";OFS=FS}{for(i=4;i<=NF;i++){if($i=="i"){$i="Z"}}}1' file
FS="" splits the input string by characters into fields. We iterate trough character/field 4 to end and replace i by Z.
The final 1 evaluates to true and make awk print the modified input line.
With sed it looks not very intutive but still it is possible:
sed -r '
h # Backup the current line in hold buffer
s/.{3}// # Remove the first three characters
s/i/Z/g # Replace all i by Z
G # Append the contents of the hold buffer to the pattern buffer (this adds a newline between them)
s/(.*)\n(.{3}).*/\2\1/ # Remove that newline ^^^ and assemble the result
' file