How can I search for a dot an a number in sed or awk and prefix the number with a leading zero? - awk

I am trying to modify the name of a large number of files, all of them with the following structure:
4.A.1 Introduction to foo.txt
2.C.3 Lectures on bar.pdf
3.D.6 Processes on baz.mp4
5.A.8 History of foo.txt
And I want to add a leading zero to the last digit:
4.A.01 Introduction to foo.txt
2.C.03 Lectures on bar.pdf
3.D.06 Processes on baz.mp4
5.A.08 History of foo.txt
At first I am trying to get the new names with sed (FreeBSD implementation):
ls | sed 's/\.[0-9]/0&/'
But I get the zero before the .
Note: replacing the second dot would also work. I am also open to use awk.

While it may have worked for you here, in general slicing and dicing ls output is fragile, whether using sed or awk or anything else. Fortunately one can accomplish this robustly in plain old POSIX sh using globbing and fancy-pants parameter expansions:
for f in [[:digit:]].[[:alpha:]].[[:digit:]]\ ?*; do
# $f = "[[:digit:]].[[:alpha:]].[[:digit:]] ?*" if no files match.
if [ "$f" != '[[:digit:]].[[:alpha:]].[[:digit:]] ?*' ]; then
tail=${f#*.*.} # filename sans "1.A." prefix
head=${f%"$tail"} # the "1.A." prefix
mv "$f" "${head}0${tail}"
fi
done
(EDIT: Filter out filenames that don't match desired format.)

This pipeline should work for you:
ls | sed 's/\.\([0-9]\)/.0\1/'
The sed command here will capture the digit and replace it with a preceding 0.
Here, \1 references the first (and in this case only) capture group - the parenthesized expression.

I am also open to use awk.
Let file.txt content be:
4.A.1 Introduction to foo.txt
2.C.3 Lectures on bar.pdf
3.D.6 Processes on baz.mp4
5.A.8 History of foo.txt
then
awk 'BEGIN{FS=OFS="."}{$3="0" $3;print}' file.txt
outputs
4.A.01 Introduction to foo.txt
2.C.03 Lectures on bar.pdf
3.D.06 Processes on baz.mp4
5.A.08 History of foo.txt
Explanation: I set dot (.) as both field seperator and output field seperator, then for every line I add leading 0 to third column ($3) by concatenating 0 and said column. Finally I print such altered line.
(tested in GNU Awk 5.0.1)

This might work for you (GNU sed):
sed 's/^\S*\./&0/' file
This appends 0 after the last . in the first string of non-empty characters in each line.

In case it helps somebody else, as an alternative to #costaparas answer:
ls | sed -E -e 's/^([0-9][.][A-Z][.])/\10/' > files
To then create the script the files:
cat files | awk '{printf "mv \"%s\" \"%s\"\n", $0, $0}' | sed 's/\.0/\./' > movefiles.sh

Related

Delete everything before first pattern match with sed/awk

Let's say I have a line looking like this:
/Users/random/354765478/Tests/StoreTests/Base64Tests.swift
In this example, I would like to get the result:
Tests/StoreTests/Base64Tests.swift
How can I do if I want to get everything before the first pattern match (either Sources or Tests) using sed or awk?
I am using sed 's/^.*\(Tests.*\).*$/\1/' right now but it's falling:
echo '/Users/random/354765478/Tests/StoreTests/Base64Tests.swift' | sed 's/^.*\(Tests\)/\1/'
Tests.swift
Here's another example using Sources (which seems to work):
echo '/Users/random/741672469/Sources/Store/StoreDataSource.swift' | sed 's/^.*\(Sources\)/\1/'
Sources/Store/StoreDataSource.swift
I would like to get everything before the first, and not the last Sources or Tests pattern match.
Any help would be appreciated!
How can I do if I want to get everything before the first pattern match (either Sources or Tests).
Easier to use a grep -o here:
grep -Eo '(Sources|Tests)/.*' file
Tests/StoreTests/Base64Tests.swift
Sources/Store/StoreDataSource.swift
# where input file is
cat file
/Users/random/354765478/Tests/StoreTests/Base64Tests.swift
/Users/random/741672469/Sources/Store/StoreDataSource.swift
Breakdown:
Regex pattern (Sources|Tests)/.* match any text that starts with Sources/ or Tests/ until end of the line.
-E: enables extended regex mode
-o: prints only matched text instead of full line
Alternatively you may use this awk as well:
awk 'match($0, /(Sources|Tests)\/.*/) {
print substr($0, RSTART)
}' file
Tests/StoreTests/Base64Tests.swift
Sources/Store/StoreDataSource.swift
Or this sed:
sed -E 's~.*/((Sources|Tests)/.*)~\1~' file
Tests/StoreTests/Base64Tests.swift
Sources/Store/StoreDataSource.swift
With your shown samples please try following GNU grep. This will look for very first match of /Sources OR /Tests and then print values from these strings to till end of the value.
grep -oP '^.*?\/\K(Sources|Tests)\/.*' Input_file
Using sed
$ sed -E 's~([^/]*/)+((Tests|Sources).*)~\2~' input_file
Tests/StoreTests/Base64Tests.swift
would like to get everything before the first, and not the last
Sources or Tests pattern match.
First thing is to understand reason of that, you are using
sed 's/^.*\(Tests.*\).*$/\1/'
observe that * is greedy, i.e. it will match as much as possible, therefore it will always pick last Tests, if it would be non-greedy it would find first Tests but sed does not support this, if you are using linux there is good chance that you have perl command which does support that, let file.txt content be
/Users/random/354765478/Tests/StoreTests/Base64Tests.swift
then
perl -p -e 's/^.*?(Tests.*)$/\1/' file.txt
gives output
Tests/StoreTests/Base64Tests.swift
Explanation: -p -e means engage sed-like mode, alterations in regular expression made: brackets no longer require escapes, first .* (greedy) changed to .*? (non-greedy), last .* deleted as superfluous (observe that capturing group will always extended to end of line)
(tested in perl 5, version 30, subversion 0)

Git URL - Pull out substring via Shell (awk & sed)?

I have got the following URL:
https://xcg5847#git.rz.bankenit.de/scm/smat/sma-mes-test.git
I need to pull out smat-mes-test and smat:
git config --local remote.origin.url|sed -n 's#.*/\([^.]*\)\.git#\1#p'
sma-mes-test
This works. But I also need the project name, which is smat
I am not really familiar to complex regex and sed, I was able to find the other command in another post here. Does anyone know how I am able to extract the smat value here?
With your shown samples please try following awk code. Simple explanation would be, setting field separator(s) as / and .git for all the lines and in main program printing 3rd last and 3nd last elements from the line.
your_git_command | awk -F'/|\\.git' '{print $(NF-2),$(NF-1)}'
Your sed is pretty close. You can just extend it to capture 2 values and print them:
git config --local remote.origin.url |
sed -E 's~.*/([^/]+)/([^.]+)\.git$~\1 \2~'
smat sma-mes-test
If you want to populate shell variable using these 2 values then use this read command in bash:
read v1 v2 < <(git config --local remote.origin.url |
sed -E 's~.*/([^/]+)/([^.]+)\.git$~\1 \2~')
# check variable values
declare -p v1 v2
declare -- v1="smat"
declare -- v2="sma-mes-test"
Using sed
$ sed -E 's#.*/([^/]*)/#\1 #' input_file
smat sma-mes-test.git
I would harness GNU AWK for this task following way, let file.txt content be
https://xcg5847#git.rz.bankenit.de/scm/smat/sma-mes-test.git
then
awk 'BEGIN{FS="/"}{sub(/\.git$/,"",$NF);print $(NF-1),$NF}' file.txt
gives output
smat sma-mes-test
Explanation: I instruct GNU AWK that field separator is slash character, then I replace .git (observe that . is escaped to mean literal dot) adjacent to end ($) in last field ($NF), then I print 2nd from end field ($(NF-1)) and last field ($NF), which are sheared by space, which is default output field separator, if you wish to use other character for that purpose set OFS (output field separator) in BEGIN. If you want to know more about NF then read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in gawk 4.2.1)
Why not sed 's!.*/\(.*/.*\)!\1!'?
string=$(config --local remote.origin.url | tail -c -21)
var1=$(echo "${string}" | cut -d'/' -f1)
var2=$(echo "${string}" | cut -d'/' -f2 | sed s'#\.git##')
If you have multiple urls with variable lengths, this will not work, but if you only have the one, it will.
var1=smat
var2=sma-mes-test.git
If I did have something variable, personally I would replace all of the forward slashes with carriage returns, throw them into a file, and then export the last and second last lines with ed, which would give me the two last segments of the url.
Regular expressions literally give me a migraine headache, but as long as I can get everything on its' own line, I can quite easily bypass the need for them entirely.

Get the line number of the last line with non-blank characters

I have a file which has the following content:
10 tiny toes
tree
this is that tree
5 funny 0
There are spaces at the end of the file. I want to get the line number of the last row of a file (that has characters). How do I do that in SED?
This is easily done with awk,
awk 'NF{c=FNR}END{print c}' file
With sed it is more tricky. You can use the = operator but this will print the line-number to standard out and not in the pattern space. So you cannot manipulate it. If you want to use sed, you'll have to pipe it to another or use tail:
sed -n '/^[[:blank:]]*$/!=' file | tail -1
You can use following pseudo-code:
Replace all spaces by empty string
Remove all <beginning_of_line><end_of_line> (the lines, only containing spaces, will be removed like this)
Count the number of remaining lines in your file
It's tough to count line numbers in sed. Some versions of sed give you the = operator, but it's not standard. You could use an external tool to generate line numbers and do something like:
nl -s ' ' -n ln -ba input | sed -n 's/^\(......\)...*/\1/p' | sed -n '$p'
but if you're going to do that you might as well just use awk.
This might work for you (GNU sed):
sed -n '/\S/=' file | sed -n '$p'
For all lines that contain a non white space character, print a line number. Pipe this output to second invocation of sed and print only the last line.
Alternative:
grep -n '\S' file | sed -n '$s/:.*//p'

Using grep-awk and sed in one-row-command result in a "No such file or directory" error

..And I know why:
I have a xml document with lots of information inside. I need to extract what I need and eventually print them on a new file.
The xml (well, part of it.. rows just keeps repeating)
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toMartha/"
errordir="%home%/../../../home/samba/user/Outbound/toMartha/error"
sentdir="%home%/../../../home/samba/user/data/Sent/Martha"
interval="600"
defaults="sender.name=me_myself, receiver.name=Martha"
sendfilename="true"
mimetype="application/standard"/>
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toJosh/"
errordir="%home%/../../../home/samba/user/Outbound/toJosh/error"
sentdir="%home%/../../../home/samba/user/data/Sent/Josh"
interval="600"
defaults="sender.name=me_myself, receiver.name=Josh"
sendfilename="true"
mimetype="application/standard"/>
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toPamela/"
errordir="%home%/../../../home/samba/user/Outbound/toPamela/error"
interval="600"
defaults="sender.name=me_myself, receiver.name=Pamela"
sendfilename="true"
mimetype="application/standard"/>
I need to extract the folder after "Outbound" and clean it from quotes or slashes.
Also, I need to exclude the "/error" so I get only 1 result for each of them.
My command is:
grep -o -v "/error" "Outbound/" config.xml | awk -F"Outbound/" '{print $2}' | sed -e "s/\/\"//g" > /tmp/sync_users
The error is: grep: Outbound/: No such file or directory which of course means that I'm giving to grep too many arguments (?) - If i remove the -v "/error" it would work but would print also the names with "/error".
Can someone help me?
EDIT:
As some pointed out in their example (thanks for the time you put in), I'd need to extract these words based on the sample above:
toMartha
toJosh
toPamela
could be intersting to use sed in this case
sed -e '\#/Outbound/#!d' -e '\#/error"$#d' -e 's#.*/Outbound/##;s#/\{0,1\}"$##' Config.xml
awk version, assuming (for last print) that your line is always 1 folder below Outbound as shown
awk -F '/' '$0 !~ /\/Outbound\// || /\/error"$/ {next} {print $(NF-1)}' Config.xml
Loose the grep altogether:
$ awk '/outboxdir/{gsub(/^.+Outbound\/|\/" *\r?$/,""); print}' file
toMartha
toJosh
toPamela
/^outboxdir/ /outboxdir/only process records that have start with outboxdir on them
gsub remove unwanted parts of the record
added space removal at the end of record and CRLF fix for Windows originated files
To give grep multiples patterns they have to be separated by newlines or specified by multiples pattern option (-e, F,.. ). However -v invert the match as a whole, you can't invert only one.
For what you're after you can use PCRE (-P argument) for the lookaround ability:
grep -o -P '(?<=Outbound\/)[^\/]+(?!.*\/error)' config.xml
Regex demo here
The regex try to
match something not a slash at least once, the [^\/]+
preceded by Outbound/ the positive lookbehind (?<=Outbound\/)
and not followed by something ending with /error, the negative lookahead (?!.*\/error)
With your first sample input:
$ grep -o -P '(?<=Outbound\/)[^\/]+(?!.*\/error)' test.txt
toMartha
toJosh
toPamela
How about:
grep -i "outbound" your_file | awk -F"Outbound/" '{print $2}' | sed -e 's/error//' -e 's/\/\"//' | uniq
Should work :)
You can use match in gawkand capturing group in regex
awk 'match($0, /^.*\/Outbound\/([^\/]+)\/([^\/]*)\/?"$/, a){
if(a[2]!="error"){print a[1]}
}' config.xml
you get,
toMartha
toJosh
toPamela
grep can accept multiple patterns with the -e option (aka --regexp, even though it can be used with --fixed-strings too, go figure). However, -v (--invert-match) applies to all of the patterns as a group.
Another solution would be to chain two calls to grep:
grep -v "/error" config.xml | grep "Outbound/" | awk -F"Outbound/" '{print $2}' | sed -e "s/\/\"//g"

awk to transpose lines of a text file

A .csv file that has lines like this:
20111205 010016287,1.236220,1.236440
It needs to read like this:
20111205 01:00:16.287,1.236220,1.236440
How do I do this in awk? Experimenting, I got this far. I need to do it in two passes I think. One sub to read the date&time field, and the next to change it.
awk -F, '{print;x=$1;sub(/.*=/,"",$1);}' data.csv
Use that awk command:
echo "20111205 010016287,1.236220,1.236440" | \
awk -F[\ \,] '{printf "%s %s:%s:%s.%s,%s,%s\n", \
$1,substr($2,1,2),substr($2,3,2),substr($2,5,2),substr($2,7,3),$3,$4}'
Explanation:
-F[\ \,]: sets the delimiter to space and ,
printf "%s %s:%s:%s.%s,%s,%s\n": format the output
substr($2,0,3): cuts the second firls ($2) in the desired pieces
Or use that sed command:
echo "20111205 010016287,1.236220,1.236440" | \
sed 's/\([0-9]\{8\}\) \([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{3\}\)/\1 \2:\3:\4.\5/g'
Explanation:
[0-9]\{8\}: first match a 8-digit pattern and save it as \1
[0-9]\{2\}...: after a space match 3 times a 2-digit pattern and save them to \2, \3 and \4
[0-9]\{3\}: and at last match 3-digit pattern and save it as \5
\1 \2:\3:\4.\5: format the output
sed is better suited to this job since it's a simple substitution on single lines:
$ sed -r 's/( ..)(..)(..)/\1:\2:\3./' file
20111205 01:00:16.287,1.236220,1.236440
but if you prefer here's GNU awk with gensub():
$ awk '{print gensub(/( ..)(..)(..)/,"\\1:\\2:\\3.","")}' file
20111205 01:00:16.287,1.236220,1.236440