How to extract the final word of a sentence - awk

For a given text file I'd like to extract the final word in every sentence to a space-delimited text file. It would be acceptable to have a few errors for words like Mr. and Dr., so I don't need to try to achieve that level of precision.
I was thinking I could do this with Sed and Awk, but it's been too long since I've worked with them and I don't remember where to begin. Help?
(Output example: For the previous two paragraphs, I'd like to see this):
file Mr Dr precision begin Help

Using this regex:
([[:alpha:]]+)[.!?]
Explanation
Grep can do this:
$ echo "$txt" | grep -o -E '([[:alpha:]]+)[.!?]'
file.
Mr.
Dr.
precision.
begin.
Help?
Then if you want only the words, a second time through:
$ echo "$txt" | grep -o -E '([[:alpha:]]+)[.!?]' | grep -o -E '[[:alpha:]]+'
file
Mr
Dr
precision
begin
Help
In awk, same regex:
$ echo "$txt" | awk '/[[:alpha:]]+[.!?]/{for(i=1;i<=NF;i++) if($i~/[[:alpha:]]+[.!?]/) print $i}'
Perl, same regex, allows capture groups and maybe a little more direct syntax:
$ echo "$txt" | perl -ne 'print "$1 " while /([[:alpha:]]+)[.!?]/g'
file Mr Dr precision begin Help
And with Perl, it is easier to refine the regex to be more discriminating about the words captured:
echo "$txt" | perl -ne 'print "$1 " while /([[:alpha:]]+)(?=[.!?](?:(?:\s+[[:upper:]])|(?:\s*\z)))/g'
file precision begin Help

gawk:
$ gawk -v ORS=' ' -v RS='[.?!]' '{print $NF}' w.txt
file Mr Dr precision begin Help
(Note that plain awk does not support assigning a regular expression to RS.)

This might work for you (GNU sed):
sed -r 's/^[^.?!]*\b(\w+)[.?!]/\1\n/;/\n/!d;P;D' file
For one word per line or use paste for a single line so:
sed -r 's/^[^.?!]*\b(\w+)[.?!]/\1\n/;/\n/!d;P;D' file | paste -sd' '
For another solution just using sed:
sed -r 'H;$!d;x;s/\n//g;s/\b(\w+)[.?!]/\n\1\n/g;/\n/!d;s/[^\n]*\n([^\n]*)\n/ \1/g;s/.//' file

Easy in Perl:
perl -ne 'print "$1 " while /(\w+)[.!?]/g'
-n reads the input line by line.
\w matches a "word character".
\w+ matches one or more word characters.
[.!?] matches any of the sentence-end markers.
/g stands for "globally" - it remembers where the last match occurred and tries to match after it.

Related

How can I search for a dot an a number in sed or awk and prefix the number with a leading zero?

I am trying to modify the name of a large number of files, all of them with the following structure:
4.A.1 Introduction to foo.txt
2.C.3 Lectures on bar.pdf
3.D.6 Processes on baz.mp4
5.A.8 History of foo.txt
And I want to add a leading zero to the last digit:
4.A.01 Introduction to foo.txt
2.C.03 Lectures on bar.pdf
3.D.06 Processes on baz.mp4
5.A.08 History of foo.txt
At first I am trying to get the new names with sed (FreeBSD implementation):
ls | sed 's/\.[0-9]/0&/'
But I get the zero before the .
Note: replacing the second dot would also work. I am also open to use awk.
While it may have worked for you here, in general slicing and dicing ls output is fragile, whether using sed or awk or anything else. Fortunately one can accomplish this robustly in plain old POSIX sh using globbing and fancy-pants parameter expansions:
for f in [[:digit:]].[[:alpha:]].[[:digit:]]\ ?*; do
# $f = "[[:digit:]].[[:alpha:]].[[:digit:]] ?*" if no files match.
if [ "$f" != '[[:digit:]].[[:alpha:]].[[:digit:]] ?*' ]; then
tail=${f#*.*.} # filename sans "1.A." prefix
head=${f%"$tail"} # the "1.A." prefix
mv "$f" "${head}0${tail}"
fi
done
(EDIT: Filter out filenames that don't match desired format.)
This pipeline should work for you:
ls | sed 's/\.\([0-9]\)/.0\1/'
The sed command here will capture the digit and replace it with a preceding 0.
Here, \1 references the first (and in this case only) capture group - the parenthesized expression.
I am also open to use awk.
Let file.txt content be:
4.A.1 Introduction to foo.txt
2.C.3 Lectures on bar.pdf
3.D.6 Processes on baz.mp4
5.A.8 History of foo.txt
then
awk 'BEGIN{FS=OFS="."}{$3="0" $3;print}' file.txt
outputs
4.A.01 Introduction to foo.txt
2.C.03 Lectures on bar.pdf
3.D.06 Processes on baz.mp4
5.A.08 History of foo.txt
Explanation: I set dot (.) as both field seperator and output field seperator, then for every line I add leading 0 to third column ($3) by concatenating 0 and said column. Finally I print such altered line.
(tested in GNU Awk 5.0.1)
This might work for you (GNU sed):
sed 's/^\S*\./&0/' file
This appends 0 after the last . in the first string of non-empty characters in each line.
In case it helps somebody else, as an alternative to #costaparas answer:
ls | sed -E -e 's/^([0-9][.][A-Z][.])/\10/' > files
To then create the script the files:
cat files | awk '{printf "mv \"%s\" \"%s\"\n", $0, $0}' | sed 's/\.0/\./' > movefiles.sh

Using grep-awk and sed in one-row-command result in a "No such file or directory" error

..And I know why:
I have a xml document with lots of information inside. I need to extract what I need and eventually print them on a new file.
The xml (well, part of it.. rows just keeps repeating)
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toMartha/"
errordir="%home%/../../../home/samba/user/Outbound/toMartha/error"
sentdir="%home%/../../../home/samba/user/data/Sent/Martha"
interval="600"
defaults="sender.name=me_myself, receiver.name=Martha"
sendfilename="true"
mimetype="application/standard"/>
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toJosh/"
errordir="%home%/../../../home/samba/user/Outbound/toJosh/error"
sentdir="%home%/../../../home/samba/user/data/Sent/Josh"
interval="600"
defaults="sender.name=me_myself, receiver.name=Josh"
sendfilename="true"
mimetype="application/standard"/>
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toPamela/"
errordir="%home%/../../../home/samba/user/Outbound/toPamela/error"
interval="600"
defaults="sender.name=me_myself, receiver.name=Pamela"
sendfilename="true"
mimetype="application/standard"/>
I need to extract the folder after "Outbound" and clean it from quotes or slashes.
Also, I need to exclude the "/error" so I get only 1 result for each of them.
My command is:
grep -o -v "/error" "Outbound/" config.xml | awk -F"Outbound/" '{print $2}' | sed -e "s/\/\"//g" > /tmp/sync_users
The error is: grep: Outbound/: No such file or directory which of course means that I'm giving to grep too many arguments (?) - If i remove the -v "/error" it would work but would print also the names with "/error".
Can someone help me?
EDIT:
As some pointed out in their example (thanks for the time you put in), I'd need to extract these words based on the sample above:
toMartha
toJosh
toPamela
could be intersting to use sed in this case
sed -e '\#/Outbound/#!d' -e '\#/error"$#d' -e 's#.*/Outbound/##;s#/\{0,1\}"$##' Config.xml
awk version, assuming (for last print) that your line is always 1 folder below Outbound as shown
awk -F '/' '$0 !~ /\/Outbound\// || /\/error"$/ {next} {print $(NF-1)}' Config.xml
Loose the grep altogether:
$ awk '/outboxdir/{gsub(/^.+Outbound\/|\/" *\r?$/,""); print}' file
toMartha
toJosh
toPamela
/^outboxdir/ /outboxdir/only process records that have start with outboxdir on them
gsub remove unwanted parts of the record
added space removal at the end of record and CRLF fix for Windows originated files
To give grep multiples patterns they have to be separated by newlines or specified by multiples pattern option (-e, F,.. ). However -v invert the match as a whole, you can't invert only one.
For what you're after you can use PCRE (-P argument) for the lookaround ability:
grep -o -P '(?<=Outbound\/)[^\/]+(?!.*\/error)' config.xml
Regex demo here
The regex try to
match something not a slash at least once, the [^\/]+
preceded by Outbound/ the positive lookbehind (?<=Outbound\/)
and not followed by something ending with /error, the negative lookahead (?!.*\/error)
With your first sample input:
$ grep -o -P '(?<=Outbound\/)[^\/]+(?!.*\/error)' test.txt
toMartha
toJosh
toPamela
How about:
grep -i "outbound" your_file | awk -F"Outbound/" '{print $2}' | sed -e 's/error//' -e 's/\/\"//' | uniq
Should work :)
You can use match in gawkand capturing group in regex
awk 'match($0, /^.*\/Outbound\/([^\/]+)\/([^\/]*)\/?"$/, a){
if(a[2]!="error"){print a[1]}
}' config.xml
you get,
toMartha
toJosh
toPamela
grep can accept multiple patterns with the -e option (aka --regexp, even though it can be used with --fixed-strings too, go figure). However, -v (--invert-match) applies to all of the patterns as a group.
Another solution would be to chain two calls to grep:
grep -v "/error" config.xml | grep "Outbound/" | awk -F"Outbound/" '{print $2}' | sed -e "s/\/\"//g"

grep strings which contains 4 or 6 characters count only in bash

I want to print only the words which contains 4 or 6 characters count only in a line in bash. I tried couple of things which didn't work. Can someone let me know how can we make this?
To capture words that are either four or six characters long:
$ echo four fives sixsix | grep -Eow '\w{4}|\w{6}'
four
sixsix
-w tells grep to match only complete words
-o tells grep to print only the matches and not their context.
-E tells grep to use extended regular expressions so that we don't have to type so many backslashes
\w{4} matches words that are four characters long while \w{6} matches words that are six characters long. | is logical-or.
in case yoiu dont have GNU version (especialy option -o on grep)
Sed version (posix compliant)
sed -e 's/[[:blank:]]\{1,\}/ /g;s/.*/ & /' -e ':cycle' -e 's/ [^ ]\{1,3\} / /g;s/ [^ ]\{7,\} / /g;s/ [^ ]\{5\} / /g;t cycle' -e 's/^ *//;s/ *$//;/^ *$/d' YourFile
awk
awk '{for(i=1;i++<=NR;)if(l=length($i)==4||l==6)print $i}' YourFile

How to work with literal square bracket using awk and foreach iterations

I have a file named mapstring. Because of [ string in my patterns my script is not working. Please help me find a solution to this.
Content of mapstring
BC1 bc1
BC2 bc2
BAD_BIT[0] badl0
BAD_BIT[1] badlleftnr
I am working with following script to replace pattern in file testfile
Content of script
foreach cel (`cat mapstring |awk '{print $1}'`)
echo $cel
grep -wq $cel testfile
if( $status == 0 ) then
set var2 = `grep -w $cel rajeshmap |awk '{print $2}'`
sed -i "s% ${cel} % ${var2} %g" testfile
endif
end
Content of testfile
rajesh jain BAD_BIT[0] 1234 BAD_BIT[1000]
jain rajesh DA[0] snps
raj jain CLK stm
That's because square brackets are reserved in sed's basic regex syntax.
You'll have to escape them (and any other special characters in fact) using backslashes (i.e. \[) before using them later in your script; this can itself be done with sed, e.g.:
sed -re 's/(\[|\])/\\\1/g'
(note that using extended regexes in sed (-r) can make this easier).
Your script is rather inefficient anyhow. You can simply get rid of csh entirely (along with the useless cat and the other stylistic problems), and do this with two connected sed scripts.
sed 's/[][*\\%]/\\&/g;s/\([^ ]*\) *\(.*\)/s%\1%\2%g/' mapstring |
sed -i -f - testfile
This is assuming your sed can accept a script on standard input (-f -) and that your sed dialect does not understand any additional special characters which need to be escaped.
#!/bin/ksh
# or sh
sed 's/[[\\$^&.+*]/\\&/g' mapstring | while read -r OldCel NewCel
do
echo ${OldCel}
sed -i "/${OldCel}/ {
s/.*/ & /;s% ${OldCel} % ${NewCel} %g;s/.\\(.*\\)./\\1/
}" testfile
done
pre escape your cel values for a sed manipulation (you could add other special char if occuring and depending directive given to sed like {( )
try something like this (cannot test, no GNU sed available here)
From the good remarq of #tripleee, this need to be another shell than the one used in the request, script adapted for this

How to remove comments from a file using "grep"?

I have an SQL file that I need to remove all the comments
-- Sql comment line
How can I achieve this in Linux using GREP or other tool?
Best Regards,
The grep tool has a -v option which reverses the sense of the filter. For example:
grep -v pax people
will give you all lines in the people file that don't contain pax.
An example is:
grep -v '^ *-- ' oldfile >newfile
which gets rid of lines with only white space preceding a comment marker. It won't however change lines like:
select blah blah -- comment here.
If you wanted to do that, you would use something like sed:
sed -e 's/ --.*$//' oldfile >newfile
which edits each line removing any characters from " --" to the end of the line.
Keep in mind you need to be careful with finding the string " --" in real SQL statements like (the contrived):
select ' -- ' | colm from blah blah blah
If you have these, you're better off creating/using an SQL parser rather than a simple text modification tool.
A transcript of the sed in operation:
pax$ echo '
...> this is a line with -- on it.
...> this is not
...> and -- this is again' | sed -e 's/ --.*$//'
this is a line with
this is not
and
For the grep:
pax$ echo '
-- this line starts with it.
this line does not
and -- this one is not at the start' | grep -v '^ *-- '
this line does not
and -- this one is not at the start
You can use the sed command as sed -i '/\-\-/d' <filename>
Try using sed on shell:
sed -e "s/(--.*)//" sql.filename