Convert capitalized words only into lower case to cancels capitalization of nouns - awk

Given :
$ cat input
Hello
Welcome
strIng
North Korea
USA
U.K.
I want to obtain:
$ cat output
hello
welcome
strIng
North Korea
USA
U.K.
How to convert capitalized words* only to lower case ?
*: first letter is capitalized.
Note: I look for a command which cancels the capitalization of nouns, while not attacking acronyms and weird words.

something like this will cover the sample input but not sure it's comprehensive of all other implied conditions
$ awk '/^[A-Z][^A-Z]+$/{$1=tolower(substr($1,1,1)) substr($1,2)}1' file
hello
welcome
strIng
North Korea
USA
U.K.
if first char matches an upper case and any subsequent chars are not, convert first char to lower case.

A sed solution:
sed '/^[A-Z][^A-Z]*$/ {
/^./y/ABCEDEFGHIJKLMNOPQRSTUVWXYZ/abcedefghijklmnopqrstuvwxyz/
}' input.txt > output.txt
Tested and confirmed to work for your example. Modify to work for accented characters (not sure how well awk's toupper would do with them).

Related

How to get the string which is less than 4 using SED or AWS or GREP

I'm trying to get strings which are less than 4 (0,3) characters which might include some special characters too.
The issue here is I'm not really sure what all special characters are involved
It can contain names of any length with some special characters not sure what all are included.
Sample Input data is as below
r#nger
d!nger
'iterr
4#e
c#nuidig
c#niting
c^neres
sample
Sample Output should be like this
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
I have tried below which both works but both has flaws apart from the 0,3 character strings I'm also getting only 1 character outputs which is incorrect.
Like just C, which I don't have in the input by itself
grep -iE '^[a-z0-9\.-+?$_,#]{0,3}$'
sed -n '/^.\{0,3\}$/p'
grep uid: file.csv | awk {'print $2'} | sed -En 's/^([^[:space:]]{3}).*/\1/p' | sort -f > output
Sample Output from above
r#n
d!n
'it
4#e
c#n
c
c
sam
s
I'm thinking that there might be some special character after the first character which is making it break and only printing the first character.
Can someone please suggest how to get this working as expected
Thanks,
To get the output you posted from the input you posted is just:
$ cut -c1-3 file
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
If that's not all you need then edit your question to more clearly state your requirements and provide more truly representative sample input/output including cases where this doesn't work.
You may use this grep with -o and -E options:
grep -oE '^[^[:blank:]]{1,3}' file
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
Regex ^[^[:blank:]]{1,3} matches and outputs 1 to 3 non-whitespace characters from start position.
Using awk:
awk '{print (length($0)<3) ? $0 : substr($0,0,3)}' src.dat
Output:
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
1
11
-1
.
Contents of src.dat:
r#nger
d!nger
'iterr
4#e
c#nuidig
c#niting
c^neres
sample
1
11
-1
.
sed 's/.//4g' file
Delete every char starting at 4th until there aren't any more. GNU sed, which says:
Note: the POSIX standard does not specify what should happen when you mix the g and number modifiers, and currently there is no widely agreed upon meaning across sed implementations. For GNU sed, the interaction is defined to be: ignore matches before the numberth, and then match and replace all matches from the numberth on.
Also: grep -o '^...' file

Identifying lines ending in capital letters with AWK and using them as record separators

This is a smaller representative version of a data file that I need to parse and divide into chunks with awk based on the roman numeral of each chunk.
I
Apple
II
Banana
III
Mango
IV
Durian
Lemon
IV
Papaya
V
This seemed like an easy task with awk, so I tried
gawk -v RS="[A-Z]+$" '{print $0}' blah.txt to use lines that end with one or more capital letters (thus indicating lines with Roman Numerals) as record separators.
Surprisingly the program outputted the entire data file. Where did I go wrong?
Even more surprisingly, if I place an exit after the print statement, it still prints the entire file (indicating that the whole file is considered as one record)
I am using GNU AWK 4.1.3 on a Linux Mint machine.
NOTE: The specific use case I have in mind is to extract an arbitrary Shakespearean sonnet by number from the text file at http://www.gutenberg.org/cache/epub/1041/pg1041.txt (after removing the boilerplate header and footer data )
The $ is the culprit - GNU awk treats the entire file as a single string for the purpose of matching a RS regular expression, and $ thus only matches at the end of the file (This is noted in the manual). Try replacing it with \>, which matches end of word, not end of string (And \< to match the start of a word, so that only things like I and IV are matched):
$ awk -v RS='\\<[A-Z]+\\>' '{print $0}' input.txt
Apple
Banana
Mango
Durian
Lemon
Papaya
You'll have to deal with all the extra newlines and whitespace, of course.
Given the input on http://www.gutenberg.org/cache/epub/1041/pg1041.txt, it looks like you can just print the 12th, 14th, 16th, paragraph to get the output that you want. Setting the Record Separator to an empty string and printing the desired record is enough to do that.
For example, to print the first sonnet:
$ awk -v RS='' -v sonnet=1 'NR == 10 + 2 * sonnet' file
From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thy self thy foe, to thy sweet self too cruel:
Thou that art now the world's fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak'st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.

Mining dictionary for sed search strings

For fun I was mining the dictionary for words that sed could use to modify strings. Example:
sed settee <<< better
sed statement <<< dated
Outputs:
beer
demented
These sed swords must be at least 5 letters long, and begin with s, then another letter, which can appear only 3 times, with at least one other letter between the first and second instances, and with the third instance as the final letter.
I used sed to generate a word list, and it seems to work:
d=/usr/share/dict/american-english
sed -n '/^s\([a-z]\)\(.*\1\)\{2\}$/{
/^s\([a-z]\)\(.*\1\)\{3\}$/!{/^s\([a-z]\)\1/!p}}' $d |
xargs echo
Output:
sanatoria sanitaria sarcomata savanna secede secrete secretive segregate selective selvedge sentence sentience sentimentalize septette sequence serenade serene serpentine serviceable serviette settee severance severe sewerage sextette stateliest statement stealthiest stoutest straightest straightjacket straitjacket strategist streetlight stretchiest strictest structuralist
But that sed code runs three passes through each line, which seems excessively long and kludgy. How can that code be simplified, while still outputting the same word list?
grep or awk answers would also be OK.
awk to the rescue!
code is cleaner with awk and reads as the spec: split the word based on the second char, three instances of the char will split the word into 4 segments; 2nd one should have at least one char and the last one should be empty.
$ awk '/^s/{n=split($1,a,substr($1,2,1));
if(n==4 && length(a[2])>0 && a[4]=="") print}' /usr/share/dict/american-english | xargs
sanatoria sanitaria sarcomata savanna secede secrete secretive
segregate selective selvedge sentence sentience sentimentalize
septette sequence serenade serene serpentine serviceable serviette
settee severance severe sewerage sextette stateliest statement
stealthiest stoutest straightest straightjacket straitjacket strategist
streetlight stretchiest strictest structuralist
very cool idea. I think you're more restrictive than necessary
sed -nE '/^s(.)[^\1]+\1[^\1]*\1g?$/p'
seems to work fine. It generated 518 words for me. I only have /usr/share/dict/words dictionary file though.
sabadilla sabakha sabana sabbatia sabdariffa sacatra saccharilla
saccharogalactorrhea saccharorrhea saccharosuria saccharuria sacralgia
sacraria sacrcraria sacrocoxalgia sadhaka sadhana sahara saintpaulia
salaceta salada salagrama salamandra saltarella salutatoria
...
stuntist subbureau sucuriu sucuruju sulphurou surucucu
syenite-porphyry symphyseotomy symphysiotomy symphysotomy symphysy
symphytically syndactyly synonymity synonymously synonymy
syzygetically syzygy
an interesting find is
$ sed snow-nodding <<< now-or-never
noddior-never
A speedy pcregrep method, (.025 seconds user time):
d=/usr/share/dict/american-english
pcregrep '^s(.)((?!\1).)+\1((?!\1).)*\1$' $d | xargs echo
Output:
sanatoria sanitaria sarcomata savanna secede secrete secretive segregate selective selvedge sentence sentience sentimentalize septette sequence serenade serene serpentine serviceable serviette settee severance severe sewerage sextette stateliest statement stealthiest stoutest straightest straightjacket straitjacket strategist streetlight stretchiest strictest structuralist
Code inspired by: Regex: Match everything except backreference

Using awk to extract lines between patterns

I am trying to use awk to extract lines between two patterns, but as the second pattern involves multiple $$$$ sign I do not manage to protect properly the $
input is:
Name1
iii
iii
$$$$
Name2
ii
ooo
ppp
$$$$
Name3
pp
oo
ii
uu
$$$$
desired output
Name2
ii
ooo
ppp
$$$$
I tried something like this:
awk 'BEGIN {RS="\\$\\$\\$\\$\n"; FS="\n"} $1==Name2 {print $0; print "$$$$"}' inputfile
I also tried something like
awk '/^Name2/,/\\$\\$\\$\\$\\/' input
I tried many different protection of $ but i do it wrong, either nothing is printed or it prints the entire file
Many thanks for suggestions
you don't have to use patterns if you're looking for a literal match
awk '$0=="Name2",$0=="$$$$"' file
will extract the lines between the two literals. Or a combination if the first match is a pattern
awk '/Name2/,$0=="$$$$"' file
Awk solution:
awk '/^Name2/{ f=1 }f; $1=="$$$$"{ f=0 }' file
f variable is a marker indicating the possibility for the current line to be printed
The output:
Name2
ii
ooo
ppp
$$$$
Instead of using $$$$ as the record separator, you may use the double empty lines, meaning splitting into paragraphs:
awk 'BEGIN{RS=""}/Name2/' file
RS="" is a special value. From the awk man page:
If RS is set to the null string, then records are separated by blank lines. When RS is set to the null string, the newline character always acts as
a field separator, in addition to whatever value FS may have.
While the above code is fine for your example, you get into troubles when there are keys like Name20. Meaning a regex match might not be the right approach. An string comparison would probably be a better suite:
awk 'BEGIN{RS="";FS="\n"} $1 == "Name2"' file
I'm explicitly setting FS="\n" to avoid splitting within single lines.

Printing lines with duplicate words

I am trying to print all line that can contain same word twice or more
E.g. with this input file:
cat dog cat
dog cat deer
apple peanut banana apple
car bus train plane
car train car train
Output should be
cat dog cat
apple peanut banana apple
car train car train.
I have tried this code and it works but I think there must be a shorter way.
awk '{ a=0;for(i=1;i<=NF;i++){for(j=i+1;j<=NF;j++){if($i==$j)a=1} } if( a==1 ) print $0}'
Later I want to find all such duplicate words and delete all the duplicate entries except for 1st occurrence.
So input:
cat dog cat lion cat
dog cat deer
apple peanut banana apple
car bus train plane
car train car train
Desired output:
cat dog lion
dog cat deer
apple peanut banana
car bus train plane
car train
You can use this GNU sed command:
sed -rn '/(\b\w+\b).*\b\1\b/ p' yourfile
-r activate extended re and n deactivates the implicit printing of every line
the p command then prints only lines that match the preceding re (inside the slashes):
\b\w+\b are words : an nonemtpy sequence of word charactes (\w) between word boundaries (\b`), these are GNU extensions
such a word is "stored" in \1 for later reuse, due to the use of parentheses
then we try to match this word with \b\1\b again with something optional (.*) between those two places.
and that is the whole trick: match something, put it in parentheses so you can reuse it in the same re with \1
To answer the second part of the question, deleting the doubled words after the first, but print all lines (modifying only the lines with doubled words), you could use some sed s magic:
sed -r ':A s/(.*)(\b\w+\b)(.*)\b\2\b(.*)/\1\2\3\4/g; t A ;'
here we use again the backreference trick.
but we have to account for the things before, between and after our doubled words, thus we have a \2 in the matching part of then s command and we have the other backreferences in the replacement part.
notice that only the \2 has no parens in the matching part and we use all groups in the replacement, thus we effectively deleted the second word of the pair.
for more repetitions of the word we need loop:
:A is a label
t A jumps to the label if there was a replacement done in the last s comamnd
this builds a "while loop" around the s to delete the other repetitions, too
Here's a solution for printing only lines that contain duplicate words.
awk '{
delete seen
for (i=1;i<=NF;++i) {
if (seen[$i]) { print ; next }
seen[$i] = 1
}
}'
Here's a solution for deleting duplicate words after the first.
awk '{
delete seen
for (i=1;i<=NF;++i) {
if (seen[$i]) { continue }
printf("%s ", $i);
seen[$i] = 1
}
print "";
}'
Re your comment...
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. — Jamie Zawinski, 1997
With egrep you can use a so called back reference:
egrep '(\b\w+\b).*\b\1\b' file
(\b\w+\b) matches a word at word boundaries in capturing group 1. \1 references that matched word in the pattern.
I'll show solutions in Perl as it is probably the most flexible tool for text parsing, especially when it comes to regular expressions.
Detecting Duplicates
perl -ne 'print if m{\b(\S+)\b.*?(\b\1\b)}g' file
where
-n causes Perl to execute the expression passed via -e for each input line;
\b matches word boundaries;
\S+ matches one or more non-space characters;
.*? is a non-greedy match for zero or more characters;
\1 is a backreference to the first group, i.e. the word \S+;
g globally matches the pattern repeatedly in the string.
Removing Duplicates
perl -pe '1 while (s/\b(\S+)\b.*?\K(\s\1\b)//g)' file
where
-p causes Perl to print the line ($_), like sed;
1 while loop runs as long as the substitution replaces something;
\K keeps the part matching the previous expression;
Duplicate words (\s\1\b) are replaced with empty string (//g).
Why Perl?
Perl regular expressions are known to be very flexible, and regular expressions in Perl are actually more than just regular expressions. For example, you can embed Perl code into the substitution using the /e modifier. You can use the /x modifier that allows to write regular expressions in a more readable format and even use Perl comments in it, e.g.:
perl -pe '1 while (
s/ # Begins substitution: s/pattern/replacement/flags
\b (\S+) \b # A word
.*? # Ungreedy pattern for any number of characters
\K # Keep everything that matched the previous patterns
( # Group for the duplicate word:
\s # - space
\1 # - backreference to the word
\b # - word boundary
)
//xg
)' file
As you should have noticed, the \K anchor is very convenient, but is not available in many popular tools including awk, bash, and sed.