Mining dictionary for sed search strings - awk

For fun I was mining the dictionary for words that sed could use to modify strings. Example:
sed settee <<< better
sed statement <<< dated
Outputs:
beer
demented
These sed swords must be at least 5 letters long, and begin with s, then another letter, which can appear only 3 times, with at least one other letter between the first and second instances, and with the third instance as the final letter.
I used sed to generate a word list, and it seems to work:
d=/usr/share/dict/american-english
sed -n '/^s\([a-z]\)\(.*\1\)\{2\}$/{
/^s\([a-z]\)\(.*\1\)\{3\}$/!{/^s\([a-z]\)\1/!p}}' $d |
xargs echo
Output:
sanatoria sanitaria sarcomata savanna secede secrete secretive segregate selective selvedge sentence sentience sentimentalize septette sequence serenade serene serpentine serviceable serviette settee severance severe sewerage sextette stateliest statement stealthiest stoutest straightest straightjacket straitjacket strategist streetlight stretchiest strictest structuralist
But that sed code runs three passes through each line, which seems excessively long and kludgy. How can that code be simplified, while still outputting the same word list?
grep or awk answers would also be OK.

awk to the rescue!
code is cleaner with awk and reads as the spec: split the word based on the second char, three instances of the char will split the word into 4 segments; 2nd one should have at least one char and the last one should be empty.
$ awk '/^s/{n=split($1,a,substr($1,2,1));
if(n==4 && length(a[2])>0 && a[4]=="") print}' /usr/share/dict/american-english | xargs
sanatoria sanitaria sarcomata savanna secede secrete secretive
segregate selective selvedge sentence sentience sentimentalize
septette sequence serenade serene serpentine serviceable serviette
settee severance severe sewerage sextette stateliest statement
stealthiest stoutest straightest straightjacket straitjacket strategist
streetlight stretchiest strictest structuralist

very cool idea. I think you're more restrictive than necessary
sed -nE '/^s(.)[^\1]+\1[^\1]*\1g?$/p'
seems to work fine. It generated 518 words for me. I only have /usr/share/dict/words dictionary file though.
sabadilla sabakha sabana sabbatia sabdariffa sacatra saccharilla
saccharogalactorrhea saccharorrhea saccharosuria saccharuria sacralgia
sacraria sacrcraria sacrocoxalgia sadhaka sadhana sahara saintpaulia
salaceta salada salagrama salamandra saltarella salutatoria
...
stuntist subbureau sucuriu sucuruju sulphurou surucucu
syenite-porphyry symphyseotomy symphysiotomy symphysotomy symphysy
symphytically syndactyly synonymity synonymously synonymy
syzygetically syzygy
an interesting find is
$ sed snow-nodding <<< now-or-never
noddior-never

A speedy pcregrep method, (.025 seconds user time):
d=/usr/share/dict/american-english
pcregrep '^s(.)((?!\1).)+\1((?!\1).)*\1$' $d | xargs echo
Output:
sanatoria sanitaria sarcomata savanna secede secrete secretive segregate selective selvedge sentence sentience sentimentalize septette sequence serenade serene serpentine serviceable serviette settee severance severe sewerage sextette stateliest statement stealthiest stoutest straightest straightjacket straitjacket strategist streetlight stretchiest strictest structuralist
Code inspired by: Regex: Match everything except backreference

Related

How can I search for a dot an a number in sed or awk and prefix the number with a leading zero?

I am trying to modify the name of a large number of files, all of them with the following structure:
4.A.1 Introduction to foo.txt
2.C.3 Lectures on bar.pdf
3.D.6 Processes on baz.mp4
5.A.8 History of foo.txt
And I want to add a leading zero to the last digit:
4.A.01 Introduction to foo.txt
2.C.03 Lectures on bar.pdf
3.D.06 Processes on baz.mp4
5.A.08 History of foo.txt
At first I am trying to get the new names with sed (FreeBSD implementation):
ls | sed 's/\.[0-9]/0&/'
But I get the zero before the .
Note: replacing the second dot would also work. I am also open to use awk.
While it may have worked for you here, in general slicing and dicing ls output is fragile, whether using sed or awk or anything else. Fortunately one can accomplish this robustly in plain old POSIX sh using globbing and fancy-pants parameter expansions:
for f in [[:digit:]].[[:alpha:]].[[:digit:]]\ ?*; do
# $f = "[[:digit:]].[[:alpha:]].[[:digit:]] ?*" if no files match.
if [ "$f" != '[[:digit:]].[[:alpha:]].[[:digit:]] ?*' ]; then
tail=${f#*.*.} # filename sans "1.A." prefix
head=${f%"$tail"} # the "1.A." prefix
mv "$f" "${head}0${tail}"
fi
done
(EDIT: Filter out filenames that don't match desired format.)
This pipeline should work for you:
ls | sed 's/\.\([0-9]\)/.0\1/'
The sed command here will capture the digit and replace it with a preceding 0.
Here, \1 references the first (and in this case only) capture group - the parenthesized expression.
I am also open to use awk.
Let file.txt content be:
4.A.1 Introduction to foo.txt
2.C.3 Lectures on bar.pdf
3.D.6 Processes on baz.mp4
5.A.8 History of foo.txt
then
awk 'BEGIN{FS=OFS="."}{$3="0" $3;print}' file.txt
outputs
4.A.01 Introduction to foo.txt
2.C.03 Lectures on bar.pdf
3.D.06 Processes on baz.mp4
5.A.08 History of foo.txt
Explanation: I set dot (.) as both field seperator and output field seperator, then for every line I add leading 0 to third column ($3) by concatenating 0 and said column. Finally I print such altered line.
(tested in GNU Awk 5.0.1)
This might work for you (GNU sed):
sed 's/^\S*\./&0/' file
This appends 0 after the last . in the first string of non-empty characters in each line.
In case it helps somebody else, as an alternative to #costaparas answer:
ls | sed -E -e 's/^([0-9][.][A-Z][.])/\10/' > files
To then create the script the files:
cat files | awk '{printf "mv \"%s\" \"%s\"\n", $0, $0}' | sed 's/\.0/\./' > movefiles.sh

Get the line number of the last line with non-blank characters

I have a file which has the following content:
10 tiny toes
tree
this is that tree
5 funny 0
There are spaces at the end of the file. I want to get the line number of the last row of a file (that has characters). How do I do that in SED?
This is easily done with awk,
awk 'NF{c=FNR}END{print c}' file
With sed it is more tricky. You can use the = operator but this will print the line-number to standard out and not in the pattern space. So you cannot manipulate it. If you want to use sed, you'll have to pipe it to another or use tail:
sed -n '/^[[:blank:]]*$/!=' file | tail -1
You can use following pseudo-code:
Replace all spaces by empty string
Remove all <beginning_of_line><end_of_line> (the lines, only containing spaces, will be removed like this)
Count the number of remaining lines in your file
It's tough to count line numbers in sed. Some versions of sed give you the = operator, but it's not standard. You could use an external tool to generate line numbers and do something like:
nl -s ' ' -n ln -ba input | sed -n 's/^\(......\)...*/\1/p' | sed -n '$p'
but if you're going to do that you might as well just use awk.
This might work for you (GNU sed):
sed -n '/\S/=' file | sed -n '$p'
For all lines that contain a non white space character, print a line number. Pipe this output to second invocation of sed and print only the last line.
Alternative:
grep -n '\S' file | sed -n '$s/:.*//p'

Replace character except between pattern using grep -o or sed (or others)

In the following file I want to replace all the ; by , with the exception that, when there is a string (delimited with two "), it should not replace the ; inside it.
Example:
Input
A;B;C;D
5cc0714b9b69581f14f6427f;5cc0714b9b69581f14f6428e;1;"5cc0714b9b69581f14f6427f;16a4fba8d13";xpto;
5cc0723b9b69581f14f64285;5cc0723b9b69581f14f64294;2;"5cc0723b9b69581f14f64285;16a4fbe3855";xpto;
5cc072579b69581f14f6428a;5cc072579b69581f14f64299;3;"5cc072579b69581f14f6428a;16a4fbea632";xpto;
output
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
For sed I have: sed 's/;/,/g' input.txt > output.txt but this would replace everything.
The regex for the " delimited string: \".*;.*\" .
(A regex for hexadecimal would be better -- something like: [0-9a-fA-F]+)
My problem is combining it all to make a grep -o / sed that replaces everything except for that pattern.
The file size is in the order of two digit Gb (max 99Gb), so performance is important. Relevant.
Any ideas are appreciated.
sed is for doing simple s/old/new on individual strings. grep is for doing g/re/p. You're not trying to do either of those tasks so you shouldn't be considering either of those tools. That leaves the other standard UNIX tool for manipulating text - awk.
You have a ;-separated CSV that you want to make ,-separated. That's simply:
$ awk -v FPAT='[^;]*|"[^"]+"' -v OFS=',' '{$1=$1}1' file
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
The above uses GNU awk for FPAT. See What's the most robust way to efficiently parse CSV using awk? for more details on parsing CSVs with awk.
If I get correctly your requirements, one option would be to make a three pass thing.
From your comment about hex, I'll consider nothing like # will come in the input so you can do (using GNU sed) :
sed -E 's/("[^"]+);([^"]+")/\1#\2/g' original > transformed
sed -i 's/;/,/g' transformed
sed -i 's/#/;/g' transformed
The idea being to replace the ; when within quotes by something else and write it to a new file, then replace all ; by , and then set back the ; in place within the same file (-i flag of sed).
The three pass can be combined in a single command with
sed -E 's/("[^"]+);([^"]+")/\1#\2/g;s/;/,/g;s/#/;/g' original > transformed
That said, there's probably a bunch of csv parser witch already handle quoted fields that you can probably use in the final use case as I bet this is just an intermediary step for something else later in the chain.
From Ed Morton's comment: if you do it in one pass, you can use \n as replacement separator as there can't be a newline in the text considered line by line.
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^"]*"[^"]*)*"[^";]*);/\1\n/;ta;y/;/,/;y/\n/;/' file
Replace ;'s inside double quotes with newlines, transpose ;'s to ,'s and then transpose newlines to ;'s.

awk to remove 5th column from N column with fixed delimiter

I have file with Nth columns
I want to remove the 5th column from last of Nth columns
Delimiter is "|"
I tested with simple example as shown below:
bash-3.2$ echo "1|2|3|4|5|6|7|8" | nawk -F\| '{print $(NF-4)}'
4
Expecting result:
1|2|3|5|6|7|8
How should I change my command to get the desired output?
If I understand you correctly, you want to use something like this:
sed -E 's/\|[^|]*((\|[^|]*){4})$/\1/'
This matches a pipe character \| followed by any number of non-pipe characters [^|]*, then captures 4 more of the same pattern ((\|[^|]*){4}). The $ at the end matches the end of the line. The first part of the match (i.e. the fifth field from the end) is dropped.
Testing it out:
$ sed -E 's/\|[^|]*((\|[^|]*){4})$/\1/' <<<"1|2|3|4|5|6|7"
1|2|4|5|6|7
You could achieve the same thing using GNU awk with gensub but I think that sed is the right tool for the job in this case.
If your version of sed doesn't support extended regex syntax with -E, you can modify it slightly:
sed 's/|[^|]*\(\(|[^|]*\)\{4\}\)$/\1/'
In basic mode, pipes are interpreted literally but parentheses for capture groups and curly brcneed to be escaped.
AWK is your friend :
Sample Input
A|B|C|D|E|F|G|H|I
A|B|C|D|E|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|D|O|R|Q|U|I
A|B|C|D|E|F|G|H|I|E|O|Q
A|B|C|D|E|F|G|H|I|X
A|B|C|D|E|F|G|H|I|J|K|L
Script
awk 'BEGIN{FS="|";OFS="|"}
{$(NF-5)="";sub(/\|\|/,"|");print}' file
Sample Output
A|B|C|E|F|G|H|I
A|B|C|D|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|O|R|Q|U|I
A|B|C|D|E|F|H|I|E|O|Q
A|B|C|D|F|G|H|I|X
A|B|C|D|E|F|H|I|J|K|L
What we did here
As you are aware awk's has special variables to store each field in the record, which ranges from $1,$2 upto $(NF)
To exclude the 5th from the last column is as simple as
Emptying the colume ie $(NF-5)=""
Removing from the record, the consecutive | formed by the above step ie do sub(/\|\|/,"|")
another alternative, using #sjsam's input file
$ rev file | cut -d'|' --complement -f6 | rev
A|B|C|E|F|G|H|I
A|B|C|D|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|O|R|Q|U|I
A|B|C|D|E|F|H|I|E|O|Q
A|B|C|D|F|G|H|I|X
A|B|C|D|E|F|H|I|J|K|L
not sure you want the 5'th from the last or 6th. But it's easy to adjust.
Thanks for the help and guidance.
Below is what I tested:
bash-3.2$ echo "1|2|3|4|5|6|7|8|9" | nawk 'BEGIN{FS="|";OFS="|"} {$(NF-4)="!";print}' | sed 's/|!//'
Output: 1|2|3|4|6|7|8|9
Further tested on the file that I have extracted from system and so it worked fine.

Replace chars after column X

Lets say my data looks like this
iqwertyuiop
and I want to replace all the letters i after column 3 with a Z.. so my output would look like this
iqwertyuZop
How can I do this with sed or awk?
It's not clear what you mean by "column" but maybe this is what you want using GNU awk for gensub():
$ echo iqwertyuiop | awk '{print substr($0,1,3) gensub(/i/,"Z","g",substr($0,4))}'
iqwertyuZop
Perl is handy for this: you can assign to a substring
$ echo "iiiiii" | perl -pe 'substr($_,3) =~ s/i/Z/g'
iiiZZZ
This would totally be ideal for the tr command, if only you didn't have the requirement that the first 3 characters remain untouched.
However, if you are okay using some bash tricks plus cut and paste, you can split the file into two parts and paste them back together afterwords:
paste -d'\0' <(cut -c-3 foo) <(cut -c4- foo | tr i Z)
The above uses paste to rejoin together the two parts of the file that get split with cut. The second section is piped through tr to translate i's to Z's.
(1) Here's a short-and-simple way to accomplish the task using GNU sed:
sed -r -e ':a;s/^(...)([^i]*)i/\1\2Z/g;ta'
This entails looping (t), and so would not be as efficient as non-looping approaches. The above can also be written using escaped parentheses instead of unescaped characters, and so there is no real need for the -r option. Other implementations of sed should (in principle) be up to the task as well, but your MMV.
(2) It's easy enough to use "old awk" as well:
awk '{s=substr($0,4);gsub(/i/,"Z",s); print substr($0,1,3) s}'
The most intuitive way would be to use awk:
awk 'BEGIN{FS="";OFS=FS}{for(i=4;i<=NF;i++){if($i=="i"){$i="Z"}}}1' file
FS="" splits the input string by characters into fields. We iterate trough character/field 4 to end and replace i by Z.
The final 1 evaluates to true and make awk print the modified input line.
With sed it looks not very intutive but still it is possible:
sed -r '
h # Backup the current line in hold buffer
s/.{3}// # Remove the first three characters
s/i/Z/g # Replace all i by Z
G # Append the contents of the hold buffer to the pattern buffer (this adds a newline between them)
s/(.*)\n(.{3}).*/\2\1/ # Remove that newline ^^^ and assemble the result
' file