How to get the string which is less than 4 using SED or AWS or GREP - awk

I'm trying to get strings which are less than 4 (0,3) characters which might include some special characters too.
The issue here is I'm not really sure what all special characters are involved
It can contain names of any length with some special characters not sure what all are included.
Sample Input data is as below
r#nger
d!nger
'iterr
4#e
c#nuidig
c#niting
c^neres
sample
Sample Output should be like this
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
I have tried below which both works but both has flaws apart from the 0,3 character strings I'm also getting only 1 character outputs which is incorrect.
Like just C, which I don't have in the input by itself
grep -iE '^[a-z0-9\.-+?$_,#]{0,3}$'
sed -n '/^.\{0,3\}$/p'
grep uid: file.csv | awk {'print $2'} | sed -En 's/^([^[:space:]]{3}).*/\1/p' | sort -f > output
Sample Output from above
r#n
d!n
'it
4#e
c#n
c
c
sam
s
I'm thinking that there might be some special character after the first character which is making it break and only printing the first character.
Can someone please suggest how to get this working as expected
Thanks,

To get the output you posted from the input you posted is just:
$ cut -c1-3 file
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
If that's not all you need then edit your question to more clearly state your requirements and provide more truly representative sample input/output including cases where this doesn't work.

You may use this grep with -o and -E options:
grep -oE '^[^[:blank:]]{1,3}' file
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
Regex ^[^[:blank:]]{1,3} matches and outputs 1 to 3 non-whitespace characters from start position.

Using awk:
awk '{print (length($0)<3) ? $0 : substr($0,0,3)}' src.dat
Output:
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
1
11
-1
.
Contents of src.dat:
r#nger
d!nger
'iterr
4#e
c#nuidig
c#niting
c^neres
sample
1
11
-1
.

sed 's/.//4g' file
Delete every char starting at 4th until there aren't any more. GNU sed, which says:
Note: the POSIX standard does not specify what should happen when you mix the g and number modifiers, and currently there is no widely agreed upon meaning across sed implementations. For GNU sed, the interaction is defined to be: ignore matches before the numberth, and then match and replace all matches from the numberth on.
Also: grep -o '^...' file

Related

grep -v multiple line same time

I would like to filter the lines containing "pattern" and the following 5 lines.
Something like grep -v -A 5 'pattern' myfile.txt with output:
other
other
other
other
other
other
I'm interested in linux shell solutions, grep, awk, sed...
Thx
myfile.txt:
other
other
other
pattern
follow1
follow2
follow3
follow4
follow5
other
other
other
pattern
follow1
follow2
follow3
follow4
follow5
other
other
other
other
other
other
You can use awk:
awk '/pattern/{c=5;next} !(c&&c--)' file
Basically: We are decreasing the integer c on every row of input. We are printing lines when c is 0. *(see below) Note: c will be automatically initialized with 0 by awk upon it's first usage.
When the word pattern is found, we set c to 5 which makes c--<=0 false for 5 lines and makes awk not print those lines.
* We could bascially use c--<=0 to check if c is less or equal than 0. But when there are many(!) lines between the occurrences of the word pattern, c could overflow. To avoid that, oguz ismail suggested to implement the check like this:
!(c&&c--)
This will check if c is trueish (greater zero) and only then decrement c. c will never be less than 0 and therefore not overflow. The inversion of this check !(...) makes awk print the correct lines.
Side-note: Normally you would use the word regexp if you mean a regular expression, not pattern.
With GNU sed (should be okay as Linux is mentioned by OP)
sed '/pattern/,+5d' ip.txt
which deletes the lines matching the given regex and 5 lines that follow
I did it using this:
head -$(wc -l myfile.txt | awk '{print $1-5 }') myfile.txt | grep -v "whatever"
which means:
wc -l myfile.txt : how many lines (but it also shows the filename)
awk '{print $1}' : only show the amount of lines
awk '{print $1-5 }' : we don't want the last five lines
head ... : show the first ... lines (which means, leave out the last five)
grep -v "..." : this part you know :-)

Mining dictionary for sed search strings

For fun I was mining the dictionary for words that sed could use to modify strings. Example:
sed settee <<< better
sed statement <<< dated
Outputs:
beer
demented
These sed swords must be at least 5 letters long, and begin with s, then another letter, which can appear only 3 times, with at least one other letter between the first and second instances, and with the third instance as the final letter.
I used sed to generate a word list, and it seems to work:
d=/usr/share/dict/american-english
sed -n '/^s\([a-z]\)\(.*\1\)\{2\}$/{
/^s\([a-z]\)\(.*\1\)\{3\}$/!{/^s\([a-z]\)\1/!p}}' $d |
xargs echo
Output:
sanatoria sanitaria sarcomata savanna secede secrete secretive segregate selective selvedge sentence sentience sentimentalize septette sequence serenade serene serpentine serviceable serviette settee severance severe sewerage sextette stateliest statement stealthiest stoutest straightest straightjacket straitjacket strategist streetlight stretchiest strictest structuralist
But that sed code runs three passes through each line, which seems excessively long and kludgy. How can that code be simplified, while still outputting the same word list?
grep or awk answers would also be OK.
awk to the rescue!
code is cleaner with awk and reads as the spec: split the word based on the second char, three instances of the char will split the word into 4 segments; 2nd one should have at least one char and the last one should be empty.
$ awk '/^s/{n=split($1,a,substr($1,2,1));
if(n==4 && length(a[2])>0 && a[4]=="") print}' /usr/share/dict/american-english | xargs
sanatoria sanitaria sarcomata savanna secede secrete secretive
segregate selective selvedge sentence sentience sentimentalize
septette sequence serenade serene serpentine serviceable serviette
settee severance severe sewerage sextette stateliest statement
stealthiest stoutest straightest straightjacket straitjacket strategist
streetlight stretchiest strictest structuralist
very cool idea. I think you're more restrictive than necessary
sed -nE '/^s(.)[^\1]+\1[^\1]*\1g?$/p'
seems to work fine. It generated 518 words for me. I only have /usr/share/dict/words dictionary file though.
sabadilla sabakha sabana sabbatia sabdariffa sacatra saccharilla
saccharogalactorrhea saccharorrhea saccharosuria saccharuria sacralgia
sacraria sacrcraria sacrocoxalgia sadhaka sadhana sahara saintpaulia
salaceta salada salagrama salamandra saltarella salutatoria
...
stuntist subbureau sucuriu sucuruju sulphurou surucucu
syenite-porphyry symphyseotomy symphysiotomy symphysotomy symphysy
symphytically syndactyly synonymity synonymously synonymy
syzygetically syzygy
an interesting find is
$ sed snow-nodding <<< now-or-never
noddior-never
A speedy pcregrep method, (.025 seconds user time):
d=/usr/share/dict/american-english
pcregrep '^s(.)((?!\1).)+\1((?!\1).)*\1$' $d | xargs echo
Output:
sanatoria sanitaria sarcomata savanna secede secrete secretive segregate selective selvedge sentence sentience sentimentalize septette sequence serenade serene serpentine serviceable serviette settee severance severe sewerage sextette stateliest statement stealthiest stoutest straightest straightjacket straitjacket strategist streetlight stretchiest strictest structuralist
Code inspired by: Regex: Match everything except backreference

awk to remove 5th column from N column with fixed delimiter

I have file with Nth columns
I want to remove the 5th column from last of Nth columns
Delimiter is "|"
I tested with simple example as shown below:
bash-3.2$ echo "1|2|3|4|5|6|7|8" | nawk -F\| '{print $(NF-4)}'
4
Expecting result:
1|2|3|5|6|7|8
How should I change my command to get the desired output?
If I understand you correctly, you want to use something like this:
sed -E 's/\|[^|]*((\|[^|]*){4})$/\1/'
This matches a pipe character \| followed by any number of non-pipe characters [^|]*, then captures 4 more of the same pattern ((\|[^|]*){4}). The $ at the end matches the end of the line. The first part of the match (i.e. the fifth field from the end) is dropped.
Testing it out:
$ sed -E 's/\|[^|]*((\|[^|]*){4})$/\1/' <<<"1|2|3|4|5|6|7"
1|2|4|5|6|7
You could achieve the same thing using GNU awk with gensub but I think that sed is the right tool for the job in this case.
If your version of sed doesn't support extended regex syntax with -E, you can modify it slightly:
sed 's/|[^|]*\(\(|[^|]*\)\{4\}\)$/\1/'
In basic mode, pipes are interpreted literally but parentheses for capture groups and curly brcneed to be escaped.
AWK is your friend :
Sample Input
A|B|C|D|E|F|G|H|I
A|B|C|D|E|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|D|O|R|Q|U|I
A|B|C|D|E|F|G|H|I|E|O|Q
A|B|C|D|E|F|G|H|I|X
A|B|C|D|E|F|G|H|I|J|K|L
Script
awk 'BEGIN{FS="|";OFS="|"}
{$(NF-5)="";sub(/\|\|/,"|");print}' file
Sample Output
A|B|C|E|F|G|H|I
A|B|C|D|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|O|R|Q|U|I
A|B|C|D|E|F|H|I|E|O|Q
A|B|C|D|F|G|H|I|X
A|B|C|D|E|F|H|I|J|K|L
What we did here
As you are aware awk's has special variables to store each field in the record, which ranges from $1,$2 upto $(NF)
To exclude the 5th from the last column is as simple as
Emptying the colume ie $(NF-5)=""
Removing from the record, the consecutive | formed by the above step ie do sub(/\|\|/,"|")
another alternative, using #sjsam's input file
$ rev file | cut -d'|' --complement -f6 | rev
A|B|C|E|F|G|H|I
A|B|C|D|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|O|R|Q|U|I
A|B|C|D|E|F|H|I|E|O|Q
A|B|C|D|F|G|H|I|X
A|B|C|D|E|F|H|I|J|K|L
not sure you want the 5'th from the last or 6th. But it's easy to adjust.
Thanks for the help and guidance.
Below is what I tested:
bash-3.2$ echo "1|2|3|4|5|6|7|8|9" | nawk 'BEGIN{FS="|";OFS="|"} {$(NF-4)="!";print}' | sed 's/|!//'
Output: 1|2|3|4|6|7|8|9
Further tested on the file that I have extracted from system and so it worked fine.

awk script to parse case with information in two fields of file

I have a awk parser that works for all data inputs but one and I am having trouble with it. The problem is in the below rules steps 1 and 2 come from $2 (NC_000013.10:g.20763686_20763687delinsA) and steps 3 and 4 come from $1 (NM_004004.5:c.34_35delGGinsT).
Parse Rules:
The header is skipped and
4 zeros after the NC_ (not always the case) and the digits before the .
g. ### (before underscore) _### (# after the _)
letters after the "del" until the “ins”
letters after the "ins"
Desired output:
13 20763686 20763687 GG T
Input:
Input Variant Errors Chromosomal Variant Coding Variant(s)
NM_004004.5:c.34_35delGGinsT NC_000013.10:g.20763686_20763687delinsA NM_004004.5:c.34_35delinsT XM_005266354.1:c.34_35delinsT XM_005266355.1:c.34_35delinsT XM_005266356.1:c.34_35delinsT
My attempt:
awk 'NR>1 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" out_position.txt > out_parse.txt
I think that in this case, you're better off using a regular expression. This sed one-liner produces the desired output:
$ sed -nr 's/.*del([A-Z]+)ins([A-Z]+).*NC_0{4}([0-9]+).*g\.([0-9]+)_([0-9]+).*/\3\t\4\t\5\t\1\t\2/p' file
13 20763686 20763687 GG T
It's not going to win any beauty awards but hopefully it's fairly clear what's going on. The parts in the parentheses are captured and used in the output, separated by tab characters.

grep line not containing the list of character

My line can contain alphanumeric or any one of these character , , ; , / , - ,_.
I want to print any line containing characters outside the above listed even though if they have above listed characters
I tried this
egrep '^[[:alnum:]]|\-|\_|\/|\:\,$'
It didn't help me
Example :-
cat fc
2
a
A
-
_
/
?
egrep -nv '^[[:alnum:]]|\-|\_|\/$' fc
7:?
But when I inserted it in the middle of the string:
cat fc_v1
hello
hello1
helloA
h#llo
h?llo
egrep -nv '^[[:alnum:]]|\-|\_|\/$' fc_v1
but it doesn't work. How can I make it pick up the characters, wherever they appear in the line?
The expectation was to print
h#llo0
h?llo
As they did not have the listed characters
Fortunately i was able to achieve it by
egrep -n '[^_[:alnum:]|-|_|/]' fc_v1
Now the actual requirement does not seem to work with above syntax
cat fc_v2
/a/?hello -sec=sys,rw, root=c;g,nosuid
/h/hello02 -sec=sys,rw,root=c,nosuid
/h/#hello_ -sec=sys,rw,root=c,nosuid
/h/helloA -sec=sys,rw,root=c,nosuid
egrep -n '[^_[:alnum:]|/|_|-|\,|\=]' fc_v3
1:/a/?hello -sec=sys,rw, root=c;g,nosuid
2:/h/hello02 -sec=sys,rw,root=c,nosuid
3:/h/#hello_ -sec=sys,rw,root=c,nosuid
4:/h/helloA -sec=sys,rw,root=c,nosuid
It prints all the lines, Expected is only line 3 and 1
it sounds like you might be looking for:
$ grep '[^-[:alnum:]_/:,]' file
h#llo
h?llo
but it's a guess since you didn't post expected output for your sample input.