Is there a way to tell sed to output only captured groups?
For example, given the input:
This is a sample 123 text and some 987 numbers
And pattern:
/([\d]+)/
Could I get only 123 and 987 output in the way formatted by back references?
The key to getting this to work is to tell sed to exclude what you don't want to be output as well as specifying what you do want. This technique depends on knowing how many matches you're looking for. The grep command below works for an unspecified number of matches.
string='This is a sample 123 text and some 987 numbers'
echo "$string" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
This says:
don't default to printing each line (-n)
exclude zero or more non-digits
include one or more digits
exclude one or more non-digits
include one or more digits
exclude zero or more non-digits
print the substitution (p) (on one line)
In general, in sed you capture groups using parentheses and output what you capture using a back reference:
echo "foobarbaz" | sed 's/^foo\(.*\)baz$/\1/'
will output "bar". If you use -r (-E for OS X) for extended regex, you don't need to escape the parentheses:
echo "foobarbaz" | sed -r 's/^foo(.*)baz$/\1/'
There can be up to 9 capture groups and their back references. The back references are numbered in the order the groups appear, but they can be used in any order and can be repeated:
echo "foobarbaz" | sed -r 's/^foo(.*)b(.)z$/\2 \1 \2/'
outputs "a bar a".
If you have GNU grep:
echo "$string" | grep -Po '\d+'
It may also work in BSD, including OS X:
echo "$string" | grep -Eo '\d+'
These commands will match any number of digit sequences. The output will be on multiple lines.
or variations such as:
echo "$string" | grep -Po '(?<=\D )(\d+)'
The -P option enables Perl Compatible Regular Expressions. See man 3 pcrepattern or man 3 pcresyntax.
Sed has up to nine remembered patterns but you need to use escaped parentheses to remember portions of the regular expression.
See here for examples and more detail
you can use grep
grep -Eow "[0-9]+" file
run(s) of digits
This answer works with any count of digit groups. Example:
$ echo 'Num123that456are7899900contained0018166intext' \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Expanded answer.
Is there any way to tell sed to output only captured groups?
Yes. replace all text by the capture group:
$ echo 'Number 123 inside text' \
| sed 's/[^0-9]*\([0-9]\{1,\}\)[^0-9]*/\1/'
123
s/[^0-9]* # several non-digits
\([0-9]\{1,\}\) # followed by one or more digits
[^0-9]* # and followed by more non-digits.
/\1/ # gets replaced only by the digits.
Or with extended syntax (less backquotes and allow the use of +):
$ echo 'Number 123 in text' \
| sed -E 's/[^0-9]*([0-9]+)[^0-9]*/\1/'
123
To avoid printing the original text when there is no number, use:
$ echo 'Number xxx in text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1/p'
(-n) Do not print the input by default.
(/p) print only if a replacement was done.
And to match several numbers (and also print them):
$ echo 'N 123 in 456 text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1 /gp'
123 456
That works for any count of digit runs:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Which is very similar to the grep command:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" | grep -Po '\d+'
123
456
7899900
0018166
About \d
and pattern: /([\d]+)/
Sed does not recognize the '\d' (shortcut) syntax. The ascii equivalent used above [0-9] is not exactly equivalent. The only alternative solution is to use a character class: '[[:digit:]]`.
The selected answer use such "character classes" to build a solution:
$ str='This is a sample 123 text and some 987 numbers'
$ echo "$str" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
That solution only works for (exactly) two runs of digits.
Of course, as the answer is being executed inside the shell, we can define a couple of variables to make such answer shorter:
$ str='This is a sample 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D+($d+)$D*/\1 \2/p"
But, as has been already explained, using a s/…/…/gp command is better:
$ str='This is 75577 a sam33ple 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D*/\1 /gp"
75577 33 123 987
That will cover both repeated runs of digits and writing a short(er) command.
Give up and use Perl
Since sed does not cut it, let's just throw the towel and use Perl, at least it is LSB while grep GNU extensions are not :-)
Print the entire matching part, no matching groups or lookbehind needed:
cat <<EOS | perl -lane 'print m/\d+/g'
a1 b2
a34 b56
EOS
Output:
12
3456
Single match per line, often structured data fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*/$1/g'
a1 b2
a34 b56
EOS
Output:
1
34
With lookbehind:
cat <<EOS | perl -lane 'print m/(?<=a)(\d+)/'
a1 b2
a34 b56
EOS
Multiple fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*?b(\d+).*/$1 $2/g'
a1 c0 b2 c0
a34 c0 b56 c0
EOS
Output:
1 2
34 56
Multiple matches per line, often unstructured data:
cat <<EOS | perl -lape 's/.*?a(\d+)|.*/$1 /g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
34 78
With lookbehind:
cat EOS<< | perl -lane 'print m/(?<=a)(\d+)/g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
3478
I believe the pattern given in the question was by way of example only, and the goal was to match any pattern.
If you have a sed with the GNU extension allowing insertion of a newline in the pattern space, one suggestion is:
> set string = "This is a sample 123 text and some 987 numbers"
>
> set pattern = "[0-9][0-9]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
123
987
> set pattern = "[a-z][a-z]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
his
is
a
sample
text
and
some
numbers
These examples are with tcsh (yes, I know its the wrong shell) with CYGWIN. (Edit: For bash, remove set, and the spaces around =.)
Try
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
I got this under cygwin:
$ (echo "asdf"; \
echo "1234"; \
echo "asdf1234adsf1234asdf"; \
echo "1m2m3m4m5m6m7m8m9m0m1m2m3m4m5m6m7m8m9") | \
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
1234
1234 1234
1 2 3 4 5 6 7 8 9
$
You need include whole line to print group, which you're doing at the second command but you don't need to group the first wildcard. This will work as well:
echo "/home/me/myfile-99" | sed -r 's/.*myfile-(.*)$/\1/'
It's not what the OP asked for (capturing groups) but you can extract the numbers using:
S='This is a sample 123 text and some 987 numbers'
echo "$S" | sed 's/ /\n/g' | sed -r '/([0-9]+)/ !d'
Gives the following:
123
987
I want to give a simpler example on "output only captured groups with sed"
I have /home/me/myfile-99 and wish to output the serial number of the file: 99
My first try, which didn't work was:
echo "/home/me/myfile-99" | sed -r 's/myfile-(.*)$/\1/'
# output: /home/me/99
To make this work, we need to capture the unwanted portion in capture group as well:
echo "/home/me/myfile-99" | sed -r 's/^(.*)myfile-(.*)$/\2/'
# output: 99
*) Note that sed doesn't have \d
You can use ripgrep, which also seems to be a sed replacement for simple substitutions, like this
rg '(\d+)' -or '$1'
where ripgrep uses -o or --only matching and -r or --replace to output only the first capture group with $1 (quoted to be avoid intepretation as a variable by the shell) two times due to two matches.
Related
Consider the following data in a file file.txt:
$
$
$
FORCE 10 30 40
* 1 5 4
FORCE 11 20 22
* 2 3 0
FORCE 19 25 10
* 16 12 8
.
.
.
I want to print every 2nd element of the third column, starting from line 4, resulting in:
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors generated either.
You might use awk checking that the row number > 3 and then check for an even row number with NR%2==0.
Note that you don't have to use cat
awk 'NR > 3 && NR%2==0 {
print $3
}' file.txt
Output
30
20
25
Using sed
$ sed -En '4~2s/([^ \t]*[ \t]+){2}([^ \t]*).*/\2/p' input_file
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors
generated either.
You do not need cat whilst using GNU sed as it can read file on its' own, in this case it would be sed 's/\|/ /' file.txt.
You should consider if you need that part at all, your sample input does not have pipe character at all, so it would do nothing to it. You might also drop that part if lines holding values you want to print do not have that character.
Output is empty as NR%2==4 does never hold, remainder of division by x is always smaller than x (in particular case of %2 only 2 values are possible: 0 and 1)
This might work for you (GNU sed):
sed -nE '4~2s/^((\S+)\s*){3}.*/\2/p' file
Turn off implicit printing by setting the -n option and reduce back slashes in regexps by turning on -E.
From the fourth line and then every second line thereafter, capture the third column and print it.
N.B. The \2 represents the last inhabitant of that back reference which in conjunction with the {3} means the above.
Alternative:
sed -n '4,${s/^\(\(\S\+\)\s*\)\{3\}.*/\2/p;n}' file
I have a text file with a bunch of data. I was able to extract exactly what I want using sed; but
I need to replaced only the specific pattern I searched without losing the other content from the file.
Im using the following sed command; but I need to the replacement; but dont know how to do it.
cat file.txt | sed -rn '/([a-z0-9]{2}\s){6}/p' > output.txt
The sed searches for the following pattern: ## ## ## ## ## ##, but I want to replace that pattern like this: ######-######.
cat file.txt | sed -rn '/([a-z0-9]{2}\s){6}/p' > output.txt
Output:
1 | ec eb b8 7b e3 c0 47
9 | 90 20 c2 f6 3d c0 1/1/1
25 | 00 fd 45 3d a7 80 31
Desired Output:
1 | ecebb8-7be3c0 47
9 | 9020c2-f63dc0 1/1/1
25 | 00fd45-3da780 31
Thanks
With your shown samples please try following awk program.
awk '
BEGIN{ FS=OFS="|" }
{
$2=substr($2,1,3) substr($2,5,2) substr($2,8,2)"-" substr($2,11,2) substr($2,14,2) substr($2,17,2) substr($2,19)
}
1
' Input_file
Explanation: Simple explanation would be, setting FS and OFS as | for awk program. Then in 2nd field using awk's substr function keeping only needed values as per shown samples of OP. Where substr function works on method of printing specific indexes/position number values(eg: from which value to which value you need to print). Then saving required values into 2nd field itself and printing current line then.
With awk:
awk '{$3=$3$4$5"-"$6$7$8; print $1"\t",$2,$3,$NF}' file
1 | ecebb8-7be3c0 47
9 | 9020c2-f63dc0 1/1/1
25 | 00fd45-3da780 31
This might work for you (GNU sed):
sed -E 's/ (\S\S) (\S\S) (\S\S)/ \1\2\3/;s//-\1\2\3/' file
Pattern match 3 spaced pairs twice, removing the spaces between the 2nd and 3rd pairs and replacing the first space in the 2nd match by -.
If you want to extract specific substrings, you'll need to write a more specific regex to pull out exactly those.
sed -rn '/([a-z0-9]{2})\s([a-z0-9]{2})\s([a-z0-9]{2})\s([a-z0-9]{2})\s([a-z0-9]{2})\s([a-z0-9]{2})\s/\1\2\3-\4\5\6/' file.txt > output.txt
Notice also how easy it is to avoid the useless cat.
\s is generally not portable; the POSIX equivalent is [[:space:]].
test file contains
$ cat test
i-d119c118,vol-37905322,,,2015-07-29T03:50:32.511Z,General Purpose SSD,15
i-2278b42e,vol-c90539cc,,,2014-11-12T04:27:22.618Z,General Purpose SSD,10
script output:
$ for instance_id in $(cut -d"," -f1 test); do python getattrib.py get $instance_id | cut -d"'" -f2; done
10.10.0.68
10.10.0.96
inserting variable using sed yields following result, note the same IP address
$ insert=( `for instance_id in $(cut -d"," -f1 test); do python getattrib.py get $instance_id | cut -d"'" -f2; done` )
$ sed "s|$|,${insert}|" test
i-d119c118,vol-37905322,,,2015-07-29T03:50:32.511Z,General Purpose SSD,15,10.10.0.68
i-2278b42e,vol-c90539cc,,,2014-11-12T04:27:22.618Z,General Purpose SSD,10,10.10.0.68
but i am looking for output as below:
10.10.0.68,i-d119c118,vol-37905322,,,2015-07-29T03:50:32.511Z,General Purpose SSD,15
10.10.0.96,i-2278b42e,vol-c90539cc,,,2014-11-12T04:27:22.618Z,General Purpose SSD,10
use start delimiter ^ instead of end $ and adapt the ,
sed "s/^/${insert},/" test
but your sed and value retreiving need to be into the loop, not after or taking all result as value
example in loop:
for instance_id in $(cut -d"," -f1 test)
do
insert="$( python getattrib.py get ${instance_id} | cut -d"'" -f2 )"
sed -e "/^${instance_id}/ !d" -e "s|$|,${insert}|" test
done
insert=( `for instance_id in $(cut -d"," -f1 test); do python getattrib.py get $instance_id | cut -d"'" -f2; done` )
the insert variable is an array holding 2 elements
sed "s|$|,${insert}|" test
${insert} only retrieves the first element -- it is implicitly ${insert[0]}
I would rewrite that like this, to read the file line-by-line:
while IFS=, read -ra fields; do
ip=$( python getattrib.py get "${fields[0]}" | cut -d"'" -f2 )
printf "%s" "$ip"
printf ",%s" "${fields[#]}"
echo
done < test
I have a curl output generated similar below, Im working on a SED/AWK script to eliminate unwanted strings.
File
{id":"54bef907-d17e-4633-88be-49fa738b092d","name":"AA","description","name":"AAxxxxxx","enabled":true}
{id":"20000000000000000000000000000000","name":"BB","description","name":"BBxxxxxx","enabled":true}
{id":"542ndf07-d19e-2233-87gf-49fa738b092d","name":"AA","description","name":"CCxxxxxx","enabled":true}
{id":"20000000000000000000000000000000","name":"BB","description","name":"DDxxxxxx","enabled":true}
......
I like to modify this file and retain similar below,
AA AAxxxxxx
BB BBxxxxxx
AA CCxxxxxx
BB DDxxxxxx
AA n.....
BB n.....
Is there a way I could remove word/commas/semicolons in-between so I can only retain these values?
Try this awk
curl your_command | awk -F\" '{print $(NF-9),$(NF-3)}'
Or:
curl your_command | awk -F\" '{print $7,$13}'
A semantic approach ussing perl:
curl your_command | perl -lane '/"name":"(\w+)".*"name":"(\w+)"/;print $1." ".$2'
For any number of name ocurrences:
curl your_command | perl -lane 'printf $_." " for ( $_ =~ /"name":"(\w+)"/g);print ""'
This might work for you (GNU sed):
sed -r 's/.*("name":")([^"]*)".*\1([^"]*)".*/\2 \3/p;d' file
This extracts the fields following the two name keys and prints them if successful.
Alternatively, on simply pattern matching:
sed -r 's/.*:.*:"([^"]*)".*:"([^"]*)".*:.*/\1 \2/p;d' file
In this particular case, you could do
awk -F ":|," '{print $4,$7}' file2 |tr -d '"'
and get
AA AAxxxxxx
BB BBxxxxxx
AA CCxxxxxx
BB DDxxxxxx
Here, the field separator is either : or ,, we print the fourth and seventh field (because all lines have the entries in these two fields) and finally, we use tr to delete the " because you don't want to have it.
I have a document containing several lines of text.
Example(not actual):
*Prepare 42 Locked delete from table where type='test' and user_id='099'and number='+66719919*
I want to be able to search for user_id where ever it occurs in the document (which does not follow a pattern) and have the output as:
user_id=009
OR
009
Please how do I achieve this using awk?
Thanks.
awk '/user_id/{for(i=1;i<=NF;i++){if($i~/user_id/){split($i,a,"=");print a[2]}}}' your_file
tested:
> echo "*Prepare type='test' and user_id='099' and number='+66719919*" | awk '/user_id/{for(i=1;i<=NF;i++){if($i~/user_id/){split($i,a,"=");print a[2]}}}'
'099'
another one:
> echo "*Prepare type='test' and user_id='099' and number='+66719919*" | awk '/user_id/{for(i=1;i<=NF;i++){if($i~/user_id/){ print $i}}}'
user_id='099'
You could also use grep:
grep -o "user_id='\?[0-9]*'\?"
Append tr to remove the quotes:
grep -o "user_id='\?[0-9]*'\?" | tr -d \'