sed cut grep linux command output - awk

i have a string, i want to cut all occurrences from matching until first comma: example
[{"value":1,"btata":"15","Id":"17","","url":"","time":"222"{"value":1,"secId":"16","Id":"19","time":"20218 22status":""}
I want to get Id:17 Id:19
I have been able to get Id using sed -e 's/Id/_/g' -e 's/[^_]//g' -e 's/_/Id /g' but couldn't match until comma.

You can do it with sed but it requires two expressions. Essentially you need to remove all '"' characters and then split the input on ',' by replacing them with '\n'. The second expression simply locates the lines beginning with Id, e.g.
sed 's/"//g;s/,/\n/g' | sed -n /^Id/p
Example Use/Output
$ echo '[{value:1,btata:15,Id:17,,url:,time:222{value:1,secId:16,Id:19,time:20218 22status:}' |
sed 's/"//g;s/,/\n/g' | sed -n /^Id/p
Id:17
Id:19
(note: this all comes with the caveat that you should not process json with shell commands. Using a json validating tool like jq is recommended -- though this doesn't appear to be valid json either)

Related

replace strings between two patterns

I would like to replace (using sed/awk/tr) all the strings between CleanAgrobacterium and _gene by ZZZ in my file A.nwk:
(((CleanAgrobacterium_fabrum_str__C58_DE0068_Scaffold_Proteins_gene-FS783_RS12830:0,CleanAgrobacterium_fabrum_str__C58_DE0067_Scaffold_Proteins_gene-FS653_RS12825:0):0.056789,(CleanAgrobacterium_fabrum_GV2260_Complete_Genome_Proteins_gene-EML4058_RS17445:0,(CleanAgrobacterium_fabrum_1D1416_Chromosome_Proteins_gene-NQG32_RS17500:0,(CleanAgrobacterium_fabrum_PDC82_Contig_Proteins_gene-BLT49_RS14090:0,(CleanAgrobacterium_fabrum_N3394_Scaffold_Proteins_gene-G6L76_RS17395:0,(CleanAgrobacterium_fabrum_12D13_Complete_Genome_Proteins_gene-At12D13_RS18010:0,(CleanAgrobacterium_fabrum_Bi46_Contig_Proteins_gene-LQ162_RS02700:0,(CleanAgrobacterium_fabrum_ARqua1_Scaffold_Proteins_gene-HI842_RS18310:0,(CleanAgrobacterium_fabrum_N4094_Scaffold_Proteins_gene-G6L42_RS17400:0,(CleanAgrobacterium_fabrum_GV3101__pMP90_Complete_Genome_Proteins_gene-EML485_RS17435:0,(CleanAgrobacterium_fabrum_Kin001_Complete_Genome_Proteins_gene-FY134_RS17775:0,(CleanAgrobacterium_fabrum_LBA645_Complete_Genome_Proteins_gene-KXJ62_RS17445:0,(CleanAgrobacterium_fabrum_Di1525a_Scaffold_Proteins_gene-G6L89_RS17735:0,(CleanAgrobacterium_fabrum_NFIX02_Scaffold_Proteins_gene-BLR22_RS16795:0,(CleanAgrobacterium_fabrum_Arqua_Contig_Proteins_gene-EXN51_RS19140:0,(CleanAgrobacterium_fabrum_str__J-07_J-07_Scaffold_Proteins_gene-AGR8A_RS20015:0,CleanAgrobacterium_fabrum_1D132_Complete_Genome_Proteins_gene-At1D132_RS18580:0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0,(CleanAgrobacterium_fabrum_EHA105_Complete_Genome_Proteins_gene-EML540_RS17455:0,(CleanAgrobacterium_fabrum_RIT-As-3_Contig_Proteins_gene-ORG40_RS11815:0,(CleanAgrobacterium_fabrum_2788_Contig_Proteins_gene-G6L39_RS17590:0,(CleanAgrobacterium_fabrum_BG5_Complete_Genome_Proteins_gene-F3P66_RS17495:0,(CleanAgrobacterium_fabrum_Bi05_Contig_Proteins_gene-LQV40_RS07170:0,(CleanAgrobacterium_fabrum_str__C58_C58_Complete_Genome_Proteins_gene-ATU_RS17440:0,CleanAgrobacterium_fabrum_NFIX01_Scaffold_Proteins_gene-BMY00_RS16800:0):0):0):0):0):0):0);
sed "/CleanAgrobacterium/,/gene-/d" A.nwk
Instead of using a range, you could make the pattern more specific for the example data matching 1 or more alphanumeric chars or - or _ in between using [[:alnum:]_-]\+ and replace the match(es) with zzz
sed "s/CleanAgrobacterium[[:alnum:]_-]\+_gene/zzz/g" A.nwk
Output
(((zzz-FS783_RS12830:0,zzz-FS653_RS12825:0):0.056789,(zzz-EML4058_RS17445:0,(zzz-NQG32_RS17500:0,(zzz-BLT49_RS14090:0,(zzz-G6L76_RS17395:0,(zzz-At12D13_RS18010:0,(zzz-LQ162_RS02700:0,(zzz-HI842_RS18310:0,(zzz-G6L42_RS17400:0,(zzz-EML485_RS17435:0,(zzz-FY134_RS17775:0,(zzz-KXJ62_RS17445:0,(zzz-G6L89_RS17735:0,(zzz-BLR22_RS16795:0,(zzz-EXN51_RS19140:0,(zzz-AGR8A_RS20015:0,zzz-At1D132_RS18580:0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0,(zzz-EML540_RS17455:0,(zzz-ORG40_RS11815:0,(zzz-G6L39_RS17590:0,(zzz-F3P66_RS17495:0,(zzz-LQV40_RS07170:0,(zzz-ATU_RS17440:0,zzz-BMY00_RS16800:0):0):0):0):0):0):0);
This replaces all the text between CleanAgrobacterium and _gene by ZZZ:
sed -E 's/(CleanAgrobacterium).*(_gene)/\1ZZZ\2/g' A.nwk
But the result is probably not what you would expect. I assume you want ungreedy matching of the text in-between (.*). For that, use perl:
perl -pe 's/(CleanAgrobacterium).*(_gene)/\1ZZZ\2/g' A.nwk
This might work for you (GNU sed):
sed -E 's/CleanAgrobacterium/&\n/g
s/gene-/\n&/g
s/(CleanAgrobacterium)\n[^\n]*\n(gene-)/\1ZZZ\2/g
s/\n//g' file
Append a newline to CleanAgrobacterium and prepend a newline to gene-.
Replace everything that is not a newline between the desired words.
Remove any introduced newlines.
N.B. This does not cater for matches on separate lines. In this case use something like:
sed -E 'H;1h;$!d;x
s/\n/###NEWLINE%%%/g
s/CleanAgrobacterium/&\n/g
s/gene-/\n&/g
s/(CleanAgrobacterium)\n[^\n]*\n(gene-)/\1ZZZ\2/g
s/\n//g
s/###NEWLINE%%%/\n/g' file
This slurps the whole file into memory, replaces all newlines by a unique string, then applies the first solution and tidies up afterwards.
try this:
sed 's/gene-/gene-\n/g' < A.nwk | sed 's/CleanAgrobacterium.*gene-/CleanAgrobacteriumZZZgene-/g' | sed -n ':a;N;$!ba;s/\n//g;p' > output.txt
works with GNU Sed 4.9 using Linux .
Yet another sed solution. It replaces all THIS with THAT (with your samples in reality but more readable here) between START and END in "fooSTARTTHISENDfooSTARTTHISENDfoo"
and outputs "fooSTARTTHATENDfooSTARTTHATENDfoo".
$ sed -E 's/(CleanAgrobacterium)([^_]|_(_|g(_|e(_|n_)))*([^_g]|g([^_e]|e([^_n]|n[^_e]))))*(_(_|g(_|e(_|n_)))*(g(e?|en))?)?(_gene)/\1ZZZ\2/g' file
The solution is non-greedy and relies on regex capturing groups (CleanAgrobacterium)and (_gene), their backreferences \1, \2 and what is between them
([^_]|_(_|g(_|e(_|n_)))*([^_g]|g([^_e]|e([^_n]|n[^_e]))))*(_(_|g(_|e(_|n_)))*(g(e?|en))?)?
(not _gene) getting replaced by ZZZ. You could use it in, for example; GNU awk's gensub() which supports backreferencing:
$ gawk '{print gensub(/(CleanAgrobacterium)([^_]|_(_|g(_|e(_|n_)))*([^_g]|g([^_e]|e([^_n]|n[^_e]))))*(_(_|g(_|e(_|n_)))*(g(e?|en))?)?(_gene)/,"\\1ZZZ\\2","g",$0)}' file

Get the line number of the last line with non-blank characters

I have a file which has the following content:
10 tiny toes
tree
this is that tree
5 funny 0
There are spaces at the end of the file. I want to get the line number of the last row of a file (that has characters). How do I do that in SED?
This is easily done with awk,
awk 'NF{c=FNR}END{print c}' file
With sed it is more tricky. You can use the = operator but this will print the line-number to standard out and not in the pattern space. So you cannot manipulate it. If you want to use sed, you'll have to pipe it to another or use tail:
sed -n '/^[[:blank:]]*$/!=' file | tail -1
You can use following pseudo-code:
Replace all spaces by empty string
Remove all <beginning_of_line><end_of_line> (the lines, only containing spaces, will be removed like this)
Count the number of remaining lines in your file
It's tough to count line numbers in sed. Some versions of sed give you the = operator, but it's not standard. You could use an external tool to generate line numbers and do something like:
nl -s ' ' -n ln -ba input | sed -n 's/^\(......\)...*/\1/p' | sed -n '$p'
but if you're going to do that you might as well just use awk.
This might work for you (GNU sed):
sed -n '/\S/=' file | sed -n '$p'
For all lines that contain a non white space character, print a line number. Pipe this output to second invocation of sed and print only the last line.
Alternative:
grep -n '\S' file | sed -n '$s/:.*//p'

I need to extract I'd from a Google drive urls with sed, gawk or grep

URLs:
1. https://docs.google.com/uc?id=0B3X9GlR6EmbnQ0FtZmJJUXEyRTA&export=download
2. https://drive.google.com/open?id=1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
3. https://drive.google.com/drive/folders/1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py?usp=sharing
I need a single regex for these all urls.
This is what I tried to use but didn't get expected results.
sed -E 's/.*\(folders\)?\(id\)?=?\/?(.*)&?.*/\1/'
Expected results:
0B3X9GlR6EmbnQ0FtZmJJUXEyRTA
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
With your own code updated:
$ cat file
1. https://docs.google.com/uc?id=0B3X9GlR6EmbnQ0FtZmJJUXEyRTA&export=download
2. https://drive.google.com/open?id=1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
3. https://drive.google.com/drive/folders/1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py?usp=sharing
$ sed -E 's#.*(folders/|id=)([^?&]+).*#\2#' file
0B3X9GlR6EmbnQ0FtZmJJUXEyRTA
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
$ sed -E 's#.*(folders/|id=)([^?&]+).*#\2#' file | uniq
0B3X9GlR6EmbnQ0FtZmJJUXEyRTA
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
And yours updated to sed -E 's#.*(folders/|id=)(.*)(\?|&|$).*#\2#' would work on GNU sed.
You are using -E, so no need to escape group quotes (), and | means OR.
When matching literal ?, you need to escape it.
And the separator of sed can change to other character, which is # here.
Note uniq will only remove adjacent duplicates, if there're duplicates in different places, change it to sort -u instead.
A GNU grep solution :
$ grep -Poi '(id=|folders/)\K[a-z0-9_-]*' file
0B3X9GlR6EmbnQ0FtZmJJUXEyRTA
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
Also these two give same results, but are more accurate than above shorter sed one:
sed -E 's#.*(folders/|id=)([A-Za-z0-9_-]*).*#\2#'
sed -E 's#.*(folders/|id=)([[:alnum:]_-]*).*#\2#'
Btw, + means one or more occurances, * means zero or more.
A GNU awk version (removes duplicates at the same time):
awk 'match($0,".*(folders/|id=)([A-Za-z0-9_-]+)",m){if(!a[m[2]]++)print m[2]}' file
Could you please try following.
awk 'match($0,/uc\?id=[^&]*|folders\/[^?]*/){value=substr($0,RSTART,RLENGTH);gsub(/.*=|.*\//,"",value);print value}' Input_file
Try this:
sed -E 's/.*(id=|folders\/)([^&?/]*).*/\2/' file
Explanations:
.*(id=|folders\/): after any characters(.*) followed by id= or folders/
([^&?/]*): search and capture any characters except &, ? and /
\2: using backreference, matching string is replaced with the second captured text([^&?/]*)
Edit:
To remove duplicate url, just pipe the command to sort then to uniq(because uniq just removes adjacent duplicate lines, you may want to sort the list before):
sed -E 's/.*(id=|folders\/)([^&?/]*).*/\2/' file | sort | uniq
As #Tiw suggests in edit, you can also pipe to a single command by using sort with the -u flag:
sed -E 's/.*(id=|folders\/)([^&?/]*).*/\2/' file | sort -u
Using Perl
$ cat rohit.txt
1. https://docs.google.com/uc?id=0B3X9GlR6EmbnQ0FtZmJJUXEyRTA&export=download
2. https://drive.google.com/open?id=1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
3. https://drive.google.com/drive/folders/1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py?usp=sharing
$ perl -lne ' s/.*\/.*..\/(.*)$/$1/g; s/(.*id=)//g; /(.+?)(&|\?|$)/ and print $1 ' rohit.txt
0B3X9GlR6EmbnQ0FtZmJJUXEyRTA
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
$

insert new line after grep match with -z parametr (awk/sed/tr/gawk/anything)

I need to use -z paramater with grep, which enables me to find a pattern divided to multiple lines.
grep -zPo myregex
However, it then prints
abc
instead of
a
b
c
as results
I know it is because of -z parametr, but i need to somehow insert the new lines between matches back, at results.
I tried to do so with sed, tr and awk such as
grep -zPo myregex | sed -e 's/$/\n/'
but it doesn't work, awk managed just to insert new line at the end of whole output. Someone adviced me to try it with gawk but I wasn't able to find any way to do so so far.
grep -zPo /tcp /etc/services | \
tr '\0' '\12'
This example demonstrates searching for the pattern in a file, and returning only the pattern (in this case the simple "/tcp" pattern). Then due to the -z option, grep is separating them by nulls (\0). So tr the nulls to returns (\n).

Grep: Pulling multiple columns based off pattern

I'm trying to parse a log file by pulling two columns (timestamp and url) where the file format is:
1470700748 foo="narf1" url="http://narf2.com" bar="narf3"
The column names are not guaranteed to be in the same order, except for the timestamp.
Getting the timetamp is easy enough:
grep -Eo '^[^ ]+' test.txt or
sed 's/ .*//' test.txt
I never been able to pull the url right, nor have I been able to pull them both at the same time.
sed -n 's/.*url="\(.*\)".*/\1/p' test.txt
The above works when there are no empty lines, so I'm also working on combining the sed command with:
sed -e /^$/d test.txt
most of the other SO posts dealt with fixed column orders and I wasn't able to get them working. I tried many various permutations of grep, sed, awk, and cut.
has anyone done something similar? based on 1470700748 foo="narf1" url="narf2" bar="narf3", I am trying to get:
1470700748 http://narf2.com
here you go...
$ grep -oP '^[0-9]+|(?<=url=")[^"]+' file | xargs
1470700748 http://narf2.com
$ sed -E -n 's/([^ ]+).* url="([^"]+).*/\1 \2/p' file
1470700748 http://narf2.com