Is there a way to replace all the commas except those in quotes using sed - awk

# Director, Movie Title, Year, Comment
Ethan Coen, No Country for Old Men, 2007, none
Ethan Coen, "O Brother, Where Art Thou?", 2000, none
Like in here the the commas outside the quotes have to be replaced with |.
Ethan Coen| "O Brother, Where Art Thou?"| 2000| none
i did try this command
sed -e 's/,(?=(?:[^"]*"[^"]*")*[^"]*$)/|/g',
to first match those strings, put i am getting RE error: repetition-operator operand invalid
i don't know how to convert the re, totally new to shell and regex, i would be grateful for any help

You may try this gnu awk command with FPAT to split fields using a custom regex:
awk -v OFS='|' -v FPAT=' *"[^"]*"|[^",]+' '{$1=$1} 1' file
# Director| Movie Title| Year| Comment
Ethan Coen| No Country for Old Men| 2007| none
Ethan Coen| "O Brother, Where Art Thou?"| 2000| none

The proper way to deal with CSV is to use a proper csv parser.
Given:
$ cat file
# Director, Movie Title, Year, Comment
Ethan Coen, No Country for Old Men, 2007, none
Ethan Coen, "O Brother, Where Art Thou?", 2000, none
First issue, your example has a two character delimiter of ', ' vs a single comma ','. That will throw off most csv parsers unless they support multi character delimiters. For example, csvkit does not support multi character delimiters. (The place where it fails is on quoted fields since if the csv parser is looking for ," as the start of a quoted field, it fails with , "...)
The lightest weight but full featured csv parser commonly available at the command line is in ruby.
With ruby, you can do:
$ ruby -rcsv -ne 'puts (CSV.parse $_, col_sep:", ").join("|")' file
# Director|Movie Title|Year|Comment
Ethan Coen|No Country for Old Men|2007|none
Ethan Coen|O Brother, Where Art Thou?|2000|none
If you want the replacement to also be '| ' vs the single character delimiter of '|' you can do:
$ ruby -rcsv -ne 'puts (CSV.parse $_, col_sep:", ").join("| ")' file
# Director| Movie Title| Year| Comment
Ethan Coen| No Country for Old Men| 2007| none
Ethan Coen| O Brother, Where Art Thou?| 2000| none
Note that O Brother, Where Art Thou? is no longer quoted since the ', ' is no longer a delimiter.
To even be more proper you would use the csv module to re-encode back into a proper RFC 4180 compliant file.
Suppose you wanted to fix the ', ' into a compliant ',' and maintain the quoted fields. Our single line of Ruby does not do that.
Instead:
$ ruby -rcsv -ne 'out=(CSV.parse $_, col_sep:", ").map do |row|
row.to_csv(:col_sep=>",")
end
puts out' file
# Director,Movie Title,Year,Comment
Ethan Coen,No Country for Old Men,2007,none
Ethan Coen,"O Brother, Where Art Thou?",2000,none
Or into a '|':
$ ruby -rcsv -ne 'out=(CSV.parse $_, col_sep:", ").map do |row|
row.to_csv(:col_sep=>"|")
end
puts out' file
# Director|Movie Title|Year|Comment
Ethan Coen|No Country for Old Men|2007|none
Ethan Coen|O Brother, Where Art Thou?|2000|none

This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^",]*"[^"]*)*"[^",]*),/\1\n/;ta;y/,\n/|,/' file
Replace all ,'s between "'s with \n's, then translate all ,'s for |s and all \n's for ,'s.

You may find perl better suited to handling complex regular expressions like this, when you need to avoid matching a pattern in specific context(s):
perl -i -pe 's/"[^"]*"(*SKIP)(*F)|,/|/g' file.txt
Or, if you need to match across lines,
perl -0777 -i -pe 's/"[^"]*"(*SKIP)(*F)|,/|/g' file.txt
See the regex demo and the Bash demo.
Here,
-0777 slurps the file contents into a single string (or, it will be treated as a list of lines, each of which will be fed to the regex engine separately)
-i - makes changes to the input file "inline"
"[^"]*"(*SKIP)(*F)|, - the regex that matches ", then any zero or more chars other than " and then a " and then skips the match and goes on searching for a match from the position where it failed, or (|) matches a , in any other context
| is the replacement
g - replaces all occurrences.
If your file is UTF8, and you need to make some manipulation with Unicode text, you would need to add -CSD -Mutf8, say, before -i.

Related

How to get the string which is less than 4 using SED or AWS or GREP

I'm trying to get strings which are less than 4 (0,3) characters which might include some special characters too.
The issue here is I'm not really sure what all special characters are involved
It can contain names of any length with some special characters not sure what all are included.
Sample Input data is as below
r#nger
d!nger
'iterr
4#e
c#nuidig
c#niting
c^neres
sample
Sample Output should be like this
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
I have tried below which both works but both has flaws apart from the 0,3 character strings I'm also getting only 1 character outputs which is incorrect.
Like just C, which I don't have in the input by itself
grep -iE '^[a-z0-9\.-+?$_,#]{0,3}$'
sed -n '/^.\{0,3\}$/p'
grep uid: file.csv | awk {'print $2'} | sed -En 's/^([^[:space:]]{3}).*/\1/p' | sort -f > output
Sample Output from above
r#n
d!n
'it
4#e
c#n
c
c
sam
s
I'm thinking that there might be some special character after the first character which is making it break and only printing the first character.
Can someone please suggest how to get this working as expected
Thanks,
To get the output you posted from the input you posted is just:
$ cut -c1-3 file
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
If that's not all you need then edit your question to more clearly state your requirements and provide more truly representative sample input/output including cases where this doesn't work.
You may use this grep with -o and -E options:
grep -oE '^[^[:blank:]]{1,3}' file
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
Regex ^[^[:blank:]]{1,3} matches and outputs 1 to 3 non-whitespace characters from start position.
Using awk:
awk '{print (length($0)<3) ? $0 : substr($0,0,3)}' src.dat
Output:
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
1
11
-1
.
Contents of src.dat:
r#nger
d!nger
'iterr
4#e
c#nuidig
c#niting
c^neres
sample
1
11
-1
.
sed 's/.//4g' file
Delete every char starting at 4th until there aren't any more. GNU sed, which says:
Note: the POSIX standard does not specify what should happen when you mix the g and number modifiers, and currently there is no widely agreed upon meaning across sed implementations. For GNU sed, the interaction is defined to be: ignore matches before the numberth, and then match and replace all matches from the numberth on.
Also: grep -o '^...' file

Only print first and second word of each line to output with sed

I want to clean up a pattern file for later use, so only the first and second word (or number) are relevant.
I have this:
pattern.txt
# This is a test pattern
some_variable one # placeholder which replaces a variable
some_other_var 2 # other variable to replace
# Some random comment in between
different_var "hello" # this will also replace a placeholder but with a string
# And after some empty lines:
var_after_newlines 18 # some variable after a lot of newlines
{{hello}} " this is just a string surrounded by space "
{bello} "this is just a string"#and this is a comment
cello "#string with a comment in it"#and a comment
To which I apply:
sed -nE '/^\s*#/d;/^\s*$/d;s/^\s*([^\s]+)\s+([^\s]+).*$/\1 \2/p' pattern.txt > output.txt
it should clean out comment lines starting with # -> works
it should clean out empty lines (or lines with whitespace characters) -> works
it should replace every line with its first and second word seperated by one (1) space character -> doesn't work. Compare:
output.txt
Expectation:
some_variable one
some_other_var 2
different_var "hello"
var_after_newlines 18
{{hello}} " this is just a string surrounded by space "
{bello} "this is just a string"
cello "#string with a comment in it"
Reality:
different_var "hello" # thi
var_after_newline
{{hello}} " thi
{bello} "thi
cello "#
What am I missing?
EDIT:
As #Ed Morton pointed out, it would make sense to als include the following cases: strings with spaces, strings with spaces before and after quotation marks, comments within strings and comments right after the quotation mark. The accepted answers sed solution works fine with all of this.
Completely based on your shown samples only, this could be easily done with awk. Written and tested with GNU awk, should work with any awk.
awk '{sub(/\r$/,"")} NF && !/^#/{print $1,$2}' Input_file
Explanation: Simply checking 2 conditions here. 1st- NF which makes sure line is NOT empty line. 2nd- Line is NOT starting with # then print 1st and 2nd columns of current line.
With sed: Please try following in GNU sed.
sed -E 's/\r$//;/^#/d;/^\s*$/d;s/^ +//;s/([^ ]*) +([^ ]*).*/\1 \2/' Input_file
OR as per Ed sir's comments use following:
sed -E 's/\r$//; /^#/d; /^\s*$/d; s/^\s+//; s/(\S*)\s+(\S*).*/\1 \2/' Input_file
Sample output is as follows for both above solutions:
some_variable one
some_other_var 2
different_var "hello"
var_after_newlines 18
In GNU sed
sed -E '/^\s*(#.*)?$/d; s/^\s*(\S+)\s+(\S+).*/\1 \2/' pattern.txt
Update after the comments:
sed -E '/^\s*(#.*)?$/d; s/^\s*(\S+)\s+("[^"]*"|\S+).*/\1 \2/' pattern.txt
Version that should work with most any sed:
$ sed 's/^[[:space:]]*//; s/#.*//; /^$/d; s/^\([^[:space:]]\{1,\}\)[[:space:]]\{1,\}\([^[:space:]]\{1,\}\).*/\1 \2/' pattern.txt
some_variable one
some_other_var 2
different_var "hello"
var_after_newlines 18

Printing lines with duplicate words

I am trying to print all line that can contain same word twice or more
E.g. with this input file:
cat dog cat
dog cat deer
apple peanut banana apple
car bus train plane
car train car train
Output should be
cat dog cat
apple peanut banana apple
car train car train.
I have tried this code and it works but I think there must be a shorter way.
awk '{ a=0;for(i=1;i<=NF;i++){for(j=i+1;j<=NF;j++){if($i==$j)a=1} } if( a==1 ) print $0}'
Later I want to find all such duplicate words and delete all the duplicate entries except for 1st occurrence.
So input:
cat dog cat lion cat
dog cat deer
apple peanut banana apple
car bus train plane
car train car train
Desired output:
cat dog lion
dog cat deer
apple peanut banana
car bus train plane
car train
You can use this GNU sed command:
sed -rn '/(\b\w+\b).*\b\1\b/ p' yourfile
-r activate extended re and n deactivates the implicit printing of every line
the p command then prints only lines that match the preceding re (inside the slashes):
\b\w+\b are words : an nonemtpy sequence of word charactes (\w) between word boundaries (\b`), these are GNU extensions
such a word is "stored" in \1 for later reuse, due to the use of parentheses
then we try to match this word with \b\1\b again with something optional (.*) between those two places.
and that is the whole trick: match something, put it in parentheses so you can reuse it in the same re with \1
To answer the second part of the question, deleting the doubled words after the first, but print all lines (modifying only the lines with doubled words), you could use some sed s magic:
sed -r ':A s/(.*)(\b\w+\b)(.*)\b\2\b(.*)/\1\2\3\4/g; t A ;'
here we use again the backreference trick.
but we have to account for the things before, between and after our doubled words, thus we have a \2 in the matching part of then s command and we have the other backreferences in the replacement part.
notice that only the \2 has no parens in the matching part and we use all groups in the replacement, thus we effectively deleted the second word of the pair.
for more repetitions of the word we need loop:
:A is a label
t A jumps to the label if there was a replacement done in the last s comamnd
this builds a "while loop" around the s to delete the other repetitions, too
Here's a solution for printing only lines that contain duplicate words.
awk '{
delete seen
for (i=1;i<=NF;++i) {
if (seen[$i]) { print ; next }
seen[$i] = 1
}
}'
Here's a solution for deleting duplicate words after the first.
awk '{
delete seen
for (i=1;i<=NF;++i) {
if (seen[$i]) { continue }
printf("%s ", $i);
seen[$i] = 1
}
print "";
}'
Re your comment...
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. — Jamie Zawinski, 1997
With egrep you can use a so called back reference:
egrep '(\b\w+\b).*\b\1\b' file
(\b\w+\b) matches a word at word boundaries in capturing group 1. \1 references that matched word in the pattern.
I'll show solutions in Perl as it is probably the most flexible tool for text parsing, especially when it comes to regular expressions.
Detecting Duplicates
perl -ne 'print if m{\b(\S+)\b.*?(\b\1\b)}g' file
where
-n causes Perl to execute the expression passed via -e for each input line;
\b matches word boundaries;
\S+ matches one or more non-space characters;
.*? is a non-greedy match for zero or more characters;
\1 is a backreference to the first group, i.e. the word \S+;
g globally matches the pattern repeatedly in the string.
Removing Duplicates
perl -pe '1 while (s/\b(\S+)\b.*?\K(\s\1\b)//g)' file
where
-p causes Perl to print the line ($_), like sed;
1 while loop runs as long as the substitution replaces something;
\K keeps the part matching the previous expression;
Duplicate words (\s\1\b) are replaced with empty string (//g).
Why Perl?
Perl regular expressions are known to be very flexible, and regular expressions in Perl are actually more than just regular expressions. For example, you can embed Perl code into the substitution using the /e modifier. You can use the /x modifier that allows to write regular expressions in a more readable format and even use Perl comments in it, e.g.:
perl -pe '1 while (
s/ # Begins substitution: s/pattern/replacement/flags
\b (\S+) \b # A word
.*? # Ungreedy pattern for any number of characters
\K # Keep everything that matched the previous patterns
( # Group for the duplicate word:
\s # - space
\1 # - backreference to the word
\b # - word boundary
)
//xg
)' file
As you should have noticed, the \K anchor is very convenient, but is not available in many popular tools including awk, bash, and sed.

Replace chars after column X

Lets say my data looks like this
iqwertyuiop
and I want to replace all the letters i after column 3 with a Z.. so my output would look like this
iqwertyuZop
How can I do this with sed or awk?
It's not clear what you mean by "column" but maybe this is what you want using GNU awk for gensub():
$ echo iqwertyuiop | awk '{print substr($0,1,3) gensub(/i/,"Z","g",substr($0,4))}'
iqwertyuZop
Perl is handy for this: you can assign to a substring
$ echo "iiiiii" | perl -pe 'substr($_,3) =~ s/i/Z/g'
iiiZZZ
This would totally be ideal for the tr command, if only you didn't have the requirement that the first 3 characters remain untouched.
However, if you are okay using some bash tricks plus cut and paste, you can split the file into two parts and paste them back together afterwords:
paste -d'\0' <(cut -c-3 foo) <(cut -c4- foo | tr i Z)
The above uses paste to rejoin together the two parts of the file that get split with cut. The second section is piped through tr to translate i's to Z's.
(1) Here's a short-and-simple way to accomplish the task using GNU sed:
sed -r -e ':a;s/^(...)([^i]*)i/\1\2Z/g;ta'
This entails looping (t), and so would not be as efficient as non-looping approaches. The above can also be written using escaped parentheses instead of unescaped characters, and so there is no real need for the -r option. Other implementations of sed should (in principle) be up to the task as well, but your MMV.
(2) It's easy enough to use "old awk" as well:
awk '{s=substr($0,4);gsub(/i/,"Z",s); print substr($0,1,3) s}'
The most intuitive way would be to use awk:
awk 'BEGIN{FS="";OFS=FS}{for(i=4;i<=NF;i++){if($i=="i"){$i="Z"}}}1' file
FS="" splits the input string by characters into fields. We iterate trough character/field 4 to end and replace i by Z.
The final 1 evaluates to true and make awk print the modified input line.
With sed it looks not very intutive but still it is possible:
sed -r '
h # Backup the current line in hold buffer
s/.{3}// # Remove the first three characters
s/i/Z/g # Replace all i by Z
G # Append the contents of the hold buffer to the pattern buffer (this adds a newline between them)
s/(.*)\n(.{3}).*/\2\1/ # Remove that newline ^^^ and assemble the result
' file

How to replace pipe instead of comma in a csv file

I want to convert csv file from comma separated to pipe(|). But in csv file some lines should have comma also,
My file
$ cat a.txt
"a","b","c,test","new","abc"
Expecting:
a|b|c,test|new|abc
This sed command will do:
sed 's/","/\|/g; s/"//g' File
Replace all "," patterns with |. This will have " at either ends, which is removed later.
Sample:
AMD$ cat File
"a","b","c,test","new","abc"
AMD$ sed 's/","/\|/g; s/"//g' File
a|b|c,test|new|abc
sed ':cycle
s/^\(\("[^"]*"[|]\{0,1\}\)*\),/\1|/
t cycle' YourFile
recursive posix version.
A shortcut with [|]\{0,1\} assuming there is no "foo"|, or "foo",, (empty field are "")
another assumption here, there is no double quote inside quoted string (even escaped)
CSV can be tricky to get right by hand. I'd use a language with a proper CSV parser. For example, with ruby:
$ ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' a.txt
a|b|c,test|new|abc
That loops over the lines of the file, parses it into an array using the defaults (comma separated, double quotes as the quote character), then generates a new CSV string using pipe as the separator. If a field were to contain a pipe character, that field would be quoted.
This parser cannot handle embedded newlines in a quoted field. Perl's Text::CSV can.
$ awk -F'","' -v OFS='|' '{$1=$1; gsub(/"/,"")} 1' a.txt
a|b|c,test|new|abc
You can use perl in the following way:
cat a.txt | perl -ne 's/^"//; s/"$//; #items = split /","/; print join("|", #items);'