How to replace pipe instead of comma in a csv file - awk

I want to convert csv file from comma separated to pipe(|). But in csv file some lines should have comma also,
My file
$ cat a.txt
"a","b","c,test","new","abc"
Expecting:
a|b|c,test|new|abc

This sed command will do:
sed 's/","/\|/g; s/"//g' File
Replace all "," patterns with |. This will have " at either ends, which is removed later.
Sample:
AMD$ cat File
"a","b","c,test","new","abc"
AMD$ sed 's/","/\|/g; s/"//g' File
a|b|c,test|new|abc

sed ':cycle
s/^\(\("[^"]*"[|]\{0,1\}\)*\),/\1|/
t cycle' YourFile
recursive posix version.
A shortcut with [|]\{0,1\} assuming there is no "foo"|, or "foo",, (empty field are "")
another assumption here, there is no double quote inside quoted string (even escaped)

CSV can be tricky to get right by hand. I'd use a language with a proper CSV parser. For example, with ruby:
$ ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' a.txt
a|b|c,test|new|abc
That loops over the lines of the file, parses it into an array using the defaults (comma separated, double quotes as the quote character), then generates a new CSV string using pipe as the separator. If a field were to contain a pipe character, that field would be quoted.
This parser cannot handle embedded newlines in a quoted field. Perl's Text::CSV can.

$ awk -F'","' -v OFS='|' '{$1=$1; gsub(/"/,"")} 1' a.txt
a|b|c,test|new|abc

You can use perl in the following way:
cat a.txt | perl -ne 's/^"//; s/"$//; #items = split /","/; print join("|", #items);'

Related

Replace character except between pattern using grep -o or sed (or others)

In the following file I want to replace all the ; by , with the exception that, when there is a string (delimited with two "), it should not replace the ; inside it.
Example:
Input
A;B;C;D
5cc0714b9b69581f14f6427f;5cc0714b9b69581f14f6428e;1;"5cc0714b9b69581f14f6427f;16a4fba8d13";xpto;
5cc0723b9b69581f14f64285;5cc0723b9b69581f14f64294;2;"5cc0723b9b69581f14f64285;16a4fbe3855";xpto;
5cc072579b69581f14f6428a;5cc072579b69581f14f64299;3;"5cc072579b69581f14f6428a;16a4fbea632";xpto;
output
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
For sed I have: sed 's/;/,/g' input.txt > output.txt but this would replace everything.
The regex for the " delimited string: \".*;.*\" .
(A regex for hexadecimal would be better -- something like: [0-9a-fA-F]+)
My problem is combining it all to make a grep -o / sed that replaces everything except for that pattern.
The file size is in the order of two digit Gb (max 99Gb), so performance is important. Relevant.
Any ideas are appreciated.
sed is for doing simple s/old/new on individual strings. grep is for doing g/re/p. You're not trying to do either of those tasks so you shouldn't be considering either of those tools. That leaves the other standard UNIX tool for manipulating text - awk.
You have a ;-separated CSV that you want to make ,-separated. That's simply:
$ awk -v FPAT='[^;]*|"[^"]+"' -v OFS=',' '{$1=$1}1' file
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
The above uses GNU awk for FPAT. See What's the most robust way to efficiently parse CSV using awk? for more details on parsing CSVs with awk.
If I get correctly your requirements, one option would be to make a three pass thing.
From your comment about hex, I'll consider nothing like # will come in the input so you can do (using GNU sed) :
sed -E 's/("[^"]+);([^"]+")/\1#\2/g' original > transformed
sed -i 's/;/,/g' transformed
sed -i 's/#/;/g' transformed
The idea being to replace the ; when within quotes by something else and write it to a new file, then replace all ; by , and then set back the ; in place within the same file (-i flag of sed).
The three pass can be combined in a single command with
sed -E 's/("[^"]+);([^"]+")/\1#\2/g;s/;/,/g;s/#/;/g' original > transformed
That said, there's probably a bunch of csv parser witch already handle quoted fields that you can probably use in the final use case as I bet this is just an intermediary step for something else later in the chain.
From Ed Morton's comment: if you do it in one pass, you can use \n as replacement separator as there can't be a newline in the text considered line by line.
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^"]*"[^"]*)*"[^";]*);/\1\n/;ta;y/;/,/;y/\n/;/' file
Replace ;'s inside double quotes with newlines, transpose ;'s to ,'s and then transpose newlines to ;'s.

Replace chars after column X

Lets say my data looks like this
iqwertyuiop
and I want to replace all the letters i after column 3 with a Z.. so my output would look like this
iqwertyuZop
How can I do this with sed or awk?
It's not clear what you mean by "column" but maybe this is what you want using GNU awk for gensub():
$ echo iqwertyuiop | awk '{print substr($0,1,3) gensub(/i/,"Z","g",substr($0,4))}'
iqwertyuZop
Perl is handy for this: you can assign to a substring
$ echo "iiiiii" | perl -pe 'substr($_,3) =~ s/i/Z/g'
iiiZZZ
This would totally be ideal for the tr command, if only you didn't have the requirement that the first 3 characters remain untouched.
However, if you are okay using some bash tricks plus cut and paste, you can split the file into two parts and paste them back together afterwords:
paste -d'\0' <(cut -c-3 foo) <(cut -c4- foo | tr i Z)
The above uses paste to rejoin together the two parts of the file that get split with cut. The second section is piped through tr to translate i's to Z's.
(1) Here's a short-and-simple way to accomplish the task using GNU sed:
sed -r -e ':a;s/^(...)([^i]*)i/\1\2Z/g;ta'
This entails looping (t), and so would not be as efficient as non-looping approaches. The above can also be written using escaped parentheses instead of unescaped characters, and so there is no real need for the -r option. Other implementations of sed should (in principle) be up to the task as well, but your MMV.
(2) It's easy enough to use "old awk" as well:
awk '{s=substr($0,4);gsub(/i/,"Z",s); print substr($0,1,3) s}'
The most intuitive way would be to use awk:
awk 'BEGIN{FS="";OFS=FS}{for(i=4;i<=NF;i++){if($i=="i"){$i="Z"}}}1' file
FS="" splits the input string by characters into fields. We iterate trough character/field 4 to end and replace i by Z.
The final 1 evaluates to true and make awk print the modified input line.
With sed it looks not very intutive but still it is possible:
sed -r '
h # Backup the current line in hold buffer
s/.{3}// # Remove the first three characters
s/i/Z/g # Replace all i by Z
G # Append the contents of the hold buffer to the pattern buffer (this adds a newline between them)
s/(.*)\n(.{3}).*/\2\1/ # Remove that newline ^^^ and assemble the result
' file

gawk system-command ignoring backslashes

Consider a file with a list of strings,
string1
string2
...
abc\de
.
When using gawk's system command to execute a shell command
, in this case printing the strings,
cat file | gawk '{system("echo " $0)}'
the last string will be formatted to abcde. $0 denotes the whole record, here this is just the one string
Is this a limitation of gawk's system command, not being able to output the gawk variables unformatted?
Expanding on Mussé Redi's answer, observe that, in the following, the backslash does not print:
$ echo 'abc\de' | gawk '{system("echo " $0)}'
abcde
However, here, the backslash will print:
$ echo 'abc\de' | gawk '{system("echo \"" $0 "\"")}'
abc\de
The difference is that the latter command passes $0 to the shell with double-quotes around it. The double-quotes change how the shell processes the backslash.
The exact behavior will change from one shell to another.
To print while avoiding all the shell vagaries, a simple solution is:
$ echo 'abc\de' | gawk '{print $0}'
abc\de
In Bash we use double backslashes to denote an actual backslash. The function of a single backslash is escaping a character. Hence the system command is not formatting at all; Bash is.
The solution for this problem is writing a function in awk to preformat backslashes to double backslahses, afterwards passing it to the system command.

How to convert a text file with one line of comma separated text to a file with each comma separated field on a new line

I'm using an old Sun Solaris 2.5.1. I have several files with l line of comma separated data fields that I need converted to each field on a new line. There are a variable amount of fields in each file.
i.e.
#cat recipe1
299834,43399,PRODUCT NAME HERE,4.02,344,MORE TEXT HERE,3423 etc...
NEEDS TO BE:
#cat recipe1
299834
43399
PRODUCT NAME HERE
4.02
344
MORE TEXT HERE
3423
etc...
Perhaps:
tr ',' '\r' < in > out
So you just mean convert commas to newlines?
sed 's/,/\n/g' in.txt > out.txt
Looks like old Solaris doesn't have the special \n sequence. Try doing it with a real newline character:
sed 's/,/\
/g' in.txt > out.txt
Edit: updated to add backslash as suggested by #Ed Morton
You can use this awk:
awk '$1=$1' RS=, recipe1
But it will fail if there are 0 in the file, so better use:
awk '{$1=$1}1' RS=, recipe1
299834
43399
PRODUCT NAME HERE
4.02
344
MORE TEXT HERE
3423
perl -pi -e 's/,/\n/g' your_file
PS: This command will do an in place replacement. So No need to redirect the output to a new file. Exiting file itself will be modified.

Awk: Using invert match to a string and then substitute characters

I want to extract lines that don't contain # and delete ", ; in the output.
My input FILE looks like this:
# ;string"1"
# string"2";
string"3";
Can use grep and tr to get wanted output:
grep -v '#' FILE | tr -d ';"'
string3
However I want to use awk.
I can extract invert match awk '!/#/' FILE, but how can I use sub to delete ", ; in the same awk command?
You can use gsub for global substitution:
awk '!/#/{gsub(/[";]/,"",$0);print}'
The following transcript shows this in action, it delivers the same results as your grep/tr pipeline:
pax> echo '# ;string"1"
# string"2";
string"3";' | awk '!/#/{gsub(/[";]/,"",$0);print}{}'
string3
Note that the final {} may not be necessary in some implementations of awk but it's there to stop output of non-matching lines in those implementations (usually older ones) that do it automatically for lines matching none of the rules.
Use gsub instead which would replace all matches not just one:
awk '/#/{next}{gsub(/[";]/,"")}1' file
Output:
string3
Skipping the third parameter to gsub makes it process $0 by default.
/#/{next} makes it skip lines containing #
1 makes it print $0
Another awk version
awk -F"[\";]" '{$1=$1} !/^#/' OFS= file
string3
awk '{gsub(/[";]/,x)} !/^#/' file
string3
The x represents nothing. Could also have used "", but saves one characters :)
If you want to give sed a chance:
sed -n '/^[^#]/s/[";]//gp' file
string3