sed replace text between comma - awk

I have csv files that need to be changed f -> 0 and t -> 1 only between commas for every single csv if it matches. From:
,t,t,f,f,a,t,f,t,f,f,t,f,
tftf
to:
,1,1,0,0,a,1,0,1,0,0,1,0,
tftf
Works this way, but want to know better way that could reduce the replacing time consume
for i in 1 2 3 4 5 6
do
echo "converting tables for mariaDB"
find ./ -type f -name "*.csv" -print0 | xargs -0 sed -i 's/\,t\,/\,1\,/g'
find ./ -type f -name "*.csv" -print0 | xargs -0 sed -i 's/\,f\,/\,0\,/g'
echo "$i time(s) changed "
done
I except , one single command will change the line

Could you please try following. Though it is not perfect solution but would be simplest use it in case you don't have gawk's latest version where -inplace edit option is present.
for file in *.csv
awk '{gsub(/,t,/,",1,");gsub(/,f,/,",0,");gsub(/,t,/,",1,");gsub(/,f,/,",0,")} 1' "$file" > temp && mv temp"$file"
done
OR
for file in *.csv
awk -v t_val="1" -v f_val="0" 'BEGIN{FS=OFS=","}{for(i=2;i<NF;i++){$i=($i=="t"?t_val:$i=="f"?f_val:$i)}} 1' "$file" > temp && mv temp "$file"
done
2nd solution: Using gawk's latest version where we could save edit into Input_file itself.
gawk -i inplace '{gsub(/,t,/,",1,");gsub(/,f,/,",0,");gsub(/,t,/,",1,");gsub(/,f,/,",0,")} 1' *.csv
OR
gawk -i inplace -v t_val="1" -v f_val="0" 'BEGIN{FS=OFS=","}{for(i=2;i<NF;i++){$i=($i=="t"?t_val:$i=="f"?f_val:$i)}} 1' Input_file

The main problem, in this case, is that a regular expression does not allow overlap when parsing it with sed 's/ere/str/g' or awk '{gsub(ere,str,$0)}'. This comment nicely explains how you can circumvent this in sed using the t<label> command, which means: if a change happened to the pattern space, move to <label>. The comment shows a generic way of doing it. The awk alternative to this rule would be:
$ awk '{while(match($0,ere)) gsub(ere,str)}'
An alternative sed solution in the case of the OP's example could use the following idea:
duplicate all commas. Since we are searching for strings of the form ",t,", this duplication avoid overlap using s.
since no overlap is possible, replace all ",f," with ",0," and all ",t," with ",1,".
We can now revert all duplicated commas again. As no overlap is allowed, sequences like ,,,, will be nicely converted to ,, and not ,
In POSIX sed this looks like:
$ sed -e 's/,/,,/g' -e 's/,f,/,0,/g' \
-e 's/,t,/,1,/g' -e 's/,,/,/g' file > file.tmp
$ mv file.tmp file
With GNU sed we can do it in one go:
$ sed -i 's/,/,,/g;s/,f,/,0,/g;s/,t,/,1,/g;s/,,/,/g' file
With awk, this would look like:
$ awk 'BEGIN{FS=",";OFS=FS FS}
{$1=$1;gsub(/,f,/,",0,");gsub(/,t,/,",1,");gsub(OFS,FS)}1' file > file.tmp
$ mv file.tmp file

Related

I need to extract I'd from a Google drive urls with sed, gawk or grep

URLs:
1. https://docs.google.com/uc?id=0B3X9GlR6EmbnQ0FtZmJJUXEyRTA&export=download
2. https://drive.google.com/open?id=1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
3. https://drive.google.com/drive/folders/1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py?usp=sharing
I need a single regex for these all urls.
This is what I tried to use but didn't get expected results.
sed -E 's/.*\(folders\)?\(id\)?=?\/?(.*)&?.*/\1/'
Expected results:
0B3X9GlR6EmbnQ0FtZmJJUXEyRTA
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
With your own code updated:
$ cat file
1. https://docs.google.com/uc?id=0B3X9GlR6EmbnQ0FtZmJJUXEyRTA&export=download
2. https://drive.google.com/open?id=1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
3. https://drive.google.com/drive/folders/1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py?usp=sharing
$ sed -E 's#.*(folders/|id=)([^?&]+).*#\2#' file
0B3X9GlR6EmbnQ0FtZmJJUXEyRTA
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
$ sed -E 's#.*(folders/|id=)([^?&]+).*#\2#' file | uniq
0B3X9GlR6EmbnQ0FtZmJJUXEyRTA
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
And yours updated to sed -E 's#.*(folders/|id=)(.*)(\?|&|$).*#\2#' would work on GNU sed.
You are using -E, so no need to escape group quotes (), and | means OR.
When matching literal ?, you need to escape it.
And the separator of sed can change to other character, which is # here.
Note uniq will only remove adjacent duplicates, if there're duplicates in different places, change it to sort -u instead.
A GNU grep solution :
$ grep -Poi '(id=|folders/)\K[a-z0-9_-]*' file
0B3X9GlR6EmbnQ0FtZmJJUXEyRTA
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
Also these two give same results, but are more accurate than above shorter sed one:
sed -E 's#.*(folders/|id=)([A-Za-z0-9_-]*).*#\2#'
sed -E 's#.*(folders/|id=)([[:alnum:]_-]*).*#\2#'
Btw, + means one or more occurances, * means zero or more.
A GNU awk version (removes duplicates at the same time):
awk 'match($0,".*(folders/|id=)([A-Za-z0-9_-]+)",m){if(!a[m[2]]++)print m[2]}' file
Could you please try following.
awk 'match($0,/uc\?id=[^&]*|folders\/[^?]*/){value=substr($0,RSTART,RLENGTH);gsub(/.*=|.*\//,"",value);print value}' Input_file
Try this:
sed -E 's/.*(id=|folders\/)([^&?/]*).*/\2/' file
Explanations:
.*(id=|folders\/): after any characters(.*) followed by id= or folders/
([^&?/]*): search and capture any characters except &, ? and /
\2: using backreference, matching string is replaced with the second captured text([^&?/]*)
Edit:
To remove duplicate url, just pipe the command to sort then to uniq(because uniq just removes adjacent duplicate lines, you may want to sort the list before):
sed -E 's/.*(id=|folders\/)([^&?/]*).*/\2/' file | sort | uniq
As #Tiw suggests in edit, you can also pipe to a single command by using sort with the -u flag:
sed -E 's/.*(id=|folders\/)([^&?/]*).*/\2/' file | sort -u
Using Perl
$ cat rohit.txt
1. https://docs.google.com/uc?id=0B3X9GlR6EmbnQ0FtZmJJUXEyRTA&export=download
2. https://drive.google.com/open?id=1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
3. https://drive.google.com/drive/folders/1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py?usp=sharing
$ perl -lne ' s/.*\/.*..\/(.*)$/$1/g; s/(.*id=)//g; /(.+?)(&|\?|$)/ and print $1 ' rohit.txt
0B3X9GlR6EmbnQ0FtZmJJUXEyRTA
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
1TkLq5C7NzzmbRjd7VGRhauNT9Vaap-Py
$

Using grep-awk and sed in one-row-command result in a "No such file or directory" error

..And I know why:
I have a xml document with lots of information inside. I need to extract what I need and eventually print them on a new file.
The xml (well, part of it.. rows just keeps repeating)
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toMartha/"
errordir="%home%/../../../home/samba/user/Outbound/toMartha/error"
sentdir="%home%/../../../home/samba/user/data/Sent/Martha"
interval="600"
defaults="sender.name=me_myself, receiver.name=Martha"
sendfilename="true"
mimetype="application/standard"/>
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toJosh/"
errordir="%home%/../../../home/samba/user/Outbound/toJosh/error"
sentdir="%home%/../../../home/samba/user/data/Sent/Josh"
interval="600"
defaults="sender.name=me_myself, receiver.name=Josh"
sendfilename="true"
mimetype="application/standard"/>
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toPamela/"
errordir="%home%/../../../home/samba/user/Outbound/toPamela/error"
interval="600"
defaults="sender.name=me_myself, receiver.name=Pamela"
sendfilename="true"
mimetype="application/standard"/>
I need to extract the folder after "Outbound" and clean it from quotes or slashes.
Also, I need to exclude the "/error" so I get only 1 result for each of them.
My command is:
grep -o -v "/error" "Outbound/" config.xml | awk -F"Outbound/" '{print $2}' | sed -e "s/\/\"//g" > /tmp/sync_users
The error is: grep: Outbound/: No such file or directory which of course means that I'm giving to grep too many arguments (?) - If i remove the -v "/error" it would work but would print also the names with "/error".
Can someone help me?
EDIT:
As some pointed out in their example (thanks for the time you put in), I'd need to extract these words based on the sample above:
toMartha
toJosh
toPamela
could be intersting to use sed in this case
sed -e '\#/Outbound/#!d' -e '\#/error"$#d' -e 's#.*/Outbound/##;s#/\{0,1\}"$##' Config.xml
awk version, assuming (for last print) that your line is always 1 folder below Outbound as shown
awk -F '/' '$0 !~ /\/Outbound\// || /\/error"$/ {next} {print $(NF-1)}' Config.xml
Loose the grep altogether:
$ awk '/outboxdir/{gsub(/^.+Outbound\/|\/" *\r?$/,""); print}' file
toMartha
toJosh
toPamela
/^outboxdir/ /outboxdir/only process records that have start with outboxdir on them
gsub remove unwanted parts of the record
added space removal at the end of record and CRLF fix for Windows originated files
To give grep multiples patterns they have to be separated by newlines or specified by multiples pattern option (-e, F,.. ). However -v invert the match as a whole, you can't invert only one.
For what you're after you can use PCRE (-P argument) for the lookaround ability:
grep -o -P '(?<=Outbound\/)[^\/]+(?!.*\/error)' config.xml
Regex demo here
The regex try to
match something not a slash at least once, the [^\/]+
preceded by Outbound/ the positive lookbehind (?<=Outbound\/)
and not followed by something ending with /error, the negative lookahead (?!.*\/error)
With your first sample input:
$ grep -o -P '(?<=Outbound\/)[^\/]+(?!.*\/error)' test.txt
toMartha
toJosh
toPamela
How about:
grep -i "outbound" your_file | awk -F"Outbound/" '{print $2}' | sed -e 's/error//' -e 's/\/\"//' | uniq
Should work :)
You can use match in gawkand capturing group in regex
awk 'match($0, /^.*\/Outbound\/([^\/]+)\/([^\/]*)\/?"$/, a){
if(a[2]!="error"){print a[1]}
}' config.xml
you get,
toMartha
toJosh
toPamela
grep can accept multiple patterns with the -e option (aka --regexp, even though it can be used with --fixed-strings too, go figure). However, -v (--invert-match) applies to all of the patterns as a group.
Another solution would be to chain two calls to grep:
grep -v "/error" config.xml | grep "Outbound/" | awk -F"Outbound/" '{print $2}' | sed -e "s/\/\"//g"

awk print overwrite strings

I have a problem using awk in the terminal.
I need to move many files in a group from the actual directory to another one and I have the list of the necessary files in a text file, as:
filename.txt
file1
file2
file3
...
I usually digit:
paste filename.txt | awk '{print "mv "$1" ../dir/"}' | sh
and it executes:
mv file1 ../dir/
mv file2 ../dir/
mv file3 ../dir/
It usually works, but now the command changes its behaviour and awk overwrites the last string ../dir/ on the first one, starting again the print command from the initial position, obtaining:
../dire1 ../dir/
../dire2 ../dir/
../dire3 ../dir/
and of course it cannot be executed.
What's happened?
How do I solve it?
Your input file contains carriage returns (\r aka control-M). Run dos2unix on it before running a UNIX tool on it.
idk what you're using paste for though, and you should not be using awk for this at all anyway, it's just a job for a simple shell script, e.g. remove the echo once you've tested this:
$ < file xargs -n 1 -I {} echo mv "{}" "../dir"
mv file1 ../dir
mv file2 ../dir
mv file3 ../dir

Inserting characters into specified fields in large files

Here is my question. I have several hundred files that are too large to edit utilizing vi editor. I'm looking for a possible awk or sed command to manipulate my files. Bit of a rookie. I have a simplified file:
001|1|3|053412|16|1234|||
001|21|4|123618|15|88|||
The files were created, with the fourth field being in the wrong format.
The fourth field should be 05:34:12 reflecting HH:MM:SS. The time values are correct, I just need to insert the : in the appropriate places.
How do I insert the colons after the second character and the fourth characters in the fourth field? I cannot do it by character count since the values before and after the fourth field are variable.
gawk to the rescue!
$ awk -F\| -v OFS=\| '{$4=gensub(/(..)(..)(..)/,"\\1:\\2:\\3","g",$4)}1' file
001|1|3|05:34:12|16|1234|||
001|21|4|12:36:18|15|88|||
otherwise you can do the same with substr($4,1,2)":"...
With GNU awk for gensub() and inplace editing
awk -i inplace 'BEGIN{FS=OFS="|"} {$4=gensub(/(..)(..)/,"\\1:\\2:",1,$4)} 1' *
Similarly with GNU sed for EREs and inplace editing:
sed -i -E 's/(([^|]*\|){3}..)(..)/\1:\3:/' *
e.g.:
$ awk 'BEGIN{FS=OFS="|"} {$4=gensub(/(..)(..)/,"\\1:\\2:",1,$4)} 1' file
001|1|3|05:34:12|16|1234|||
001|21|4|12:36:18|15|88|||
$ sed -E 's/(([^|]*\|){3}..)(..)/\1:\3:/' file
001|1|3|05:34:12|16|1234|||
001|21|4|12:36:18|15|88|||
With sed:
$ sed -r 's/([^|]*|[^|]*|[^|]*|)([0-9]{2})([0-9]{2})([0-9]{2})/\1\2:\3:\4/' file
001|1|3|05:34:12|16|1234|||
001|21|4|12:36:18|15|88|||
This might work for you (GNU sed):
sed -r 's/^(([^|]*\|){3})(..)(..)/\1\3:\4:/' file
Use back references to group the first three fields and the following two pairs. Then format the 4th field as required.
Try this awk!
awk -F"|" -v OFS="|" '{r=split($4,T,"");for(i=2;i<=r;i+=2){if(i!=r)T[i]=T[i]":"}tmp="";for(i=1;i<=r;i++){tmp=tmp T[i]}$4=tmp;}1' file
001|1|3|05:34:12|16|1234|||
001|21|4|12:36:18|15|88|||
Longer whit explanation:
BEGIN{
FS=OFS="|"; #Field separator and output field separator
}
{
tmp="";
r=split($4,time_field,""); # Chunk field into pieces
for(i=2;i<=r;i+=2) # Loop two by two
{
if(i!=r)
{
time_field[i]=time_field[i]":"; # Add ":"
}
}
for(i=1;i<=r;i++) # Loop over again to rebuild
{
tmp=tmp time_field[i];
}
$4=tmp; #rebuid field
print
}
How you could use it in bash: Save it as whatever.awk
while IFS='' read -r file
do
awk -f whatever.awk "$file" > out_file
done < list_of_files_to_edit.txt
If you want edit the files in place, you can add the option -i to Kenavoz sed command. sed -ri ...

Shell script: How to split line?

here's my scanerio:
my input file like:
/tmp/abc.txt
/tmp/cde.txt
/tmp/xyz/123.txt
and i'd like to obtain the following output in 2 files:
first file
/tmp/
/tmp/
/tmp/xyz/
second file
abc.txt
cde.txt
123.txt
thanks a lot
Here is all in one single awk
awk -F\/ -vOFS=\/ '{print $NF > "file2";$NF="";print > "file1"}' input
cat file1
/tmp/
/tmp/
/tmp/xyz/
cat file2
abc.txt
cde.txt
123.txt
Here we set input and output separator to /
Then print last field $NF to file2
Set the last field to nothing, then print the rest to file1
I realize you already have an answer, but you might be interested in the following two commands:
basename
dirname
If they're available on your system, you'll be able to get what you want just piping through these:
cat input | xargs -l dirname > file1
cat input | xargs -l basename > file2
Enjoy!
Edit: Fixed per quantdev's comment. Good catch!
Through grep,
grep -o '.*/' file > file1.txt
grep -o '[^/]*$' file > file2.txt
.*/ Matches all the characters from the start upto the last / symbol.
[^/]*$ Matches any character but not of / zero or more times. $ asserts that we are at the end of a line.
The awk solution is probably the best, but here is a pure sed solution :
#n sed script to get base and file paths
h
s/.*\/\(.*.txt\)/\1/
w file1
g
s/\(.*\)\/.*.txt/\1/
w file2
Note how we hold the buffer with h, and how we use the write (w) command to produce the output files. There are many other ways to do it with sed, but I like this one for using multiple different commands.
To use it :
> sed -f sed_script testfile
Here is another oneliner that uses tee:cat f1.txt | tee >(xargs -n 1 dirname >> f2.txt) >(xargs -n 1 basename >> f3.txt) &>/dev/random