How do I write sed commands to generate an awk file.
Here is my problem:
For example, I have a text file, A.txt which contains a word on each line.
app#
#ple
#ol#
The # refers when the word starts/ ends/ starts and ends. For example, app# shows that the word starts with 'app'. #ple shows that the word ends with 'ple'. #ol# shows that the word has 'ol' in the middle of the word.
I have to generate an awk file from sed commands which reads in another file, B.txt (which contains a word on each line) and increments the variable start, end, middle.
How do I write sed commands whereby for each line in the text file, A.txt, it will generate an awk code ie.
{ {if ($1 ~/^app/)
{start++;}
}
For example, if I input the other file, B.txt with these words into the awk script,
application
people
bold
cold
The output would be; start = 1, end = 1, middle = 2.
I'd use ed over sed for this, actually.
A quick script that creates A.awk from A.txt and runs it on B.txt:
#!/bin/sh
ed -s A.txt <<'EOF'
1,$ s!^#\(.*\)#$!$0 ~ /.+\1.+/ { middle++ }!
1,$ s!^#\(.*\)!$0 ~ /\1$/ { end++ }!
1,$ s!^\(.*\)#!$0 ~ /^\1/ { start++ }!
0 a
#!/usr/bin/awk -f
BEGIN { start = end = middle = 0 }
.
$ a
END { printf "start = %d, end = %d, middle = %d\n", start, end, middle }
.
w A.awk
EOF
# awk -f A.awk B.txt would work too, but this demonstrates a self-contained awk script
chmod +x A.awk
./A.awk B.txt
Running it:
$ ./translate.sh
start = 1, end = 1, middle = 2
$ cat A.awk
#!/usr/bin/awk -f
BEGIN { start = end = middle = 0 }
$0 ~ /^app/ { start++ }
$0 ~ /ple$/ { end++ }
$0 ~ /.+ol.+/ { middle++ }
END { printf "start = %d, end = %d, middle = %d\n", start, end, middle }
Note: This assumes that the middle patterns shouldn't match at the start or end of a line.
But here's a attempt using sed to create A.awk, putting all the sed commands in a file, as trying to this as a one-liner using -e and getting all the escaping right is not something I feel up to at the moment:
Contents of makeA.sed:
s!^#\(.*\)#$!$0 ~ /.+\1.+/ { middle++ }!
s!^#\(.*\)!$0 ~ /\1$/ { end++ }!
s!^\(.*\)#!$0 ~ /^\1/ { start++ }!
1 i\
#!/usr/bin/awk -f\
BEGIN { start = end = middle = 0 }
$ a\
END { printf "start = %d, end = %d, middle = %d\\n", start, end, middle }
Running it:
$ sed -f makeA.sed A.txt > A.awk
$ awk -f A.awk B.txt
start = 1, end = 1, middle = 2
Off the top of my head, and not tested:
/\(.*\)#$/s//{if ($1 ~ /^\1/) start++; next}/
/#\(.*\)$/s//{if ($1 ~ /\1$/) end++; next}/
/\(.*\)/s//{if ($1 ~ /\1/) middle++; next}/
The construct \(.*\) matches any text and saves it in a back-reference, then \1 recalls the back-reference. The empty pattern following the s command refers back to the pattern that matched the line. The next prevents the third pattern from matching after one of the other two has already matched.
Related
I have a sparse matrix ("matrix.csv") with 10k rows and 4 columns (1st column is "user", and the rest columns are called "slots" and contain 0s or 1s), like this:
user1,0,1,0,0
user2,0,1,0,1
user3,1,0,0,0
Some of the slots that contain a "0" should be changed to contain a "1".
I have another file ("slots2change.csv") that tells me which slots should be changed, like this:
user1,3
user3,2
user3,4
So for user1, I need to change slot3 to contain a "1" instead of a "0", and for user3 I should change slot2 and slot4 to contain a "1" instead of a "0", and so on.
Expected result:
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
How can I achieve this using awk or sed?
Looking at this post: awk or sed change field of file based on another input file, a user proposed an answer that is valid if the "slots2change.csv" file do not contain the same user in diferent rows, which is not the case in here.
The solution proposed was:
awk 'BEGIN{FS=OFS=","}
NR==FNR{arr[$1]=$2;next}
NR!=FNR {for (i in arr)
if ($1 == i) {
F=arr[i] + 1
$F=1
}
print
}
' slots2change.csv matrix.csv
But that answer doesn't apply in the case where the "slots2change.csv" file contain the same user in different rows, as is now the case.
Any ideas?
Using GNU awk for arrays of arrays:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
users2slots[$1][$2]
next
}
$1 in users2slots {
for ( slot in users2slots[$1] ) {
$(slot+1) = 1
}
}
{ print }
$ awk -f tst.awk slots2change.csv matrix.csv
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
or using any awk:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
if ( !seen[$0]++ ) {
users2slots[$1] = ($1 in users2slots ? users2slots[$1] FS : "") $2
}
next
}
$1 in users2slots {
split(users2slots[$1],slots)
for ( idx in slots ) {
slot = slots[idx]
$(slot+1) = 1
}
}
{ print }
$ awk -f tst.awk slots2change.csv matrix.csv
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
Using sed
while IFS="," read -r user slot; do
sed -Ei "/$user/{s/(([^,]*,){$slot})[^,]*/\11/}" matrix.csv
done < slots2change.csv
$ cat matrix.csv
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
If the order in which the users are outputted doesn't matter then you could do something like this:
awk '
BEGIN { FS = OFS = "," }
FNR == NR {
fieldsCount[$1] = NF
for (i = 1; i <= NF; i++ )
matrix[$1,i] = $i
next
}
{ matrix[$1,$2+1] = 1 }
END {
for ( id in fieldsCount ) {
nf = fieldsCount[id]
for (i = 1; i <= nf; i++)
printf "%s%s", matrix[id,i], (i < nf ? OFS : ORS)
}
}
' matrix.csv slots2change.csv
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
This might work for you (GNU sed):
sed -E 's#(.*),(.*)#/^\1/s/,[01]/,1/\2#' fileChanges | sed -f - fileCsv
Create a sed script from the file containing the changes and apply it to the intended file.
The solution above, manufactures a match and substitution for each line in the file changes. This is then piped through to second invocation of sed which applies the sed script to the csv file.
I have this kind of file :
>AX-89948491-minus
CTAACACATTTAGTAGATT
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT
When the lines start by ">" and include "minus" , I need to reverse (rev) and translate (tr) the next following lines. I should get :
>AX-89948491-minus
AATCTACTAAATGTGTTAG
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT
I would like to go with awk. I tried that but it does not work..
awk '{if(NR%2==1~/"plus"/){print;getline;print} else if (NR%2==1~/"minus"/){system("echo "$0" | rev | tr ATCGatcg TAGCtagc")} else {print;getline;print}}' file
Any help?
This gnu-awk should work for you:
awk '
p {
cmd = "rev <<< \047" $0 "\047 | tr ATCGatcg TAGCtagc"
if ((cmd |& getline var) > 0)
$0 = var
}
{
p = /^>/ && /-minus/
} 1' file
>AX-89948491-minus
AATCTACTAAATGTGTTAG
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT
Awk is a tool to manipulate text, not a tool to sequence calls to other tools. The latter is what a shell is for. There are times when you need to call other tools from awk but not when it's simple text manipulation like reversing and translating characters in a string as you want to do.
Using any awk in any shell on every Unix box without spawning a subshell once per target input line to call other Unix tools (including the non-POSIX-defined rev which won't exist on some Unix boxes):
$ cat tst.awk
BEGIN {
split("ATCGatcg TAGCtagc",tmp)
for (i=1; i<=length(tmp[1]); i++) {
tr[substr(tmp[1],i,1)] = substr(tmp[2],i,1)
}
}
f {
out = ""
for (i=1; i<=length($0); i++) {
char = substr($0,i,1)
out = (char in tr ? tr[char] : char) out
}
$0 = out
f = 0
}
/^>.*minus/ { f=1 }
{ print }
$ awk -f tst.awk file
>AX-89948491-minus
AATCTACTAAATGTGTTAG
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT
I'd use perl, as it has builtin reverse and tr functions:
perl -lpe '
if (/^>/) {$rev = /minus/; next}
if ($rev) {$_ = reverse; tr/ATCGatcg/TAGCtagc/}
' file
>AX-89948491-minus
AATCTACTAAATGTGTTAG
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT
I want to merge some rows in a file so that the lines should contain 22 fields seperated by ~.
Input file looks like this.
200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2269~744~2701~VALD~3500~93~~~~76~423~223~Y~
UN~~243~223~~~~A~200123
209~7414~7001~VALD~OM30~963~~~
~76~23~2523~Y~UN~~223~223~~~~A~123
and So on
First line looks fine. 2nd and 3rd line needs to be merged so that it becomes a line with 22 fields. 4th,5th and 6th line should be merged and so on.
Expected output:
200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2269~744~2701~VALD~3500~93~~~~76~423~223~Y~UN~~243~223~~~~A~200123
209~7414~7001~VALD~OM30~963~~~~76~23~2523~Y~UN~~223~223~~~~A~123
The file has 10 GB data but the code I wrote (used while loop) is taking too much time to execute . How to solve this problem using awk/sed command?
Code Used:
IFS=$'\n'
set -f
while read line
do
count_tild=`echo $line | grep -o '~' | wc -l`
if [ $count_tild == 21 ]
then
echo $line
else
checkLine
fi
done < file.txt
function checkLine
{
current_line=$line
read line1
next_line=$line1
new_line=`echo "$current_line$next_line"`
count_tild_mod=`echo $new_line | grep -o '~' | wc -l`
if [ $count_tild_mod == 21 ]
then
echo "$new_line"
else
line=$new_line
checkLine
fi
}
Using only the shell for this is slow, error-prone, and frustrating. Try Awk instead.
awk -F '~' 'NF==1 { next } # Hack; see below
NF<22 {
for(i=1; i<=NF; i++) f[++a]=$i }
a==22 {
for(i=1; i<=a; ++i) printf "%s%s", f[i], (i==22 ? "\n" : "~")
a=0 }
NF==22
END {
if(a) for(i=1; i<=a; i++) printf "%s%s", f[i], (i==a ? "\n" : "~") }' file.txt>file.new
This assumes that consecutive lines with too few fields will always add up to exactly 22 when you merge them. You might want to check this assumption (or perhaps accept this answer and ask a new question with more and better details). Or maybe just add something like
a>22 {
print FILENAME ":" FNR ": Too many fields " a >"/dev/stderr"
exit 1 }
The NF==1 block is a hack to bypass the weirdness of the completely empty line 5 in your sample.
Your attempt contained multiple errors and inefficiencies; for a start, try http://shellcheck.net/ to diagnose many of them.
$ cat tst.awk
BEGIN { FS="~" }
{
sub(/^[0-9]+\./,"")
gsub(/[[:space:]]+/,"")
$0 = prev $0
if ( NF == 22 ) {
print ++cnt "." $0
prev = ""
}
else {
prev = $0
}
}
$ awk -f tst.awk file
1.200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2.2269~744~2701~VALD~3500~93~~~~76~423~223~Y~UN~~243~223~~~~A~200123
3.209~7414~7001~VALD~OM30~963~~~~76~23~2523~Y~UN~~223~223~~~~A~123
The assumption above is that you never have more than 22 fields on 1 line nor do you exceed 22 in any concatenation of the contiguous lines that are each less than 22 fields, just like you show in your sample input.
You can try this awk
awk '
BEGIN {
FS=OFS="~"
}
{
while(NF<22) {
if(NF==0)
break
a=$0
getline
$0=a$0
}
if(NF!=0)
print
}
' infile
or this sed
sed -E '
:A
s/((.*~){21})([^~]*)/\1\3/
tB
N
bA
:B
s/\n//g
' infile
With this script every field is printed out according to the longest word of the current file, but needs to have a line break every file. How can this be achieved?
awk 'BEGIN{ORS="\n"}FNR=NR{a[i++]=$0; if(length($0) > length(max)) max=$0;l=length(max)} END{ for(j=1; j<=i;j++) printf("%-"(l+1)"s,",a[j-1])}' file1 file2 >outfile
file1
HELLO
WORLD
SOUTH IS
WARM
NORTH IS
COLD
file2
HELLO
WORLD
SOUTH
WARM
NORTH
COLD
output
HELLO ,WORLD ,SOUTH IS ,WARM ,NORTH IS ,COLD
HELLO ,WORLD ,SOUTH ,WARM ,NORTH ,COLD
It's not entirely clear what you are asking for, but perhaps you just want:
FNR==1 {print "\n"}
Which will print a newline whenever it starts reading the first line of a file. Make sure this pattern/action is before any others so that the newline prints before any other action prints anything for the first line of the current file. (This does not appear to apply in your case, since no such action exists.)
Took me some time, got it solved with this script.
awk '{ NR>1 && FNR==1 ? l=length($0) && a[i++]= "\n" $0 : a[i++]=$0 }
{if(NR>1 && FNR==1) for(e=(i-c);e<=(i-1);e++) b[e]=d ;c=FNR; d=l }
{ if( length($0) > l) l=length($0)+1 }
END{for(e=(i-c+1);e<=i;e++) b[e]=d; for(j=1;j<=i;j++) printf("%-"b[j]"s,",a[j-1] )}' infiles* >outfile
#!/usr/bin/awk -f
function beginfile (file) {
split("", a)
max = 0
delim = ""
}
function endfile (file) {
for (i = 1; i <= lines; i++) {
printf "%s%-*s", delim, max, a[i]
delim = " ,"
}
printf "\n"
}
FILENAME != _oldfilename \
{
if (_oldfilename != "")
endfile(_oldfilename)
_oldfilename = FILENAME
beginfile(FILENAME)
}
END { endfile(FILENAME) }
{
len = length($0)
if (len > max) {
max = len
}
a[FNR] = $0
lines = FNR
}
To run it:
chmod u+x filename
./filename file1 file2
Note that in gawk you can do delete a instead of split("", a). GAWK 4 has builtin BEGINFILE and ENDFILE.
I am converting a CSV file into a table format, and I wrote an AWK script and saved it as my.awk. Here is the my script:
#AWK for test
awk -F , '
BEGIN {
aa = 0;
}
{
hdng = "fname,lname,salary,city";
l1 = length($1);
l13 = length($13);
if ((l1 > 2) && (l13 == 0)) {
fname = substr($1, 2, 1);
l1 = length($3) - 4;
lname = substr($3, l1, 4);
processor = substr($1, 2);
#printf("%s,%s,%s,%s\n", fname, lname, salary, $0);
}
if ($0 ~ ",,,,")
aa++
else if ($0 ~ ",fname")
printf("%s\n", hdng);
else if ((l1 > 2) && (l13 == 0)) {
a++;
}
else {
perf = $11;
if (perf ~/^[0-9\.\" ]+$/)
type = "num"
else
type = "char";
if (type == "num")
printf("Mr%s,%s,%s,%s,,N,N,,\n", $0,fname,lname, city);
}
}
END {
} ' < life.csv > life_out.csv*
How can I run this script on a Unix server? I tried to run this my.awk file by using this command:
awk -f my.awk life.csv
The file you give is a shell script, not an awk program. So, try sh my.awk.
If you want to use awk -f my.awk life.csv > life_out.cs, then remove awk -F , ' and the last line from the file and add FS="," in BEGIN.
If you put #!/bin/awk -f on the first line of your AWK script it is easier. Plus editors like Vim and ... will recognize the file as an AWK script and you can colorize. :)
#!/bin/awk -f
BEGIN {} # Begin section
{} # Loop section
END{} # End section
Change the file to be executable by running:
chmod ugo+x ./awk-script
and you can then call your AWK script like this:
`$ echo "something" | ./awk-script`
Put the part from BEGIN....END{} inside a file and name it like my.awk.
And then execute it like below:
awk -f my.awk life.csv >output.txt
Also I see a field separator as ,. You can add that in the begin block of the .awk file as FS=","