Add file name as a new column with awk - awk

First of all existing questions didn't solve my problem that's why I am asking again.
I have two txt files temp.txt
adam 12
george 15
thomas 20
and demo.txt
mark 8
richard 11
james 18
I want to combine them and add a 3rd column as their file names without extension, like this:
adam 12 temp
george 15 temp
thomas 20 temp
mark 8 demo
richard 11 demo
james 18 demo
I used this script:
for i in $(ls); do name=$(basename -s .txt $i)| awk '{OFS="\t";print $0, $name} ' $i; done
But it yields following table:
mark 8 mark 8
richard 11 richard 11
james 18 james 18
adam 12 adam 12
george 15 george 15
thomas 20 thomas 20
I don't understand why it gives the name variable as the whole table.
Thanks in advance.

Awk has no access to Bash's variables, or vice versa. Inside the Awk script, name is undefined, so $name gets interpreted as $0.
Also, don't use ls in scripts, and quote your shell variables.
Finally, the assignment of name does not print anything, so piping its output to Awk makes no sense.
for i in ./*; do
name=$(basename -s .txt "$i")
awk -v name="$name" '{OFS="\t";print $0, $name}' "$i"
done
As such, the basename calculation could easily be performed natively in Awk, but I leave that as an exercise. (Hint: sub(regex, "", FILENAME))

awk has a FILENAME variable whose value is the path of the file being processed, and a FNR variable whose value is the current line number in the file;
so, at FNR == 1 you can process FILENAME and store the result in a variable that you'll use afterwards:
awk -v OFS='\t' '
FNR == 1 {
basename = FILENAME
sub(".*/", "", basename) # strip from the start up to the last "/"
sub(/\.[^.]*$/, "", basename) # strip from the last "." up to the end
}
{ print $0, basename }
' ./path/temp.txt ./path/demo.txt
adam 12 temp
george 15 temp
thomas 20 temp
mark 8 demo
richard 11 demo
james 18 demo

Using BASH:
for i in temp.txt demo.txt ; do while read -r a b ; do printf "%s\t%s\t%s\n" "$a" "$b" "${i%%.*}" ; done <"$i" ; done
Output:
adam 12 temp
george 15 temp
thomas 20 temp
mark 8 demo
richard 11 demo
james 18 demo
For each source file read each line and use printf to output tab-delimited columns including the current source file name without extension via bash parameter expansion.

First, you need to unmask $name which is inside the single quotes, so does not get replaced by the filename from the shell. After you do that, you need to add double quotes around $name so that awk sees that as a string:
for i in $(ls); do name=$(basename -s .txt $i); awk '{OFS="\t";print $0, "'$name'"} ' $i; done

Related

How to extract multiple strings with single regex expression in Awk

I have the following strings:
Mike has XXX cats and XXXXX dogs.
MikehasXXXcatsandXXXXXdogs
I would like to replace Xs with the digits corresponding to the number of Xs:
I tried:
awk '{ match($0, /[X]+/);
a = length(substr($0, RSTART, RLENGTH));
gsub(/[X]+/, a) }1'
But it captures only the first match.
Expected output:
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
With your shown samples, could you please try following. Written and tested in GNU awk(should work in any awk).
awk '{for(i=1;i<=NF;i++){if($i~/^X+$/){$i=gsub(/X/,"&",$i)}}} 1' Input_file
Sample output will be:
Mike has 3 cats and 5 dogs.
Explanation: Going through all the fields(space delimited) and checking if field starts from X and has only X till end of current field, if yes then globally substituting it with its own value(to get the count) and saving into current field itself. Then mentioning 1 will print current line.
NOTE: As per Ed sir's comment(under question section), in case your fields may have values other X too then try(this will even cover XXX456 value in any column too):
awk '{for(i=1;i<=NF;i++){if($i~/X/){$i=gsub(/X/,"&",$i)}}} 1' Input_file
EDIT: Since OP's samples are changed so adding this solution here, written and tested with GNU awk.
awk -v RS='X+' '{ORS=(RT ? gsub(/./,"",RT) : "")} 1' Input_file
OR
awk -v RS='X+' '{ORS=(RT ? length(RT) : "")} 1' Input_file
Output will be as follows for above code:
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
another awk
$ awk '{for(i=1;i<=NF;i++) if($i~/^X+$/) $i=length($i)}1' file
Mike has 3 cats and 5 dogs.
$ awk '{while( match($0,/X+/) ) $0=substr($0,1,RSTART-1) RLENGTH substr($0,RSTART+RLENGTH)} 1' file
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
If Perl is okay:
$ perl -pe 's/X+/length $&/ge' ip.txt
Mike has 3 cats and 5 dogs.
Mikehas3catsand5dogs
The e flag allows Perl code in replacement section. $& will have the matched portion.
Here's the cleanest awk-based solution i can think of
{mawk/mawk2/gawk} 'BEGIN { FS = "^$" } /X/ {
while(match($0, /[X]+/)) { sub(/[X]+/, RLENGTH) } } 1'
downside of this is having to use regex engine twice for every replacmeent. upside is that it avoids a bunch of substr( ) ops.

awk find out how many times columns two and three equal specific word

Lets say I have a names.txt file with the following
Bob Billy Billy
Bob Billy Joe
Bob Billy Billy
Joe Billy Billy
and using awk I want to find out how many times $2 = Billy while $3 = Billy. In this case my desired output would be 3 times.
Also, I'm testing this on a mac if that matters.
You first need to test $2==$3 then test that one of those equals "Billy". Increment a counter and then print the result at the end:
$ awk '$2==$3 && $2=="Billy"{cnt++} END{print cnt+0}' names.txt
3
Or, you could almost write just what you said:
$ awk '$2=="Billy" && $3=="Billy" {cnt++} END{print cnt+0}' names.txt
3
And if you want to use a variable so you don't need to type it several times:
$ awk -v name='Billy' '$2==name && $3==name {cnt++}
END{printf "Found \"%s\" %d times\n", name, cnt+0}' names.txt
Found "Billy" 3 times
Or, you could collect them all up and report what was found:
$ awk '{cnts[$2 "," $3]++}
END{for (e in cnts) print e ": " cnts[e]}' names.txt
Billy,Billy: 3
Billy,Joe: 1
You may also consider use grep to do that,
$ grep -c "\sBilly\sBilly" name.txt
3
-c: print a count of matching lines

Print every second consequtive field in two columns - awk

Assume the following file
#zvview.exe
#begin Present/3
77191.0000 189.320100 0 0 3 0111110 16 1
-8.072430+6-8.072430+6 77190 0 1 37111110 16 2
37 2 111110 16 3
8.115068+6 0.000000+0 8.500000+6 6.390560-2 9.000000+6 6.803440-1111110 16 4
9.500000+6 1.685009+0 1.000000+7 2.582780+0 1.050000+7 3.260540+0111110 16 5
37 2 111110 16 18
What I would like to do, is print in two columns, the fields after line 6. This can be done using NR. The tricky part is the following : Every second field, should go in one column as well as adding an E before the sign, so that the output file will look like this
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
From the output file you see that I want to keep in $6 only length($6)=10 characters.
How is it possible to do it in awk?
can do all in awk but perhaps easier with the unix toolset
$ sed -n '6,7p' file | cut -c2-66 | tr ' ' '\n' | pr -2ats' '
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Here is a awk only solution or comparison
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) {f[++c]=$i;s[c]=$(i+1)}}
END{for(i=1;i<=c;i++) print f[i],s[i]}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Perhaps shorter version,
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) print $i FS $(i+1)}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
to convert format to standard scientific notation, you can pipe the result to
sed or embed something similar in awk script (using gsub).
... | sed 's/[+-]/E&/g'
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { FIELDWIDTHS="9 2 9 2 9 2 9 2 9 2 9 2" }
NR>5 && NR<8 {
for (i=1;i<NF;i+=4) {
print $i "E" $(i+1), $(i+2) "E" $(i+3)
}
}
$ awk -f tst.awk file
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
If you really want to get rid of the leading blanks then there's various ways to do it (simplest being gsub(/ /,"",$<field number>) on the relevant fields) but I left them in because the above allows your output to line up properly if/when your numbers start with a -, like they do on line 4 of your sample input.
If you don't have GNU awk, get it as you're missing a LOT of extremely useful functionality.
I tried to combine #karafka 's answer using substr, so the following does the trick!
awk 'NR>=6 && NR<=7{$6=substr($6,1,10);for(i=1;i<=6;i+=2) print substr($i,1,8) "E" substr($i,9) FS substr($(i+1),1,8) "E" substr($(i+1),9)}' file
and the output is
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

Cut column from multiple files with the same name in different directories and paste into one

I have multiple files with the same name (3pGtoA_freq.txt), but all located in different directories.
Each file looks like this:
pos 5pG>A
1 0.162421557770395
2 0.0989643268124281
3 0.0804131316857248
4 0.0616563298066399
5 0.0577551761714493
6 0.0582450832072617
7 0.0393129770992366
8 0.037037037037037
9 0.0301016419077404
10 0.0327510917030568
11 0.0301598837209302
12 0.0309050772626932
13 0.0262089331856774
14 0.0254612546125461
15 0.0226130653266332
16 0.0206971677559913
17 0.0181280059193489
18 0.0243993993993994
19 0.0181347150259067
20 0.0224429727740986
21 0.0175690211545357
22 0.0183916336098089
23 0.0196078431372549
24 0.0187983781791375
25 0.0173192771084337
I want to cut column 2 from each file and paste column by column in one file
I tried running:
for s in results_Sample_*_hg19/results_MapDamage_Sample_*/results_Sample_*_bwa_LongSeed_sorted_hg19_noPCR/3pGtoA_freq.txt; do awk '{print $2}' $s >> /home/users/istolarek/aDNA/3pGtoA_all; done
but it's not pasting the columns next to each other.
Also I wanted to name each column by the '*', which is the only string that changes in path.
Any help with that?
for i in $(find you_file_dir -name 3pGtoA_freq.txt);do awk '{print $2>>"NewFile"}' $i; done
I would do this by processing all files in parallel in awk:
awk 'BEGIN{printf "pos ";
for(i=1;i<ARGC;++i)
printf "%-19s",gensub("^results_Sample_","",1,gensub("_hg19.*","",1,ARGV[i]));
printf "\n";
while(getline<ARGV[1]){
printf "%-4s%-19s",$1,$2;
for(i=2;i<ARGC;++i){
getline<ARGV[i];
printf "%-19s",$2}
printf "\n"}}{exit}' \
results_Sample_*_hg19/results_MapDamage_Sample_*/results_Sample_*_bwa_LongSeed_sorted_hg19_noPCR/3pGtoA_freq.txt
If your awk doesn't have gensub (I'm using cygwin), you can remove the first four lines (printf-printf); headers won't be printed in that case.

delete line if string is matched and the next line contains another string

got an annoying text manipulation problem, i need to delete a line in a file if it contains a string but only if the next line also contains another string. for example, i have these lines:
john paul
george
john paul
12
john paul
i want to delete any line containing 'john paul' if it is immediately followed by a line that contains 'george', so it would return:
george
john paul
12
john paul
not sure how to grep or sed this. if anyone could lend a hand that'd be great!
This might work for you (GNU sed):
sed '/john paul/{$!N;/\n.*george/!P;D}' file
If the line contains john paul read the next line and if it contains george don't print the first line.
N.B. If the line containing george contains john paul it will be checked also.
awk 'NR > 1 && !(/george/ && p ~ /john paul/) { print p } { p = $0 } END { print }' file
Output:
george
john paul
12
john paul
This awk should do:
cat file
john paul
george
john paul
12
john paul
hans
george
awk 'f~/john paul/ && /george/ {f=$0;next} NR>1 {print f} {f=$0} END {print}' file
george
john paul
12
john paul
hans
george
This will only delete name above george if it is john paul
Here is one version more general:
if the lines matches a string and previous line was exactly "john paul" then do nothing, otherwise, print the previous line. (change the ^[a-zA-Z]$ part to george if you only want george to be detected.
awk '!(/^[a-zA-W]+$/ && previous ~/^john paul$/){print previous}{previous=$0}END{print}'
In your example:
$> echo 'john paul
george
john paul
12
john paul' |awk '!(/^[a-zA-W]+$/ && previous ~/^john paul$/){print previous}{previous=$0}END{print}'
george
john paul
12
john paul
if there is some numbers in the line, it prints the previous, otherwise it doesn't:
$> echo 'john paul
george 234
john paul
auie
john paul' |awk '!(/^[a-zA-W]+$/ && previous ~/^john paul$/){print previous}{previous=$0}END{print}'
john paul
george 234
auie
john paul
The sed solution is short: two commands and lots of comments ;)
/john paul/ {
# read the next line and append to pattern space
N
# and then if we find "george" in that next line,
# only retain the last line in the pattern space
s/.*\n\(.*george\)/\1/
# and finally print the pattern space,
# as we don't use the -n option
}
You put the above in some sedscript file and then run:
sed -f sedscript your_input_file
Just to throw some Perl into the mix:
perl -ne 'print $p unless /george/ && $p =~ /john paul/; $p = $_ }{ print $p' file
Print the previous line, unless the current line matches /george/ and the previous line $p matched /john paul/. Set $p to the value of the previous line. }{ effectively creates an END block, so the last line is also printed after the file has been read.
You might have to change the \r\n to \n or to \r, other than that this should work:
<?php
$string = "john paul
george
john paul
12
john paul";
$string = preg_replace("#john paul\r\n(george)#i",'$1',$string);
echo $string;
?>
You could also read a file into the variable and then after overwrite the file.
With GNU awk for multi-char RS:
$ gawk -vRS='^$' '{gsub(/john paul\ngeorge/,"george")}1' file
george
john paul
12
john paul
or if there's more on each line than your sample input shows just change the RE to suit and use gensub():
$ gawk -vRS='^$' '{$0 = gensub(/[^\n]*john paul[^\n]*\n([^\n]*george[^\n]*)/,"\\1","")}1' file
george
john paul
12
john paul