How to find the position of word in a list of string to use in awk? - awk

Morning guys,
I often have files that I want to grep+awk but that have a lot of fields.
I'm interested in one in particular (so I'd like to awk '{print $i}') but how can I know what position (ie "i" here) my column is, other than counting it manually?
With files of around 70 fields, I'd be saving lot of time! :)
Thanks a lot,
[Edit]
Following Ian McGowan's suggestion, I'll look for the column number in the file's header:
head -1 myFile | awk '{for (i=1; i<=NF; i++) printf("%3d=%s\n", i, $i); }' | grep -i <the_column_Im_looking_for>
Thanks everyone,
Will1v

I was looking for a sample but:
$ cat > file
this is
the sample
$ awk '{
for(i=1;i<=NF;i++)
if($i=="sample")
print NR,i
}' file
2 2

I do this all the time when trying to profile some large text delimited file.
$head -4 myfile
4A 1 321 537 513.30
4B 0.00
8 592 846 905.66
9B2 39 887 658.77
Transpose or pivot by looping over the columns/fields:
$awk '{ for (i=1; i<=NF; i++) printf("%4d %3d=%s\n", NR, i, $i); }' < myfile
1 1=4A
1 2=1
1 3=321
1 4=537
1 5=513.30
2 1=4B
2 2=0.00
3 1=8
3 2=592
3 3=846
3 4=905.66
4 1=9B2
4 2=39
4 3=887
4 4=658.77
You can add printf("row=%-4d col=%-3d:%s\n", NR, i, $i); or whatever in the format mask for printf, and then grep for just the data you care about to find out the column, or if you know the columns grep for col=44 to get the 44th column.

xargs -n1 will print the columns one per line, so you can do this:
head -1 file | xargs -n1 | grep -n "column_name"

Related

Count number of elements that match one file with another using AWK

First of all, thank you for your help. I have the file letter.txt:
A
B
C
And I have the file number.txt
B 10
D 20
A 15
C 18
E 23
A 12
B 14
I want to count how many times does each letter in letter.txt appears in number.txt so the output will be:
We have found 2 A
We have found 2 B
We have found 1 C
Total letter found: 5
I know I can do it using this code, but I want to do it generally with any file.
cat number.txt | awk 'BEGIN {A=0;B=0;C=0;count=0}; {count++};{if ($1 == "A")A++};{if ($1 == "B")B++};{if ($1 == "C")C++}END{print "We have found" A "A\n" "We have found" B "B\n" "We have found" C "C"}
You basically want to do an inner join (easy enough to google) and group by the join key and return the count for each group.
awk 'NR==FNR { count[$1] = 0; next }
$1 in count { ++count[$1]; ++total}
END { for(k in count)
print "We have found", count[k], k
print "Total", total, "letters"}' letters.txt numbers.txt
All of this should be easy to find in a basic Awk tutorial, but in brief, the line number within the file FNR is equal to the overall line number NR when you are reading the first input file. We initialize count to contain the keys we want to look for. If we fall through, we are reading the second file; if we see a key we want, we increase its count. When we are done, report what we found.
Consider starting with:
$ join letter.txt <(cut -d' ' -f1 number.txt | sort) | uniq -c
2 A
2 B
1 C
Then:
$ join letter.txt <(cut -d' ' -f1 number.txt | sort) | uniq -c |
awk '
{ print "We have found", $1, $2; tot+=$1 }
END { print "Total letter found:", tot+0 }
'
We have found 2 A
We have found 2 B
We have found 1 C
Total letter found: 5
although in reality I'd probably just do it all in awk, just wanted to show an alternative.
Don't know if you need awk
to me easier (but slower execution as you read in comments) to use grep -c
cat file1 | while read line; do
c=`grep -c $line file2 | sed 's/ //g'`;
echo We have found $c $line;
done
it's a cycle, where
$c is the count taken with grep -c, and sed remove spaces in grep -c output
grep and coreutils can also do this:
grep -f letter.txt number.txt | cut -d' ' -f1 | sort | uniq -c
Output:
2 A
2 B
1 C

Sort a file preserving the header as first position with bash

When sorting a file, I am not preserving the header in its position:
file_1.tsv
Gene Number
a 3
u 7
b 9
sort -k1,1 file_1.tsv
Result:
a 3
b 9
Gene Number
u 7
So I am tryig this code:
sed '1d' file_1.tsv | sort -k1,1 > file_1_sorted.tsv
first='head -1 file_1.tsv'
sed '1 "$first"' file_1_sorted.tsv
What I did is to remove the header and sort the rest of the file, and then trying to add again the header. But I am not able to perform this last part, so I would like to know how can I copy the header of the original file and insert it as the first row of the new file without substituting its actuall first row.
You can do this as well :
{ head -1; sort; } < file_1.tsv
** Update **
For macos :
{ IFS= read -r header; printf '%s\n' "$header" ; sort; } < file_1.tsv
a simpler awk
$ awk 'NR==1{print; next} {print | "sort"}' file
$ head -1 file; tail -n +2 file | sort
Output:
Gene Number
a 3
b 9
u 7
Could you please try following.
awk '
FNR==1{
first=$0
next
}
{
val=(val?val ORS:"")$0
}
END{
print first
print val | "sort"
}
' Input_file
Logical explanation:
Check condition FNR==1 to see if its first line; then save its values to variable and move on to next line by next.
Then keep appending all lines values to another variable with new line till last line.
Now come to END block of this code which executes when Input_file is done being read, there print first line value and put sort command on rest of the lines value there.
This will work using any awk, sort, and cut in any shell on every UNIX box and will work whether the input is coming from a pipe (when you can't read it twice) or from a file (when you can) and doesn't involve awk spawning a subshell:
awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
The above uses awk to stick a 0 at the front of the header line and a 1 in front of the rest so you can sort by that number then whatever other field(s) you want to sort on and then remove the added field again with a cut. Here it is in stages:
$ awk -v OFS='\t' '{print (NR>1), $0}' file
0 Gene Number
1 a 3
1 u 7
1 b 9
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2
0 Gene Number
1 a 3
1 b 9
1 u 7
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
Gene Number
a 3
b 9
u 7

Print every second consequtive field in two columns - awk

Assume the following file
#zvview.exe
#begin Present/3
77191.0000 189.320100 0 0 3 0111110 16 1
-8.072430+6-8.072430+6 77190 0 1 37111110 16 2
37 2 111110 16 3
8.115068+6 0.000000+0 8.500000+6 6.390560-2 9.000000+6 6.803440-1111110 16 4
9.500000+6 1.685009+0 1.000000+7 2.582780+0 1.050000+7 3.260540+0111110 16 5
37 2 111110 16 18
What I would like to do, is print in two columns, the fields after line 6. This can be done using NR. The tricky part is the following : Every second field, should go in one column as well as adding an E before the sign, so that the output file will look like this
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
From the output file you see that I want to keep in $6 only length($6)=10 characters.
How is it possible to do it in awk?
can do all in awk but perhaps easier with the unix toolset
$ sed -n '6,7p' file | cut -c2-66 | tr ' ' '\n' | pr -2ats' '
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Here is a awk only solution or comparison
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) {f[++c]=$i;s[c]=$(i+1)}}
END{for(i=1;i<=c;i++) print f[i],s[i]}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Perhaps shorter version,
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) print $i FS $(i+1)}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
to convert format to standard scientific notation, you can pipe the result to
sed or embed something similar in awk script (using gsub).
... | sed 's/[+-]/E&/g'
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { FIELDWIDTHS="9 2 9 2 9 2 9 2 9 2 9 2" }
NR>5 && NR<8 {
for (i=1;i<NF;i+=4) {
print $i "E" $(i+1), $(i+2) "E" $(i+3)
}
}
$ awk -f tst.awk file
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
If you really want to get rid of the leading blanks then there's various ways to do it (simplest being gsub(/ /,"",$<field number>) on the relevant fields) but I left them in because the above allows your output to line up properly if/when your numbers start with a -, like they do on line 4 of your sample input.
If you don't have GNU awk, get it as you're missing a LOT of extremely useful functionality.
I tried to combine #karafka 's answer using substr, so the following does the trick!
awk 'NR>=6 && NR<=7{$6=substr($6,1,10);for(i=1;i<=6;i+=2) print substr($i,1,8) "E" substr($i,9) FS substr($(i+1),1,8) "E" substr($(i+1),9)}' file
and the output is
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

Cut column from multiple files with the same name in different directories and paste into one

I have multiple files with the same name (3pGtoA_freq.txt), but all located in different directories.
Each file looks like this:
pos 5pG>A
1 0.162421557770395
2 0.0989643268124281
3 0.0804131316857248
4 0.0616563298066399
5 0.0577551761714493
6 0.0582450832072617
7 0.0393129770992366
8 0.037037037037037
9 0.0301016419077404
10 0.0327510917030568
11 0.0301598837209302
12 0.0309050772626932
13 0.0262089331856774
14 0.0254612546125461
15 0.0226130653266332
16 0.0206971677559913
17 0.0181280059193489
18 0.0243993993993994
19 0.0181347150259067
20 0.0224429727740986
21 0.0175690211545357
22 0.0183916336098089
23 0.0196078431372549
24 0.0187983781791375
25 0.0173192771084337
I want to cut column 2 from each file and paste column by column in one file
I tried running:
for s in results_Sample_*_hg19/results_MapDamage_Sample_*/results_Sample_*_bwa_LongSeed_sorted_hg19_noPCR/3pGtoA_freq.txt; do awk '{print $2}' $s >> /home/users/istolarek/aDNA/3pGtoA_all; done
but it's not pasting the columns next to each other.
Also I wanted to name each column by the '*', which is the only string that changes in path.
Any help with that?
for i in $(find you_file_dir -name 3pGtoA_freq.txt);do awk '{print $2>>"NewFile"}' $i; done
I would do this by processing all files in parallel in awk:
awk 'BEGIN{printf "pos ";
for(i=1;i<ARGC;++i)
printf "%-19s",gensub("^results_Sample_","",1,gensub("_hg19.*","",1,ARGV[i]));
printf "\n";
while(getline<ARGV[1]){
printf "%-4s%-19s",$1,$2;
for(i=2;i<ARGC;++i){
getline<ARGV[i];
printf "%-19s",$2}
printf "\n"}}{exit}' \
results_Sample_*_hg19/results_MapDamage_Sample_*/results_Sample_*_bwa_LongSeed_sorted_hg19_noPCR/3pGtoA_freq.txt
If your awk doesn't have gensub (I'm using cygwin), you can remove the first four lines (printf-printf); headers won't be printed in that case.

Extracting block of data from a file

I have a problem, which surely can be solved with an awk one-liner.
I want to split an existing data file, which consists of blocks of data into separate files.
The datafile has the following form:
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
And i want to store every single block of data in a separate file, named - for example - "1.dat", ".dat", "3.dat",...
The problem is, that each block doesn't have a specific line number, they are just delimited by two "new lines".
Thanks in advance,
Jürgen
This should get you started:
awk '{ print > ++i ".dat" }' RS= file.txt
If by two "new lines" you mean, two newline characters:
awk '{ print > ++i ".dat" }' RS="\n\n" file.txt
See how the results differ? Setting a null RS (i.e. the first example) is probably what you're looking for.
Another approach:
awk 'NF != 0 {print > $1 ".dat"}' file.txt