How to strict an area in file with awk? - awk

I have a long text file and I need to provide computation with a table that is in this large text file, so I am trying to restrict the area and print only the table I need. The area I care about looks like:
Sums of squares of residuals for separate curves, including only individual weights
Curve No. of obs. Sum of squares
1 82 0.20971070
2 7200 13659.50038631
3 7443 15389.87972458
4 5843 10510.37305696
5 290 49918.40634886
6 1376 49974.57509390
7 694 8340.44771461
8 545 2476.43037281
9 349 1425.69687357
1111 1111 0101110 01110 11001 01111 11110 0 1 1 0.100D-02
UNWEIGHTED OBSERVATIONAL EQUATIONS
No. Curve Input Param. Correction Output Param. Standard Deviation
9 0 39.6398000000 0.0796573846 39.7194573846 0.6864389887
I tried this, but all file is printed
/Curve/ { in_f_format=0; next }
/UNWEIGHTED/ { in_f_format=1; next }
{print}
desired output
1 82 0.20971070
2 7200 13659.50038631
3 7443 15389.87972458
4 5843 10510.37305696
5 290 49918.40634886
6 1376 49974.57509390
7 694 8340.44771461
8 545 2476.43037281
9 349 1425.69687357

Update: according to your desired output, you can use this:
awk '/Curve/ { in_f_format=1; next } /^[[:space:]]*$/ { in_f_format=0; next } in_f_format'
If you only want the content between the two patterns, change your code to this would work:
/Curve/ { in_f_format=1; next }
/UNWEIGHTED/ { in_f_format=0; next }
in_f_format {print}
The things before the blocks are considered conditions, when a condition evaluates to true, then the block after it will be executed.
A block without a condition will be executed by default (when not skipped by next or other thing).
Additionally, a condition without a block will have {print} implied, so it can be saved here.
For example, file with the content you provided:
$ awk '/Curve/ { in_f_format=1; next } /UNWEIGHTED/ { in_f_format=0; next } in_f_format' file
1 82 0.20971070
2 7200 13659.50038631
3 7443 15389.87972458
4 5843 10510.37305696
5 290 49918.40634886
6 1376 49974.57509390
7 694 8340.44771461
8 545 2476.43037281
9 349 1425.69687357
1111 1111 0101110 01110 11001 01111 11110 0 1 1 0.100D-02
Another example, starting from Curve title line to before empty line:
$ awk '/Curve/ { in_f_format=1; } /^[[:space:]]*$/ { in_f_format=0; next } in_f_format' file
Curve No. of obs. Sum of squares
1 82 0.20971070
2 7200 13659.50038631
3 7443 15389.87972458
4 5843 10510.37305696
5 290 49918.40634886
6 1376 49974.57509390
7 694 8340.44771461
8 545 2476.43037281
9 349 1425.69687357
Unassigned variables have 0 or empty value by default, which evaluates to false.
The [[:space:]]* is for lines have space characters, if you want strictly speaking empty line, then just /^$/ where ^ means line-beginning and $ means line-ending.

Related

how to format a large txt file to bed

I am trying to format CpG methylation calls from R package "methyKit" to simple bed format. Since it is a large file, i can not do it in Excel. I also tried Seqmonk, but it does not allow me to export the data in the format I want. Linux Awk/sed might be a good option, but I am new to them as well. Basically, I need to trim "chr" column, add "stop" column, convert "F" to "+" /"R" to "-", and freqC with 2 decimal places. Can you please help?
From:
chrBase chr base strand coverage freqC freqT
chr1.339 chr1 339 F 7 0.00 100.00
chr1.183 chr1 183 R 4 0.00 100.00
chr1.192 chr1 192 R 6 0.00 100.00
chr1.340 chr1 340 R 5 40.00 60.00
chr1.10007 chr1 10007 F 13 53.85 46.15
chr1.10317 chr1 10317 F 8 0.00 100.00
chr1.10346 chr1 10346 F 9 88.89 11.11
chr1.10349 chr1 10349 F 9 88.89 11.11
To:
chr start stop freqc Coverage strand
1 67678 67679 0 3 -
1 67701 67702 0 3 -
1 67724 67725 0 3 -
1 67746 67747 0 3 -
1 67768 67769 0.333333 3 -
1 159446 159447 0 3 +
1 162652 162653 0 3 +
1 167767 167768 0.666667 3 +
1 167789 167790 0.666667 3 +
1 167797 167798 0 3 +
This should do what you actually want, producing a BED6 file with the methylation percentage in the score column:
$ cat foo.awk
BEGIN{OFS="\t"}
{if(NR>1) {
if($4=="F") {
strand="+"
} else {
strand="-"
}
chromUse=gsub("chr", "", $2);
print chromUse,$3-1,$3,$1,$6,strand,$5
}}
That would then be run with:
awk -f foo.awk input.txt > output.bed
The additional column 7 is the coverage, since genome viewers will only display a single score column:
1 338 339 chr1.339 0.00 + 7
1 182 183 chr1.183 0.00 - 4
1 191 192 chr1.192 0.00 - 6
1 339 340 chr1.340 40.00 - 5
1 10006 10007 chr1.10007 53.85 + 13
1 10316 10317 chr1.10317 0.00 + 8
1 10345 10346 chr1.10346 88.89 + 9
1 10348 10349 chr1.10349 88.89 + 9
You can tweak that further as needed.
It is not entirely clear the exact sequence you want since your "From" data does not correspond to what you show as your "To" results, but if what you are showing is the general format change and in the "From" data, you want to:
discard field 1,
retrieve the "chr" value from the end of field 2,
if the 4th field is "F" make it "+" else if it is "R" make it "-" otherwise leave it unchanged,
use the 3rd field as "start" and 3rd + 1 as "stop" (adjust whether to add or subtract 1 as needed to get the desired "start" and "stop" values),
print 6th field as "freqc",
output 5th field as "Coverage", and finally
output modified 4th field as "strand"
If that is your goal, then with your from data in the file named from, you can do something like the following:
awk '
BEGIN { OFS="\t"; print "chr","start","stop","freqc","Coverage","strand" }
FNR > 1 {
match($2, /[[:digit:]]+$/, arr)
if ($4 == "F")
$4 = "+"
else if ($4 == "R")
$4 = "-"
print arr[0], $3, $3 + 1, $6, $5, $4
}
' from
Explanation, the BEGIN rule is run before awk starts processing records (lines) from the file. Above it simply sets the Output Field Separator to tab and prints the heading.
The condition (pattern) of FNR > 1 on the second rule processes the from file from the 2nd record (line) on (skipping the heading row). FNR is awk's way of saying File Record Number (even though it looks like the N and R are backwards).
match($2, /[[:digit:]]+$/, arr) splits the trailing digits from the second field into the first element of arr (e.g. arr[0]) and not relevant here sets the RSTART and RLENGTH internal variables telling you which character the first digit starts on and how many digits there are.
The if and else if statement does the "F" to "+" and "R" to "-" change. And, finally, the print statement just prints the modified values and unchanged fields in the order specified above.
Example Output
Running the above on your original "From" data will produce:
chr start stop freqc Coverage strand
1 339 340 0.00 7 +
1 183 184 0.00 4 -
1 192 193 0.00 6 -
1 340 341 40.00 5 -
1 10007 10008 53.85 13 +
1 10317 10318 0.00 8 +
1 10346 10347 88.89 9 +
1 10349 10350 88.89 9 +
Let me know if this is close to what you explained in your question, and if not, drop a comment below.
The GNU Awk User's Guide is a great gawk/awk reference.

count, groupby with sed, or awk

i want to perform two different sort and count on a file, based on each line's content.
1. i need to take the first column of a .tsv file
i would like to group by each line that starts with three digits, and keep only the three first digits, and for everything else, just sort and count the whole occurrence of the sentence in the first column.
Sample data:
687/878 9
890987 4
01a 55
1b 8743917
890a 34
abcdee 987
dfeqfe fkdjald
890897 34213
6878853 834
32fasd 53891
abcdee 8794371
abd 873
result:
687 2
890 3
01a 1
1b 1
32fasd 1
abd 1
dfeqfe 1
abcdee 2
I would also appreciate a solution that would
also take into account a sample input like
687/878 9
890987 4
01a 55
1b 8743917
890a 34
abcdee 987
dfeqfe 545
890897 34213
6878853 834
(632)fasd 53891
(88)abcdee 8794371
abd 873
so the first column may have values like (,), #, ', all kind of characters
so output will have two columns, the first with the values extracted, and the second with the new count, with the new values extracted from the source file.
Again preferred output format tsv.
so i need to extract all values that start with
^\d\d\d, and then for these three first digits, sort and count unique values,
but in a second pass, also do the same for each line, that does not start with 3 digits, but this time, keep the whole columns value and sort count by it.
what i have tried:
| sort | uniq -c | sort -nr for the lines that do start with ^\d\d\d, and
the same for those that do not fulfill the above regex, but is there a more elegant way using either sed or awk?
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ cnt[/^[0-9]{3}/ ? substr($1,1,3) : $1]++ }
END {
for (key in cnt) {
print (key !~ /^[0-9]{3}/), cnt[key], key, cnt[key]
}
}
$ awk -f tst.awk file | sort -k1,2n | cut -f3-
687 1
890 2
abcdee 1
You can try Perl
$ cat nefijaka.txt
687 878 9
890987 4
890a 34
abcdee 987
$ perl -lne ' /^(\d{3})|(\S+)/; $x=$1?$1:$2; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt
687 1
890 2
abcdee 1
$
You can pipe it to sort and get the values sorted..
$ perl -lne ' /^(\d{3})|(\S+)/; $x=$1?$1:$2; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt | sort -k2 -nr
890 2
abcdee 1
687 1
EDIT1:
$ cat nefijaka.txt2
687 878 9
890987 4
890a 34
abcdee 987
a word and then 23
$ perl -lne ' /^(\d{3})|(.+?\t)/; $x=$1?$1:$2; $x=~s/\t//g; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt2
687 1
890 2
a word and then 1
abcdee 1
$

Select rows after certain condition

Select rows after the maximum minimum valor found
Input file
22 101 5
23 102 5
24 103 5
25 104 23
26 105 25
27 106 21
28 107 20
29 108 8
30 109 6
31 110 7
To figure out my problem, I tried to subtract column 3 and print the lines after the minimum value found in column 4. In this case after row 7
awk '{$4 = $3 - prev3; prev3 = $3; print $0}' file
22 101 5
23 102 5 0
24 103 5 0
25 104 2 18
26 105 2 2
27 106 2 -4
28 107 2 -1
29 108 8 -12
30 109 6 -2
31 110 7 1
Desired Output
29 108 8
30 109 6
31 110 7
I believe there is better and easy way to get same output.
Thanks in advance
You need to process the same file twice:
Find out the line number of the min value
Print the line and the lines after it
Like this:
awk 'NR==FNR{v=$3-prev3;prev3=$3;if(NR==2||v<m){m=v;ln=NR};next}FNR>=ln' file file
Explanation:
# This condition is true as long as we process the file the first time
NR==FNR {
# Your calculation
v=$3-prev3
prev3=$3
# If NR==2, meaning in row 2 we initialize m and ln.
# Otherwise check if v is the new minimum and set m and ln.
if(NR==2 || v<m){
# Set m and ln when v is the new minimum
m=v
ln=NR
}
next # Skip the conditional below
}
# This condition will be only evaluated when we parse the file
# the second time. (because of the "next" statement above)
# When the line number is greater or equal than "ln" print it.
# (print is the default action)
FNR>=ln

How to insert two lines for every data frame using awk?

I have repeating data as follows
....
4 4 4 66 79 169 150 0 40928 40938 40923 40921 40789 40000 40498
5 4 3 16 22 247 0 40168 40911 40944 40205 40000 40562
6 4 4 17 154 93 309 0 40930 40919 40903 40917 40852 40000 40419
7 3 2 233 311 0 40936 40932 40874 40000 40807
....
This data is made up of 115 data blocks, and each data block have 4000 lines like that format.
Here, I hope to put two new lines (number of line per data block = 4000 and empty line) at the begining of each data blocks, so it looks
4000
1 4 4 244 263 704 952 0 40936 40930 40934 40921 40820 40000 40570
2 4 4 215 172 305 33 0 40945 40942 40937 40580 40687 40000 40410
3 4 4 344 279 377 1945 0 40933 40915 40907 40921 40839 40000 40437
4 4 4 66 79 169 150 0 40928 40938 40923 40921 40789 40000 40498
...
3999 2 2 4079 4081 0 40873 40873 40746 40000 40634
4000 1 1 4080 0 40873 40923 40000 40345
4000
1 4 4 244 263 704 952 0 40936 40930 40934 40921 40820 40000 40570
2 4 4 215 172 305 33 0 40945 40942 40937 40580 40687 40000 40410
3 4 4 344 279 377 1945 0 40933 40915 40907 40921 40839 40000 40437
4 4 4 66 79 169 150 0 40928 40938 40923 40921 40789 40000 40498
...
Can I do this with awk or any other unix command?
My solution is more general, since the blocks can be of non-equal lenght as long as you restart the 1st field counter to denote the beginning of a new block
% cat mark_blocks
$1<count { print count; print "";
for(i=1;i<=count;i++) print l[i]; }
# executed for each line
{ l[$1] = $0; count=$1}
END { print count; print "";
for(i=1;i<=count;i++) print l[i]; }
% awk -f mark_blocks your_data > marked_data
%
The working is simple, awk accumulates lines in memory and it prints the header lines and the accumulated data when it reaches a new block or EOF.
The (modest) trick is that the output action must take place before we do the usual stuff we do for each line.
A simple one liner using awk can do the purpose.
awk 'NR%4000==1{print "4000\n"} {print$0}' file
what it does.
print $0 prints every line.
NR%4000==1 selects the 4000th line. When it occures it prints a 4000 and a newline \n, that is two new lines.
NR Number of records, which is effectivly number of lines reads so far.
simple test.
inserts 4000 at 5th line
awk 'NR%5==1{print "4000\n"} {print$0}'
output:
4000
1
2
3
4
5
4000
6
7
8
9
10
4000
11
12
13
14
15
4000
16
17
18
19
20
4000
You can do it all in bash :
cat $FILE | ( let countmax=4000; let count=countmax; while read lin ; do if [ $count == $countmax ]; then let count=0; echo -e "$countmax\n" ; fi ; echo $lin ; let count=count+1 ; done )
Here we assume you are reading this data from $FILE . Then all we are doing is reading from the file and piping it into our little bash script.
The bash script reads lines one by one (with the while read lin) , and increments the counter countfor each line. When starting or when the counter count reaches the value countmax (set to 4000) , then it prints out the 2 lines you asked for.

compare between two columns and subtract them

my question
i have one file
344 0
465 1
729 2
777 3
676 4
862 5
766 0
937 1
980 2
837 3
936 5
i need to compare each two pair (zero with zero, one with one and so on) if the value exist(any value of column two should exist two times) subtract 766-344 , 937-465 and so on if not exist like the forth value do nothing (4 exist one time so do nothing) the output
422
472
251
060
074
also i need to add index
example
1 422
2 472
3 251
4 060
5 074
finally i need to add this code as part of tcl script, or function of tcl porgram
I have a tcl script contain awk functions like this
set awkCBR0 {
{
if ($1 == "r" && $6 == 280) {
print $2, i >> "cbr0.q";
i +=1 ;
}
}
}
exec rm -f cbr0.q
exec touch cbr0.q
exec awk $awkCBR0 cbr.trq
thanks
Try this:
awk 'a[$2]{printf "%d %03d\n",++x,$1-a[$2];next}{a[$2]=$1}' file
Output
$ awk 'a[$2]{printf "%d %03d\n",++x,$1-a[$2];next}{a[$2]=$1}' file
1 422
2 472
3 251
4 060
5 074
I will leave it for you to add it to tcl function.