I have 400 tab-delimited text files with 6 million rows in each file. Below is the format of the files:
### input.txt
col1 col2 col3 col4 col5
ID1 str1 234 cond1 0
ID1 str2 567 cond1 0
ID1 str3 789 cond1 1
ID1 str4 123 cond1 1
### file1.txt
col1 col2 col3 col4 col5
ID2 str1 235 cond1 0
ID2 str2 567 cond2 3
ID2 str3 789 cond1 3
ID2 str4 123 cond2 0
### file2.txt
col1 col2 col3 col4 col5
ID3 str1 235 cond1 0
ID3 str2 567 cond2 4
ID3 str3 789 cond1 1
I am trying to add values in $1 from the rest of the file1..filen to $6 in input.txt file by using:
conditions:
1. columns $2 and $3 as key
2. If the key is found in files1...filen then if $5>=2 add the value from $1 to $6 in the input file.
Code:
awk -F "\t" -v OFS="\t" '!c {
c=$0"\tcol6";
next
}
NR==FNR {
a[$2$3]=$0 "\t";
next
}
{
if ($5>=2) {
a[$2$3]=a[$2$3] $1 ","
}
}
END {
print c;
for (i in a) {
print a[i]
}
}' input.txt file1..filen.txt
The output from the above code is as expected:
Output.txt
col1 col2 col3 col4 col5 col6
ID1 str2 567 cond1 0 ID2,ID3,
ID1 str4 123 cond1 1
ID1 str1 234 cond1 0
ID1 str3 789 cond1 1 ID2,
However, the problem is that the code is very slow as it has to iterate each key in input.txt through 400 files with 6 million rows in each file. This takes several hours to few days. Could someone suggest a better way to reduce the processing time in awk or using other scripts.
Any help would really save lot of time.
input.txt
Sam string POS Zyg QUAL
WSS 1 125 hom 4973.77
WSS 1 810 hom 3548.77
WSS 1 389 hom 62.74
WSS 1 689 hom 4.12
file1.txt
Sam string POS Zyg QUAL
AC0 1 478 hom 8.64
AC0 1 583 het 37.77
AC0 1 588 het 37.77
AC0 1 619 hom 92.03
file2.txt
Sam string POS zyg QUAL
AC1 1 619 hom 89.03
AC1 1 746 hom 17.86
AC1 1 810 het 2680.77
AC1 1 849 het 200.77
awk -F "\t" -v OFS="\t" '!c {
c=$0"\tcol6";
next
}
NR==FNR {
a[$2$3]=$0 "\t";
next
}
{
if ( ($5>=2) && (FNR > 1) ) {
if ( $2$3 in a ) {
a[$2$3]=a[$2$3] $1 ",";
} else {
print $0 > "Errors.txt";
}
}
}
END {
print c;
for (i in a) {
print a[i]
}
}' input.txt file*
For the above input files it prints the below output:
AC0,AC1,
WSS 1 389 hom 62.74
AC1,
WSS 1 810 hom 3548.77 AC1,
WSS 1 689 hom 4.12
WSS 1 1250 hom 4973.77
It still prints the values in $1 from file1 and file2
Related
I have input.txt like so:
237 #
0 2 3 4 0. ABC
ABC
DEF
# 237
0 1 4 7 2 0.
0 3 8 9 1 0. GHI
XYZ
(a) If a row contains the symbol #, then, in the output, I want a newline/blankline.
(b) If a row starts with a 0 and contains 0. then, the interval of such entries excepting the terminating 0. should be displayed.
The following script accomplishes (b)
awk '{
for (i=1; i<NF; i++)
if($i == "0")
{arr[NR] = $i}
else
if ($i == "0.")
{break}
else
{arr[NR]=arr[NR]" "$(i)}}
($1 == "0") {print arr[NR]}
' input.txt > output.txt
so that the output is:
0 2 3 4
0 1 4 7 2
0 3 8 9 1
How can (a) be accomplished so that the output is:
// <----Starting newline
0 2 3 4
0 1 4 7 2
0 3 8 9 1
try add if ($0 ~ /#/) {print ""}
so
awk '{
for (i=1; i<NF; i++)
if($i == "0")
{arr[NR] = $i}
else
if ($i == "0.")
{break}
else
{arr[NR]=arr[NR]" "$(i)}
if ($0 ~ /#/) {print ""}
($1 == "0") {print arr[NR]}
' soinput.txt > output.txt
Is this what you're trying to do?
$ awk '/#/{print ""} /^0/ && sub(/ 0\..*/,"")' file
0 2 3 4
0 1 4 7 2
0 3 8 9 1
I have
A 34 missense fixed
A 33 synonymous fixed
B 12 synonymous var
B 34 missense fixed
B 34 UTR fixed
B 45 missense var
TRI 4 synonymous var
TRI 4 intronic var
3 3 synonymous fixed
I wanna output the counts of the combinations missense && fixed, missense && var, synonymous && fixed, synonymous && var , for each element in $1
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 0
TRI 0 0 0 1
3 0 0 1 0
I can do this way with 4 individual commands selecting for each combination and concatenating the outputs
awk -F'\t' '($3~/missense/ && $4~/fixed/)' file | awk -F'\t' '{count[$1"\t"$3"\t"$4]++} END {for (word in count) print word"\t"count[word]}' > out
But I'm would like to do this for all combinations at once. I've tried some variations of this but not able to make it work
awk print a[i] -v delim=":" -v string='missense:synonymous:fixed:var' 'BEGIN {n = split(string, a, delim); for (i = 1; i <= n-2; ++i) {count[xxxx}++}} END ;for (word in count) print word"\t"count[word]}
You may use this awk with multiple arrays to hold different counts:
awk -v OFS='\t' '
{keys[$1]}
/missense fixed/ {++mf[$1]}
/missense var/ {++mv[$1]}
/synonymous fixed/ {++sf[$1]}
/synonymous var/ {++sv[$1]}
END {
print "-\tmissensefixed\tmissensevar\tsynonymousfixed\tsynonymousvar"
for (i in keys)
print i, mf[i]+0, mv[i]+0, sf[i]+0, sv[i]+0
}
' file | column -t
- missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
I have used column -t for tabular output only.
GNU awk supports arrays of arrays, so if it is your awk you can count your records with something as simple as num[$1][$3$4]++. The most complex part is the final human-friendly printing:
$ cat foo.awk
{ num[$1][$3$4]++ }
END {
printf(" missensefixed missensevar synonymousfixed synonymousvar\n");
for(r in num) printf("%3s%14d%12d%16d%14d\n", r, num[r]["missensefixed"],
num[r]["missensevar"], num[r]["synonymousfixed"], num[r]["synonymousvar"])}
$ awk -f foo.awk data.txt
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
Using any awk in any shell on every Unix box with an assist from column to convert the tab-separated awk output to a visually tabular display if you want it:
$ cat tst.awk
BEGIN {
OFS = "\t"
numTags = split("missensefixed missensevar synonymousfixed synonymousvar",tags)
}
{
keys[$1]
cnt[$1,$3 $4]++
}
END {
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "%s%s", OFS, tag
}
print ""
for (key in keys) {
printf "%s", key
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
val = cnt[key,tag]
printf "%s%d", OFS, val
}
print ""
}
}
$ awk -f tst.awk file
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
$ awk -f tst.awk file | column -s$'\t' -t
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
I'd highly recommend you always give every column a header string though so it doesn't make further processing of the data harder (e.g. reading it into Excel and sorting on headers), so if I were you I'd add printf "key" or something else that more accurately identifies that columns contents as the first line of the END section (i.e. on a line immediately before the first for loop) so the first column gets a header too:
$ awk -f tst.awk file | column -s$'\t' -t
key missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
I would like to leave empty first four columns, then I want to add filename without extension in the last 4 columns. I have files as file.frq and goes on. Later I will apply this to the 200 files in loop.
input
CHR POS REF ALT AF HOM Het Number of animals
1 94980034 C T 0 0 0 5
1 94980057 C T 0 0 0 5
Desired output
file file file file
CHR POS REF ALT AF HOM Het Number of animals
1 94980034 C T 0 0 0 5
1 94980057 C T 0 0 0 5
I tried this from Add file name and empty column to existing file in awk
awk '{$0=(NR==1? " \t"" \t"" \t"" \t":FILENAME"\t") "\t" $0}7' file2.frq
But it gave me this:
CHR POS REF ALT AF HOM Het Number of animals
file2.frq 1 94980034 C T 0 0 0 5
file2.frq 1 94980057 C T 0 0 0 5
file2.frq 1 94980062 G C 0 0 0 5
and I also tried this
awk -v OFS="\t" '{print FILENAME, $1=" ",$2=" ",$3=" ", $4=" ",$5 - end}' file2.frq
but it gave me this
CHR POS REF ALT AF HOM Het Number of animals
file2.frq 1 94980034 C T 0 0 0 5
file2.frq 1 94980057 C T 0 0 0 5
any help will be appreciated!
Assuming your input is tab-separated like your desired output:
awk '
BEGIN { FS=OFS="\t" }
NR==1 {
orig = $0
fname = FILENAME
sub(/\.[^.]*$/,"",fname)
$1=$2=$3=$4 = ""
$5=$6=$7=$8 = fname
print
$0 = orig
}
1' file.txt
file file file file
CHR POS REF ALT AF HOM Het Number of animals
1 94980034 C T 0 0 0 5
1 94980057 C T 0 0 0 5
To see it in table format:
$ awk '
BEGIN { FS=OFS="\t" }
NR==1 {
orig = $0
fname = FILENAME
sub(/\.[^.]*$/,"",fname)
$1=$2=$3=$4 = ""
$5=$6=$7=$8 = fname
print
$0 = orig
}
1' file.txt | column -s$'\t' -t
file file file file
CHR POS REF ALT AF HOM Het Number of animals
1 94980034 C T 0 0 0 5
1 94980057 C T 0 0 0 5
I have data in the following format:
ID Date X1 X2 X3
1 01/01/00 1 2 3
1 01/02/00 7 8 5
2 01/03/00 9 7 1
2 01/04/00 1 4 5
I would like to group measurements into new rows according to ID, so I end up with:
ID Date X1 X2 X3 Date X1_2 X2_2 X3_2
1 01/01/00 1 2 3 01/02/00 7 8 5
2 01/03/00 9 7 1 01/04/00 1 4 5
etc.
I have as many as 20 observations for a given ID.
So far I have tried the technique given by http://gadgetsytecnologia.com/da622c17d34e6f13e/awk-transpose-childids-column-into-row.html
The code I have tried so far is:
awk -F, OFS = '\t' 'NR >1 {a[$1] = a[$1]; a[$2] = a[$2]; a[$3] = a[$3];a[$4] = a[$4]; a[$5] = a[$5] OFS $5} END {print "ID,Date,X1,X2,X3,Date_2,X1_2, X2_2 X3_2'\t' for (ID in a) print a[$1:$5] }' file.txt
The file is a tab delimited file. I don't know how to manipulate the data, or to account for the fact that there will be more than two observations per person.
Just keep track of what was the previous first field. If it changes, print the stored line:
awk 'NR==1 {print; next} # print header
prev && $1!=prev {print prev, line; line=""} # print on different $1
{prev=$1; $1=""; line=line $0} # store data and remove $1
END {print prev, line}' file # print trailing line
If you have tab-separated fields, just add -F"\t".
Test
$ awk 'NR==1 {print; next} prev && $1!=prev {print prev, line; line=""} {prev=$1; $1=""; line=line $0} END {print prev, line}' a
ID Date X1 X2 X3
1 01/01/00 1 2 3 01/02/00 7 8 5
2 01/03/00 9 7 1 01/04/00 1 4 5
you can try this (gnu-awk solution)
gawk '
NR == 1 {
N = NF;
MAX = NF-1;
for(i=1; i<=NF; i++){ #store columns names
names[i]=$i;
}
next;
}
{
for(i=2; i<=N; i++){
a[$1][length(a[$1])+1] = $i; #store records for each id
}
if(length(a[$1])>MAX){
MAX = length(a[$1]);
}
}
END{
firstline = names[1];
for(i=1; i<=MAX; i++){ #print first line
column = int((i-1)%(N-1))+2
count = int((i-1)/(N-1));
firstline=firstline OFS names[column];
if(count>0){
firstline=firstline"_"count
}
}
print firstline
for(id in a){ #print each record in store
line = id;
for(i=1; i<=length(a[id]); i++){
line=line OFS a[id][i];
}
print line;
}
}
' input
input
ID Date X1 X2 X3
1 01/01/00 1 2 3
1 01/02/00 7 8 5
2 01/03/00 9 7 1
2 01/04/00 1 4 5
1 01/03/00 72 28 25
you get
ID Date X1 X2 X3 Date_1 X1_1 X2_1 X3_1 Date_2 X1_2 X2_2 X3_2
1 01/01/00 1 2 3 01/02/00 7 8 5 01/03/00 72 28 25
2 01/03/00 9 7 1 01/04/00 1 4 5
I have some text files as follows
293 800 J A 0 0 162
294 801 J R - 0 0 67
295 802 J P - 0 0 56
298 805 J G S S- 0 0 22
313 820 J R T 4 S- 0 0 152
I would like to print column4 if column5 is empty.
desired output
>filename
ARP
I used the following code. But this code prints only the filenames.
awk '{
if (FNR == 1 ) print ">" FILENAME
if ($5 == "") {
printf $4
}
}
END { printf "\n"}' *.txt
Here's one way using GNU awk:
awk 'BEGIN { FIELDWIDTHS="5 4 2 3 3 2 7 4 3" } FNR==1 { print ">" FILENAME } $5 == " " { sub(/ $/, "", $4); printf $4 } END { printf "\n" }' file.txt
Result:
>file.txt
ARP
This is not an elegant solution by any means and it is specific to this file.
You can do something like this
cut -c1-15 yourtext | awk '$5 {print $4}'
where 15 is the number of characters including column 5.
I do strongly agree with steve's suggestion to use an better alternative for your files. Or at least put a dummy/error value instead of leaving columns blank.
awk '{if(substr($0,15,1)~/ /)printf("%s",$4);}' your_file
tested below:
> cat temp
293 800 J A 0 0 162
294 801 J R - 0 0 67
295 802 J P - 0 0 56
298 805 J G S S- 0 0 22
313 820 J R T 4 S- 0 0 152
> awk '{if(substr($0,15,1)~/ /)printf("%s",$4);}' temp
ARP>
This is a starting point assuming the variations in column numbers stay the same.
awk '$5 !="" && NF<=8 {printf $4}END{print "\n"}' data.txt
yields
ARP
you can graft on the parts to display the filename.