Related
I have multi columns file and i want to extract some info in column 71.
I want to extract using tags which the value can be anything, for example i want to just extract AC=* ; AF=* , where the value can be anything .
I found similar question and gave it a try but it didn't work
Extract columns with values matching a specific pattern
Column 71 looks like this:
AC=14511;AC_AFR=382;AC_AMR=1177;AC_Adj=14343;AC_EAS=5;AC_FIN=427;AC_Het=11813;AC_Hom=1265;AC_NFE=11027;AC_OTH=97;AC_SAS=1228;AF=0.137;AN=106198;AN_AFR=8190;AN_AMR=10424;AN_Adj=99264;AN_EAS=7068;AN_FIN=6414;AN_NFE=51090;AN_OTH=658;AN_SAS=15420;BaseQRankSum=1.73;ClippingRankSum=-1.460e-01;DB;DP=1268322;FS=0.000;GQ_MEAN=190.24;GQ_STDDEV=319.67;Het_AFR=358;Het_AMR=1049;Het_EAS=5;Het_FIN=399;Het_NFE=8799;Het_OTH=83;Het_SAS=1120;Hom_AFR=12;Hom_AMR=64;Hom_EAS=0;Hom_FIN=14;Hom_NFE=1114;Hom_OTH=7;Hom_SAS=54;InbreedingCoeff=0.0478;MQ=60.00;MQ0=0;MQRankSum=0.037;NCC=270;POSITIVE_TRAIN_SITE;QD=21.41;ReadPosRankSum=0.212;VQSLOD=4.79;culprit=MQ;DP_HIST=30|3209|1539|1494|30007|7938|4130|2038|1310|612|334|185|97|60|31|25|9|11|7|33,0|66|339|1048|2096|2665|2626|1832|1210|584|323|179|89|54|31|22|7|9|4|15;GQ_HIST=84|66|56|82|3299|568|617|403|250|319|436|310|28566|2937|827|834|451|186|217|12591,15|15|13|16|25|11|22|28|18|38|52|31|65|76|39|83|93|65|97|12397;CSQ=T|ENSG00000186868|ENST00000334239|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11502.1|ENSP00000334886|TAU_HUMAN|B4DSE3_HUMAN|UPI0000000C16||||2/8||ENST00000334239.8:c.134-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000570299|Transcript|intron_variant&non_coding_transcript_variant||||||rs754512|1||1|MAPT|HGNC|6893|processed_transcript||||||||||2/6||ENST00000570299.1:n.262-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000340799|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45716.1|ENSP00000340438|TAU_HUMAN||UPI000004EEE6||||3/10||ENST00000340799.5:c.221-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000262410|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11501.1|ENSP00000262410|TAU_HUMAN||UPI0000EE80B7||||4/13||ENST00000262410.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000446361|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11500.1|ENSP00000408975|TAU_HUMAN||UPI000004EEE5||||2/9||ENST00000446361.3:c.134-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000574436|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11499.1|ENSP00000460965|TAU_HUMAN||UPI000002D754||||3/10||ENST00000574436.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000571987|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11501.1|ENSP00000458742|TAU_HUMAN||UPI0000EE80B7||||3/12||ENST00000571987.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000415613|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45715.1|ENSP00000410838|TAU_HUMAN||UPI0001AE66E9||||3/13||ENST00000415613.2:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000571311|Transcript|intron_variant&NMD_transcript_variant||||||rs754512|1||1|MAPT|HGNC|6893|nonsense_mediated_decay|||ENSP00000460048||I3L2Z2_HUMAN|UPI00025A2E6E||||4/4||ENST00000571311.1:c.*176-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000535772|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS56033.1|ENSP00000443028|TAU_HUMAN|B4DSE3_HUMAN|UPI000004EEE4||||4/10||ENST00000535772.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000576518|Transcript|stop_gained|5499|7|3|K/*|Aag/Tag|rs754512|1||1|MAPT|HGNC|6893|protein_coding|||ENSP00000458621||I3L170_HUMAN&B4DSE3_HUMAN|UPI0001639A7C|||1/7|||ENST00000576518.1:c.7A>T|ENSP00000458621.1:p.Lys3Ter|T:0.1171|||||||||15792962|||||POSITION:0.00682261208576998&ANN_ORF:-255.6993&MAX_ORF:-255.6993|PHYLOCSF_WEAK|ANC_ALLELE|LC,T|ENSG00000186868|ENST00000420682|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45716.1|ENSP00000413056|TAU_HUMAN||UPI000004EEE6||||2/9||ENST00000420682.2:c.221-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000572440|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2790|||||rs754512|1||1|MAPT|HGNC|6893|retained_intron|||||||||1/1|||ENST00000572440.1:n.2790A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000351559|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11499.1|ENSP00000303214|TAU_HUMAN||UPI000002D754||||4/11||ENST00000351559.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000344290|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding|YES|CCDS45715.1|ENSP00000340820|TAU_HUMAN||UPI0001AE66E9||||4/14||ENST00000344290.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000347967|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding|||ENSP00000302706|TAU_HUMAN|B4DSE3_HUMAN|UPI0000173D91||||4/10||ENST00000347967.5:c.32-100A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000431008|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS56033.1|ENSP00000389250|TAU_HUMAN|B4DSE3_HUMAN|UPI000004EEE4||||3/9||ENST00000431008.3:c.308-94A>T||T:0.1171|||||||||15792962||||||||
The code that i tried:
awk '{
for (i = 1; i <= NF; i++) {
if ($i ~ /AC|AF/) {
printf "%s %s ", $i, $(i + 1)
}
}
print ""
}'
I keep getting syntax error.
output wanted :
AC=14511;AF=0.137
Whenever you have name=value pairs, it's usually simplest to first create an array that maps names to values (n2v[] below) and then you can just access the values by their names.
$ cat file
AC=1;AC_AFR=2;AF=3 AC=4;AC_AFR=5;AF=6
$ cat tst.awk
{
delete n2v
split($2,tmp,/[;=]/)
for (i=1; i in tmp; i+=2) {
n2v[tmp[i]] = tmp[i+1]
}
prt("AC")
prt("AF")
}
function prt(name) { print name, "=", n2v[name] }
$ awk -f tst.awk file
AC = 4
AF = 6
Just change $2 to $71 for your real input.
Something like this should do it (in Gnu awk due to switch):
$ awk '{split($71,a,";");for(i in a )if(a[i]~/^AF/) print a[i]}' foo
AF=0.137
You split the field $71 by ;s, loop thru the array you split to looking for desired match. For multiple matches use switch:
$ awk '{
split($0,a,";");
for(i in a )
switch(a[i]) {
case /^AF=/:
b=b a[i] OFS;
break;
case /^AC=/:
b=b a[i] OFS;
break
}
sub(/.$/,"\n",b);
printf b
}' foo
AC=14511 AF=0.137
EDIT: Now it buffers output to a variable and prints it in the end. You can control the separator with OFS.
aNumber|bNumber|startDate|timeZone|duration|currencyType|cost|
22677512549|778|2014-07-02 10:16:35.000|NULL|NULL|localCurrency|0.00|
22675557361|76457227|2014-07-02 10:16:38.000|NULL|NULL|localCurrency|10.00|
22677521277|778|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|0.00|
22676099496|77250331|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|1.00|
22667222160|22667262389|2014-07-02 10:16:43.000|NULL|NULL|localCurrency|10.00|
22665799922|70110055|2014-07-02 10:16:45.000|NULL|NULL|localCurrency|20.00|
22676239633|433|2014-07-02 10:16:48.000|NULL|NULL|localCurrency|0.00|
22677277255|76919167|2014-07-02 10:16:51.000|NULL|NULL|localCurrency|1.00|
This is the input (sample of million of line) i have in csv file.
I want to sum up duration based on date.
My concern is i want to sum up first 1000000 lines
the awk program i'm using is:
test.awk
BEGIN { FS = "|" }
NR>1 && NR<=1000000
FNR == 1{ next }
{
sub(/ .*/,"",$3)
key=sprintf("%10s",$3)
duration[key] += $5 } END {
printf "%-10s %16s,"dAccused","Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}}
i run my script as
$awk -f test.awk 'file'
The input i have doesn't condsidered my condition NR>1 && NR<=1000000
ANY SUGGESTION? PLEASE!
You're looking for this:
BEGIN { FS = "|" }
1 < NR && NR <= 1000000 {
sub(/ .*/, "", $3)
key = sprintf("%10s",$3)
duration[key] += $5
}
END {
printf "%-10s %16s\n", "dAccused", "Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}
}
A lot of errors become obvious with proper indentation.
The reason you saw 1,000,000 lines was due to this:
NR>1 && NR<=1000000
That is a condition with no action block. The default action is to print the current record if the condition is true. That's why you see a lot of awk one-liners end with the number 1
You didn't post any expected output and your duration field is always NULL so it's still not clear what you really want output, but this is probably the right approach:
$ cat tst.awk
BEGIN { FS = "|" }
NR==1 { for (i=1;i<NF;i++) f[$i] = i; next }
{
sub(/ .*/,"",$(f["startDate"]))
sum[$(f["startDate"])] += $(f["duration"])
}
NR==1000000 { exit }
END { for (date in sum) print date, sum[date] }
$ awk -f tst.awk file
2014-07-02 0
Instead of discarding your header line, it uses it to create an array f[] that maps the field names to their order in each line so instead of having to hard-code that duration is field 4 (or whatever) you just reference it as $(f["duration"]).
Any time your input file has a header line, don't discard it - use it so your script is not coupled to the order of fields in your input file.
If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.
I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.
$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N
This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file
Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file
Input File1: file1.txt
MH=919767,918975
DL=919922
HR=919891,919394,919812
KR=919999,918888
Input File2: file2.txt
aec,919922783456,a5,b3,,,asf
abc,918975583456,a1,b1,,,abf
aeci,919998546783,a2,b4,,,wsf
Output File
aec,919922783456,a5,b3,DL,,asf
abc,918975583456,a1,b1,MH,,abf
aeci,919998546783,a2,b4,NOMATCH,,wsf
Notes
Need to compare phone number (Input file2.txt - 2nd field - initial 6 digit only) within Input file1.txt - 2nd field with "=" separted). If there is match in intial 6 digit of phone number, then OUTPUT should contain 2 digit code from file (Input file1) into output in 5th field
File1.txt is having single code (for example MH) for mupltiple phone number intials.
If you have GNU awk, try the following. Run like:
awk -f script.awk file1.txt file2.txt
Contents of script.awk:
BEGIN {
FS="[=,]"
OFS=","
}
FNR==NR {
for(i=2;i<=NF;i++) {
a[$1][$i]
}
next
}
{
$5 = "NOMATCH"
for(j in a) {
for (k in a[j]) {
if (substr($2,0,6) == k) {
$5 = j
}
}
}
}1
Alternatively, here's the one-liner:
awk -F "[=,]" 'FNR==NR { for(i=2;i<=NF;i++) a[$1][$i]; next } { $5 = "NOMATCH"; for(j in a) for (k in a[j]) if (substr($2,0,6) == k) $5 = j }1' OFS=, file1.txt file2.txt
Results:
aec,919922783456,a5,b3,DL,,asf
abc,918975583456,a1,b1,MH,,abf
aeci,919998546783,a2,b4,NOMATCH,,wsf
If you have an 'old' awk, try the following. Run like:
awk -f script.awk file1.txt file2.txt
Contents of script.awk:
BEGIN {
# set the field separator to either an equals sign or a comma
FS="[=,]"
# set the output field separator to a comma
OFS=","
}
# for the first file in the arguments list
FNR==NR {
# loop through all the fields, starting at field two
for(i=2;i<=NF;i++) {
# add field one and each field to a pseudo-multidimensional array
a[$1,$i]
}
# skip processing the rest of the code
next
}
# for the second file in the arguments list
{
# set the default value for field 5
$5 = "NOMATCH"
# loop though the array
for(j in a) {
# split the array keys into another array
split(j,b,SUBSEP)
# if the first six digits of field two equal the value stored in this array
if (substr($2,0,6) == b[2]) {
# assign field five
$5 = b[1]
}
}
# return true, therefore print by default
}1
Alternatively, here's the one-liner:
awk -F "[=,]" 'FNR==NR { for(i=2;i<=NF;i++) a[$1,$i]; next } { $5 = "NOMATCH"; for(j in a) { split(j,b,SUBSEP); if (substr($2,0,6) == b[2]) $5 = b[1] } }1' OFS=, file1.txt file2.txt
Results:
aec,919922783456,a5,b3,DL,,asf
abc,918975583456,a1,b1,MH,,abf
aeci,919998546783,a2,b4,NOMATCH,,wsf
Try something like:
awk '
NR==FNR{
for(i=2; i<=NF; i++) A[$i]=$1
next
}
{
$5="NOMATCH"
for(i in A) if ($2~"^" i) $5=A[i]
}
1
' FS='[=,]' file1 FS=, OFS=, file2
With this script every field is printed out according to the longest word of the current file, but needs to have a line break every file. How can this be achieved?
awk 'BEGIN{ORS="\n"}FNR=NR{a[i++]=$0; if(length($0) > length(max)) max=$0;l=length(max)} END{ for(j=1; j<=i;j++) printf("%-"(l+1)"s,",a[j-1])}' file1 file2 >outfile
file1
HELLO
WORLD
SOUTH IS
WARM
NORTH IS
COLD
file2
HELLO
WORLD
SOUTH
WARM
NORTH
COLD
output
HELLO ,WORLD ,SOUTH IS ,WARM ,NORTH IS ,COLD
HELLO ,WORLD ,SOUTH ,WARM ,NORTH ,COLD
It's not entirely clear what you are asking for, but perhaps you just want:
FNR==1 {print "\n"}
Which will print a newline whenever it starts reading the first line of a file. Make sure this pattern/action is before any others so that the newline prints before any other action prints anything for the first line of the current file. (This does not appear to apply in your case, since no such action exists.)
Took me some time, got it solved with this script.
awk '{ NR>1 && FNR==1 ? l=length($0) && a[i++]= "\n" $0 : a[i++]=$0 }
{if(NR>1 && FNR==1) for(e=(i-c);e<=(i-1);e++) b[e]=d ;c=FNR; d=l }
{ if( length($0) > l) l=length($0)+1 }
END{for(e=(i-c+1);e<=i;e++) b[e]=d; for(j=1;j<=i;j++) printf("%-"b[j]"s,",a[j-1] )}' infiles* >outfile
#!/usr/bin/awk -f
function beginfile (file) {
split("", a)
max = 0
delim = ""
}
function endfile (file) {
for (i = 1; i <= lines; i++) {
printf "%s%-*s", delim, max, a[i]
delim = " ,"
}
printf "\n"
}
FILENAME != _oldfilename \
{
if (_oldfilename != "")
endfile(_oldfilename)
_oldfilename = FILENAME
beginfile(FILENAME)
}
END { endfile(FILENAME) }
{
len = length($0)
if (len > max) {
max = len
}
a[FNR] = $0
lines = FNR
}
To run it:
chmod u+x filename
./filename file1 file2
Note that in gawk you can do delete a instead of split("", a). GAWK 4 has builtin BEGINFILE and ENDFILE.