summarizing a text file in awk

summarizing a text file in awk - awk

I have a sequence of characters in which I would like to split each sequence into 3-characters class from the beginning to the end. and the get the count of each class. here is a small example of sequences of characters for 2 IDs.
>ID1
ATGTCCAAGGGGATCCTGCAGGTGCATCCTCCGATCTGCGACTGCCCGGGCTGCCGAATA
TCCTCCCCGGTGAACCGGGGGCGGCTGGCAGACAAGAGGACAGTCGCCCTGCCTGCCGCC
>ID2
ATGAAACTTTCACCTGCGCTCCCGGGAACAGTTTCTGCTCGGACTCCTGATCGTTCACCT
CCCTGTTTTCCCGACAGCGAGGACTGTCTTTTCCAACCCGACATGGATGTGCTCCCAATG
ACCTGCCCGCCACCACCAGTTCCAAAGTTTGCACTCCTTAAGGATTATAGGCCTTCAGCT
and here is a small example of output for ID1. I want to get the same output for all IDs in the input file (the lines of characters belong each ID is in the next line). the counts for the next ID comes just after the first and so on.
ID1_3nt count
ATG 1
TCC 3
AAG 2
GGG 2
ATC 2
CTG 3
CAG 1
GTG 2
CAT 1
CCT 2
CCG 3
TGC 3
GAC 2
GGC 1
CGA 1
ATA 1
AAC 1
CGG 2
GCA 1
AGG 1
GCC 3
ACA 1
GTC 1
I tried this code:
awk '{i=0; printf ">%s\n",$2; while(i<=length($1)) {printf "%s\n", substr($1,i,3);i+=3}} /,substr,/ {count++}' | awk 'END { printf(" ID_3nt: %d",count)}
but did not return what I want. do you know how to improve it?

How about this patsplit()-based implementation?
#! /usr/bin/awk -f
# initialize publicly scoped vars...
function init() {
split("", idx) # index of our class (for ordering)
split("", cls) # our class name
split("", cnt) # num of classes we have seen
sz = 0 # number of classes for this ID
}
# process a class record
function proc( i, n, x) {
# split on each 3 characters
n = patsplit($0, a, /.../)
for (i=1; i<=n; ++i) {
x = a[i]
if (x in idx) {
# if this cls exists, just increment the count
++cnt[idx[x]]
} else {
# if this cls doesn't exist, index it in
cls[sz] = x
cnt[sz] = 1
idx[x] = sz++
}
}
}
# spit out class summary
function flush( i) {
if(!sz)
return
for(i=0; i<sz; ++i)
print cls[i], cnt[i]
init()
}
BEGIN {
init()
}
/^>ID/ {
flush()
sub(/^>/, "")
print $0 "_3nt count"
next
}
{
# we could have just inlined proc(), but using a function
# provides us with locally scoped variables
proc()
}
END {
flush()
}

$ cat tst.awk
sub(/^>/,"") { if (NR>1) prt(); name=$0; next }
{ rec = rec $0 }
END { prt() }
function prt( cnt, class) {
while ( rec != "" ) {
cnt[substr(rec,1,3)]++
rec = substr(rec,4)
}
print name "_3nt count"
for (class in cnt) {
print class, cnt[class]
}
}
.
$ awk -f tst.awk file
ID1_3nt count
ACA 1
AAC 1
CGA 1
CAT 1
GTG 2
CAG 1
GGG 2
CCG 3
CCT 2
GCA 1
ATA 1
GAC 2
AAG 2
GCC 3
ATC 2
TCC 3
CGG 2
CTG 3
GTC 1
AGG 1
GGC 1
TGC 3
ATG 1
ID2_3nt count
AAA 1
CCC 3
ACA 1
GTG 1
TTT 2
TGT 2
GTT 2
ACC 1
CCG 2
CTC 3
CCT 4
GCA 1
AAG 2
GAC 3
TCA 3
AGC 1
ACT 1
CGT 1
CGG 1
CTT 3
TAT 1
CAA 1
GAG 1
GAT 3
GGA 1
AGG 1
TGC 1
CCA 5
TTC 1
GCT 2
TCT 1
GCG 1
ATG 3

Related

Using awk to count number of row group

I have a data set: (file.txt)
X Y
1 a
2 b
3 c
10 d
11 e
12 f
15 g
20 h
25 i
30 j
35 k
40 l
41 m
42 n
43 o
46 p
I want to add two columns which are Up10 and Down10,
Up10: From (X) to (X-10) count of row.
Down10 : From (X) to (X+10)
count of row
For example:
X Y Up10 Down10
35 k 3 5
For Up10; 35-10 X=35 X=30 X=25 Total = 3 row
For Down10; 35+10 X=35 X=40 X=41 X=42 X=42 Total = 5 row
Desired Output:
X Y Up10 Down10
1 a 1 5
2 b 2 5
3 c 3 4
10 d 4 5
11 e 5 4
12 f 5 3
15 g 4 3
20 h 5 3
25 i 3 3
30 j 3 3
35 k 3 5
40 l 3 5
41 m 3 4
42 n 4 3
43 o 5 2
46 p 5 1
This is the Pierre François' solution: Thanks again #Pierre François
awk '
BEGIN{OFS="\t"; print "X\tY\tUp10\tDown10"}
(NR == FNR) && (FNR > 1){a[$1] = $1 + 0}
(NR > FNR) && (FNR > 1){
up = 0; upl = $1 - 10
down = 0; downl = $1 + 10
for (i in a) { i += 0 # tricky: convert i to integer
if ((i >= upl) && (i <= $1)) {up++}
if ((i >= $1) && (i <= downl)) {down++}
}
print $1, $2, up, down;
}
' file.txt file.txt > file-2.txt
But when i use this command for 13GB data, it takes too long.
I have used this way for 13GB data again:
awk 'BEGIN{ FS=OFS="\t" }
NR==FNR{a[NR]=$1;next} {x=y=FNR;while(--x in a&&$1-10<a[x]){} while(++y in a&&$1+10>a[y]){} print $0,FNR-x,y-FNR}
' file.txt file.txt > file-2.txt
When file-2.txt reaches 1.1GB it is frozen. I am waiting several hours, but i can not see finish of command and final output file.
Note: I am working on Gogole cloud. Machine type
e2-highmem-8 (8 vCPUs, 64 GB memory)

A single pass awk that keeps the sliding window of 10 last records and uses that to count the ups and downs. For symmetricy's sake there should be deletes in the END but I guess a few extra array elements in memory isn't gonna make a difference:
$ awk '
BEGIN {
FS=OFS="\t"
}
NR==1 {
print $1,$2,"Up10","Down10"
}
NR>1 {
a[NR]=$1
b[NR]=$2
for(i=NR-9;i<=NR;i++) {
if(a[i]>=a[NR]-10&&i>=2)
up[NR]++
if(a[i]<=a[NR-9]+10&&i>=2)
down[NR-9]++
}
}
NR>10 {
print a[NR-9],b[NR-9],up[NR-9],down[NR-9]
delete a[NR-9]
delete b[NR-9]
delete up[NR-9]
delete down[NR-9]
}
END {
for(nr=NR+1;nr<=NR+9;nr++) {
for(i=nr-9;i<=nr;i++)
if(a[i]<=a[nr-9]+10&&i>=2&&i<=NR)
down[nr-9]++
print a[nr-9],b[nr-9],up[nr-9],down[nr-9]
}
}' file
Output:
X Y Up10 Down10
1 a 1 5
2 b 2 5
...
35 k 3 5
...
43 o 5 2
46 p 5 1

Another single pass approach with a sliding window
awk '
NR == 1 { next } # skip the header
NR == 2 { min = max = cur = 1; X[cur] = $1; Y[cur] = $2; next }
{ X[++max] = $1; Y[max] = $2
if (X[cur] >= $1 - 10) next
for (; X[cur] + 10 < X[max]; ++cur) {
for (; X[min] < X[cur] - 10; ++min) {
delete X[min]
delete Y[min]
}
print X[cur], Y[cur], cur - min + 1, max - cur
}
}
END {
for (; cur <= max; ++cur) {
for (; X[min] < X[cur] - 10; ++min);
for (i = max; i > cur && X[cur] + 10 < X[i]; --i);
print X[cur], Y[cur], cur - min + 1, i - cur + 1
}
}
' file
The script assumes the X column is ordered numerically.

replacing associative array indexes with their value using awk or sed

I would like to replace column values of ref using key value pairs from id
cat id:
[1] a 8-23
[2] g 8-21
[3] d 8-13
cat ref:
a 1 2
b 3 4
c 5 3
d 1 2
e 3 1
f 1 2
g 2 3
desired output
8-23 1 2
b 3 4
c 5 3
8-13 1 2
e 3 1
f 1 2
8-21 2 3
I assume it would be best done using awk.
cat replace.awk
BEGIN { OFS="t" }
NR==FNR {
a[$2]=$3; next
}
$1 in !{!a[#]} {
print $0
}
Not sure what I need to change?

$1 in !{!a[#]} is not awk syntax. You just need $1 in a:
BEGIN { OFS='\t' }
NR==FNR {
a[$2] = $3
next
}
{
$1 = ($1 in a) ? a[$1] : $1
print
}
to force OFS to update, this version always assigns to $1
print uses $0 if unspecified

Is there an efficient way to do a vertical lookup kind of task in AWK using multiple files?

I am struggling a lot with the following task which I currently try to accomplish using AWK. I am not very familiar with AWK so I am not even sure if AWK is the best for this. If this is better to be solved with Python, please let me know (but I know even less of Python).
I need to prepare an input file for an analysis which is based on collecting gene expression P-values of multiple species from different files. For each species there are multiple treatment files.
In brief: I need to collect P-values linked to sequenceIDs from multiple files and put them in a single file ordered per Orthogroup. For each Orthogroup I only need to keep the lowest P-value per species treatment file
Orthogroupfile: A list of all orthogroups: on each line one orthogroup, every column is a sequenceID, 1st column is the orthogroupID.
OG0052916: TRINITY_TN_DN99904_c0_g1 TRINITY_AG_DN38054_c0_g1 TRINITY_AG_DN41618_c0_g1 TRINITY_AG_DN47300_c0_g1
OG0001002: TRINITY_AG_DN119624_c0_g1 TRINITY_AG_DN161549_c0_g1 TRINITY_AG_DN60596_c0_g1 TRINITY_MB_DN61252_c1_g1 TRINITY_SE_DN51134_c2_g1 TRINITY_SL_DN27816_c0_g1 TRINITY_SL_DN76945_c4_g1 TRINITY_SL_DN77747_c0_g1 TRINITY_SL_DN77747_c1_g1 TRINITY_TN_DN52316_c0_g1
OG0002002: TRINITY_AG_DN56841_c0_g1 TRINITY_MB_DN200880_c1_g1 TRINITY_SE_DN45370_c1_g1 TRINITY_SE_DN53999_c0_g1 TRINITY_SL_DN16333_c0_g1 TRINITY_SL_DN65991_c0_g1 TRINITY_TN_DN180200_c0_g1 TRINITY_TN_DN48658_c0_g1
OG0052920: TRINITY_TN_DN99983_c0_g1 TRINITY_AG_DN12345_c0_g1
Speciesfile: For each species I have a separate file summarising differential gene expression data. But for every species I have multiple treatments and thus multiple species treatment files. For me important is the P-value (10th column) and the sequence ID (1st column). Each species in the analysis has such a file, the two-letter code in the sequence IDs is a species code ("AG", "TN", "SE", "SL", "MB")
Speciesfile treatment 1 e.g. AG.txt:
AG.txt:
TRINITY_AG_DN38054_c0_g1 0.364813449
TRINITY_AG_DN41618_c0_g1 0.000130019
TRINITY_AG_DN47300_c0_g1 0.000195804
TRINITY_AG_DN119624_c0_g1 0.067
TRINITY_AG_DN161549_c0_g1 0.00036
TRINITY_AG_DN60596_c0_g1 0.023
TRINITY_AG_DN12345_c0_g1 NA
TRINITY_AG_DN56841_c0_g1 0.034
Speciesfile treatment 2 e.g. AA.txt:
TRINITY_AG_DN38054_c0_g1 3.364813449e-07
TRINITY_AG_DN41618_c0_g1 6.000130019e-03
TRINITY_AG_DN47300_c0_g1 8.000195804e-02
TRINITY_AG_DN119624_c0_g1 5.067e-05
TRINITY_AG_DN161549_c0_g1 5.00036e-06
TRINITY_AG_DN60596_c0_g1 4.023e-7
TRINITY_AG_DN12345_c0_g1 0.03
TRINITY_AG_DN56841_c0_g1 2.034e-2
Speciesfile treatment 1 e.g. TN.txt:
TRINITY_TN_DN99904_c0_g1 0.005
TRINITY_TN_DN99983_c0_g1 0.063
TRINITY_TN_DN180200_c0_g1 0.0326
TRINITY_TN_DN48658_c0_g1 0.02762
TRINITY_TN_DN52316_c0_g1 0.000737267
speciesfile treatment 2 e.g. TA.txt
TRINITY_TN_DN99904_c0_g1 6.005e-4
TRINITY_TN_DN99983_c0_g1 9.063e-03
TRINITY_TN_DN180200_c0_g1 1.0326e-1
TRINITY_TN_DN48658_c0_g1 3.02762e-09
TRINITY_TN_DN52316_c0_g1 2.000737267e-10
MB.txt:
TRINITY_MB_DN61252_c1_g1 0.0004378
TRINITY_MB_DN200880_c1_g1 0.00007281
SE.txt:
TRINITY_SE_DN51134_c2_g1 0.0007367
TRINITY_SE_DN53999_c0_g1 0.00376
TRINITY_SE_DN45370_c1_g1 0.00067356
The output file that I need “summarises” information from the different species with a Orthogroup on each line. I am only interested in the P-values
First column: Orthogroup ID
Second column: lowest P-value (for all
genes of sp1 in this Orthogroup e.g. "AG", so this is species treatment file dependent )
Third column: total nr.
of genes of sp1 in this Orthogroup (this will be similar for different treatments of the same species)
Fourth column: total number of
genes for sp1. in the cluster (but this can always be the same as
the third column)
The next three columns is than repeating the same for the other species, and continues afterwards. NA if there are no genes present of that species in that orthogroup
Example output.txt, which includes the P-value information for all different species "AG", "MB", "TN", "SE" and "SL":
Group AG-Pvalue AG-nGenes AG-ClusterSize MB-Pvalue MB-nGenes MB-ClusterSize SE-Pvalue SE-nGenes SE-ClusterSize TN-Pvalue TN-nGenes TN-ClusterSize AA-Pvalue AA-nGenes AA-ClusterSize TA-Pvalue TA-nGenes TA-ClusterSize
OG0052916 0.000130019 3 3 NA NA NA NA NA NA 0.005 1 1 3.364813449e-07 3 3 6.005e-4 1 1
OG0002002 0.034 1 1 0.00007281 1 1 0.00067356 3 3 0.02762 2 2 2.034e-2 1 1 3.02762e-09 2 2
OG0001002: 0.00036 3 3 0.0004378 1 1 0.0007367 1 1 0.000737267 1 1 5.067e-05 3 3 2.000737267e-10 1 1
OG0052920: NA NA NA NA NA NA NA NA NA 0.063 1 1 0.03 1 1 9.063e-03 1 1
"Next-Orthogroup" "lowest P-value of the diet treatment per species" "nr of genes of this species in this orthogroup"
I realise this problem consists of 3 different problems:
1 a simple vertical look-up
2 a if - then choice, if multiple genes in Orthogroup than copy the lowest P-value
3 calculate the number of genes per species per Orthogroup.
I wanted to tackle this one by one, but failed already at the first step:
awk 'NR==FNR{a[$0];next} $1 in a {print $10}’ Orthogroups1.txt TN.txt
Check all columns of file 1 for occurrence in file 2 and print the 10th column.
If anyone could help me with the above? Even if it is just a direction, thank you so much!

The following awk script performs the following, based on the question (assuming latest post capture all requirements)
Load the lookup tables AG.txt, MB.txt, ... (BEGIN block)
Read the main data file, and find out min, count per group/species.
Print the output (END block)
awk '
BEGIN {
# Load all XX.txt files
n_species=split("AG,MB,TN,SE,SL", species, ",")
for (s in species) {
sfile = species[s] ".txt"
nn=0
while ( (getline < sfile) > 0 ) { v[$1] = $2; nn++ }
print "Loaded:", sfile, nn > "/dev/stderr"
}
}
{
g = $1 # Group
# Calculate count, min per group
for (i=2 ; i<=NF ; i++ ) {
id=$i
split(id, parts, "_")
ss=parts[2] # Species
val = v[id]
if ( val ) {
if ( !vcount[g, ss] || val < vmin[g, ss] ) vmin[g, ss] = val
vcount[g, ss]++
group[g]++
# print "SET", id, g, ss, val, vmin[g,ss], vcount[g, ss]
}
}
}
END {
# Header Line
printf "%s", "group"
for (s in species) {
ss = species[s]
printf " %s-PValue %s-nGenees %s-ClusterSize", ss, ss, ss
}
printf "\n"
# Print line
ng = 0
for (g in group) {
ng++
printf "%s", g
for (s in species) {
ss = species[s]
# print "GET", g, ss, vmin[g, ss], vcount[g, ss], "X"
s_min = vmin[g, ss]
s_count = vcount[g, ss]
s_cs = vcount[g, ss]
if ( !s_count ) { s_count = s_min = s_cs = "NA" }
printf " %s %s %s", s_min, s_count, s_cs
}
printf "\n"
}
print "Groups:", ng > "/dev/stderr"
}' < data.txt

Upgraded Answer, to address additional data files, per additional information from OP:
Invoke with list of species/treatments, the indicator 'DATA=1', and the data file.
script.awk AG.txt MB.txt SE.txt TN.txt AA.txt TA.txt DATA=1 data.txt
script.awk
#! /usr/bin/awk -f
!DATA {
# Calculate key from file name
if ( FNR == 1 ) {
ncol++
k = gensub("(.*/)?([^/]+)\\.([^/]*)$", "\\2", 1, FILENAME)
cols[ncol] = k
}
v[k, $1] = $2
# Track keys
ccount[k]++
next
}
{
g=$1 # Group
# Calculate count, min per group
for (i=2 ; i<=NF ; i++ ) {
id=$i
split(id, parts, "_")
for (k in cols) {
ss = cols[k]
val = v[ss, id]
if ( !val ) continue
if ( !vcount[g, ss] || val < vmin[g, ss] ) vmin[g, ss] = val
vcount[g, ss]++
gcount[g, ss]++
group[g]++
# print "SET", id, g, ss, val, vmin[g,ss], vcount[g, ss]
}
}
}
END {
# Header Line
printf "%s", "group"
for (k in cols) {
ss = cols[k]
printf " %s-PValue %s-nGenees %s-ClusterSize", ss, ss, ss
}
printf "\n"
# Print line
ng = 0
for (g in group) {
ng++
printf "%s", g
for (k in cols) {
ss = cols[k]
s_min = vmin[g, ss]
s_count = vcount[g, ss]
s_cs = gcount[g, ss]
# print "GET", g, ss, vmin[g, ss], vcount[g, ss], "X"
if ( !s_count ) { s_count = s_min = s_cs = "NA" }
printf " %s %s %s", s_min, s_count, s_cs
# printf " %s %d %d", vmin[g, ss] ? vmin[g, ss] : "NA" , vcount[g, ss], vcount[g, ss]
}
printf "\n"
}
for (k in cols ) {
ss = cols[k]
print "Col:", ss, ccount[ss] > "/dev/stderr"
}
print "Groups:", ng > "/dev/stderr"
}
Output for:
awk -f ./script-spcomp.awk AG.txt MB.txt SE.txt TN.txt AA.txt TA.txt DATA=1 data.txt
group AG-PValue AG-nGenees AG-ClusterSize MB-PValue MB-nGenees MB-ClusterSize SE-PValue SE-nGenees SE-ClusterSize TN-PValue TN-nGenees TN-ClusterSize AA-PValue AA-nGenees AA-ClusterSize TA-PValue TA-nGenees TA-ClusterSize
OG0052920: NA NA NA NA NA NA NA NA NA 0.063 1 1 NA NA NA 9.063e-03 1 1
OG0052916: 0.000130019 3 3 NA NA NA NA NA NA 0.005 1 1 3.364813449e-07 3 3 6.005e-4 1 1
OG0002002: 0.034 1 1 0.00007281 1 1 0.00067356 2 2 0.02762 2 2 2.034e-2 1 1 3.02762e-09 2 2
OG0001002: 0.00036 3 3 0.0004378 1 1 0.0007367 1 1 0.000737267 1 1 4.023e-7 3 3 2.000737267e-10 1 1
When running with the modified list list of columns, the output is:
awk -f script-spcomp.awk TA.txt SE.txt MB.txt AG.txt AA.txt TN.txt DATA=1
ortho.txt
group TA-PValue TA-nGenees TA-ClusterSize SE-PValue SE-nGenees SE-ClusterSize MB-PValue MB-nGenees MB-ClusterSize AG-PValue AG-nGenees AG-ClusterSize AA-PValue AA-nGenees AA-ClusterSize TN-PValue TN-nGenees TN-ClusterSize
OG0052920: 9.063e-03 1 1 NA NA NA NA NA NA NA NA NA NA NA NA 0.063 1 1
OG0052916: 6.005e-4 1 1 NA NA NA NA NA NA 0.000130019 3 3 3.364813449e-07 3 3 0.005 1 1
OG0002002: 3.02762e-09 2 2 0.00067356 2 2 0.00007281 1 1 0.034 1 1 2.034e-2 1 1 0.02762 2 2
OG0001002: 2.000737267e-10 1 1 0.0007367 1 1 0.0004378 1 1 0.00036 3 3 4.023e-7 3 3 0.000737267 1 1

Awk code with associative arrays -- array doesn't seem populated, but no error

Question: Why does it seem that date_list[d] and isin_list[i] are not getting populated, in the code segment below?
AWK Code (on GNU-AWK on a Win-7 machine)
BEGIN { FS = "," } # This SEBI data set has comma-separated fields (NSE snapshots are pipe-separated)
# UPDATE the lists for DATE ($10), firm_ISIN ($9), EXCHANGE ($12), and FII_ID ($5).
( $17~/_EQ\>/ ) {
if (date[$10]++ == 0) date_list[d++] = $10; # Dates appear in order in raw data
if (isin[$9]++ == 0) isin_list[i++] = $9; # ISINs appear out of order in raw data
print $10, date[$10], $9, isin[$9], date_list[d], d, isin_list[i], i
}
input data
49290,C198962542782200306,6/30/2003,433581,F5811773991200306,S5405611832200306,B5086397478200306,NESTLE INDIA LTD.,INE239A01016,6/27/2003,1,E9035083824200306,REG_DL_STLD_02,591.13,5655,3342840.15,REG_DL_INSTR_EQ,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00
49291,C198962542782200306,6/30/2003,433563,F6292896459200306,S6344227311200306,B6110521493200306,GRASIM INDUSTRIES LTD.,INE047A01013,6/27/2003,1,E9035083824200306,REG_DL_STLD_02,495.33,3700,1832721,REG_DL_INSTR_EQ,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00
49292,C198962542782200306,6/30/2003,433681,F6513202607200306,S1724027402200306,B6372023178200306,HDFC BANK LTD,INE040A01018,6/26/2003,1,E745964372424200306,REG_DL_STLD_02,242,2600,629200,REG_DL_INSTR_EQ,REG_DL_DLAY_D,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00
49293,C7885768925200306,6/30/2003,48128,F4406661052200306,S7376401565200306,B4576522576200306,Maruti Udyog Limited,INE585B01010,6/28/2003,3,E912851176274200306,REG_DL_STLD_04,125,44600,5575000,REG_DL_INSTR_EQ,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00
49294,C7885768925200306,6/30/2003,48129,F4500260787200306,S1312094035200306,B4576522576200306,Maruti Udyog Limited,INE585B01010,6/28/2003,4,E912851176274200306,REG_DL_STLD_04,125,445600,55700000,REG_DL_INSTR_EQ,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00
49295,C7885768925200306,6/30/2003,48130,F6425024637200306,S2872499118200306,B4576522576200306,Maruti Udyog Limited,INE585B01010,6/28/2003,3,E912851176274200306,REG_DL_STLD_04,125,48000,6000000,REG_DL_INSTR_EU,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00
output that I am getting
6/27/2003 1 INE239A01016 1 1 1
6/27/2003 2 INE047A01013 1 1 2
6/26/2003 1 INE040A01018 1 2 3
6/28/2003 1 INE585B01010 1 3 4
6/28/2003 2 INE585B01010 2 3 4
Expected output
As far as I can tell, the print is printing out correctly (i) $10 (the date) (ii) date[$10), the count for each date (iii) $9 (firm-ID called ISIN) (iv) isin[$9], the count for each ISIN (v) d (index of date_list, the number of unique dates) and (vi) i (index of isin_list, the number of unique ISINs). I should also get two more columns -- columns 5 and 7 below -- for date_list[d] and isin_list[i], which will have values that look like $10 and $9.
6/27/2003 1 INE239A01016 1 6/27/2003 1 INE239A01016 1
6/27/2003 2 INE047A01013 1 6/27/2003 1 INE047A01013 2
6/26/2003 1 INE040A01018 1 6/26/2003 2 INE040A01018 3
6/28/2003 1 INE585B01010 1 6/28/2003 3 INE585B01010 4
6/28/2003 2 INE585B01010 2 6/28/2003 3 INE585B01010 4

actual code I now use is
{ if (date[$10]++ == 0) date_list[d++] = $10;
if (isin[$9]++ == 0) isin_list[i++] = $9;}
( $11~/1|2|3|5|9|1[24]/ )) { ++BNR[$10,$9,$12,$5]}
END { { for (u = 0; u < d; u++)
{for (v = 0; v < i; v++)
{ if (BNR[date_list[u],isin_list[v]]>0)
BR=BNR[date_list[u],isin_list[v]]
{ print(date_list[u], isin_list[v], BR}}}}}
Thanks a lot to everyone.

transpose column and rows using gawk

I am trying to transpose a really long file and I am concerned that it will not be transposed entirely.
My data looks something like this:
Thisisalongstring12345678 1 AB abc 937 4.320194
Thisisalongstring12345678 1 AB efg 549 0.767828
Thisisalongstring12345678 1 AB hi 346 -4.903441
Thisisalongstring12345678 1 AB jk 193 7.317946
I want my data to look like this:
Thisisalongstring12345678 Thisisalongstring12345678 Thisisalongstring12345678 Thisisalongstring12345678
1 1 1 1
AB AB AB AB
abc efg hi jk
937 549 346 193
4.320194 0.767828 -4.903441 7.317946
Would the length of the first string prove to be an issue? My file is much longer than this approx 2000 lines long. Also is it possible to change the name of the first string to Thisis234, and then transpose?

I don't see why it will not be - unless you don't have enough memory. Try the below and see if you run into problems.
Input:
$ cat inf.txt
a b c d
1 2 3 4
. , + -
A B C D
Awk program:
$ cat mkt.sh
awk '
{
for(c = 1; c <= NF; c++) {
a[c, NR] = $c
}
if(max_nf < NF) {
max_nf = NF
}
}
END {
for(r = 1; r <= NR; r++) {
for(c = 1; c <= max_nf; c++) {
printf("%s ", a[r, c])
}
print ""
}
}
' inf.txt
Run:
$ ./mkt.sh
a 1 . A
b 2 , B
c 3 + C
d 4 - D
Credits:
http://www.chemie.fu-berlin.de/chemnet/use/info/gawk/gawk_12.html#SEC121
Hope this helps.

This can be done with the rs BSD command:
http://www.unix.com/man-page/freebsd/1/rs/
Check out the -T option.

I tried icyrock.com's answer, but found that I had to change:
for(r = 1; r <= NR; r++) {
for(c = 1; c <= max_nf; c++) {
to
for(r = 1; r <= max_nf; r++) {
for(c = 1; c <= NR; c++) {
to get the NR columns and max_nf rows. So icyrock's code becomes:
$ cat mkt.sh
awk '
{
for(c = 1; c <= NF; c++) {
a[c, NR] = $c
}
if(max_nf < NF) {
max_nf = NF
}
}
END {
for(r = 1; r <= max_nf; r++) {
for(c = 1; c <= NR; c++) {
printf("%s ", a[r, c])
}
print ""
}
}
' inf.txt
If you don't do that and use an asymmetrical input, like:
a b c d
1 2 3 4
. , + -
You get:
a 1 .
b 2 ,
c 3 +
i.e. still 3 rows and 4 columns (the last of which is blank).

For # ScubaFishi and # icyrock code:
"if (max_nf < NF)" seems unnecessary. I deleted it, and the code works just fine.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

summarizing a text file in awk - awk

Related

Using awk to count number of row group

replacing associative array indexes with their value using awk or sed

Is there an efficient way to do a vertical lookup kind of task in AWK using multiple files?

Awk code with associative arrays -- array doesn't seem populated, but no error

transpose column and rows using gawk

Categories

Resources