awk '{for (i = 1; i <= NF; i++) {gsub(/[^[:alnum:]]/, " "); print tolower($i)": "NR | "sort -V | uniq";}}' input.txt
With above code, I get output as:
line1: 2
line1: 3
line1: 5
line2: 1
line2: 2
line3: 10
I want it like below:
line1: 2, 3, 5
line2: 1, 2
lin23: 10
How to achieve it?
Use gawk's array features. I'll provide actual code once I hack it up.
awk '{for (i = 1; i <= NF; i++) {
gsub(/[^[:alnum:]]/, " ");
arr[tolower($i)] = arr[tolower($i)]NR", "}
}
END {
for (x in arr) {
print x": "substr(arr[x], 1, length(arr[x])-2);
}}' input.txt | sort
Note that this includes duplicate line numbers if a word appears multiple times on the same lines.
using perl...
#!/usr/bin/perl
while(<>){
if( /(\w+):\s*(\d+)/){ # extract the parts
$arr{lc($1)}{$2} ++ # count them
};
}
for my $k (sort keys %arr){ # print sorted alpha
print "$k: ";
$lines=$arr{$k};
print join(", ",(sort {$a<=>$b} keys %$lines)),"\n"; print sorted numerically
}
This solution removes ans sorts the duplicated numbers. Is this what you needed?
Related
I have a coma delemiated file. I am interested to count the number of rows (column count/length) in each column with their header name.
Example Dataset:
ID, IB, IM, IZ
0.05, 0.02, 0.01, 0.09
0.06, 0.01, , 0.08
0.02, 0.06,
Coumn ID:3
Column IB:3
Column IM: 1
Column IZ:2
I have tried quite few option:
I can split these columns into seperate files and then can count number of lines in each file using wc -l File_name command.
This Command is very close to what I am interested in but stillunable to get header name. Any help will be highly appreciated.
I would use GNU AWK for this task following way, let file.txt content be
ID, IB, IM, IZ
0.05, 0.02, 0.01, 0.09
0.06, 0.01, , 0.08
0.02, 0.06,
then
awk 'BEGIN{FS=","}NR==1{split($0,names);next}{for(i=1;i<=NF;i+=1){counts[i]+=$i~/[^[:space:]]/}}END{for(i=1;i<=length(names);i+=1){print "Column",names[i]": "counts[i]}}' file.txt
output
Column ID: 3
Column IB: 3
Column IM: 1
Column IZ: 2
Explanation: I inform GNU AWK that , is field separator, when processing 1st record split whole lines ($0) into array names, so ID becomes names[1], IB becomes names[2], IM becomes names[3] and so on. After doing that go to next line. For all but 1st line iterate over columns using for loop, for every line increase value of counts[i] (where i is number of column) by does that column contain non-whitespace character? which is 0 for false and 1 for true. In other words increase by 1 if non-whitespace character found else increase by 0. After processing all lines iterate over names and print name with corresponding value of counts.
(tested in gawk 4.2.1)
With GNU awk:
awk -F'[[:space:]]*,[[:space:]]*' 'NR == 1 {header = $0; next} \
{for(i = 1; i <= NF; i++) n[i] += ($i ~ /\S/)} \
END {$0 = header; for(i = 1; i <= NF; i++) print "Column " $i ": " n[i]}' file
Column ID: 3
Column IB: 3
Column IM: 2
Column IZ: 1
Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN { FS=" *, *" }
NR == 1 {
numCols = split($0,tags)
next
}
{
for ( i=1; i<=NF; i++ ) {
if ( $i ~ /./ ) {
cnt[i]++
}
}
}
END {
for ( i=1; i<=numCols; i++ ) {
printf "Column %s:%d\n", tags[i], cnt[i]
}
}
$ awk -f tst.awk file
Column ID:3
Column IB:3
Column IM:1
Column IZ:2
I have the following format of data:
1 3
1.723608 0.8490000
1.743011 0.8390000
1.835833 0.7830000
2 5
1.751377 0.8350000
1.907603 0.7330000
1.780053 0.8190000
1.601427 0.9020000
1.950540 0.6970000
3 2
1.993951 0.6610000
1.796519 0.8090000
4 4
1.734961 0.8430000
1.840741 0.7800000
1.818444 0.7950000
1.810717 0.7980000
5 1
2.037940 0.6150000
6 7
1.738221 0.8330000
1.767678 0.8260000
1.788517 0.8140000
2.223586 0.4070000
1.667492 0.8760000
2.039232 0.6130000
1.758823 0.8300000
...
Data consists of data blocks. Each data block has the same format as follows:
The very first line is the header line. The header line contains the ID number and the total number of lines of each data block. For example, the first data block's ID is 1, and the total number of lines is 3. For the third data block, ID is 3, and the total number of lines is 2. All data blocks have this header line.
Then, the "real data" follows. As I explained, the number of lines of "real data" is designated in the second integer of the header line.
Accordingly, the total number of lines for each data block will be number_of_lines+1. In this example, the total number of lines for data block 1 is 4, and data block 2 costs 6 lines...
This format repeats all the way up to 10000 number of data blocks in my current data, but I can provide this 10000 as a variable in the bash or awk script as an input value. I know the total number of data blocks.
Now, what I wish to do is, I want to get the average of data of each two columns and print it out with data block ID number and a total number of lines. The output text will have:
ID_number number_of_lines average_of_column_1 average_of_column_2
using 5 spaces between columns with 6 decimal places format. The result will have 10000 lines, and each line will have ID, number of lines, avg of column 1 of data, and avg of column 2 of data for each data block. The result of this example will look like
1 3 1.767484 0.823666
2 5 1.798200 0.797200
3 2 1.895235 0.735000
...
I know how to get the average of a simple data column in awk and bash. These are already answered in StackOverflow a lot of times. For example, I really favor using
awk '{ total += $2; count++ } END { print total/count }' data.txt
So, I wish to this using awk or bash. But I really have no clue how can I approach and even start to get this kind of average of multiple repeating data blocks, but with a different number of lines for each data block.
I was trying based on awk, following
Awk average of n data in each column
and
https://www.unix.com/shell-programming-and-scripting/135829-partial-average-column-awk.html
But I'm not sure how can I use NR or FNR for the average of data with a varying number of total lines of data, for each data block.
You may try this awk:
awk -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.76748 0.823667
2 5 1.7982 0.7972
3 2 1.89524 0.735
4 4 1.80122 0.804
5 1 2.03794 0.615
6 7 1.85479 0.742714
If you have gnu awk then use OFMT for getting fixed size decimal numbers like this:
awk -v OFMT="%.6f" -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.767484 0.823667
2 5 1.798200 0.797200
3 2 1.895235 0.735000
4 4 1.801216 0.804000
5 1 2.037940 0.615000
6 7 1.854793 0.742714
An expanded form:
awk OFMT='%.6f' -v OFS='\t' '
$2 ~ /\./ {
s1 += $1
s2 += $2
next
}
{
if (id) {
print id, num, s1/num, s2/num
s1 = s2 = 0
}
id = $1
num = $2
}
END {
print id, num, s1/num, s2/num
}' file
And yet another one:
awk -v num_blocks=10000 '
BEGIN {
OFS = "\t"
OFMT = "%.6f"
}
num_lines == 0 {
id = $1
num_lines = $2
sum1 = sum2 = 0
next
}
lines_read < num_lines {
sum1 += $1
sum2 += $2
lines_read++
}
lines_read >= num_lines {
print id, num_lines,
sum1 / num_lines,
sum2 / num_lines
num_lines = lines_read = 0
num_blocks--;
}
num_blocks <= 0 {
exit
}' file
You could try
awk -v qnt=none 'qnt == "none" {id = $1; qnt = $2; s1 = s2 = line = 0;next}{s1 += $1; s2 += $2; ++line} line == qnt{printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt; qnt="none"}'
The above is expanded as follows:
qnt == "none"
{
id = $1;
qnt = $2;
s1 = s2 = line = 0;
next
}
{
s1 += $1;
s2 += $2;
++line
}
line == qnt
{
printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt;
qnt="none"
}
After a data block is processed (or at the beginning), record header info.
Otherwise, add to sum and print the result when we've done with all lines in this block.
I am working on a variant calling format (vcf) file, and I tried to show you guys what I am trying to do:
Input:
1 877803 838425 GC G
1 878077 966631 C CCACGG
Output:
1 877803 838425 C -
1 878077 966631 - CACGG
In summary, I am trying to delete the first letters of longer strings.
And here is my code:
awk 'BEGIN { OFS="\t" } /#/ {next}
{
m = split($4, a, //)
n = split($5, b, //)
x = "-"
delete y
if (m>n){
for (i = n+1; i <= m; i++) {
y = sprintf("%s", a[i])
}
print $1, $2, $3, y, x
}
else if (n>m){
for (j = m+1; i <= n; i++) {
y = sprintf("%s", b[j]) ## Problem here
}
print $1, $2, $3, x, y
}
}' input.vcf > output.vcf
But,
I am getting the following error in line 15, not even in line 9
awk: cmd. line:15: (FILENAME=input.vcf FNR=1) fatal: attempt to use array y in a scalar context
I don't know how to concatenate array elements into a one string using awk.
I will be very happy if you guys help me.
Merry X-Mas!
You may try this awk:
awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)); } {$4 = trim($4); $5 = trim($5)} 1' file
1 877803 838425 C -
1 878077 966631 - CACGG
More readable form:
awk -v OFS="\t" 'function trim(s) {
return (length(s) == 1 ? "-" : substr(s, 2))
}
{
$4 = trim($4)
$5 = trim($5)
} 1' file
You can use awk's substr function to process the 4th and 5th space delimited fields:
awk '{ substr($4,2)==""?$4="-":$4=substr($4,2);substr($5,2)==""?$5="-":$5=substr($5,2)}1' file
If the string from position 2 onwards in field 4 is equal to "", set field 4 to "-" otherwise, set field 4 to the extract of the field from position 2 to the end of the field. Do the same with field 5. Print lines modified or not with short hand 1.
I'm a very recent command line user thus I'm requiring some help to split a text file by columns using awk. The difficulty for me is that I want the ith filename to be the text from the 1st row of the ith column.
This is what I had in mind:
awk '{for(i = 2; i <= NF; i++){name= ??FNR == 1 $i?? ;print $1, $i > name}}' myfile.txt
But I don't know how to set the name variable...
Input: myfile.txt
'ID' 'sample_1' 'sample_2' ...
'id_1' 1 2 ...
'id_2' 2 3 ...
Excpected output:
sample_1.txt:
'ID' 'sample_1'
'id_1' 1
'id_2' 2
sample_2.txt:
'ID' 'sample_2'
'id_1' 2
'id_2' 3
Thanks
You should keep column headers in an array.
awk 'NR==1 {
for (i=2; i<=NF; ++i) {
fnames[i] = gensub(/\x27/, "", "g", $i)
print $1, $i > fnames[i] ".txt"
}
next
}
{
for (i=2; i<=NF; ++i)
print $1, "\x27" $i "\x27" > fnames[i] ".txt"
}' myfile.txt
\x27 is single quote in hex-escaped form
gensub(/\x27/, "", "g", $i) removes single quotes from column headers to name output files as you wanted.
You can try this awk :
awk -F'\t' ' # tab as field separator
{
for ( i = 2 ; i <= NF ; i++ ) { # for each record loop from field 2 to last field
if ( NR == 1 ) { # if first record
a[i] = $i # keep each field in array a
gsub ( /^'\''|'\''$/ , "" , a[i] ) # remove quote at start and end in array a
}
print $1 FS $i > a[i]".txt" # print needed field in corresponding file
}
}' myfile.txt
I have data which looks like this
1 3
1 2
1 9
5 4
4 6
5 6
5 8
5 9
4 2
I would like the output to be
1 3,2,9
5 4,6,8,9
4 6,2
This is just sample data but my original one has lots more values.
So this worked
So this basically creates a hash table, using the first column as a key and the second column of the line as the value:
awk '{line="";for (i = 2; i <= NF; i++) line = line $i ", "; table[$1]=table[$1] line;} END {for (key in table) print key " => " table[key];}' trial.txt
OUTPUT
4 => 6, 2
5 => 4, 6, 8, 9
1 => 3, 2, 9
I'd write
awk -v OFS=, '
{
key = $1
$1 = ""
values[key] = values[key] $0
}
END {
for (key in values) {
sub(/^,/, "", values[key])
print key " " values[key]
}
}
' file
If you want only the unique values for each key (requires GNU awk for multi-dimensional arrays)
gawk -v OFS=, '
{ for (i=2; i<=NF; i++) values[$1][$i] = i }
END {
for (key in values) {
printf "%s ", key
sep = ""
for (val in values[key]) {
printf "%s%s", sep, val
sep = ","
}
print ""
}
}
' file
or perl
perl -lane '
$key = shift #F;
$values{$key}{$_} = 1 for #F;
} END {
$, = " ";
print $_, join(",", keys %{$values{$_}}) for keys %values;
' file
if not concerned with the order of the keys, I think this is the idiomatic awk solution.
$ awk '{a[$1]=($1 in a?a[$1]",":"") $2}
END{for(k in a) print k,a[k]}' file |
column -t
4 6,2
5 4,6,8,9
1 3,2,9