How can I subtract to each column its mean using awk? - awk

I have a file such as the following (but with thousands of rows and hundreds of columns)
1 2 1
1 2 2
3 2 3
3 2 6
How can I subtract to each column/field its mean using awk, in order to obtain such a thing?
-1 0 -2
-1 0 -1
1 0 0
1 0 3
Thank you very much for your help.
The most close solution http://www.unix.com/shell-programming-scripting/102293-normalize-dataset-awk.html does not seem to do the job "element by element". Of course it performs another operation, but the generic concept is "perform an operation on each column using a value calculated on that column"

With awk in two passes:
awk '
NR==FNR {
for (i=1;i<=NF;i++) {
a[i]+=$i
}
next
}
{
for (y=1;y<=NF;y++) {
printf "%2d ", $y-=(a[y]/(NR-FNR))
}
print ""
}' file file
With awk in one pass:
awk '{
for (i=1;i<=NF;i++) {
a[i]+=$i;
b[NR,i]=$i
}
}
END {
for (i=1;i<=NR;i++) {
for (j=1;j<=NF;j++) {
printf "%2d ",b[i,j]-=(a[j]/NR)
}
print ""
}
}' file

import sys, numpy as np
a = np.array([i.strip().split() for i in open(sys.argv[1])],dtype =float)
for i in a - np.mean(a,axis=0): print ' '.join(map(str, i))
Usage : python script.py inputFile

Related

Stored each of the first 2 blocks of lines in arrays

I've sorted it by using Google Sheet, but its gonna takes a long time, so I figured it out, to settle it down by awk.
input.txt
Column 1
2
2
2
4
4
Column 2
562
564
119
215
12
Range
13455,13457
13161
11409
13285,13277-13269
11409
I've tried this script, so it's gonna rearrange the value.
awk '/Column 1/' RS= input.txt
(as referred in How can I set the grep after context to be "until the next blank line"?)
But it seems, it's only gonna take one matched line
It should be sorted by respective lines.
Result:
562Value2#13455
562Value2#13457
564Value2#13161
119Value2#11409
215Value4#13285
215Value4#13277-13269
12Value4#11409
it should be something like that, the "comma" will be repeating the value from Column 1 and Column 2
etc:
Range :
13455,13457
Result :
562Value2#13455
562Value2#13457
idk what sorting has to do with it but it seems like this is what you're looking for:
$ cat tst.awk
BEGIN { FS=","; recNr=1; print "Result:" }
!NF { ++recNr; lineNr=0; next }
{ ++lineNr }
lineNr == 1 { next }
recNr == 1 { a[lineNr] = $0 }
recNr == 2 { b[lineNr] = $0 }
recNr == 3 {
for (i=1; i<=NF; i++) {
print b[lineNr] "Value" a[lineNr] "#" $i
}
}
$ awk -f tst.awk input.txt
Result:
562Value2#13455
562Value2#13457
564Value2#13161
119Value2#11409
215Value4#13285
215Value4#13277-13269
12Value4#11409

how swap lines with awk with only a single pass and limited memory use?

in a previous post, this answer was shown: answer user2138595, though beautiful , the problem is that you should read the input file twice.
I wish to make a GNU awk script to read input only once.
cat swap_line.awk
you get
BEGIN {
if(init > end){
exit 1;
}
flag = 1;
memory_init = "";
memory = ""
}
{
if (NR != init && NR != end){
if(flag==1){
print $0;
}else{
memory = memory""$0"\n";
}
}else if(end == init){
print $0;
}else if(NR == init){
flag = 0;
memory_init = $0;
}else{
#NR == end
print $0;
printf("%s",memory);
print memory_init;
flag = 1;
}
}
END {
#if end is greater than the number of lines of the file
if(flag == 0){
printf("%s",memory);
print memory_init;
}
}
The scripts works well
cat input
1
2
3
4
5
awk -v init=2 -v end=4 -f swap_line.awk input
1
4
3
2
5
awk -v init=2 -v end=2 -f swap_line.awk input
1
2
3
4
5
awk -v init=2 -v end=8 -f swap_line.awk input
1
3
4
5
2
QUESTION
how could i make a script in a better way ? because, I do not like to use the memory variable, since for large files can have problems, for example if the input file is 10 million lines and want to do a swap between line 1 and line 10 million, I store 9,999,998 lines in memory variable
#JoseRicardoBustosM. it is impossible to do it in one pass in awk without saving the lines from the init to one before the end line in memory. Just think about the impossibility of getting a line N lines ahead of what you've already read to miraculously show up in place of the current line. The best solution for this is definitely a simple 2-pass approach of saving the lines in the first pass and using them in the 2nd. I am including all solutions that involve grep-ing in advance or using a getline loop in the "2"-pass approach bucket.
FWIW here's the way I'd really do it (this IS a 2-pass approach):
$ cat swap_line.awk
BEGIN { ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR { if (NR==end) tl=$0; next }
FNR==init { hd=$0; $0=tl; nr=NR-FNR; if (nr<end) next }
FNR==end { $0=hd }
FNR==nr { if (nr<end) $0 = $0 ORS hd }
{ print }
.
$ awk -v init=2 -v end=4 -f swap_line.awk input
1
4
3
2
5
$ awk -v init=2 -v end=2 -f swap_line.awk input
1
2
3
4
5
$ awk -v init=2 -v end=8 -f swap_line.awk input
1
3
4
5
2
Note that if you didn't have that very specific requirement for how to handle an "end" that's past the end of the file then the solution would simply be:
$ cat swap_line.awk
BEGIN { ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR { if (NR==end) tl=$0; next }
FNR==init { hd=$0; $0=tl }
FNR==end { $0=hd }
{ print }
and if you really want something to think about (again, just for the sunny day cases):
$ cat swap_line.awk
NR==init { hd=$0; while ((getline<FILENAME)>0 && ++c<end); }
NR==end { $0=hd }
{ print }
$ awk -v init=2 -v end=4 -f swap_line.awk input
1
4
3
2
5
I would still consider that last one as a "2"-pass approach and I wouldn't do it if I didn't fully understand all the caveats listed at http://awk.info/?tip/getline.
I think you are working too hard. This makes no attempt to deal with extreme cases (eg, if end is greater than the number of lines, the initial line will not be printed, but that can easily be handled in an END block), because I think handling the edge cases obscures the idea. Namely, print until you reach the line you want swapped out, then store data in a file, then print the line to swap, the stored data, and the initial line, and then print the rest of the file:
$ cat swap.sh
#!/bin/sh
trap 'rm -f $T1' 0
T1=$(mktemp)
awk '
NR<init { print; next; }
NR==init { f = $0; next; }
NR<end { print > t1; next; }
NR==end { print; system("cat "t1); print f; next; }
1
' init=${1?} end=${2?} t1=$T1
$ yes | sed 10q | nl -ba | ./swap.sh 4 8
1 y
2 y
3 y
8 y
5 y
6 y
7 y
4 y
9 y
10 y
I agree that 2 passes are required. The first pass can be done with a tool(s) that is designed specifically for the task:
# $init and $end have been defined
endline=$( tail -n "+$end" file | head -n 1 )
awk -v init="$init" -v end="$end" -v endline="$endline" '
NR == init {saved = $0; $0 = endline}
NR == end {$0 = saved}
{print}
' file
Hide the details away in a function:
swap_lines () {
awk -v init="$1" \
-v end="$2" \
-v endline="$(tail -n "+$2" "$3" | head -n 1)" \
'
NR == init {saved = $0; $0 = endline}
NR == end {$0 = saved}
1
' "$3"
}
seq 5 > file
swap_lines 2 4 file
1
4
3
2
5

How to detect the last line in awk before END?

I'm trying to concatenate String values and print them, but if the last types are Strings and there is no change of type then the concatenation won't print:
input.txt:
String 1
String 2
Number 5
Number 2
String 3
String 3
awk:
awk '
BEGIN { tot=0; ant_t=""; }
{
t = $1; val=$2;
#if string, concatenate its value
if (t == "String") {
tot+=val;
nx=1;
} else {
nx=0;
}
#if type change, add tot to res
if (t != "String" && ant_t == "String") {
res=res tot;
tot=0;
}
ant_t=t;
#if string, go next
if (nx == 1) {
next;
}
res=res"\n"val;
}
END { print res; }' input.txt
Current output:
3
5
2
Expected output:
3
5
2
6
How can I detect if awk is reading last line, so if there won't be change of type it will check if it is the last line?
awk reads line by line hence it cannot determine if it is reading the last line or not. The END block can be useful to perform actions once the end of file has reached.
To perform what you expect
awk '/String/{sum+=$2} /Number/{if(sum) print sum; sum=0; print $2} END{if(sum) print sum}'
will produce output as
3
5
2
6
what it does?
/String/ selects line that matches String so is Number
sum+=$2 performs the concatanation with String lines. When Number occurs, print the sum and reset to zero
Like this maybe:
awk -v lines="$(wc -l < /etc/hosts)" 'NR==lines{print "LAST"};1' /etc/hosts
I am pre-calculating the number of lines (using wc) and passing that into awk as a variable called lines, if that is unclear.
Just change last line to:
END { print res; print tot;}'
awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
Explanation
y is used as a boolean, and I check at the END if the last pattern was a string and print the sum
You can actually use x as the boolean like nu11p01n73R does which is smarter
Test
$ cat file
String 1
String 2
Number 5
Number 2
String 3
String 3
$ awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
3
5
2
6

find common elements in >2 files

I have three files as shown below
file1.txt
"aba" 0 0
"aba" 0 0 1
"abc" 0 1
"abd" 1 1
"xxx" 0 0
file2.txt
"xyz" 0 0
"aba" 0 0 0 0
"aba" 0 0 0 1
"xxx" 0 0
"abc" 1 1
file3.txt
"xyx" 0 0
"aba" 0 0
"aba" 0 1 0
"xxx" 0 0 0 1
"abc" 1 1
I want to find the similar elements in all the three files based on first two columns. To find similar elements in two files i have used something like
awk 'FNR==NR{a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt
But, how can we find similar elements in all the files, when the input files are more than 2?
Can anyone help?
With the current awk solution, the output ignores the duplicate key columns and gives the output as
"xxx" 0 0
If we assume the output comes from file1.txt, the expected output is:
"aba" 0 0
"aba" 0 0 1
"xxx" 0 0
i.e it should get the rows with duplicate key columns as well.
Try following solution generalized for N files. It saves data of first file in a hash with value of 1, and for each hit from next files that value is incremented. At the end I compare if the value of each key it's the same as the number of files processed and print only those that match.
awk '
FNR == NR { arr[$1,$2] = 1; next }
{ if ( arr[$1,$2] ) { arr[$1,$2]++ } }
END {
for ( key in arr ) {
if ( arr[key] != ARGC - 1 ) { continue }
split( key, key_arr, SUBSEP )
printf "%s %s\n", key_arr[1], key_arr[2]
}
}
' file{1..3}
It yields:
"xxx" 0
"aba" 0
EDIT to add a version that prints the whole line (see comments). I've added another array with same key where I save the line, and also use it in the printf function. I've left old code commented.
awk '
##FNR == NR { arr[$1,$2] = 1; next }
FNR == NR { arr[$1,$2] = 1; line[$1,$2] = $0; next }
{ if ( arr[$1,$2] ) { arr[$1,$2]++ } }
END {
for ( key in arr ) {
if ( arr[key] != ARGC - 1 ) { continue }
##split( key, key_arr, SUBSEP )
##printf "%s %s\n", key_arr[1], key_arr[2]
printf "%s\n", line[ key ]
}
}
' file{1..3}
NEW EDIT (see comments) to add a version that handles multiple lines with same key. Basically I join all entries instead saving only one, changing line[$1,$2] = $0 with line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0. At the time of printing I do the reverse splitting with the separator (SUBSEP variable) and print each entry.
awk '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
next
}
FNR == 1 { delete found }
{ if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
num_files = ARGC -1
for ( key in arr ) {
if ( arr[key] < num_files ) { continue }
split( line[ key ], line_arr, SUBSEP )
for ( i = 1; i <= length( line_arr ); i++ ) {
printf "%s\n", line_arr[ i ]
}
}
}
' file{1..3}
With new data edited in question, it yields:
"xxx" 0 0
"aba" 0 0
"aba" 0 0 1
This python script will list out the common lines among all files :
import sys
i,l = 0,[]
for files in sys.argv[1:]:
l.append(set())
for line in open(files): l[i].add(" ".join(line.split()[0:2]))
i+=1
commonFields = reduce(lambda s1, s2: s1 & s2, l)
for files in sys.argv[1:]:
print "Common lines in ",files
for line in open(files):
for fields in commonFields:
if fields in line:
print line,
break
Usage : python script.py file1 file2 file3 ...
For three files, all you need is:
awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt file2.txt file3.txt
The FNR==NR block returns true for only the first file in the arguments list. The next statement in this block forces a skip over the remained of the code. Therefore, ($1,$2) in a is performed for all files in the arguments list excluding the first one. To process more files in the way you have, all you need to do is list them.
If you need more powerful globbing on the command line, use extglob. You can turn it on with shopt -s extglob, and turn it off with shopt -u extglob. For example:
awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt !(file1.txt)
If you have hard to find files, use find. For example:
awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt $(find /path/to/files -type f -name "*[23].txt")
I assume you're looking for a glob range for 'N' files. For example:
awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt file{2,3}.txt

Awk count frequency

Hey i want to count the amount of data in a certain column in awk.
an example dataset is
2 5 8
1 3 7
8 5 9
and I want to count the frequency of the 5 in the second colum. This is what i tried that didn't work
{
total = 0;
for(i=1;i<=NF;i++)
{
if(i==2)
{if($i==5) {total++;}
}
printf("%s ", total);
}
}
How about the following:
awk '{ if ($2==5) count++ } END { print count }'
awk 'NR == 1 {ind = 0} $2 == 5 {ind++} END {print ind}' testdata.txt