Create new columns in specific positions by row-wise summing other columns - awk

I would like to tranform this file, by adding two columns with the result of summing a value to another columns. I would like this new columns to be located next to the corresponding summed column.
A B C
2000 33000 2
2000 153000 1
2000 178000 1
2000 225000 1
2000 252000 1
I would like to get the following data
A A1 B B1 C
2000 2999 33000 33999 2
2000 2999 153000 153999 1
2000 2999 178000 78999 1
2000 2999 225000 225999 1
2000 2999 252000 252999 1
I have found how to sum a column: awk '{{$2 += 999}; print $0}' myFile but this transforms second column, instead of creating a new one. In addition, I am not sure about how to append that column in the desired positions.
Thanks!

awk '{
# increase number of columns
NF++
# shift columns right, note - from the back!
for (i = NF; i >= 2; --i) {
$(i + 1) = $i
}
# increase second column
$2 += 999
# print it
print
}
' myfile
And similar for 4th column.

Sample specific answer: Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==1{
$1=$1 OFS "A1"
$2=$2 OFS "B1"
print
next
}
{
$1=$1 OFS $1+999
$2=$2 OFS $2+999
}
1
' Input_file | column -t
Generic solution: Adding more generic solution, where we need NOT to write field logic for each field, just give field number inside variable fieldsChange(give only field number with comma separated) and even heading of it will be taken care. variable valAdd is having value which you need to add into new columns.
awk -v valAdd="999" -v fieldsChange="1,2" '
BEGIN{
num=split(fieldsChange,arr,",")
for(i=1;i<=num;i++){ value[arr[i]] }
}
FNR==1{
for(i=1;i<=NF;i++) {
if(i in value) { $i=$i OFS $i"1" }
}
}
FNR>1{
for(i=1;i<=NF;i++) {
if(i in value) { $i=$i OFS $i+valAdd }
}
}
1
' Input_file | column -t

Related

how to keep newline(s) when selecting a given column with awk

Suppose I have a file like this (disclaimer: this is not fixed I can have more than 7 rows, and more than 4 columns)
R H A 23
S E A 45
T E A 34
U A 35
Y T A 35
O E A 353
J G B 23
I want the output to select second column if third column is A but keeping newline or whitespace character.
output should be:
HEE TE
I tried this:
awk '{if ($3=="A") print $2}' file | awk 'BEGIN{ORS = ""}{print $1}'
But this gives:
HEETE%
Which has a weird % and is missing the space.
You may use this gnu-awk solution using FIELDWIDTHS:
awk 'BEGIN{ FIELDWIDTHS = "1 1 1 1 1 1 *" } $5 == "A" {s = s $3}
END {print s}' file
HEE TE
awk splits each record using width values provided in this variable FIELDWIDTHS.
1 1 1 1 1 1 * means each of first 6 columns will have single character length and remaining text will be filled in 7th column. Since you have a space after each value so $2,$4,$6 will be filled with a single space and $1,$3,$5 will be filled with the provided values in input.
$5 == "A" {s = s $3}: Here we are checking if $5 is A and if that condition is true then we keep appending value of $3 in a variable s. In the END block we just print variable s.
Without using fixed width parsing, awk will treat A in 4th row as $2.
Or else if we let spaces part of column value then use:
awk '
BEGIN{ FIELDWIDTHS = "2 2 2 *" }
$3 == "A " {s = s substr($2,1,1)}
END {print s}
' file

How do you count the number of rows that meet a numerical condition in AWK?

I have a 2 column text file that is sorted on column 2 (numbers, ascending) that I am trying to summarise by counting the number of lines that fall within a set region. This is set to 1000. In essence the text file will be read and if the number in column 2 lies between 0 and 1000 then in the output file there will be a new line that tallies this up, then the second line of the output file I will have the 1000-2000 region and so on until the end of the file is read.
Unfortunately the code I have been passed misses the first output line 0-1000 and doesn't output the maths correctly... I think it is ignoring the first line of the INPUT file? I don't know how easy it is to change or whether a more elegant way of writing it is available...
From my understanding the AWK command says
let x=0 and y=1000
if $2 >=0 && $2 < y, then +1 to x
print when y is reached
repeat for y+1000
but my first region of 1000 is missing
INPUT FILE: sorted & tab-delimited
aaaaa 675
aaaaa 678
aaaaa 989
aaaaa 1001
aaaaa 1500
aaaaa 2020
...
awk -F'\t' 'BEGIN{x=0;y=1000;}{
if ($2 >= 0 && $2 < y) {x=x+1;}
else {OFS="\t"; $2=y; $3=y+1000; $4=x; print$1,$2,$3,$4; x=0; y=y+1000}
}' INput.txt > OUTput.txt
So, I was expecting:
aaaaa 0 1000 3
aaaaa 1000 2000 2
aaaaa 2000 3000 1
...
but what I am getting is
aaaaa 1000 2000 3
aaaaa 2000 3000 1
aaaaa 3000 4000 0
...
which isn't correct given the input files.
(... denotes the rest of the file)
In addition to #JamesBrown's answer, here is a working edition:
awk '
BEGIN {
FS=OFS="\t"
}
{
while(c<$2) {
if(c)
print $1,c-1000,c,n
n=0
c+=1000
}
n++
}
END {
print $1,c-1000,c,n
}' file
Given your sample its output:
aaaaa 0 1000 3
aaaaa 1000 2000 2
aaaaa 2000 3000 1
Lets add some debug and see:
$ cat foo.awk
BEGIN {
OFS="\t" # moved
x=0
y=1000
}
{
printf "DEBUG NR=%d $2=%d y=%d\n",NR,$2,y > "/dev/stderr" # added
if ($2 >= 0 && $2 < y)
x=x+1
else {
$2=y
$3=y+1000
$4=x
print$1,$2,$3,$4
x=0
y=y+1000
}
}
Run it:
$ awk -f foo.awk file
DEBUG NR=1 $2=675 y=1000
DEBUG NR=2 $2=678 y=1000
DEBUG NR=3 $2=989 y=1000
DEBUG NR=4 $2=1001 y=1000
aaaaa 1000 2000 3
DEBUG NR=5 $2=1500 y=2000 # if (1500 >= 0 && 1500 < 2000) {x=x+1} ie no print
DEBUG NR=6 $2=2020 y=2000
aaaaa 2000 3000 1
In awk, most of the times, you can convert the if statement into a pattern, which make the script easier to understand at the same time more concise. My approach to this problem is in a script called count.awk:
BEGIN {
threshold = 1000
FS = OFS = "\t"
}
$2 > threshold {
print first, threshold - 1000, threshold, count
threshold += 1000
count = 0
}
{
first = $1
count++
}
END {
print first, threshold - 1000, threshold, count
}
Notes
The BEGIN pattern is easy: Here I declare the threshold and delimiters
For those lines whose value in the second steps over the threshold (pattern: $2 > threshold), I print out the count so far for the previous lines, adjust the threshold, and reset the count
For every line, I save the value of the first column, then count. It is important that this block is positioned after the $2 > threshold block or the count would be off by one
At the end, I also print out the tally for the last batch
Invoking the script
awk -f count.awk INput.txt > OUTput.txt

How to find whoch part of OR condition is met when you have 40 conditions in Unix

I have a file which is having 40 fields and each should have particular length. I put a OR condition as below and checked if it is meeting the requirement and print something any of the field length is more than what is required. But I want to know and print which field exactly is more than what is required.
command:
awk -F "|" 'length ($1) > 10 || length ($2) > 30 || length ($3) > 50 || length ($4) > 15 ||...|| length ($40) > 55' /path/filename
your existing code will not test all the conditions after the first resulting true, due to short circuiting. If you want to check them all, better to keep the size requirements in variable and loop through all fields, one example can be
$ awk -F'|' -v size="10|30|50..." '
BEGIN{split(size,s)}
{c=sep="";
for(i=1;i<=NF;i++)
if(length($i)>s[i]) {c=c sep i; sep=FS};
if(c) print $0,c}' file
No need to write too many field conditions manually. Since you haven't showed us the expected output then based on your statements following code is written.
awk -F"|" '{for(i=1;i<=NF;i++){if(length($i)>40){print i,$i" having more than 40 length"}}}' Input_file
Above will print a field number, field's value which is having length more than 40.
EDIT: Adding an example on same, let's say following is the Input_file.
cat Input_file
vbrwvkjrwvbrwvbrwv123|vwkjrbvrwnbvrwvkbvkjrwbvbwvwbvrwbvrwbvvbjbvhjrwv|rwvirwvhbrwvbrwvbrwvbhrwbvhjrwbvhjrwbvjrwbvhjwbvhjvbrwvbrwhjvb
123|wwd|wfwcwc
awk -F"|" '{for(i=1;i<=NF;i++){if(length($i)>40){print i,$i" having more than 40 length"}}}' file3499
2 vwkjrbvrwnbvrwvkbvkjrwbvbwvwbvrwbvrwbvvbjbvhjrwv having more than 40 length
3 rwvirwvhbrwvbrwvbrwvbhrwbvhjrwbvhjrwbvjrwbvhjwbvhjvbrwvbrwhjvb having more than 40 length
This is basically the same as karakfa's answer, just ... more whitespacey
awk -F "|" '
BEGIN {
max[1] = 10
max[2] = 30
max[3] = 50
max[4] = 15
# ...
max[40] = 55
}
{
for (i=1; i<=NF; i++) {
if (length($i) > max[i]) {
printf "Error: line %d, column %d is wider than %d\n", NR, i, max[i]
}
}
}
' file

subtracting values in one column based on another column

I have input file as follows
100A 2000
100B 150
100C 800
100A 1000
100B 100
100C 300
I want to subtract values in column 2 for each uniq value in column 1
so the out put should look like
100A 1000
100B 50
100C 500
I have tried
awk '{if(!a[$1])a[$1]=$2; else a[$1]=$2-a[$1]}END{ for(i in a)print i" " a[i]}' file
but the out put is :
100A 0
100B 0
100C 0
please advise
So many (slight) variations on the same theme.
awk '
!($1 in a) {a[$1]=$2; next}
{a[$1]-=$2}
END {for (i in a) printf "%s %d\n",i,a[i]}
' input.txt
Stack it up as a one-liner if you like.
Remember that awk structure consists of multiple condition { statement } pairs, so you can sometimes express your requirements more elegantly than using an if..else. (Not saying that this is the case here - this is a simple enough awk script that it probably doesn't matter, unless you're a purist. :] )
Also, beware of testing for values the way you've done in the condition in your if in the question. Note that a[$1] both tests whether the value at that array index is non-zero and causes the index to exist with a null value if it didn't previously exist. If you want to check for index existence, use $1 in a.
Update based on a comment on your question...
If you want to subtract the last from the first entry, ignoring the ones in between, then you need to keep a record of both your firsts and your lasts. Something like this might suffice.
awk '
!($1 in a){a[$1]=$2;next}
{b[$1]=$2}
END {for(i in b)if(i in a)print i,a[i]-b[i]}
' input.txt
Note that as Ed mentioned, this produces output in random order. If you want the output ordered, you'll need an additional array to track of the order. For example, this will use order that items are first seen:
awk '
!($1 in a) {
a[$1]=$2;
o[++n]=$1;
next
}
{
b[$1]=$2
}
END {
for (n=1;n<=length(o);n++)
print o[n],a[o[n]]-b[o[n]]
}
' i
Note that the length() function being used to determine the number of elements in an array is not universal amongst dialects of awk, but it does work in both gawk and one-true-awk (used in FreeBSD and others).
This awk one-liner does the job:
awk '{if($1 in a)a[$1]=a[$1]-$2;else a[$1]=$2}
END{for(x in a) print x, a[x]}' file
In awk. Using conditional operator for value placing/subtraction to keep it tight:
$ awk '{ a[$1]+=($1 in a?-$2:$2) } END{ for(i in a)print i, a[i] }' file
100A 1000
100B 50
100C 500
Explained:
{
a[$1]+=($1 in a?-$2:$2) # if $1 in a already, subtract from it
# otherwise add value to it
}
END {
for(i in a) # go thru all a
print i, a[i] # and print keys and values
}
Given the sample input you provided, all you need is:
$ awk '$1 in a{print $1, a[$1]-$2} {a[$1]=$2}' file
100A 1000
100B 50
100C 500
If that's not all you need then provide more truly representative sample input/output that includes the cases where that's not good enough.
You can use this awk:
awk 'a[$1]{a[$1]=a[$1]-$2; next} {a[$1]=$2} END{for(v in a){print v, a[v]}}' file

awk to Count Sum and Unique improve command

Would like to print based on 2nd column ,count of line items, sum of 3rd column and unique values of first column.Having around 100 InputTest files and not sorted ..
Am using below 3 commands to achieve the desired output , would like to know the simplest way ...
InputTest*.txt
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
ghi,zz,10,sss
Step#1:
cat InputTest*.txt | awk -F, '{key=$2;++a[key];b[key]=b[key]+$3} END {for(i in a) print i","a[i]","b[i]}'
Op#1
xx,4,40
yy,4,60
zz,1,10
Step#2
awk -F ',' '{print $1,$2}' InputTest*.txt | sort | uniq >Op_UniqTest2.txt
Op#2
abc xx
abc yy
def xx
def yy
ghi zz
Step#3
awk '{print $2}' Op_UniqTest2.txt | sort | uniq -c
Op#3
2 xx
2 yy
1 zz
Desired Output:
xx,4,40,2
yy,4,60,2
zz,1,10,1
Looking for suggestions !!!
BEGIN { FS = OFS = "," }
{ ++lines[$2]; if (!seen[$2,$1]++) ++diff[$2]; count[$2]+=$3 }
END { for(i in lines) print i, lines[i], count[i], diff[i] }
lines tracks the number of occurrences of each value in column 2
seen records unique combinations of the second and first column, incrementing diff[$2] whenever a unique combination is found. The ++ after seen[$2,$1] means that the condition will only be true the first time the combination is found, as the value of seen[$2,$1] will be increased to 1 and !seen[$2,$1] will be false.
count keeps a total of the third column
$ awk -f avn.awk file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Using awk:
$ awk '
BEGIN { FS = OFS = "," }
{ keys[$2]++; sum[$2]+=$3 } !seen[$1,$2]++ { count[$2]++ }
END { for(key in keys) print key, keys[key], sum[key], count[key] }
' file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Set the input and output field separator to , in BEGIN block. We use arrays keys to identify and count keys. sum array keeps the sum for each keys. count allows us to keep track of unique column1 for each of column2 values.