Merge two files depending on range using awk

Merge two files depending on range using awk - awk

I have a file (let's call it main.txt) where the 1st column contains some numbers (2, 4, 8, 15).
2 OtherData
4 OtherData
8 OtherData
15 OtherData
Also, I have other file mapping.txt. I want to compare each value from main.txt (2, 4, 8, 15) with first 2 columns of mapping file.
The 1st column is a minimum allowed value, the 2nd is a maximum.
1 4 1stType
5 9 2ndType
10 14 3rdType
15 99 4thType
100 1000 5thType
How can I get a result like this using awk?
2 OtherData 1stType # 1 <= 2 <= 4
4 OtherData 1stType # 1 <= 4 <= 4
8 OtherData 2ndType # 5 <= 8 <= 9
15 OtherData 4thType # 15 <= 15 <= 100

Could you please try following, written and tested with shown samples only in GNU awk.
awk '
FNR==NR{
++count
start[count]=$1
end[count]=$2
value[count]=$NF
next
}
{
for(i=1;i<=count;i++){
if($1>=start[i] && $1<=end[i]){
print $0,value[i]
}
}
}
' Input_file2 Input_file1 | column -t
Output will be as follows.
2 OtherData 1stType
4 OtherData 1stType
8 OtherData 2ndType
15 OtherData 4thType

A shorter awk solution that loops through range and store mapping in an array:
awk 'NR == FNR {
for (i=$1; i<=$2; i++)
map[i] = $3
next
}
$1 in map {
print $0, map[$1]
}' mapping.txt main.txt
2 OtherData 1stType
4 OtherData 1stType
8 OtherData 2ndType
15 OtherData 4thType
Alternative awk:
awk 'NR == FNR {
map[$1,$2] = $3
next
}
{
for (i in map) {
split(i, a, SUBSEP)
if ($1 >= a[1] && $1 <= a[2]) {
print $0, map[i]
next
}
}
}' mapping.txt main.txt

Related

Successive averaging of repeating data but different number of lines

I have the following format of data:
1 3
1.723608 0.8490000
1.743011 0.8390000
1.835833 0.7830000
2 5
1.751377 0.8350000
1.907603 0.7330000
1.780053 0.8190000
1.601427 0.9020000
1.950540 0.6970000
3 2
1.993951 0.6610000
1.796519 0.8090000
4 4
1.734961 0.8430000
1.840741 0.7800000
1.818444 0.7950000
1.810717 0.7980000
5 1
2.037940 0.6150000
6 7
1.738221 0.8330000
1.767678 0.8260000
1.788517 0.8140000
2.223586 0.4070000
1.667492 0.8760000
2.039232 0.6130000
1.758823 0.8300000
...
Data consists of data blocks. Each data block has the same format as follows:
The very first line is the header line. The header line contains the ID number and the total number of lines of each data block. For example, the first data block's ID is 1, and the total number of lines is 3. For the third data block, ID is 3, and the total number of lines is 2. All data blocks have this header line.
Then, the "real data" follows. As I explained, the number of lines of "real data" is designated in the second integer of the header line.
Accordingly, the total number of lines for each data block will be number_of_lines+1. In this example, the total number of lines for data block 1 is 4, and data block 2 costs 6 lines...
This format repeats all the way up to 10000 number of data blocks in my current data, but I can provide this 10000 as a variable in the bash or awk script as an input value. I know the total number of data blocks.
Now, what I wish to do is, I want to get the average of data of each two columns and print it out with data block ID number and a total number of lines. The output text will have:
ID_number number_of_lines average_of_column_1 average_of_column_2
using 5 spaces between columns with 6 decimal places format. The result will have 10000 lines, and each line will have ID, number of lines, avg of column 1 of data, and avg of column 2 of data for each data block. The result of this example will look like
1 3 1.767484 0.823666
2 5 1.798200 0.797200
3 2 1.895235 0.735000
...
I know how to get the average of a simple data column in awk and bash. These are already answered in StackOverflow a lot of times. For example, I really favor using
awk '{ total += $2; count++ } END { print total/count }' data.txt
So, I wish to this using awk or bash. But I really have no clue how can I approach and even start to get this kind of average of multiple repeating data blocks, but with a different number of lines for each data block.
I was trying based on awk, following
Awk average of n data in each column
and
https://www.unix.com/shell-programming-and-scripting/135829-partial-average-column-awk.html
But I'm not sure how can I use NR or FNR for the average of data with a varying number of total lines of data, for each data block.

You may try this awk:
awk -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.76748 0.823667
2 5 1.7982 0.7972
3 2 1.89524 0.735
4 4 1.80122 0.804
5 1 2.03794 0.615
6 7 1.85479 0.742714
If you have gnu awk then use OFMT for getting fixed size decimal numbers like this:
awk -v OFMT="%.6f" -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.767484 0.823667
2 5 1.798200 0.797200
3 2 1.895235 0.735000
4 4 1.801216 0.804000
5 1 2.037940 0.615000
6 7 1.854793 0.742714
An expanded form:
awk OFMT='%.6f' -v OFS='\t' '
$2 ~ /\./ {
s1 += $1
s2 += $2
next
}
{
if (id) {
print id, num, s1/num, s2/num
s1 = s2 = 0
}
id = $1
num = $2
}
END {
print id, num, s1/num, s2/num
}' file

And yet another one:
awk -v num_blocks=10000 '
BEGIN {
OFS = "\t"
OFMT = "%.6f"
}
num_lines == 0 {
id = $1
num_lines = $2
sum1 = sum2 = 0
next
}
lines_read < num_lines {
sum1 += $1
sum2 += $2
lines_read++
}
lines_read >= num_lines {
print id, num_lines,
sum1 / num_lines,
sum2 / num_lines
num_lines = lines_read = 0
num_blocks--;
}
num_blocks <= 0 {
exit
}' file

You could try
awk -v qnt=none 'qnt == "none" {id = $1; qnt = $2; s1 = s2 = line = 0;next}{s1 += $1; s2 += $2; ++line} line == qnt{printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt; qnt="none"}'
The above is expanded as follows:
qnt == "none"
{
id = $1;
qnt = $2;
s1 = s2 = line = 0;
next
}
{
s1 += $1;
s2 += $2;
++line
}
line == qnt
{
printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt;
qnt="none"
}
After a data block is processed (or at the beginning), record header info.
Otherwise, add to sum and print the result when we've done with all lines in this block.

Using awk, how to average numbers in column between two strings in a text file

A text file containing multiple tabular delimited columns between strings with an example below.
Code 1 (3)
5 10 7 1 1
6 10 9 1 1
7 10 10 1 1
Code 2 (2)
9 11 3 1 3
10 8 5 2 1
Code 3 (1)
12 10 2 1 1
Code 4 (2)
14 8 1 1 3
15 8 7 5 1
I would like to average the numbers in the third column for each code block. The example below is what the output should look like.
8.67
4
2
4
Attempt 1
awk '$3~/^[[:digit:]]/ {i++; sum+=$3; print $3} $3!~/[[:digit:]]/ {print sum/i; sum=0;i=0}' in.txt
Returned fatal: division by zero attempted.
Attempt 2
awk -v OFS='\t' '/^Code/ { if (NR > 1) {i++; sum+=$3;} {print sum/i;}}' in.txt
Returned another division by zero error.
Attempt 3
awk -v OFS='\t' '/^Code/ { if (NR > 1) { print s/i; s=0; i=0; } else { s += $3; i += 1; }}' in.txt
Returned 1 value: 0.
Attempt 4
awk -v OFS='\t' '/^Code/ {
if (NR > 1)
i++
print sum += $3/i
}
END {
i++
print sum += $3/i
}'
Returned:
0
0
0
0.3
I am not sure where that last number is coming from, but this has been the closest solution so far. I am getting a number for each block, but not the average.

Could you please try following.
awk '
/^Code/{
if(value!=0 && value){
print sum/value
}
sum=value=""
next
}
{
sum+=$NF;
value++
}
END{
if(value!=0 && value){
print sum/value
}
}
' Input_file

replacing associative array indexes with their value using awk or sed

I would like to replace column values of ref using key value pairs from id
cat id:
[1] a 8-23
[2] g 8-21
[3] d 8-13
cat ref:
a 1 2
b 3 4
c 5 3
d 1 2
e 3 1
f 1 2
g 2 3
desired output
8-23 1 2
b 3 4
c 5 3
8-13 1 2
e 3 1
f 1 2
8-21 2 3
I assume it would be best done using awk.
cat replace.awk
BEGIN { OFS="t" }
NR==FNR {
a[$2]=$3; next
}
$1 in !{!a[#]} {
print $0
}
Not sure what I need to change?

$1 in !{!a[#]} is not awk syntax. You just need $1 in a:
BEGIN { OFS='\t' }
NR==FNR {
a[$2] = $3
next
}
{
$1 = ($1 in a) ? a[$1] : $1
print
}
to force OFS to update, this version always assigns to $1
print uses $0 if unspecified

How to add numbers from files to computation?

I need to get results of this formula - a column of numbers
{x = ($1-T1)/Fi; print (x-int(x))}
from inputs file1
4 4
8 4
7 78
45 2
file2
0.2
3
2
1
From this files should be 4 outputs.
$1 is the first column from file1, T1 is the first line in first column of the file1 (number 4) - it is alway this number, Fi, where i = 1, 2, 3, 4 are numbers from the second file. So I need a cycle for i from 1 to 4 and compute the term one times with F1=0.2, the second output with F2=3, then third output with F3=2 and the last output will be for F4=1. How to express T1 and Fi in this way and how to do a cycle?
awk 'FNR == NR { F[++n] = $1; next } FNR == 1 { T1 = $1 } { for (i = 1; i <= n; ++i) { x = ($1 - T1)/F[i]; print x - int(x) >"output" FNR} }' file2 file1
This gives more than 4 outputs. What is wrong please?

FNR == 1 { T1 = $1 } is being run twice, when file2 is started being read T1 is set to 0.2,
>"output" FNR is problematic, you should enclose the output name expression in parentheses.
Here's how I'd do it:
awk '
NR==1 {t1=$1}
NR==FNR {f[NR]=$1; next}
{
fn="output"FNR
for(i in f) {
x=(f[i]-t1)/$1
print x-int(x) >fn
}
close(fn)
}
' file1 file2

Remove data using AWK

I have a file of values that I wish to plot using gnuplot. The problem is that there are some values that I wish to remove.
Here is an example of my data:
1 52
2 3
3 0
4 4
5 1
6 1
7 1
8 0
9 0
I want to remove any row in which the right column is 0, so the data above would end up looking like this:
1 52
2 3
4 4
5 1
6 1
7 1

Let's just check field 2:
awk '$2' file
If the 2nd field has a True value, that is, not 0 or empty, the condition is True. In such case, awk performs its default action: print $0, meaning print the current line.

Updated, shorter:
awk '$2 == 0 { next; } { print; }'
awk '{ if ($2 == 0) { next; } else { print; } }'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas