Merging multiple dataframes with summarisation of variables - awk

I am trying to use awk command for the first time and I have to merge multiple files (inputfile_1, inputfile_2, ... ) that look all like this
X.IID NUM SUM AVG S_SUM
X001 2 0 0.000 0.000
X002 2 1 0.138 0.276
X003 2 1 0.138 0.276
X004 2 0 0.000 0.000
X005 2 1 0.138 0.276
I have to merge those files by X.IID and also for each X.IID sum the values of NUM and AVG into two new variables. I tried to use this awk command:
awk ‘{iid[FNR]=$1;num_sum[FNR]+=$2;avg_sum[FNR]+=$4} END{for(i=1;i<=FNR;i++)print iid[i],num_sum[i],avg_sum[i]}’ inputfile_*.csv > outputfile.csv
For some reason some X.IID values repeat in the output file. Example:
X.IID X0 X0.1
X001 50 -1.297
X001 50 -1.297
X002 50 -0.004
X003 50 -0.105
X003 50 -0.105
X003 50 -0.105
I could just remove the repeated observations as they do not differ from each other, but I would like some reassurance that the values calculated are correct.

Related

Append an empty output column in the dataset by comparing conditions defined to the input columns using pandas

Temperature Humidity Moisture observation
0 51 29.5 0
1 51 29.5 188
2 50 29.5 0
3 50 29.5 350
4 50 29.5 0
This is the dataset with input columns (temperature, humidity and moisture) and an empty output column (observation).
I need to fill out the observation column with the conditions such as
tempearture = 40
humidity = 50
moisture = 150
If the values of all the three mentioned above are less than that as mentioned then the observation column show append an output as "yes".
if the values of all the three input rows are greater than that as mentioned then the observation row should append an output as "no".
What is the best way to do this?
You can do this:
df['observation'] = np.where(((df['Temperature']<=40) & (df['Humidity']<=50) & (df['Moisture']<150)) , 'yes', 'no')
which gives:
Temperature Humidity Moisture observation
0 51 29.5 0 no
1 51 29.5 188 no
2 50 29.5 0 no
3 50 29.5 350 no
4 50 29.5 0 no
Note that your conditions are ambiguous. Did you mean all three satisfy these at the same time or if any of them is satisfies.

Adding columns from two file with awk

I have the following problem:
Let's say I have two files with the following structure:
1 17.650 0.001 0.000E+00
1 17.660 0.002 0.000E+00
1 17.670 0.003 0.000E+00
1 17.680 0.004 0.000E+00
1 17.690 0.001 0.000E+00
1 17.700 0.000 0.000E+00
1 17.710 0.004 0.000E+00
1 17.720 0.089 0.000E+00
1 17.730 0.011 0.000E+00
1 17.740 0.000 0.000E+00
1 17.750 0.032 0.000E+00
1 17.760 0.100 0.000E+00
1 17.770 0.020 0.000E+00
1 17.780 0.002 0.000E+00
2 -20.000 0.001 0.000E+00
2 -19.990 0.002 0.000E+00
2 -19.980 0.003 0.000E+00
2 -19.970 0.004 0.000E+00
2 -19.960 0.001 0.000E+00
2 -19.950 0.000 0.000E+00
2 -19.940 0.004 0.000E+00
2 -19.930 0.089 0.000E+00
2 -19.920 0.011 0.000E+00
2 -19.910 0.000 0.000E+00
2 -19.900 0.032 0.000E+00
2 -19.890 0.100 0.000E+00
2 -19.880 0.020 0.000E+00
2 -19.870 0.002 0.000E+00
The first two columns are identical in both files and what is different is the 3rd and 4th columns. The above is a sample of these files. That blank line is essential and can be found throughout the files separating the data into "blocks". It must exist!
Using the following command:
awk '{a[FNR]=$1; b[FNR]=$2; s[FNR]+=$3} END{for (i=1; i<=FNR; i++) print a[i], b[i], s[i]}' file1 file2 > file-result
I am trying to create a file where columns 1 and 2 are identical to the ones in the original files and the 3rd column is the sum of the 3rd column in file1 and file2.
This command works if there is no blank line. With the blank line I get the following:
1 17.650 0.001
1 17.660 0.002
1 17.670 0.003
1 17.680 0.004
1 17.690 0.001
1 17.700 0.000
1 17.710 0.004
1 17.720 0.089
1 17.730 0.011
1 17.740 0.000
1 17.750 0.032
1 17.760 0.100
1 17.770 0.020
1 17.780 0.002
0
2 -20.000 0.001
2 -19.990 0.002
2 -19.980 0.003
2 -19.970 0.004
2 -19.960 0.001
2 -19.950 0.000
2 -19.940 0.004
2 -19.930 0.089
2 -19.920 0.011
2 -19.910 0.000
2 -19.900 0.032
2 -19.890 0.100
2 -19.880 0.020
2 -19.870 0.002
(please note that in the above I have not written the actual sum in column 3 but you get the idea)
How can I make sure that 0 doesn't appear in the blank line? I can't figure it out.
Note that if the 2 first columns are really identical you don't need to store them in arrays; store only the 3rd columns of the first file.
The solution to your problem is simple: when processing the second file test if the line is blank and, if it is, print it. Else print the modified line. The nextstatement makes all this quite easy and clean.
awk 'NR == FNR {s[NR]=$3; next}
/^[[:space:]]*$/ {print; next}
{print $1, $2, s[FNR]+$3}' file1 file2 > file-result
The first block runs only on lines of the first file (NR == FNR is true only then). It stores the 3rd field in array s, indexed by the line number. The next statement moves immediately to the next line and prevents the two other blocks to run on lines of the first file.
The second block thus runs only on lines of the second file, and only if they are blank (^[[:space:]]*$ means only 0 or more spaces between beginning (^) and end ($) of line). The block prints a blank line as it is and the next statement again moves immediately to the next line, preventing the last block to run. Note that if your awk supports the \s operator you can replace [[:space:]] by \s. Note also that the test could also be NF == 0 (NF is the number of fields of the current record).
So the 3rd and last block runs only on non-blank lines of the second file. It simply prints the two first fields and the sum of the third fields of the two files (taken from $3 and the s array).

Get value of variable quantile per group

I have data that is categorized in groups, with a given quantile percentage per group. I want to create a threshold for each group that seperates all values within the group based on the quantile percentage. So if one group has q=0.8, I want the lowest 80% values given 1, and the upper 20% values given 0.
So, given the data like this:
I want object 1, 2 and 5 to get result 1 and the other 3 result 0. In total my data consists of 7.000.000 rows with 14.000 groups. I tried doing this with groupby.quantile but therefore I need a constant quantile measure, whereas my data has a different one for each group.
Setup:
num = 7_000_000
grp_num = 14_000
qua = np.around(np.random.uniform(size=grp_num), 2)
df = pd.DataFrame({
"Group": np.random.randint(low=0, high=grp_num, size=num),
"Quantile": 0.0,
"Value": np.random.randint(low=100, high=300, size=num)
}).sort_values("Group").reset_index(0, drop=True)
def func(grp):
grp["Quantile"] = qua[grp.Group]
return grp
df = df.groupby("Group").apply(func)
Answer: (This is basically a for loop, so for performance you can try to apply numba to this)
def func2(grp):
return grp.Value < grp.Value.quantile(grp.Quantile.iloc[0])
df["result"] = df.groupby("Group").apply(func2).reset_index(0, drop=True)
print(df)
Outputs:
Group Quantile Value result
0 0 0.33 156 1
1 0 0.33 259 0
2 0 0.33 166 1
3 0 0.33 183 0
4 0 0.33 111 1
... ... ... ... ...
6999995 13999 0.83 194 1
6999996 13999 0.83 227 1
6999997 13999 0.83 215 1
6999998 13999 0.83 103 1
6999999 13999 0.83 115 1
[7000000 rows x 4 columns]
CPU times: user 14.2 s, sys: 362 ms, total: 14.6 s
Wall time: 14.7 s

Searching File2 with 3 columns from File 1 with awk

Does anyone know how to print "Not Found" if there is no match, such that the print output will always contain the same number of lines as File 1?
To be more specific, I have two files with four columns:
File 1:
1 800 800 0.51
2 801 801 0.01
3 802 802 0.01
4 803 803 0.23
File 2:
1 800 800 0.55
2 801 801 0.09
3 802 802 0.88
4 803 804 0.24
This is what I am using now:
$ awk 'NR==FNR{a[$1,$2,$3];next}($1,$2,$3) in a{print $4}' file1.txt file2.txt
This generates the following output:
0.55
0.09
0.88
However, I want to get this:
0.55
0.09
0.88
Not Found
Could you please help?
Sorry if this is presented in a confusing manner; I have little experience with awk and am confused myself.
In a separate issue, I want to end up having a file that has the data from File 2 added on to File1, like so:
1 800 800 0.51 0.55
2 801 801 0.01 0.09
3 802 802 0.01 0.88
4 803 803 0.23 Not Found
I was going to generate the file as before (lets call it file2-matches.txt), then use the paste command:
paste -d"\t" file1.txt file2-matches.txt > output.txt
But considering I have to do this matching for over 100 files, is there any more efficient way to do this that you can suggest?
Add an else clause:
$ awk 'NR==FNR{a[$1,$2,$3];next} {if (($1,$2,$3) in a) {print $4} else {print "not found"}}' f1 f2
0.55
0.09
0.88
not found

obtain averages of field 2 after grouping by field 1 with awk

I have a file with two fields containing numbers that I have sorted numerically based on field 1. The numbers in field 1 range from 1 to 200000 and the numbers in field 2 between 0 and 1. I want to obtain averages for both field 1 and field 2 in batches (based on rows).
Here is example input output when specifying batches of 4 rows:
1 0.12
1 0.34
2 0.45
2 0.40
50 0.60
301 0.12
899 0.13
1003 0.14
1300 0.56
1699 0.43
2100 0.25
2500 0.56
The output would be:
1.5 0.327
563.25 0.247
1899.75 0.45
Here you go:
awk -v n=4 '{s1 += $1; s2 += $2; if (++i % n == 0) { print s1/n, s2/n; s1=s2=0; } }'
Explanation:
Initialize n=4, the size of the batches
Collect the sums: sum of 1st column in s1, the 2nd in s2
Increment counter i by 1 (default initial value is 0, no need to set it)
If i is divisible by n with no remainder, then we print the averages, and reset the sum variables