How to extract files from a merged file - awk

I want to separate a merged file into two files. The file:
file.dat
i =100
1 2 3
i =1
-1 -2 -3
i =101
1 2 3
i =102
1 2 3
i =103
1 2 3
i =2
-1 -2 -3
....
The mixed indices are
1,2,3,4, ...,99
and
100, 101, 102, 103,...,200.
The indices appear alternately, but there is no rule.
The data
1 2 3
and
-1 -2 -3
just denote the data block in each step.
Could you give an idea to separate the merged file into two files with respect to the indices?

If you just want the data blocks appended to two different files, depending on which group of indexes it belongs to, this should work:
# separate.awk
{
if ($1 == "i")
{
split($2,a,"=");
i = a[2];
}
if (i < 100)
print > "1-99.dat";
else
print > "100-200.dat"
}
$ awk -f separate.awk file.dat
$ cat 1-99.dat
i =1
-1 -2 -3
i =2
-1 -2 -3
$ cat 100-200.dat
i =100
1 2 3
i =101
1 2 3
i =102
1 2 3
i =103
1 2 3

This awk should do it for you:
awk -F= '/=/{f="a.txt";if($2>99)f="b.txt";next} {print >f}' file.dat
First, it sets the field separator to =. Then it checks if the line contains an equals sign, and if so, it is time to set the name of the output file to either "a.txt" or "b.txt" depending on the number after the equals sign. Then on subsequent records we just write to the file we last selected.

Related

awk: Print $1 with varying number of additional fields on the same line

I have an input file with ~100 lines and ~100 fields per line. Each field represents either a positive or negative value. I wish to print $1 followed by only the positive or negative fields in each line. The number of positive or negative fields per line is random.
sample input
0 x 9 8 7 -1 -2 -3
2 x 7 6 -2 -3 -4 -5
4 x 4 3 2 1 -6 -7
desired output
positive
0 9 8 7
2 7 6
4 4 3 2 1
negative
0 -1 -2 -3
2 -2 -3 -4 -5
4 -6 -7
context and attempt
The above outputs print $1, followed by either the positive or the negative values in remaining fields on the same line as $1.
The current code I tried (for positive values, starting on line 6 in my input):
awk 'NR>5{for(i=3; i<=NF; i++) if ( $i > 0 ) print $1, $i}' input > output
This works fine, except that I print an output like:
0 9
0 8
0 7
2 7
2 6
4 4
4 3
4 2
4 1
I have also tried:
awk 'BEGIN {ORS="\t"} NR>5 {print $1} {for(i=3;i<=NF;i++) if ( $i > 0 && i <= NF} {print $i}}' input > output
but then I never move to a new line in the output. If I change ORS back to \n via some 'else if (i = NF) {ORS=...}' condition, then it prints all field outputs for each i on a new line, like the BEGIN statement has no effect.
question
How can I tell awk to print $1, then print all other output from the same input line onto the same output line, then advance 1 new line in the output and repeat the process for the next input line?
Thank you.
response to Tiw's answer
I tried to execute this in a loop for my two files:
for j in 1 2; do
positive=ofile.p0
negative=ofile.m0
awk 'NR>5{
printf $1>"positive";
printf $1>"negative";
for(i=3;i<=NF;i++)
if($i~/[-+]?[0-9]+/)
if ($i>0) printf OFS $i>"positive";
else if($i<0) printf OFS $i>"negative";
print "">"positive";
print "">"negative";
}'ofile.0$j
mv positive $positive$j
mv negative $negative$j
done
but it hangs. Edit: Tiw's answer updated with %s in printf. It works with this change.
Try this:
awk 'NF>5{printf "%s",$1>"positive";printf "%s",$1>"negative"; for(i=2;i<=NF;i++) if($i~/^[-+]?[0-9]+$/) if ($i>0) printf "%s",OFS $i>"positive"; else if($i<0) printf "%s",OFS $i>"negative"; print "">"positive";print "">"negative";}' input
With a file named input:
0 x 9 8 7 -1 -2 -3
2 x 7 6 -2 -3 -4 -5
4 x 4 3 2 1 -6 -7
It will create two files,
one positive:
0 9 8 7
2 7 6
4 4 3 2 1
one negative:
0 -1 -2 -3
2 -2 -3 -4 -5
4 -6 -7
Put in multiple lines for better readability:
awk 'NF>5{
printf "%s",$1>"positive";
printf "%s",$1>"negative";
for(i=2;i<=NF;i++)
if($i~/^[-+]?[0-9]+$/) ## Another and better way is $i == $i + 0
if ($i>0) printf "%s",OFS $i>"positive";
else if($i<0) printf "%s",OFS $i>"negative";
print "">"positive";
print "">"negative";
}' input
It's quite straightforward so I guess it's easy for you to understand.
Note I didn't use {} to quote the block after the for and ifs, since they each has only one command after, so the quotes can be saved.
print will print a newline character \n at the end, printf won't.
Also NR means Number of Records, i.e. the line number, I changed to NF, which means Number of Fields, I think this is what you wanted.
if($i~/^[-+]?[0-9]+$/) is to test the field is a number.
If the field won't be empty, then $i==$+0 is a better way.
And combine with testing the field is not 0 or empty, use $i && ($i==$i+0).
The first thing you need to do is to check if the field is a number, if this is the case, you can do the check. In awk, you can check if a variable is a number by adding zero to it, and check if it returns the same value.
For positive numbers you do this:
awk '{for(i=1;i<=NF;++i) if ($i+0 == $i && $i >= 0) printf $i OFS; printf ORS}' file
If Perl is an option,
Input:
$ cat blaisem.txt
0 x 9 8 7 -1 -2 -3
2 x 7 6 -2 -3 -4 -5
4 x 4 3 2 1 -6 -7
$
+ve and -ve separate runs
$ perl -ne ' #p=/(\S+)(?<=\d)/g;print "$p[0] "; for(#p[1..$#p]) { print "$_ " if $_ >=0 } print "\n" ' blaisem.txt
0 9 8 7
2 7 6
4 4 3 2 1
$ perl -ne ' #p=/(\S+)(?<=\d)/g;print "$p[0] "; for(#p[1..$#p]) { print "$_ " if $_ < 0 } print "\n" ' blaisem.txt
0 -1 -2 -3
2 -2 -3 -4 -5
4 -6 -7
$
+ve and -ve in one script
$ perl -ne ' open(POS,">>pos.txt"); open(NEG,">>neg.txt"); #p=/(\S+)(?<=\d)/g;
print POS "$p[0] "; print NEG "$p[0] ";
for(#p[1..$#p]) { print NEG "$_ " if $_ < 0; print POS "$_ " if $_>=0 }
print POS "\n"; print NEG "\n" ' blaisem.txt
$ cat pos.txt
0 9 8 7
2 7 6
4 4 3 2 1
$ cat neg.txt
0 -1 -2 -3
2 -2 -3 -4 -5
4 -6 -7
$

Awk - Conditionally print an element from a certain row, based on the condition of a different element in a different row

Say I have a lot of files with a consistent number of columns and rows, and a sample one looks like this:
1 2 3
4 5 6
7 8 9
I want to print column 3 of row 2, but only if column 3 of row 3 == 4 (in this case it is 9). I'm using this logic is a means to determine if the file is valid for my use-case, and extract the relevant field if it is.
My attempt, based on other answers to people asking how to isolate certain rows was this: awk 'BEGIN{FNR=3} $3=="4"{FNR=2;print $2}'
so you are looking for something like this?
awk 'FNR==2{ x = $3 }FNR==3 && $3=="4"{ print x }' file.txt
cat file.txt
1 2 3
4 5 6
7 8 4
Output:
6
cat file.txt
1 2 3
4 5 6
7 8 9
Output:
Nothing since column 3 of row 3 is 9
awk 'FNR==3 && $3==4{print p} {p=$3}' *
Here's another which doesn't care for the order in which the records appear. In the OP the problem was to print a value (v) from 2nd record based on the tested value (t) on the 3rd record. This solution allows for the test value to appear in an earlier record than the value to be printed:
$ awk '
FNR==2 { # record on which is the value to print
v=$3
f=1 # flag indicating the value v has been read
}
FNR==3 { # record of which is the value to test
t=$3
g=1 # test value read indicator
}
f && g { # once the value and test value are acquired and
if(t==4) # test the test
print v # output
exit # and exit
}' file
6
Record order reversed (FNR values changed in the code):
$ cat file2
1 2 3
7 8 4 # records
4 5 6 # reversed
$ awk 'FNR==3{v=$3;f=1}FNR==2{t=$3;g=1}f&&g{if(t==4)print v;exit}' file2
6
Flags f and g are different from v and t in case either should be empty ("").

How to compare pairs of columns in awk?

I have the following dataset from a pairwise analysis (the first row are just the sample ids):
A B A C
1 1 1 0
1 2 1 1
1 0 1 2
I wish to compare the values for field 1 and field 2 then field 3 and field 4 such that I want to print the row number NR every time I see a 1 and 2 combination for the pairs I am examining.
For example for pairs A and B, I would want the output:
A B 2
For pairs A and C, I would want the output:
A C 3
I would want to proceed row by row so I would likely need the code to include:
for i in {1..3}; do
awk 'NR=="'${i}'" {code}'
done
But I have no idea how to proceed in a pairwise fashion (i.e. compare field 1 and field 2 and then field 3 and field 4 etc...).
How can I do this?
It's hard to say with such a minimal example but this MAY be what you want:
$ cat tst.awk
FNR==1 {
for (i=1;i<=NF;i++) {
name[i] = $i
}
next
}
{
for (i=1;i<NF;i+=2) {
if ( ($i == 1) && ($(i+1) == 2) ) {
print name[i], name[i+1], NR-1
}
}
}
$ awk -f tst.awk file
A B 2
A C 3
You certainly should only run the script once; there's no need to run awk more frequently. It isn't yet entirely clear how you want multiple matches printed. However, if you're working a line at time, then the output probably comes a line at a time.
Working on that basis, then:
awk 'NR == 1 { for (i = 1; i < NF; i += 2)
{ cols[(i+1)/2,1] = $i; cols[(i+1)/2,2] = $(i+1); }
next
}
{ for (i = 1; i < NF; i += 2)
{ if ($i == 1 && $(i+1) == 2)
print cols[(i+1)/2,1], cols[(i+1)/2,2], NR - 1
}
}'
The NR == 1 block of code captures the headings so they can be used in the main printing code. There are plenty of other ways to store the information too. The other block of code looks at the data lines and checks that pairs of fields contain 1 2 and print out the control data if there is a match. Because NF will be an even number, but the loops count on the odd numbers, the < comparison is OK. Often in awk, you use for (i = 1; i <= NF; i++) with a single increment and then <= is required for correct behaviour.
For your minimal data set, this produces:
A B 2
A C 3
For this larger data set:
A B A C
1 1 1 0
1 2 1 1
1 0 1 2
1 2 4 2
5 3 1 9
7 0 3 2
1 2 1 0
9 0 1 2
1 2 3 2
the code produces:
A B 2
A C 3
A B 4
A B 7
A C 8
A B 9

how to output data from multiple files in side by side columns in one file via awk?

I have 30 files, called UE1.dat, UE2.dat .... with 4 columns in every of them. An example of their column structure is given below for UE1.dat and UE2.dat.
UE1.dat
1 4 2 1
2 2 3 3
3 2 4 4
4 4 4 2
UE2.dat
2 6 8 7
4 4 9 6
7 1 1 2
9 3 3 3
So, i have tried with the following code:
for((i=1;i<=30;i++)); do awk 'NR$i {printf $1",";next} 1; END {print ""}' UE$i.dat; done > UE_all.dat
to get only the first column from every file and write them in a single file and columns to be side by side,The desired OUTPUT is given below.
1 2
2 4
3 7
4 9
But unfortunately, the code orders them in rows, can you give a hint?
Thank you in advance!
In awk you can do it this way:
1) Put this code in a file named output_data_from_multiple_files.awk:
BEGIN {
# All the input files are processed in one run.
# filenumber counts the number of input files.
filenumber = 1
}
{
# FNR is the input record number in the current input file.
# Concatenate the value of the first column in the corresponding
# line in the output.
output[FNR] = output[FNR] " " $1
# FNR == 1 means we are processing a new file.
if (FNR == 1) {
++filenumber
}
}
END {
# print the output
for (i=1; i<=FNR; i++)
printf("%s\n", output[i])
}
2) Run awk -f output_data_from_multiple_files.awk UE*
All the files are handled in a single execution of awk. FNR is the input record number in the current input file. filenumber is used to count the number of processed files. The values read in the input files are concatenated in the output array.
Concatenate all of the columns into one file with an awk associative array:
# use a wildcard to get all the files (could also use a for-loop)
# add each new row to the array using line number as an index
# at the end of reading all files, go through each index (will be 1-4 in
# your example) and print index, and then the fully concatenated rows
awk '{a[FNR] = a[FNR]" "$0}END{ for (i in a) print i, a[i] | "sort -k1n"}' allfiles*
I'd probably go with something like - using perl rather than awk because I prefer the handling of data structures. In this case - we use a two dimensional array, insert the first column of each file into a new column of the array, then print the whole thing.
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $num_files = 2;
my #rows;
my $count = 0;
my $max = 0;
for my $filenum ( 1..$num_files ) {
open ( my $input, "<", "UE${filenum}.dat" ) or die $!;
while ( <$input> ) {
my #fields = split;
push ( #{$rows[$filenum]}, $fields[0] );
$count++;
}
close ( $input );
if ( $count > $max ) { $max = $count };
}
print Dumper \#rows;
for ( 0..$count ) {
foreach my $filenum ( 1..$num_files ) {
print shift #{$rows[$filenum]} || ''," ";
}
print "\n";
}
My solution is this
gawk 'BEGINFILE{f++}{print FNR,f,$1}' UE* | sort -nk 1,2 | cut -d" " -f3 | xargs -L $(ls UE*.dat | wc -l)
This is how I got to it... I number the lines and files using gawk, then sort them by line number, then secondly by file, simply using sort and remove the file and line numbers. So...
gawk 'BEGINFILE{f++}{print FNR,f,$1}' UE*
1 1 1 # line 1 file 1 is 1
2 1 2 # line 2 file 1 is 2
3 1 3 # line 3 file 1 is 3
4 1 4 # line 4 file 1 is 4
1 2 2 # line 1 file 2 is 2
2 2 4 # line 2 file 2 is 4
3 2 7 # line 3 file 2 is 7
4 2 9 # line 4 file 2 is 9
Then use sort like this to put the first line of file 1 followed by the first line of file 2, first line of file n, second line of file 1, second line of file 2, second line of file n. Then get the third column:
gawk 'BEGINFILE{f++}{print FNR,f,$1}' UE* | sort -nk 1,2 | cut -d" " -f3
1
2
2
4
3
7
4
9
And then put them back together with xargs
gawk 'BEGINFILE{f++}{print FNR,f,$1}' UE* | sort -nk 1,2 | cut -d" " -f3 | xargs -L2
1 2
2 4
3 7
4 9
The -L2 at the end must match the number of files, i.e. -L30 in your case.

extracting data from a file with awk

I have a data set like below
first 0 1
first 1 2
first 2 3
second 0 1
second 1 2
second 2 3
third 0 1
third 1 2
third 2 3
I need to check this file and extract the third columns for first, second and third and store them in different files.
The output files should contain:
1
2
3
This is pretty straight forward awk '{print $3>$1}' file i.e. print the third field and redirect the output to the file, where the filename is the first field.
Demo:
$ ls
file
$ awk '{print $3>$1}' file
$ ls
file first second third
$ cat first
1
2
3
$ cat second
1
2
3
$ cat third
1
2
3