merge similar files in awk

merge similar files in awk - awk

I have the following files
File A
Kmax Event File - Text Format
1 6 1000
1 4143 9256 13645 16426 20490
49 4144 8820 14751 16529 20505
45 4146 8308 12303 16912 22715
75 4139 9049 14408 16447 20480
23 4137 8449 13223 16511 20498
22 4142 8795 14955 16615 20493
File B
Kmax Event File - Text Format
1 6 1000
42 4143 9203 13401 16475 20480
44 4140 8251 12302 16932 21872
849 6283 8455 12301 16415 20673
18 4148 8238 12757 16597 20484
19 4144 8268 12306 17110 21868
50 4134 8331 12663 16501 20606
988 5682 8296 12306 16577 20592
61 4147 8330 12307 16945 22497
0 4138 8333 12310 16871 22749
File C, File D, ... and all those files have exact the same format. In addition the file name of each file is the following : run, run%1, run%2, run%3, run%4 etc. The file number could reach even up to 30, run%30 that is.
What I'd like to do is to merge the files in the following way
Kmax Event File - Text Format
1 6 1000
1 4143 9256 13645 16426 20490
49 4144 8820 14751 16529 20505
45 4146 8308 12303 16912 22715
75 4139 9049 14408 16447 20480
23 4137 8449 13223 16511 20498
22 4142 8795 14955 16615 2049
42 4143 9203 13401 16475 20480
44 4140 8251 12302 16932 21872
849 6283 8455 12301 16415 20673
18 4148 8238 12757 16597 20484
19 4144 8268 12306 17110 21868
50 4134 8331 12663 16501 20606
988 5682 8296 12306 16577 20592
61 4147 8330 12307 16945 22497
0 4138 8333 12310 16871 22749
I believe I can do it using
awk '{i=$1;sub(i,x);A[i]=A[i]$0} FILENAME==ARGV[ARGC-1]{print i A[i]}'
but in this way the two first lines of the second file will be present. In addition I don't know if the above line will work. The problem is that I will need to merge many files at the same time. Any idea to merge those almost identical files?

Using grouping braces in the shell
{ cat run; sed '1,2d' run%*; } > c

You can use cat and tail:
cat A > C && tail -n +3 B >> C
This will merge file A and B in a new file named C.
Using awk:
awk 'FNR==NR{print; next} FNR>2' A B > C
If you have more than one file to merge into one, you can list them next to A B in the awk version, e.g A B D. C in awk version is the output file containing merged data.
In cat and tail version you can repeat tail part of the code for other files, e.g
cat A > C && tail -n +3 B >> C && tail -n +3 D >> C
or create some kind of loop to iterate over files.

Print all lines from the first file (NR==FNR) and only line 3 and on from the rest of the files (FNR>2):
awk 'NR==FNR||FNR>2' run*

Related

Using awk to select rows with a specific value in column greater than x

I tried to use awk to select all rows with a value greater than 98 in the third column. In the output, only lines between 98 - 98.99... were selected and lines with a value more than 98.99 not.
I would like to extract all lines with a value greater than 98 including 99, 100 and so on.
Here my code and my input format:
for i in *input.file; do awk '$3>98' $i >{i/input./output.}; done
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Expected Output
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20

Okay, if you have a series of files, *input.file and you want to select those lines where $3 > 98 and then write the values to the same prefix, but with output.file as the rest of the filename, you can use:
awk '$3 > 98 {
match (FILENAME,/input.file$/)
print $0 > substr(FILENAME,1,RSTART-1) "output.file"
}' *input.file
Which uses match to find the index where input.file begins and then uses substr to get the part of the filename before that and appends "output.file" to the substring for the final output filename.
match() sets the RSTART value to the index where input.file begins in the current filename which is then used by substr truncate the current filename at that index. See GNU awk String Functions for complete details.
For exmaple, if you had input files:
$ ls -1 *input.file
v1input.file
v2input.file
Both with your example content:
$ cat v1input.file
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Running the awk command above would results in two output files:
$ ls -1 *output.file
v1output.file
v2output.file
Containing the records where the third-field was greater than 98:
$ cat v1output.file
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20

Changing the field separator of awk to newline

The -F option lets you specify the field separator for awk, but using '\n' as the line separator doesn't work, that is, it doesn't make $1 the first line of the input, $2 the second line, and so on.
I suspect that this is because awk looks for the field separator within each line. Is there a way to get around this with awk, or some other Linux command? Basically, I want to separate my input by newline characters and put them into an Excel file.
I'm still warming up to Linux and shell scripts, which is the reason for my lack of creativity with this problem.
Thank you!

You may require to overwrite the input record separator (RS), which default is newline.
See my example below,
$ cat test.txt
a
b
c
d
$ awk 'BEGIN{ RS = "" ; FS = "\n" }{print $1,$2,$3,$4}' test.txt
a b c d

Note that you can change both the input and output record separator so you can do something like this to achieve a similar result to the accepted answer.
cat test.txt
a
b
c
d
$ awk -v ORS=" " '{print $1}' test.txt
a b c d

one can simplify it to just the following, with a minor caveat of extra trailing space without trailing newline :
% echo "a\nb\nc\nd"
a
b
c
d
% echo "a\nb\nc\nd" | mawk 8 ORS=' '
a b c d %
To rectify that, plus handling the edge case of no trailing newline from input, one can modify it to :
% echo -n "a\nb\nc\nd" | mawk 'NF-=_==$NF' FS='\n' RS='^$' | odview
0000000 543301729 174334051
a b c d \n
141 040 142 040 143 040 144 012
a sp b sp c sp d nl
97 32 98 32 99 32 100 10
61 20 62 20 63 20 64 0a
0000010
% echo "a\nb\nc\nd" | mawk 'NF -= (_==$NF)' FS='\n' RS='^$' | odview
0000000 543301729 174334051
a b c d \n
141 040 142 040 143 040 144 012
a sp b sp c sp d nl
97 32 98 32 99 32 100 10
61 20 62 20 63 20 64 0a
0000010

How to insert two lines for every data frame using awk?

I have repeating data as follows
....
4 4 4 66 79 169 150 0 40928 40938 40923 40921 40789 40000 40498
5 4 3 16 22 247 0 40168 40911 40944 40205 40000 40562
6 4 4 17 154 93 309 0 40930 40919 40903 40917 40852 40000 40419
7 3 2 233 311 0 40936 40932 40874 40000 40807
....
This data is made up of 115 data blocks, and each data block have 4000 lines like that format.
Here, I hope to put two new lines (number of line per data block = 4000 and empty line) at the begining of each data blocks, so it looks
4000
1 4 4 244 263 704 952 0 40936 40930 40934 40921 40820 40000 40570
2 4 4 215 172 305 33 0 40945 40942 40937 40580 40687 40000 40410
3 4 4 344 279 377 1945 0 40933 40915 40907 40921 40839 40000 40437
4 4 4 66 79 169 150 0 40928 40938 40923 40921 40789 40000 40498
...
3999 2 2 4079 4081 0 40873 40873 40746 40000 40634
4000 1 1 4080 0 40873 40923 40000 40345
4000
1 4 4 244 263 704 952 0 40936 40930 40934 40921 40820 40000 40570
2 4 4 215 172 305 33 0 40945 40942 40937 40580 40687 40000 40410
3 4 4 344 279 377 1945 0 40933 40915 40907 40921 40839 40000 40437
4 4 4 66 79 169 150 0 40928 40938 40923 40921 40789 40000 40498
...
Can I do this with awk or any other unix command?

My solution is more general, since the blocks can be of non-equal lenght as long as you restart the 1st field counter to denote the beginning of a new block
% cat mark_blocks
$1<count { print count; print "";
for(i=1;i<=count;i++) print l[i]; }
# executed for each line
{ l[$1] = $0; count=$1}
END { print count; print "";
for(i=1;i<=count;i++) print l[i]; }
% awk -f mark_blocks your_data > marked_data
%
The working is simple, awk accumulates lines in memory and it prints the header lines and the accumulated data when it reaches a new block or EOF.
The (modest) trick is that the output action must take place before we do the usual stuff we do for each line.

A simple one liner using awk can do the purpose.
awk 'NR%4000==1{print "4000\n"} {print$0}' file
what it does.
print $0 prints every line.
NR%4000==1 selects the 4000th line. When it occures it prints a 4000 and a newline \n, that is two new lines.
NR Number of records, which is effectivly number of lines reads so far.
simple test.
inserts 4000 at 5th line
awk 'NR%5==1{print "4000\n"} {print$0}'
output:
4000
1
2
3
4
5
4000
6
7
8
9
10
4000
11
12
13
14
15
4000
16
17
18
19
20
4000

You can do it all in bash :
cat $FILE | ( let countmax=4000; let count=countmax; while read lin ; do if [ $count == $countmax ]; then let count=0; echo -e "$countmax\n" ; fi ; echo $lin ; let count=count+1 ; done )
Here we assume you are reading this data from $FILE . Then all we are doing is reading from the file and piping it into our little bash script.
The bash script reads lines one by one (with the while read lin) , and increments the counter countfor each line. When starting or when the counter count reaches the value countmax (set to 4000) , then it prints out the 2 lines you asked for.

Join multiple files in gawk

I have a large number of files (around 500). Each file contain two columns. The first column is same for every file. I want to join all the files into a single file using gawk.
For example,
File 1
a 123
b 221
c 904
File 2
a 298
b 230
c 102
and so on. I want a final file like as below:
Final file
a 123 298
b 221 230
c 904 102
I have found scripts that can join two files, but I need to join multiple files.

For given sample files:
$ head f*
==> f1 <==
a 123
b 221
c 904
==> f2 <==
a 298
b 230
c 102
==> f3 <==
a 500
b 600
c 700
Method 1:
$ awk '{a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}END{for(i=1;i<=FNR;i++) print a[i]}' f*
a 123 298 500
b 221 230 600
c 904 102 700
Method 2: (Will probably be faster as your are not loading 500 files in memory)
Using paste and awk together. (Assuming first column is same and present in all files). Doing paste f* will give you the following result:
$ paste f*
a 123 a 298 a 500
b 221 b 230 b 600
c 904 c 102 c 700
Pipe that to awk to remove extra columns.
$ paste f* | awk '{printf "%s ",$1;for(i=2;i<=NF;i+=2) printf "%s%s",$i,(i==NF?RS:FS)}'
a 123 298 500
b 221 230 600
c 904 102 700
You can re-direct the output to another file.

I have encountered this problem very frequently.
I strongly encourage you to check into the getline function in gawk.
getline var < filename
is the command syntax and can be used to solve your problem.
I would suggest utilizing another language that solves this problem much more easily. Typically I invest about 5 lines of code to solve this standard problem.
j=1;
j=getline x < "filename";
if(j==0) {
break;
}
... (Commands involving x such as split and print).

You could try something like :
$ ls
f1.txt f2.txt f3.txt
$ awk '($0 !~ /#/){a[$1]=a[$1]" "$2} END {for(i in a){print i""a[i]}}' *.txt
a 123 298 299
b 221 230 231
c 904 102 103

awk 'FNR==NR{arr[$1]=$2; next;}{printf "%s%s%s%s%s",$1,OFS,arr[$1],OFS,$2; print"";}' file1 file2
based on this

compare a text file with another files

I have a file named file.txt as shown below
12 2
15 7
134 8
154 12
155 16
167 6
175 45
45 65
812 54
I have another five files named A.txt, B.txt, C.txt, D.txt, E.txt. The contents of these files are shown below.
A.txt
45
134
B.txt
15
812
155
C.txt
12
154
D.txt
175
E.txt
167
I need to check, which file contains the values of first column of file.txt exists and print the name of the file as third column.
Output:-
12 2 C
15 7 B
134 8 A
154 12 C
155 16 B
167 6 E
175 45 D
45 65 A
812 54 B

This should work:
One-liner:
awk 'FILENAME != "file.txt"{ a[$1]=FILENAME; next } $1 in a { $3=a[$1]; sub(/\..*/,"",$3) }1' {A..E}.txt file.txt
Formatted with comments:
awk '
#Check if the filename is not of the main file
FILENAME != "file.txt" {
#Create a hash. Store column 1 values of look up files as key and assign filename as values
a[$1]=FILENAME
#Skip the rest of the action
next
}
#Check the first column of main file is a key in the hash
$1 in a {
#If the key exists, assign the value of the key (which is filename) as Column 3 of main file
$3=a[$1]
#Using sub function, strip the extension of the file name as desired in your output
sub(/\..*/,"",$3)
#1 is a non-zero value forcing awk to print. {A..E} is brace expansion of your files.
}1' {A..E}.txt file.txt
Note: The main file needs to be passed at the end.
Test:
[jaypal:~/Temp] awk 'FILENAME != "file.txt"{ a[$1]=FILENAME; next } $1 in a { $3=a[$1]; sub(/\..*/,"",$3) ; printf "%-5s%-5s%-5s\n",$1,$2,$3}' {A..E}.txt file.txt
12 2 C
15 7 B
134 8 A
154 12 C
155 16 B
167 6 E
175 45 D
45 65 A
812 54 B

#! /usr/bin/awk -f
FILENAME == "file.txt" {
a[FNR] = $0;
c=FNR;
}
FILENAME != "file.txt" {
split(FILENAME, name, ".");
k[$1] = name[1];
}
END {
for (line = 1; line <= c; line++) {
split(a[line], seg, FS);
print a[line], k[seg[1]];
}
}
# $ awk -f script.awk *.txt

This solution does not preserve the order
join <(sort file.txt) \
<(awk '
FNR==1 {filename = substr(FILENAME, 1, length(FILENAME)-4)}
{print $1, filename}
' [ABCDE].txt |
sort) |
column -t
12 2 C
134 8 A
15 7 B
154 12 C
155 16 B
167 6 E
175 45 D
45 65 A
812 54 B

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

merge similar files in awk - awk

Using grouping braces in the shell { cat run; sed '1,2d' run%*; } > c

Print all lines from the first file (NR==FNR) and only line 3 and on from the rest of the files (FNR>2): awk 'NR==FNR||FNR>2' run*

Related

Using awk to select rows with a specific value in column greater than x

Changing the field separator of awk to newline

How to insert two lines for every data frame using awk?

Join multiple files in gawk

compare a text file with another files

Categories

Resources