Concatenate two file with a common pattern but with several lign

Concatenate two file with a common pattern but with several lign - awk

I have two files :
File 1 (sep = tab):
A1 bla blo bli 23
A1 bla blo bli 21
A1 bla blo bli 28
B2 bla blo bli 32
B2 bla blo bli 31
B2 bla blo bli 35
File 2 (sep = ;):
fli;flo;A1;flu;flc
fli;flo;A2;flu;flc
fli;flo;B1;flu;flc
fli;flo;B2;flu;flc
And I try to add the different value of each similar pattern of the File 1 to the File 2 like this :
fli;flo;A1;flu;flc;23;21;28
fli;flo;A2;flu;flc;
fli;flo;B1;flu;flc;
fli;flo;B2;flu;flc;32;31;35
Do you have some awk command in order to do that ?
Thanks in advance

awk 'BEGIN{OFS=";"}
(NR==FNR){a[$1,++b[$1]] = $2; n=(n>b[$1]?n:b[$1]); next }
{ s=$1; for(i=1;i<=n;++i) s = s OFS a[$1,i]; print s }' FS="\t" file1 FS=";" file2

Related

Compare two columns of two files, print the row if it matches and print zero in third column

I need to compare column 1 and column 2 of my file1.txt and file2.txt. If both columns match, print the entire row of file1.txt, but where a row in file1.txt is not present in file2.txt, also print that missing row in the output and add "0" as its value in third column.
# file1.txt #
AA ZZ
JB CX
CX YZ
BB XX
SU BY
DA XZ
IB KK
XY IK
TY AB
# file2.txt #
AA ZZ 222
JB CX 345
BB XX 3145
DA XZ 876
IB KK 234
XY IK 897
Expected output
# output.txt #
File1.txt
AA ZZ 222
JB CX 345
CX YZ 0
BB XX 3145
SU BY 0
DA XZ 376
IB KK 234
XY IK 897
TY AB 0
I tried this code but couldn't figure out how to add rows that did not match and add "0" to it
awk 'BEGIN { while ((getline <"file2.txt") > 0) {REC[$1]=$0}}{print REC[$1]}' < file1.txt > output.txt

With your shown samples, could you please try following.
awk '
FNR==NR{
arr[$1 OFS $2]
next
}
(($1 OFS $2) in arr){
print
arr1[$1 OFS $2]
}
END{
for(i in arr){
if(!(i in arr1)){
print i,0
}
}
}
' file1.txt file2.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking FNR==NR condition which will be TRUE when file1.txt is being read.
arr[$1 OFS $2] ##Creating array with 1st and 2nd field here.
next ##next will skip all further statements from here.
}
(($1 OFS $2) in arr){ ##Checking condition if 1st and 2nd field of file2.txt is present in arr then do following.
print ##Print the current line here.
arr1[$1 OFS $2] ##Creating array arr1 with index of 1st and 2nd fields here.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through arr all elements from here.
if(!(i in arr1)){ ##Checking if an element/key is NOT present in arr1 then do following.
print i,0 ##Printing index and 0 here.
}
}
}
' file1.txt file2.txt ##Mentioning Input_file names here.

You may try this awk:
awk '
FNR == NR {
map[$1,$2] = $3
next
}
{
print $1, $2, (($1,$2) in map ? map[$1,$2] : 0)
}' file2 file1
AA ZZ 222
JB CX 345
CX YZ 0
BB XX 3145
SU BY 0
DA XZ 876
IB KK 234
XY IK 897
TY AB 0

$ awk '
{ key = $1 FS $2 }
NR==FNR { map[key]=$3; next }
{ print $0, map[key]+0 }
' file2.txt file1.txt
AA ZZ 222
JB CX 345
CX YZ 0
BB XX 3145
SU BY 0
DA XZ 876
IB KK 234
XY IK 897
TY AB 0

add filename without the extension at certain columns using awk

I would like to leave empty first four columns, then I want to add filename without extension in the last 4 columns. I have files as file.frq and goes on. Later I will apply this to the 200 files in loop.
input
CHR POS REF ALT AF HOM Het Number of animals
1 94980034 C T 0 0 0 5
1 94980057 C T 0 0 0 5
Desired output
file file file file
CHR POS REF ALT AF HOM Het Number of animals
1 94980034 C T 0 0 0 5
1 94980057 C T 0 0 0 5
I tried this from Add file name and empty column to existing file in awk
awk '{$0=(NR==1? " \t"" \t"" \t"" \t":FILENAME"\t") "\t" $0}7' file2.frq
But it gave me this:
CHR POS REF ALT AF HOM Het Number of animals
file2.frq 1 94980034 C T 0 0 0 5
file2.frq 1 94980057 C T 0 0 0 5
file2.frq 1 94980062 G C 0 0 0 5
and I also tried this
awk -v OFS="\t" '{print FILENAME, $1=" ",$2=" ",$3=" ", $4=" ",$5 - end}' file2.frq
but it gave me this
CHR POS REF ALT AF HOM Het Number of animals
file2.frq 1 94980034 C T 0 0 0 5
file2.frq 1 94980057 C T 0 0 0 5
any help will be appreciated!

Assuming your input is tab-separated like your desired output:
awk '
BEGIN { FS=OFS="\t" }
NR==1 {
orig = $0
fname = FILENAME
sub(/\.[^.]*$/,"",fname)
$1=$2=$3=$4 = ""
$5=$6=$7=$8 = fname
print
$0 = orig
}
1' file.txt
file file file file
CHR POS REF ALT AF HOM Het Number of animals
1 94980034 C T 0 0 0 5
1 94980057 C T 0 0 0 5
To see it in table format:
$ awk '
BEGIN { FS=OFS="\t" }
NR==1 {
orig = $0
fname = FILENAME
sub(/\.[^.]*$/,"",fname)
$1=$2=$3=$4 = ""
$5=$6=$7=$8 = fname
print
$0 = orig
}
1' file.txt | column -s$'\t' -t
file file file file
CHR POS REF ALT AF HOM Het Number of animals
1 94980034 C T 0 0 0 5
1 94980057 C T 0 0 0 5

Break lines at specified points with awk

I have a file with multiple lines in the following form:
name1 a1 b3 c6 a3 b4 c9
name2 a7 b8 c7 a9 b10 c13
name3 a12 b9 c8
name4 a4 b34 c19 a7 b2 c10 a3 b5 c67
I need to break the lines after the letters repeat (i.e. after each a,b,c), but have the original name (field 1) retained:
name1 a1 b3 c6
name1 a3 b4 c9
name2 a7 b8 c7
name2 a9 b10 c13
name3 a12 b9 c8
name4 a4 b34 c19
name4 a7 b2 c10
name4 a3 b5 c67
I tried something along the lines of:
awk -F"\t" '{ for (i=2;i<=NF;i++) print $1"\t"$i }' file
but the i++ incorporates each field, is there a way to group them?
Thank you.

#starter5: Try:
awk 'BEGIN{V["a"];V["b"];V["c"]} /name/{R=$0;next} {Q=$0;gsub(/[[:digit:]]/,"",Q)} (Q in V){if(!W[Q]++){A++}} $0{if(A==1 && $0 && R){$0=R OFS $0};printf("%s %s",$0,(A==3?"\n":OFS));;if(A==3){A="";delete W}}' RS='[ +|\n]' Input_file
Following is the NON-one liner form of solution too here.
awk 'BEGIN{
V["a"];
V["b"];
V["c"]
}
/name/{
R=$0;
next
}
{
Q=$0;
gsub(/[[:digit:]]/,"",Q)
}
(Q in V){
if(!W[Q]++){
A++
}
}
$0 {
if(A==1 && $0 && R){
$0=R OFS $0
};
printf("%s %s",$0,(A==3?"\n":OFS));;
if(A==3) {
A="";
delete W
}
}
' RS='[ +|\n]' Input_file
So let's say we have following Input_file(where I changed the last line) to test if a,b,c are not coming in sequence, so it will NOT break line till three of them found, have a look to it and let me know then.
cat Input_file
name1 a1 b3 c6 a3 b4 c9
name2 a7 b8 c7 a9 b10 c13
name3 a12 b9 c8
name4 a4 b34 a19 a7 b2 c10 a3 b5 c67
Output will be as follows.
name1 a1 b3 c6
name1 a3 b4 c9
name2 a7 b8 c7
name2 a9 b10 c13
name3 a12 b9 c8
name4 a4 b34 a19 a7 b2 c10
name4 a3 b5 c67

{ # for any record
printf $1 # print name
c=substr($2,1,1); # first letter of group
printf OFS $2 # first part of first group
for(i=3; i<=NF; i++) { # for all the rest fields
if(index($i,c) != 1) # if next group has not started
printf OFS $i # print this part on same line
else # otherwise
printf ORS $1 OFS $i # print name and this part on next line
} # done for all fields
printf ORS # move to next line
} # done for this record
This does not work if some letter repeats within a group. For example, it won't work for a3 b5 a4 c6 a5 b6 a0 b9 where groups of a b a c are present.
This can be run like:
awk '{ printf $1; c=substr($2,1,1); printf OFS $2; for(i=3;i<=NF;i++) if(index($i,c)!=1) printf OFS $i; else printf ORS $1 OFS $i; printf ORS}' file

I need to break the lines after the letters repeat (i.e. after each
a,b,c), but have the original name (field 1) retained:
Input
$ cat file
name1 a1 b3 c6 a3 b4 c9
name2 a7 b8 c7 a9 b10 c13
name3 a12 b9 c8
name4 a4 b34 c19 a7 b2 c10 a3 b5 c67
Output
$ awk 'function _p(){print $1,s; s=""; split("",p)}{for(i=2; i<=NF; i++){ c=substr($i,1,1);if(c in p)_p(); s = (s?s OFS:"") $i; p[c] }_p()}' file
name1 a1 b3 c6
name1 a3 b4 c9
name2 a7 b8 c7
name2 a9 b10 c13
name3 a12 b9 c8
name4 a4 b34 c19
name4 a7 b2 c10
name4 a3 b5 c67
Better Readable version
awk '
function _p()
{
print $1,s;
s="";
split("",p)
}
{
for(i=2; i<=NF; i++)
{
c=substr($i,1,1);
if(c in p)_p();
s = (s?s OFS:"") $i;
p[c]
}
_p()
}
' file
OR
$ awk 'function _p(){print $1,s; s=p=""}{for(i=2; i<=NF; i++){ c=substr($i,1,1); if(c==p)_p(); s = (s?s OFS:"") $i; if(!p)p=c }_p()}' file
name1 a1 b3 c6
name1 a3 b4 c9
name2 a7 b8 c7
name2 a9 b10 c13
name3 a12 b9 c8
name4 a4 b34 c19
name4 a7 b2 c10
name4 a3 b5 c67
Better Readable version
awk '
function _p()
{
print $1,s;
s=p=""
}
{
for(i=2; i<=NF; i++)
{
c=substr($i,1,1);
if(c==p)_p();
s = (s?s OFS:"") $i;
if(!p)p=c
}
_p()
}' file

How to improve this awk code to reduce processing time

I have 400 tab-delimited text files with 6 million rows in each file. Below is the format of the files:
### input.txt
col1 col2 col3 col4 col5
ID1 str1 234 cond1 0
ID1 str2 567 cond1 0
ID1 str3 789 cond1 1
ID1 str4 123 cond1 1
### file1.txt
col1 col2 col3 col4 col5
ID2 str1 235 cond1 0
ID2 str2 567 cond2 3
ID2 str3 789 cond1 3
ID2 str4 123 cond2 0
### file2.txt
col1 col2 col3 col4 col5
ID3 str1 235 cond1 0
ID3 str2 567 cond2 4
ID3 str3 789 cond1 1
I am trying to add values in $1 from the rest of the file1..filen to $6 in input.txt file by using:
conditions:
1. columns $2 and $3 as key
2. If the key is found in files1...filen then if $5>=2 add the value from $1 to $6 in the input file.
Code:
awk -F "\t" -v OFS="\t" '!c {
c=$0"\tcol6";
next
}
NR==FNR {
a[$2$3]=$0 "\t";
next
}
{
if ($5>=2) {
a[$2$3]=a[$2$3] $1 ","
}
}
END {
print c;
for (i in a) {
print a[i]
}
}' input.txt file1..filen.txt
The output from the above code is as expected:
Output.txt
col1 col2 col3 col4 col5 col6
ID1 str2 567 cond1 0 ID2,ID3,
ID1 str4 123 cond1 1
ID1 str1 234 cond1 0
ID1 str3 789 cond1 1 ID2,
However, the problem is that the code is very slow as it has to iterate each key in input.txt through 400 files with 6 million rows in each file. This takes several hours to few days. Could someone suggest a better way to reduce the processing time in awk or using other scripts.
Any help would really save lot of time.

input.txt
Sam string POS Zyg QUAL
WSS 1 125 hom 4973.77
WSS 1 810 hom 3548.77
WSS 1 389 hom 62.74
WSS 1 689 hom 4.12
file1.txt
Sam string POS Zyg QUAL
AC0 1 478 hom 8.64
AC0 1 583 het 37.77
AC0 1 588 het 37.77
AC0 1 619 hom 92.03
file2.txt
Sam string POS zyg QUAL
AC1 1 619 hom 89.03
AC1 1 746 hom 17.86
AC1 1 810 het 2680.77
AC1 1 849 het 200.77
awk -F "\t" -v OFS="\t" '!c {
c=$0"\tcol6";
next
}
NR==FNR {
a[$2$3]=$0 "\t";
next
}
{
if ( ($5>=2) && (FNR > 1) ) {
if ( $2$3 in a ) {
a[$2$3]=a[$2$3] $1 ",";
} else {
print $0 > "Errors.txt";
}
}
}
END {
print c;
for (i in a) {
print a[i]
}
}' input.txt file*
For the above input files it prints the below output:
AC0,AC1,
WSS 1 389 hom 62.74
AC1,
WSS 1 810 hom 3548.77 AC1,
WSS 1 689 hom 4.12
WSS 1 1250 hom 4973.77
It still prints the values in $1 from file1 and file2

Transforming multiple entries of data for the same ID into a row-awk

I have data in the following format:
ID Date X1 X2 X3
1 01/01/00 1 2 3
1 01/02/00 7 8 5
2 01/03/00 9 7 1
2 01/04/00 1 4 5
I would like to group measurements into new rows according to ID, so I end up with:
ID Date X1 X2 X3 Date X1_2 X2_2 X3_2
1 01/01/00 1 2 3 01/02/00 7 8 5
2 01/03/00 9 7 1 01/04/00 1 4 5
etc.
I have as many as 20 observations for a given ID.
So far I have tried the technique given by http://gadgetsytecnologia.com/da622c17d34e6f13e/awk-transpose-childids-column-into-row.html
The code I have tried so far is:
awk -F, OFS = '\t' 'NR >1 {a[$1] = a[$1]; a[$2] = a[$2]; a[$3] = a[$3];a[$4] = a[$4]; a[$5] = a[$5] OFS $5} END {print "ID,Date,X1,X2,X3,Date_2,X1_2, X2_2 X3_2'\t' for (ID in a) print a[$1:$5] }' file.txt
The file is a tab delimited file. I don't know how to manipulate the data, or to account for the fact that there will be more than two observations per person.

Just keep track of what was the previous first field. If it changes, print the stored line:
awk 'NR==1 {print; next} # print header
prev && $1!=prev {print prev, line; line=""} # print on different $1
{prev=$1; $1=""; line=line $0} # store data and remove $1
END {print prev, line}' file # print trailing line
If you have tab-separated fields, just add -F"\t".
Test
$ awk 'NR==1 {print; next} prev && $1!=prev {print prev, line; line=""} {prev=$1; $1=""; line=line $0} END {print prev, line}' a
ID Date X1 X2 X3
1 01/01/00 1 2 3 01/02/00 7 8 5
2 01/03/00 9 7 1 01/04/00 1 4 5

you can try this (gnu-awk solution)
gawk '
NR == 1 {
N = NF;
MAX = NF-1;
for(i=1; i<=NF; i++){ #store columns names
names[i]=$i;
}
next;
}
{
for(i=2; i<=N; i++){
a[$1][length(a[$1])+1] = $i; #store records for each id
}
if(length(a[$1])>MAX){
MAX = length(a[$1]);
}
}
END{
firstline = names[1];
for(i=1; i<=MAX; i++){ #print first line
column = int((i-1)%(N-1))+2
count = int((i-1)/(N-1));
firstline=firstline OFS names[column];
if(count>0){
firstline=firstline"_"count
}
}
print firstline
for(id in a){ #print each record in store
line = id;
for(i=1; i<=length(a[id]); i++){
line=line OFS a[id][i];
}
print line;
}
}
' input
input
ID Date X1 X2 X3
1 01/01/00 1 2 3
1 01/02/00 7 8 5
2 01/03/00 9 7 1
2 01/04/00 1 4 5
1 01/03/00 72 28 25
you get
ID Date X1 X2 X3 Date_1 X1_1 X2_1 X3_1 Date_2 X1_2 X2_2 X3_2
1 01/01/00 1 2 3 01/02/00 7 8 5 01/03/00 72 28 25
2 01/03/00 9 7 1 01/04/00 1 4 5

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Concatenate two file with a common pattern but with several lign - awk

awk 'BEGIN{OFS=";"} (NR==FNR){a[$1,++b[$1]] = $2; n=(n>b[$1]?n:b[$1]); next } { s=$1; for(i=1;i<=n;++i) s = s OFS a[$1,i]; print s }' FS="\t" file1 FS=";" file2

Related

Compare two columns of two files, print the row if it matches and print zero in third column

add filename without the extension at certain columns using awk

Break lines at specified points with awk

How to improve this awk code to reduce processing time

Transforming multiple entries of data for the same ID into a row-awk

Categories

Resources