Problems making an awk between range of columns - awk

I have a file called sds
$head sds
2557 386 fs://name/user/hive/ware/doc1/do_fact/date=20190313/fact=6
2593 393 fs://name/user/hive/ware/toc1/do_gas/idi_centr=6372/mes=20
2594 343 fs://name/user/hive/ware/dac2/do_gas2/idi_centr=6354/mes=21
349 307 fs://name/user/hive/ware/tec2/do_des/mes=25
340 332 fs://name/user/hive/ware/dc1/venta/year=2018/month=12
I want delete /user/hive/ware and replace $7 ~ /_1$ for 1and other $7 for 2 using awk.
The code that I used was:
awk -F"/" '{ if ($7 ~ /_1$/)
print $1"//"$3"/1/"$7-$NF
else
print $1"//"$3"/2/"$7-$NF}' sds
but the result is bad.
I would like and output like:
2557 386 fs://name/1/do_fact/date=20190313/fact=6
2593 393 fs://name/1/do_gas/idi_centr=6372/mes=20
2594 343 fs://name/2/do_gas2/idi_centr=6354/mes=21
349 307 fs://name/2/do_des/mes=25
340 332 fs://name/1/venta/year=2018/month=12

$ awk 'BEGIN{FS=OFS="/"} {sub("/user/hive/ware",""); $4=($4~/1$/ ? 1 : 2)} 1' file
2557 386 fs://name/1/do_fact/date=20190313/fact=6
2593 393 fs://name/1/do_gas/idi_centr=6372/mes=20
2594 343 fs://name/2/do_gas2/idi_centr=6354/mes=21
349 307 fs://name/2/do_des/mes=25
340 332 fs://name/1/venta/year=2018/month=12
or if you don't REALLY want to remove the string /user/hive/share and instead want to remove the 4th through 6th fields no matter their value:
$ awk 'BEGIN{FS=OFS="/"} {$4=$5=$6="\n"; sub(/(\/\n){3}/,""); $4=($4~/1$/ ? 1 : 2)} 1' file
2557 386 fs://name/1/do_fact/date=20190313/fact=6
2593 393 fs://name/1/do_gas/idi_centr=6372/mes=20
2594 343 fs://name/2/do_gas2/idi_centr=6354/mes=21
349 307 fs://name/2/do_des/mes=25
340 332 fs://name/1/venta/year=2018/month=12

with sed
$ sed -E 's_/user/hive/ware/[^/]+(.)/_/\1/_' file
2557 386 fs://name/1/do_fact/date=20190313/fact=6
2593 393 fs://name/1/do_gas/idi_centr=6372/mes=20
2594 343 fs://name/2/do_gas2/idi_centr=6354/mes=21
349 307 fs://name/2/do_des/mes=25
340 332 fs://name/1/venta/year=2018/month=12
it's not really a conditional replacement.

You can use awk and its gsub function to perform replacement on the selected columns.
awk 'BEGIN{FS=OFS="/"}{gsub("user/hive/ware/","");gsub(/^[^12]+/,"",$4)}1' inputfile
2557 386 fs://name/1/do_fact/date=20190313/fact=6
2593 393 fs://name/1/do_gas/idi_centr=6372/mes=20
2594 343 fs://name/2/do_gas2/idi_centr=6354/mes=21
349 307 fs://name/2/do_des/mes=25
340 332 fs://name/1/venta/year=2018/month=12

Related

cat specific columns in 3 files

I have 3 files such as :
file1_file:
scaffold_159 625 YP_009345712 0.284 447 289 9 96675 95377 196 625 6.963E-38 158
scaffold_159 625 YP_009345714 0.284 447 289 9 96675 95377 196 625 6.963E-38 158
IDBA_scaffold_24562 625 YP_009345713 0.464 56 20 2 2549 2686 10 65 7.513E-03 37
file2_file:
scaffold_159 625 YP_009345717 0.284 447 289 9 96675 95377 196 625 6.963E-38 158
scaffold_159 625 YP_009345718 0.284 447 289 9 96675 95377 196 625 6.963E-38 158
IDBA_scaffold_24562 625 YP_009345719 0.464 56 20 2 2549 2686 10 65 7.513E-03 37
file3_file:
scaffold_159 625 YP_009345711 0.284 447 289 9 96675 95377 196 625 6.963E-38 158
scaffold_159 625 YP_009345723 0.284 447 289 9 96675 95377 196 625 6.963E-38 158
IDBA_scaffold_24562 625 YP_009345721 0.464 56 20 2 2549 2686 10 65 7.513E-03 37
and I would like to only get the 3th part of the 3 file in a single new_file.txt.
Here I should get :
YP_009345712
YP_009345714
YP_009345713
YP_009345717
YP_009345718
YP_009345711
YP_009345723
YP_009345721
From now I tried:
cat file_names.txt | while read line; do cat /path1/${line}/path2/${line}_file > new_file.txt; done
in file_names.txt I have :
file1
file2
file3
but I do not know how to extract only the 3th column...
Ps: the 3 files are not in the same directory :
/path1/file1/path2/file1_file
/path1/file2/path2/file2_file
/path1/file3/path2/file3_file
EDIT: After chatting with OP came to know that his/her files could be on different locations, so in that case could you please try following. Considering that you have an Input_file where all file names are there. I am yet to test it.
file_name=$(awk '{val=(val?val OFS:"")"/path1/" $0 "/path2/" $0} END{print val}' file_names.txt)
awk '{print $3}' "$file_name"
OR
awk '{print $3}' $(awk '{val=(val?val OFS:"")"/path1/" $0 "/path2/" $0} END{print val}' file_names.txt)
You could use awk here.
awk '{print $3}' /complete_path/file1 /complete_path/file2 /complete_path/file3
I think it can be simpler with
$ sed 's_.*_"path1/&/path2/&"_' filenames | xargs awk '{print $3}'
awk will be called only once.
So you have a file fnames.txt with hundreds of strings:
str1
str2
str3
str4
...
and each string represents a file located in
/path1/${str}/path2/${str}_file
where ${str} is a value from file fnames.txt.
Now you want to read the third column, from the third file only:
$ str="$(awk '(NR==3){print; exit}' fnames.txt)"
$ file="/path1/${str}/path2/${str}_file"
$ awk '{print $3}' "$file" > new_file.txt
Always remember the KISS principle

print matched column from multiple inputs and concatenate the output

print matched column from multiple inputs and concatenate the output
A.lst
1091 1991 43.5 -30.1 -11.4
1091 1993 -11.2 -28.5 -2.7
1091 1997 35.8 -13.2 -4.5
1091 2003 -26.8 -23.9 0.6
1091 2007 23.8 64.8 3.5
1091 2008 -45.8 70.7 -6.0
1100 1967 24.5 -25.6 -12.7
1100 1971 -935.0 9.3 52.0
1100 1972 -388.8 59.1 20.4
1100 1974 17.7 48.9 3.0
B.lst
1091 1991 295 1
1091 1993 293 3
1091 1997 296 7
1091 2003 287 13
1091 2007 283 17
1091 2008 282 18
1100 1967 1419 3
1100 1971 56 7
1031 2023 283 17
1021 2238 282 18
1140 1327 1419 3
1150 3971 56 7
1100 1972 55 8
1100 1974 1445 10
By using this command (from previous, cant remember from where, i'll credit to them once i found it),
NR==FNR{
a[$1,$2]=$1; next
}
{
s=SUBSEP
k=$3 s $4
}k in a{ print $0 }
But I've no idea how to combine the output
it should print only matched (certain column $3 $4 from B.lst)
1091 1991 43.5 -30.1 -11.4 295 1
1091 1993 -11.2 -28.5 -2.7 293 3
1091 1997 35.8 -13.2 -4.5 296 7
1091 2003 -26.8 -23.9 0.6 287 13
1091 2007 23.8 64.8 3.5 283 17
1091 2008 -45.8 70.7 -6.0 282 18
1100 1967 24.5 -25.6 -12.7 1419 3
1100 1971 -935.0 9.3 52.0 56 7
1100 1972 -388.8 59.1 20.4 55 8
1100 1974 17.7 48.9 3.0 1445 10
Could you please try following.
awk 'FNR==NR{array[$1,$2]=$3 OFS $4;next} (($1,$2) in array){print $0,array[$1,$2]}' file_B file_A
Adding a non-one liner form of above solution now.
awk '
FNR==NR{
array[$1,$2]=$3 OFS $4
next
}
(($1,$2) in array){
print $0,array[$1,$2]
}
' file_B file_A
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file_B is being read.
array[$1,$2]=$3 OFS $4 ##Creating an array named array whose index is $1,$2 and value is $3 OFS $4.
next ##Using next will skip all further statements from here.
} ##Closing FNR==NR condition BLOCK here.
(($1,$2) in array){ ##Checking condition if $1,$2 is present in array then do following.
print $0,array[$1,$2] ##Printing current line and then value of array with index of $1,$2
}
' file_B file_A ##Mentioning Input_file names here.

MPI_sendrecv changes loop index with -O1 flag [duplicate]

This question already has an answer here:
MPI_Recv overwrites parts of memory it should not access
(1 answer)
Closed 7 years ago.
Despite having written long, heavily parallelized codes with complicated send/receives over three dimensional arrays, this simple code with a two dimensional array of integers has got me at my wits end. I combed stackoverflow for possible solutions and found one that resembled slightly with the issue I am having:
Boost.MPI: What's received isn't what was sent!
However the solutions seem to point the looping segment of code as the culprit for overwriting sections of the memory. But this one seems to act even stranger. Maybe it is a careless oversight of some simple detail on my part. The problem is with the below code:
program main
implicit none
include 'mpif.h'
integer :: i, j
integer :: counter, offset
integer :: rank, ierr, stVal
integer, dimension(10, 10) :: passMat, prntMat !! passMat CONTAINS VALUES TO BE PASSED TO prntMat
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
counter = 0
offset = (rank + 1)*300
do j = 1, 10
do i = 1, 10
prntMat(i, j) = 10 !! prntMat OF BOTH RANKS CONTAIN 10
passMat(i, j) = offset + counter !! passMat OF rank=0 CONTAINS 300..399 AND rank=1 CONTAINS 600..699
counter = counter + 1
end do
end do
if (rank == 1) then
call MPI_SEND(passMat(1:10, 1:10), 100, MPI_INTEGER, 0, 1, MPI_COMM_WORLD, ierr) !! SEND passMat OF rank=1 to rank=0
else
call MPI_RECV(prntMat(1:10, 1:10), 100, MPI_INTEGER, 1, 1, MPI_COMM_WORLD, stVal, ierr)
do i = 1, 10
print *, prntMat(:, i)
end do
end if
call MPI_FINALIZE(ierr)
end program main
When I compile the code with mpif90 with no flags and run it on my machine with mpirun -np 2, I get the following output with wrong values in the first four indices of the array:
0 0 400 0 604 605 606 607 608 609
610 611 612 613 614 615 616 617 618 619
620 621 622 623 624 625 626 627 628 629
630 631 632 633 634 635 636 637 638 639
640 641 642 643 644 645 646 647 648 649
650 651 652 653 654 655 656 657 658 659
660 661 662 663 664 665 666 667 668 669
670 671 672 673 674 675 676 677 678 679
680 681 682 683 684 685 686 687 688 689
690 691 692 693 694 695 696 697 698 699
However, when I compile it with the same compiler but with the -O3 flag on, I get the correct output:
600 601 602 603 604 605 606 607 608 609
610 611 612 613 614 615 616 617 618 619
620 621 622 623 624 625 626 627 628 629
630 631 632 633 634 635 636 637 638 639
640 641 642 643 644 645 646 647 648 649
650 651 652 653 654 655 656 657 658 659
660 661 662 663 664 665 666 667 668 669
670 671 672 673 674 675 676 677 678 679
680 681 682 683 684 685 686 687 688 689
690 691 692 693 694 695 696 697 698 699
This error is machine dependent. This issue turns up only on my system running Ubuntu 14.04.2, using OpenMPI 1.6.5
I tried this on other systems running RedHat and CentOS and the code ran well with and without the -O3 flag. Curiously those machines use an older version of OpenMPI - 1.4
I am guessing that the -O3 flag is performing some odd optimization that is modifying the manner in which arrays are being passed between the processes.
I also tried other versions of array allocation. The above code uses explicit shape arrays. With assumed shape and allocated arrays I am receiving equally, if not more bizarre results, with some of them seg-faulting. I tried using Valgrind to trace the origin of these seg-faults, but I still haven't gotten the hang of getting Valgrind to not give false positives when running with MPI programs.
I believe that resolving the difference in performance of the above code will help me understand the tantrums of my other codes as well.
Any help would be greatly appreciated! This code has really gotten me questioning if all the other MPI codes I wrote are sound at all.
Using the Fortran 90 interface to MPI reveals a mismatch in your call to MPI_RECV
call MPI_RECV(prntMat(1:10, 1:10), 100, MPI_INTEGER, 1, 1, MPI_COMM_WORLD, stVal, ierr)
1
Error: There is no specific subroutine for the generic ‘mpi_recv’ at (1)
This is because the status variable stVal is an integer scalar, rather than an array of MPI_STATUS_SIZE. The F77 interface (include 'mpif.h') to MPI_RECV is:
INCLUDE ’mpif.h’
MPI_RECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS, IERROR)
<type> BUF(*)
INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM
INTEGER STATUS(MPI_STATUS_SIZE), IERROR
Changing
integer :: rank, ierr, stVal
to
integer :: rank, ierr, stVal(mpi_status_size)
produces a program that works as expected, tested with gfortran 5.1 and OpenMPI 1.8.5.
Using the F90 interface (use mpi vs include "mpif.h") lets the compiler detect the mismatched arguments at compile time rather than producing confusing runtime problems.

Compare two files according to first column and print whole line

I will ask my question with an example. I have 2 files:
File1-
TR100013|c0_g1
TR100013|c0_g2
TR10009|c0_g1
TR10009|c0_g2
File2-
TR100013|c0_g1 AT1G01360.1 78.79 165 35 0 301 795 19 183 2E-089 272
TR100013|c0_g2 AT1G01360.1 78.79 165 35 0 301 795 19 183 2E-089 272
TR10009|c0_g1 AT1G16240.3 77.42 62 14 0 261 76 113 174 4E-025 95.9
TR10009|c0_g2 AT1G16240.2 69.17 120 37 0 1007 648 113 232 2E-050 171
TR29295|c0_g1 AT1G22540.1 69.19 172 53 2 6 521 34 200 2E-053 180
TR49005|c5_g1 AT5G24530.1 69.21 302 90 1 909 13 39 340 5E-157 446
Expected Output :
TR100013|c0_g1 AT1G01360.1 78.79 165 35 0 301 795 19 183 2E-089 272
TR100013|c0_g2 AT1G01360.1 78.79 165 35 0 301 795 19 183 2E-089 272
TR10009|c0_g1 AT1G16240.3 77.42 62 14 0 261 76 113 174 4E-025 95.9
TR10009|c0_g2 AT1G16240.2 69.17 120 37 0 1007 648 113 232 2E-050 171
I want to compare two files. If the first column is same in both files, then print the whole line of second file which is common in both files.
Using awk:
awk 'NR==FNR{a[$1]++;next};a[$1]' file1 file2
grep can do the same:
grep -wf file1 file2
-w is to match whole word only.
-f specifies the file with the pattern.

How to take the 10 last values from a file in awk

I have a file that contains this data:
345
234
232
454
343
676
887
324
342
355
657
786
343
879
088
342
121
345
534
657
767
I need to cut the last 10 values and put it in another file:
786
343
879
388
342
121
345
534
657
767
No need to use awk, just tail the data:
tail -n 10 input.txt > example.txt
If you really wanted to use awk, you have to keep track of the 10 last lines ($0) and print them in END. Though this would be overkill.
This is what the tail command is for:
$ tail -10 file
786
343
879
088
342
121
345
534
657
767
To store the output in a new file use to redirection operator:
$ tail -10 file > new_file
However if you really want to do it with awk then the brute force approach is to store each line in an array and print the last 10 elements at the end of file:
$ awk '{a[NR]=$0}END{for(i=(NR-9);i<=NR;i++)print a[i]}' file
786
343
879
088
342
121
345
534
657
767
Again, to store the output in a new file use the redirection operator:
$ awk '{a[NR]=$0}END{for(i=(NR-9);i<=NR;i++)print a[i]}' file > new_file
The previous method is a very inefficient method as we have to create an array the same size as the file we are reading. A much better approach is to use the modulus operator to just create an array of size 10 containing the last 10 lines read:
$ awk '{a[NR%10]=$0}END{for(i=NR%10+1;j++<10;i=++i%10) print a[i]}' file 
786
343
879
088
342
121
345
534
657
767
This can be generalized to the last n lines like so (i.e n=3):
$ awk '{a[NR%n]=$0}END{for(i=NR%n+1;j++<n;i=++i%n) print a[i]}' n=3 file
534
657
767
Code for GNU sed:
get the last 10 lines
sed ':a $q;N;11,$D;ba' file
get all but the last 10 lines
sed ':a $d;N;2,10ba;P;D' file