I want to extract the information from a large file based on multiple conditions (from the same file) as well as pattern searching from other small file, Following is the script I used:
awk 'BEGIN{FS=OFS="\t"}NR==FNR{a[$0]++;next}$1 in a {print $2,$4,$5}' file2.txt file1.txt >output.txt
Now, I want to use the condition in the same awk script that ONLY print the line where the element of 4th column (any one character amongst the ATGC) matches the element of 5th column (any one character amongst the ATGC); both the column is in file 1.
Hence, in a way, I want to merge the following script with the script mentioned above:
awk '$4 " "==$5{print $2,$4,$5}' file1.txt
Following is the representation of file1.txt:
SNP Name Sample ID GC Score Allele1 - Forward Allele2 - Forward
ARS-BFGL-BAC-10172 834269752 0.9374 A G
ARS-BFGL-BAC-1020 834269752 0.9568 A A
ARS-BFGL-BAC-10245 834269752 0.7996 C C
ARS-BFGL-BAC-10345 834269752 0.9604 A C
ARS-BFGL-BAC-10365 834269752 0.5296 G G
ARS-BFGL-BAC-10591 834269752 0.4384 A A
ARS-BFGL-BAC-10793 834269752 0.9549 C C
ARS-BFGL-BAC-10867 834269752 0.9400 G G
ARS-BFGL-BAC-10951 834269752 0.5453 T T
enter code here
Following is the representation of file2.txt
ARS-BFGL-BAC-10172
ARS-BFGL-BAC-1020
ARS-BFGL-BAC-10245
ARS-BFGL-BAC-10345
ARS-BFGL-BAC-10365
ARS-BFGL-BAC-10591
ARS-BFGL-BAC-10793
ARS-BFGL-BAC-10867
ARS-BFGL-BAC-10951
Output should be:
834269752 A A
834269752 C C
834269752 G G
834269752 A A
834269752 C C
834269752 G G
834269752 T T
You can simply use boolean logic, and from your input file it seems you can get away with "normal" input field splitting, which will allow you to get rid of that space in the comparison:
awk 'BEGIN{OFS="\t"}
NR==FNR{a[$0]++;next}
($1 in a) && ($4==$5) {print $2,$4,$5}' file2.txt file1.txt > output.txt
As an example, here is my test file2.txt:
ARS-BFGL-BAC-1020
ARS-BFGL-BAC-10172
And here is the result of the command above:
834269752 A A
Related
I have the following sample text:
a b c
x_y_
d e f
x_y_
g h i
x_y_
k l m
x_y_
I need it to be formatted as follows:
x_y_ a b c
x_y_ d e f
x_y_ g h i
x_y_ k l m
Using sed, awk or something else in bash, how do we accomplish this?
Another awk:
$ awk 'NR%2==0{print $0,p}{p=$0}' file
Output:
x_y_ a b c
x_y_ d e f
x_y_ g h i
x_y_ k l m
Explained:
$ awk '
NR%2==0 { # on every even numbered record
print $0,p # output current record and previous
}{
p=$0 # buffer record for next round
}' file
Update:
In case of odd number of records (mostly due to the peer pressure :), you need to deal with the left-over x y z:
$ awk 'NR%2==0{print $0,p}{p=$0}END{if(NR%2)print}' file
Output:
...
x_y_ g h i
x_y_ k l m
x y z
With sed:
sed -E 'N;s/(.*)\n(.*)/\2 \1/g' sample.txt
a short pipeline:
tac file | paste -d ' ' - - | tac
$ awk 'NR%2{s=$0; next} {print $0, s}' file
x_y_ a b c
x_y_ d e f
x_y_ g h i
x_y_ k l m
1st solution: Could you please try following, tested and created with GNU awk.
awk -v RS="" -v FS="\n" '{for(i=2;i<=NF;i+=2){printf("%s\n",$i OFS $(i-1))}}' Input_file
OR(with print):
awk -v RS="" -v FS="\n" '{for(i=2;i<=NF;i+=2){print $i,$(i-1)}}' Input_file
2nd solution: By checking if a line number is completely divided by 2 then print previous and current lines values. It also checks if total number of lines are ODD in Input_file then it prints last remaining line too(by checking a flag(variable)'s status).
awk 'prev && FNR%2==0{print $0 OFS prev;prev="";next} {prev=$0} END{if(prev){print prev}}' Input_file
Output will be as follows.
x_y_ a b c
x_y_ d e f
x_y_ g h i
x_y_ k l m
This might work for you (GNU sed):
sed '1~2{h;d};G;s/\n/ /' file
Save odd line numbered lines in the hold space and append them to even numbered lines and replace the newline with a space.
Another variation:
sed -n 'h;n;G;s/\n/ /p' file
There are many more ways to achieve this, as can be seen by answers above.
How about this:
parallel -N2 echo "{2} {1}" :::: file
See here for parallel.
Big question:
I want a list of the unique combinations between two fields in a data frame.
Example data:
A B
C D
E F
B A
C F
E F
I would like to be able to get the result of 4 unique combinations: AB, CD, EF, and CF. Since BA and and BA contain the same components but in a different order, I only want one copy (it is a mutual relationship so BA is the same thing as AB)
Attempt:
So far I have tried sorting and keeping unique lines:
sort file | uniq
but of course that produces 5 combinations:
A B
C D
E F
B A
C F
I do not know how to approach AB/BA being considered the same. Any suggestions on how to do this?
The idiomatic awk approach is to order the index parts:
$ awk '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file
A B
C D
E F
C F
another awk magic
awk '!a[$1,$2] && !a[$2,$1]++' file
In awk:
$ awk '($1$2 in a){next}{a[$1$2];a[$2$1]}1' file
A B
C D
E F
C F
Explained:
($1$2 in a) { next } # if duplicate in hash, next record
{ a[$1$2]; a[$2$1] } 1 # hash reverse also and output
It works for single char fields. If you want to use it for longer strings, add FS between fields, like a[$1 FS $2] etc. (thanks #EdMorton).
I have a text file in the following format, the alphabets are ids separated by a space.
OG1: A B C D E
OG2: C F G D R
OG3: A D F F F
I would like to randomly extract one id from each group as
OG1: E
OG2: D
OG3: A
I tried using
shuf -n 1 data.txt
which gives me
OG2: C F G D R
awk to the rescue!
$ awk -v seed=$RANDOM 'BEGIN{srand(seed)} {print $1,$(rand()*(NF-1)+2)}' file
OG1: D
OG2: F
OG3: F
to skip a certain letter, you can change the main block to
... {while ("C"==r=$(rand()*(NF-1)+2)); print $1,r}' file
perl -lane 'print "$F[0] ".$F[rand($#F-1)+1]' data.txt
Explanation:
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
#F is the array of words in each line, indexed starting with $F[0]
$#F is the number of words in #F
output:
OG1: A
OG2: F
OG3: F
I'm trying to fetch the data from column B to D from a tab delimited file "FILE". The simple AWK code I use fetch the data, but unfortunately keeps the output in a single column and remove the identifiers (shown below).
Any suggestions please.
CODE
awk '{for(i=2;i<=4;++i)print $i}' FILE
FILE
A B C D E F G
1_at 10.8435630935 10.8559287854 8.6666141543 8.820310681 9.9024050571 8.613199083 11.9807771094
2_at 4.7615531106 4.5209119307 11.2467919586 8.8105151099 7.1831990104 11.0645055836 4.3726598561
3_at 6.0025262754 5.4058080843 3.2475272982 3.1869728585 3.5654989547
OUTPUT OBTAINED
B
C
D
10.8435630935
10.8559287854
8.6666141543
4.7615531106
4.5209119307
11.2467919586
6.0025262754
5.4058080843
3.2475272982
Why don't you directly use cut?
$ cut -d$'\t' -f2-4 < file
B C D
10.8435630935 10.8559287854 8.6666141543
4.7615531106 4.5209119307 11.2467919586
6.0025262754 5.4058080843 3.2475272982
With awk you would need printf to avoid new lines of print:
awk -F"\t" '{for(i=2;i<=4;++i) printf "%s%s", $i, (i==4?RS:FS)}'
I need to split a single column of data in a large file into two columns as follows:
A
B B A
C ----> D C
D F E
E H G
F
G
H
Is there an easy way of doing it with unix shell commands and/or small shell script? awk?
$ awk 'NR%2{s=$0;next} {print $0,s}' file
B A
D C
F E
H G
You can use the following awk script:
awk 'NR % 2 != 0 {cache=$0}; NR % 2 == 0 {print $0 cache}' data.txt
Output:
BA
DC
FE
HG
It caches the value of odd lines and outputs even lines + appends the cache to them.
I know this is tagged awk, but I just can't stop myself from posting a sed solution, since the question left it open for "easy way . . . with unix shell commands":
$ sed -n 'h;n;G;s/\n/ /g;p' data.txt
B A
D C
F E
H G