selecting specific lines using awk - awk

I have a file with lines like these
1 1000034 G C 0.4 12
2 1000435 C G 0.1 52
3 0092943 A T 0.2 5
4 0092241 G A 0.3 34
etc.
columns 3 and 4 only contain the characters AGCT
I need to print lines that DO NOT contain both G and C in columns 3 and 4.
What I´m trying so far in awk is doing
awk ' { if ($3!="G" && $4!="C") print }' file
but this is also excluding lines with G and A in columns 3 and 4, respectively. I only want to exclude lines with G and C in columns 3 and 4, respectively.
I prefer to use awk for this problem.

One way:
awk '!($3=="G" && $4=="C")' file
Trying to print the inverse of G & C combination

Related

How do I print starting from a certain row of output with awk? [duplicate]

I have millions of records in my file, what i need to do is print columns 1396 to 1400 for specific number of rows, and if i can get this in excel or notepad.
Tried with this command
awk {print $1396,$1397,$1398,$1399,$1400}' file_name
But this is running for each row.
You need a condition to specify which rows to apply the action to:
awk '<<condition goes here>> {print $1396,$1397,$1398,$1399,$1400}' file_name
For example, to do this only for rows 50 to 100:
awk 'NR >= 50 && NR <= 100 {print $1396,$1397,$1398,$1399,$1400}' file_name
(Depending on what you want to do, you can also have much more complicated selection patterns than this.)
Here's a simpler example for testing:
awk 'NR >= 3 && NR <= 5 {print $2, $3}'
If I run this on an input file containing
1 2 3 4
2 3 4 5
3 a b 6
4 c d 7
5 e f 8
6 7 8 9
I get the output
a b
c d
e f

Awk - Conditionally print an element from a certain row, based on the condition of a different element in a different row

Say I have a lot of files with a consistent number of columns and rows, and a sample one looks like this:
1 2 3
4 5 6
7 8 9
I want to print column 3 of row 2, but only if column 3 of row 3 == 4 (in this case it is 9). I'm using this logic is a means to determine if the file is valid for my use-case, and extract the relevant field if it is.
My attempt, based on other answers to people asking how to isolate certain rows was this: awk 'BEGIN{FNR=3} $3=="4"{FNR=2;print $2}'
so you are looking for something like this?
awk 'FNR==2{ x = $3 }FNR==3 && $3=="4"{ print x }' file.txt
cat file.txt
1 2 3
4 5 6
7 8 4
Output:
6
cat file.txt
1 2 3
4 5 6
7 8 9
Output:
Nothing since column 3 of row 3 is 9
awk 'FNR==3 && $3==4{print p} {p=$3}' *
Here's another which doesn't care for the order in which the records appear. In the OP the problem was to print a value (v) from 2nd record based on the tested value (t) on the 3rd record. This solution allows for the test value to appear in an earlier record than the value to be printed:
$ awk '
FNR==2 { # record on which is the value to print
v=$3
f=1 # flag indicating the value v has been read
}
FNR==3 { # record of which is the value to test
t=$3
g=1 # test value read indicator
}
f && g { # once the value and test value are acquired and
if(t==4) # test the test
print v # output
exit # and exit
}' file
6
Record order reversed (FNR values changed in the code):
$ cat file2
1 2 3
7 8 4 # records
4 5 6 # reversed
$ awk 'FNR==3{v=$3;f=1}FNR==2{t=$3;g=1}f&&g{if(t==4)print v;exit}' file2
6
Flags f and g are different from v and t in case either should be empty ("").

Filter rows with duplicates or triplicates++ by matching key and screening columns

I'm getting stuck with duplicate / triplicate filtering complexity. Solution preferably awk, but could also be sort -u or unique etc.
I want to filter rows with either unique or exact duplicate/triplicate etc. values in the first three columns. The whole line including the fourth column which shouldn't match anything should be printed. Consider this tab-separated table:
Edit: $2 and $3 values don't have to be compared within one row. As recommended, I changed $3 values to 2xx.
name value1 value2 anyval
a 1 21 first
b 2 22 second
b 2 22 third
c 3 23 fourth
c 3 28 fifth
d 4 24 sixth
d 4 24 seventh
e 4 25 eighth
e 4 25 ninth
f 7 27 tenth
f 7 27 eleventh
f 7 27 twelveth
f 7 27 thirteenth
g 11 210 fourteenth
g 10 210 fifteenth
Line 1 is unique and should be printed.
Lines 2 + 3 contain exact duplicate values, one of them should be printed.
Lines 4 + 5 contain different value in col 3 and should be kicked out.
Lines 6 + 7 are duplicates, but they should be kicked out because lines 8 + 9 contain the same value in in col 2.
Same for lines 8 + 9.
One of the lines 10 to 13 should be printed.
Desired output:
a 1 21 first
b 2 22 second
f 7 27 tenth
... or any other of b and f.
What I've got so far but failed:
awk '!seen[$1]++ && !seen[$2]'
prints all duplicate lines based on col 1
a 1 21 first
b 2 22 second
c 3 23 fourth
d 4 24 sixth
e 4 25 eighth
f 7 27 tenth
awk '!seen[$1]++ && !seen[$2]++'
prints
a 1 21 first
b 2 22 second
c 3 23 fourth
d 4 24 sixth
f 7 27 tenth
Consequently, awk should print the desired result if:
awk '!seen[$1]++ && !seen[$2]++ && !seen[$3]++'
But the output is empty.
A different try: print dups in col 1, then again same procedure for col 2 and col 3 - doesn't work because there are dulicates in col 2
awk -F'\t' '{print $1}' file.txt |sort|uniq -d|grep -F -f - file.txt
prints first the duplicates in col 1 without "a", which I could cat later on
b 2 22 second
b 2 22 third
c 3 23 fourth
c 3 22 fifth
d 4 24 sixth
d 4 24 seventh
e 4 25 eighth
e 4 25 nineth
f 7 27 tenth
f 7 27 eleventh
f 7 27 twelveth
f 7 27 thirteenth
But again, I'm getting stuck with repetitive values (e.g. 4) spanning multiple columns.
I think the solution could be to define col1 singlets and multiplets and screen for repetitive values in all other columns, but that's causing massive stack overflow in my brain.
I'm not 100% clear of the requirements, but you can filter the records in stages...
$ awk '!a[$1,$2,$3]++{print $0,$2}' file |
uniq -uf4 |
cut -d' ' -f1-4
a 1 1 first
b 2 2 second
f 7 7 tenth
first awk filters all the duplicate entries based on first three fields and prints the second field to be used by the next process, unique filters only based on second field (now in forth position) and removes all copies of duplicates, cut gets rid of the extra key field.
UPDATE
For filtering both unique $2 and $3 fields, we have to revert back to awk
$ awk '!a[$1,$2,$3]++ {f2[$2]++; f3[$3]++; line[$2,$3]=$0}
END {for(i in f2)
for(j in f3)
if((i,j) in line && f2[i]*f3[j]==1) print line[i,j]}' file |
sort
a 1 1 first
b 2 2 second
f 7 7 tenth

How to do a join using awk

Here is my Input file
Identifier Relation
A 1
A 2
A 3
B 2
B 3
C 1
C 2
C 3
I want to join this file to itself based on the "Relation" field.
Sample Output file
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
I used the following awk script:
awk 'NR==FNR {a[NR]=$0; next} { for (k in a) if (a[k]~$2) print a[k],$0}' input input > output
However, I had to do another awk step to delete lines which did a join with itself i.e, A 1 A 1 ; B 2 B 2 etc.
The second issue with this file is it prints both directions of the join, thus
A 1 C 1 is printed along with C 1 A 1 on another line.
Both these lines display the same relation and I would NOT like to include this.I want to see just one or the other i.e, "A 1 C 1" or "C 1 A 1" not both.
Any suggestions/directions are highly appreciated.
alternative solution using awk with join and sort support
$ join -j 2 <(sort -k2 -k1,1 file){,}
| awk '$2!=$3 && !($3 FS $2 in a){a[$2 FS $3]; print$2,$1,$3,$1}'
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
create the cross product, eliminate the diagonal and one of the symmetrical pairs.
Here is an awk-only solution:
awk 'NR>1{ar[$2]=(ar[$2]$1);}\
END{ for(key in ar){\
for(i=1; i<length(ar[key]); i++) {\
for(j=i+1; j<length(ar[key])+1; j++) {\
print substr(ar[key],i,1), key, substr(ar[key],j,1), key;\
}\
}\
}}' infile
Each number in the second column of the input serves as a key of an awk-array. The value of the corresponding array-element is a sequence of first-column letters (e.g., array[1]=ABC).
Then, we built all two-letter combinations for each sequence (e.g., "ABC" gives "AB", "AC" and "BC")
Output:
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
Note:
If a number occurs only once, no output is generated for this number.
The order of output depends on the order of input. (No sorting of letters!!). That is if the second input line was C 1, then array[1]="CAB" and the first output line would be C 1 A 1
First line of input is ignored due to NR>1
There is surely a solution with awk only, but I'm going to propose a solution using awk and sort because I think it's quite simple and does not require storing the entire file content in awk variables. The idea is as follows:
rewrite the input file so that the "relation" field is first (A 1 -> 1 A)
use sort -n to put together all lines with same "relation"
use awk to combine consecutive lines having the same "relation"
That would translate to something like:
awk '{print $2 " " $1}' input | sort -n |
awk '{if ($1==lastsel)printf " "; else if(lastsel) printf "\n"; lastsel=$1; printf "%s %s", $2, $1;}END{if(lastsel)printf"\n"}'
A 1 C 1
A 2 B 2 C 2
A 3 B 3 C 3
EDIT: If you want only one i-j relation per line:
awk '{print $2 " " $1}' input | sort -n |
awk '$1!=rel{rel=$1;item=$2;next;} {printf "%s %s %s %s\n", item, rel, $2, $1;}'
A 1 C 1
A 2 B 2
A 2 C 2
A 3 B 3
A 3 C 3
Note the following limitations with this solution:
In case a given n has only one entry, nothing will be output (no output such as D 1)
All relations always have the lexicographically first item in the first column (e.g. A 1 C 1 but never B 1 C 1)

awk: delete first and last entry of comma-separated field

I have a 4 column data that looks something like the following:
a 1 g 1,2,3,4,5,6,7
b 2 g 3,5,3,2,6,4,3,2
c 3 g 5,2,6,3,4
d 4 g 1,5,3,6,4,7
I am trying to delete first two numbers and the last two numbers on entire fourth column so the output looks like the following
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6
Can someone give me a help? I would appreciate it.
You can use this:
$ awk '{n=split($4, a, ","); for (i=3; i<=n-2; i++) t=t""a[i](i==n-2?"":","); print $1, $2, $3, t; t=""}' file
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6
Explanation
n=split($4, a, ",") slices the 4th field in pieces, based on comma as delimiter. As split() returns the number of pieces we got, we store it in n to work with it later on.
for (i=3; i<=n-2; i++) t=t""a[i](i==n-2?"":",") stores in t the last field, looping through all the slices.
print $1, $2, $3, t; t="" prints the new output and blanks the variable t.
This will work for your posted sample input:
$ awk '{gsub(/^([^,]+,){2}|(,[^,]+){2}$/,"",$NF)}1' file
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6
If you have cases where there's less than 4 commas in your 4th field then update your question to show how those should be handled.
This uses bash array manipulation. It may be a little ... gnarly:
while read -a fields; do # read the fields for each line
IFS=, read -a values <<< "${fields[3]}" # split the last field on comma
new=("${values[#]:2:${#values[#]}-4}") # drop the first 2 and last fields
fields[3]=$(IFS=,; echo "${new[*]}") # join the new list on comma
printf "%s\t" "${fields[#]}"; echo # print the new line
done <<END
a 1 g 1,2,3,4,5,6,7
b 2 g 3,5,3,2,6,4,3,2
c 3 g 5,2,6,3,4
d 4 g 1,5,3,6,4,7
END
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6