How to print fields for repeated key column in one line - awk

I'd like to transform a table in such a way that for duplicated
values in column #2 it would have corresponding values from column #1.
I.e. something like that...
MZ00024296 AC148152.3_FG005
MZ00047079 AC148152.3_FG006
MZ00028122 AC148152.3_FG008
MZ00032922 AC148152.3_FG008
MZ00048218 AC148152.3_FG008
MZ00024680 AC148167.6_FG001
MZ00013456 AC149475.2_FG003
to
AC148152.3_FG005 MZ00024296
AC148152.3_FG006 MZ00047079
AC148152.3_FG008 MZ00028122|MZ00032922|MZ00048218
AC148167.6_FG001 MZ00024680
AC149475.2_FG003 MZ00013456
As I need it to computations in R I tried to use:
x=aggregate(mz_grmz,by=list(mz_grmz[,2]),FUN=paste(mz_grmz[,1],sep="|"))
but it don't work (wrong function)
Error in match.fun(FUN) :
'paste(mz_grmz[, 1], sep = "|")' is not a function, character or symbol
I also remind myself about unstack() function, but it isn't what I need.
I tried to do it using awk, based on my base knowledge I reworked code given here:
site1
#! /bin/sh
for y do
awk -v FS="\t" '{
for (x=1;x<=NR;x++) {
if (NR>2 && x=x+1) {
print $2"\t"x
}
else {print NR}
}
}' $y > $y.2
done
unfortunately it doesn't work, it's only produce enormous file with field #2 and some numbers.
I suppose it is easy task, but it is above my skills right now.
Could somebody give me a hint? Maybe just function to use in aggregate in R.
Thanks

You could do it in awk like this:
awk '
{
if ($2 in a)
a[$2] = a[$2] "|" $1
else
a[$2] = $1
}
END {
for (i in a)
print i, a[i]
}' INFILE > OUTFILE

to keep the output as same as the text in your question (empty lines etc..):
awk '{if($0 &&($2 in a))a[$2]=a[$2]"|"$1;else if ($0) a[$2]=$1;}\
END{for(x in a){print x,a[x];print ""}}' inputFile
test:
kent$ echo "MZ00024296 AC148152.3_FG005
MZ00047079 AC148152.3_FG006
MZ00028122 AC148152.3_FG008
MZ00032922 AC148152.3_FG008
MZ00048218 AC148152.3_FG008
MZ00024680 AC148167.6_FG001
MZ00013456 AC149475.2_FG003"|awk '{if($0 &&($2 in a))a[$2]=a[$2]"|"$1;else if ($0) a[$2]=$1;}END{for(x in a){print x,a[x];print ""}}'
AC149475.2_FG003 MZ00013456
AC148152.3_FG005 MZ00024296
AC148152.3_FG006 MZ00047079
AC148152.3_FG008 MZ00028122|MZ00032922|MZ00048218
AC148167.6_FG001 MZ00024680

This GNU sed solution might work for you:
sed -r '1{h;d};H;${x;s/(\S+)\s+(\S+)/\2\t\1/g;:a;s/(\S+\t)([^\n]*)(\n+)\1([^\n]*)\n*/\1\2|\4\3/;ta;p};d' input_file
Explanation: Use the extended regex option-r to make regex's more readable. Read the whole file into the hold space (HS). Then on end-of-file, switch to the HS and firstly swap and tab separate fields. Then compare the first fields in adjacent lines and if they match, tag the second field from the second record to the first line separated by a |. Repeated until no further adjacent lines have duplicate first fields then print the file out.

Related

How to check if a string contains at least one letter different from 4 using bash or awk

How to check that a sequence has at least one letter that is not A, U, C, G characters using awk or bash?
Can it be done without the typical for loop?
Example of sequence:
AUVGAU
I give this as input I should get it back given that it has V
The input file looks something like this, so I think awk would be better.
>7A0E_1|
AUVGAU
>7A0E_2|
GUCAU
Expected output
>7A0E_1|
AUVGAU
Here is what I tried:
awk '!/^>/ {next}; {getline s}; s !~ /AUGC/ { print $0 "\n" s }' sample
But obviously /AUGC/ is not right... can someone help me with this regex?
I think awk is the tool if you want to conditionally output the > line if the next record does not contain [AUCG]. You can do that with:
awk '/^>/ {rec=$0; next} /[^AUGC]/ {printf "%s\n%s\n", rec, $0}' sample
In your case that results in:
$ awk '/^>/ {rec=$0; next} /[^AUGC]/ {printf "%s\n%s\n", rec, $0}' sample
>7A0E_1|
AUVGAU
(note: you can use print rec; print instead of printf, but printf above reduced the output to a single call)
Where you ran into trouble was forgetting to save the current record that began with > and then using getline -- which wasn't needed at all.
How to check that a sequence has at least one letter that is not A, U, C, G characters using awk(...)? Can it be done without the typical for loop?
Yes, GNU AWK can do that. Let file.txt content be
AUVGAU
AUCG
(empty line is intentional) then
awk 'BEGIN{FPAT="[^AUCG]"}{print NF>=1}' file.txt
output
1
0
0
Explanation: both solutions count number of characters which are not one of: A, U, C, G, any other character is treated as constituing field and number of fields (NF) is then checked (>=1). Note that this solution does redefine what is field and if that is problem you might use patsplit instead
awk '{patsplit($0,arr,"[^AUCG]");print length(arr)>=1}' file.txt
(tested in gawk 4.2.1)

Can I delete a field in awk?

This is test.txt:
0x01,0xDF,0x93,0x65,0xF8
0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0xB2,0x00,0x76
If I run
awk -F, 'BEGIN{OFS=","}{$2="";print $0}' test.txt
the result is:
0x01,,0x93,0x65,0xF8
0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,,0x00,0x76
The $2 wasn't deleted, it just became empty.
I hope, when printing $0, that the result is:
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
All the existing solutions are good though this is actually a tailor made job for cut:
cut -d, -f 1,3- file
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
If you want to remove 3rd field then use:
cut -d, -f 1,2,4- file
To remove 4th field use:
cut -d, -f 1-3,5- file
I believe simplest would be to use sub function to replace first occurrence of continuous ,,(which are getting created after you made 2nd field NULL) with single ,. But this assumes that you don't have any commas in between field values.
awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file
2nd solution: OR you could use match function to catch regex from first comma to next comma's occurrence and get before and after line of matched string.
awk '
match($0,/,[^,]*,/){
print substr($0,1,RSTART-1)","substr($0,RSTART+RLENGTH)
}' Input_file
It's a bit heavy-handed, but this moves each field after field 2 down a place, and then changes NF so the unwanted field is not present:
$ awk -F, -v OFS=, '{ for (i = 2; i < NF; i++) $i = $(i+1); NF--; print }' test.txt
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01
0x01,0x00,0x76
$
Tested with both GNU Awk 4.1.3 and BSD Awk ("awk version 20070501" on macOS Mojave 10.14.6 — don't ask; it frustrates me too, but sometimes employers are not very good at forward thinking). Setting NF may or may not work on older versions of Awk — I was a little surprised it did work, but the surprise was a pleasant one, for a change.
If Awk is not an absolute requirement, and the input is indeed as trivial as in your example, sed might be a simpler solution.
sed 's/,[^,]*//' test.txt
This is especially elegant if you want to remove the second field. A more generic approach to remove, the nth field would require you to put in a regex which matches the first n - 1 followed by the nth, then replace that with just the the first n - 1.
So for n = 4 you'd have
sed 's/\([^,]*,[^,]*,[^,]*,\)[^,]*,/\1/' test.txt
or more generally, if your sed dialect understands braces for specifying repetitions
sed 's/\(\([^,]*,\)\{3\}\)[^,]*,/\1/' test.txt
Some sed dialects allow you to lose all those pesky backslashes with an option like -r or -E but again, this is not universally supported or portable.
In case it's not obvious, [^,] matches a single character which is not (newline or) comma; and \1 recalls the text from first parenthesized match (back reference; \2 recalls the second, etc).
Also, this is completely unsuitable for escaped or quoted fields (though I'm not saying it can't be done). Every comma acts as a field separator, no matter what.
With GNU sed you can add a number modifier to substitute nth match of non-comma characters followed by comma:
sed -E 's/[^,]*,//2' file
Using awk in a regex-free way, with the option to choose which line will be deleted:
awk '{ col = 2; n = split($0,arr,","); line = ""; for (i = 1; i <= n; i++) line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] ); print line }' test.txt
Step by step:
{
col = 2 # defines which column will be deleted
n = split($0,arr,",") # each line is split into an array
# n is the number of elements in the array
line = "" # this will be the new line
for (i = 1; i <= n; i++) # roaming through all elements in the array
line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] )
# appends a comma (except if line is still empty)
# and the current array element to the line (except when on the selected column)
print line # prints line
}
Another solution:
You can just pipe the output to another sed and squeeze the delimiters.
$ awk -F, 'BEGIN{OFS=","}{$2=""}1 ' edward.txt | sed 's/,,/,/g'
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
$
Commenting on the first solution of #RavinderSingh13 using sub() function:
awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file
The gnu-awk manual: https://www.gnu.org/software/gawk/manual/html_node/Changing-Fields.html
It is important to note that making an assignment to an existing field changes the value of $0 but does not change the value of NF, even when you assign the empty string to a field." (4.4 Changing the Contents of a Field)
So, following the first solution of RavinderSingh13 but without using, in this case,sub() "The field is still there; it just has an empty value, delimited by the two colons":
awk 'BEGIN {FS=OFS=","} {$2="";print $0}' file
0x01,,0x93,0x65,0xF8
0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,,0x00,0x76
My solution:
awk -F, '
{
regex = "^"$1","$2
sub(regex, $1, $0);
print $0;
}'
or one line code:
awk -F, '{regex="^"$1","$2;sub(regex, $1, $0);print $0;}' test.txt
I found that OFS="," was not necessary
I would do it following way, let file.txt content be:
0x01,0xDF,0x93,0x65,0xF8
0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0xB2,0x00,0x76
then
awk 'BEGIN{FS=",";OFS=""}{for(i=2;i<=NF;i+=1){$i="," $i};$2="";print}' file.txt
output
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
Explanation: I set OFS to nothing (empty string), then for 2nd and following column I add , at start. Finally I set what is now comma and value to nothing. Keep in mind this solution would need rework if you wish to remove 1st column.

How do I obtain a specific row with the cut command?

Background
I have a file, named yeet.d, that looks like this
JET_FUEL = /steel/beams
ABC_DEF = /michael/jackson
....50 rows later....
SHIA_LEBEOUF = /just/do/it
....73 rows later....
GIVE_FOOD = /very/hungry
NEVER_GONNA = /give/you/up
I am familiar with the f and d options of the cut command. The f option allows you to specify which column(s) to extract from, while the d option allows you to specify what the delimiters.
Problem
I want this output returned using the cut command.
/just/do/it
From what I know, this is part of the command I want to enter:
cut -f1 -d= yeet.d
Given that I want the values to the right of the equals sign, with the equals sign as the delimiter. However this would return:
/steel/beams
/michael/jackson
....50 rows later....
/just/do/it
....73 rows later....
/very/hungry
/give/you/up
Which is more than what I want.
Question
How do I use the cut command to return only /just/do/it and nothing else from the situation above? This is different from How to get second last field from a cut command because I want to select a row within a large file, not just near from the end or the beginning.
This looks like it would be easier to express with awk...
# awk -v _s="${_string}" '$3 == _s {print $3}' "${_path}"
## Above could be more _scriptable_ form of bellow example
awk -v _search="/just/do/it" '$3 == _search {print $3}' <<'EOF'
JET_FULE = /steal/beams
SHIA_LEBEOUF = /just/do/it
NEVER_GONNA = /give/you/up
EOF
## Either way, output should be similar to
## /just/do/it
-v _something="Some Thing" bit allows for passing Bash variables to awk
$3 == _search bit tells awk to match only when column 3 is equal to the search string
To search for a sub-string within a line one can use $0 ~ _search
{print $3} bit tells awk to print column 3 for any matches
And the <<'EOF' bit tells Bash to not expand anything within the opening and closing EOF tags
... however, the above will still output duplicate matches, eg. if yeet.d somehow contained...
JET_FULE = /steal/beams
SHIA_LEBEOUF = /just/do/it
NEVER_GONNA = /give/you/up
AGAIN = /just/do/it
... there'd be two /just/do/it lines outputed by awk.
Quickest way around that would be to pipe | to head -1, but the better way would be to tell awk to exit after it's been told to print...
_string='/just/do/it'
_path='yeet.d'
awk -v _s="${_string}" '$3 == _s {print $3; exit}' "${_path}"
... though that now assumes that only the first match is wanted, obtaining the nth is possible though currently outside the scope of the question as of last time read.
Updates
To trip awk on the first column while printing the third column and exiting after the first match may look like...
_string='SHIA_LEBEOUF'
_path='yeet.d'
awk -v _s="${_string}" '$1 == _s {print $3; exit}' "${_path}"
... and generalize even further...
_string='^SHIA_LEBEOUF '
_path='yeet.d'
awk -v _s="${_string}" '$0 ~ _s {print $3; exit}' "${_path}"
... because awk totally gets regular expressions, mostly.
It depends on how you want to identify the desired line.
You could identify it by the line number. In this case you can use sed
cut -f2 -d= yeet.d | sed '53q;d'
This extracts the 53th line.
Or you could identify it by a keyword. In this case use grep
cut -f2 -d= yeet.d | grep just
This extracts all lines containing the word just.

Split rows to multiple line based on comma : one liner solution

I want to split the following format to unique lines
Input:
17:79412041:C:T,CGGATGTCAT
17:79412059:C:G,T
17:79412138:G:A,C
17:79412192:C:G,T,A
Desired output
17:79412041:C:T
17:79412041:C:CGGATGTCAT
17:79412059:C:G
17:79412059:C:T
17:79412138:G:A
17:79412138:G:C
17:79412192:C:G
17:79412192:C:T
17:79412192:C:A
Basically split the input to unique rows or firstID:secondID:thirdID:FourthID. Here multiple row may have firstID:secondID:thirdID may be common and the FourthID is the one it make each raw unique(that was seperated by "," in the input).
Thanks in advance
Shams
awk one-liner
$ awk -F":" '{gsub(/,/,":"); a=$1FS$2FS$3; for(i=4; i<=NF; i++) print a FS $i;}' f1
17:79412041:C:T
17:79412041:C:CGGATGTCAT
17:79412059:C:G
17:79412059:C:T
17:79412138:G:A
17:79412138:G:C
17:79412192:C:G
17:79412192:C:T
17:79412192:C:A
We are first replacing all , with : to keep a common delimiter i.e. :
We are then traversing from 4th field to end and printing each field by prefixing first three fields.
This one-liner here:
$ awk -F':' '{ split($4,a,","); for (i in a) { print $1":"$2":"$3":"a[i] } }' data.txt
Produces:
17:79412041:C:T
17:79412041:C:CGGATGTCAT
17:79412059:C:G
17:79412059:C:T
17:79412138:G:A
17:79412138:G:C
17:79412192:C:G
17:79412192:C:T
17:79412192:C:A
Explanation:
split(string, array, delimiter)
splits the string by the delimiter, and saves the pieces into the array.
The for-in loop simply prints every piece in the array with the first three entries.
The -F':' part defines the top-level delimiter.
another awk, should work for any number of fields
$ awk -F: '{split($NF,a,","); for(i in a) {sub($NF"$",a[i]); print}}' file
Following awk + gsub of it may help you on same too:
awk -F":" '{gsub(",",ORS $1 OFS $2 OFS $3 "&");gsub(/,/,":")} 1' OFS=":" Input_file
This might work for you (GNU sed):
sed 's/^\(\(.*:\)[^:,]*\),/\1\n\2/;P;D' file
Insert a newline and the key for each comma in a line.
An alternative using a loop and syntactic sugar:
sed -r ':a;s/^((.*:)[^:,]*),/\1\n\2/;ta' file

AWK - get value between two strings over multiple lines

input.txt:
>block1
111111111111111111111
>block2
222222222222222222222
>block3
333333333333333333333
AWK command:
awk '/>block2.*>/' input.txt
Expected output
222222222222222222222
However, AWK is returning nothing. What am I misunderstanding?
Thanks!
If you want to print the line after the line containing >block2, then you could use:
awk '/^>block2$/ { nr=NR+1 } NR == nr { print }'
Track the record number plus 1 when you find the match; when the current record number matches the remembered one, print the current record.
If you want all the lines between the line >block2 and >block3, then you'd use:
awk '/^>block2$/,/^>block3/ {if ($0 !~ /^>block[23]$/) print }'
For all lines between the two markers, if the line doesn't match either marker, print it. The output is the same with the sample data file.
another awk
$ awk 'c&&c--; /^>block2/{c=1}' file
222222222222222222222
c specifies how many lines you want to print after the match. If you want the text between two markers
$ awk '/^>block3/{exit} s; /^>block2/{s=1}' file
222222222222222222222
if there are multiple instances and you want them all, just change exit to s=0
You probably meant:
$ awk '/>/{f=/^>block2$/;next} f' file
222222222222222222222