parse and concatenate phrases - awk

I have a file dump which has the records of individuals:
.....Detail....account=xxxxx,......state=yyyyy,....
.....Detail....account=aaaaa,......state=bbbbb,....
What would be a way to extract the 2 phrases concatenated together using awk,sed or grep?
Would it be possible in a single-pass command line?
Expected output(delimiter does not matter):
xxxxx-yyyyy
aaaaa-bbbbb

awk -F'[=,]' '{print $2"-"$4}' file
xxxxx-yyyyy
aaaaa-bbbbb

The details about the input data are a bit vague, but the following sed filter will probably have the desired effect, and could most likely be tweaked to do so otherwise:
s/.*account=\([^,]*\).*state=\([^,]*\),.*/\1-\2/

#IUnknown: I believe that .....(dots) in your Input_file are the data, could you please try following and let me know if this helps.
awk '{for(i=1;i<=NF;i++){if($i ~ /=/){split($i, A,"=");Q=Q?Q"-"A[2]:A[2]}};print Q;Q=""}' Input_file
It considers that you need only those values which are having = in them and you want their second values, let me know if this helps you.

Related

awk - store first occurrence based on cell

I have a file (around 10k entries) with following format:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
A;B;C;n/a;n/a
D;E;F;56.011;13.099
D;E;F;56.01;13.01
D;E;F;n/a;n/a
I;B;C;n/a;n/a
containing duplicates, some without, others with mildly contradicting LAT;LONG coordinates.
I only want to store first unique value of [$1;$2;$3;$4;$5] as output, so desired output should look like:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
I'd assume that I want to create an array, but I struggle with proper formating of it... so any help appreciated !
I'm glad you have it working, but personally, I would suggest something a little more along the lines of:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
Example Use/Output
With your data in file, you could then do:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
You can shorten it to about your example (which simply checks if the unique index of the first three fields combined has been set yet and relies on the default print operation to output the first records having the unique combination):
$ awk -F";" '!seen[$1,$2,$3]++' file
However, using the joined fields $1,$2,$3 as the index is about the only way you can ensure uniqueness.
If you say your works, then it is certainly shorter. Let me know if you have further questions.
Found it by stopping to look for creating arrays
created a new $1 being $1,$2,$3, but the other solutions is indeed more elegant, here is the command I came up with after merging the fields in the file (and setting them as new $1), which I then didn't have to do
awk -F';' '!seen[($1)]++' file1.csv > file2.csv

Finding sequence in data

I to use awk to find the sequence of pattern in a DNA data but I cannot figure out how to do it. I have a text file "test.tx" which contains a lot of data and I want to be able to match any sequence that starts with ATG and ends with TAA, TGA or TAG and prints them.
for instance, if my text file has data that look like below. I want to find and match all the existing sequence and output as below.
AGACGCCGGAAGGTCCGAACATCGGCCTTATTTCGTCGCTCTCTTGCTTTGCTCGAATAAACGAGTTTGGCTTTATCGAATCTCCGTACCGTAAGGTCGAAAACGGCCGGGTCATTGAGTACGTGAAAGTACAAAATGG
GTCCGCGAATTTTTCGGTTCGTCTCAGCTTTCGCAGTTTATGGATCAGACGAACCCGCTCTCTGAAATTACTCATAAACGCAGGCTCTCGGCGCTCGGGCCCGGCGGACTCTCGCGGGAGCGTGCAGGTTTCGAAGTTC
GGATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAGGCGAACTGCTCGAAAATCAATTCCGAATCGGGCTTGAGCGAATGGAGCGGGCCATCAAGGAAAAAATGTCTATCCAGCAGGATATGCAAACGACG
AAAGTATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAAATTCAATATCAAAATGGGACGCCCCGAGCGCGACCGTATAGACGATCCGCTGCTTGCGCCGATGGATTTCATCGACGTTGTGAA
ATGAGACCGGGCGATCCGCCGACTGTGCCAACCGCCTACCGGCTTCTGG
Print out matches:
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAA
I try something like this, but it only display the rows that starts with ATG. it doesn't actually fix my problem
awk '/^AGT/{print $0}' test.txt
assuming the records are not spanning multiple lines
$ grep -oP 'ATG.*?T(AA|AG|GA)' file
ATGGATCAGACGAACCCGCTCTCTGA
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAA
ATGGGACGCCCCGAGCGCGACCGTATAG
ATGGATTTCATCGACGTTGTGA
non-greedy match, requires -P switch (to find the first match, not the longest).
Could you please try following.
awk 'match($0,/ATG.*TAA|ATG.*TGA|ATG.*TAG/){print substr($0,RSTART,RLENGTH)}' Input_file

awk lines in file between header and footer strings

I'm trying to parse out all of the lines in between different headers and footers to different files using an awk script in a for loop. For example, I have a file with a list of mismatches with sample-name headers (compiled.csv) that looks like this:
19-T00,,,,,,,,,,,,,,,,
1557,WT,,,,,,,,,,,,,,,
6,109-G->A,110-G->A,,,,,,,,,,,,,,
3,183-G->A,,,,,,,,,,,,,,,
19-T10,,,,,,,,,,,,,,,,
642,WT,,,,,,,,,,,,,,,
206,24->G,,,,,,,,,,,,,,,
19-T21,,,,,,,,,,,,,,,,
464,24->G,,,,,,,,,,,,,,,
19-TSpl,,,,,,,,,,,,,,,,
2219,24->G,,,,,,,,,,,,,,,
20-T00,,,,,,,,,,,,,,,,,,
...
...
My goal for the lines above would be to pass all the lines from the 19-T00 to the 2219,24->G,,,,,,,,,,,,,,, in a sample output file called sample-19.csv.
The sample names all share the pattern [0-9][0-9]-T*. And my approach to doing this first was based on creating an array with all 20 sample names (i.e. 19, 20, 21...). I am trying to execute the following loop, and output files are created but they are blank.
for i in {0,19}
do a="$i"
b=`echo $i+1 | bc`
header="${array[$a]}-T"; footer="${array[$b]}-T"
name=`echo $header | cut -d"-" -f1`
awk -F, -v start="$header" -v finish="$footer" '/^start*/,/^finish*/' compiled.csv >"sample-"$name".csv"
done
If I do this manually with the one-liner:
awk '/^19-T*/,/^20-T*/' compiled.csv >sample-19.csv it works fine. So I think there may be a problem in the variable passing, but I don't know how to fix it.
I know there are some other threads discussing the header-footer approach using awk, but I just think my syntax needs some help. If anyone has any advice by way of more experienced eyes, it would be much appreciated. Let me know if anything isn't clear.
Thanks,
Matt
All you need is something like this (untested):
awk '
/^[0-9][0-9]-T00,/ {
close(out)
out = "sample-" $0
sub(/-T00.*/,".csv",out)
}
{ print > out }
' compiled.csv
If you're ever again considering processing text with a shell loop make sure to read why-is-using-a-shell-loop-to-process-text-considered-bad-practice first
using awk
awk --posix '/[0-9]{2}-T00/{split($0,a,"-"); name=a[1]} {print $0>"sample-"name".cas"}' file
Output will be two files "sample-19.csv" and "sample-20.csv" for your contents

Awk equivalent of MySQL's substring_index

I have strings with delimited fields, but a different number of fields in each, eg:
this/that
this/that/theother
this/that/theother/stuff
I want to retrieve the last two fields in each case, ie:
this/that
that/theother
theother/stuff
This is easy in MySQL with the substring_index function, and I see this thread explains how to do it in PHP.
Can someone help me achieve the same with awk in the command line? Thanks
echo 'this/that/theother/stuff' | awk -F/ '{print $(NF-1) "/" $(NF)}'

Filter Records in a file based on first column value through AWK/SED

I have a file with the following records:
a,1
a,1,2
a,1,2,3
b,4
b,4,5
b,4,5,6
I want the output like this:
a,1,2,3
b,4,5,6
It's really unclear what you are trying to do here. It's even less clear what you have tried so far (good StackOverflow questions usually involve some code)! You've read the FAQ, right?
If your input is in a file called input_file.csv, then the following awk program will give you the output you have said you want. Whether it will work for your real data is anyone's guess.
% awk -F',' '{
lines[$1] = $0
}
END {
for (line in lines) {
print lines[line]
}
}' input_file.csv
I offer no explanation as to what this simple script does, but a handy reference for awk.
Thanks for your appreciation!
As requested
awk '/......./' input
a,1,2,3
b,4,5,6