Separate lines with keys and store in different files - awk

How to separate (get) the entire line related to hexadecimal number keys and the entire line for DEBUG in a text file, then store in different file, where the key is in this format: "[ uid key]"?
i.e. ignore any lines that is not DEBUG.
in.txt:
[ uid 28fd4583833] DEBUG web.Action
[ uid 39fd5697944] DEBUG test.Action
[ uid 56866969445] DEBUG test2.Action
[ uid 76696944556] INFO test4.Action
[ uid 39fd5697944] DEBUG test7.Action
[ uid 85483e10256] DEBUG testing.Action
The output files are named as "out" + i + ".txt", where i = 1, 2, 3, 4.
i.e.
out1.txt:
[ uid 28fd4583833] DEBUG web.Action
out2.txt:
[ uid 39fd5697944] DEBUG test.Action
[ uid 39fd5697944] DEBUG test7.Action
out3.txt:
[ uid 56866969445] DEBUG test2.Action
out4.txt:
[ uid 85483e10256] DEBUG testing.Action
I tried:
awk 'match($0, /uid ([^]]+)/, a) && /DEBUG/ {print > (a[1] ".txt")}' in.txt

If you are willing to change the output file names to include the keys (frankly, this seems more useful that a one-up counter in the names), you can do:
awk '/DEBUG/{print > ("out-" $3 ".txt")}' FS='[][ ]*' in.txt
This will put all lines that match the string DEBUG with key 85483e10256 into the file out-85483e10256.txt, etc.
If you do want to keep the one-up counter, you could do:
awk '/DEBUG/{if( ! a[$3] ) a[$3] = ++counter;
print > ("out" a[$3] ".txt")}' FS='[][ ]*' in.txt
Basically, the idea is to use the regex [][ ]* as the field separator, which matches a string of square brackets or spaces. This way, $1 is the text preceding the initial [, $2 is the string uid, and $3 is the key. This will (should!) correctly get the key for lines that might have slightly different white space. We use an associative array to keep track of which keys have already been seen to keep track of the counter. But it really is cleaner just to use the key in the output file name.

Using GNU sort for -s (to guarantee retaining input line order for every key value) and any awk:
$ sort -sk3,3 in.txt |
awk '$4!="DEBUG"{next} $3!=prev{close(out); out="out"(++i)".txt"; prev=$3} {print > out}'
$ head out*.txt
==> out1.txt <==
[ uid 28fd4583833] DEBUG web.Action
==> out2.txt <==
[ uid 39fd5697944] DEBUG test.Action
[ uid 39fd5697944] DEBUG test7.Action
==> out3.txt <==
[ uid 56866969445] DEBUG test2.Action
==> out5.txt <==
[ uid 85483e10256] DEBUG testing.Action
If you don't have GNU sort then you can apply the DSU (Decorate/Sort/Undecorate) idiom using any sort:
$ awk -v OFS='\t' '{print NR, $0}' in.txt | sort -k4,4 -k1,1n | cut -f2- |
awk '$4!="DEBUG"{next} $3!=prev{close(out); out="out"(++i)".txt"; prev=$3} {print > out}'
Note that with the above only sort has to handle all of the input in memory and it's designed to use demand paging, etc. to handle extremely large amounts of input, while awk only processes 1 line at a time and keeps almost nothing in memory and only has 1 output file open at a time and so the above is far more likely to succeed for large files than an approach that stores a lot in memory in awk, or has many output files open concurrently.

If your file format is consistent as you show, you can just do:
awk '
$4!="DEBUG" { next }
!f[$3] { f[$3]=++i }
{ print > ("out" f[$3] ".txt") }
' in.txt

1st solution: Using GNU awk try following single awk code. Where I am using PROCINFO["sorted_in"] method of GNU awk.
awk '
BEGIN{
PROCINFO["sorted_in"] = "#ind_num_asc"
}
!/DEBUG/{ next }
match($0,/uid [a-zA-Z0-9]+/){
ind=substr($0,RSTART,RLENGTH)
arr[ind]=(arr[ind]?arr[ind] ORS:"") $0
}
END{
for(i in arr){
outputFile=("out"++count".txt")
print arr[i] > (outputFile)
close(outputFile)
}
}
' Input_file
2nd solution: with any awk, with your shown samples please try following solution. Change Input_file name with your actual file's name here. Using GNU sort here with option -s to maintain the order while sorting values.
awk '
!/DEBUG/{ next }
match($0,/uid [0-9a-zA-Z]+/){
print substr($0,RSTART,RLENGTH)";"$0
}' Input_file |
sort -sk2n |
cut -d';' -f2- |
awk '
match($0,/uid [0-9a-zA-Z]+/){
if(prev!=substr($0,RSTART,RLENGTH)){
count++
close(outputFile)
}
outputFile="out"count".txt"
print > (outputFile)
prev=substr($0,RSTART,RLENGTH)
}
'
1st solution's Explanation: Adding detailed explanation for 1st solution:
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
PROCINFO["sorted_in"] = "#ind_num_asc" ##Setting PROCINFO["sorted_in"] to #ind_num_asc to sort any array with index.
}
!/DEBUG/{ next } ##If a line does not contain DEBUG then jump to next line.
match($0,/uid [a-zA-Z0-9]+/){ ##using match function to match uid space and alphanumeric values here.
ind=substr($0,RSTART,RLENGTH) ##Creating ind which contains sub string of matched sub string in match function.
arr[ind]=(arr[ind]?arr[ind] ORS:"") $0 ##Creating array arr with index of ind and keep adding current line value to same index.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through array arr here.
outputFile=("out"++count".txt") ##Creating output file name here as per OP requirement.
print arr[i] > (outputFile) ##printing current array element into outputFile variable.
close(outputFile) ##Closing output file in backend to avoid too many files opened error.
}
}
' Input_file ##Mentioning Input_file name here.

A relatively portable awk-based solution with these highlights ::
output rows do not truncate leading edge double space
output filenames adhere to stabilized input row order without the need to pre-sort rows, post-sort rows, or utilize gnu gawk-specific features
tested and confirmed working on
gawk 5.1.1, including -ce flag,
mawk 1.3.4,
mawk 1.9.9.6, and
macOS nawk 20200816
————————————————————————————————
# gawk profile, created Thu May 19 12:10:56 2022
BEGIN {
____ = "test_72297811_" # opt. filename prefix
OFS = FS = "^ [[] uid "
_+=_ = gsub("\\^|[[][]]", _, OFS)
_*= _--
} NF *= / DEBUG / {
print >> (__[___ = substr($NF,_~_,_)] ?__[___]:\
__[___]= ____ "out" length(__) ".txt" )
} END {
for (_ in __) { close(__[_]) } }'
————————————————————————————————
==> test_72297811_out1.txt <==
[ uid 28fd4583833] DEBUG web.Action
==> test_72297811_out2.txt <==
[ uid 39fd5697944] DEBUG test.Action
[ uid 39fd5697944] DEBUG test7.Action
==> test_72297811_out3.txt <==
[ uid 56866969445] DEBUG test2.Action
==> test_72297811_out4.txt <==
[ uid 85483e10256] DEBUG testing.Action

Related

Removing lines which match with specific pattern from another file

I've got two files (I only show the beginning of these files) :
patterns.txt
m64071_201130_104452/13
m64071_201130_104452/26
m64071_201130_104452/46
m64071_201130_104452/49
m64071_201130_104452/113
m64071_201130_104452/147
myfile.txt
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
I should get an output like that :
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
I want to create a new file if the lines in patterns.txt match with the lines in myfile.txt . I need to keep the letters ACTG associated with the pattern in question. I use :
for i in $(cat patterns.txt); do
grep -A 1 $i myfile.txt; done > my_newfile.txt
It works, but it's very slow to create the new file... The files I work on are pretty large but not too much (14M for patterns.txt and 700M for myfile.txt).
I also tried to use grep -v because I have the another file which contains the others patterns of myfile.txt not present in patterns.txt. But it is the same "speed filling file" problem.
If you see a solution..
With your shown samples please try following. Written and tested in GNU awk.
awk '
FNR==NR{
arr[$0]
next
}
/^>/{
found=0
match($0,/.*\//)
if((substr($0,RSTART+1,RLENGTH-2)) in arr){
print
found=1
}
next
}
found
' patterns.txt myfile.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when patterns.txt is being read.
arr[$0] ##Creating array with index of current line.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found=0 ##Unsetting found here.
match($0,/.*\//) ##using match to match a regex to till / in current line.
if((substr($0,RSTART+1,RLENGTH-2)) in arr){ ##Checking condition if sub string of matched regex is present in arr then do following.
print ##Printing current line here.
found=1 ##Setting found to 1 here.
}
next ##next will skip all further statements from here.
}
found ##Printing the line if found is set.
' patterns.txt myfile.txt ##Mentioning Input_file names here.
Another awk:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) # if first part found in hash
print # output and store found result in var tf
if(getline && tf) # read next record and if previous record was found
print # output
}' patterns myfile
Output:
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
Edit: To output the ones not found:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) { # if first part found in hash
getline # consume the next record too
next
}
print # otherwise output
}' patterns myfile
Output:
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG

How to fetch a particular string using a sed command

I have an input string like below:
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
Like this, there are more than 1000 rows.
I want to fetch the value of the first parameter, i.e., the a field and d field, but for the d field I want only har:919876543210#abc.com.
I tried like this:
cat $filename | grep -v Orig |sed -e 's/['a:','d:']//g' |awk -F'|' -v OFS=',' '{print $1 "," $4}' >> $NGW_DATA_FILE
The output I got is below:
1,<har919876543210#abc.com>; tag=vy6r5BpcvQ
I want it like this,
1,har:919876543210#abc.com
Where did I make the mistake and how do I solve it?
EDIT: As per OP's change of Input_file and OP's comments, adding following now.
awk '
BEGIN{ FS="|"; OFS="," }
{
sub(/[^:]*:/,"",$1)
gsub(/^[^<]*|; .*/,"",$4)
gsub(/^<|>$/,"",$4)
print $1,$4
}' Input_file
With shown samples, could you please try following, written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS="|"
OFS=","
}
{
val=""
for(i=1;i<=NF;i++){
split($i,arr,":")
if(arr[1]=="a" || arr[1]=="d"){
gsub(/^[^:]*:|; .*/,"",$i)
gsub(/^<|>$/,"",$i)
val=(val?val OFS:"")$i
}
}
print val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="|" ##Setting FS as pipe here.
OFS="," ##Setting OFS as comma here.
}
{
val="" ##Nullify val here(to avoid conflicts of its value later).
for(i=1;i<=NF;i++){ ##Traversing through all fields here
split($i,arr,":") ##Splitting current field into arr with delimiter by :
if(arr[1]=="a" || arr[1]=="d"){ ##Checking condition if first element of arr is either a OR d
gsub(/^[^:]*:|; .*/,"",$i) ##Globally substituting from starting till 1st occurrence of colon OR from semi colon to everything with NULL in $i.
val=(val?val OFS:"")$i ##Creating variable val which has current field value and keep adding in it.
}
}
print val ##printing val here.
}
' Input_file ##Mentioning Input_file name here.
You may also try this AWK script:
cat file
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
awk -F '[|;]' '{
s=""
for (i=1; i<=NF; ++i)
if ($i ~ /^VAL:/) {
gsub(/^[^:]+:|[<>]*/, "", $i)
s = (s == "" ? "" : s "," ) $i
}
print s
}' file
1,har:919876543210#abc.com
You can do the same thing with sed rather easily using Extended Regex, two capture groups and two back-references, e.g.
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
Explanation
's/find/replace/' standard substitution, where the find is;
^[^:]*: from the beginning skip through the first ':', then
(\w+) capture one or more word characters ([a-zA-Z0-9_]), then
[^<]*[<] consume zero or more characters not a '<', then the '<', then
([^>]+) capture everything not a '>', and
.*$ discard all remaining chars in line, then the replace is
\1,\2 reinsert the captured groups separated by a comma.
Example Use/Output
$ echo 'a:1|b:2|c:3|d:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|' |
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
1,har:919876543210#abc.com

Extract sequence from list of data into separate line

sample.txt does have "tab-separated column", and there's semi-colon seperated that needed to be splitted accordingly from sequence of number into repeated value.
cat sample.txt
2 2627 588;577
2 2629 566
2 2685 568-564
2 2771 573
2 2773 597
2 2779 533
2 2799 558
2 6919 726;740-742;777
2 7295 761;771-772
Please be noted that, some of line may have inverted sequence 568-564
By using previous script, I manage to split it, but failed to extract from sequence (splitted by dash)
#!/bin/sh
awk -F"\t" '{print $1}' $1 >> $2 &&
awk -F"\t" '{print $2}' $1 >> $2 &&
awk -F"\t" '{print $3}' $1 >> $2 &&
sed -i "s/^M//;s/;\r//g" $2
#!/bin/awk -f
BEGIN { FS=";"; recNr=1}
!NF { ++recNr; lineNr=0; next }
{ ++lineNr }
lineNr == 1 { next }
recNr == 1 { a[lineNr] = $0 }
recNr == 2 { b[lineNr] = $0 }
recNr == 3 {
for (i=1; i<=NF; i++) {
print a[lineNr] "," b[lineNr] "," $i
}
}
Expected
2,2627,588
2,2627,577
2,2629,566
2,2685,564
2,2685,565
2,2685,566
2,2685,567
2,2685,568
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
Could you please try following(will add explanation in few mins).
awk '
BEGIN{
OFS=","
}
{
num=split($NF,array,";")
for(i=1;i<=num;i++){
if(array[i]~/-/){
split(array[i],array2,"-")
to=array2[1]>array2[2]?array2[1]:array2[2]
from=array2[1]<array2[2]?array2[1]:array2[2]
while(from<=to){
print $1,$2,from++
}
}
else{
print $1,$2,array[i]
}
from=to=""
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code here.
OFS="," ##Setting OFS as comma here.
}
{
num=split($NF,array,";") ##Splitting last field of line into an array named array with delimiter semi-colon here.
for(i=1;i<=num;i++){ ##Starting a for loop from 1 to till value of num which is actually length of array created in previous step.
if(array[i]~/-/){ ##Checking condition if array value with index i is having dash then do followong.
split(array[i],array2,"-") ##Split value of array with index i to array2 here with delimiter -(dash) here.
to=array2[1]>array2[2]?array2[1]:array2[2] ##Creating to variable which will compare 2 elements of array2 and have maximum value out of them here.
from=array2[1]<array2[2]?array2[1]:array2[2] ##Creating from variable which will compare 2 elements of array2 and will have minimum out of them.
while(from<=to){ ##Running while loop from variable from to till value of variable to here.
print $1,$2,from++ ##Printing 1st, 2nd fields with value of from variable and increasing from value with 1 each time it comes here.
}
}
else{ ##Mention else part of if condition here.
print $1,$2,array[i] ##Printing only 1st, 2nd fields along with value of array with index i here.
}
from=to="" ##Nullifying variables from and to here.
}
}
' Input_file ##Mentioning Input_file name here.
Adding link for conditional statements ? and : explanation as per James sir's comments:
https://www.gnu.org/software/gawk/manual/html_node/Conditional-Exp.html
For shown sample output will be as follows.
2,2627,588
2,2627,577
2,2629,566
2,2685,564
2,2685,565
2,2685,566
2,2685,567
2,2685,568
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
$ awk '
BEGIN {
FS="( +|;)" # input field separator is space or ;
OFS="," # output fs is comma
}
{
for(i=3;i<=NF;i++) { # from the 3rd field to the end
n=split($i,t,"-") # split on - if any. below loop from smaller to greater
if(n) # in case of empty fields
for(j=(t[1]<t[n]?t[1]:t[n]); j<=(t[1]<t[n]?t[n]:t[1]);j++)
print $1,$2,j # output
}
}' file
Output
2,2627,588
2,2627,577
2,2629,566
2,2685,564 <─┐
2,2685,565 │
2,2685,566 ├─ wrong order, from smaller to greater
2,2685,567 │
2,2685,568 <─┘
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
Tested on GNU awk, mawk, Busybox awk and awk version 20121220.

How to remove dupes within lines of delimited text

What's a smart and easy way to remove dupes (not necessarily consecutive) within delimited items on a line.
BEFORE:
apple,banana,apple,cherry,cherry
delta,epsilon,delta,epsilon
apple pie,delta,delta
AFTER:
apple,banana,cherry
delta,epsilon
apple pie,delta
Should work on a Mac. Allow unicode. Any shell method/language/command. Dupes are not necessarily consecutive.
Note: this question is a variation of How to remove dupes from blocks of text -- which is for blocks of text separated with blank lines.
awk -F, '{ for(i=1;i<=NF;i++) if( split($0,t,$i)>2 ) sub($i",","") }1' file
banana,apple,cherry
delta,epsilon
apple pie,delta
sed version:
sed -r 's/(.+)(.*),\1/\1\2,/g;s/,$//' file
apple,banana,cherry
delta,epsilon
apple pie,delta
Just Code.
$ awk 'BEGIN { FS=OFS="," }
{
delete seen
sep=""
for (i=1;i<=NF;i++) {
if (!seen[$i]++) {
printf "%s%s", sep, $i
sep = OFS
}
}
print ""
}' file
apple,banana,cherry
delta,epsilon
apple pie,delta

awk Joining n fields with delimiter

How can I use awk to join various fields, given that I don't know how many of them I have? For example, given the input string
aaa/bbb/ccc/ddd/eee
I use -F'/' as delimiter, do some manipulation on aaa, bbb, ccc, ddd, eee (altering, removing...) and I want to join it back to print something line
AAA/bbb/ddd/e
Thanks
... given that I don't know how many of them I have?
Ah, but you do know how many you have. Or you will soon, if you keep reading :-)
Before giving you a record to process, awk will set the NF variable to the number of fields in that record, and you can use for loops to process them (comments aren't part of the script, I've just put them there to explain):
$ echo pax/is/a/love/god | awk -F/ '{
gsub (/god/,"dog",$5); # pax,is,a,love,dog
$4 = ""; # pax,is,a,,dog
$6 = $5; # pax,is,a,,dog,dog
$5 = "rabid"; # pax,is,a,,rabid,dog
printf $1; # output "pax"
for (i = 2; i <= NF; i++) { # output ".<field>"
if ($i != "") { # but only for non-blank fields (skip $4)
printf "."$i;
}
}
printf "\n"; # finish line
}'
pax.is.a.rabid.dog
This shows manipulation of the values, as well as insertion and deletion.
The following will show you how to process each field and do some example manipulations on them.
The only caveat of using the output field separator OFS is that "deleted" fields will still have delimiters as shown in the output below; however it makes the code much simpler if you can live with that.
awk '
BEGIN{FS=OFS="/"}
{
for(i=1;i<=NF;i++){
if($i == "aaa")
$i=toupper($i)
else if($i ~ /c/)
$i=""
else if($i ~ /^eee$/)
$i="e"
}
}1' <<<'aaa/bbb/ccc/ddd/eee'
Output
AAA/bbb//ddd/e
This might work for you:
echo "aaa/bbb/ccc/ddd/eee" |
awk 'BEGIN{FS=OFS="/"}{sub(/../,"",$4);NF=4;print}'
aaa/bbb/ccc/d
To delete fields not at the end use a function to shuffle the values:
echo "aaa/bbb/ccc/ddd/eee" |
awk 'func d(n){for(x=n;x<=NF-1;x++){y=x+1;$x=$y}NF--};BEGIN{FS=OFS="/"}{d(2);print}'
aaa/ccc/ddd/eee
Deletes the second field.
awk -F'/' '{ # I'd suggest to add them to an array, like:
# for (i=1;i<=NF;i++) {a[i]=$i }
# Now manipulate your elements in the array
# then finally print them:
n = asorti(a, dest)
for (i=1;i<=n;i++) { output+=dest[i] "/") }
print gensub("/$","","g",output)
}' INPUTFILE
Doing it this way you can delete elements as well. Note deleting an item can be done like delete array[index].