How to sort file contents by paragraph - awk

I have a text file containing below lines.
Number: "472"
displayname: "jack holi"
Number: "392"
displayname: "david"
Number: "376"
displayname: "joly"
Number: "481"
displayname: "john doe"
....
How to sort them in ascending order by number and have output like below
Number: "376"
displayname: "joly"
Number: "392"
displayname: "david"
Number: "472"
displayname: "jack holi"
Number: "481"
displayname: "john doe"

If you are still looking for an (due to the array sorting GNU) awk solution, you can use this script:
script.awk
BEGIN { ORS= RS="\n\n"
FS="[\n:]"
PROCINFO["sorted_in"] = "#ind_num_asc"
}
{ gsub( /"/, "", $2)
so[ $2 + 0 ] = $0 }
END { for( k in so ) print so[k] }
Use it like this awk -f script.awk yourfile .
Explanation
Record separator RS is set to two newlines, so that number and displayname become members of the same record
Fiedld separator FS is set to newlline or : so that we get the number, displayname and their values as fields $1,$3,$2,$4 respectively
the record is put into so under the key in $2, so is sorted by (number#ind_num_asc)
only in the end everything is printed

Perl to the rescue!
perl -e 'BEGIN { $/ = "" }
print for map $_->[1],
sort { $a->[0] <=> $b->[0] }
map [ /Number: "(\d+)"/, $_ ],
<>;' -- input.txt
The BEGIN block turns on paragraph mode, i.e. file is read by the diamond operator in paragraphs, i.e. blocks of texts separated by empty lines.
It uses Schwartzian Transform, i.e. it maps each block to a pair Number, block, then sorts the pairs by the numbers and maps them back to the blocks, now in correct order.

Here's a slightly different take... read two lines at a time from your input file with GNU Parallel and put them together on a single line, sort them, then split the lines up again:
parallel -L2 -ra input.txt echo | sort -n | perl -pe 's/" /"\n/; $_.="\n"'

For earlier versions of gawk that don't have PROCINFO for array scanning order, you can do:
awk 'function cmp(i1, v1, i2, v2)
{ return (i1-i2) }
BEGIN { ORS=RS="\n\n" }
{ s=$2
gsub(/"/, "", s)
arr[s]=$0 }
END {
asorti(arr, so, "cmp")
for (k in so)
print arr[so[k]]}' file

Related

Separate lines with keys and store in different files

How to separate (get) the entire line related to hexadecimal number keys and the entire line for DEBUG in a text file, then store in different file, where the key is in this format: "[ uid key]"?
i.e. ignore any lines that is not DEBUG.
in.txt:
[ uid 28fd4583833] DEBUG web.Action
[ uid 39fd5697944] DEBUG test.Action
[ uid 56866969445] DEBUG test2.Action
[ uid 76696944556] INFO test4.Action
[ uid 39fd5697944] DEBUG test7.Action
[ uid 85483e10256] DEBUG testing.Action
The output files are named as "out" + i + ".txt", where i = 1, 2, 3, 4.
i.e.
out1.txt:
[ uid 28fd4583833] DEBUG web.Action
out2.txt:
[ uid 39fd5697944] DEBUG test.Action
[ uid 39fd5697944] DEBUG test7.Action
out3.txt:
[ uid 56866969445] DEBUG test2.Action
out4.txt:
[ uid 85483e10256] DEBUG testing.Action
I tried:
awk 'match($0, /uid ([^]]+)/, a) && /DEBUG/ {print > (a[1] ".txt")}' in.txt
If you are willing to change the output file names to include the keys (frankly, this seems more useful that a one-up counter in the names), you can do:
awk '/DEBUG/{print > ("out-" $3 ".txt")}' FS='[][ ]*' in.txt
This will put all lines that match the string DEBUG with key 85483e10256 into the file out-85483e10256.txt, etc.
If you do want to keep the one-up counter, you could do:
awk '/DEBUG/{if( ! a[$3] ) a[$3] = ++counter;
print > ("out" a[$3] ".txt")}' FS='[][ ]*' in.txt
Basically, the idea is to use the regex [][ ]* as the field separator, which matches a string of square brackets or spaces. This way, $1 is the text preceding the initial [, $2 is the string uid, and $3 is the key. This will (should!) correctly get the key for lines that might have slightly different white space. We use an associative array to keep track of which keys have already been seen to keep track of the counter. But it really is cleaner just to use the key in the output file name.
Using GNU sort for -s (to guarantee retaining input line order for every key value) and any awk:
$ sort -sk3,3 in.txt |
awk '$4!="DEBUG"{next} $3!=prev{close(out); out="out"(++i)".txt"; prev=$3} {print > out}'
$ head out*.txt
==> out1.txt <==
[ uid 28fd4583833] DEBUG web.Action
==> out2.txt <==
[ uid 39fd5697944] DEBUG test.Action
[ uid 39fd5697944] DEBUG test7.Action
==> out3.txt <==
[ uid 56866969445] DEBUG test2.Action
==> out5.txt <==
[ uid 85483e10256] DEBUG testing.Action
If you don't have GNU sort then you can apply the DSU (Decorate/Sort/Undecorate) idiom using any sort:
$ awk -v OFS='\t' '{print NR, $0}' in.txt | sort -k4,4 -k1,1n | cut -f2- |
awk '$4!="DEBUG"{next} $3!=prev{close(out); out="out"(++i)".txt"; prev=$3} {print > out}'
Note that with the above only sort has to handle all of the input in memory and it's designed to use demand paging, etc. to handle extremely large amounts of input, while awk only processes 1 line at a time and keeps almost nothing in memory and only has 1 output file open at a time and so the above is far more likely to succeed for large files than an approach that stores a lot in memory in awk, or has many output files open concurrently.
If your file format is consistent as you show, you can just do:
awk '
$4!="DEBUG" { next }
!f[$3] { f[$3]=++i }
{ print > ("out" f[$3] ".txt") }
' in.txt
1st solution: Using GNU awk try following single awk code. Where I am using PROCINFO["sorted_in"] method of GNU awk.
awk '
BEGIN{
PROCINFO["sorted_in"] = "#ind_num_asc"
}
!/DEBUG/{ next }
match($0,/uid [a-zA-Z0-9]+/){
ind=substr($0,RSTART,RLENGTH)
arr[ind]=(arr[ind]?arr[ind] ORS:"") $0
}
END{
for(i in arr){
outputFile=("out"++count".txt")
print arr[i] > (outputFile)
close(outputFile)
}
}
' Input_file
2nd solution: with any awk, with your shown samples please try following solution. Change Input_file name with your actual file's name here. Using GNU sort here with option -s to maintain the order while sorting values.
awk '
!/DEBUG/{ next }
match($0,/uid [0-9a-zA-Z]+/){
print substr($0,RSTART,RLENGTH)";"$0
}' Input_file |
sort -sk2n |
cut -d';' -f2- |
awk '
match($0,/uid [0-9a-zA-Z]+/){
if(prev!=substr($0,RSTART,RLENGTH)){
count++
close(outputFile)
}
outputFile="out"count".txt"
print > (outputFile)
prev=substr($0,RSTART,RLENGTH)
}
'
1st solution's Explanation: Adding detailed explanation for 1st solution:
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
PROCINFO["sorted_in"] = "#ind_num_asc" ##Setting PROCINFO["sorted_in"] to #ind_num_asc to sort any array with index.
}
!/DEBUG/{ next } ##If a line does not contain DEBUG then jump to next line.
match($0,/uid [a-zA-Z0-9]+/){ ##using match function to match uid space and alphanumeric values here.
ind=substr($0,RSTART,RLENGTH) ##Creating ind which contains sub string of matched sub string in match function.
arr[ind]=(arr[ind]?arr[ind] ORS:"") $0 ##Creating array arr with index of ind and keep adding current line value to same index.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through array arr here.
outputFile=("out"++count".txt") ##Creating output file name here as per OP requirement.
print arr[i] > (outputFile) ##printing current array element into outputFile variable.
close(outputFile) ##Closing output file in backend to avoid too many files opened error.
}
}
' Input_file ##Mentioning Input_file name here.
A relatively portable awk-based solution with these highlights ::
output rows do not truncate leading edge double space
output filenames adhere to stabilized input row order without the need to pre-sort rows, post-sort rows, or utilize gnu gawk-specific features
tested and confirmed working on
gawk 5.1.1, including -ce flag,
mawk 1.3.4,
mawk 1.9.9.6, and
macOS nawk 20200816
————————————————————————————————
# gawk profile, created Thu May 19 12:10:56 2022
BEGIN {
____ = "test_72297811_" # opt. filename prefix
OFS = FS = "^ [[] uid "
_+=_ = gsub("\\^|[[][]]", _, OFS)
_*= _--
} NF *= / DEBUG / {
print >> (__[___ = substr($NF,_~_,_)] ?__[___]:\
__[___]= ____ "out" length(__) ".txt" )
} END {
for (_ in __) { close(__[_]) } }'
————————————————————————————————
==> test_72297811_out1.txt <==
[ uid 28fd4583833] DEBUG web.Action
==> test_72297811_out2.txt <==
[ uid 39fd5697944] DEBUG test.Action
[ uid 39fd5697944] DEBUG test7.Action
==> test_72297811_out3.txt <==
[ uid 56866969445] DEBUG test2.Action
==> test_72297811_out4.txt <==
[ uid 85483e10256] DEBUG testing.Action

Awk printing each line in file seperatly

I am making a script that takes a list of zones records and values and puts it in a DNS server.
The formatting for the wanted output is just for ansible, the problem is i get that i can't operate on each line seperatly with awk.
What it does is when i don't mention a NR it prints all the items in the same line.
When i mention an NR it prints either nothing or only the specified NR (Ie only if i do NR==1 it will print the first line)
My objective is to iterate on all lines and print them in the format i want with newline after end of line.
bash_script
#! /bin/bash
read -p "File: " file
zone=`awk 'NR==1 { print $2}' ${file}`
echo "${zone}:" >> /etc/ansible/roles/create_dns/defaults/main/${zone}.yml
lines=`wc -l < ${file}`
for line_num in $(seq 1 $lines)
do
echo $line_num
echo `awk 'NR==$line_num {print " - { zone: \x27" $2"\x27, record: \x27" $1"\x27, value: \x27" $3"\x27 }\\n"}' ${file}` >> /etc/ansible/roles/create_dns/defaults/main/${zone}.yml
done
$file
ansible ben.com 10.0.0.10
test ben.com 10.0.0.110
example ben.com 10.0.0.120
Wanted output:
ben.com:
- { zone: 'ben.com', record: 'ansible', value: '10.0.0.10' }
- { zone: 'ben.com', record: 'test', value: '10.0.0.110' }
- { zone: 'ben.com', record: 'example', value: '10.0.0.120' }
Output i get:
ben.com:
You can use this single awk for this:
read -p "File: " file
awk '{printf "\t- { zone: \047%s\047, record: \047%s\047, value: \047%s\047 }\n", $2, $1, $3 > $2}' "$file"
cat ben.com
- { zone: 'ben.com', record: 'ansible', value: '10.0.0.10' }
- { zone: 'ben.com', record: 'test', value: '10.0.0.110' }
- { zone: 'ben.com', record: 'example', value: '10.0.0.120' }```
With your shown samples, please try following solution. This is a Generic solution, on basis of, you could give a number of column names in BEGIN section of this program under split section, but this is considering that you want to add strings(eg: zone, record etc) before each field/column values. IN case your number of strings are lesser than number of fields/columns in Input_file then you can change condition from i<=NF too as per your need, to fetch how many columns you want to get.
read -p "File: " file
awk -v s1='\047' 'BEGIN{OFS=", ";split("zone:,record:,value:",headings,",")} {for(i=1;i<=NF;i++){$i=headings[i]" " s1 $i s1};$0=" - { " $0" }"} 1' "$file"
Adding a non one liner form of solution here:
awk -v s1="'" '
BEGIN{
OFS=", "
split("zone:,record:,value:",headings,",")
}
{
for(i=1;i<=NF;i++){
$i=headings[i]" " s1 $i s1
}
$0=" - { " $0" }"
}
1
' "$file"

Getting awk to print a line with a keyword, but only within a range

I am using FreeBSD's geom command to gather information about partitions on my storage devices and filter it using awk. Specifically, I'm trying to extract two lines from the Providers section of the output: Mediasize, and type.
This is what the unfiltered output looks like:
$ geom part list da0
Geom name: da0
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 120845263
first: 40
entries: 128
scheme: GPT
Providers:
1. Name: da0p1
Mediasize: 61872754688 (58G)
Sectorsize: 512
Stripesize: 0
Stripeoffset: 20480
Mode: r0w0e0
efimedia: HD(1,GPT,1b5fe285-3be5-11ea-8179-b827ebb30e4e,0x28,0x733f3a8)
rawuuid: 1b5fe285-3be5-11ea-8179-b827ebb30e4e
rawtype: 516e7cb6-6ecf-11d6-8ff8-00022d09712b
label: (null)
length: 61872754688
offset: 20480
type: freebsd-ufs
index: 1
end: 120845263
start: 40
Consumers:
1. Name: da0
Mediasize: 61872793600 (58G)
Sectorsize: 512
Mode: r0w0e0
I can use this awk one-liner to get Mediasize and type, but it returns both the Providers and Consumers Mediasize: since the search string appears in both sections:
$ geom part list da0 | awk '/Mediasize:/ { print $2 } /[ ]+type:/ { print $2 }'
61872754688
freebsd-ufs
61872793600
I can use this command to limit the output to only the lines that fall between Providers: and Consumers:
$ geom part list da0 | awk '/Providers:/,/Consumers:/'
Providers:
1. Name: da0p1
Mediasize: 61872754688 (58G)
Sectorsize: 512
Stripesize: 0
Stripeoffset: 20480
Mode: r0w0e0
efimedia: HD(1,GPT,1b5fe285-3be5-11ea-8179-b827ebb30e4e,0x28,0x733f3a8)
rawuuid: 1b5fe285-3be5-11ea-8179-b827ebb30e4e
rawtype: 516e7cb6-6ecf-11d6-8ff8-00022d09712b
label: (null)
length: 61872754688
offset: 20480
type: freebsd-ufs
index: 1
end: 120845263
start: 40
Consumers:
What I'm struggling with is how to combine the two into an awk one-liner, to print Mediasize:, but only from the Providers: section.
I've tried this, but it gives me errors:
$ geom part list da0 | awk '/Providers:/,/Consumers:/ { /Mediasize:/ { print $2 } /[ ]+type:/ { print $2 } }'
awk: syntax error at source line 1
context is
/Providers:/,/Consumers:/ { /Mediasize:/ >>> { <<<
awk: illegal statement at source line 1
awk: syntax error at source line 1
Piping the output of one awk program to another gets me what I want, but it seems like a kludge.
$ geom part list da0 | awk '/Providers:/,/Consumers:/' | awk '/Mediasize:/ { print $2 } /[ ]+type:/ { print $2 }'
61872754688
freebsd-ufs
Ideally, I'd like to get the output from a single awk one-liner.
Ways I can think of (ordered from most elegant to least elegant) include:
1) Somehow fixing awk '/Providers:/,/Consumers:/ { /Mediasize:/ { print $2 } /[ ]+type:/ { print $2 } }'
2) Exiting premeturely once the Consumers: keyword is encountered.
3) Using a flag to toggle printing off once the Consumers: keyword is encountered.
I can get #3 to work, with a flag and a ternary operator, but it seems less than elegant:
$ geom part list da0 | awk '/Mediasize:/ { print (++flag==1)?$2:"" } /[ ]type:/ { print (flag==1)?$2:"" }'
61872754688
freebsd-ufs
Any ideas on how I might get solution #1 or #2 to work, or perhaps another solution I am overlooking?
Untested:
/Mediasize/ { print $2 }
/type/ { print $2 }
/Consumers/ { exit }
You could use a flag, for example:
awk '/Providers/ {f=1; next} f && /Mediasize/{print $2; f=0}
This can be read as after matching Providers, find Mediazise and return the second field.
For those interested in the final outcome, I was able to put user448810's answer to work and get the output I wanted.
The command:
geom part list mmcsd0 | awk 'BEGIN { printf "{" } /Name/ { printf "%s\n \"%s\": { ", (++count==1)?"":",", $3 } /Mediasize/ { printf "\"size\": %s, ", $2 } / type:/ { printf "\"type\": \"%s\" }", $2 } /Consumers/ { exit } END { printf "\n}\n" }'
The output:
{
"mmcsd0s1": { "size": 52383744, "type": "fat32lba" },
"mmcsd0s2": { "size": 31052026368, "type": "freebsd" }
}
Qué bello!

Can't replace string to multi-lined string with sed

I'm trying to replace a fixed parse ("replaceMe") in a text with multi-lined text with sed.
My bash script goes as follows:
content=$(awk'{print $5}' < data.txt | sort | uniq)
target=$(cat install.sh)
text=$(sed "s/replaceMe/$content/" <<< "$target")
echo "${text}"
If content contains one line only, replacing works, but if it contains sevrel lines I get:
sed:... untarminated `s' command
I read about "fetching" multi-lined content, but I couldn't find something about placing multi lined string
You'll have more problems than that depending on the contents of data.txt since sed doesn't understand literal strings (see Is it possible to escape regex metacharacters reliably with sed). Just use awk which does:
text="$( awk -v old='replaceMe' '
NR==FNR {
if ( !seen[$5]++ ) {
new = (NR>1 ? new ORS : "") $5
}
next
}
s = index($0,old) { $0 = substr($0,1,s-1) new substr($0,s+length(old)) }
{ print }
' data.txt install.sh )"

gsub for substituting translations not working

I have a dictionary dict with records separated by ":" and data fields by new lines, for example:
:one
1
:two
2
:three
3
:four
4
Now I want awk to substitute all occurrences of each record in the input
file, eg
onetwotwotwoone
two
threetwoone
four
My first awk script looked like this and works just fine:
BEGIN { RS = ":" ; FS = "\n"}
NR == FNR {
rep[$1] = $2
next
}
{
for (key in rep)
grub(key,rep[key])
print
}
giving me:
12221
2
321
4
Unfortunately another dict file contains some character used by regular expressions, so I have to substitute escape characters in my script. By moving key and rep[key] into a string (which can then be parsed for escape characters), the script will only substitute the second record in the dict. Why? And how to solve?
Here's the current second part of the script:
{
for (key in rep)
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
print
}
All scripts are run by awk -f translate.awk dict input
Thanks in advance!
Your fundamental problem is using strings in regexp and backreference contexts when you don't want them and then trying to escape the metacharacters in your strings to disable the characters that you're enabling by using them in those contexts. If you want strings, use them in string contexts, that's all.
You won't want this:
gsub(regexp,backreference-enabled-string)
You want something more like this:
index(...,string) substr(string)
I think this is what you're trying to do:
$ cat tst.awk
BEGIN { FS = ":" }
NR == FNR {
if ( NR%2 ) {
key = $2
}
else {
rep[key] = $0
}
next
}
{
for ( key in rep ) {
head = ""
tail = $0
while ( start = index(tail,key) ) {
head = head substr(tail,1,start-1) rep[key]
tail = substr(tail,start+length(key))
}
$0 = head tail
}
print
}
$ awk -f tst.awk dict file
12221
2
321
4
Never mind for asking....
Just some missing parentheses...?!
{
for (key in rep)
{
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
}
print
}
works like a charm.