awk/sed replace multiple newlines in the record except end of record - awk

I have file where:
field delimiter is \x01
the record delimiter is \n
Some lines contain multiple newlines I need to remove them, however I don't want to remove the legitimate newlines at the end of each lines. I have tried this with awk:
awk -F '\x01' 'NF < 87 {getline s; $0 = $0 s} 1' infile > outfile
But this is only working when the line contains one newline in the record (except end of line newline). This does not work for multiple newlines.
Note: the record contains 87 fields.
What am I doing wrong here?
Example of file:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000
test^A00000000
Test^A^A^A^A
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000
test^A00000000
Test^A^A^A^A
SL^ANov-21^A30-11-2021^AB^A0000^A1234567^A00000
test^A12102120^A00000^A00^A^A
NOTE: The file contains 11 fields; field separate \x01; record separator \n
Expected result:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
SL^ANov-21^A30-11-2021^AB^A0000^A1234567^A00000test^A12102120^A00000^A00^A^A
Note: I need to preserve the field delimiter (\x01) and record delimiter (\n)
Thank you very much in advance for looking into this.
The file always contains 87 fields;
The fild delimiter is '\x01', but when viewing in Linux it is represented as '^A'
Some lines contain newlines - I need to remove them, but I don't want to remove the legitimate newlines at the end of each line.
The newline appears twice in the 1st and second record and once in third record - this are the newlines I want to remove.
In the examples/expected results there are 11 delimiters "x01" represented as "^A",
I expect to have 3 records and not 6, i.e.:
First record:
test^A00000000 should be joined to the previous line
Test^A^A^A^A should be joined to the first line as well
forming one record:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
Second record
test^A00000000 should be joined to the previous line
Test^A^A^A^A should be joined to that previous line as well
forming one record:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
Third record:
test^A12102120^A00000^A00^A^A should be joined to the previous line
forming one record:
SL^ANov-21^A30-11-2021^AB^A0000^A1234567^A00000test^A12102120^A00000^A00^A^A
Note:
The example of awk - provided works when there is one unwanted newline in the record but not when there are multiple newlines
Thank you so very much. It works perfectly. Thank you for explaining it so well to me too.

This might work for you (GNU sed):
sed ':a;N;s/\x01/&/87;Ta;s/\n//g' file
Gather up lines until there are 87 separators, remove any newlines and print the result.

What's wrong with your attempt is that you concatenate two lines, print the result and move to the next line. NF is then reset to the next fields count. As all your lines have less than 87 fields the NF < 87 condition is useless, your script would work the same without it.
Try this awk script:
$ awk -F'\x01' -vn=87 -vi=0 '
{printf("%s", $0); i+=NF; if(i==n) {i=0; print "";} else i-=1;}' file
Here, we use the real \x01 field separator and the NF fields count. Variable i counts the number of already printed fields. We first print the current line without the trailing newline (printf("%s", $0)). Then we update our i fields counter. If it is equal to n we reset it and print a newline. Else we decrement it such that we do not count the last field of this line and the first of the next as 2 separate fields.
Demo with n=12 instead of 87 and your own input file (with \x01 field separators):
$ awk -F'\x01' -vn=12 -vi=0 '
{printf("%s", $0); i+=NF; if(i==n) {i=0; print "";} else i-=1;}' file |
sed 's/\x01/|/g'
PL|Nov-21|29-11-2021|0|00|00|0000000test|00000000 Test||||
PL|Nov-21|29-11-2021|0|00|00|0000000test|00000000 Test||||
SL|Nov-21|30-11-2021|B|0000|1234567|00000test|12102120|00000|00||
The sed command shows the result with the \x01 replaced by | for easier viewing.

Related

Bash command/script to split line on a certain character

I would like to split the below data to the expected output:
Raw Data:
931096|376601|1|ART|AT-2151780724|2151780724|2|102809198|I|CGM44I|MIL3VF03|52576377.3600|PENDING|MO|PEND-INFO|Pend ACS4R|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|52576377.3600|1317720|system|2020-02-13 02:00:42|0
931097|375789|1|AYT|AT-2151509210|2151509210|7|102614605|A|CTHGMI|OZF19|444006.6400|APPROVED|NULL|APPROVED|Approved|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|kg17718|NULL|NULL|0.0000|1317722|system|2020-02-13 02:00:43|0931098|375979|1|AHT|AT-2151780726|2151780726|2|102809199|I|CGMI|MILaesLF11|26312.0000|PENDING|MO|PEND-INFO|Pend ACRES|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|26312.0000|1317721|system|2020-02-13 02:00:43|0
931099|376572|1|AT|AT-2151399812|2151399812|5|102673999|I|CG2rMI|WEL44LF15|60991.6956|PENDING|MO|PEND-INFO|Pend ACERS|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|0.0000|1317723|system|2020-02-13 02:00:45|0
Expected Output:
931096|376601|1|ART|AT-2151780724|2151780724|2|102809198|I|CGM44I|MIL3VF03|52576377.3600|PENDING|MO|PEND-INFO|Pend ACS4R|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|52576377.3600|1317720|system|2020-02-13 02:00:42|0
931097|375789|1|AYT|AT-2151509210|2151509210|7|102614605|A|CTHGMI|OZF19|444006.6400|APPROVED|NULL|APPROVED|Approved|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|kg17718|NULL|NULL|0.0000|1317722|system|2020-02-13 02:00:43|0
931098|375979|1|AHT|AT-2151780726|2151780726|2|102809199|I|CGMI|MILaesLF11|26312.0000|PENDING|MO|PEND-INFO|Pend ACRES|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|26312.0000|1317721|system|2020-02-13 02:00:43|0
931099|376572|1|AT|AT-2151399812|2151399812|5|102673999|I|CG2rMI|WEL44LF15|60991.6956|PENDING|MO|PEND-INFO|Pend ACERS|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|0.0000|1317723|system|2020-02-13 02:00:45|0
Basically the \n character is getting lost sometimes in the data and the lines are getting merged. Sometimes more than 1 line gets merged as well (even the opposite happens but we can get to that later).
The data always has 43 columns | separated. The last but one column(42nd) always is a timestamp and the last column is usually 0 or 1.
Trying for the below approach:
If cols > 43
Split 44th column to add \n and print the remaining.
Repeat process until cols=43
echo "${curr}" | awk -F\| ' { if(NF > 43) {for(i=43;i<NF;i++) "sed '${NR}s/\(^0\)/\1\n/p' $i" }}' filename
less complex
awk 'BEGIN {FS=OFS="|"}
NF>43 {for(i=43;i<=NF;i+=42) {t=$i; $i=substr(t,1,1) ORS substr(t,2)}}1' file
931096|376601|1|ART|AT-2151780724|2151780724|2|102809198|I|CGM44I|MIL3VF03|52576377.3600|PENDING|MO|PEND-INFO|Pend ACS4R|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|52576377.3600|1317720|system|2020-02-13 02:00:42|0
931097|375789|1|AYT|AT-2151509210|2151509210|7|102614605|A|CTHGMI|OZF19|444006.6400|APPROVED|NULL|APPROVED|Approved|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|kg17718|NULL|NULL|0.0000|1317722|system|2020-02-13 02:00:43|0
931098|375979|1|AHT|AT-2151780726|2151780726|2|102809199|I|CGMI|MILaesLF11|26312.0000|PENDING|MO|PEND-INFO|Pend ACRES|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|26312.0000|1317721|system|2020-02-13 02:00:43|0
931099|376572|1|AT|AT-2151399812|2151399812|5|102673999|I|CG2rMI|WEL44LF15|60991.6956|PENDING|MO|PEND-INFO|Pend ACERS|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|0.0000|1317723|system|2020-02-13 02:00:45|0
following your spec
If cols > 43 Split 44th 43th column to add
\n and print the remaining. Repeat process until cols=43 the end.
The usual way with sed: write a regex that matches 43 | characters with anything in between and a digit. Then insert a newline after the matched string.
sed 's/[0-9]\{6\}\(|[^|]*\)\{41\}|[0-9]/&\n/g ; s/\n$//'
# ^^^^^^^ - remove the leftover newline
# ^ - the matched string
# ^^^^^ - trailing digit
# ^ - 42th pipe character
# ^^^^^^^^^^^^^^^^ - 41 fields with anything in between
# ^^^^^^^^^^ - leading 6 digits
tested on repl
Or maybe match 42 pipes with anything in front and a digit::
sed 's/\([^|]*|\)\{42\}[0-9]/&\n/g ; s/\n$//'
Or match a character after 42 pipes and a digit and insert a newline in between:
sed 's/\(\([^|]*|\)\{42\}[0-9]\)\(.\)/\1\n\3/g'
Could you please try following, written and tested with shown samples. This solution will take care of inserting new lines even if you have more than 1 occurrences present in your single line too.
awk '
match($0,/[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\|0/){
val=substr($0,RSTART+RLENGTH)
if(val){
num=gsub(/[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\|0/,"&")
while(++count<num){
sub(/[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\|0/,"&\n")
}
}
val=count=num=""
}
1
' Input_file
You don't trust the source of the data. Maybe it will add another | and the number of columns is wrong.
Another approach is guessing that you can trust the timestamp field.
So try to split the line when the field after the timestamp has more dan one character (and split after the first).
sed -E 's/([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\|.)(.)/\1\n\2/g' file
This might work for you (GNU sed):
sed 's/[^|]*/\n&/44;s/\(|.\)\([^|]*|\)\n/\1\n\2/;P;D' file
If there is a 44th field, insert a newline before it. Then remove that newline and insert it following the first character of the 43rd field. Print the first line, delete the first line and repeat.

Printing out a particular row based on condition in another row

apologies if this really basic stuff but i just started with awk
so i have an input file im piping into awk like below. format never changes (like below)
name: Jim
gender: male
age: 40
name: Joe
gender: female
age: 36
name: frank
gender: Male
age: 40
I'm trying to list all names where age is 40
I can find them like so
awk '$2 == "40" {print $2 }'
but cant figure out how to print the name
Could you please try following(I am driving as of now so couldn't test it).
awk '/^age/{if($NF==40){print val};val="";next} /^name/{val=$0}' Input_file
Explanation: 1st condition checking ^name if a line starts from it then store that line value in variable val. Then in other condition checking if a line starts from age; then checking uf that line's 2nd field is greater than 40 then print value if variable val and nullify it too.
Using gnu awk and set Record Selector to nothing makes it works with blocks.
awk -v RS="" '/age: 40/ {print $2}' file
Jim
frank
Some shorter awk versions of suspectus and RavinderSingh13 post
awk '/^name/{n=$2} /^age/ && $NF==40 {print n}' file
awk '/^name/{n=$2} /^age: 40/ {print n}' file
Jim
frank
If line starts with name, store the name in n
IF line starts with age and age is 40 print n
Awk knows the concept records and fields.
Files are split in records where consecutive records are split by the record separator RS. Each record is split in fields, where consecutive fields are split by the field separator FS.
By default, the record separator RS is set to be the <newline> character (\n) and thus each record is a line. The record separator has the following definition:
RS:
The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So with the file format you give, we can define the records based on RS="".
So based on this, we can immediately list all records who have the line age: 40
$ awk 'BEGIN{RS="";ORS="\n\n"}/age: 40/
There are a couple of problems with the above line:
What if we have a person that is 400 yr old, he will be listed because the line /age: 400/ contains that the requested line.
What if we have a record with a typo stating age:40 or age : 40
What if our record has a line stating wage: 40 USD/min
To solve most of these problems, it is easier to work with well-defined fields in the record and build the key-value-pairs per record:
key value
---------------
name => Jim
gender => male
age => 40
and then, we can use this to select the requested information:
$ awk 'BEGIN{RS="";FS="\n"}
# build the record
{ delete rec;
for(i=1;i<=NF;++i) {
# find the first ":" and select key and value as substrings
j=index($i,":"); key=substr($i,1,j-1); value=substr($i,j+1)
# remove potential spaces from front and back
gsub(/(^[[:blank:]]*|[[:blank:]]$)/,key)
gsub(/(^[[:blank:]]*|[[:blank:]]$)/,value)
# store key-value pair
rec[key] = value
}
}
# select requested information and print
(rec["age"] == 40) { print rec["name"] }' file
This is not a one-liner, but it is robust. Furthermore, this method is fairly flexible and adaptable to make selections based on a more complex logic.
If you are not averse to using grep and the format is always the same:
cat filename | grep -B2 "age: 40" | grep -oP "(?<=name: ).*"
Jim
frank
awk -F':' '/^name/{name=$2} \
/^age/{if ($NF==40)print name}' input_file

gawk to create first column based on part of second column

I have a 2 column tsv that I need to insert a new first column using part of the value in column 2.
What I have:
fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
What I want:
D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
I want to pull everything between "fastq/" and the first period and print that as the new first column.
$ awk -F'[/.]' '{printf "%s\t%s\n",$2,$0}' file
D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
How it works
awk implicitly loops over all input lines.
-F'[/.]'
This tells awk to use any occurrence of / or . as a field separator. This means that, for your input, the string you are looking for will be the second field.
printf "%s\t%s\n",$2,$0
This tells awk to print the second field ($2), followed by a tab (\t), followed by the input line ($0), followed by a newline character (\n)

Print lines containing the same second field for more than 3 times in a text file

Here is what I am doing.
The text file is comma separated and has three field,
and I want to extract all the line containing the same second field
more than three times.
Text file (filename is "text"):
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
6,keyword2,content5
6,keyword2,content5
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
My command is like below. cat the whole text file inside awk and grep with the second field of each line and count the number of the line.
If the number of the line is greater than 2, print the whole line.
The command:
awk -F "," '{ "cat text | grep "$2 " | wc -l" | getline var; if ( 2 < var ) print $0}' text
However, the command output contains only first three consecutive lines,
instead of printing also last three lines containing "keyword1" which occurs in the text for six times.
Result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
My expected result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
Can somebody tell me what I am doing wrong?
It is relatively straight-forward to make just two passes over the file. In the first pass, you count the number of occurrences of each value in column 2. In the second pass, you print out the rows where the value in column 2 occurs more than your threshold value of 3 times.
awk -F, 'FNR == NR { count[$2]++ }
FNR != NR { if (count[$2] > 3) print }' text text
The first line of code handles the first pass; it counts the occurrences of each different value of the second column.
The second line of code handles the second pass; if the value in column 2 was counted more than 3 times, print the whole line.
This doesn't work if the input is only available on a pipe rather than as a file (so you can't make two passes over the data). Then you have to work much harder.

Awk: printing undetermined number of columns

I have a file that contains a number of fields separated by tab. I am trying to print all columns except the first one but want to print them all in only one column with AWK. The format of the file is
col 1 col 2 ... col n
There are at least 2 columns in one row.
Sample
2012029754 901749095
2012028240 901744459 258789
2012024782 901735922
2012026032 901738573 257784
2012027260 901742004
2003062290 901738925 257813 257822
2012026806 901741040
2012024252 901733947 257493
2012024365 901733700
2012030848 901751693 260720 260956 264843 264844
So I want to tell awk to print column 2 to column n for n greater than 2 without printing blank lines when there is no info in column n of that row, all in one column like the following.
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
This is the first time I am using awk, so bear with me. I wrote this from command line which works:
awk '{i=2;
while ($i ~ /[0-9]+/)
{
printf "%s\n", $i
i++
}
}' bth.data
It is more of a seeking approval than asking a question whether it is the right way of doing something like this in AWK or is there a better/shorter way of doing it.
Note that the actual input file could be millions of lines.
Thanks
Is this what you want as output?
awk '{for(i=2; i<=NF; i++) print $i}' bth.data
gives
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
NF is one of several pre-defined awk variables. It indicates the number of fields on a given input line. For instance, it is useful if you want to always print out the last field in a line print $NF. Or of course if you want to iterate through all or part of the fields on a given line to the end of the line.
Seems like awk is the wrong tool. I would do:
cut -f 2- < bth.data | tr -s '\t' '\n'
Note that with -s, this avoids printing blank lines as stated in the original problem.