Bash command/script to split line on a certain character - awk

I would like to split the below data to the expected output:
Raw Data:
931096|376601|1|ART|AT-2151780724|2151780724|2|102809198|I|CGM44I|MIL3VF03|52576377.3600|PENDING|MO|PEND-INFO|Pend ACS4R|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|52576377.3600|1317720|system|2020-02-13 02:00:42|0
931097|375789|1|AYT|AT-2151509210|2151509210|7|102614605|A|CTHGMI|OZF19|444006.6400|APPROVED|NULL|APPROVED|Approved|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|kg17718|NULL|NULL|0.0000|1317722|system|2020-02-13 02:00:43|0931098|375979|1|AHT|AT-2151780726|2151780726|2|102809199|I|CGMI|MILaesLF11|26312.0000|PENDING|MO|PEND-INFO|Pend ACRES|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|26312.0000|1317721|system|2020-02-13 02:00:43|0
931099|376572|1|AT|AT-2151399812|2151399812|5|102673999|I|CG2rMI|WEL44LF15|60991.6956|PENDING|MO|PEND-INFO|Pend ACERS|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|0.0000|1317723|system|2020-02-13 02:00:45|0
Expected Output:
931096|376601|1|ART|AT-2151780724|2151780724|2|102809198|I|CGM44I|MIL3VF03|52576377.3600|PENDING|MO|PEND-INFO|Pend ACS4R|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|52576377.3600|1317720|system|2020-02-13 02:00:42|0
931097|375789|1|AYT|AT-2151509210|2151509210|7|102614605|A|CTHGMI|OZF19|444006.6400|APPROVED|NULL|APPROVED|Approved|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|kg17718|NULL|NULL|0.0000|1317722|system|2020-02-13 02:00:43|0
931098|375979|1|AHT|AT-2151780726|2151780726|2|102809199|I|CGMI|MILaesLF11|26312.0000|PENDING|MO|PEND-INFO|Pend ACRES|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|26312.0000|1317721|system|2020-02-13 02:00:43|0
931099|376572|1|AT|AT-2151399812|2151399812|5|102673999|I|CG2rMI|WEL44LF15|60991.6956|PENDING|MO|PEND-INFO|Pend ACERS|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|0.0000|1317723|system|2020-02-13 02:00:45|0
Basically the \n character is getting lost sometimes in the data and the lines are getting merged. Sometimes more than 1 line gets merged as well (even the opposite happens but we can get to that later).
The data always has 43 columns | separated. The last but one column(42nd) always is a timestamp and the last column is usually 0 or 1.
Trying for the below approach:
If cols > 43
Split 44th column to add \n and print the remaining.
Repeat process until cols=43
echo "${curr}" | awk -F\| ' { if(NF > 43) {for(i=43;i<NF;i++) "sed '${NR}s/\(^0\)/\1\n/p' $i" }}' filename

less complex
awk 'BEGIN {FS=OFS="|"}
NF>43 {for(i=43;i<=NF;i+=42) {t=$i; $i=substr(t,1,1) ORS substr(t,2)}}1' file
931096|376601|1|ART|AT-2151780724|2151780724|2|102809198|I|CGM44I|MIL3VF03|52576377.3600|PENDING|MO|PEND-INFO|Pend ACS4R|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|52576377.3600|1317720|system|2020-02-13 02:00:42|0
931097|375789|1|AYT|AT-2151509210|2151509210|7|102614605|A|CTHGMI|OZF19|444006.6400|APPROVED|NULL|APPROVED|Approved|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|kg17718|NULL|NULL|0.0000|1317722|system|2020-02-13 02:00:43|0
931098|375979|1|AHT|AT-2151780726|2151780726|2|102809199|I|CGMI|MILaesLF11|26312.0000|PENDING|MO|PEND-INFO|Pend ACRES|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|26312.0000|1317721|system|2020-02-13 02:00:43|0
931099|376572|1|AT|AT-2151399812|2151399812|5|102673999|I|CG2rMI|WEL44LF15|60991.6956|PENDING|MO|PEND-INFO|Pend ACERS|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|0.0000|1317723|system|2020-02-13 02:00:45|0
following your spec
If cols > 43 Split 44th 43th column to add
\n and print the remaining. Repeat process until cols=43 the end.

The usual way with sed: write a regex that matches 43 | characters with anything in between and a digit. Then insert a newline after the matched string.
sed 's/[0-9]\{6\}\(|[^|]*\)\{41\}|[0-9]/&\n/g ; s/\n$//'
# ^^^^^^^ - remove the leftover newline
# ^ - the matched string
# ^^^^^ - trailing digit
# ^ - 42th pipe character
# ^^^^^^^^^^^^^^^^ - 41 fields with anything in between
# ^^^^^^^^^^ - leading 6 digits
tested on repl
Or maybe match 42 pipes with anything in front and a digit::
sed 's/\([^|]*|\)\{42\}[0-9]/&\n/g ; s/\n$//'
Or match a character after 42 pipes and a digit and insert a newline in between:
sed 's/\(\([^|]*|\)\{42\}[0-9]\)\(.\)/\1\n\3/g'

Could you please try following, written and tested with shown samples. This solution will take care of inserting new lines even if you have more than 1 occurrences present in your single line too.
awk '
match($0,/[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\|0/){
val=substr($0,RSTART+RLENGTH)
if(val){
num=gsub(/[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\|0/,"&")
while(++count<num){
sub(/[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\|0/,"&\n")
}
}
val=count=num=""
}
1
' Input_file

You don't trust the source of the data. Maybe it will add another | and the number of columns is wrong.
Another approach is guessing that you can trust the timestamp field.
So try to split the line when the field after the timestamp has more dan one character (and split after the first).
sed -E 's/([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\|.)(.)/\1\n\2/g' file

This might work for you (GNU sed):
sed 's/[^|]*/\n&/44;s/\(|.\)\([^|]*|\)\n/\1\n\2/;P;D' file
If there is a 44th field, insert a newline before it. Then remove that newline and insert it following the first character of the 43rd field. Print the first line, delete the first line and repeat.

Related

Get the value after a specific symbol in each column of the table

let's consider the following table
#frame x1:1 y1:5 z2:3 m1:13 n3:35
1 130.31 23.2 44.1 32.7 54.3
....
....
I want to get the value present after the colon(:) symbol in each column. Thus the outcome will be from 2nd column 1, 3rd column 5, 4th column 3, 5th column 13, and 6th column 35.
I would use GNU AWK for this task following way, let file.txt content be
#frame x1:1 y1:5 z2:3 m1:13 n3:35
1 130.31 23.2 44.1 32.7 54.3
....
....
then
awk 'BEGIN{FPAT=":[^[:space:]]+"}NF{for(i=1;i<=NF;i+=1){$i=substr($i,2)};print}' file.txt
gives output
1 5 3 13 35
Explanation: I inform GNU AWK via FPAT variable that it should consider : followed by 1 or more (+) non (^) - whitespace ([:space:]) to be column. Then for each line having any column (NF) I iterate over these columns using for loop and replace its' content with content with value starting at 2nd character, i.e. I discard leading :, when that is done I print line (contents of columns sheared by space characters)
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed -En 's/\S+://gp' file
Delete any non-whitespace characters followed by a colon, globally and print the result only if there is a match.
Or if you only want the values following :, then:
sed -En '/:/{s/[^:]*//;s/\S*://gp}' file

awk/sed replace multiple newlines in the record except end of record

I have file where:
field delimiter is \x01
the record delimiter is \n
Some lines contain multiple newlines I need to remove them, however I don't want to remove the legitimate newlines at the end of each lines. I have tried this with awk:
awk -F '\x01' 'NF < 87 {getline s; $0 = $0 s} 1' infile > outfile
But this is only working when the line contains one newline in the record (except end of line newline). This does not work for multiple newlines.
Note: the record contains 87 fields.
What am I doing wrong here?
Example of file:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000
test^A00000000
Test^A^A^A^A
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000
test^A00000000
Test^A^A^A^A
SL^ANov-21^A30-11-2021^AB^A0000^A1234567^A00000
test^A12102120^A00000^A00^A^A
NOTE: The file contains 11 fields; field separate \x01; record separator \n
Expected result:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
SL^ANov-21^A30-11-2021^AB^A0000^A1234567^A00000test^A12102120^A00000^A00^A^A
Note: I need to preserve the field delimiter (\x01) and record delimiter (\n)
Thank you very much in advance for looking into this.
The file always contains 87 fields;
The fild delimiter is '\x01', but when viewing in Linux it is represented as '^A'
Some lines contain newlines - I need to remove them, but I don't want to remove the legitimate newlines at the end of each line.
The newline appears twice in the 1st and second record and once in third record - this are the newlines I want to remove.
In the examples/expected results there are 11 delimiters "x01" represented as "^A",
I expect to have 3 records and not 6, i.e.:
First record:
test^A00000000 should be joined to the previous line
Test^A^A^A^A should be joined to the first line as well
forming one record:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
Second record
test^A00000000 should be joined to the previous line
Test^A^A^A^A should be joined to that previous line as well
forming one record:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
Third record:
test^A12102120^A00000^A00^A^A should be joined to the previous line
forming one record:
SL^ANov-21^A30-11-2021^AB^A0000^A1234567^A00000test^A12102120^A00000^A00^A^A
Note:
The example of awk - provided works when there is one unwanted newline in the record but not when there are multiple newlines
Thank you so very much. It works perfectly. Thank you for explaining it so well to me too.
This might work for you (GNU sed):
sed ':a;N;s/\x01/&/87;Ta;s/\n//g' file
Gather up lines until there are 87 separators, remove any newlines and print the result.
What's wrong with your attempt is that you concatenate two lines, print the result and move to the next line. NF is then reset to the next fields count. As all your lines have less than 87 fields the NF < 87 condition is useless, your script would work the same without it.
Try this awk script:
$ awk -F'\x01' -vn=87 -vi=0 '
{printf("%s", $0); i+=NF; if(i==n) {i=0; print "";} else i-=1;}' file
Here, we use the real \x01 field separator and the NF fields count. Variable i counts the number of already printed fields. We first print the current line without the trailing newline (printf("%s", $0)). Then we update our i fields counter. If it is equal to n we reset it and print a newline. Else we decrement it such that we do not count the last field of this line and the first of the next as 2 separate fields.
Demo with n=12 instead of 87 and your own input file (with \x01 field separators):
$ awk -F'\x01' -vn=12 -vi=0 '
{printf("%s", $0); i+=NF; if(i==n) {i=0; print "";} else i-=1;}' file |
sed 's/\x01/|/g'
PL|Nov-21|29-11-2021|0|00|00|0000000test|00000000 Test||||
PL|Nov-21|29-11-2021|0|00|00|0000000test|00000000 Test||||
SL|Nov-21|30-11-2021|B|0000|1234567|00000test|12102120|00000|00||
The sed command shows the result with the \x01 replaced by | for easier viewing.

split based on the last dot and create a new column with the last part of the string

I have a file with 2 columns. In the first column, there are several strings (IDs) and in the second values. In the strings, there are a number of dots that can be variable. I would like to split these strings based on the last dot. I found in the forum how remove the last past after the last dot, but I don't want to remove it. I would like to create a new column with the last part of the strings, using bash command (e.g. awk)
Example of strings:
5_8S_A.3-C_1.A 50
6_FS_B.L.3-O_1.A 20
H.YU-201.D 80
UI-LP.56.2011.A 10
Example of output:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
I tried to solve it by using the following command but it works if I have just 1 dot in the string:
awk -F' ' '{{split($1, arr, "."); print arr[1] "\t" arr[2] "\t" $2}}' file.txt
You may use this sed:
sed -E 's/^([[:blank:]]*[^[:blank:]]+)\.([^[:blank:]]+)/\1 \2/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
Details:
^: Start
([[:blank:]]*[^[:blank:]]+): Capture group #2 to match 0 or more whitespaces followed by 1+ non-whitespace characters.
\.: Match a dot. Since this regex pattern is greedy it will match until last dot
([^[:blank:]]+): Capture group #2 to match 1+ non-whitespace characters
\1 \2: Replacement to place a space between capture value #1 and capture value #2
Assumptions:
each line consists of two (white) space delimited fields
first field contains at least one period (.)
Sticking with OP's desire (?) to use awk:
awk '
{ n=split($1,arr,".") # split first field on period (".")
pfx=""
for (i=1;i<n;i++) { # print all but the nth array entry
printf "%s%s",pfx,arr[i]
pfx="."}
print "\t" arr[n] "\t" $2} # print last array entry and last field of line
' file.txt
Removing comments and reducing to a one-liner:
awk '{n=split($1,arr,"."); pfx=""; for (i=1;i<n;i++) {printf "%s%s",pfx,arr[i]; pfx="."}; print "\t" arr[n] "\t" $2}' file.txt
This generates:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
With your shown samples, here is one more variant of rev + awk solution.
rev Input_file | awk '{sub(/\./,OFS)} 1' | rev
Explanation: Simple explanation would be, using rev to print reverse order(from last character to first character) for each line, then sending its output as a standard input to awk program where substituting first dot(which is last dot as per OP's shown samples only) with spaces and printing all lines. Then sending this output as a standard input to rev again to print output into correct order(to remove effect of 1st rev command here).
$ sed 's/\.\([^.]*$\)/\t\1/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10

Adding a decimal point to an integer with awk or sed

So, I have csv files to use with hledger, and last field of every row is the amount for that line transaction.
Lines are in the following format:
date1, date2, description, amount
With the amount format any length between 4 and 6 digits; now for some reason all amounts are missing the period before the last two digits.
Now: 1000
Should be: 10.00
Now: 25452
Should be: 254.52
How to add a '.' before the last two digits of all lines, preferably with sed/awk?
So the input file is:
16.12.2005,18.12.2005,ATM,2000
17.12.2005,18.12.2005,utility,12523
18.12.2005,20.12.2005,salary,459023
desired output
16.12.2005,18.12.2005,ATM,20.00
17.12.2005,18.12.2005,utility,125.23
18.12.2005,20.12.2005,salary,4590.23
Thanks
You could try:
awk -F , '{printf "%s,%s,%s,%-6.2f\n", $1, $2, $3, $4/100.0}'
You should always add a sample of your input file and of the output you want in your question.
In this input you provide, you will have to define what has to happen when the description field contains a ,, or if it is possible to have amount of less than 100 as input.
In function of your answer, I will need to adapt the code or not.
sed 's/..$/.&/'
......................
You can also use cut utility to get the desired output. In your case, you always want to add '.' before the last two digits. So essentially it can be thought as something like this:
Step 1: Get all the characters from the beginning till the last 2 characters.
Step 2: Get the last 2 characters from the end.
Step 3: Concatenate them with the character that you want ('.' in this case).
The corresponding command for each of the step is the following:
$ a='17.12.2005,18.12.2005,utility,12523'
$ b=`echo $a | rev | cut -c3- | rev`
$ c=`echo $a | rev | cut -c1-2 | rev`
$ echo $b"."$c
This would produce the output
17.12.2005,18.12.2005,utility,125.23
16.12.2005,18.12.2005,ATM,20.00
17.12.2005,18.12.2005,utility,125.23
18.12.2005,20.12.2005,salary,4590.23
awk -F, '{sub(/..$/,".& ")}1' file

Printing out a particular row based on condition in another row

apologies if this really basic stuff but i just started with awk
so i have an input file im piping into awk like below. format never changes (like below)
name: Jim
gender: male
age: 40
name: Joe
gender: female
age: 36
name: frank
gender: Male
age: 40
I'm trying to list all names where age is 40
I can find them like so
awk '$2 == "40" {print $2 }'
but cant figure out how to print the name
Could you please try following(I am driving as of now so couldn't test it).
awk '/^age/{if($NF==40){print val};val="";next} /^name/{val=$0}' Input_file
Explanation: 1st condition checking ^name if a line starts from it then store that line value in variable val. Then in other condition checking if a line starts from age; then checking uf that line's 2nd field is greater than 40 then print value if variable val and nullify it too.
Using gnu awk and set Record Selector to nothing makes it works with blocks.
awk -v RS="" '/age: 40/ {print $2}' file
Jim
frank
Some shorter awk versions of suspectus and RavinderSingh13 post
awk '/^name/{n=$2} /^age/ && $NF==40 {print n}' file
awk '/^name/{n=$2} /^age: 40/ {print n}' file
Jim
frank
If line starts with name, store the name in n
IF line starts with age and age is 40 print n
Awk knows the concept records and fields.
Files are split in records where consecutive records are split by the record separator RS. Each record is split in fields, where consecutive fields are split by the field separator FS.
By default, the record separator RS is set to be the <newline> character (\n) and thus each record is a line. The record separator has the following definition:
RS:
The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So with the file format you give, we can define the records based on RS="".
So based on this, we can immediately list all records who have the line age: 40
$ awk 'BEGIN{RS="";ORS="\n\n"}/age: 40/
There are a couple of problems with the above line:
What if we have a person that is 400 yr old, he will be listed because the line /age: 400/ contains that the requested line.
What if we have a record with a typo stating age:40 or age : 40
What if our record has a line stating wage: 40 USD/min
To solve most of these problems, it is easier to work with well-defined fields in the record and build the key-value-pairs per record:
key value
---------------
name => Jim
gender => male
age => 40
and then, we can use this to select the requested information:
$ awk 'BEGIN{RS="";FS="\n"}
# build the record
{ delete rec;
for(i=1;i<=NF;++i) {
# find the first ":" and select key and value as substrings
j=index($i,":"); key=substr($i,1,j-1); value=substr($i,j+1)
# remove potential spaces from front and back
gsub(/(^[[:blank:]]*|[[:blank:]]$)/,key)
gsub(/(^[[:blank:]]*|[[:blank:]]$)/,value)
# store key-value pair
rec[key] = value
}
}
# select requested information and print
(rec["age"] == 40) { print rec["name"] }' file
This is not a one-liner, but it is robust. Furthermore, this method is fairly flexible and adaptable to make selections based on a more complex logic.
If you are not averse to using grep and the format is always the same:
cat filename | grep -B2 "age: 40" | grep -oP "(?<=name: ).*"
Jim
frank
awk -F':' '/^name/{name=$2} \
/^age/{if ($NF==40)print name}' input_file