How to join next line if certain character is there in AWK? - awk

In a text file (windows) i have:
sometexthere"
nothingtodohere
yessomethinghere"
etc
Using AWK (in Ubuntu) how to delete the apostrophe " at the end of the line, replace it with a semi colon : and join the next line?
so it looks like this:
sometexthere:nothingtodohere
yessomethinghere:etc

This way:
awk '1' RS='"\n' ORS=':' yourfile
Just set your record sep. to double quotes plus break line, and your output record sep. to the join character.
For DOS line breaks just adjust the regex:
awk '1' RS='"\r\n' ORS=':' yourfile
Note: what that 1 means?
Short answer, It's just a shortcut to avoid using the print statement. In awk when a condition gets matched the default action is to print the input line, example:
$ echo "test" |awk '1'
test
That's because 1 will be always true, so this expression is equivalent to:
$ echo "test"|awk '1==1'
test
$ echo "test"|awk '{if (1==1){print}}'
test

Related

awk: counting fields in a variable

Given a string like {running_db_nodes,[ejabberd#host002,ejabberd#host001]}, , how could the number of comma-delimited strings in square brackets be counted?
The useful substring can be extracted with gensub:
awk '/running_db_nodes/ {print gensub(/ {running_db_nodes,\[(.*)\]},/, "\\1", 1)}' .
A naive approach with NF gets fields from the original input string:
awk -F, '/running_db_nodes/ {nodes=gensub(/ {running_db_nodes,\[(.*)\]},/, "\\1", 1); print NF}'
How could the number of fields in a variable like nodes in the last example be extracted?
You can set your FS to characters [ and ], then split your $2 to an array and capture the count of elements returned from split():
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}," |
awk -F"[][]" '{print split($2,a,",")}'
2
With your shown samples only and with shown attempts please try following awk code.
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}," |
awk '
{
gsub(/.*\[|\].*$/,"")
print gsub(/,/,"&")+1
}
'
Explanation: Simple explanation would be:
gsub(/.*\[|\].*$/,""): Globally substituting everything from starting to till [ AND substituting from [ to till end of value with NULL in current line.
print gsub(/,/,"&")+1: Globally substituting , with itself(just to count it) and adding 1 to it and printing it as pre requirement.
A naive approach with NF gets fields from the original input string
gensub does not change string it is working on, you might use sub (or gsub) which will alter string it is working at which will alter relevant built-in variables values that is
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}" | awk 'BEGIN{FS=","}{sub(/^.*\[/,"");sub(/].*$/,"");print NF}'
gives output
2
Explanation: use sub to delete everything before [ and [, then ] and everything behind it, print number of fields.
(tested in GNU Awk 5.0.1)

gawk - Delimit lines with custom character and no similar ending character

Let's say I have a file like so:
test.txt
one
two
three
I'd like to get the following output: one|two|three
And am currently using this command: gawk -v ORS='|' '{ print $0 }' test.txt
Which gives: one|two|three|
How can I print it so that the last | isn't there?
Here's one way to do it:
$ seq 1 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1
$ seq 3 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1|2|3
With paste:
$ seq 1 | paste -sd'|'
1
$ seq 3 | paste -sd'|'
1|2|3
Convert one column to one row with field separator:
awk '{$1=$1} 1' FS='\n' OFS='|' RS='' file
Or in another notation:
awk -v FS='\n' -v OFS='|' -v RS='' '{$1=$1} 1' file
Output:
one|two|three
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
awk solutions work great. Here is tr + sed solution:
tr '\n' '|' < file | sed 's/\|$//'
1|2|3
just flatten it :
gawk/mawk 'BEGIN { FS = ORS; RS = "^[\n]*$"; OFS = "|"
} NF && ( $NF ? NF=NF : —-NF )'
ascii | = octal \174 = hex 0x7C. The reason for —-NF is that more often than not, the input includes a trailing new line, which makes field count 1 too many and result in
1|2|3|
Both NF=NF and --NF are similar concepts to $1=$1. Empty inputs, regardless of whether trailing new lines exist or not, would result in nothing printed.
At the OFS spot, you can delimit it with any string combo you like instead of being constrained by tr, which has inconsistent behavior. For instance :
gtr '\012' '高' # UTF8 高 = \351\253\230 = xE9 xAB x98
on bsd-tr, \n will get replaced by the unicode properly 1高2高3高 , but if you're on gnu-tr, it would only keep the leading byte of the unicode, and result in
1 \351 2 \351 . . .
For unicode equiv-classes, bsd-tr works as expected while gtr '[=高=]' '\v' results in
gtr: ?\230: equivalence class operand must be a single character
and if u attempt equiv-classes with an arbitrary non-ASCII byte, bsd-tr does nothing while gnu-tr would gladly oblige, even if it means slicing straight through UTF8-compliant characters :
g3bn 77138 | (g)tr '[=\224=]' '\v'
bsd-tr : 77138=Koyote 코요태 KYT✜ 高耀太
gnu-tr : 77138=Koyote ?
?
태 KYT✜ 高耀太
I would do it following way, using GNU AWK, let test.txt content be
one
two
three
then
awk '{printf NR==1?"%s":"|%s", $0}' test.txt
output
one|two|three
Explanation: If it is first line print that line content sans trailing newline, otherwise | followed by line content sans trailing newline. Note that I assumed that test.txt has not trailing newline, if this is not case test this solution before applying it.
(tested in gawk 5.0.1)
Also you can try this with awk:
awk '{ORS = (NR%3 ? "|" : RS)} 1' file
one|two|three
% is the modulo operator and NR%3 ? "|" : RS is a ternary expression.
See Ed Morton's explanation here: https://stackoverflow.com/a/55998710/14259465
With a GNU sed, you can pass -z option to match line breaks, and thus all you need is replace each newline but the last one at the end of string:
sed -z 's/\n\(.\)/|\1/g' test.txt
perl -0pe 's/\n(?!\z)/|/g' test.txt
perl -pe 's/\n/|/g if !eof' test.txt
See the online demo.
Details:
s - substitution command
\n\(.\) - an LF char followed with any one char captured into Group 1 (so \n at the end of string won't get matched)
|\1 - a | char and the captured char
g - all occurrences.
The first perl command matches any LF char (\n) not at the end of string ((?!\z)) after slurping the whole file into a single string input (again, to make \n visible to the regex engine).
The second perl command replaces an LF char at the end of each line except the one at the end of file (eof).
To make the changes inline add -i option (mind this is a GNU sed example):
sed -i -z 's/\n\(.\)/|\1/g' test.txt
perl -i -0pe 's/\n(?!\z)/|/g' test.txt
perl -i -pe 's/\n/|/g if !eof' test.txt

Count number of instances of a character and use it in a condition

I have a csv file (t.txt) in which some lines have 7 commas and some have 8
If the line has 8 commas, I want to remove the 2nd comma in that line.
Any suggestions? I have been trying
if [[awk -F "," '{print NF-1}' == 8]]; then sed 's/\,//2'; t.txt
Since you haven't provided samples of Input_file or expected output so couldn't test, could you please try following.
echo "a,a,a,d,f,g,h,h,f" |
awk '
match($0,/,[^,]*/){
print substr($0,1,RSTART-1) substr($0,RSTART,RLENGTH) substr($0,RSTART+RLENGTH+1)
next
}
1'
a,aa,d,f,g,h,h,f
In case you want to check 8 commas condition too then replace match with match($0,/,[^,]*/) && NF==9 to above code too.

How can I replace all middle characters with '*'?

I would like to replace middle of word with ****.
For example :
ifbewofiwfib
wofhwifwbif
iwjfhwi
owfhewifewifewiwei
fejnwfu
fehiw
wfebnueiwbfiefi
Should become :
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
So far I managed to replace all but the first 2 chars with:
sed -e 's/./*/g3'
Or do it the long way:
grep -o '^..' file > start
cat file | sed 's:^..\(.*\)..$:\1:' | awk -F. '{for (i=1;i<=length($1);i++) a=a"*";$1=a;a=""}1' > stars
grep -o '..$' file > end
paste -d "" start stars > temp
paste -d "" temp end > final
I would use Awk for this, if you have a GNU Awk to set the field separator to an empty string (How to set the field separator to an empty string?).
This way, you can loop through the chars and replace the desired ones with "*". In this case, replace from the 3rd to the 3rd last:
$ awk 'BEGIN{FS=OFS=""}{for (i=3; i<=NF-2; i++) $i="*"} 1' file
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
If perl is okay:
$ perl -pe 's/..\K.*(?=..)/"*" x length($&)/e' ip.txt
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
..\K.*(?=..) to match characters other than first/last two characters
See regex lookarounds section for details
e modifier allows to use Perl code in replacement section
"*" x length($&) use length function and string repetition operator to get desired replacement string
You can do it with a repetitive substitution, e.g.:
sed -E ':a; s/^(..)([*]*)[^*](.*..)$/\1\2*\3/; ta'
Explanation
This works by repeating the substitution until no change happens, that is what the :a; ...; ta bit does. The substitution consists of 3 matched groups and a non-asterisk character:
(..) the start of the string.
([*]*) any already replaced characters.
[^*] the character to be replaced next.
(.*..) any remaining characters to replace and the end of the string.
Alternative GNU sed answer
You could also do this by using the hold space which might be simpler to read, e.g.:
h # save a copy to hold space
s/./*/g3 # replace all but 2 by *
G # append hold space to pattern space
s/^(..)([*]*)..\n.*(..)$/\1\2\3/ # reformat pattern space
Run it like this:
sed -Ef parse.sed input.txt
Output in both cases
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
Following awk may help you on same. It should work in any kind of awk versions.
awk '{len=length($0);for(i=3;i<=(len-2);i++){val=val "*"};print substr($0,1,2) val substr($0,len-1);val=""}' Input_file
Adding a non-one liner form of solution too now.
awk '
{
len=length($0);
for(i=3;i<=(len-2);i++){
val=val "*"};
print substr($0,1,2) val substr($0,len-1);
val=""
}
' Input_file
Explanation: Adding explanation now for above code too.
awk '
{
len=length($0); ##Creating variable named len whose value is length of current line.
for(i=3;i<=(len-2);i++){ ##Starting for loop which starts from i=3 too till len-2 value and doing following:
val=val "*"}; ##Creating a variable val whose value is concatenating the value of it within itself.
print substr($0,1,2) val substr($0,len-1);##Printing substring first 2 chars and variable val and then last 2 chars of the current line.
val="" ##Nullifying the variable val here, so that old values should be nullified for this variable.
}
' Input_file ##Mentioning the Input_file name here.

awk: changing OFS without looping though variables

I'm working on an awk one-liner to substitute commas to tabs in a file ( and swap \\N for missing values in preparation for MySQL select into).
The following link http://www.unix.com/unix-for-dummies-questions-and-answers/211941-awk-output-field-separator.html (at the bottom) suggest the following approach to avoid looping through the variables:
echo a b c d | awk '{gsub(OFS,";")}1'
head -n1 flatfile.tab | awk -F $'\t' '{for(j=1;j<=NF;j++){gsub(" +","\\N",$j)}gsub(OFS,",")}1'
Clearly, the trailing 1 (can be a number, char) triggers the printing of the entire record. Could you please explain why this is working?
SO also has Print all Fields with AWK separated by OFS , but in that post it seems unclear why this is working.
Thanks.
Awk evaluates 1 or any number other than 0 as a true-statement. Since, true statements without the action statements part are equal to { print $0 }. It prints the line.
For example:
$ echo "hello" | awk '1'
hello
$ echo "hello" | awk '0'
$