How to remove space and the specific character in string - awk - awk

Below is a input.
!{ID=34, ID2=35}
>
!{ID=99, ID2=23}
>
!{ID=18, ID2=87}
<
I am trying to make a final result like as following. That is, wanted to remove space,'{' and '}' character and check if the next line is '>' or '<'.
In fact, the input above is repeated. I also need to parse '>' and '<' character so I will put the parsed string(YES or NO) into database.
ID=34,ID=35#YES#NO
ID=99,ID=23#YES#NO
ID=18,ID=87#NO#YES
So, with 'sub' function I thought I can replace the space with blank but the result shows:
1#YES#NO
Can you let me know what is wrong?
If possible, teach me how to remove '{' and '}' as well.
Appreciated if you could show me the awk file version instead of one-liner.
BEGIN {
VALUES = ""
L_EXIST = "NO"
R_EXIST = "NO"
}
/!/ { VALUES = gsub(" ", "", $0);
getline;
if ($1 == ">") L_EXIST = "YES";
else if ($1 == "<") R_EXIST = "YES";
print VALUES"#"L_EXIST"#"R_EXIST
}
END {
}

Given your sample input:
$ cat file
!{ID=34, ID2=35}
>
!{ID=99, ID2=23}
>
!{ID=18, ID2=87}
<
This script produces the desired output:
BEGIN { FS="[}{=, ]+"; RS="!" }
NR > 1 { printf "ID=%d,ID=%d#%s\n", $3, $5, ($6==">"?"YES#NO":"NO#YES") }
The Field Separator is set to consume the spaces and other characters between the parts of the line that you're interested in. The Record Separator is set to !, so that each pair of lines is treated as a single record.
The first record is empty (the start of the first line, up to the first !), so we only process the ones after that. The output is constructed using printf, with a ternary to determine the last part (I assume that there are only two options, > or <).

Let's say you have this input:
input.txt
!{ID=34, ID2=35}
!{ID=36, ID2=37}
>
You can use the following awk command
awk -F'[!{}, ]' 'NR>1{yn="NO";if($1==">")yn="YES";print l"#"yn}{l=$3","$5}' input.txt
to produce this output:
ID=34,ID2=35#NO
ID=36,ID2=37#YES

Related

How can I store the length of a line into a var withing awk script?

I have this simple awk script with which I attempt to check the amount of characters in the first line.
if the first line has more of less than 10 characters I want to store the amount
of caracters into a var.
Somehow the first print statement works but storing that result into a var doesn't.
Please help.
I tried removing dollar sign " thelength=(length($0))"
and removing the parenthesis "thelength=length($0)" but it doen't print anything...
Thanks!
#!/bin/ksh
awk ' BEGIN {FS=";"}
{
if (NR==1)
if(length($0)!=10)
{
print(length($0))
thelength=$(length($0))
print "The length of the first line is: ",$thelength;
exit 1;
}
}
END { print "STOP" }' $1
Two issues dealing with mixing ksh and awk scripting ...
no need to make a sub-shell call within awk to obtain the length; use thelength=length($0)
awk variables do not require a leading $ when being referenced; use print ... ,thelength
So your code becomes:
#!/bin/ksh
awk ' BEGIN {FS=";"}
{
if (NR==1)
if(length($0)!=10)
{
print(length($0))
thelength=length($0)
print "The length of the first line is: ",thelength;
exit 1;
}
}
END { print "STOP" }' $1

gsub for substituting translations not working

I have a dictionary dict with records separated by ":" and data fields by new lines, for example:
:one
1
:two
2
:three
3
:four
4
Now I want awk to substitute all occurrences of each record in the input
file, eg
onetwotwotwoone
two
threetwoone
four
My first awk script looked like this and works just fine:
BEGIN { RS = ":" ; FS = "\n"}
NR == FNR {
rep[$1] = $2
next
}
{
for (key in rep)
grub(key,rep[key])
print
}
giving me:
12221
2
321
4
Unfortunately another dict file contains some character used by regular expressions, so I have to substitute escape characters in my script. By moving key and rep[key] into a string (which can then be parsed for escape characters), the script will only substitute the second record in the dict. Why? And how to solve?
Here's the current second part of the script:
{
for (key in rep)
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
print
}
All scripts are run by awk -f translate.awk dict input
Thanks in advance!
Your fundamental problem is using strings in regexp and backreference contexts when you don't want them and then trying to escape the metacharacters in your strings to disable the characters that you're enabling by using them in those contexts. If you want strings, use them in string contexts, that's all.
You won't want this:
gsub(regexp,backreference-enabled-string)
You want something more like this:
index(...,string) substr(string)
I think this is what you're trying to do:
$ cat tst.awk
BEGIN { FS = ":" }
NR == FNR {
if ( NR%2 ) {
key = $2
}
else {
rep[key] = $0
}
next
}
{
for ( key in rep ) {
head = ""
tail = $0
while ( start = index(tail,key) ) {
head = head substr(tail,1,start-1) rep[key]
tail = substr(tail,start+length(key))
}
$0 = head tail
}
print
}
$ awk -f tst.awk dict file
12221
2
321
4
Never mind for asking....
Just some missing parentheses...?!
{
for (key in rep)
{
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
}
print
}
works like a charm.

Awk replace nth Character with blank value

I have the below file with 100s of entries which I want to replace the 46th Character (N) with a blank with an awk command on a unix box. Does anyone know the best way to do this?
TESTENTRY1||||||N|Y|N|OFF||N||||N|L|N|0|N|0|N|N||||A|0||0||N|N|N|Y|N||0|N|N||0|||N|N|N|N|N
TESTENTRY2||||||N|Y|N|OFF||N||||N|L|N|0|N|0|N|N||||A|0||0||N|N|N|Y|N||0|N|N||0|||N|N|N|N|N
So it looks like the below:
TESTENTRY1||||||N|Y|N|OFF||N||||N|L|N|0|N|0|N|N||||A|0||0||N|N|N|Y|N||0|N|N||0|||N||N|N|N
TESTENTRY2||||||N|Y|N|OFF||N||||N|L|N|0|N|0|N|N||||A|0||0||N|N|N|Y|N||0|N|N||0|||N||N|N|N
$ awk 'BEGIN { FS=OFS="|" } { $46 = "" }1' nnn.txt
TESTENTRY1||||||N|Y|N|OFF||N||||N|L|N|0|N|0|N|N||||A|0||0||N|N|N|Y|N||0|N|N||0|||N||N|N|N
TESTENTRY2||||||N|Y|N|OFF||N||||N|L|N|0|N|0|N|N||||A|0||0||N|N|N|Y|N||0|N|N||0|||N||N|N|N
BEGIN { FS=OFS="|" } sets the input and output field separators to the vertical bar before the records are read.
{ $46 = "" } sets the 46th column to be empty in each record.
The trailing 1 prints the resulting record to the output.

Fields contain field separator as string: How to apply awk correctly in this case?

I have a CSV-file similar to this test.csv file:
Header 1; Header 2; Header 3
A;B;US
C;D;US
E;F;US
G;H;FR
I;J;FR
K;L;FR
M;"String with ; semicolon";UK
N;"String without semicolon";UK
O;"String OK";
P;"String OK";
Now, I want to split this file based on header 3. So I want to end up with four separate CSV files, one for "US", "FR", "UK", and "".
With my very limited Linux command line skills (sadly :-( I used until now this line:
awk -F\; 'NR>1{ fname="country_yearly_"$3".csv"; print >>(fname); close(fname);}' test.csv
Of course, the experienced command line users of you will notice my problem: One field in my test.csv contains rows in which the semicolon which I use as a separator is also present in fields that are marked with quotation marks (I can't guarantee that for sure because of millions of rows, but I'm happy with an answer that assumes this). So sadly, I get an additional file named country_yearly_ semicolon".csv, which contains this row in my example.
In my venture to solve this issue, I came across this question on SO. In particular, Thor's answer seems to contain the solution of my problem by replacing all semicolons in strings. I adjusted his code accordingly as follows:
awk -F'"' -v OFS='' '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub(";", "|", $i);
$i = FS $i FS; # reinsert the quotes
}
print
}' test.csv > test1.csv
Now, I get the following test1.csv file:
M;"String with | semicolon";UK
N;"String without semicolon";UK
O;"String OK";
P;"String OK";
As you can see, all rows that have quotation marks are shown and my problem line is fixed as well, but a) I actually want all rows, not only those in quotation marks and I can't figure out which part in his code does limit the rows to ones with quotation marks and b) I think it would be more efficient if test.csv is just changed instead of sending the output to a new file, but I don't know how to do that either.
EDIT in response to Birei's answer:
Unfortunately, my minimal example was too simple. Here is an updated version:
Header 1; Header 2; Header 3; Header 4
A;B;US;
C;D;US;
E;F;US;
G;H;FR;
I;J;FR;
K;L;FR;
M;"String with ; semicolon";UK;"Yet another ; string"
N;"String without semicolon";UK; "No problem here"
O;"String OK";;"Fine"
P;"String OK";;"Not ; fine"
Note that my real data has roughly 100 columns and millions of rows and the country column, ignoring semicolons in strings, is column 13. However, as far as I see it I can't use the fact that it's column 13 if I don't get rid of the semicolons in strings first.
To split the file, you might just do:
awk -v FS=";" '{ CSV_FILE = "country_yearly_" $NF ".csv" ; print > CSV_FILE }'
Which always take the last field to construct the file name.
In your example, only lines with quotation marks are printed due to the NF > 1 pattern. The following script will print all lines:
awk -F'"' -v OFS='' '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub(";", "|", $i);
$i = FS $i FS; # reinsert the quotes
}
}
{
# print all lines
print
}' test.csv > test1.csv
To do what you want, you could change the line in the script and reprocess it:
awk -F'"' -v OFS='' '
# Save the original line
{ ORIGINAL_LINE = LINE = $0 }
# Replace the semicolon inside quotes by a dummy character
# and put the resulting line in the LINE variable
NF > 1 {
LINE = ""
for(i=2; i<=NF; i+=2) {
gsub(";", "|", $i)
LINE = LINE $(i-1) FS $i FS # reinsert the quotes
}
# Add the end of the line after the last quote
if ( $(i+1) ) { LINE = LINE $(i+1) }
}
{
# Put the semicolon-separated fields in a table
# (the semicolon inside quotes have been removed from LINE)
split( LINE, TABLE, /;/ )
# Build the file name -- TABLE[ 3 ] is the 3rd field
CSV_FILE = "country_yearly_" TABLE[ 3 ] ".csv"
# Save the line
print ORIGINAL_LINE > CSV_FILE
}'
You were near of a solution. I would use the last field to avoid the problem of fields with double quotes. Also, there is no need to close each file. They will automatically be closed by the shell at the end of the awk script.
awk '
BEGIN {
FS = OFS = ";";
}
FNR > 1 {
fname = "country_yearly_" $NF ".csv";
print >>fname;
}
' infile
Check output:
head country_yearly_*
That yields:
==> country_yearly_.csv <==
O;"String OK";
P;"String OK";
==> country_yearly_FR.csv <==
G;H;FR
I;J;FR
K;L;FR
==> country_yearly_UK.csv <==
M;"String with ; semicolon";UK
N;"String without semicolon";UK
==> country_yearly_US.csv <==
A;B;US
C;D;US
E;F;US

Awk command to insert corresponding line numbers except for blank lines

I'm doing an assignment at the moment and the question that's stumped me is:
"Write an awk command to insert the corresponding line number before
each line in the text file above. The blank line should NOT be
numbered in this case."
I have an answer, but I'm struggling to find the explanation of what each component does.
The command is:
awk '{print (NF? ++a " " :"") $0}' <textfile.txt>
I know that NF is the field number, and that $0 refers to the whole input record. I tried playing around with the command to find what does what, but it always seems to have syntax errors whenever I omit something.
So, my question is what does each component do? What does the ++a do? The ? after NF? and what does the bit with the quotations do?
Thanks in advance!
The instruction ... ? ... : ... it's an if-else. So, it's the same as:
if ( NF > 0 ) {
++a;
print a " " $0;
} else {
print $0;
}
a is a variable that is only incremented when found a line with fields.
print (NF? ++a " " :"") $0
a ternary operator has been used in your solution.
for a blank line NF will be 0 always
so
cond?true case:false case
if NF is >0 then print a or else print ""
a++ says that after printing increment a by 1 which will be used for next non blank line processing.
awk 'BEGIN{count=1}{if($0~/^$/){print}else{print count,$0;count++}}' your_file
tested below:
> cat temp.cc
int main ()
{
}
> awk 'BEGIN{count=1}{if($0~/^$/){print}else{print count,$0;count++}}' temp.cc
1 int main ()
2 {
3 }
>