I have a csv that I'm translating using an awk script.
One of the fields has white space that needs to be removed- however, the current function I am using is not working.
grnd_tack_number = gsub(/ /, "", $13)
input = 487 060210996314
desired output= 487060210996314
Current output = 5
I suggest try to debug with this code sample.
{
grnd_tack_number = $13;
print grnd_tack_number;
gsub(/ /, "", grnd_tack_number);
print grnd_tack_number;
}
I have this simple awk script with which I attempt to check the amount of characters in the first line.
if the first line has more of less than 10 characters I want to store the amount
of caracters into a var.
Somehow the first print statement works but storing that result into a var doesn't.
Please help.
I tried removing dollar sign " thelength=(length($0))"
and removing the parenthesis "thelength=length($0)" but it doen't print anything...
Thanks!
#!/bin/ksh
awk ' BEGIN {FS=";"}
{
if (NR==1)
if(length($0)!=10)
{
print(length($0))
thelength=$(length($0))
print "The length of the first line is: ",$thelength;
exit 1;
}
}
END { print "STOP" }' $1
Two issues dealing with mixing ksh and awk scripting ...
no need to make a sub-shell call within awk to obtain the length; use thelength=length($0)
awk variables do not require a leading $ when being referenced; use print ... ,thelength
So your code becomes:
#!/bin/ksh
awk ' BEGIN {FS=";"}
{
if (NR==1)
if(length($0)!=10)
{
print(length($0))
thelength=length($0)
print "The length of the first line is: ",thelength;
exit 1;
}
}
END { print "STOP" }' $1
I have the below file with 100s of entries which I want to replace the 46th Character (N) with a blank with an awk command on a unix box. Does anyone know the best way to do this?
TESTENTRY1||||||N|Y|N|OFF||N||||N|L|N|0|N|0|N|N||||A|0||0||N|N|N|Y|N||0|N|N||0|||N|N|N|N|N
TESTENTRY2||||||N|Y|N|OFF||N||||N|L|N|0|N|0|N|N||||A|0||0||N|N|N|Y|N||0|N|N||0|||N|N|N|N|N
So it looks like the below:
TESTENTRY1||||||N|Y|N|OFF||N||||N|L|N|0|N|0|N|N||||A|0||0||N|N|N|Y|N||0|N|N||0|||N||N|N|N
TESTENTRY2||||||N|Y|N|OFF||N||||N|L|N|0|N|0|N|N||||A|0||0||N|N|N|Y|N||0|N|N||0|||N||N|N|N
$ awk 'BEGIN { FS=OFS="|" } { $46 = "" }1' nnn.txt
TESTENTRY1||||||N|Y|N|OFF||N||||N|L|N|0|N|0|N|N||||A|0||0||N|N|N|Y|N||0|N|N||0|||N||N|N|N
TESTENTRY2||||||N|Y|N|OFF||N||||N|L|N|0|N|0|N|N||||A|0||0||N|N|N|Y|N||0|N|N||0|||N||N|N|N
BEGIN { FS=OFS="|" } sets the input and output field separators to the vertical bar before the records are read.
{ $46 = "" } sets the 46th column to be empty in each record.
The trailing 1 prints the resulting record to the output.
I've been playing around with this for an hour trying to work out how to embed the removal of quotes from a specific field using AWK.
Basically, the file encapsulates text in quotes, but I want to use the second field to name the file and split them based on the first field.
ID,Name,Value1,Value2,Value3
1,"AAA","DEF",1,2
1,"AAA","GGG",7,9
2,"BBB","DEF",1,2
2,"BBB","DEF",9,0
3,"CCC","AAA",1,1
What I want to get out are three files, all with the header row named:
AAA [1].csv
BBB [2].csv
CCC [3].csv
I have got it all working, except for the fact that I can't for the life of me work out how to remove the quotes around the filename!!
So, this command does everything (except the file is named with quotes around $2, but I need to do some kind of transformation on $2 before it goes into evname. In the actual file, I want to keep the encapsulating quotes.
awk -F, 'NR==1{h=$0;next}!($1 in files){evname=$2" ["$1"].csv";files[$1]=1;print h>evname}{print > evname}' DataExtract.csv
I've tried to push a gsub into this, but I'm struggling to work out exactly how this should look.
This is I think as close as I have got, but it is just calling everything "2" for $2, I'm not sure if this means I need to do an escape of $2 somehow in the gsub, but trying that doesn't seem to be working, so I'm at a loss as to what I'm doing wrong.
awk -F, 'NR==1{h=$0;next}!($1 in files){evname=gsub(""\","", $2)" - Event ID ["$1"].csv";files[$1]=1;print h>evname}{print > evname}' DataExtract.csv
Any help greatly appreciated.
Thanks in advance!!
Gannon
If I understand what you are attempting correctly, then
awk -F, 'NR==1{h=$0;next}!($1 in files){gsub(/"/, "", $2); evname=$2" ["$1"].csv";files[$1]=1;print h>evname}{print > evname}' DataExtract.csv
should work. That is
NR == 1 {
h = $0;
next
}
!($1 in files) {
stub = $2 # <-- this is the new bit: make a working copy
# of $2 (so that $2 is unchanged and the line
# is not rebuilt with changes for printing),
gsub(/"/, "", stub) # remove the quotes from it, and
evname = stub " [" $1 "].csv" # use it to assemble the filename.
files[$1] = 1;
print h > evname
}
{
print > evname
}
You can, of course, use
evname = stub " - Event ID [" $1 "].csv"
or any other format after the substitution (this one seems to be what you tried to get in your second code snippet).
The gsub function returns the number of substitutions made, not the result of the substitutions; that is why evname=gsub(""\","", $2)" - Event ID ["$1"].csv" does not work.
Things are always clearer with a little white space:
awk -F, '
NR==1 { hdr=$0; next }
!seen[$1]++ {
evname = $2
gsub(/"/,"",evname)
outfile = evname " [" $1 "].csv"
print hdr > outfile
}
{ print > outfile }
' DataExtract.csv
Aside: It's pretty unusual for someone to WANT to create files with spaces in their names given the complexity that introduces in any later scripts you write to process them. You sure you want to do that?
P.S. here's the gawk version as suggested by #JID below
awk -F, '
NR==1 { hdr=$0; next }
!seen[$1]++ {
outfile = gensub(/"/,"","g",$2) " [" $1 "].csv"
print hdr > outfile
}
{ print > outfile }
' DataExtract.csv
Apply the gsub before you make the assignment:
awk -F, 'NR==1{h=$0;next}
!($1 in files){
gsub("\"","",$2); # Add this line
evname=$2" ["$1"].csv";files[$1]=1;print...
I have a CSV-file similar to this test.csv file:
Header 1; Header 2; Header 3
A;B;US
C;D;US
E;F;US
G;H;FR
I;J;FR
K;L;FR
M;"String with ; semicolon";UK
N;"String without semicolon";UK
O;"String OK";
P;"String OK";
Now, I want to split this file based on header 3. So I want to end up with four separate CSV files, one for "US", "FR", "UK", and "".
With my very limited Linux command line skills (sadly :-( I used until now this line:
awk -F\; 'NR>1{ fname="country_yearly_"$3".csv"; print >>(fname); close(fname);}' test.csv
Of course, the experienced command line users of you will notice my problem: One field in my test.csv contains rows in which the semicolon which I use as a separator is also present in fields that are marked with quotation marks (I can't guarantee that for sure because of millions of rows, but I'm happy with an answer that assumes this). So sadly, I get an additional file named country_yearly_ semicolon".csv, which contains this row in my example.
In my venture to solve this issue, I came across this question on SO. In particular, Thor's answer seems to contain the solution of my problem by replacing all semicolons in strings. I adjusted his code accordingly as follows:
awk -F'"' -v OFS='' '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub(";", "|", $i);
$i = FS $i FS; # reinsert the quotes
}
print
}' test.csv > test1.csv
Now, I get the following test1.csv file:
M;"String with | semicolon";UK
N;"String without semicolon";UK
O;"String OK";
P;"String OK";
As you can see, all rows that have quotation marks are shown and my problem line is fixed as well, but a) I actually want all rows, not only those in quotation marks and I can't figure out which part in his code does limit the rows to ones with quotation marks and b) I think it would be more efficient if test.csv is just changed instead of sending the output to a new file, but I don't know how to do that either.
EDIT in response to Birei's answer:
Unfortunately, my minimal example was too simple. Here is an updated version:
Header 1; Header 2; Header 3; Header 4
A;B;US;
C;D;US;
E;F;US;
G;H;FR;
I;J;FR;
K;L;FR;
M;"String with ; semicolon";UK;"Yet another ; string"
N;"String without semicolon";UK; "No problem here"
O;"String OK";;"Fine"
P;"String OK";;"Not ; fine"
Note that my real data has roughly 100 columns and millions of rows and the country column, ignoring semicolons in strings, is column 13. However, as far as I see it I can't use the fact that it's column 13 if I don't get rid of the semicolons in strings first.
To split the file, you might just do:
awk -v FS=";" '{ CSV_FILE = "country_yearly_" $NF ".csv" ; print > CSV_FILE }'
Which always take the last field to construct the file name.
In your example, only lines with quotation marks are printed due to the NF > 1 pattern. The following script will print all lines:
awk -F'"' -v OFS='' '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub(";", "|", $i);
$i = FS $i FS; # reinsert the quotes
}
}
{
# print all lines
print
}' test.csv > test1.csv
To do what you want, you could change the line in the script and reprocess it:
awk -F'"' -v OFS='' '
# Save the original line
{ ORIGINAL_LINE = LINE = $0 }
# Replace the semicolon inside quotes by a dummy character
# and put the resulting line in the LINE variable
NF > 1 {
LINE = ""
for(i=2; i<=NF; i+=2) {
gsub(";", "|", $i)
LINE = LINE $(i-1) FS $i FS # reinsert the quotes
}
# Add the end of the line after the last quote
if ( $(i+1) ) { LINE = LINE $(i+1) }
}
{
# Put the semicolon-separated fields in a table
# (the semicolon inside quotes have been removed from LINE)
split( LINE, TABLE, /;/ )
# Build the file name -- TABLE[ 3 ] is the 3rd field
CSV_FILE = "country_yearly_" TABLE[ 3 ] ".csv"
# Save the line
print ORIGINAL_LINE > CSV_FILE
}'
You were near of a solution. I would use the last field to avoid the problem of fields with double quotes. Also, there is no need to close each file. They will automatically be closed by the shell at the end of the awk script.
awk '
BEGIN {
FS = OFS = ";";
}
FNR > 1 {
fname = "country_yearly_" $NF ".csv";
print >>fname;
}
' infile
Check output:
head country_yearly_*
That yields:
==> country_yearly_.csv <==
O;"String OK";
P;"String OK";
==> country_yearly_FR.csv <==
G;H;FR
I;J;FR
K;L;FR
==> country_yearly_UK.csv <==
M;"String with ; semicolon";UK
N;"String without semicolon";UK
==> country_yearly_US.csv <==
A;B;US
C;D;US
E;F;US