Sed/awk for String to integer conversion of a csv column in shell - awk

I need 7th column of a csv file to be converted from float to decimal. It's a huge file and I don't want to use while read for conversion. Any shortcuts with awk?
Input:
"xx","x","xxxxxx","xxx","xx","xx"," 00000001.0000"
"xx","x","xxxxxx","xxx","xx","xx"," 00000002.0000"
"xx","x","xxxxxx","xxx","xx","xx"," 00000005.0000"
"xx","x","xxxxxx","xxx","xx","xx"," 00000011.0000"
Output:
"xx","x","xxxxxx","xxx","xx","xx","1"
"xx","x","xxxxxx","xxx","xx","xx","2"
"xx","x","xxxxxx","xxx","xx","xx","5"
"xx","x","xxxxxx","xxx","xx","xx","11"
Tried these, worked. But anything simpler ?
awk 'BEGIN {FS=OFS="\",\""} {$7 = sprintf("%.0f", $7)} 1' $test > $test1
awk '{printf("%s\"\n", $0)}' $test1

With your shown samples, please try following awk program.
awk -v s1="\"" -v OFS="," '{$NF = s1 ($NF + 0) s1} 1' Input_file
Explanation: Simple explanation would be, setting OFS as , then in main program; in each line's last field keeping only digits and covering last field with ", re-shuffle the fields and printing edited/non-edited all lines.

Another simple awk solution:
awk 'BEGIN {FS=OFS="\",\""} {$NF = $NF+0 "\""} 1' file
"xx","x","xxxxxx","xxx","xx","xx","1"
"xx","x","xxxxxx","xxx","xx","xx","2"
"xx","x","xxxxxx","xxx","xx","xx","5"
"xx","x","xxxxxx","xxx","xx","xx","11"

awk 'BEGIN{FS=OFS=","} {gsub(/"/, "", $7); $7="\"" $7+0 "\""; print}' file
Output:
"xx","x","xxxxxx","xxx","xx","xx","1"
"xx","x","xxxxxx","xxx","xx","xx","2"
"xx","x","xxxxxx","xxx","xx","xx","5"
"xx","x","xxxxxx","xxx","xx","xx","11"
gsub(/"/, "", $7): removes all " from $7
$7+0: Reduces the number in $7 to minimal representation

Related

how to use awk to filter newline as record seperator and field seperator

I have the following file :
Field1
UNIX - System V
Field2
32 bit
Field3
No
here field operator is double line and record operator is also a double line. I want output as:
Field1 UNIX - System V
Field2 32 bit
On writing the following command:
awk 'BEGIN{ FS="\n"; RS="\n\n"} {print $1 $2}' ctemp.txt
I am not getting my desired output.
$ awk 'NF{printf "%s%s", $0, ((++c)%2 ? OFS : ORS)}' file
Field1 UNIX - System V
Field2 32 bit
Field3 No
1st solution: Could you please try following, tested with provided samples and written with GNU awk.
awk -v FS="\n" -v RS="^$" '{for(i=1;i<=NF;i+=4){print $i,$(i+2)}}' Input_file
2nd solution: OR above will NOT deal with spaces coming in starting of lines, in case you want to remove those spaces like we have before (UNIX - System V) then try following.
awk -v FS="\n" -v RS="^$" '
BEGIN{
OFS="\t"
}
{
for(i=1;i<=NF;i+=4){
sub(/^ +/,"",$i)
sub(/^ +/,"",$(i+2))
print $i,$(i+2)
}
}
' Input_file
3rd solution: Should work in a NON GNU awk too, tested and written with provided samples by OP.
awk '
value==""{
value=$0
next
}
NF && value{
sub(/^ +/,"")
print value,$0
value=""
}
' Input_file

Filtering rows based on column values of csv file

I have a dataset with 1000 rows and 10 columns. Here is the sample dataset
A,B,C,D,E,F,
a,b,c,d,e,f,
g,h,i,j,k,l,
m,n,o,p,q,r,
s,t,u,v,w,x,
From this dataset I want to copy the rows whose has value of column A as 'a' or 'm' to a new csv file. Also I want the header to get copied.
I have tried using awk. It copied all the rows but not the header.
awk '{$1~/a//m/ print}' inputfile.csv > outputfile.csv
How can I copy the header also into the new outputfile.csv?
Thanks in advance.
Considering that your header will be on 1st row, could you please try following.
awk 'BEGIN{FS=OFS=","} FNR==1{print;next} $1 ~ /^a$|^m$/' Input_file > outputfile.csv
OR as per Cyrus sir's comment adding following:
awk 'BEGIN{FS=OFS=","} FNR==1{print;next} $1 ~ /^(a|m)$/' Input_file > outputfile.csv
OR as per Ed sir's comment try following:
awk -F, 'NR==1 || $1~/^[am]$/' Input_file > outputfile.csv
Added corrections in OP's attempt:
Added FS and OFS as , here for all lines since lines are comma delimited.
Added FNR==1 condition which means it is checking 1st line here and printing it simply, since we want to print headers in out file. It will print very first line and then next will skip all further statements from here.
Used a better regex for checking 1st field's condition $1 ~ /^a$|^m$/
This might work for you (GNU sed):
sed '1b;/^[am],/!d' oldFile >newFile
Always print the first line and delete any other line that does not beging a, or m,.
Alternative:
awk 'NR==1 || /^[am],/' oldFile >newFile
With awk. Set field separator (FS) to , and output current row if it's first row or if its first column contains a or m.
awk 'NR==1 || $1=="a" || $1=="m"' FS=',' in.csv >out.csv
Output to out.csv:
A,B,C,D,E,F,
a,b,c,d,e,f,
m,n,o,p,q,r,
$ awk -F, 'BEGIN{split("a,m",tmp); for (i in tmp) tgts[tmp[i]]} NR==1 || $1 in tgts' file
A,B,C,D,E,F,
a,b,c,d,e,f,
m,n,o,p,q,r,
It appears that awk's default delimiter is whitespace. Link
Changing the delimiter can be denoted by using the FS variable:
awk 'BEGIN { FS = "," } ; { print $2 }'

Awk editing with field delimiter

Imagine if you have a string like this
Amazon.com Inc.:181,37:184,22
and you do awk -F':' '{print $1 ":" $2 ":" $3}' then it will output the same thing.
But can you declare $2 in this example so it only outputs 181 and not ,37?
Thanks in advance!
You can change the field separator so that it contains either : or ,, using a bracket expression:
awk -F'[:,]' '{ print $2 }' file
If you are worried that , may appear in the first field (which will break this approach), you could use split:
awk -F: '{ split($2, a, /,/); print a[1] }' file
This splits the second field on the comma and then prints the first part. Any other fields containing a comma are unaffected.

Convert single column into three comma separated columns using awk

I have a single long column and want to reformat it into three comma separated columns, as indicated below, using awk or any Unix tool.
Input:
Xaa
Ybb
Mdd
Tmmn
UUnx
THM
THSS
THEY
DDe
Output:
Xaa,Ybb,Mdd
Tmmn,UUnx,THM
THSS,THEY,DDe
$ awk '{printf "%s%s",$0,NR%3?",":"\n";}' file
Xaa,Ybb,Mdd
Tmmn,UUnx,THM
THSS,THEY,DDe
How it works
For every line of input, this prints the line followed by, depending on the line number, either a comma or a newline.
The key part is this ternary statement:
NR%3?",":"\n"
This takes the line number modulo 3. If that is non-zero, then it returns a comma. If it is zero, it returns a newline character.
Handling files that end before the final line is complete
The assumes that the number of lines in the file is an integer multiple of three. If it isn't, then we probably want to assure that the last line has a newline. This can be done, as Jonathan Leffler suggests, using:
awk '{printf "%s%s",$0,NR%3?",":"\n";} END { if (NR%3 != 0) print ""}' file
If the final line is short of three columns, the above code will leave a trailing comma on the line. This may or may not be a problem. If we do not want the final comma, then use:
awk 'NR==1{printf "%s",$0; next} {printf "%s%s",(NR-1)%3?",":"\n",$0;} END {print ""}' file
Jonathan Leffler offers this slightly simpler alternative to achieve the same goal:
awk '{ printf("%s%s", pad, $1); pad = (NR%3 == 0) ? "\n" : "," } END { print "" }'
Improved portability
To support platforms which don't use \n as the line terminator, Ed Morton suggests:
awk -v OFS=, '{ printf("%s%s", pad, $1); pad = (NR%3?OFS:ORS)} END { print "" }' file
There is a tool for this. Use pr
pr -3ats,
3 columns width, across, suppress header, comma as separator.
xargs -n3 < file | awk -v OFS="," '{$1=$1} 1'
xargs uses echo as default action, $1=$1 forces rebuild of $0.
Using only awk I would go with this (which is similar to what proposed by #jonathan-leffler and #John1024)
{
sep = NR == 1 ? "" : \
(NR-1)%3 ? "," : \
"\n"
printf sep $0
}
END {
printf "\n"
}

awk ternay operator, count fs with ,

How to make this command line:
awk -F "," '{NF>0?$NF:$0}'
to print the last field of a line if NF>0, otherwise print the whole line?
Working data
bogota
dept math, bogota
awk -F, '{ print ( NF ? $NF : $0 ) }' file
Actually, you don't need ternary operator for this, but use :
awk -F, '{print $NF}' file
This will print the last field, i.e, if there are more than 1 field, it will print the last field, if line has only one field, it will print the same.