awk to convert scientific notation in file to original integer in csv file - awk

I am trying to use awk to covert all scientific notation to its original integer. The below does execute, but the scientific notation remains in the output.
awk
awk -F, '{
for (i=1;i<=NF;i++) if ($i+0 == $i && $i ~ /e/) $i = sprintf("%.0f", $i)
} 1' f.csv
f.csv
[Header],,
....
....
,,
[Data],,
ID,Number,Test
1,2.07E-11,a
2,2.07E-11,a
desired
[Header],,
....
....
,,
[Data],,
ID,Number,Test
1,207081000000,a
2,207081000000,a

A few issues with the current code and/or description:
trying to match E against e will always fail
input of 2.07E-11 becomes 0.000000000207 (not 207081000000) and with a format of %.0f this would display as 0
even if the input is 2.07E+11 this should convert to 207000000000 and not 207081000000 (ie, where did the 81 come from? or is this a typo?)
output field delimiter has not been defined
why loop through all fields if only the 2nd field is of interest?
Modifying the input file:
$ cat f.csv
[Header],,
....
....
,,
[Data],,
ID,Number,Test
1,2.07E-11,a
2,2.07E+11,a
One awk idea:
awk '
BEGIN { FS=OFS=","}
{ if ($2+0 == $2 && $2 ~ /[eE]/)
$2=sprintf("%.0f",$2)
}
1' f.csv
This generates:
[Header],,
....
....
,,
[Data],,
ID,0,Test
1,0,a
2,207000000000,a
Assuming the input may contain scientific values in different fields we can push our changes into OP's code ...
$ cat f.csv
[Header],,
....
....
,,
[Data],,
ID,Number,Test
1,2.07E-11,a
2,2.07E+11,a
3,skip_this_field,5.6789E+11
Updating OP's awk code:
awk '
BEGIN { FS=OFS=","}
{ for (i=1;i<=NF;i++)
if ($i+0 == $i && $i ~ /[eE]/)
$i=sprintf("%.0f",$i)
}
1' f.csv
This generates:
[Header],,
....
....
,,
[Data],,
ID,Number,Test
1,0,a
2,207000000000,a
3,skip_this_field,567890000000

Related

Variable padding with printf and csv file

I'm aware that there are a lot of similar questions but I can't find an answer that relates to my particular use case, though I'll admit that this may be due to my lack of understanding more than anything else.
I am currently using the following format to print the contents of a csv file:
awk -F, 'BEGIN{printf "%-10s %-10s %-5s %-20s\n", "Col1","Col2,"Col3","Col4"} {printf "%-10s %-10s %-5s %-20s\n",$1,$2,$3,$4} test.csv
Instead of the columns being 10, 10, 5 and 20 I want them to use the variables $col1length, $col2length and so on. So if $col1length is 20 characters long then the width will format to being 20 characters wide. I'm new to this so I'm struggling to use answers from similar questions and change them to match my use case without breaking the command completely.
Edit: I'm not sure what else to add to address Ed Mortons request. My csv file contains data like the following:
longtextstring,databasename1,12345,2022-05-10T11:35:01.000Z
evenlongertextstring,databasename2,1234567,2022-05-10T11:21:02.000Z
shrtstring,db3,987,2022-05-10T10:15:11.000Z
anotheryveryveryverylongtextstring,dbname3,483,2022-04-17T01:45:09.000Z
I can get the length of the longest field in each column and have those saved to the variables named col1length to col2length. So I know that the length of the longest string in col1 is 34 characters.
Using the code above I can output with fixed width columns but I want to use the variables mentioned to dynamically change the column widths every time the code is run. I've attempted to solve the problem myself to no avail, this is why I am posting the question. I've tried to read similar questions and interpret those answers to solve my own issue but those attempts have resulted in errors such as:
$ awk -F, 'BEGIN{printf "%-$s*s %-10s %-5s %-20s\n", "Col1","Col2,"Col3","Col4"} {printf "%-$s*s %-10s %-5s %-20s\n",$1,$2,$3,$4} test.csv
ask: cmd. lin:1: fatal: not enough arguments to satisfy format string
`%-*s %-10s %-5s %-20s
^ran out for this one
When I try to replace the number with the variable I get the error:
$ awk -F, 'BEGIN{printf "%-$col1lengths %-10s %-5s %-20s\n", "Col1","Col2,"Col3","Col4"} {printf "%-$col1lengths %-10s %-5s %-20s\n",$1,$2,$3,$4} test.csv
awk: cmd. lin:1: fatal: arg count with `$' must be > 0
The correct syntax to do what you were trying to do with your printf lines is either:
$ awk 'BEGIN{col1lengths=17; printf "%-*s %-5s\n", col1lengths, "Col1", "Col2"}'
Col1 Col2
or if you prefer:
$ awk 'BEGIN{col1lengths=17; printf "%-"col1lengths"s %-5s\n", "Col1", "Col2"}'
Col1 Col2
You didn't show the expected output yet so it's a guess but maybe this is what you're trying to do over all:
$ cat tst.awk
BEGIN {
ARGV[ARGC++] = ARGV[1]
FS=","
}
NR == FNR {
for ( i=1; i<=NF; i++ ) {
wid = length($i)
wids[i] = (wid > wids[i] ? wid : wids[i])
}
next
}
FNR == 1 {
for ( i=1; i<=NF; i++ ) {
hdr = "Col" i
wid = length(hdr)
wids[i] = (wid > wids[i] ? wid : wids[i])
printf "%-*s%s", wids[i], hdr, (i<NF ? OFS : ORS)
}
}
{
for ( i=1; i<=NF; i++ ) {
printf "%-*s%s", wids[i], $i, (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk test.csv
Col1 Col2 Col3 Col4
longtextstring databasename1 12345 2022-05-10T11:35:01.000Z
evenlongertextstring databasename2 1234567 2022-05-10T11:21:02.000Z
shrtstring db3 987 2022-05-10T10:15:11.000Z
anotheryveryveryverylongtextstring dbname3 483 2022-04-17T01:45:09.000Z
but you may want to consider using column instead of having awk calculate the column widths:
$ awk -F',' 'NR==1{for (i=1;i<=NF;i++) printf "Col"i (i<NF?FS:ORS)} 1' test.csv |
column -s',' -o' ' -t
Col1 Col2 Col3 Col4
longtextstring databasename1 12345 2022-05-10T11:35:01.000Z
evenlongertextstring databasename2 1234567 2022-05-10T11:21:02.000Z
shrtstring db3 987 2022-05-10T10:15:11.000Z
anotheryveryveryverylongtextstring dbname3 483 2022-04-17T01:45:09.000Z

Awk column with pattern array

Is it possible to do this but use an actual array of strings where it says "array"
array=(cat
dog
mouse
fish
...)
awk -F "," '{ if ( $5!="array" ) { print $0; } }' file
I would like to use spaces in some of the strings in my array.
I would also like to be able to match partial matches, so "snow" in my array would match "snowman"
It should be case sensitive.
Example csv
s,dog,34
3,cat,4
1,african elephant,gd
A,African Elephant,33
H,snowman,8
8,indian elephant,3k
7,Fish,94
...
Example array
snow
dog
african elephant
Expected output
s,dog,34
H,snowman,8
1,african elephant,gd
Cyrus posted this which works well, but it doesn't allow spaces in the array strings and wont match partial matches.
echo "${array[#]}" | awk 'FNR==NR{len=split($0,a," "); next} {for(i=1;i<=len;i++) {if(a[i]==$2){next}} print}' FS=',' - file
The brief approach using a single regexp for all array contents:
$ array=('snow' 'dog' 'african elephant')
$ printf '%s\n' "${array[#]}" | awk -F, 'NR==FNR{r=r s $0; s="|"; next} $2~r' - example.csv
s,dog,34
1,african elephant,gd
H,snowman,8
Or if you prefer string comparisons:
$ cat tst.sh
#!/bin/env bash
array=('snow' 'dog' 'african elephant')
printf '%s\n' "${array[#]}" |
awk -F',' '
NR==FNR {
array[$0]
next
}
{
for (val in array) {
if ( index($2,val) ) { # or $2 ~ val for a regexp match
print
next
}
}
}
' - example.csv
$ ./tst.sh
s,dog,34
1,african elephant,gd
H,snowman,8
This prints no line from csv file which contains an element from array in column 5:
echo "${array[#]}" | awk 'FNR==NR{len=split($0,a," "); next} {for(i=1;i<=len;i++) {if(a[i]==$5){next}} print}' FS=',' - file

awk - AWK variable in if condition

How can a variable in an awk command be used to read a column of a file in an if condition?
e.g. Say to read column 2 of a below sample file, in which fcolumn1 holds value as 2, startdate as 2014-09-22 00:00:00, and enddate as 2014-09-23 00:00:00.
abcd,2016-04-23 02:35:34,sdfsdfsd
sdsd,2016-04-22 02:35:34,sdfsdfsd
Below command works:
awk -v startdate="$startdate" -v enddate="$enddate" -F"," '
{
if ($2>=startdate && $2<enddate)
{
print $2
}
}'
Expectation is to make $2 as dynamic as below:
awk -v startdate="$startdate" -v enddate="$enddate" -v "fcolumn1=${fcolumn1}" -F"," '
{
if (fcolumn1 != "")
{
if (**$fcolumn1**>=startdate && **$fcloumn1**<enddate)
{
print "$fcolum1"
}
}
}'
First, the if block is superfluous since awk programs follow the following (simplified) structure:
CONDITION { ACTIONS } CONDITION {ACTIONS} ...
You can write the condition without the if statement:
awk '$2>=startdate && $2<enddate { print $2 }' file
If you want to make the actual column number configurable via a variable, note that you can address a column using a variable in awk, like this:
awk -v col=2 '{print $col}'

awk to print of column headers if field has dot in it

In the tab-delimited file below I am trying to use awk to print out the headers of the fields if they contain a single . (dot). The other fields should not contain a . and I am going to use another awk to detect there data type (either alpha or integer --- could be a decimal). The below seems close but not working as expected. Thank you :).
file
Index HGMD Sanger Classification Pop
1 . . VUS .36
awk
awk -F'\t' '$2 && $3 ~ /./ && FNR == 1 {printf "dot detected in fields: ORS $0"}' file
Index HGMD Sanger Classification
desired output
dot detected in fields: HGMD, Sanger
Assuming you want the headers of the columns that have a single dot on any line (HGMD and Sanger here):
Index HGMD Sanger Classification Pop
1 . 2 VUS .36
1 . . VUS .36
One solution would be:
awk -F'\t' 'NR==1 {for (i=0 ; i <= NF ; i++) headers[i] = $i; } # 1
NR!=1 {for (i=0 ; i <= NF ; i++) if ($i == ".") dots[i] = 1} # 2
END { printf "Dots in fields: ";
for (x in headers) if (dots[x]) printf "%s ", headers[x]; # 3
printf "\n"
} ' file
(1) collect the headers from the first input line to array headers.
(2) On other input lines, compare the value to a single dot, and set the entry in array dots to record any found dots.
(3) Finally, print out headers of the columns with dots[i] set.
Output is Dots in fields: HGMD Sanger, i.e. they are only listed once.
The dot matches any character in a regex, so $3 ~ /./ in your snippet would be true if field 3 contained any character. Also, $2 && $3 ~ ... would first test field 2 for truthiness (an empty string is falsy), and then do the match on field 3.
Use an Awk as below
awk 'BEGIN{FS="\t"}NR==1{for(i=1;i<=NF;i++) header[i]=$i}{for(i=1;i<=NF;i++) { if (match($i,/^\.$/)) { print header[i] } } }' file
HGMD
Sanger
The idea is to get the header information from the first line hashed by index 1..n and when processing the actual lines, if the . is encountered, get the hashed value from the array and print it.
awk '
NR==1 { split($0,hdr); next }
{
for (i=1; i<=NF; i++) {
if ($i != ".") {
delete hdr[i]
}
}
}
END {
printf "dot detected in fields"
for (i in hdr) {
printf "%s %s", (c++?",":":"), hdr[i]
}
print ""
}
' file
dot detected in fields: HGMD, Sanger

How to split - awk

I was wondering if I can make lists having left characters OR right characters after splitting with '=', and finally each character also get splitted with another '|' and ','. I have tried but failed because the number of lists are not fixed. Even C16 can be shown up, then it will be 16 item in the input.
Can you give me any hint?
Input
C1=34,C2=35,C3="99"
Output
C1|C2|C3#34,35,"99"
You can pass multiple characters as the delimiter when using -F. The command could look like this:
awk -F'[,=]' '{printf "%s|%s|%s#%s,%s,%s", $1,$3,$5,$2,$4,$6}' input.txt
I'm using , and = as the delimiter. This makes it simple to access individual values and reassemble them using printf.
If the number of columns is unknown, you need to loop over the columns. First over the odd columns which are the names, then over the even columns which are the values. I suggest to put it into a script:
test.awk
BEGIN {
FS="[,=]"
}
{
for(i=1;i<=NF;i+=2){
if(i>=NF-1){
fmt="%s"
}else{
fmt="%s|"
}
printf fmt,$i
}
printf "#"
for(i=2;i<=NF;i+=2){
if(i>=NF-1){
fmt="%s"
}else{
fmt="%s,"
}
printf fmt,$i
}
}
Then execute it like this:
awk -f test.awk input.txt
awk -F'[=,]' '
{
for (i=1;i<=NF;i+=2) {
printf "%s%s", $i, (i<(NF-1)?"|":"#")
}
for (i=2;i<=NF;i+=2) {
printf "%s%s", $i, (i<NF?",":ORS)
}
}
' file
C1|C2|C3#34,35,"99"
awk '{sub(/C1=34,C2=35,C3="99"/,"C1|C2|C3#34,35,\"99\"")}1' file
C1|C2|C3#34,35,"99"