Related
I have a csv file that contains this kind of values:
vm47,8,32794384Ki,16257320Ki
vm47,8,30223304245,15223080Ki
vm48,8,32794384Ki,16257312Ki
vm48,8,30223304245,15223072Ki
vm49,8,32794384Ki,16257320Ki
vm49,8,30223304245,15223080Ki
The columns 3 and 4 are memoy values expressed either in bytes, or kibibytes. The problem is that the "Ki" string appears randomly through the CSV file, particularly in column3, it's inconsistent.
So to make the file consistent, I need to convert everything in bytes. So basically, any value matching a trailing "Ki" needs to have its numeric value multiplied by 1024, and then replace the corresponding XXXXXKi match.
The reason why I want to do it with awk is because I am already using awk to generate that csv format, but I am happy to do it with sed too.
This is my code so far but obviously it's wrong as it's multiplying any value in columns 3 and 4 by 1024 even though it does not match "Ki". I am not sure at this point how to ask awk "if you see Ki at the end, then multiply by 1024".
kubectl describe node --context=$context| sed -E '/Name:|cpu:|ephemeral-storage:|memory:/!d' | sed 's/\s//g' | awk '
BEGIN {FS = ":"; OFS = ","}
{record[$1] = $2}
$1 == "memory" {print record["Name"], record["cpu"], record["ephemeral-storage"], record["memory"]}
' | awk -F, '{print $1,$2,$3,$3*1024,$4,$4*1024}' >> describe_nodes.csv
Edit: I made a mistake, you need to multiply by 128 to convert KiB in bytes, not 1024.
"if you see Ki at the end, then multiply by 1024
You may use:
awk 'BEGIN{FS=OFS=","} $3 ~ /Ki$/ {$3 *= 1024} $4 ~ /Ki$/ {$4 *= 1024} 1' file
vm47,8,33581449216,16647495680
vm47,8,30223304245,15588433920
vm48,8,33581449216,16647487488
vm48,8,30223304245,15588425728
vm49,8,33581449216,16647495680
vm49,8,30223304245,15588433920
Or a bit shorter:
awk 'BEGIN{FS=OFS=","} {
for (i=3; i<=4; ++i) $i ~ /Ki$/ && $i *= 1024} 1' file
With your shown samples/attempts, please try following awk code. Simple explanation would be, traverse through fields from 3rd field onwards and look for if a value has Ki(ignore cased manner) then multiply it with 128, print all edited/non-edited lines at last.
awk 'BEGIN{FS=OFS=","} {for(i=3;i<=NF;i++){if($i~/[Kk][Ii]$/){$i *= 128}}} 1' Input_file
You could try numfmt:
$ numfmt -d, --field 3,4 --from=auto --to=none <<EOF
vm47,8,32794384Ki,16257320Ki
vm47,8,30223304245,15223080Ki
EOF
vm47,8,33581449216,16647495680
vm47,8,30223304245,15588433920
I will like to duplicate each line 2 times and print values of column 5 and 6 separated.( transpose values of column 5 and 6 from column to row ) for each line
I mean value on column 5 (first line) value in column 6 ( second line)
Input File
08,1218864123180000,3201338573,VV,22,27
08,1218864264864000,3243738789,VV,15,23
08,1218864278580000,3244738513,VV,3,13
08,1218864310380000,3243938789,VV,15,23
08,1218864324180000,3244538513,VV,3,13
08,1218864334380000,3200538561,VV,22,27
Desired Output
08,1218864123180000,3201338573,VV,22
08,1218864123180000,3201338573,VV,27
08,1218864264864000,3243738789,VV,15
08,1218864264864000,3243738789,VV,23
08,1218864278580000,3244738513,VV,3
08,1218864278580000,3244738513,VV,13
08,1218864310380000,3243938789,VV,15
08,1218864310380000,3243938789,VV,23
08,1218864324180000,3244538513,VV,3
08,1218864324180000,3244538513,VV,13
08,1218864334380000,3200538561,VV,22
08,1218864334380000,3200538561,VV,27
I use this code to duplicate the lines 2 times, but i cant'n figer out the condition with values of column 5 and 6
awk '{print;print}' file
Thanks in advance
To repeatedly print the start of a line for each of the last N fields where N is 2 in this case:
$ awk -v n=2 '
BEGIN { FS=OFS="," }
{
base = $0
sub("("FS"[^"FS"]+){"n"}$","",base)
for (i=NF-n+1; i<=NF; i++) {
print base, $i
}
}
' file
08,1218864123180000,3201338573,VV,22
08,1218864123180000,3201338573,VV,27
08,1218864264864000,3243738789,VV,15
08,1218864264864000,3243738789,VV,23
08,1218864278580000,3244738513,VV,3
08,1218864278580000,3244738513,VV,13
08,1218864310380000,3243938789,VV,15
08,1218864310380000,3243938789,VV,23
08,1218864324180000,3244538513,VV,3
08,1218864324180000,3244538513,VV,13
08,1218864334380000,3200538561,VV,22
08,1218864334380000,3200538561,VV,27
In this simple case where the last field has to be removed and placed on the last line, you can do
awk -F , -v OFS=, '{ x = $6; NF = 5; print; $5 = x; print }'
Here -F , and -v OFS=, will set the input and output field separators to a comma, respectively, and the code does
{
x = $6 # remember sixth field
NF = 5 # Set field number to 5, so the last one won't be printed
print # print those first five fields
$5 = x # replace value of fifth field with remembered value of sixth
print # print modified line
}
This approach can be extended to handle fields in the middle with a function like the one in the accepted answer of this question.
EDIT: As Ed notes in the comments, writing to NF is not explicitly defined to trigger a rebuild of $0 (the whole-line record that print prints) in the POSIX standard. The above code works with GNU awk and mawk, but with BSD awk (as found on *BSD and probably Mac OS X) it fails to do anything.
So to be standards-compliant, we have to be a little more explicit and force awk to rebuild $0 from the modified field state. This can be done by assigning to any of the field variables $1...$NF, and it's common to use $1=$1 when this problem pops up in other contexts (for example: when only the field separator needs to be changed but not any of the data):
awk -F , -v OFS=, '{ x = $6; NF = 5; $1 = $1; print; $5 = x; print }'
I've tested this with GNU awk, mawk and BSD awk (which are all the awks I can lay my hands on), and I believe this to be covered by the awk bit in POSIX where it says "setting any other field causes the re-evaluation of $0" right at the top. Mind you, the spec could be more explicit on this point, and I'd be interested to test if more exotic awks behave the same way.
Could you please try following(considering that your Input_file always is same as shown and you need to print every time 1st four fields and then rest of the fields(one by one printing along with 1st four)).
awk 'BEGIN{FS=OFS=","}{for(i=5;i<=NF;i++){print $1,$2,$3,$4,$i}}' Input_file
This might work for you (GNU awk):
awk '{print gensub(/((.*,).*),/,"\\1\n\\2",1)}' file
Replace the last comma by a newline and the previous fields less the penultimate.
I wrote a script for getting the MEAN and the STDEV from a data file.
Let's say the data file has this data:
1
2
3
4
5
The awk script looks like this
awk '{MEAN+=$1/5}END{print MEAN, STDEV=sqrt(($1-MEAN)**2/4)}' dat.dat>stat1.dat
but it gives me an incorrect value of STDEV=1. It must be 1.5811. Do you know what is incorrect in my script? how could I improve it?
you can do the same in one pass
$ seq 5 | awk '{sum+=$1; sqsum+=$1^2}
END{mean=sum/NR;
print mean, sqrt((sqsum-NR*mean^2)/(NR-1))}'
3 1.58114
note that this is the std definition for "sample population" (divide by N-1).
Could you please try following and let me know if this helps you(this should work on provided data and if you actual file has more fields too).
awk '{for(i=1;i<=NF;i++){sum+=$i};mean=sum?sum/NF:0;sum="";for(j=1;j<=NF;j++){$j=($j-mean)*($j-mean);sum+=$j};print "Mean=",mean", S.D=",sqrt(sum/NF)}' Input_file
Adding a non-one liner form of solution too now.
awk '
{
for(i=1;i<=NF;i++){ sum+=$i };
mean=sum?sum/NF:0;
sum="";
for(j=1;j<=NF;j++){ $j=($j-mean)*($j-mean);
sum+=$j};
print "Mean=",mean", S.D=",sqrt(sum/NF)
}
' Input_file
EDIT: Adding code similar to above only thing adding exception handling kind of where if any of the value is ZERO it should print 0 then.
awk '
{
for(i=1;i<=NF;i++){ sum+=$i };
mean=sum?sum/NF:0
sum="";
for(j=1;j<=NF;j++){ $j=($j-mean)*($j-mean);
sum+=$j};
val=sum?sqrt(sum/NF):0
print "Mean=",mean", S.D=",val
}
' Input_file
Even though the title and tag say awk, I wanted to add that calculating the mean and stdev for a column of data can be easily accomplished with datamash:
seq 1 5 | datamash mean 1 sstdev 1
3 1.5811388300842
It may be off-topic here (and I realize that programming simple tasks like that in awk can be a good learning opportunity), but I think datamash deserves some attention, specially for straightforward calculations such as this one. The documentation gives all the functions it can perform, and good examples as well for files with many columns. It is a fast and reliable alternative. Hope it helps!
Here is a two-pass streamable version:
parse.awk
# First-pass: sum the numbers
FNR == NR { sum += $1; next }
# After first pass: determine sample size (N) and mean
# Note: run only once because of the f flag
!f {
N = NR-1 # Number of samples
mean = sum/N # The mean of the samples
f = 1
}
# Second-pass: add the squares of the sample distance to mean
{ varsum += ($1 - mean)**2 }
END {
# Sample standard deviation
sstd = sqrt( varsum/(N-1) )
print "Sample std: " sstd
}
Run it like this for a file:
awk -f parse.awk file.dat{,}
Run it like this for streams:
awk -f parse.awk <(seq 5) <(seq 5)
Output in both cases:
Sample std: 1.58114
I have a file with input text below (this is not the original file and just example of input text ) and I want to replace all the 2 letter string to numeric 100 . In this file FS can be :,| or " " (space) , I have no other choice but to treat all three of them as FS, and I want to preserve these field separators at the original position (as in input file) in the output
A:B C|D
AA:C EE G
BB|FF XX1 H
DD:MM:YY K
I have tried
awk -F"[:| ]" '{gsub(/[A-Z]{2}/,"100");print}'
but this does not seem to work , please suggest.
Desired output:
A:B C|D
100:C 1000 G
100|100 1001 H
100:100:100 K
There is no functionality in POSIX awk to retain the strings that match the string defined by RS (POSIX) or regexp defined by FS. Since in POSIX RS is just a string there's no need for such functionality and to do it for every FS matching string would be unnecessarily inefficient given it's rarely needed.
With GNU awk where RS can be a regexp, not just a string, you can retain the string that matched the regexp RS with RT but there is no functionality that retains the values that match FS for the same efficiency reason that POSIX doesn't do it. Instead in GNU awk they added a 4th arg to split() so you can retain the strings that match FS in an array yourself if you want it (seps[] below):
$ awk -v FS='[:| ]' '{
split($0,flds,FS,seps)
gsub(/[A-Z]{2}/,"100")
for (i=1;i<=NF;i++) {
printf "%s%s", $i, seps[i]
}
print ""
}' file
A:B C|D
100:C 100 G
100|100 1001 H
100:100:100 K
Look up split() in the GNU awk manual for more info.
in this case
sed 's/[A-Z]\{2\}/100/g' YourFile
awk '{gsub(/[A-Z]{2}/, "100"); print}' YourFile
no need of field separation in this case, change all group of upper letter by "100", unless you specify other constraint than in OP (like other element in the string, you than need to specify what is possible and idealy, add a sample of expected result to be univoq)
Now you certainly have lot more thing around, so this code will certainly failed by changing thing like ABC:DEF with 100C:100F that is certainly not expected
in this case
awk -F '[[:blank:]:|]+' '
{
split( $0, aS, /[^[:blank:]:|]+/)
for( i=1;i<=NF;i++){
if( $i ~ /^[A-Z][A-Z]$/) $i = "100"
printf( "%s%s", $i, aS[i+1])
}
printf( "\n" )
} ' YourFile
Give this sed one-liner a try:
kent$ sed -r 's/(^|[:| ])[A-Z][A-Z]([:| ]|$)/\1100\2/g' file
A:B C|D
100:C 100 G
100|FF XX1 H
100:MM:100 K
Note:
this will search and replace pattern: exact two [A-Z] between two delimiters. If this is not what you want exactly, paste the desired output.
Your code seems to work just fine with my Gnu awk:
A:B C|D
100:C 100 G # even the typo in this record got fixed.
100|100 1001 H
100:100:100 K
I'd say the problem is that the regex /[A-Z]{2}/ should be written /[A-Z][A-Z]/.
I wanted to do a simple parsing of two files with ids and some corresponding numerical values. I didn't want awk to print numbers in scientific notation.
File looks like this:
someid-1 860025 50.0401 4.00022
someid-2 384319 22.3614 1.78758
someid-3 52096 3.03118 0.242314
someid-4 43770 2.54674 0.203587
someid-5 33747 1.96355 0.156967
someid-6 20281 1.18004 0.0943328
someid-7 12231 0.711655 0.0568899
someid-8 10936 0.636306 0.0508665
someid-9 10224.8 0.594925 0.0475585
someid-10 10188.8 0.59283 0.047391
when use print instead of printf :
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); print $1,k[2],k[3],k[4],$2,$3,$4}' OSCAo.txt dme_miRNA_PIWI_OSC.txt | sort -n -r -k 7 | head
i get this result:
dme-miR-iab-4-5p 0.333333 0.000016 0.000001 0.25 0.000605606 9.36543e-07
dme-miR-9c-5p 10987.300000 0.525413 0.048798 160.2 0.388072 0.000600137
dme-miR-9c-3p 731.986000 0.035003 0.003251 2.10714 0.00510439 7.89372e-06
dme-miR-9b-5p 30322.500000 1.450020 0.134670 595.067 1.4415 0.00222922
dme-miR-9b-3p 2628.280000 0.125684 0.011673 48 0.116276 0.000179816
dme-miR-9a-3p 10.365000 0.000496 0.000046 0.25 0.000605606 9.36543e-07
dme-miR-999-5p 103.433000 0.004946 0.000459 0.0769231 0.00018634 2.88167e-07
dme-miR-999-3p 1513.790000 0.072389 0.006723 28 0.0678278 0.000104893
dme-miR-998-5p 514.000000 0.024579 0.002283 73 0.176837 0.000273471
dme-miR-998-3p 3529.000000 0.168756 0.015673 42 0.101742 0.000157339
Notice the scientific notation in the last column
I understand that printf with appropriate format modifier can do the job but the code becomes very lengthy. I have to write something like this:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt
This becomes clumsy when I have to parse fileout with another similarly structured file.
Is there any way to specify default numerical output, such that any string will be printed like a string but all numbers follow a particular format.
I think You misinterpreted the meaning of %3.6f. The first number before the decimal point is the field width not the "number of digits before decimal point". (See prinft(3))
So You should use %10.6f instead. It can be tested easily in bash
$ printf "%3.6f\n%3.6f\n%3.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
$ printf "%10.6f\n%10.6f\n%10.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
You can see that the later aligns to the decimal point properly.
As sidharth c nadhan mentioned You can use the OFMT awk internal variable (seem awk(1)). An example:
$ awk 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456
12.345
1.234
$ awk -vOFMT=%10.6f 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456000
12.345000
1.234000
As I see in You example the number with maximum digits can be 123456.1234567, so the format %15.7f to cover all and show a nice looking table.
But unfortunately it will not work if the number has no decimal point in it or even if it does, but it ends with .0.
$ awk -vOFMT=%15.7f 'BEGIN{print 123.456;print 123;print 123.0;print 0.0+123.0}'
123.4560000
123
123
123
I even tried gawk's strtonum() function, but the integers are considered as non-OFMT strings. See
awk -vOFMT=%15.7f -vCONVFMT=%15.7f 'BEGIN{print 123.456; print strtonum(123); print strtonum(123.0)}'
It has the same output as before.
So I think, you have to use printf anyway. The script can be a little bit shorter and a bit more configurable:
awk -vf='\t'%15.7f 'NR==FNR{x[$1]=sprintf("%s"f f f,$1,$2,$3,$4);next}$1 in x{printf("%s"f f f"\n",x[$1],$2,$3,$4)}' file1.txt file2.txt
The script will not work properly if there are duplicated IDs in the first file. If it does not happen then the two conditions can be changed and the ;next can be left off.
awk 'NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%9s\t%9s\t%9s\t%9s\t%9s\t%9s\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt