EDIT: Thanks all of you. Python solution worked lightning-fast :)
I have a file that looks like this:
132,658,165,3216,8,798,651
but it's MUCH larger (~ 600 kB). There are no newlines, except one at the end of file.
And now, I have to sum all values that are there. I expect the final result to be quite big, but if I'd sum it in C++, I possess a bignum library, so it shouldn't be a problem.
How should I do that, and in what language / program? C++, Python, Bash?
Penguin Sed, "Awk"
sed -e 's/,/\n/g' tmp.txt | awk 'BEGIN {total=0} {total += $1} END {print total}'
Assumptions
Your file is tmp.txt (you can edit this obviously)
Awk can handle numbers that large
Python
sum(map(int,open('file.dat').readline().split(',')))
The language doesn't matter, so long as you have a bignum library. A rough pseudo-code solution would be:
str = ""
sum = 0
while input
get character from input
if character is not ','
append character to back of str
else
convert str to number
add number to sum
str = ""
output sum
If all of the numbers are smaller than (2**64)/600000 (which still has 14 digits), an 8 byte datatype like "long long" in C will be enough. The program is pretty straight-forward, use the language of your choice.
Since it's expensive to treat that large input as a whole I suggest you take a look at this post. It explains how to write a generator for string splitting. It's in C# but it well suited for crunching through that kind of input.
If you are worried about the total sum to not fit in a integer (say 32-bit) you can just as easily implement a bignum your self, especially if you just use integer and addition. Just carry the bit-31 to next dword and keep adding.
If precision isn't important, just accumulate the result in a double. That should give you plenty of range.
http://www.koders.com/csharp/fid881E3E70CC37E480545A0C37C98BC8C208B06723.aspx?s=datatable#L12
A fast C# CSV parser. I've seen it crunch though a few thousand 1MB files rather quickly, I have it running as part of a service that consumes about 6000 files a month.
No need to reinvent a fast wheel.
python can handle the big integers.
tr "," "\n" < file | any old script for summing
Ruby is convenient, since it automatically handles big numbers. I can't remember of Awk does arbitrary precision arithmentic, but if so, you could use
awk 'BEGIN {RS="," ; sum = 0 }
{sum += $1 }
END { print sum }' < file
Related
I am facing a case where I need to transform a string to an int equivalent with gawk5.
This transformation must be deterministic.
My first, naive, approach is to convert each letter of the string to its equivalent position in the latin alphabet and then concat the results back into a string.
For example:
my_string = "AB"
A = 1
B = 2
my_int=12
However, this has several downsides:
Very long strings may generate an integer that goes beyond maximum integer size.
What to do in case of special characters, symbols, etc. ?
This requires me to hold a table of each character position in the alphabet.
So, basically, it's a no go.
What is a good and robust method to generate an integer from a string with gawk5 ?
PS: Some will comment that gawk may not be the tool for that, and they may be right and I am aware of that. But this is for a personnal project that should include only awk if possible ;)
If your string contains only ASCII characters, no newlines, and if you use GNU awk, the following simply converts each character into its 3-digits ASCII code:
$ echo "abc" | awk -vFS= '
BEGIN {for(i=0;i<128;i++) c[sprintf("%c",i)]=i}
{for(i=1;i<=NF;i++) printf("%03d",c[$i])}'
097098099
Of course this expands the string by a factor of 3, which can be sub-optimal. If you know that your string contains only ASCII characters in the 32-127 range you can reduce this factor to 2:
$ echo "abc" | awk -vFS= '
BEGIN {for(i=32;i<128;i++) c[sprintf("%c",i)]=i-32}
{for(i=1;i<=NF;i++) printf("%02d",c[$i])}'
656667
I've always contended that you should never use a range expression like:
/start/,/end/
in awk because although it makes the trivial case where you only want to print matching text including the start and end lines slightly briefer than the alternative*:
/start/{f=1} f{print; if (/end/) f=0}
when you want to tweak it even slightly to do anything else, it requires a complete re-write or results in duplicated or otherwise undesirable code. e.g. if you want to print the matching text excluding the range delimiters using the second form above you'd just tweak it to move the components around:
f{if (/end/) f=0; else print} /start/{f=1}
but if you started with /start/,/end/ you'd need to abandon that approach in favor of what I just posted or you'd have to write something like:
/start/,/end/{ if (!/start|end/) print }
i.e. duplicate the conditions which is undesirable.
Then I saw a question posted that required identifying the LAST end in a file and where a range expression was used in the solution and I thought it seemed like that might have some value (see https://stackoverflow.com/a/21145009/1745001).
Now, though, I'm back to thinking that it's just not worth bothering with range expressions at all and a solution that doesn't use range expressions would have worked just as well for that case.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
*I used to use:
/start/{f=1} f; /end/{f=0}
but too many times I found I had to do something additional when f is true and /end/ is found (or to put it another way ONLY do something when /end/ is found IF f were true) so now I just try to stick to the slightly less brief but much more robust and extensible:
/start/{f=1} f{print; if (/end/) f=0}
Interesting. I also often start with a range expression and then later on switch to using a variable..
I think a situation where this could be useful, aside from the pure range-only situations is if you want to print a match, but only if it lies in a certain range. Also because it is immediately obvious what it does. For example:
awk '/start/,/end/{if(/ppp/)print}' file
with this input:
start
dfgd gd
ppp 1
gfdg
fd gfd
end
ppp 2
ppp 3
start
ppp 4
ppp 5
end
ppp 6
ppp 7
gfdgdgd
will produce:
ppp 1
ppp 4
ppp 5
--
One could of course also use:
awk '/start/{f=1} /ppp/ && f; /end/{f=0}' file
But it is longer and somewhat less readable..
While you are right that the /start/,/end/ range expression can easily be reimplemented with a conditional, it has many interesting use-cases where it is used on its own. As you observe it, it might have little value for processing of tabular data, the main but not only use case of awk.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
In the mentioned use-cases, the range expression improves legibility. Here are a few examples, where the range expression accurately selects the text to be processed. These are only a hand of examples, but there is countlessly similar applications, demonstrating the incredible versatility of awk.
Filter logs within a time range
Assuming each log line starts with an ISO timestamp, the filter below selects all events in a given range of 1 hour:
awk '/^2015-06-30T12:00:00Z/,/^2015-06-30T13:00:00Z/'
Extract a document from a file
awk '/---- begin file.data ----/,/---- end file.data ----/'
This can be used to bundle resources with shell scripts (with cat), to extract parts of GPG-signed messages (prepared with --clearsign) or more generally of MIME-messages.
Process LaTeX files
The range pattern can be used to match LaTeX environments, so for instance we can select the abstracts of all articles in our directory:
awk '/begin{abstract}/,/end{abstract}/' *.tex
or all the theorems, to prepare a theorem database!
awk '/begin{theorem}/,/end{theorem}/' *.tex
or write a linter ensuring that theorems do not contain citations (if we regard this as bad style):
awk '
/begin{theorem}/,/end{theorem}/ { if(/\\cite{/) { c+= 1 } }
END { printf("There were %d bad-style citations.\n", c) }
'
or preprocess tables, etc.
I have a DCL script that creates a .txt file that looks something like this
something,somethingelse,00000004
somethingdifferent,somethingelse1,00000002
anotherline,line,00000015
I need to sort the file by the 3rd column highest to lowest
ex:
anotherline,line,00000015
something,somethingelse,00000004
somethingdifferent,somethingelse1,00000002
Is it best to use the sort command, if so everything i've seen required a position number, how can this be done if each line would have a different start position?
If sort is a bad way to handle this is there something else or can I somehow handle this while writing the lines to the file.
I've only been working with VMS/DCL for a few weeks now so i'm not fimilar with all of the commands yet.
Thanks!
As you already noticed, the VMS sort expects fields with a fixed start position within a record. You can not specify a field by a separator. If you want to use the VMS sort you have to make sure your third field starts at the same column, for all records. In other words, you have to pad preceding fields. If you have control on how the file is created, this may work for you. If you don't or you don't know how big the string in front of the sort field will be, this may not be a workaround. Maybe changing the order of the fields is an option.
On the other hand, you may find GNV installed on your system. Then you can try to use its sort, which is a GNU style sort. That is, $ mcr gnv$gnu:[bin]sort -t, -k3 -r x.txt may get you the wanted results.
VMS Sort is indeed not really equipped for this.
Reformatting as you did is about the only way.
If you so not have access to GNV sort on the OpenVMS system then perhaps you have, or can install PERL? Is is somewhat easier to install.
In perl there are of course many ways.
For example using an anonymous sort function ( $a is first arg, $b second; <> reads all input )
$ perl -e "print sort { 0+(split /,/,$b)[1] <=> 0+(split /,/,$a)[2]} <>" x.x
where the 0 + forces numeric evaluation. For (fixed length?) string compare use:
$ perl -e "print sort { (split /,/,$b)[2] cmp (split /,/,$a)[2]} <>" x.x
hth,
Hein.enter code here
I have made a program which takes a 1000 digit number as input.
It is fixed, so I put this input into the code file itself.
I would obviously be storing it as Integer type, but how do I do it?
I have tried the program by having 1000 digits in the same line. I know this is the worst possible code format! But it works.
How can assign the variable this number, and split its lines. I read somewhere something about eos? Ruby, end of what?
I was thinking that something similar to comments could be used here.
Help will be appreciated.
the basic idea is to make this work:
a=3847981438917489137897491412341234
983745893289572395725258923745897232
instead of something like this:
a=3847981438917489137897491412341234983745893289572395725258923745897232
Haskell doesn't have a way to split (non-String) literals across multiple lines. Since Strings are an exception, we can shoehorn in other literals by parsing a multiline String:
v = read
"32456\
\23857\
\23545" :: Integer
Alternately, you can use list syntax if you think it's prettier:
v = read . concat $
["32456"
,"24357"
,"23476"
] :: Integer
The price you pay for this is that some work will be done (once) at runtime, namely, the parsing (e.g. read).
It's been quite a while since I last used D Programming Language, and now I'm using it for some project that involves scientific calculations.
I have a bunch of floating point data, but when I print them using writefln, I get results like: 4.62593E-172 which is a zero! How do I use string formatting % stuff to print such things as 0?
Right now I'm using a hack:
if( abs(a) < 0.0000001 )
writefln(0);
else
writefln(a);
it does the job, but I want to do it using the formatting operations, if possible.
UPDATE
someone suggested writefln("%.3f", a) but the problem with it is that it prints needless extra zeros, i.e. 0 becomes 0.000 and 1.2 becomes 1.200
Can I make it also remove the trailing zeros?
Short answer: This can't be done with printf format specifiers.
Since D uses the same formatting as C99's vsprintf(), you find your answer in this thread: Avoid trailing zeroes in printf()
Try something like
writefln("%.3f", a);
Federico's answer should work, for more information check the format specifiers section.
I see you are currently using Phobos, however what you are trying to do is supported in Tango.
Stdout.formatln("{:f2}", 1.2);
will print "1.20"