How do I create a sub arrary in awk? - awk

Given a list like:
Dog bone
Cat catnip
Human ipad
Dog collar
Dog collar
Cat collar
Human car
Human laptop
Cat catnip
Human ipad
How can I get results like this, using awk:
Dog bone 1
Dog collar 2
Cat catnip 2
Cat collar 1
Human car 1
Human laptop 1
Human ipad 2
Do I need a sub array? It seems to me like a need an array of "owners" which is populated by arrays of "things."
I'd like to use awk to do this, as this is a subscript of another program in awk, and for now, I'd rather not create a separate program.
By the way, I can already do it using sort and grep -c, and a few other pipes, but I really won't be able to do that on gigantic data files, as it would be too slow. Awk is generally much faster for this kind of thing, I'm told.
Thanks,
Kevin
EDIT: Be aware, that the columns are actually not next to eachother like this, in the real file, they are more like column $8 and $11. I say this because I suppose if they were next to eachother I could incorporate an awk regex ~/Dog\ Collar/ or something. But I won't have that option. -thanks!

awk does not have multi-dimensional arrays, but you can manage by constructing 2D-ish array keys:
awk '{count[$1 " " $2]++} END {for (key in count) print key, count[key]}' | sort
which, from your input, outputs
Cat catnip 2
Cat collar 1
Dog bone 1
Dog collar 2
Human car 1
Human ipad 2
Human laptop 1
Here, I use a space to separate the key values. If your data contains spaces, you can use some other character that does not appear in your input. I typically use array[$a FS $b] when I have a specific field separator, since that's guaranteed not to appear in the field values.

GNU Awk has some support for multi-dimensional arrays, but it's really just cleverly concatenating keys to form a sort of compound key.
I'd recommend learning Perl, which will be fairly familiar to you if you like awk, but Perl supports true Lists of Lists. In general, Perl will take you much further than awk.
Re your comment:
I'm not trying to be superior. I understand you asked how to accomplish a task with a specific tool, awk. I did give a link to the documentation for simulating multi-dimensional arrays in awk. But awk doesn't do that task well, and it was effectively replaced by Perl nearly 20 years ago.
If you ask how to cross a lake on a bicycle, and I tell you it'll be easier in a boat, I don't think that's unreasonable. If I tell you it'll be easier to first build a bridge, or first invent a Star Trek transporter, then that would be unreasonable.

Related

How to awk to read a dictionary and replace words in a file?

We have a source file ("source-A") that looks like this (if you see blue text, it comes from stackoverflow, not the text file):
The container of white spirit was made of aluminium.
We will use an aromatic method to analyse properties of white spirit.
No one drank white spirit at stag night.
Many people think that a potato crisp is savoury, but some would rather eat mashed potato.
...
more sentences
Each sentence in "source-A" is on its own line and terminates with a newline (\n)
We have a dictionary/conversion file ("converse-B") that looks like this:
aluminium<tab>aluminum
analyse<tab>analyze
white spirit<tab>mineral spirits
stag night<tab>bachelor party
savoury<tab>savory
potato crisp<tab>potato chip
mashed potato<tab>mashed potatoes
"converse-B" is a two column, tab delimited file.
Each equivalence map (term-on-left<tab>term-on-right) is on its own line and terminates with a newline (\n)
How to read "converse-B", and replace terms in "source-A" where a term in "converse-B" column-1 is replaced with the term in column-2, and then write to an output file ("output-C")?
For example, the "output-C" would look like this:
The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
The tricky part is the term potato.
If a "simple" awk solution cannot handle a singular term (potato) and a plural term (potatoes), we'll use a manual substitution method. The awk solution can skip that use case.
In other words, an awk solution can stipulate that it only works for an unambiguous word or a term composed of space separated, unambiguous words.
An awk solution will get us to a 90% completion rate; we'll do the remaining 10% manually.
sed probably suits better since since it's only phrase/word replacements. Note that if the same words appear in multiple phrases first come first serve; so change your dictionary order accordingly.
$ sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' dict) content
The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
...
more sentences
file substitute sed statement converts dictionary entries into sed expressions and the main sed uses them for the content replacements.
NB: Note that production quality script should take of word cases and also word boundaries to eliminate unwanted substring substitution, which are ignored here.

Difference between print, put and say?

In Perl 6, what is the difference between print, put and say?
I can see how print 5 is different, but put 5 and say 5 look the same.
put $a is like print $a.Str ~ “\n”
say $a is like print $a.gist ~ “\n”
put is more computer readable.
say is more human readable.
put 1 .. 8 # 1 2 3 4 5 6 7 8
say 1 .. 8 # 1..8
Learn more about .gist here.
———
More accurately, put and say append the value of the nl-out attribute of the output filehandle, which by default is \n. You can override it, though. Thanks Brad Gilbert for pointing that out.
Handy Perl 6 FAQ: How and why do say, put and print differ?
The most obvious difference is that say and put append a newline at the end of the output, and print does not.
But there's another difference: print and put converts its arguments to a string by calling the Str method on each item passed to, say uses the gist method instead. The gist method, which you can also create for your own classes, is intended to create a Str for human interpretation. So it is free to leave out information about the object deemed unimportant to understand the essence of the object.
...
So, say is optimized for casual human interpretation, dd is optimized for casual debugging output and print and put are more generally suitable for producing output.
...

How can I make nawk evauluate two if conditions as one?

For the data set:
12345 78945
12345 45678
I need the output to be
12345 45678
Long story short, sometimes two values are representing the same object due to certain practices instituted above my pay grade. So with a list of known value-pairs that represent the same object, I need awk to filter these from the output. In the above situation 12345 and 78945 represent the same object, so it should be filtered out. The remaining rows are errors that should be brought to my attention.
My attempted code
cat data | nawk '(($1!="12345")&&($2!="78945"))'
produces an empty set as output. So either I'm committing a logical error in my mind or a syntactical one whereby nawk is evaluating each condition individually as if written like cat data | nawk '($1!="12345")&&($2!="78945")', thus filtering out both since both fail the first condition.
I'm sure it's just my unfamiliarity with how nawk resolves such things. Thanks in advance for any assistance. And for reasons, this has to be done in nawk.
There is no line in your sample data for which $1!="12345" is true so there can be no condition where that && anything else can be true. Think about it. This has nothing to do with awk - it's simple boolean logic.
Try any of these instead, whichever you feel is clearer:
nawk '($1!="12345") || ($2!="78945")' data
nawk '!(($1=="12345") && ($2=="78945"))' data
nawk '($1=="12345") && ($2=="78945"){next} 1' data
nawk '($1" "$2) != "12345 78945"' data
nawk '!/^[ \t]*12345[ \t]+78945([ \t+]|$)/' data
Also google UUOC to understand why I got rid of cat data |.

Is a /start/,/end/ range expression ever useful in awk?

I've always contended that you should never use a range expression like:
/start/,/end/
in awk because although it makes the trivial case where you only want to print matching text including the start and end lines slightly briefer than the alternative*:
/start/{f=1} f{print; if (/end/) f=0}
when you want to tweak it even slightly to do anything else, it requires a complete re-write or results in duplicated or otherwise undesirable code. e.g. if you want to print the matching text excluding the range delimiters using the second form above you'd just tweak it to move the components around:
f{if (/end/) f=0; else print} /start/{f=1}
but if you started with /start/,/end/ you'd need to abandon that approach in favor of what I just posted or you'd have to write something like:
/start/,/end/{ if (!/start|end/) print }
i.e. duplicate the conditions which is undesirable.
Then I saw a question posted that required identifying the LAST end in a file and where a range expression was used in the solution and I thought it seemed like that might have some value (see https://stackoverflow.com/a/21145009/1745001).
Now, though, I'm back to thinking that it's just not worth bothering with range expressions at all and a solution that doesn't use range expressions would have worked just as well for that case.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
*I used to use:
/start/{f=1} f; /end/{f=0}
but too many times I found I had to do something additional when f is true and /end/ is found (or to put it another way ONLY do something when /end/ is found IF f were true) so now I just try to stick to the slightly less brief but much more robust and extensible:
/start/{f=1} f{print; if (/end/) f=0}
Interesting. I also often start with a range expression and then later on switch to using a variable..
I think a situation where this could be useful, aside from the pure range-only situations is if you want to print a match, but only if it lies in a certain range. Also because it is immediately obvious what it does. For example:
awk '/start/,/end/{if(/ppp/)print}' file
with this input:
start
dfgd gd
ppp 1
gfdg
fd gfd
end
ppp 2
ppp 3
start
ppp 4
ppp 5
end
ppp 6
ppp 7
gfdgdgd
will produce:
ppp 1
ppp 4
ppp 5
--
One could of course also use:
awk '/start/{f=1} /ppp/ && f; /end/{f=0}' file
But it is longer and somewhat less readable..
While you are right that the /start/,/end/ range expression can easily be reimplemented with a conditional, it has many interesting use-cases where it is used on its own. As you observe it, it might have little value for processing of tabular data, the main but not only use case of awk.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
In the mentioned use-cases, the range expression improves legibility. Here are a few examples, where the range expression accurately selects the text to be processed. These are only a hand of examples, but there is countlessly similar applications, demonstrating the incredible versatility of awk.
Filter logs within a time range
Assuming each log line starts with an ISO timestamp, the filter below selects all events in a given range of 1 hour:
awk '/^2015-06-30T12:00:00Z/,/^2015-06-30T13:00:00Z/'
Extract a document from a file
awk '/---- begin file.data ----/,/---- end file.data ----/'
This can be used to bundle resources with shell scripts (with cat), to extract parts of GPG-signed messages (prepared with --clearsign) or more generally of MIME-messages.
Process LaTeX files
The range pattern can be used to match LaTeX environments, so for instance we can select the abstracts of all articles in our directory:
awk '/begin{abstract}/,/end{abstract}/' *.tex
or all the theorems, to prepare a theorem database!
awk '/begin{theorem}/,/end{theorem}/' *.tex
or write a linter ensuring that theorems do not contain citations (if we regard this as bad style):
awk '
/begin{theorem}/,/end{theorem}/ { if(/\\cite{/) { c+= 1 } }
END { printf("There were %d bad-style citations.\n", c) }
'
or preprocess tables, etc.

Summing values in one-line comma delimited file

EDIT: Thanks all of you. Python solution worked lightning-fast :)
I have a file that looks like this:
132,658,165,3216,8,798,651
but it's MUCH larger (~ 600 kB). There are no newlines, except one at the end of file.
And now, I have to sum all values that are there. I expect the final result to be quite big, but if I'd sum it in C++, I possess a bignum library, so it shouldn't be a problem.
How should I do that, and in what language / program? C++, Python, Bash?
Penguin Sed, "Awk"
sed -e 's/,/\n/g' tmp.txt | awk 'BEGIN {total=0} {total += $1} END {print total}'
Assumptions
Your file is tmp.txt (you can edit this obviously)
Awk can handle numbers that large
Python
sum(map(int,open('file.dat').readline().split(',')))
The language doesn't matter, so long as you have a bignum library. A rough pseudo-code solution would be:
str = ""
sum = 0
while input
get character from input
if character is not ','
append character to back of str
else
convert str to number
add number to sum
str = ""
output sum
If all of the numbers are smaller than (2**64)/600000 (which still has 14 digits), an 8 byte datatype like "long long" in C will be enough. The program is pretty straight-forward, use the language of your choice.
Since it's expensive to treat that large input as a whole I suggest you take a look at this post. It explains how to write a generator for string splitting. It's in C# but it well suited for crunching through that kind of input.
If you are worried about the total sum to not fit in a integer (say 32-bit) you can just as easily implement a bignum your self, especially if you just use integer and addition. Just carry the bit-31 to next dword and keep adding.
If precision isn't important, just accumulate the result in a double. That should give you plenty of range.
http://www.koders.com/csharp/fid881E3E70CC37E480545A0C37C98BC8C208B06723.aspx?s=datatable#L12
A fast C# CSV parser. I've seen it crunch though a few thousand 1MB files rather quickly, I have it running as part of a service that consumes about 6000 files a month.
No need to reinvent a fast wheel.
python can handle the big integers.
tr "," "\n" < file | any old script for summing
Ruby is convenient, since it automatically handles big numbers. I can't remember of Awk does arbitrary precision arithmentic, but if so, you could use
awk 'BEGIN {RS="," ; sum = 0 }
{sum += $1 }
END { print sum }' < file