llvm opt -O3 fail (?) - optimization

llvm opt -O3 fail (?) - optimization

I need to identify integer variables which behave like boolean variables, that is, they can only have the values 0 or 1.
For that purpose, I modified the llvm bitcode to add an equivalent instruction to:
int tmp = someVar*(someVar-1);
Hoping that agressive O3 optimizations will identify tmp as being the constant value 0. Here is a C version of the code I used:
int should_expand(char *s)
{
int tmp = 0;
int ret = 0;
char *p = s;
if (p && *p == '&')
{
ret = 1;
}
tmp = ret * (ret - 1);
return tmp;
}
When I examine the *.ll file I see that almighty clang 6.0.0 failed
to realize tmp is actually 0:
define i32 #should_expand(i8* readonly %s) local_unnamed_addr #0 {
entry:
%tobool = icmp eq i8* %s, null
br i1 %tobool, label %if.end, label %land.lhs.true
land.lhs.true: ; preds = %entry
%tmp = load i8, i8* %s, align 1, !tbaa !2
%cmp = icmp eq i8 %tmp, 38
%spec.select = zext i1 %cmp to i32
br label %if.end
if.end: ; preds = %land.lhs.true, %entry
%ret.0 = phi i32 [ 0, %entry ], [ %spec.select, %land.lhs.true ]
%sub = add nsw i32 %ret.0, -1
%tmp1 = sub nsw i32 0, %ret.0
%mul = and i32 %sub, %tmp1
ret i32 %mul
}
Does that make sense? are there any external static analyzers I can use, that inter-operate smoothly with clang? or any other trick I can use?
Thanks a lot!

Related

LLVM optimization passes break recursive code

I have a problem regarding some LLVM optimization passes, which modify the compilation output, so that it is not functional anymore.
This is the input source code of a Fibonacci algorithm:
f<int> fib(int n) {
if n <= 1 { return 1; }
return fib(n - 1) + fib(n - 2);
}
f<int> main() {
printf("Result: %d", fib(46));
return 0;
}
Without optimization, my compiler spits out following IR code, which is working perfectly fine:
; ModuleID = 'Module'
source_filename = "Module"
#0 = private unnamed_addr constant [11 x i8] c"Result: %d\00", align 1
declare i32 #printf(i8*, ...)
define i32 #"fib(int)"(i32 %0) {
entry:
%n = alloca i32, align 4
store i32 %0, i32* %n, align 4
%result = alloca i32, align 4
%1 = load i32, i32* %n, align 4
%le = icmp sle i32 %1, 1
br i1 %le, label %then, label %end
then: ; preds = %entry
ret i32 1
br label %end
end: ; preds = %then, %entry
%2 = load i32, i32* %n, align 4
%sub = sub i32 %2, 1
%3 = call i32 #"fib(int)"(i32 %sub)
%4 = load i32, i32* %n, align 4
%sub1 = sub i32 %4, 2
%5 = call i32 #"fib(int)"(i32 %sub1)
%add = add i32 %3, %5
ret i32 %add
}
define i32 #main() {
main_entry:
%result = alloca i32, align 4
%0 = call i32 #"fib(int)"(i32 46)
%1 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([11 x i8], [11 x i8]* #0, i32 0, i32 0), i32 %0)
ret i32 0
}
Then I applied a few optimization passes to it. Here is my chain of passes:
fpm->add(llvm::createDeadCodeEliminationPass());
fpm->add(llvm::createLoopDeletionPass());
fpm->add(llvm::createDeadStoreEliminationPass());
fpm->add(llvm::createGVNPass());
fpm->add(llvm::createPromoteMemoryToRegisterPass());
fpm->add(llvm::createInstructionCombiningPass());
fpm->add(llvm::createReassociatePass());
fpm->add(llvm::createCFGSimplificationPass()); // Breaks recursion
fpm->add(llvm::createCorrelatedValuePropagationPass());
fpm->add(llvm::createLoopSimplifyPass());
With this optimization passes enabled, I get following IR code:
; ModuleID = 'Module'
source_filename = "Module"
#0 = private unnamed_addr constant [11 x i8] c"Result: %d\00", align 1
declare i32 #printf(i8*, ...)
define i32 #"fib(int)"(i32 %0) {
entry:
%le = icmp slt i32 %0, 2
%sub = add i32 %0, -1
%1 = call i32 #"fib(int)"(i32 %sub)
%sub1 = add i32 %0, -2
%2 = call i32 #"fib(int)"(i32 %sub1)
%add = add i32 %2, %1
ret i32 %add
}
define i32 #main() {
main_entry:
%0 = call i32 #"fib(int)"(i32 46)
%1 = call i32 (i8*, ...) #printf(i8* noundef nonnull dereferenceable(1) getelementptr inbounds ([11 x i8], [11 x i8]* #0, i64 0, i64 0), i32 %0)
ret i32 0
}
Obviously, this code does produce a stack overflow, because the recursion anchor is gone. It seems like the CFGSimplificationPass merges blocks in a wrong way / eliminates the if body although it is relevant. When I remove the 'createCFGSimplificationPass' line, the optimizations works and the executable outcome runs fine.
Now my question: What am I doing wrong? Or is this maybe a bug in LLVM?
Thanks for your help!

then: ; preds = %entry
ret i32 1
br label %end
A block can't have two terminators, so this IR is invalid. This causes the optimizations to misbehave as they can't tell which one is the intended terminator.
To more easily catch errors like this in the future, you should use llvm::verifyModule to verify the IR you generate before you run additional passes on it.

How to get the dynamic assigned heap address and malloc size with llvm pass instrumentation at runtime?

Traverse the basic blocks to get the malloc size args and return address at runtime.
I instrument the printf() function at every call malloc() site in the IR and hope it can print the malloc size at runtime.
In the example, the size is inst.getOperand(0), the malloc size get from the scanf().
for (auto &BB : F) {
for (auto Inst = BB.begin(); Inst != BB.end(); Inst++) {
Instruction &inst = *Inst;
if(CallInst* call_inst = dyn_cast<CallInst>(&inst)) {
Function* fn = call_inst->getCalledFunction();
if(fn == "malloc"){
/* do something to get heap address and malloc size*/
// for example
/* declare printf function */
IRBuilder<> builder(call_inst);
std::vector<llvm::Type *> putsArgs;
putsArgs.push_back(builder.getInt8Ty()->getPointerTo());
llvm::ArrayRef<llvm::Type*> argsRef(putsArgs);
/* declare a varible and assign, then puts args */
llvm::FunctionType *putsType =
llvm::FunctionType::get(builder.getInt64Ty(), argsRef, true);
llvm::Constant *putsFunc = M.getOrInsertFunction("printf", putsType);
Value *allocDeclrInt;
Value *RightValue = IntegerType::get(64, inst.getOperand(0));
StoreInst store=builder.CreateStore(RightValue,allocDeclrInt, false);
LoadInst *a = builder.CreateLoad(allocDeclrInt);
Value *intFormat = builder.CreateGlobalStringPtr("%d");
std::vector<llvm::Value *> values;
values.clear();
values.push_back(intFormat);
values.push_back(a);
//puts size
builder.CreateCall(putsFunc, values);
}
}
}
}
My test.c file contains:
int a=0;
scanf("%d",&a);
p1=(char*)malloc(a*sizeof(char));
The IR language:
%conv = sext i32 %29 to i64, !dbg !81
%a.size = alloca i32, !dbg !82
store i32 10, i32* %a.size, !dbg !82
%30 = load i32, i32* %a.size, !dbg !82
%31 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([3 x i8], [3 x i8]* #0, i32 0, i32 0), i32 %30), !dbg !82
%32 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([17 x i8], [17 x i8]* #1, i32 0, i32 0)), !dbg !82
%call1 = call i8* #malloc(i64 %conv), !dbg !82
can I get the assigned size and heap address at runtime?

malloc() itself selects its address at runtime (and some implementations guarantee that the return value will vary each time the program is run), so if you want to get the heap address, you have to replace it with your own implementation of malloc.
Getting at the malloc size is easier: If callInst->getArgOperand(0) is a ConstantInt you have the size. If not you might be able to fold it, but that's perhaps beyond your interest?

How to implement the MapReduce example in Erlang efficiently?

I am trying to compare the performance of concurrent programming languages, such as Haskell, Go and Erlang. The following Go code calculates the sum of squares, ( repeat calculate the sum of squares for R times):
1^2+2^2+3^2....1024^2
package main
import "fmt"
func mapper(in chan int, out chan int) {
for v := range in {out <- v*v}
}
func reducer(in1, in2 chan int, out chan int) {
for i1 := range in1 {i2 := <- in2; out <- i1 + i2}
}
func main() {
const N = 1024 // calculate sum of squares up to N; N must be power of 2
const R = 10 // number of repetitions to fill the "pipe"
var r [N*2]chan int
for i := range r {r[i] = make(chan int)}
var m [N]chan int
for i := range m {m[i] = make(chan int)}
for i := 0; i < N; i++ {go mapper(m[i], r[i + N])}
for i := 1; i < N; i++ {go reducer(r[i * 2], r[i *2 + 1], r[i])}
go func () {
for j := 0; j < R; j++ {
for i := 0; i < N; i++ {m[i] <- i + 1}
}
} ()
for j := 0; j < R; j++ {
<- r[1]
}
}
The following code is the MapReduce solution in Erlang. I am a newbie to Erlang. I would like to compare performance among Go, Haskell and Erlang. My question is how to optimize this Erlang code. I compile this code by using erlc -W mr.erl and run the code by using erl -noshell -s mr start -s init stop -extra 1024 1024. Are there any special compile and execution options available for optimizations? I really appreciate any help you can provide.
-module(mr).
-export([start/0, create/2, doreduce/2, domap/1, repeat/3]).
start()->
[Num_arg|Repeat] = init:get_plain_arguments(),
N = list_to_integer(Num_arg),
[R_arg|_] = Repeat,
R = list_to_integer(R_arg),
create(R, N).
create(R, Num) when is_integer(Num), Num > 0 ->
Reducers = [spawn(?MODULE, doreduce, [Index, self()]) || Index <- lists:seq(1, 2*Num - 1)],
Mappers = [spawn(?MODULE, domap, [In]) || In <- lists:seq(1, Num)],
reducer_connect(Num-1, Reducers, self()),
mapper_connect(Num, Num, Reducers, Mappers),
repeat(R, Num, Mappers).
repeat(0, Num, Mappers)->
send_message(Num, Mappers),
receive
{result, V}->
%io:format("Repeat: ~p ~p ~n", [0, V])
true
end;
repeat(R, Num, Mappers)->
send_message(Num, Mappers),
receive
{result, V}->
%io:format("Got: ~p ~p ~n", [R, V])
true
end,
repeat(R-1, Num, Mappers).
send_message(1, Mappers)->
D = lists:nth (1, Mappers),
D ! {mapper, 1};
send_message(Num, Mappers)->
D = lists:nth (Num, Mappers),
D ! {mapper, Num},
send_message(Num-1, Mappers).
reducer_connect(1, RList, Root)->
Parent = lists:nth(1, RList),
Child1 = lists:nth(2, RList),
Child2 = lists:nth(3, RList),
Child1 ! {connect, Parent},
Child2 ! {connect, Parent},
Parent !{connect, Root};
reducer_connect(Index, RList, Root)->
Parent = lists:nth(Index, RList),
Child1 = lists:nth(Index*2, RList),
Child2 = lists:nth(Index*2+1, RList),
Child1 ! {connect, Parent},
Child2 ! {connect, Parent},
reducer_connect(Index-1, RList, Root).
mapper_connect(1, Num, RList, MList)->
R = lists:nth(Num, RList),
M = lists:nth(1, MList),
M ! {connect, R};
mapper_connect(Index, Num, RList, MList) when is_integer(Index), Index > 0 ->
R = lists:nth(Num + (Index-1), RList),
M = lists:nth(Index, MList),
M ! {connect, R},
mapper_connect(Index-1, Num, RList, MList).
doreduce(Index, CurId)->
receive
{connect, Parent}->
doreduce(Index, Parent, 0, 0, CurId)
end.
doreduce(Index, To, Val1, Val2, Root)->
receive
{map, Val} ->
if Index rem 2 == 0 ->
To ! {reduce1, Val},
doreduce(Index, To, 0, 0, Root);
true->
To ! {reduce2, Val},
doreduce(Index, To, 0, 0, Root)
end;
{reduce1, V1} when Val2 > 0, Val1 == 0 ->
if Index == 1 ->% root node
Root !{result, Val2 + V1},
doreduce(Index, To, 0, 0, Root);
Index rem 2 == 0 ->
To ! {reduce1, V1+Val2},
doreduce(Index, To, 0, 0, Root);
true->
To ! {reduce2, V1+Val2},
doreduce(Index, To, 0, 0, Root)
end;
{reduce2, V2} when Val1 > 0, Val2 == 0 ->
if Index == 1 ->% root node
Root !{result, Val1 + V2},
doreduce(Index, To, 0, 0, Root);
Index rem 2 == 0 ->
To ! {reduce1, V2+Val1},
doreduce(Index, To, 0, 0, Root);
true->
To ! {reduce2, V2+Val1},
doreduce(Index, To, 0, 0, Root)
end;
{reduce1, V1} when Val1 == 0, Val2 == 0 ->
doreduce(Index, To, V1, 0, Root);
{reduce2, V2} when Val1 == 0, Val2 == 0 ->
doreduce(Index, To, 0, V2, Root);
true->
true
end.
domap(Index)->
receive
{connect, ReduceId}->
domap(Index, ReduceId)
end.
domap(Index, To)->
receive
{mapper, V}->
To !{map, V*V},
domap(Index, To);
true->
true
end.

Despite it is not a good task for Erlang at all, there is a quite simple solution:
-module(mr).
-export([start/1, start/2]).
start([R, N]) ->
Result = start(list_to_integer(R), list_to_integer(N)),
io:format("~B x ~B~n", [length(Result), hd(Result)]).
start(R, N) ->
Self = self(),
Reducer = start(Self, R, 1, N),
[ receive {Reducer, Result} -> Result end || _ <- lists:seq(1, R) ].
start(Parent, R, N, N) ->
spawn_link(fun() -> mapper(Parent, R, N) end);
start(Parent, R, From, To) ->
spawn_link(fun() -> reducer(Parent, R, From, To) end).
mapper(Parent, R, N) ->
[ Parent ! {self(), N*N} || _ <- lists:seq(1, R) ].
reducer(Parent, R, From, To) ->
Self = self(),
Middle = ( From + To ) div 2,
A = start(Self, R, From, Middle),
B = start(Self, R, Middle + 1, To),
[ Parent ! {Self, receive {A, X} -> receive {B, Y} -> X+Y end end}
|| _ <- lists:seq(1, R) ].
You can run it using
$ erlc -W mr.erl
$ time erl -noshell -run mr start 1024 1024 -s init stop
1024 x 358438400
real 0m2.162s
user 0m4.177s
sys 0m0.151s
But most of the time is VM start and gracefull stop overhead
$ time erl -noshell -run mr start 1024 1024 -s erlang halt
1024 x 358438400
real 0m1.172s
user 0m4.110s
sys 0m0.150s
$ erl
1> timer:tc(fun() -> mr:start(1024,1024) end).
{978453,
[358438400,358438400,358438400,358438400,358438400,
358438400,358438400,358438400,358438400,358438400,358438400,
358438400,358438400,358438400,358438400,358438400,358438400,
358438400,358438400,358438400,358438400,358438400,358438400,
358438400,358438400,358438400,358438400|...]}
Keep in mind it is more like an elegant solution than an efficient one. An efficient solution should balance reduction tree branching with communication overhead.

Generate all combinations of a char array inside of a CUDA device kernel

I need help please. I started to program a common brute forcer / password guesser with CUDA (2.3 / 3.0beta).
I tried different ways to generate all possible plain text "candidates" of a defined ASCII char set.
In this sample code I want to generate all 74^4 possible combinations (and just output the result back to host/stdout).
$ ./combinations
Total number of combinations : 29986576
Maximum output length : 4
ASCII charset length : 74
ASCII charset : 0x30 - 0x7a
"0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy"
CUDA code (compiled with 2.3 and 3.0b - sm_10) - combinaions.cu:
#include <stdio.h>
#include <cuda.h>
__device__ uchar4 charset_global = {0x30, 0x30, 0x30, 0x30};
__shared__ __device__ uchar4 charset[128];
__global__ void combo_kernel(uchar4 * result_d, unsigned int N)
{
int totalThreads = blockDim.x * gridDim.x ;
int tasksPerThread = (N % totalThreads) == 0 ? N / totalThreads : N/totalThreads + 1;
int myThreadIdx = blockIdx.x * blockDim.x + threadIdx.x ;
int endIdx = myThreadIdx + totalThreads * tasksPerThread ;
if( endIdx > N) endIdx = N;
const unsigned int m = 74 + 0x30;
for(int idx = myThreadIdx ; idx < endIdx ; idx += totalThreads) {
charset[threadIdx.x].x = charset_global.x;
charset[threadIdx.x].y = charset_global.y;
charset[threadIdx.x].z = charset_global.z;
charset[threadIdx.x].w = charset_global.w;
__threadfence();
if(charset[threadIdx.x].x < m) {
charset[threadIdx.x].x++;
} else if(charset[threadIdx.x].y < m) {
charset[threadIdx.x].x = 0x30; // = 0
charset[threadIdx.x].y++;
} else if(charset[threadIdx.x].z < m) {
charset[threadIdx.x].y = 0x30; // = 0
charset[threadIdx.x].z++;
} else if(charset[threadIdx.x].w < m) {
charset[threadIdx.x].z = 0x30;
charset[threadIdx.x].w++;; // = 0
}
charset_global.x = charset[threadIdx.x].x;
charset_global.y = charset[threadIdx.x].y;
charset_global.z = charset[threadIdx.x].z;
charset_global.w = charset[threadIdx.x].w;
result_d[idx].x = charset_global.x;
result_d[idx].y = charset_global.y;
result_d[idx].z = charset_global.z;
result_d[idx].w = charset_global.w;
}
}
#define BLOCKS 65535
#define THREADS 128
int main(int argc, char **argv)
{
const int ascii_chars = 74;
const int max_len = 4;
const unsigned int N = pow((float)ascii_chars, max_len);
size_t size = N * sizeof(uchar4);
uchar4 *result_d, *result_h;
result_h = (uchar4 *)malloc(size );
cudaMalloc((void **)&result_d, size );
cudaMemset(result_d, 0, size);
printf("Total number of combinations\t: %d\n\n", N);
printf("Maximum output length\t: %d\n", max_len);
printf("ASCII charset length\t: %d\n\n", ascii_chars);
printf("ASCII charset\t: 0x30 - 0x%02x\n ", 0x30 + ascii_chars);
for(int i=0; i < ascii_chars; i++)
printf("%c",i + 0x30);
printf("\n\n");
combo_kernel <<< BLOCKS, THREADS >>> (result_d, N);
cudaThreadSynchronize();
printf("CUDA kernel done\n");
printf("hit key to continue...\n");
getchar();
cudaMemcpy(result_h, result_d, size, cudaMemcpyDeviceToHost);
for (unsigned int i=0; i<N; i++)
printf("result[%06u]\t%c%c%c%c\n",i, result_h[i].x, result_h[i].y, result_h[i].z, result_h[i].w);
free(result_h);
cudaFree(result_d);
}
The code should compile without any problems but the output is not what i expected.
On emulation mode:
CUDA kernel done hit
key to continue...
result[000000] 1000
...
result[000128] 5000
On release mode:
CUDA kernel done hit
key to continue...
result[000000] 1000
...
result[012288] 5000
I also used __threadfence() and or __syncthreads() on different lines of the code also without success...
ps. if possible I want to generate everything inside of the kernel function . I also tried "pre" generating of possible plain text candidates inside host main function and memcpy to device, this works only with a very limited charset size (because of limited device memory).
any idea about the output, why the repeating (even with __threadfence() or __syncthreads()) ?
any other method to generate plain text (candidates) inside CUDA kernel fast :-) (~75^8) ?
thanks a million
greets jan

Incidentally, your loop bound is overly complex. You don't need to do all that work to compute the endIdx, instead you can do the following, making the code simpler.
for(int idx = myThreadIdx ; idx < N ; idx += totalThreads)

Let's see:
When filling your charset array, __syncthreads() will be sufficient as you are not interested in writes to global memory (more on this later)
Your if statements are not correctly resetting your loop iterators:
In z < m, then both x == m and y == m and must both be set to 0.
Similar for w
Each thread is responsible for writing one set of 4 characters in charset, but every thread writes the same 4 values. No thread does any independent work.
You are writing each threads results to global memory without atomics, which is unsafe. There is no guarantee that the results won't be immediately clobbered by another thread before reading them back.
You are reading the results of computation back from global memory immediately after writing them to global memory. It's unclear why you are doing this and this is very unsafe.
Finally, there is no reliable way in CUDA to to a synchronization between all blocks, which seems to be what you are hoping for. Calling __threadfence only applies to blocks currently executing on the device, which can be subset of all blocks that should run for a kernel call. Thus it doesn't work as a synchronization primitive.
It's probably easier to calculate initial values of x, y, z and w for each thread. Then each thread can start looping from its initial values until it has performed tasksPerThread iterations. Writing the values out can probably proceed more or less as you have it now.
EDIT: Here is a simple test program to demonstrate the logic errors in your loop iteration:
int m = 2;
int x = 0, y = 0, z = 0, w = 0;
for (int i = 0; i < m * m * m * m; i++)
{
printf("x: %d y: %d z: %d w: %d\n", x, y, z, w);
if(x < m) {
x++;
} else if(y < m) {
x = 0; // = 0
y++;
} else if(z < m) {
y = 0; // = 0
z++;
} else if(w < m) {
z = 0;
w++;; // = 0
}
}
The output of which is this:
x: 0 y: 0 z: 0 w: 0
x: 1 y: 0 z: 0 w: 0
x: 2 y: 0 z: 0 w: 0
x: 0 y: 1 z: 0 w: 0
x: 1 y: 1 z: 0 w: 0
x: 2 y: 1 z: 0 w: 0
x: 0 y: 2 z: 0 w: 0
x: 1 y: 2 z: 0 w: 0
x: 2 y: 2 z: 0 w: 0
x: 2 y: 0 z: 1 w: 0
x: 0 y: 1 z: 1 w: 0
x: 1 y: 1 z: 1 w: 0
x: 2 y: 1 z: 1 w: 0
x: 0 y: 2 z: 1 w: 0
x: 1 y: 2 z: 1 w: 0
x: 2 y: 2 z: 1 w: 0

Code Golf: Automata

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I made the ultimate laugh generator using these rules. Can you implement it in your favorite language in a clever manner?
Rules:
On every iteration, the following transformations occur.
H -> AH
A -> HA
AA -> HA
HH -> AH
AAH -> HA
HAA -> AH
n = 0 | H
n = 1 | AH
n = 2 | HAAH
n = 3 | AHAH
n = 4 | HAAHHAAH
n = 5 | AHAHHA
n = 6 | HAAHHAAHHA
n = 7 | AHAHHAAHHA
n = 8 | HAAHHAAHHAAHHA
n = 9 | AHAHHAAHAHHA
n = ...

Lex/Flex
69 characters. In the text here, I changed tabs to 8 spaces so it would look right, but all those consecutive spaces should be tabs, and the tabs are important, so it comes out to 69 characters.
#include <stdio.h>
%%
HAA|HH|H printf("AH");
AAH|AA|A printf("HA");
For what it's worth, the generated lex.yy.c is 42736 characters, but I don't think that really counts. I can (and soon will) write a pure-C version that will be much shorter and do the same thing, but I feel that should probably be a separate entry.
EDIT:
Here's a more legit Lex/Flex entry (302 characters):
char*c,*t;
#define s(a) t=c?realloc(c,strlen(c)+3):calloc(3,1);if(t)c=t,strcat(c,#a);
%%
free(c);c=NULL;
HAA|HH|H s(AH)
AAH|AA|A s(HA)
%%
int main(void){c=calloc(2,1);if(!c)return 1;*c='H';for(int n=0;n<10;n++)printf("n = %d | %s\n",n,c),yy_scan_string(c),yylex();return 0;}int yywrap(){return 1;}
This does multiple iterations (unlike the last one, which only did one iteration, and had to be manually seeded each time, but produced the correct results) and has the advantage of being extremely horrific-looking code. I use a function macro, the stringizing operator, and two global variables. If you want an even messier version that doesn't even check for malloc() failure, it looks like this (282 characters):
char*c,*t;
#define s(a) t=c?realloc(c,strlen(c)+3):calloc(3,1);c=t;strcat(c,#a);
%%
free(c);c=NULL;
HAA|HH|H s(AH)
AAH|AA|A s(HA)
%%
int main(void){c=calloc(2,1);*c='H';for(int n=0;n<10;n++)printf("n = %d | %s\n",n,c),yy_scan_string(c),yylex();return 0;}int yywrap(){return 1;}
An even worse version could be concocted where c is an array on the stack, and we just give it a MAX_BUFFER_SIZE of some sort, but I feel that's taking this too far.
...Just kidding. 207 characters if we take the "99 characters will always be enough" mindset:
char c[99]="H";
%%
c[0]=0;
HAA|HH|H strcat(c, "AH");
AAH|AA|A strcat(c, "HA");
%%
int main(void){for(int n=0;n<10;n++)printf("n = %d | %s\n",n,c),yy_scan_string(c),yylex();return 0;}int yywrap(){return 1;}
My preference is for the one that works best (i.e. the first one that can iterate until memory runs out and checks its errors), but this is code golf.
To compile the first one, type:
flex golf.l
gcc -ll lex.yy.c
(If you have lex instead of flex, just change flex to lex. They should be compatible.)
To compile the others, type:
flex golf.l
gcc -std=c99 lex.yy.c
Or else GCC will whine about ‘for’ loop initial declaration used outside C99 mode and other crap.
Pure C answer coming up.

MATLAB (v7.8.0):
73 characters (not including formatting characters used to make it look readable)
This script ("haha.m") assumes you have already defined the variable n:
s = 'H';
for i = 1:n,
s = regexprep(s,'(H)(H|AA)?|(A)(AH)?','${[137-$1 $1]}');
end
...and here's the one-line version:
s='H';for i=1:n,s = regexprep(s,'(H)(H|AA)?|(A)(AH)?','${[137-$1 $1]}');end
Test:
>> for n=0:10, haha; disp([num2str(n) ': ' s]); end
0: H
1: AH
2: HAAH
3: AHAH
4: HAAHHAAH
5: AHAHHA
6: HAAHHAAHHA
7: AHAHHAAHHA
8: HAAHHAAHHAAHHA
9: AHAHHAAHAHHA
10: HAAHHAAHHAHAAHHA

A simple translation to Haskell:
grammar = iterate step
where
step ('H':'A':'A':xs) = 'A':'H':step xs
step ('A':'A':'H':xs) = 'H':'A':step xs
step ('A':'A':xs) = 'H':'A':step xs
step ('H':'H':xs) = 'A':'H':step xs
step ('H':xs) = 'A':'H':step xs
step ('A':xs) = 'H':'A':step xs
step [] = []
And a shorter version (122 chars, optimized down to three derivation rules + base case):
grammar=iterate s where{i 'H'='A';i 'A'='H';s(n:'A':m:x)|n/=m=m:n:s x;s(n:m:x)|n==m=(i n):n:s x;s(n:x)=(i n):n:s x;s[]=[]}
And a translation to C++ (182 chars, only does one iteration, invoke with initial state on the command line):
#include<cstdio>
#define o putchar
int main(int,char**v){char*p=v[1];while(*p){p[1]==65&&~*p&p[2]?o(p[2]),o(*p),p+=3:*p==p[1]?o(137-*p++),o(*p++),p:(o(137-*p),o(*p++),p);}return 0;}

Javascript:
120 stripping whitespace and I'm leaving it alone now!
function f(n,s){s='H';while(n--){s=s.replace(/HAA|AAH|HH?|AA?/g,function(a){return a.match(/^H/)?'AH':'HA'});};return s}
Expanded:
function f(n,s)
{
s = 'H';
while (n--)
{
s = s.replace(/HAA|AAH|HH?|AA?/g, function(a) { return a.match(/^H/) ? 'AH' : 'HA' } );
};
return s
}
that replacer is expensive!

Here's a C# example, coming in at 321 bytes if I reduce whitespace to one space between each item.
Edit: In response to #Johannes Rössel comment, I removed generics from the solution to eek out a few more bytes.
Edit: Another change, got rid of all temporary variables.
public static String E(String i)
{
return new Regex("HAA|AAH|HH|AA|A|H").Replace(i,
m => (String)new Hashtable {
{ "H", "AH" },
{ "A", "HA" },
{ "AA", "HA" },
{ "HH", "AH" },
{ "AAH", "HA" },
{ "HAA", "AH" }
}[m.Value]);
}
The rewritten solution with less whitespace, that still compiles, is 158 characters:
return new Regex("HAA|AAH|HH|AA|A|H").Replace(i,m =>(String)new Hashtable{{"H","AH"},{"A","HA"},{"AA","HA"},{"HH","AH"},{"AAH","HA"},{"HAA","AH"}}[m.Value]);
For a complete source code solution for Visual Studio 2008, a subversion repository with the necessary code, including unit tests, is available below.
Repository is here, username and password are both 'guest', without the quotes.

Ruby
This code golf is not very well specified -- I assumed that function returning n-th iteration string is best way to solve it. It has 80 characters.
def f n
a='h'
n.times{a.gsub!(/(h(h|aa)?)|(a(ah?)?)/){$1.nil?? "ha":"ah"}}
a
end
Code printing out n first strings (71 characters):
a='h';n.times{puts a.gsub!(/(h(h|aa)?)|(a(ah?)?)/){$1.nil?? "ha":"ah"}}

Erlang
241 bytes and ready to run:
> erl -noshell -s g i -s init stop
AHAHHAAHAHHA
-module(g).
-export([i/0]).
c("HAA"++T)->"AH"++c(T);
c("AAH"++T)->"HA"++c(T);
c("HH"++T)->"AH"++c(T);
c("AA"++T)->"HA"++c(T);
c("A"++T)->"HA"++c(T);
c("H"++T)->"AH"++c(T);
c([])->[].
i(0,L)->L;
i(N,L)->i(N-1,c(L)).
i()->io:format(i(9,"H"))
Could probably be improved.

Perl 168 characters.
(not counting unnecessary newlines)
perl -E'
($s,%m)=qw[H H AH A HA AA HA HH AH AAH HA HAA AH];
sub p{say qq[n = $_[0] | $_[1]]};p(0,$s);
for(1..9){$s=~s/(H(AA|H)?|A(AH?)?)/$m{$1}/g;p($_,$s)}
say q[n = ...]'
De-obfuscated:
use strict;
use warnings;
use 5.010;
my $str = 'H';
my %map = (
H => 'AH',
A => 'HA',
AA => 'HA',
HH => 'AH',
AAH => 'HA',
HAA => 'AH'
);
sub prn{
my( $n, $str ) = #_;
say "n = $n | $str"
}
prn( 0, $str );
for my $i ( 1..9 ){
$str =~ s(
(
H(?:AA|H)? # HAA | HH | H
|
A(?:AH?)? # AAH | AA | A
)
){
$map{$1}
}xge;
prn( $i, $str );
}
say 'n = ...';
Perl 150 characters.
(not counting unnecessary newlines)
perl -E'
$s="H";
sub p{say qq[n = $_[0] | $_[1]]};p(0,$s);
for(1..9){$s=~s/(?|(H)(?:AA|H)?|(A)(?:AH?)?)/("H"eq$1?"A":"H").$1/eg;p($_,$s)}
say q[n = ...]'
De-obfuscated
#! /usr/bin/env perl
use strict;
use warnings;
use 5.010;
my $str = 'H';
sub prn{
my( $n, $str ) = #_;
say "n = $n | $str"
}
prn( 0, $str );
for my $i ( 1..9 ){
$str =~ s{(?|
(H)(?:AA|H)? # HAA | HH | H
|
(A)(?:AH?)? # AAH | AA | A
)}{
( 'H' eq $1 ?'A' :'H' ).$1
}egx;
prn( $i, $str );
}
say 'n = ...';

Python (150 bytes)
import re
N = 10
s = "H"
for n in range(N):
print "n = %d |"% n, s
s = re.sub("(HAA|HH|H)|AAH|AA|A", lambda m: m.group(1) and "AH" or "HA",s)
Output
n = 0 | H
n = 1 | AH
n = 2 | HAAH
n = 3 | AHAH
n = 4 | HAAHHAAH
n = 5 | AHAHHA
n = 6 | HAAHHAAHHA
n = 7 | AHAHHAAHHA
n = 8 | HAAHHAAHHAAHHA
n = 9 | AHAHHAAHAHHA

Here is a very simple C++ version:
#include <iostream>
#include <sstream>
using namespace std;
#define LINES 10
#define put(t) s << t; cout << t
#define r1(o,a,c0) \
if(c[0]==c0) {put(o); s.unget(); s.unget(); a; continue;}
#define r2(o,a,c0,c1) \
if(c[0]==c0 && c[1]==c1) {put(o); s.unget(); a; continue;}
#define r3(o,a,c0,c1,c2) \
if(c[0]==c0 && c[1]==c1 && c[2]==c2) {put(o); a; continue;}
int main() {
char c[3];
stringstream s;
put("H\n\n");
for(int i=2;i<LINES*2;) {
s.read(c,3);
r3("AH",,'H','A','A');
r3("HA",,'A','A','H');
r2("AH",,'H','H');
r2("HA",,'A','A');
r1("HA",,'A');
r1("AH",,'H');
r1("\n",i++,'\n');
}
}
It's not exactly code-golf (it could be made a lot shorter), but it works. Change LINES to however many lines you want printed (note: it will not work for 0). It will print output like this:
H
AH
HAAH
AHAH
HAAHHAAH
AHAHHA
HAAHHAAHHA
AHAHHAAHHA
HAAHHAAHHAAHHA
AHAHHAAHAHHA

ANSI C99
Coming in at a brutal 306 characters:
#include <stdio.h>
#include <string.h>
char s[99]="H",t[99]={0};int main(){for(int n=0;n<10;n++){int i=0,j=strlen(s);printf("n = %u | %s\n",n,s);strcpy(t,s);s[0]=0;for(;i<j;){if(t[i++]=='H'){t[i]=='H'?i++:t[i+1]=='A'?i+=2:1;strcat(s,"AH");}else{t[i]=='A'?i+=1+(t[i+1]=='H'):1;strcat(s,"HA");}}}return 0;}
There are too many nested ifs and conditional operators for me to effectively reduce this with macros. Believe me, I tried. Readable version:
#include <stdio.h>
#include <string.h>
char s[99] = "H", t[99] = {0};
int main()
{
for(int n = 0; n < 10; n++)
{
int i = 0, j = strlen(s);
printf("n = %u | %s\n", n, s);
strcpy(t, s);
s[0] = 0;
/*
* This was originally just a while() loop.
* I tried to make it shorter by making it a for() loop.
* I failed.
* I kept the for() loop because it looked uglier than a while() loop.
* This is code golf.
*/
for(;i<j;)
{
if(t[i++] == 'H' )
{
// t[i] == 'H' ? i++ : t[i+1] == 'A' ? i+=2 : 1;
// Oh, ternary ?:, how do I love thee?
if(t[i] == 'H')
i++;
else if(t[i+1] == 'A')
i+= 2;
strcat(s, "AH");
}
else
{
// t[i] == 'A' ? i += 1 + (t[i + 1] == 'H') : 1;
if(t[i] == 'A')
if(t[++i] == 'H')
i++;
strcat(s, "HA");
}
}
}
return 0;
}
I may be able to make a shorter version with strncmp() in the future, but who knows? We'll see what happens.

In python:
def l(s):
H=['HAA','HH','H','AAH','AA','A']
L=['AH']*3+['HA']*3
for i in [3,2,1]:
if s[:i] in H: return L[H.index(s[:i])]+l(s[i:])
return s
def a(n,s='H'):
return s*(n<1)or a(n-1,l(s))
for i in xrange(0,10):
print '%d: %s'%(i,a(i))
First attempt: 198 char of code, I'm sure it can get smaller :D

REBOL, 150 characters. Unfortunately REBOL is not a language conducive to code golf, but 150 characters ain't too shabby, as Adam Sandler says.
This assumes the loop variable m has already been defined.
s: "H" r: "" z:[some[["HAA"|"HH"|"H"](append r "AH")|["AAH"|"AA"|"A"](append r "HA")]to end]repeat n m[clear r parse s z print["n =" n "|" s: copy r]]
And here it is with better layout:
s: "H"
r: ""
z: [
some [
[ "HAA" | "HH" | "H" ] (append r "AH")
| [ "AAH" | "AA" | "A" ] (append r "HA")
]
to end
]
repeat n m [
clear r
parse s z
print ["n =" n "|" s: copy r]
]

F#: 184 chars
Seems to map pretty cleanly to F#:
type grammar = H | A
let rec laugh = function
| 0,l -> l
| n,l ->
let rec loop = function
|H::A::A::x|H::H::x|H::x->A::H::loop x
|A::A::H::x|A::A::x|A::x->H::A::loop x
|x->x
laugh(n-1,loop l)
Here's a run in fsi:
> [for a in 0 .. 9 -> a, laugh(a, [H])] |> Seq.iter (fun (a, b) -> printfn "n = %i: %A" a b);;
n = 0: [H]
n = 1: [A; H]
n = 2: [H; A; A; H]
n = 3: [A; H; A; H]
n = 4: [H; A; A; H; H; A; A; H]
n = 5: [A; H; A; H; H; A]
n = 6: [H; A; A; H; H; A; A; H; H; A]
n = 7: [A; H; A; H; H; A; A; H; H; A]
n = 8: [H; A; A; H; H; A; A; H; H; A; A; H; H; A]
n = 9: [A; H; A; H; H; A; A; H; A; H; H; A]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

llvm opt -O3 fail (?) - optimization

Related

LLVM optimization passes break recursive code

How to get the dynamic assigned heap address and malloc size with llvm pass instrumentation at runtime?

How to implement the MapReduce example in Erlang efficiently?

Generate all combinations of a char array inside of a CUDA device kernel

Code Golf: Automata

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

llvm opt -O3 fail (?) - optimization

Related

LLVM optimization passes break recursive code

How to get the dynamic assigned heap address and malloc size with llvm pass instrumentation at runtime?

How to implement the MapReduce example in Erlang efficiently?

Generate all combinations of a char array inside of a CUDA __device__ kernel

Code Golf: Automata

Categories

Resources

Generate all combinations of a char array inside of a CUDA device kernel