Fast single dispatch to get around multiple dispatch at runtime - dynamic

When type inference falters (::Any in #code_warntype printout), my understanding is that function calls are dynamically dispatched. In other words, at run-time, the arguments' types are checked to find the specialization (MethodInstance) for the concrete argument types. Needing to do this at run-time instead of compile-time incurs performance costs.
(EDIT: originally, I said "multiple dispatch finds the fitting method" between the type-checking and specialization-finding, but I don't actually know if this part happens at runtime. It seems that it only needs to happen if no valid specialization exists and one needs to be compiled.)
In cases where only one argument's concrete type needs to be checked, is it possible to do a faster dynamic single dispatch instead, like in some sort of lookup table of specializations? I just can't find a way to access and call MethodInstances as if they were functions.
When it comes to altering dispatch or specialization, I thought of invoke and #nospecialize. invoke looks like it might skip right to a specified method, but checking multiple argument types and specialization must still happen. #nospecialize doesn't skip any part of the dispatch process, just results in different specializations.
EDIT: A minimal example with comments that hopefully describe what I'm talking about.
struct Foo end
struct Bar end
# want to dispatch only on 1st argument
# still want to specialize on 2nd argument
baz(::Foo, ::Integer) = 1
baz(::Foo, ::AbstractFloat) = 1.0
baz(::Bar, ::Integer) = 1im
baz(::Bar, ::AbstractFloat) = 1.0im
x = Any[Foo(), Bar(), Foo()]
# run test1(x, 1) or test1(x, 1.0)
function test1(x, second)
# first::Any in #code_warntype printout
for first in x
# first::Any requires dynamic dispatch of baz
println(baz(first, second))
# Is it possible to only dispatch -baz- on -first- given
# the concrete types of the other arguments -second-?
end
end

The easiest way to do what you ask is to simply not dispatch on the second argument (by not specifying a type assertion on the second variable specific enough to trigger dispatch), and instead specialize with an if statement within your function. For example:
struct Foo end
struct Bar end
# Note lack of type assertion on second variable.
# We could also write `baz(::Foo, n::Number)` for same effect in this case,
# but type annotations have no performance benefit in Julia if you're not
# dispatching on them anyways.
function baz(::Foo, n)
if isa(n, Integer)
1
elseif isa(n, AbstractFloat)
1.0
else
error("unsupported type")
end
end
function baz(::Bar, n)
if isa(n, Integer)
1im
elseif isa(n, AbstractFloat)
1.0im
else
error("unsupported type")
end
end
Now, this will do what you want
julia> x = Any[Foo(), Bar(), Foo()]
3-element Vector{Any}:
Foo()
Bar()
Foo()
julia> test1(x, 1)
1
0 + 1im
1
julia> test1(x, 1.0)
1.0
0.0 + 1.0im
1.0
and since this effectively manually picks only two cases to specialize out of all the possible types to specialize on, I could imagine scenarios where this sort of technique has performance benefits (though, of course, it goes without saying in Julia that generally even better would be to find and eliminate the source of the type instability in the first place if at all possible).
However, it is critically important in the context of this question as written to point out that that even though we have eliminated dispatch on the second argument of the function, these baz functions may still have poor performance if the first argument (i.e., the one you are dispatching on) is type-unstable – as is the case in the question as written because of the use of an Array{Any}.
Instead, try to use an array with at least some type constraint. Ex:
julia> function test2(x, second)
s = 1+1im
for first in x
s += baz(first, second)
end
s
end
test2 (generic function with 1 method)
julia> using BenchmarkTools
julia> x = Any[Foo(), Bar(), Foo()];
julia> #benchmark test2($x, 1)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
Range (min … max): 13.845 ns … 71.554 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.869 ns ┊ GC (median): 0.00%
Time (mean ± σ): 15.397 ns ± 3.821 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅ ▃ ▄ ▄ ▄ ▄ ▃ ▁
██▇▆█▇██▄█▇▇▄▃▁▁██▁▃▃▁▁▃██▃▁▃▁▁▄▃▃▃▆▆▅▆▆▅▅▄▁▁▄▃▃▃▁▃▁▄▁▁▃▄▄█ █
13.8 ns Histogram: log(frequency) by time 30.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> x = Union{Foo,Bar}[Foo(), Bar(), Foo()];
julia> #benchmark test2($x, 1)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 4.654 ns … 62.311 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.707 ns ┊ GC (median): 0.00%
Time (mean ± σ): 5.471 ns ± 1.714 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂▂▃▄ ▃ ▄▁ ▄▂ ▅▁ ▁▄ ▁
███████▁▁██▁▁▁▁██▁▁▁▁▁▁██▁▁▁▄▁▃▁▃▁▁▁▁▃▁▁▁▁▃▁▃▃▁▁▁▁▃▁▁▁▁▁██ █
4.65 ns Histogram: log(frequency) by time 10.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0.

Related

How to search for rows that contain a special word case in-sensitively in a specific column in DataFrames Julia?

I was trying to find a specific dataset within RDatasets and since this package has provided 763 datasets with specific notations, it's hard to find out whether the dataset exists in the RDatasets. For example, I knew there exists a dataset that is about foods but I didn't know the exact name. So I searched for it to find out whether it's been provided by the RDatasets:
using RDatasets, DataFrames
# This line recalls information on all the provided datasets within the package
Rdatasets = RDatasets.datasets();
Rdatasets[occursin.("food", Rdatasets.Title), :]
# 0×5 DataFrame
# Row │ Package Dataset Title Rows Columns
# │ String15 String31 String Int64 Int64
# ─────┴────────────────────────────────────────────
# Then I searched for "Food"
Rdatasets[occursin.("Food", Rdatasets.Title), :]
# 1×5 DataFrame
# Row │ Package Dataset Title Rows Columns
# │ String15 String31 String Int64 Int64
# ─────┼─────────────────────────────────────────────────────────────────────────
# 1 │ Ecdat BudgetFood Budget Share of Food for Spanish… 23972 6
But I tried two times, and even I might give up on the further search. How can I find the row in the Rdatasets DataFrame that contains the food word case in-sensitively in its Title column (if there is any)?
RegEx is everyone's friend! Even if you were looking for Iris in the Dataset column, you'd be in trouble because they provided the names case-sensitively. So an option is lower/upper case the contents of the preferred column using lowercase.(df.columnname) and then search for the word in the suitable corresponding matchcase (but this can fail in extreme cases, check this answer), or you can use RegEx and hand over it to decide about the occurrence of the letters! So the latter helps you to find the specific word within a specific column in any dataframe with its specific notations (for example, maybe you had a dataframe that contained iris with the specific notation of "iRiS" in its column. Then it wouldn't be efficient to search for Iris or iris or iRis etc. until one works):
Rdatasets[occursin.(r"(?i)food", Rdatasets.Title), :]
# 1×5 DataFrame
# Row │ Package Dataset Title Rows Columns
# │ String15 String31 String Int64 Int64
# ─────┼─────────────────────────────────────────────────────────────────────────
# 1 │ Ecdat BudgetFood Budget Share of Food for Spanish… 23972 6
In the above, I used a RegEx expression to search for any notation of the food word within the Title column of the Rdatasets DataFrame. The ?(i) turns case-insensitivity on and the r"" is:
help?> r""
#r_str -> Regex
Construct a regex, such as r"^[a-z]*$", without interpolation and unescaping (except for quotation mark "
which still has to be escaped). The regex also accepts one or more flags, listed after the ending quote, to
change its behaviour:
• i enables case-insensitive matching
• m treats the ^ and $ tokens as matching the start and end of individual lines, as opposed to the
whole string.
• s allows the . modifier to match newlines.
• x enables "comment mode": whitespace is enabled except when escaped with \, and # is treated as
starting a comment.
• a disables UCP mode (enables ASCII mode). By default \B, \b, \D, \d, \S, \s, \W, \w, etc. match
based on Unicode character properties. With this option, these sequences only match ASCII
characters.
See Regex if interpolation is needed.
Examples
≡≡≡≡≡≡≡≡≡≡
julia> match(r"a+.*b+.*?d$"ism, "Goodbye,\nOh, angry,\nBad world\n")
RegexMatch("angry,\nBad world")
This regex has the first three flags enabled.
Note that things can be more complicated like occurring special chars ($, #, etc.), and in those cases converting the content to uppercase or the opposite wouldn't be helpful! So using RegEx is the safest option.
Comparison
Comparison against Przemyslaw's proposal:
julia> #benchmark $Rdatasets[occursin.(r"(?i)food", $Rdatasets.Title), :]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 120.000 μs … 583.000 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 125.200 μs ┊ GC (median): 0.00%
Time (mean ± σ): 135.361 μs ± 27.258 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▆▅▄▃▃▂▂▁▃▃▂▁▁▁▂▁▁▁ ▁
███████████████████████▇▇▆▆▆▆▆▇▆▅▆▆▅▅▅▆▆▆▅▆▅▆▆▅▅▅▄▄▅▃▅▆▆▄▅▅▄▅ █
120 μs Histogram: log(frequency) by time 265 μs <
Memory estimate: 5.81 KiB, allocs estimate: 27.
julia> #benchmark $Rdatasets[occursin.(Unicode.normalize("food",casefold=true), Unicode.normalize.($Rdatasets.Title,casefold=true)),:]
BenchmarkTools.Trial: 4393 samples with 1 evaluation.
Range (min … max): 984.700 μs … 10.233 ms ┊ GC (min … max): 0.00% … 88.07%
Time (median): 1.064 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.132 ms ± 463.506 μs ┊ GC (mean ± σ): 2.15% ± 4.81%
▂▅█▅▂
▂▃██████▇▅▅▅▄▄▄▃▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▁▂▂▂▁▁▂ ▃
985 μs Histogram: frequency by time 1.76 ms <
Memory estimate: 268.03 KiB, allocs estimate: 3085.
If I want to sumup:
Time
Memory
Garbage Collect
using RegEx
~135.361 μs
~5.81 KiB, allocs: 27
0%
using Unicode.normalize
~1.132 ms
~268.03 KiB, allocs: 3085
2.15% ± 4.81
According to Julia documentation, Unicode.normalize("", casefold=true) is recommended to perform case-insensitive comparison.
Hence you want:
julia> Rdatasets[occursin.(Unicode.normalize("food",casefold=true), Unicode.normalize.(Rdatasets.Title,casefold=true)),:]
1×5 DataFrame
Row │ Package Dataset Title Rows Columns
│ String15 String31 String Int64 Int64
─────┼─────────────────────────────────────────────────────────────────────────
1 │ Ecdat BudgetFood Budget Share of Food for Spanish… 23972 6
Example when this matters over lowercase (two ways to write "foot" in German):
julia> Unicode.normalize("der Fuß",casefold=true) ==
Unicode.normalize("der Fuss",casefold=true)
true
julia> lowercase("der Fuß") == lowercase("der Fuss")
false

Julia: CSV.write very memory inefficient?

I noticed that when saving large dataframes as CSVs the memory allocations are an order of magnitude higher than the size of the dataframe in memory (or the size of the CSV file on disk), at least by a factor of 10. Why is this the case? And is there a way to prevent this? Ie is there a way to save a dataframe to disk without using (much) more memory than the actual dataframe?
In the example below I generate a dataframe with one integer column and 10m rows. It weighs 76MB but writing the CSV allocates 1.35GB.
using DataFrames, CSV
function generate_df(n::Int64)
DataFrame!(a = 1:n)
end
julia> #time tmp = generate_df2(10000000);
0.671053 seconds (2.45 M allocations: 199.961 MiB)
julia> Base.summarysize(tmp) / 1024 / 1024
76.29454803466797
julia> #time CSV.write("~/tmp/test.csv", tmp)
3.199506 seconds (60.11 M allocations: 1.351 GiB)
What you see is not related to CSV.write, but to the fact that DataFrame is type-unstable. This means that it will allocate when iterating rows and accessing their contents. Here is an example:
julia> df = DataFrame(a=1:10000000);
julia> f(x) = x[1]
f (generic function with 1 method)
julia> #time sum(f, eachrow(df)) # after compilation
0.960045 seconds (40.07 M allocations: 613.918 MiB, 4.18% gc time)
50000005000000
This is a deliberate design decision to avoid unacceptable compilation times for very wide data frames (which are common in practice in certain fields of application). Now, this is the way to reduce allocations:
julia> #time CSV.write("test.csv", df) # after compilation
1.976654 seconds (60.00 M allocations: 1.345 GiB, 5.64% gc time)
"test.csv"
julia> #time CSV.write("test.csv", Tables.columntable(df)) # after compilation
0.439597 seconds (36 allocations: 4.002 MiB)
"test.csv"
(this will work OK if the table is narrow, for wide tables it might hit compilation time issues)
This is one of the patterns that are often encountered in Julia (even Julia itself works this way as args field in Expr is Vector{Any}): often you are OK with type unstable code if you do not care about performance (but want to avoid excessive compilation latency), and it is easy to switch to type-stable mode where compilation time does not matter and type-stability does.
Python Pandas:
import pandas as pd
df = pd.DataFrame({'a': range(10_000_000)})
%time df.to_csv("test_py.csv", index=False)
memory consumption (measured in Task Manager): 135 MB (before writing) -> 151 MB (during writing), Wall time: 8.39 s
Julia:
using DataFrames, CSV
df = DataFrame(a=1:10_000_000)
#time CSV.write("test_jl.csv", df)
#time CSV.write("test_jl.csv", df)
memory consumption: 284 MB (before writing) -> 332 MB (after 1st writing),
2.196639 seconds (51.42 M allocations: 1.270 GiB, 7.49% gc time)
2nd execution (no compilation required anymore): -> 357 MB,
1.701374 seconds (50.00 M allocations: 1.196 GiB, 6.26% gc time)
The memory increase of Python and Julia during writing of the CSV file is similar (~ 15 MB). In Julia, I observe a significant memory increase after execution of the first write command, probably due to cashing of compiled code.
Note that allocations != memory requirement. Even though 1.2 GB memory is allocated in total, only 15 MB is used at the same time (max value).
Regarding performance, Julia is nearly 4 times faster than Python, even including compilation time.

Is there a way to use numpy to apply a function to a 2D array without a loop

I'm trying to convert a list of quaternions to their corresponding orientation matrix using the Transforms3d python package.
Each quaternion is a 4 element list/array of the inputs and using the transforms3d.quaternions.quat2mat(q) function it returns the 3x3 orientation matrix.
I have a list of some 10K-100K quaternions that need converting (nx4 array) and while it's easy enough to do this with a loop, I think it could be quicker if there was some way of vectorising the process.
Some searching suggested I could simply do something like np.vectorize() but I'm struggling to make that work. A list comprehension works fine, but I guess the numpy vector solution would be much quicker.
orientations = np.array([[ 0.6594993 , -0.06402525, -0.74797227, -0.03871606],
[ 0.78091967, -0.15961452, -0.44240183, -0.41105753]])
rotMatrix = [quat2mat(orient) for orient in orientations]
vfunc=np.vectorize(quat2mat, signature='(m,n)->()')
vfunc(orientations)
unfortunately i can't even get the numpy version to run, both with and without the signature (which is possibly wrong).
>
Traceback (most recent call last):
File "", line 1, in
vfunc(aa)
File "c:\wpy64-3740\python-3.7.4.amd64\lib\site-packages\numpy\lib\function_base.py", line 2091, in call
return self._vectorize_call(func=func, args=vargs)
File "c:\wpy64-3740\python-3.7.4.amd64\lib\site-packages\numpy\lib\function_base.py", line 2157, in _vectorize_call
res = self._vectorize_call_with_signature(func, args)
File "c:\wpy64-3740\python-3.7.4.amd64\lib\site-packages\numpy\lib\function_base.py", line 2198, in _vectorize_call_with_signature
results = func(*(arg[index] for arg in args))
File "c:\wpy64-3740\python-3.7.4.amd64\lib\site-packages\transforms3d\quaternions.py", line 133, in quat2mat
w, x, y, z = q
ValueError: not enough values to unpack (expected 4, got 2)
As was suggested, the best way to improve performance was to vectorise quat2mat, and the results (%timeit) support that:
quat2mat() in loop for 2000 quaternions:
17.3 ms ± 482 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vectorised quat2mat_array() for 2000 quaternions:
1.11 ms ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Should have just done that first instead of messing with np.vectorise()! Thanks for the re-focus!

pandas performance: columns selection

I've observed today that selecting two or more columns of Data frame may be much slower than selecting only one.
If I use loc, or iloc to choose more than one column and I use list to pass column names or indexes, then performance drops 100 times in comparison to single column or many column selection with iloc (but no list passed)
examples:
df = pd.DataFrame(np.random.randn(10**7,10), columns=list('abcdefghij'))
One column selection:
%%timeit -n 100
df['b']
3.17 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
df.iloc[:,1]
66.7 µs ± 5.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
df.loc[:,'b']
44.2 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Two columns selection:
%%timeit -n 10
df[['b', 'c']]
96.4 ms ± 788 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
df.loc[:,['b', 'c']]
99.4 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
df.iloc[:,[1,2]]
97.6 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Only this selection works like expected:
[EDIT]
%%timeit -n 100
df.iloc[:,1:3]
103 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
What are the differences in mechanisms and why they are so big?
[EDIT]:
As #run-out pointed out, pd.Series seems to be processed much faster than pd.DataFrame, anyone knows why it is the case?
On the other hand - it does not explain difference between df.iloc[:,[1,2]] and df.iloc[:,1:3]
Pandas works with single rows or columns as a pandas.Series, which would be faster than working within the DataFrame architecture.
Pandas works with pandas.Series when you ask for:
%%timeit -n 10
df['b']
2.31 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, I can call a DataFrame for the same column by putting it in a list. Then you get:
%%timeit -n 10
df[['b']]
90.7 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can see from the above that it's the Series that is outperforming the DataFrame.
Here is how Pandas is working with column 'b'.
type(df['b'])
pandas.core.series.Series
type(df[['b']])
pandas.core.frame.DataFrame
EDIT:
I'm expanding on my answer as OP wants to dig deeper into why there is greater speed is so much greater for pd.series vs. pd.dataframe. And also as this is a great question to expand my/our understanding of how the underlying technology works. Those with more expertise please chime in.
First let's start with numpy as it's a building block of pandas. According to Wes McKinney, author of pandas and from Python for Data Analysis, the performance pick up in numpy over python:
This is based partly on performance differences having to do with the
cache hierarchy of the CPU; operations accessing contiguous blocks of memory (e.g.,
summing the rows of a C order array) will generally be the fastest because the mem‐
ory subsystem will buffer the appropriate blocks of memory into the ultrafast L1 or
L2 CPU cache.
Let's see the speed difference for this example. Let's make a numpy array from column 'b' of the dataframe.
a = np.array(df['b'])
And now do the performance test:
%%timeit -n 10
a
The results are:
32.5 ns ± 28.2 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
That's a serious pick up in performance over the pd.series time of 2.31 µs.
The other main reason for performance pickup is that numpy indexing goes straight into NumPy C extensions, but there is a lot of python stuff going on when you index into a Series, and this is a lot slower. (read this article)
Let's look at the question of why does:
df.iloc[:,1:3]
drastically outperform:
df.iloc[:,[1,2]]
It's interesting to note that .loc has the same effect with performance as .iloc in this scenario.
Our first big clue that something is not right is in the following code:
df.iloc[:,1:3] is df.iloc[:,[1,2]]
False
These give the same result, but are different objects. I've done a deep dive try to find out what the difference is. I was unable to find reference to this on the internet or in my library of books.
Looking at the source code, we can start to see some difference. I refer to indexing.py.
In the Class _iLocIndexer we can find some extra work being done by pandas for list in an iloc slice.
Right away, we run into these two difference when checking input:
if isinstance(key, slice):
return
vs.
elif is_list_like_indexer(key):
# check that the key does not exceed the maximum size of the index
arr = np.array(key)
l = len(self.obj._get_axis(axis))
if len(arr) and (arr.max() >= l or arr.min() < -l):
raise IndexError("positional indexers are out-of-bounds")
Could this alone be cause enough for the reduced performance? I don't know.
Although .loc is slightly different, it also suffers performance when using a list of values. Looking in index.py, look at def _getitem_axis(self, key, axis=None): --> in class _LocIndexer(_LocationIndexer):
The code section for is_list_like_indexer(key) that handles list inputs is quite long including a lot of overhead. It contains the note:
# convert various list-like indexers
# to a list of keys
# we will use the *values* of the object
# and NOT the index if its a PandasObject
Certainly there is enough additional overhead in dealing with a list of values or integers then direct slices to cause delays in processing.
The rest of the code is past my pay grade. If anyone can have a look and chime it, it would be most welcome
I found this probably rooted in numpy.
numpy has two kind of indexing:
basic indexing like a[1:3]
advanced indexing like a[[1,2]]
according to the documentation,
Advanced indexing always returns a copy of the data (contrast with
basic slicing that returns a view).
So if you check
a=df.values
%timeit -n2 a[:,0:3]
%timeit -n2 a[:,[0,1,2]]
you got
The slowest run took 5.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1.57 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 2 loops each)
188 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
quite similar behavior to pandas dataframe
I can only recommend using the cudf library, it is basically pandas ported on Nvidia GPU. It is extremely fast as most operations are highly parallelized contrary to pandas.
You need to have a Nvidia GPU available tough, starting from generation GTX 10xx.
Columns slicing is extremely fast, I'll come with benchmarks when I have time.

Possible to constant propagate if statement without dispatch?

Julia (v0.5) does not constant propagate the following, leading to poor performance:
julia> g(::Int) = true
g (generic function with 1 method)
julia> f(x) = g(x) ? 1 : 1.0
f (generic function with 1 method)
julia> #code_warntype f(1)
Variables:
#self#::#f
x::Int64
Body:
begin
unless $(QuoteNode(true)) goto 3
return 1
3:
return 1.0
end::Union{Float64,Int64}
Instead, I have to do the following:
julia> g(::Int) = Val{true}
g (generic function with 1 method)
julia> f_(::Type{Val{true}}) = 1
f_ (generic function with 1 method)
julia> f_(::Type{Val{false}}) = 1.0
f_ (generic function with 2 methods)
julia> f(x) = f_(g(x))
f (generic function with 1 method)
Although this works, it requires defining an additional function, which creates additional compile-time overhead. Is there an existing solution that works on v0.5, without this overhead?
As you note in the issue you posted, LLVM does indeed do this constant propagation so there's no branch at runtime. The issue is simply that the method is still type-unstable since inference isn't the one doing the constant propagation and dead code elimination.
Another possible workaround for this sort of type-inference issue is a generated function. It's very heavy-handed, though, and is likely to have even more compile-time overhead.
Indeed, if you look at the first-run times, you can see that while the generated function needs fewer allocations, it still takes about 2-3x longer. Both lower to the exact same LLVM/native code, so the only consideration here is compile-time and complexity. And in that regard, the generated function loses handily. I think your current workaround is as good as it gets for now.
julia> g(::Int) = Val{true}
f_(::Type{Val{true}}) = 1
f_(::Type{Val{false}}) = 1.0
f(x) = f_(g(x))
f (generic function with 1 method)
julia> #time f(1)
0.002720 seconds (521 allocations: 30.203 KB)
julia> g′(::Type{Int}) = true
#generated f′(x) = g′(x) ? 1 : 1.0
f′ (generic function with 1 method)
julia> #time f′(1)
0.007655 seconds (351 allocations: 21.125 KB)