I am trying to generate a random hexadecimal string for a blockchain wallet application but I am not sure how I would do it in Julia. In Python, I would do something like
import os
from ecdsa import SigningKey, VerifyingKey, SECP256k1
seed = os.urandom(SECP256k1.baselen)
You can use randstring specifying the allowed characters:
julia> using Random
# default length is 8 characters
julia> randstring(['0':'9'; 'a':'f'])
"e7e15070"
# other custom lengths
julia> randstring(['0':'9'; 'a':'f'], 4)
"9a6a"
julia> randstring(['0':'9'; 'a':'f'], 12)
"df398edb1937"
Alternatively you can use rand + bytes2hex:
julia> bytes2hex(rand(UInt8, 4))
"9fae741a"
rand and randstring use by default a MersenneTwister pseudorandom number generator. If you want to use different streams of random numbers you can use RandomDevice instead:
julia> randstring(RandomDevice(), ['0':'9'; 'a':'f'])
"6f4499e3"
julia> bytes2hex(rand(RandomDevice(), UInt8, 4))
"4d1c035c"
As stated in the documentation of the standard library:
Julia also provides the RandomDevice RNG type, which is a wrapper over the OS provided entropy.
Note that the word "entropy" is abused here. The OS will - with high likelihood - not directly return entropy, it will return random bytes from it's own seeded pseudo random number generator.
Related
Unpacking binary data in Python:
import struct
bytesarray = "01234567".encode('utf-8')
# Return a new Struct object which writes and reads binary data according to the format string.
s = struct.Struct('=BI3s')
s = s.unpack(bytesarray) # Output: (48, 875770417, b'567')
Does Raku have a similar function to Python's Struct? How can I unpack binary data according to a format string in Raku?
There's the experimental unpack
use experimental :pack;
my $bytearray = "01234567".encode('utf-8');
say $bytearray.unpack("A1 L H");
It's not exactly the same, though; this outputs "(0 875770417 35)". You can tweak your way through it a bit, maybe.
There's also an implementation of Perl's pack / unpack in P5pack
I encountered the following issue making some tests to demonstrate the usefulness of a pure pyarrow UDF in pyspark as compared to always going through pandas.
import awkward
import numpy
import pandas
import pyarrow
counts = numpy.random.randint(0,20,size=200000)
content = numpy.random.normal(size=counts.sum())
test_jagged = awkward.JaggedArray.fromcounts(counts, content)
test_arrow = awkward.toarrow(test_jagged)
def awk_arrow(col):
jagged = awkward.fromarrow(col)
jagged2 = jagged**2
return awkward.toarrow(jagged2)
def pds_arrow(col):
pds = col.to_pandas()
pds2 = pds**2
return pyarrow.Array.from_pandas(pds2)
out1 = awk_arrow(test_arrow)
out2 = pds_arrow(test_arrow)
out3 = awkward.fromarrow(out1)
out4 = awkward.fromarrow(out2)
type(out3)
type(out4)
yields
<class 'awkward.array.jagged.JaggedArray'>
<class 'awkward.array.masked.BitMaskedArray'>
and
out3 == out4
yields (at the end of the stack trace):
AttributeError: no column named 'reshape'
looking at the arrays:
print(out3);print();print(out4);
[[0.00736072240594475 0.055560612050914775 0.4094101942882973 ... 2.4428454924678533 0.07220045904440388 3.627270394986972] [0.16496227597707766 0.44899025266849046 1.314602433843517 ... 0.07384558862546337 0.5655043672418324 4.647396184088295] [0.04356259421421215 1.8983172440218923 0.10442121937532822 0.7222467989756899 0.03199694383894229 0.954281670741488] ... [0.23437909336737087 2.3050822727237272 0.10325064534860394 0.685018355096147] [0.8678765133108529 0.007214659054089928 0.3674379091794599 0.1891573101427716 2.1412651888713317 0.1461282900111415] [0.3315468986268042 2.7520115602119772 1.3905787720409803 ... 4.476255451581318 0.7237199572195625 0.8820112289563018]]
[[0.00736072240594475 0.055560612050914775 0.4094101942882973 ... 2.4428454924678533 0.07220045904440388 3.627270394986972] [0.16496227597707766 0.44899025266849046 1.314602433843517 ... 0.07384558862546337 0.5655043672418324 4.647396184088295] [0.04356259421421215 1.8983172440218923 0.10442121937532822 0.7222467989756899 0.03199694383894229 0.954281670741488] ... [0.23437909336737087 2.3050822727237272 0.10325064534860394 0.685018355096147] [0.8678765133108529 0.007214659054089928 0.3674379091794599 0.1891573101427716 2.1412651888713317 0.1461282900111415] [0.3315468986268042 2.7520115602119772 1.3905787720409803 ... 4.476255451581318 0.7237199572195625 0.8820112289563018]]
You can see the contents and shape of the arrays are the same, but they're not comparable to each other at face value, which is very counter intuitive. Is there a good reason for dense jagged structures with no Nulls to be represented as a BitMaskedArray?
All data in Arrow are nullable (at every level), and they use bit masks (as opposed to byte masks) to specify which elements are valid. The specification allows columns of entirely valid data to not write the bitmask, but not every writer takes advantage of that freedom. Quite often, you see unnecessary bitmasks.
When it encounters a bitmask, such as here, awkward inserts a BitMaskedArray.
It could be changed to check to see if the mask is unnecessary and skip that step, though that adds an operation that scales with the size of the dataset (though likely insignificant in most cases—bitmasks are 8 times faster to check than bytemasks). It's also a little complicated: the last byte may be incomplete if the length of the dataset is not a multiple of 8. One would need to check these bits individually, but the rest of the mask could be checked in bulk. (Maybe even cast as int64 to check 64 flags at a time.)
I've been struggling to find a way to get this calc that works for a dask workflow.
I have code that uses np.random.mulivariate_normal function and while many of these types are available to us on dask array it seems this one it not. Sooo.... I attempted to create my own based on an example provided in the dask documentation.
Here is my attempt which is giving errors that I am having difficulty understanding. I also provided random input variables to make it easy to replicate:
import numpy as np
from dask.distributed import Client
import dask.array as da
def mvn(mu, sigma, n, blocksize):
chunks = ((blocksize,) * (n // blocksize),
(blocksize,) * (n // blocksize))
name = 'mvn' # unique identifier
dsk = {(name, i, j): (np.random.multivariate_normal(mu,sigma, blocksize))
if i == j else
(np.zeros, (blocksize, blocksize))
for i in range(n // blocksize)
for j in range(n // blocksize)}
dtype = np.random.multivariate_normal(0).dtype # take dtype default from numpy
return da.Array(dsk, name, chunks, dtype)
n = 10000
A = da.random.normal(0, 1, size=(n,n), chunks=(1000, 1000))
sigma = da.dot(A,A.transpose())
mu = 4.0*da.ones(n, chunks = 1000)
R = da.numpy.random.mvn(mu, sigma, n, chunks=(100))
Any suggestions or am I so far off the mark here that I should abandon all hope? Thanks!
If you have a cluster to run this on, you can use my answer from this post, copied here for refrence:
An work arround for now, is to use a cholesky decomposition. Note that any covariance matrix C can be expressed as C=G*G'. It then follows that x = G'*y is correlated as specified in C if y is standard normal (see this excellent post on StackExchange Mathematic). In code:
Numpy
n_dim =4
size = 100000
A = np.random.randn(n_dim, n_dim)
covm = A.dot(A.T)
x= np.random.multivariate_normal(size=size, mean=np.zeros(len(covm)),cov=covm)
## verify numpys covariance is correct
np.cov(x, rowvar=False)
covm
Dask
## create covariance matrix
A = da.random.standard_normal(size=(n_dim, n_dim),chunks=(2,2))
covm = A.dot(A.T)
## get cholesky decomp
L = da.linalg.cholesky(covm, lower=True)
## drawn standard normal
sn= da.random.standard_normal(size=(size, n_dim),chunks=(100,100))
## correct for correlation
x =L.dot(sn.T)
x.shape
## verify
covm.compute()
da.cov(x, rowvar=True).compute()
This answer can be fleshed out, but I imagine you would have an easier time using dask's delayed, da.from_delayed and da.*stack.
One immediate problem I see with what you have: with np.random.multivariate_normal(mu,sigma, blocksize) you are directly calling the function, instead of making the spec. You probably wanted (np.random.multivariate_normal, mu,sigma, blocksize). This shows that working with raw dask dictionaries can be tricky!
Using scipy.interpolate.interp1d it is possible to pass in a (1080, 4) nd.array and compute an interpolation function for each 'row' in a single command:
spline = interp1d(np.arange(1,5), np.random.random(1080,4), kind='cubic')
I am getting slightly different interpolation results (off the knots) than some existing Fortran code. I believe this is because the SciPy source is using a b-spline and the Fortran code is using splines derived from numerical recipes.
I am attempting to perform the same interpolation using UnivariateSpline with s=0, so InterpolatedUnivariateSpline.
I am able to get this working if I pass the data row by row, i.e. using an iterator to step over all 1080 rows - this is highly inefficient.
Using:
spline = UnivariateSpline(np.arange(1,5).reshape(-1,1), np.random.random(1080,4), s=0, k=3)
I am seeing:
failed in converting 2nd argument `y' of dfitpack.fpcurf0 to C/Fortran array
I believe this is an issue getting the multi-dimensional array into Fitpack? Any insight in how to avoid an iterator? Additionally, any insight into a SciPy interpolation function that matches the one described in numerical recipes (section 3.3, p.120) - You have to type the page number, I can not direct link, it is a Flash viewer...
In older version of SciPy (I observed it in 0.14) the splines returned by interp1d were of relatively poor quality. In versions 0.19 and later, interp1d is consistent with other spline routines, and since it accepts vector inputs, I think that answers the question. Here is the comparison of three spline constructors: the latter two only take one row as input.
from scipy.interpolate import interp1d, UnivariateSpline, splrep, splev
x = np.arange(1, 5)
y = np.random.normal(size=(1080, 4))
spl1 = interp1d(x, y, kind='cubic')
spl2 = UnivariateSpline(x, y[123, :], s=0, k=3)
spl3 = splrep(x, y[123, :], s=0, k=3)
t = 2.345
print(spl1(t)[123], spl2(t), splev(t, spl3))
This prints (with my random numbers)
-0.333973049011 -0.333973049011 -0.333973049011
I have two computers with python 2.7.2 (MSC v.1500 32 bit (Intel)] on win32) and numpy 1.6.1.
But
numpy.mean(data)
returns
1.13595094681 on my old computer
and
1.13595104218 on my new computer
where
Data = [ 0.20227873 -0.02738848 0.59413314 0.88547146 1.26513398 1.21090782
1.62445402 1.80423951 1.58545554 1.26801944 1.22551131 1.16882968
1.19972098 1.41940248 1.75620842 1.28139281 0.91190684 0.83705413
1.19861531 1.30767155]
In both cases
s=0
for n in data[:20]:
s+=n
print s/20
gives
1.1359509334
Can anyone explain why and how to avoid?
Mads
If you want to avoid any differences between the two, then make them explicitly 32-bit or 64-bit float arrays. NumPy uses several other libraries that may be 32 or 64 bit. Note that rounding can occur in your print statements as well:
>>> import numpy as np
>>> a = [0.20227873, -0.02738848, 0.59413314, 0.88547146, 1.26513398,
1.21090782, 1.62445402, 1.80423951, 1.58545554, 1.26801944,
1.22551131, 1.16882968, 1.19972098, 1.41940248, 1.75620842,
1.28139281, 0.91190684, 0.83705413, 1.19861531, 1.30767155]
>>> x32 = np.array(a, np.float32)
>>> x64 = np.array(a, np.float64)
>>> x32.mean()
1.135951042175293
>>> x64.mean()
1.1359509335
>>> print x32.mean()
1.13595104218
>>> print x64.mean()
1.1359509335
Another point to note is that if you have lower level libraries (e.g., atlas, lapack) that are multi-threaded, then for large arrays, you may have differences in your result regardless, due to possible variable order of operations and floating point precision.
Also, you are at the limit of precision for 32 bit numbers:
>>> x32.sum()
22.719021
>>> np.array(sorted(x32)).sum()
22.719019
This is happening because you have Float32 arrays (single precision). With single precision, the operations are only accurate to 6 decimal place. Hence your results are the same up to the 6th decimal place (after the decimal point, rounding the last digit), but they are not accurate after that. Different architectures/machines/compilers will yield the different results after that. If you want the same results you should use higher precision arrays (e.g. Float64).