Finding smallest dtype to safely cast an array to - numpy

Let's say I want to find the smallest data type I can safely cast this array to, to save it as efficiently as possible. (The expected output is int8.)
arr = np.array([-101,125,6], dtype=np.int64)
The most logical solution seems something like
np.min_scalar_type(arr) # dtype('int64')
but that function doesn't work as expected for arrays. It just returns their original data type.
The next thing I tried is this:
np.promote_types(np.min_scalar_type(arr.min()), np.min_scalar_type(arr.max())) # dtype('int16')
but that still doesn't output the smallest possible data type.
What's a good way to achieve this?

Here's a working solution I wrote. It will only work for integers.
def smallest_dtype(arr):
arr_min = arr.min()
arr_max = arr.max()
for dtype_str in ["u1", "i1", "u2", "i2", "u4", "i4", "u8", "i8"]:
if (arr_min >= np.iinfo(np.dtype(dtype_str)).min) and (arr_max <= np.iinfo(np.dtype(dtype_str)).max):
return np.dtype(dtype_str)

This is close to your initial idea:
np.result_type(np.min_scalar_type(arr.min()), arr.max())
It will take the signed int8 from arr.min() if arr.max() fits inside of it.

Related

Numpy - how do I erase elements of an array if it is found in an other array

TLDR: I have 2 arrays indices = numpy.arange(9) and another that contains some of the numbers in indices (maybe none at all, maybe it'll contain [2,4,7]). The output I'd like for this example is [0,1,3,5,6,8]. What method can be used to achieve this?
Edit: I found a method which works somewhat: casting both arrays to a set then taking the difference of the two does give the correct result, but as a set, even if I pass this result to a numpy.array(). I'll update this if I find a solution for that.
Edit2: Casting the result of the subtraction to a list, then casting passing that to a numpy.array() resolved my issue.
I guess I posted this question a little prematurely, given that I found the solution for it myself, but maybe this'll be useful to somebody in future!
You can make use of boolean masking:-
indices[~numpy.isin(indices,[2,4,7])]
Explanation:-
we are using numpy.isin() method to find out the values exists or not in incides array and then using ~ so that this gives opposite result and finally we are passing this boolean mask to indices

TypeError: <class 'datetime.time'> is not convertible to datetime

The problem is somewhat simple. My objective is to compute the days difference between two dates, say A and B.
These are my attempts:
df['daydiff'] = df['A']-df['B']
df['daydiff'] = ((df['A']) - (df['B'])).dt.days
df['daydiff'] = (pd.to_datetime(df['A'])-pd.to_datetime(df['B'])).dt.days
These works for me before but for some reason, I'm keep getting this error this time:
TypeError: class 'datetime.time' is not convertible to datetime
When I export the df to excel, then the date works just fine. Any thoughts?
Use pd.Timestamp to handle the awkward differences in your formatted times.
df['A'] = df['A'].apply(pd.Timestamp) # will handle parsing
df['B'] = df['B'].apply(pd.Timestamp) # will handle parsing
df['day_diff'] = (df['A'] - df['B']).dt.days
Of course, if you don't want to change the format of the df['A'] and df['B'] within the DataFrame that you are outputting, you can do this in a one-liner.
df['day_diff'] = (df['A'].apply(pd.Timestamp) - df['B'].apply(pd.Timestamp)).dt.days
This will give you the days between as an integer.
When I applied the solution offered by emmet02, I got TypeError: Cannot convert input [00:00:00] of type as well. It's basically saying that the dataframe contains missing timestamp values which are represented as [00:00:00], and this value is rejected by pandas.Timestamp function.
To address this, simply apply a suitable missing-value strategy to clean your data set, before using
df.apply(pd.Timestamp)

How can I change column data type from float to string in Julia?

I am trying to get a column in a dataframe form float to string. I have tried
df = readtable("data.csv", coltypes = {String, String, String, String, String, Float64, Float64, String});
but I got complained
syntax: { } vector syntax is discontinued
I also have tried
dfB[:serial] = string(dfB[:serial])
but it didn't work either. So, I'd like to know what would be the proper approach to change column data type in Julia.
thx
On your first attempt, Julia tells you what the problem is - you can't make a vector with {}, you need to use []. Also, the name of the keyword argument should be eltypes rather than coltypes.
On the second try, you don't have a float, you have a Vector of floats. So to change the type you need to change the type of all elements. In Julia, elementwise operations on vectors are generalized by the 'dot' syntax, e.g. string.(collect(dfB[:serial])) . The collect is needed currently to cast the DataArray to a normal Array first – this will fail if the DataArray contains NAs. IMHO the DataFrames interface is still rather wonky, so expect a few headaches like this ATM.

What is the syntax to instantiate a structured dtype in numpy?

If I have a dtype like
foo = dtype([('chrom1', '<f4', (100,)), ('chrom2', '<f4', (13,))])
How can I create an instance of that dtype, as a scalar.
Background, in case There's A Better Way:
I want to efficiently represent arrays of scalars mapping directly to the bases in a genome, chromosome by chromosome. I don't want arrays of these genomic arrays, each one is simply a structured set of scalars that I want to reference by name/position, and be able to add/subtract/etc.
It appears that dtype.type() is maybe the path forward, but I haven't found useful documentation for correctly calling this function yet.
So suppose I have:
chrom1_array = numpy.arange(100)
chrom2_array = numpy.arange(13)
genomic_array = foo.type([chrom1_array, chrom2_array])
That last line isn't right, but hopefully it conveys what I'm currently attempting.
Is this a horrible idea? If so, what's the right idea? If not, what's the correct way to implement it?
This sort of works, but is terrible:
bar = np.zeros(1, dtype=[('chrom1', 'f4', 100), ('chrom2', 'f4', 13)])[0]
try this:
foo = np.dtype([('chrom1', '<f4', (100,)), ('chrom2', '<f4', (13,))])
t = np.zeros((), dtype=foo)

Convert an alphanumeric string to integer format

I need to store an alphanumeric string in an integer column on one of my models.
I have tried:
#result.each do |i|
hex_id = []
i["id"].split(//).each{|c| hex_id.push(c.hex)}
hex_id = hex_id.join
...
Model.create(:origin_id => hex_id)
...
end
When I run this in the console using puts hex_id in place of the create line, it returns the correct values, however the above code results in the origin_id being set to "2147483647" for every instance. An example string input is "t6gnk3pp86gg4sboh5oin5vr40" so that doesn't make any sense to me.
Can anyone tell me what is going wrong here or suggest a better way to store a string like the aforementioned example as a unique integer?
Thanks.
Answering by request form OP
It seems that the hex_id.join operation does not concatenate strings in this case but instead sums or performs binary complement of the hex values. The issue could also be that hex_id is an array of hex-es rather than a string, or char array. Nevertheless, what seems to happen is reaching the maximum positive value for the integer type 2147483647. Still, I was unable to find any documented effects on array.join applied on a hex array, it appears it is not concatenation of the elements.
On the other hand, the desired result 060003008600401100500050040 is too large to be recorded as an integer either. A better approach would be to keep it as a string, or use different algorithm for producing a number form the original string. Perhaps aggregating the hex values by an arithmetic operation will do better than join ?