I have a numpy array called "landuse" that's a series of numbers 1-3 representing different landuse categories. I want to convert this to a string based on a lookup table.
ids = [0,1,2,3]
lookup_table = ['None', 'Forest', 'Water', 'Urban']
First let me explain why your loop isn't working, in python an assignment, ie a = 1 takes the object 1 and gives it the name a. When you do name = "Water", name forgets what it was pointing to before and now points to "Water", but that doesn't mean the previous object that was assigned to name gets replaced with "Water".
That's the problem, and now for a fix. If you have your landuse as an array of integer codes you can just use a lookup table. The table should be big enough so you don't get an indexing error when you do lookup_table[landuse.max()]
import numpy as np
landuse = np.array([1,2,3,1,2,4])
lookup_table = np.array(['None', 'Forest', 'Water', 'Urban', 'Other'])
landuse_title = lookup_table[landuse]
And for the final part of your question, the numpy ndarray is a homogenous data structure, meaning everything in the array must have the same data type. With that limitation in mind, it should be clear that you cannot take a row of the integers and replace it with a row of strings. Numpy does have "flexible dtypes" which allow you to do something like:
>>> dt = np.dtype([('name', 'S4'), ('age', 'int'), ('height', 'float')])
>>> array = np.array([('Mark', 25, 70.5),('Ben',40,72.75)], dtype=dt)
>>> array
array([('Mark', 25, 70.5), ('Ben', 40, 72.75)],
dtype=[('name', '|S4'), ('age', '<i4'), ('height', '<f8')])
>>> array.shape
(2,)
>>> array['name']
array(['Mark', 'Ben'],
dtype='|S4')
We've created an array that hold a name, age and height for each person, but notice that the shape of the array is (2,) because we have two "people" in the array. I'm not sure exactly what your needs are, but you could try and use the flexible dtype to hold all the information in one array if that's what you need. Depending on what my end goal, I often find it's easier to just use a few separate arrays, or a list of arrays. Hope that helps.
I am not entirely clear what your question is, but it seems you could use a dictionary for this:
import numpy as np
landuse=np.array([1,2,3,1,2,4],dtype=np.integer)
a={1:'Forest',2:'Water'}
print [a.setdefault(i,'Urban') for i in landuse]
which will emit a list containing the strings you are interested in:
['Forest', 'Water', 'Urban', 'Forest', 'Water', 'Urban']
If you objective is to have the final result in a numpy array of strings, you can do this:
name=np.array([a.setdefault(i,'Urban') for i in landuse],dtype='|S10')
Related
How can I get a list of dtypes from a numpy structured array?
Create example structured array:
arr = np.array([[1.0, 2.0],[3.0, 4.0]])
dt = {'names':['ID', 'Ring'], 'formats':[np.double, np.double]}
arr.dtype = dt
>>> arr
array([[(1., 2.)],
[(3., 4.)]], dtype=[('ID', '<f8'), ('Ring', '<f8')])
On one hand, it's easy to isolate the column names.
>>> arr.dtype.names
('ID', 'RING')
However, ironically, none of the dtype attributes seem to reveal the individual dtypes.
Discovered that, despite not having dictionary methods like .items(), you can still call dtype['<column_name>'].
column_names = list(arr.dtype.names)
dtypes = [str(arr.dtype[n]) for n in column_names]
>>> dtypes
['float64', 'float64']
Or, as #hpaulj hinted, in one step:
>>> [str(v[0]) for v in arr.dtype.fields.values()]
['float64', 'float64']
For following:
d = np.array([[0,1,4,3,2],[10,18,4,7,5]])
print(d.shape)
Output is:
(2, 5)
It is expected.
But, for this(difference in number of elements in individual rows):
d = np.array([[0,1,4,3,2],[10,18,4,7]])
print(d.shape)
Output is:
(2,)
How to explain this behaviour?
Short answer: It parses it as an array of two objects: two lists.
Numpy is used to process "rectangular" data. In case you pass it non-rectangular data, the np.array(..) function will fallback on considering it a list of objects.
Indeed, take a look at the dtype of the array here:
>>> d
array([list([0, 1, 4, 3, 2]), list([10, 18, 4, 7])], dtype=object)
It is an one-dimensional array that contains two items two lists. These lists are simply objects.
I am trying to append a pandas dataframe based on two pre-existing columns in that dataframe. The issue I'm having is that the index of the pandas dataframe is in object format, not integer format. To make things more complicated, I only want to append a certain range of the dataframe, leaving the remaining cells in the new column as 'NaN'. In order to append over only a certain range of the dataframe, I will have to use a "for" loop.
Here is my question: How do I loop over a certain range of the dataframe when I have an object index?
My initial pandas dataframe is simply...
import pandas as pd
dates = ['2005Q4','2006Q1','2006Q2','2006Q3','2006Q4','2007Q1','2007Q2']
col1 = [ 5.9805, 6.2181, 6.3508, 6.7878, 6.6212, 6.4583, 6.4068 ]
col2 = [ 'NaN', -0.001054985938, -0.121731711952, 0.046275331889,
-0.017517211963, -0.023422842422, 0.009072170884 ]
data = pd.DataFrame(
{
'col1': col1,
'col2': col2
},
columns = [
'col1',
'col2'
],
index = dates
)
All I'm trying to do is something like this...
data['col3'] = 'NaN'
for i in range('2006Q1','2006Q4',1):
data['col3'][i] = data['col1'][i-1] +\
data['col2'][i]
Naively, I had hoped that python would be able to correlate the object name in the index with the actual index number associated with that particular indice. For example, if I define the index as given, python would be able to know that '2005Q4' is index = 0, '2006Q1' is index = 1, etc. In this way, I could use object strings in the range() function and it would still know the integer I'm referring to. However, this appears not to be the case.
I need to avoid converting the objects into date format as well. It is important that I keep the index in the format 'YearQuarter', and I have yet to find a simple way of using pd.to_datetime that is able to do this.
Does anyone have any suggestions on how to loop over only a certain range of object-based indices in python?
Using .index() with a list returns the index of the item you're looking for. Try this tweak to your for loop.
for i in range(dates.index('2006Q1'),dates.index('2006Q4'),1):
Obviously however much more efficient ways to do this. .shift() shifts your entire column up or down however much you want :
data['col3'] = data.col1 - data.col2.shift(-1)
I need to compare a bunch of numpy arrays with different dimensions, say:
a = np.array([1,2,3])
b = np.array([1,2,3],[4,5,6])
assert(a == b[0])
How can I do this if I do not know either the shape of a and b, besides that
len(shape(a)) == len(shape(b)) - 1
and neither do I know which dimension to skip from b. I'd like to use np.index_exp, but that does not seem to help me ...
def compare_arrays(a,b,skip_row):
u = np.index_exp[ ... ]
assert(a[:] == b[u])
Edit
Or to put it otherwise, I wan't to construct slicing if I know the shape of the array and the dimension I want to miss. How do I dynamically create the np.index_exp, if I know the number of dimensions and positions, where to put ":" and where to put "0".
I was just looking at the code for apply_along_axis and apply_over_axis, studying how they construct indexing objects.
Lets make a 4d array:
In [355]: b=np.ones((2,3,4,3),int)
Make a list of slices (using list * replicate)
In [356]: ind=[slice(None)]*b.ndim
In [357]: b[ind].shape # same as b[:,:,:,:]
Out[357]: (2, 3, 4, 3)
In [358]: ind[2]=2 # replace one slice with index
In [359]: b[ind].shape # a slice, indexing on the third dim
Out[359]: (2, 3, 3)
Or with your example
In [361]: b = np.array([1,2,3],[4,5,6]) # missing []
...
TypeError: data type not understood
In [362]: b = np.array([[1,2,3],[4,5,6]])
In [366]: ind=[slice(None)]*b.ndim
In [367]: ind[0]=0
In [368]: a==b[ind]
Out[368]: array([ True, True, True], dtype=bool)
This indexing is basically the same as np.take, but the same idea can be extended to other cases.
I don't quite follow your questions about the use of :. Note that when building an indexing list I use slice(None). The interpreter translates all indexing : into slice objects: [start:stop:step] => slice(start, stop, step).
Usually you don't need to use a[:]==b[0]; a==b[0] is sufficient. With lists alist[:] makes a copy, with arrays it does nothing (unless used on the RHS, a[:]=...).
I can convert a pandas string column to Categorical, but when I try to insert it as a new DataFrame column it seems to get converted right back to Series of str:
train['LocationNFactor'] = pd.Categorical.from_array(train['LocationNormalized'])
>>> type(pd.Categorical.from_array(train['LocationNormalized']))
<class 'pandas.core.categorical.Categorical'>
# however it got converted back to...
>>> type(train['LocationNFactor'][2])
<type 'str'>
>>> train['LocationNFactor'][2]
'Hampshire'
Guessing this is because Categorical doesn't map to any numpy dtype; so do I have to convert it to some int type, and thus lose the factor labels<->levels association?
What's the most elegant workaround to store the levels<->labels association and retain the ability to convert back? (just store as a dict like here, and manually convert when needed?)
I think Categorical is still not a first-class datatype for DataFrame, unlike R.
(Using pandas 0.10.1, numpy 1.6.2, python 2.7.3 - the latest macports versions of everything).
The only workaround for pandas pre-0.15 I found is as follows:
column must be converted to a Categorical for classifier, but numpy will immediately coerce the levels back to int, losing the factor information
so store the factor in a global variable outside the dataframe
.
train_LocationNFactor = pd.Categorical.from_array(train['LocationNormalized']) # default order: alphabetical
train['LocationNFactor'] = train_LocationNFactor.labels # insert in dataframe
[UPDATE: pandas 0.15+ added decent support for Categorical]
The labels<->levels is stored in the index object.
To convert an integer array to string array: index[integer_array]
To convert a string array to integer array: index.get_indexer(string_array)
Here is some exampe:
In [56]:
c = pd.Categorical.from_array(['a', 'b', 'c', 'd', 'e'])
idx = c.levels
In [57]:
idx[[1,2,1,2,3]]
Out[57]:
Index([b, c, b, c, d], dtype=object)
In [58]:
idx.get_indexer(["a","c","d","e","a"])
Out[58]:
array([0, 2, 3, 4, 0])