How to generate pandas DataFrame column of Categorical from string column? - pandas

I can convert a pandas string column to Categorical, but when I try to insert it as a new DataFrame column it seems to get converted right back to Series of str:
train['LocationNFactor'] = pd.Categorical.from_array(train['LocationNormalized'])
>>> type(pd.Categorical.from_array(train['LocationNormalized']))
<class 'pandas.core.categorical.Categorical'>
# however it got converted back to...
>>> type(train['LocationNFactor'][2])
<type 'str'>
>>> train['LocationNFactor'][2]
'Hampshire'
Guessing this is because Categorical doesn't map to any numpy dtype; so do I have to convert it to some int type, and thus lose the factor labels<->levels association?
What's the most elegant workaround to store the levels<->labels association and retain the ability to convert back? (just store as a dict like here, and manually convert when needed?)
I think Categorical is still not a first-class datatype for DataFrame, unlike R.
(Using pandas 0.10.1, numpy 1.6.2, python 2.7.3 - the latest macports versions of everything).

The only workaround for pandas pre-0.15 I found is as follows:
column must be converted to a Categorical for classifier, but numpy will immediately coerce the levels back to int, losing the factor information
so store the factor in a global variable outside the dataframe
.
train_LocationNFactor = pd.Categorical.from_array(train['LocationNormalized']) # default order: alphabetical
train['LocationNFactor'] = train_LocationNFactor.labels # insert in dataframe
[UPDATE: pandas 0.15+ added decent support for Categorical]

The labels<->levels is stored in the index object.
To convert an integer array to string array: index[integer_array]
To convert a string array to integer array: index.get_indexer(string_array)
Here is some exampe:
In [56]:
c = pd.Categorical.from_array(['a', 'b', 'c', 'd', 'e'])
idx = c.levels
In [57]:
idx[[1,2,1,2,3]]
Out[57]:
Index([b, c, b, c, d], dtype=object)
In [58]:
idx.get_indexer(["a","c","d","e","a"])
Out[58]:
array([0, 2, 3, 4, 0])

Related

Convert a float column with nan to int pandas

I am trying to convert a float pandas column with nans to int format, using apply.
I would like to use something like this:
df.col = df.col.apply(to_integer)
where the function to_integer is given by
def to_integer(x):
if np.isnan(x):
return np.NaN
else:
return int(x)
However, when I attempt to apply it, the column remains the same.
How could I achieve this without having to use the standard technique of dtypes?
You can't have NaN in an int column, NaN are float (unless you use an object type, which is not a good idea since you'll lose many vectorial abilities).
You can however use the new nullable integer type (NA).
Conversion can be done with convert_dtypes:
df = pd.DataFrame({'col': [1, 2, None]})
df = df.convert_dtypes()
# type(df.at[0, 'col'])
# numpy.int64
# type(df.at[2, 'col'])
# pandas._libs.missing.NAType
output:
col
0 1
1 2
2 <NA>
Not sure how you would achieve this without using dtypes. Sometimes when loading in data, you may have a column that contains mixed dtypes. Loading in a column with one dtype and attemping to turn it into mixed dtypes is not possible though (at least, not that I know of).
So I will echo what #mozway said and suggest you use nullable integer data types
e.g
df['col'] = df['col'].astype('Int64')
(note the capital I)

How to get a list of dtypes from a numpy structured array?

How can I get a list of dtypes from a numpy structured array?
Create example structured array:
arr = np.array([[1.0, 2.0],[3.0, 4.0]])
dt = {'names':['ID', 'Ring'], 'formats':[np.double, np.double]}
arr.dtype = dt
>>> arr
array([[(1., 2.)],
[(3., 4.)]], dtype=[('ID', '<f8'), ('Ring', '<f8')])
On one hand, it's easy to isolate the column names.
>>> arr.dtype.names
('ID', 'RING')
However, ironically, none of the dtype attributes seem to reveal the individual dtypes.
Discovered that, despite not having dictionary methods like .items(), you can still call dtype['<column_name>'].
column_names = list(arr.dtype.names)
dtypes = [str(arr.dtype[n]) for n in column_names]
>>> dtypes
['float64', 'float64']
Or, as #hpaulj hinted, in one step:
>>> [str(v[0]) for v in arr.dtype.fields.values()]
['float64', 'float64']

Why list of pd.Interval doesn't recognized by DataFrame automatically?

intervals = [pd.Interval(0, 0.1), pd.Interval(1, 5)]
pd.DataFrame({'d':intervals}).dtypes
Produces dtype as Object not Interval:
>>> d object
>>> dtype: object
But at the same time list of, for example, DateTimes is recognized on the fly:
datetimes = [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]
pd.DataFrame({'d':datetimes}).dtypes
>>> d datetime64[ns]
>>> dtype: object
Is situation with intervals somewhat like with list of strings - default type of column in the DataFrame will be object as well, because DataFrame doesn't 'know' if we want to treat this column as objects (for dumping to disk, ..), or as strings (for concatenation, ..) or even as elements of category type? If so - what different use cases with intervals may be? If not what is the case here?
This is a bug in pandas: https://github.com/pandas-dev/pandas/issues/23563
For now, the cleanest workaround is to wrap the list with pd.array:
In [1]: import pandas as pd; pd.__version__
Out[1]: '0.24.2'
In [2]: intervals = [pd.Interval(0, 0.1), pd.Interval(1, 5)]
In [3]: pd.DataFrame({'d': pd.array(intervals)}).dtypes
Out[3]:
d interval[float64]
dtype: object

dtype is ignored when using multilevel columns

When using DataFrame.read_csv with multi level columns (read with header=) pandas seems to ignore the dtype= keyword.
Is there a way to make pandas use the passed types?
I am reading large data sets from CSV and therefore try to read the data already in the correct format to save CPU and memory.
I tried passing a dict using dtype with tuples as well as strings. It seems that dtype expects strings. At least I observed, that if I pass the level 0 keys the types are assigned, but unfortunately that would mean that all columns with the same level 0 label would get the same type. In the esample below columns (A, int16) and (A, int32) would get type object and (B, float32) and (B, int16) would get float32.
import pandas as pd
df= pd.DataFrame({
('A', 'int16'): pd.Series([1, 2, 3, 4], dtype='int16'),
('A', 'int32'): pd.Series([132, 232, 332, 432], dtype='int32'),
('B', 'float32'): pd.Series([1.01, 1.02, 1.03, 1.04], dtype='float32'),
('B', 'int16'): pd.Series([21, 22, 23, 24], dtype='int16')})
print(df)
df.to_csv('test_df.csv')
print(df.dtypes)
<i># full column name tuples with level 0/1 labels don't work</i>
df_new= pd.read_csv(
'test_df.csv',
header=list(range(2)),
dtype = {
('A', 'int16'): 'int16',
('A', 'int32'): 'int32'
})
print(df_new.dtypes)
<i># using the level 0 labels for dtype= seems to work</i>
df_new2= pd.read_csv(
'test_df.csv',
header=list(range(2)),
dtype={
'A':'object',
'B': 'float32'
})
print(df_new2.dtypes)
I'd expect the second print(df.dtypes) to output the same column types as the first print(df.dtypes), but it does not seem to use the dtype= argument at all and infers the types resulting in much more memory intense types.
Was I missing something?
Thank you in advance Jottbe
This is a bug, that is also present in the current version of pandas. I filed a bug report here.
But also for the current version, there is a workaround. It works perfectly, if the engine is switched to python:
df_new= pd.read_csv(
'test_df.csv',
header=list(range(2)),
engine='python',
dtype = {
('A', 'int16'): 'int16',
('A', 'int32'): 'int32'
})
print(df_new.dtypes)
The output is:
Unnamed: 0_level_0 Unnamed: 0_level_1 int64
A int16 int16
int32 int32
B float32 float64
int16 int64
So the "A-columns" are typed as specified in dtypes.

Numpy Integer to String Based on Lookup Table

I have a numpy array called "landuse" that's a series of numbers 1-3 representing different landuse categories. I want to convert this to a string based on a lookup table.
ids = [0,1,2,3]
lookup_table = ['None', 'Forest', 'Water', 'Urban']
First let me explain why your loop isn't working, in python an assignment, ie a = 1 takes the object 1 and gives it the name a. When you do name = "Water", name forgets what it was pointing to before and now points to "Water", but that doesn't mean the previous object that was assigned to name gets replaced with "Water".
That's the problem, and now for a fix. If you have your landuse as an array of integer codes you can just use a lookup table. The table should be big enough so you don't get an indexing error when you do lookup_table[landuse.max()]
import numpy as np
landuse = np.array([1,2,3,1,2,4])
lookup_table = np.array(['None', 'Forest', 'Water', 'Urban', 'Other'])
landuse_title = lookup_table[landuse]
And for the final part of your question, the numpy ndarray is a homogenous data structure, meaning everything in the array must have the same data type. With that limitation in mind, it should be clear that you cannot take a row of the integers and replace it with a row of strings. Numpy does have "flexible dtypes" which allow you to do something like:
>>> dt = np.dtype([('name', 'S4'), ('age', 'int'), ('height', 'float')])
>>> array = np.array([('Mark', 25, 70.5),('Ben',40,72.75)], dtype=dt)
>>> array
array([('Mark', 25, 70.5), ('Ben', 40, 72.75)],
dtype=[('name', '|S4'), ('age', '<i4'), ('height', '<f8')])
>>> array.shape
(2,)
>>> array['name']
array(['Mark', 'Ben'],
dtype='|S4')
We've created an array that hold a name, age and height for each person, but notice that the shape of the array is (2,) because we have two "people" in the array. I'm not sure exactly what your needs are, but you could try and use the flexible dtype to hold all the information in one array if that's what you need. Depending on what my end goal, I often find it's easier to just use a few separate arrays, or a list of arrays. Hope that helps.
I am not entirely clear what your question is, but it seems you could use a dictionary for this:
import numpy as np
landuse=np.array([1,2,3,1,2,4],dtype=np.integer)
a={1:'Forest',2:'Water'}
print [a.setdefault(i,'Urban') for i in landuse]
which will emit a list containing the strings you are interested in:
['Forest', 'Water', 'Urban', 'Forest', 'Water', 'Urban']
If you objective is to have the final result in a numpy array of strings, you can do this:
name=np.array([a.setdefault(i,'Urban') for i in landuse],dtype='|S10')