How to get a list of dtypes from a numpy structured array? - numpy

How can I get a list of dtypes from a numpy structured array?
Create example structured array:
arr = np.array([[1.0, 2.0],[3.0, 4.0]])
dt = {'names':['ID', 'Ring'], 'formats':[np.double, np.double]}
arr.dtype = dt
>>> arr
array([[(1., 2.)],
[(3., 4.)]], dtype=[('ID', '<f8'), ('Ring', '<f8')])
On one hand, it's easy to isolate the column names.
>>> arr.dtype.names
('ID', 'RING')
However, ironically, none of the dtype attributes seem to reveal the individual dtypes.

Discovered that, despite not having dictionary methods like .items(), you can still call dtype['<column_name>'].
column_names = list(arr.dtype.names)
dtypes = [str(arr.dtype[n]) for n in column_names]
>>> dtypes
['float64', 'float64']

Or, as #hpaulj hinted, in one step:
>>> [str(v[0]) for v in arr.dtype.fields.values()]
['float64', 'float64']

Related

dtype definition for pandas dataframe with columns of VARCHAR or String

I want to get some data in a dictionary that need to go into a pandas dataframe.
The dataframe is later written in a PostgreSQL table using sqlalchemy, and I would like to get the right column types.
Hence, I specify the dtypes for the dataframe
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8),
"forretningsområde": sqlalchemy.types.String(length=40),
"forretningsproces": sqlalchemy.types.INTEGER(),
"id_namespace": sqlalchemy.types.String(length=100),
"id_lokalId": sqlalchemy.types.String(length=36),
"kommunekode": sqlalchemy.types.INTEGER(),
"registreringFra": sqlalchemy.types.DateTime()}
Later I use df = pd.DataFrame(item_lst, dtype=dtypes), where item_lst is a list of dictionaries.
Independent from me using either String(8), String(length=8) or VARCHAR(8) in the dtype definition, the result of pd.DataFrame(item_lst, dtype=dtypes) is always object of type '(String or VARCHAR)' has no len().
How do I have to define the dtype to overcome this error?
Instead of forcing data types when the DataFrame is created, let pandas infer the data types (just df = pd.DataFrame(item_lst)) and then use your dtypes dict with to_sql() when you push your DataFrame to the database, like this:
from pprint import pprint
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("sqlite://")
item_lst = [{"forretningshændelse": "foo"}]
df = pd.DataFrame(item_lst)
print(df.info())
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 forretningshændelse 1 non-null object
dtypes: object(1)
memory usage: 136.0+ bytes
None
"""
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8)}
df.to_sql("tbl", engine, index=False, dtype=dtypes)
insp = sqlalchemy.inspect(engine)
pprint(insp.get_columns("tbl"))
"""
[{'autoincrement': 'auto',
'default': None,
'name': 'forretningshændelse',
'nullable': True,
'primary_key': 0,
'type': VARCHAR(length=8)}]
"""
I believe you are confusing the dtypes within the DataFrame with the dtypes on the SQL table itself.
You probably don't need to manually specify the datatypes in pandas itself but if you do, here's how.
Spoiler alert: it is written in the pandas.Dataframe documentation that only a single dtype must be specified so you will need some loops or manual column work to get different types.
To solve your problem:
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("connection_string")
df = pd.DataFrame(item_list)
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8),
"forretningsområde": sqlalchemy.types.String(40),
"forretningsproces": sqlalchemy.types.INTEGER(),
"id_namespace": sqlalchemy.types.String(100),
"id_lokalId": sqlalchemy.types.String(36),
"kommunekode": sqlalchemy.types.INTEGER(),
"registreringFra": sqlalchemy.types.DateTime()}
with engine.connect() as engine:
df.to_sql("table_name",if_exists="replace", con=engine, dtype=dtypes)
Tip: Avoid using special characters while coding in general, it only makes maintaining code harder at some point :). I assumed you're creating a new sql table and not appending, otherwise types for the table would already be defined.
Happy Coding!

How to split a cell which contains nested array in a pandas DataFrame

I have a pandas DataFrame, which contains 610 rows, and every row contains a nested list of coordinate pairs, it looks like that:
[1377778.4800000004, 6682395.377599999] is one coordinate pair.
I want to unnest every row, so instead of one row containing a list of coordinates I will have one row for every coordinate pair, i.e.:
I've tried s.apply(pd.Series).stack() from this question Split nested array values from Pandas Dataframe cell over multiple rows but unfortunately that didn't work.
Please any ideas? Many thanks in advance!
Here my new answer to your problem. I used "reduce" to flatten your nested array and then I used "itertools chain" to turn everything into a 1d list. After that I reshaped the list into a 2d array which allows you to convert it to the dataframe that you need. I tried to be as generic as possible. Please let me know if there are any problems.
#libraries
import operator
from functools import reduce
from itertools import chain
#flatten lists of lists using reduce. Then turn everything into a 1d list using
#itertools chain.
reduced_coordinates = list(chain.from_iterable(reduce(operator.concat,
geometry_list)))
#reshape the coordinates 1d list to a 2d and convert it to a dataframe
df = pd.DataFrame(np.reshape(reduced_coordinates, (-1, 2)))
df.columns = ['X', 'Y']
One thing you can do is use numpy. It allows you to perform a lot of list/ array operations in a fast and efficient way. This includes "unnesting" (reshaping) lists. Then you only have to convert to pandas dataframe.
For example,
import numpy as np
#your list
coordinate_list = [[[1377778.4800000004, 6682395.377599999],[6582395.377599999, 2577778.4800000004], [6582395.377599999, 2577778.4800000004]]]
#convert list to array
coordinate_array = numpy.array(coordinate_list)
#print shape of array
coordinate_array.shape
#reshape array into pairs of
reshaped_array = np.reshape(coordinate_array, (3, 2))
df = pd.DataFrame(reshaped_array)
df.columns = ['X', 'Y']
The output will look like this. Let me know if there is something I am missing.
import pandas as pd
import numpy as np
data = np.arange(500).reshape([250, 2])
cols = ['coord']
new_data = []
for item in data:
new_data.append([item])
df = pd.DataFrame(data=new_data, columns=cols)
print(df.head())
def expand(row):
row['x'] = row.coord[0]
row['y'] = row.coord[1]
return row
df = df.apply(expand, axis=1)
df.drop(columns='coord', inplace=True)
print(df.head())
RESULT
coord
0 [0, 1]
1 [2, 3]
2 [4, 5]
3 [6, 7]
4 [8, 9]
x y
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9

Why list of pd.Interval doesn't recognized by DataFrame automatically?

intervals = [pd.Interval(0, 0.1), pd.Interval(1, 5)]
pd.DataFrame({'d':intervals}).dtypes
Produces dtype as Object not Interval:
>>> d object
>>> dtype: object
But at the same time list of, for example, DateTimes is recognized on the fly:
datetimes = [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]
pd.DataFrame({'d':datetimes}).dtypes
>>> d datetime64[ns]
>>> dtype: object
Is situation with intervals somewhat like with list of strings - default type of column in the DataFrame will be object as well, because DataFrame doesn't 'know' if we want to treat this column as objects (for dumping to disk, ..), or as strings (for concatenation, ..) or even as elements of category type? If so - what different use cases with intervals may be? If not what is the case here?
This is a bug in pandas: https://github.com/pandas-dev/pandas/issues/23563
For now, the cleanest workaround is to wrap the list with pd.array:
In [1]: import pandas as pd; pd.__version__
Out[1]: '0.24.2'
In [2]: intervals = [pd.Interval(0, 0.1), pd.Interval(1, 5)]
In [3]: pd.DataFrame({'d': pd.array(intervals)}).dtypes
Out[3]:
d interval[float64]
dtype: object

How to generate pandas DataFrame column of Categorical from string column?

I can convert a pandas string column to Categorical, but when I try to insert it as a new DataFrame column it seems to get converted right back to Series of str:
train['LocationNFactor'] = pd.Categorical.from_array(train['LocationNormalized'])
>>> type(pd.Categorical.from_array(train['LocationNormalized']))
<class 'pandas.core.categorical.Categorical'>
# however it got converted back to...
>>> type(train['LocationNFactor'][2])
<type 'str'>
>>> train['LocationNFactor'][2]
'Hampshire'
Guessing this is because Categorical doesn't map to any numpy dtype; so do I have to convert it to some int type, and thus lose the factor labels<->levels association?
What's the most elegant workaround to store the levels<->labels association and retain the ability to convert back? (just store as a dict like here, and manually convert when needed?)
I think Categorical is still not a first-class datatype for DataFrame, unlike R.
(Using pandas 0.10.1, numpy 1.6.2, python 2.7.3 - the latest macports versions of everything).
The only workaround for pandas pre-0.15 I found is as follows:
column must be converted to a Categorical for classifier, but numpy will immediately coerce the levels back to int, losing the factor information
so store the factor in a global variable outside the dataframe
.
train_LocationNFactor = pd.Categorical.from_array(train['LocationNormalized']) # default order: alphabetical
train['LocationNFactor'] = train_LocationNFactor.labels # insert in dataframe
[UPDATE: pandas 0.15+ added decent support for Categorical]
The labels<->levels is stored in the index object.
To convert an integer array to string array: index[integer_array]
To convert a string array to integer array: index.get_indexer(string_array)
Here is some exampe:
In [56]:
c = pd.Categorical.from_array(['a', 'b', 'c', 'd', 'e'])
idx = c.levels
In [57]:
idx[[1,2,1,2,3]]
Out[57]:
Index([b, c, b, c, d], dtype=object)
In [58]:
idx.get_indexer(["a","c","d","e","a"])
Out[58]:
array([0, 2, 3, 4, 0])

Numpy Integer to String Based on Lookup Table

I have a numpy array called "landuse" that's a series of numbers 1-3 representing different landuse categories. I want to convert this to a string based on a lookup table.
ids = [0,1,2,3]
lookup_table = ['None', 'Forest', 'Water', 'Urban']
First let me explain why your loop isn't working, in python an assignment, ie a = 1 takes the object 1 and gives it the name a. When you do name = "Water", name forgets what it was pointing to before and now points to "Water", but that doesn't mean the previous object that was assigned to name gets replaced with "Water".
That's the problem, and now for a fix. If you have your landuse as an array of integer codes you can just use a lookup table. The table should be big enough so you don't get an indexing error when you do lookup_table[landuse.max()]
import numpy as np
landuse = np.array([1,2,3,1,2,4])
lookup_table = np.array(['None', 'Forest', 'Water', 'Urban', 'Other'])
landuse_title = lookup_table[landuse]
And for the final part of your question, the numpy ndarray is a homogenous data structure, meaning everything in the array must have the same data type. With that limitation in mind, it should be clear that you cannot take a row of the integers and replace it with a row of strings. Numpy does have "flexible dtypes" which allow you to do something like:
>>> dt = np.dtype([('name', 'S4'), ('age', 'int'), ('height', 'float')])
>>> array = np.array([('Mark', 25, 70.5),('Ben',40,72.75)], dtype=dt)
>>> array
array([('Mark', 25, 70.5), ('Ben', 40, 72.75)],
dtype=[('name', '|S4'), ('age', '<i4'), ('height', '<f8')])
>>> array.shape
(2,)
>>> array['name']
array(['Mark', 'Ben'],
dtype='|S4')
We've created an array that hold a name, age and height for each person, but notice that the shape of the array is (2,) because we have two "people" in the array. I'm not sure exactly what your needs are, but you could try and use the flexible dtype to hold all the information in one array if that's what you need. Depending on what my end goal, I often find it's easier to just use a few separate arrays, or a list of arrays. Hope that helps.
I am not entirely clear what your question is, but it seems you could use a dictionary for this:
import numpy as np
landuse=np.array([1,2,3,1,2,4],dtype=np.integer)
a={1:'Forest',2:'Water'}
print [a.setdefault(i,'Urban') for i in landuse]
which will emit a list containing the strings you are interested in:
['Forest', 'Water', 'Urban', 'Forest', 'Water', 'Urban']
If you objective is to have the final result in a numpy array of strings, you can do this:
name=np.array([a.setdefault(i,'Urban') for i in landuse],dtype='|S10')