Pytables table into pandas DataFrame - pandas

Lots of information on how to read a csv into a pandas dataframe, but I what I have is a pyTable table and want a pandas DataFrame.
I've found how to store my pandas DataFrame to pytables... then read I want to read it back, at this point it will have:
"kind = v._v_attrs.pandas_type"
I could write it out as csv and re-read it in but that seems silly. It is what I am doing for now.
How should I be reading pytable objects into pandas?

import tables as pt
import pandas as pd
import numpy as np
# the content is junk but we don't care
grades = np.empty((10,2), dtype=(('name', 'S20'), ('grade', 'u2')))
# write to a PyTables table
handle = pt.openFile('/tmp/test_pandas.h5', 'w')
handle.createTable('/', 'grades', grades)
print handle.root.grades[:].dtype # it is a structured array
# load back as a DataFrame and check types
df = pd.DataFrame.from_records(handle.root.grades[:])
df.dtypes
Beware that your u2 (unsigned 2-byte integer) will end as an i8 (integer 8 byte), and the strings will be objects, because Pandas does not yet support the full range of dtypes that are available for Numpy arrays.

The docs now include an excellent section on using the HDF5 store and there are some more advanced strategies discussed in the cookbook.
It's now relatively straightforward:
In [1]: store = HDFStore('store.h5')
In [2]: print store
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Empty
In [3]: df = DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [4]: store['df'] = df
In [5]: store
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame (shape->[2,2])
And to retrieve from HDF5/pytables:
In [6]: store['df'] # store.get('df') is an equivalent
Out[6]:
A B
0 1 2
1 3 4
You can also query within a table.

Related

dtype definition for pandas dataframe with columns of VARCHAR or String

I want to get some data in a dictionary that need to go into a pandas dataframe.
The dataframe is later written in a PostgreSQL table using sqlalchemy, and I would like to get the right column types.
Hence, I specify the dtypes for the dataframe
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8),
"forretningsområde": sqlalchemy.types.String(length=40),
"forretningsproces": sqlalchemy.types.INTEGER(),
"id_namespace": sqlalchemy.types.String(length=100),
"id_lokalId": sqlalchemy.types.String(length=36),
"kommunekode": sqlalchemy.types.INTEGER(),
"registreringFra": sqlalchemy.types.DateTime()}
Later I use df = pd.DataFrame(item_lst, dtype=dtypes), where item_lst is a list of dictionaries.
Independent from me using either String(8), String(length=8) or VARCHAR(8) in the dtype definition, the result of pd.DataFrame(item_lst, dtype=dtypes) is always object of type '(String or VARCHAR)' has no len().
How do I have to define the dtype to overcome this error?
Instead of forcing data types when the DataFrame is created, let pandas infer the data types (just df = pd.DataFrame(item_lst)) and then use your dtypes dict with to_sql() when you push your DataFrame to the database, like this:
from pprint import pprint
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("sqlite://")
item_lst = [{"forretningshændelse": "foo"}]
df = pd.DataFrame(item_lst)
print(df.info())
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 forretningshændelse 1 non-null object
dtypes: object(1)
memory usage: 136.0+ bytes
None
"""
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8)}
df.to_sql("tbl", engine, index=False, dtype=dtypes)
insp = sqlalchemy.inspect(engine)
pprint(insp.get_columns("tbl"))
"""
[{'autoincrement': 'auto',
'default': None,
'name': 'forretningshændelse',
'nullable': True,
'primary_key': 0,
'type': VARCHAR(length=8)}]
"""
I believe you are confusing the dtypes within the DataFrame with the dtypes on the SQL table itself.
You probably don't need to manually specify the datatypes in pandas itself but if you do, here's how.
Spoiler alert: it is written in the pandas.Dataframe documentation that only a single dtype must be specified so you will need some loops or manual column work to get different types.
To solve your problem:
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("connection_string")
df = pd.DataFrame(item_list)
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8),
"forretningsområde": sqlalchemy.types.String(40),
"forretningsproces": sqlalchemy.types.INTEGER(),
"id_namespace": sqlalchemy.types.String(100),
"id_lokalId": sqlalchemy.types.String(36),
"kommunekode": sqlalchemy.types.INTEGER(),
"registreringFra": sqlalchemy.types.DateTime()}
with engine.connect() as engine:
df.to_sql("table_name",if_exists="replace", con=engine, dtype=dtypes)
Tip: Avoid using special characters while coding in general, it only makes maintaining code harder at some point :). I assumed you're creating a new sql table and not appending, otherwise types for the table would already be defined.
Happy Coding!

How would I access individual elements of this pandas dataFrame?

How would I access the individual elements of the dataFrame below?
More specifically, how would I retrieve/extract the string "CN112396173" in the index 2 of the dataFrame?
Thanks
A more accurate description of your problem would be: "Getting all first words of string column in pandas"
You can use data["PN"].str.split(expand=True)[0]. See the docs.
>>> import pandas as pd
>>> df = pd.DataFrame({"column": ["asdf abc cdf"]})
>>> series = df["column"].str.split(expand=True)[0]
>>> series
0 asdf
Name: 0, dtype: object
>>> series.to_list()
["asdf"]
dtype: object is actually normal (in pandas, strings are 'objects').

How to split a cell which contains nested array in a pandas DataFrame

I have a pandas DataFrame, which contains 610 rows, and every row contains a nested list of coordinate pairs, it looks like that:
[1377778.4800000004, 6682395.377599999] is one coordinate pair.
I want to unnest every row, so instead of one row containing a list of coordinates I will have one row for every coordinate pair, i.e.:
I've tried s.apply(pd.Series).stack() from this question Split nested array values from Pandas Dataframe cell over multiple rows but unfortunately that didn't work.
Please any ideas? Many thanks in advance!
Here my new answer to your problem. I used "reduce" to flatten your nested array and then I used "itertools chain" to turn everything into a 1d list. After that I reshaped the list into a 2d array which allows you to convert it to the dataframe that you need. I tried to be as generic as possible. Please let me know if there are any problems.
#libraries
import operator
from functools import reduce
from itertools import chain
#flatten lists of lists using reduce. Then turn everything into a 1d list using
#itertools chain.
reduced_coordinates = list(chain.from_iterable(reduce(operator.concat,
geometry_list)))
#reshape the coordinates 1d list to a 2d and convert it to a dataframe
df = pd.DataFrame(np.reshape(reduced_coordinates, (-1, 2)))
df.columns = ['X', 'Y']
One thing you can do is use numpy. It allows you to perform a lot of list/ array operations in a fast and efficient way. This includes "unnesting" (reshaping) lists. Then you only have to convert to pandas dataframe.
For example,
import numpy as np
#your list
coordinate_list = [[[1377778.4800000004, 6682395.377599999],[6582395.377599999, 2577778.4800000004], [6582395.377599999, 2577778.4800000004]]]
#convert list to array
coordinate_array = numpy.array(coordinate_list)
#print shape of array
coordinate_array.shape
#reshape array into pairs of
reshaped_array = np.reshape(coordinate_array, (3, 2))
df = pd.DataFrame(reshaped_array)
df.columns = ['X', 'Y']
The output will look like this. Let me know if there is something I am missing.
import pandas as pd
import numpy as np
data = np.arange(500).reshape([250, 2])
cols = ['coord']
new_data = []
for item in data:
new_data.append([item])
df = pd.DataFrame(data=new_data, columns=cols)
print(df.head())
def expand(row):
row['x'] = row.coord[0]
row['y'] = row.coord[1]
return row
df = df.apply(expand, axis=1)
df.drop(columns='coord', inplace=True)
print(df.head())
RESULT
coord
0 [0, 1]
1 [2, 3]
2 [4, 5]
3 [6, 7]
4 [8, 9]
x y
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9

Save a pandas DataFrame in a group of h5py for later use

I want to append a pandas DataFrame object to an existing h5py file, whether as a subgroup or dataset, with all the index and header information. Is that possible? I tried the following:
import pandas as pd
import h5py
f = h5py.File('f.h5', 'r+')
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['A', 'B', 'C'], index=['X', 'Y'])
f['df'] = df
From another script, I would like to access f.h5, but the output of f['df'][()] is array([[1, 2, 3],[4, 5, 6]]), which doesn't contain the header information.
You can write to an existing hdf5 file directly with Pandas via pd.DataFrame.to_hdf() and read it back in with pd.read_hdf(). You just have to make sure to read and write with the same key.
To write to the h5 file:
existing_hdf5 = "f.h5"
df = pd.DataFrame([[1,2,3],[4,5,6]],
columns=['A', 'B', 'C'], index=['X', 'Y'])
df.to_hdf(existing_hdf5 , key='df')
Then you can read by:
df2 = pd.read_hdf(existing_hdf5 , key='df')
print(df2)
A B C
X 1 2 3
Y 4 5 6
Note that you can also make the dataframe appendable using format="table" which requires the option dependency of Pytables

pandas calling an element from imported data

I have a csv file with dates.
import pandas as pd
spam=pd.read_csv('DATA.csv', parse_dates=[0], usecols=[0], header=None)
spam.shape
is (n,1)
How can I call an element as I do in Numpy (ex. I have an array A.shape => (n,1), if I call A[5,1] I get the element on the 5th row in the 1st column) ?
Numpy arrays index at zero, so you'll actually need A[4,0] to get the element on the 5th row of the 1st column.
But this is how you'd get the same as Numpy Arrays.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(2,2)) # create a 2 by 2 DataFrame object
>>> df.ix[1,1]
-1.206712609725652
>>> df
0 1
0 -0.281467 1.124922
1 0.580617 -1.206713
iloc is for integers only, whereas ix will work for both integers and labels, and is available in older versions of Pandas.
>>> df.iloc[1,1]
-1.206712609725652