Pandas HDF5 append time series fails - pandas

Going through the documentation of pandas HDF5 usability (http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5) the given example raises an error:
import pandas as pd
import numpy as np
store = pd.HDFStore('store.h5')
np.random.seed(1234)
index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index)
store['df'] = df
df1 = df[0:4]
df2 = df[4:]
store.append('df', df1)
store.append('df', df2)
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-225-ef7f2e059c6a>", line 1, in <module>
store.append('df', df1)
File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 919, in append
**kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1252, in _write_to_group
raise ValueError('Can only append to Tables')
ValueError: Can only append to Tables
Has something changed here? Or am I doing something wrong?

You need to enable append by default store in the table format by setting the following option at the beginning as your store behaves like a DF currently:
pd.set_option('io.hdf.default_format','table')
Docs

Related

Error while converting pandas dataframe to polars dataframe (pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object)

I am converting pandas dataframe to polars dataframe but pyarrow throws error.
My code:
import polars as pl
import pandas as pd
if __name__ == "__main__":
with open(r"test.xlsx", "rb") as f:
excelfile = f.read()
excelfile = pd.ExcelFile(excelfile)
sheetnames = excelfile.sheet_names
df = pd.concat(
[
pd.read_excel(
excelfile, sheet_name=x, header=0)
for x in sheetnames
], axis=0)
df_pl = pl.from_pandas(df)
Error:
File "pyarrow\array.pxi", line 312, in pyarrow.lib.array
File "pyarrow\array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow\error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
I tried changing pandas dataframe dtype to str and problem is solved, but i don't want to change dtypes. Is it bug in pyarrow or am I missing something?
Edit: Polars 0.13.42 and later
Polars now has a read_excel function that will correctly handle this situation. read_excel is now the preferred way to read Excel files into Polars.
Note: to use read_excel, you will need to install xlsx2csv (which can be installed with pip).
Polars: prior to 0.13.42
I can replicate this result. It is due to a column in the original Excel file that contains both text and numbers.
For example, create a new Excel file with one column in which you type both numbers and text, save it, and run your code on that file. I get the following traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/convert.py", line 299, in from_pandas
return DataFrame._from_pandas(df, rechunk=rechunk, nan_to_none=nan_to_none)
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/frame.py", line 454, in _from_pandas
pandas_to_pydf(
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 485, in pandas_to_pydf
arrow_dict = {
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 486, in <dictcomp>
str(col): _pandas_series_to_arrow(
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 237, in _pandas_series_to_arrow
return pa.array(values, pa.large_utf8(), from_pandas=nan_to_none)
File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
There are several lengthy discussions on this issue, such as these:
to_parquet can't handle mixed type columns #21228
pyarrow.lib.ArrowTypeError: "Expected a string or bytes object, got a 'int' object" #349
This particular comment might be relevant, as you are concatenating the results of parsing multiple sheets in an Excel file. This may lead to conflicting dtypes for a column:
https://github.com/pandas-dev/pandas/issues/21228#issuecomment-419175116
How to approach this depends on your data and its use, so I can't recommend a blanket solution (i.e., fixing your source Excel file, or changing the dtype to str).
My problem is solved by saving pandas dataframe to 'csv' format and then importing 'csv' file in polars.
import os
import polars as pl
import pandas as pd
if __name__ == "__main__":
with open(r"test.xlsx", "rb") as f:
excelfile = f.read()
excelfile = pd.ExcelFile(excelfile)
sheetnames = excelfile.sheet_names
df = pd.concat([pd.read_excel(excelfile, sheet_name=x, header=0)
for x in sheetnames
], axis=0)
df.to_csv("temp.csv",index=False)
df_pl = pl.scan_csv("temp.csv")
os.remove("temp.csv")

Problem after groupby (pandas), the grouped column is not accessible

I have a problem after groupby and receive this error message:
Traceback (most recent call last):
File "C:\Users\User\PycharmProjects\HashTag_Curso\venv\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Ano'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:/Users/User/PycharmProjects/Bibliotecas/Exemplo.py", line 11, in
x = dfg['Ano']
File "C:\Users\User\PycharmProjects\HashTag_Curso\venv\lib\site-packages\pandas\core\frame.py", line 3024, in getitem
indexer = self.columns.get_loc(key)
File "C:\Users\User\PycharmProjects\HashTag_Curso\venv\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'Ano'
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
from astropy.stats import biweight_midcorrelation as bw_cor
df = pd.read_csv(r'Bases_dados\D_1_4M\Tudo/combined.csv').iloc[:100000]
df['Ano'] = df['Data decimal']//1
dfg = df.groupby(by=["Ano"]).mean()
print(dfg)
x = dfg['Ano']
y = dfg['Lances']
r = np.corrcoef(x, y)[0][1]
bwr = bw_cor(x, y)
print(bwr, r)
plt.scatter(x, y)
plt.show()
If i use
x = df['Ano']
y = df['Lances']
work fine, but with dfg (grouped by 'Ano'), i receive that err msg.
When i print(dfg), the column "Ano" appears normally.
It's moved to the index part, so you can either reset_index or pass as_index=False to groupby to begin with:
dfg = df.groupby(by="Ano", as_index=False).mean()

TypeError: _any() missing 1 required keyword-only argument: 'where'

I am trying to read the file using pandas but it is showing me a type error. I am not able to discern why. Can someone help me?
Below is my code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#prepare the files
df = pd.read_csv("~/Downloads/Boston.csv") # for doing modifications
Traceback (most recent call last):
File "", line 1, in
df = pd.read_csv("~/Downloads/Boston.csv") # for doing modifications
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
low_memory=_c_parser_defaults["low_memory"],
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
iterator = kwds.get("iterator", False)
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1148, in read
names : iterable of names
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 435, in init
d = {'col1': [1, 2], 'col2': [3, 4]}
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 233, in init_dict
datelike_vals = maybe_infer_to_datetimelike(values)
TypeError: _any() missing 1 required keyword-only argument: 'where'
Could be that read_csv method has troubles parsing your file without any other indications.
Try using additional keywords arguments such as sep, usecols, etc.
Refer to documentation for more: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Loading .txt file from Google Cloud Storage into a Pandas DF

I'm trying to load a .txt file from a GCS bucket into pandas df via pd.read_csv. When I run this code on my local machine (sourcing the .txt file from a local directory), it works perfectly. However, when I try and run the code in a cloud function , accessing the same .txt file but from a GCS bucket, I get a 'TypeError: cannot use a string pattern on a bytes-like object'
The only thing that's different is the fact that I'm accessing the .txt file via the GCS bucket so its a bucket object (Blob) instead of a normal file. Would I need to download the blob as a string or as a file-like object first before doing pd.read_csv? code is below
def stage1_cogs_vfc(data, context):
from google.cloud import storage
import pandas as pd
import dask.dataframe as dd
import io
import numpy as np
start_bucket = 'my_bucket'
storage_client = storage.Client()
source_bucket = storage_client.bucket(start_bucket)
df = pd.DataFrame()
file_path = 'gs://my_bucket/SCE_Var_Fact_Costs.txt'
df = pd.read_csv(file_path,skiprows=12, encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python')
Traceback (most recent call last):
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 20, in stage1_cogs_vfc df = pd.read_csv(file_path,skiprows=12, encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python') File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f return _read(filepath_or_buffer, kwds) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 429, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__ self._make_engine(self.engine) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1132, in _make_engine self._engine = klass(self.f, **self.options) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2238, in __init__ self.unnamed_cols) = self._infer_columns() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2614, in _infer_columns line = self._buffered_line() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2689, in _buffered_line return self._next_line() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2791, in _next_line next(self.data) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2379, in _read yield pat.split(line.strip()) TypeError: cannot use a string pattern on a bytes-like object
``|
I found a similar situation here.
I also noticed that on the line:
source_bucket = storage_client.bucket(source_bucket)
you are using "source_bucket" for both: your variable name and parameter. I would suggest to change one of those.
However, I think you'd like to see this doc for any further question related to the API itself: Storage Client - Google Cloud Storage API
Building on points from #K_immer is my updated code that includes reading into 'Dask' df...
def stage1_cogs_vfc(data, context):
from google.cloud import storage
import pandas as pd
import dask.dataframe as dd
import io
import numpy as np
import datetime as dt
start_bucket = 'my_bucket'
destination_path = 'gs://my_bucket/ddf-*_cogs_vfc.csv'
storage_client = storage.Client()
bucket = storage_client.get_bucket(start_bucket)
blob = bucket.get_blob('SCE_Var_Fact_Costs.txt')
df0 = pd.DataFrame()
file_path = 'gs://my_bucket/SCE_Var_Fact_Costs.txt'
df0 = dd.read_csv(file_path,skiprows=12, dtype=object ,encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python')
df7 = df7.compute() # converts dask df to pandas df
# then do your heavy ETL stuff here using pandas...

How to avoid set_index on a pre-sorted DataFrame constructed with from_delayed?

I am trying to get the expression, 'df.resample('1T', how='mean').sum()' to work in Dask but, running into an issue where it seems like Dask needs me to explicitly set_index on the DataFrame before performing resample. I get an error as below...
>>> c.gather(df).compute()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 1508, in gather
asynchronous=asynchronous)
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 615, in sync
return sync(self.loop, func, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/distributed/utils.py", line 253, in sync
six.reraise(*error[0])
File "/usr/local/lib/python2.7/site-packages/distributed/utils.py", line 238, in f
result[0] = yield make_coro()
File "/usr/local/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/local/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/local/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 1385, in _gather
traceback)
File "/usr/local/lib/python2.7/site-packages/dask/dataframe/core.py", line 1633, in resample
return _resample(self, rule, how=how, closed=closed, label=label)
File "/usr/local/lib/python2.7/site-packages/dask/dataframe/tseries/resample.py", line 33, in _resample
return getattr(resampler, how)()
File "/usr/local/lib/python2.7/site-packages/dask/dataframe/tseries/resample.py", line 151, in mean
return self._agg('mean')
File "/usr/local/lib/python2.7/site-packages/dask/dataframe/tseries/resample.py", line 126, in _agg
meta_r = self.obj._meta_nonempty.resample(self._rule, **self._kwargs)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/generic.py", line 7104, in resample
base=base, key=on, level=level)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/resample.py", line 1148, in resample
return tg._get_resampler(obj, kind=kind)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/resample.py", line 1276, in _get_resampler
"but got an instance of %r" % type(ax).__name__)
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
Below is the python code which I am using. Since the pandas DFs being returned by my delayed objects were already timestamp indexed, my expectation was for Dask to infer/construct an index from those DFs' timestamp indices instead of me having to explicitly set one. Although, I am unsure how an explicit set_index can be called in this case (what are the arguments to be passed?). Setting a pd.DatetimeIndex on the meta dataframe (commented line as below) works. Is constructing the index by hand and feeding it to meta the only realistic way to do this? Am I missing something?
#! /usr/bin/env python
# Start dask scheduler and workers
# dask-scheduler &
# dask-worker --nthreads 1 --nprocs 6 --memory-limit 3GB localhost:8786 --local-directory /dev/shm &
from dask.distributed import Client
from dask.delayed import delayed
import pandas as pd
import numpy as np
import dask.dataframe as dd
import time
c = Client('127.0.0.1:8786')
def load(epoch):
# 1525132800 - 1/5
# 1527811200 - 1/6
num_ts=100
idx = []
for ts in range(0, 86400, 15):
idx.append(epoch + ts)
d = np.random.rand(86400/15, num_ts)
ts = []
for i in range(0, num_ts):
# tsname = "ts_%s_%s" % (i, epoch)
tsname = "ts_%s" % (i)
ts.append(tsname)
gts.append(tsname)
res = pd.DataFrame(index=idx, data=d, columns=ts, dtype=np.float64)
res.index = pd.to_datetime(arg=res.index, unit='s')
return res
gts = []
load(1525132800)
print time.time()
i = pd.DatetimeIndex(start=1525132800, freq='15S', end=1527811185, dtype='datetime64[s]')
# meta = pd.DataFrame(index=i, data=[], columns=gts, dtype=np.float64)
meta = pd.DataFrame(index=[], data=[], columns=gts, dtype=np.float64)
dfs = [delayed(load)(fn) for fn in range(1525132800, 1527811200, 86400)]
print time.time()
df = dd.from_delayed(dfs, meta, 'sorted')
print time.time()
df.npartitions
df.divisions
print time.time()
df = c.submit(dd.DataFrame.resample, df, rule='1T', how='mean')
print time.time()
#df = c.submit(dd.DataFrame.sum, df, axis=1)
print time.time()
c.gather(df).compute()
print time.time()
#c.gather(df).visualize(filename='/usr/share/nginx/html/svg/df4.svg')
Dask uses the meta of a data-frame to infer the data types before computing any of the chunks of data. In your case, your chunks contain datetime indexes, but the meta doesn't. The meta should be a zero-length version of the data:
meta = pd.DataFrame(index=i[:0], data=[], columns=gts, dtype=np.float64)