Convert Pandas DataFrame to bytes-like object - pandas

Hi I am trying to convert my df to binary and store it in a variable.
my_df:
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
my code:
import io
towrite = io.BytesIO()
df.to_excel(towrite) # write to BytesIO buffer
towrite.seek(0) # reset pointer
I am getting AttributeError: '_io.BytesIO' object has no attribute 'write_cells'
Full Traceback:
AttributeError Traceback (most recent call last)
<ipython-input-25-be6ee9d9ede6> in <module>()
1 towrite = io.BytesIO()
----> 2 df.to_excel(towrite) # write to BytesIO buffer
3 towrite.seek(0) # reset pointer
4 encoded = base64.b64encode(towrite.read()) #
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes)
1422 formatter.write(excel_writer, sheet_name=sheet_name, startrow=startrow,
1423 startcol=startcol, freeze_panes=freeze_panes,
-> 1424 engine=engine)
1425
1426 def to_stata(self, fname, convert_dates=None, write_index=True,
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\formats\excel.py in write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine)
624
625 formatted_cells = self.get_formatted_cells()
--> 626 writer.write_cells(formatted_cells, sheet_name,
627 startrow=startrow, startcol=startcol,
628 freeze_panes=freeze_panes)
AttributeError: '_io.BytesIO' object has no attribute 'write_cells'

I solved the issue by upgrading pandas to newer version.
import io
towrite = io.BytesIO()
df.to_excel(towrite) # write to BytesIO buffer
towrite.seek(0)
print(towrite)
b''
print(type(towrite))
_io.BytesIO
if you want to see the bytes-like object use getvalue,
print(towrite.getvalue())
b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x00!\x00<\xb

Pickle
Pickle is a reproducible format for a Pandas dataframe, but it's only for internal use among trusted users. It's not for sharing with untrusted users due to security reasons.
import pickle
# Export:
my_bytes = pickle.dumps(df, protocol=4)
# Import:
df_restored = pickle.loads(my_bytes)
This was tested with Pandas 1.1.2. Unfortunately this failed for a very large dataframe, but then what worked is pickling and parallel-compressing each column individually, followed by pickling this list. Alternatively you can pickle chunks of the large dataframe.
CSV
If you must use a CSV representation:
df.to_csv(index=False).encode()
Note that various datatypes are lost when using CSV.
Parquet
See this answer. Note that various datatypes are converted when using parquet.
Excel
Avoid its use for the most part because it limits the max number of rows and columns.

I required to upload the file object to S3 via boto3 which didn't accept the pandas bytes object. So building on the answer from Asclepius I cast the object to a BytesIO, eg:
from io import BytesIO
data = BytesIO(df.to_csv(index=False).encode('utf-8'))

Related

How to decode a .csv .gzip file containing tweets?

I'm trying to do a twitter sentiment analysis and my dataset is a couple of .csv.gzip files.
This is what I did to convert them to all to one dataframe.
(I'm using google colab, if that has anything to do with the error, filename or something)
apr_files = [file[9:] for file in csv_collection if re.search(r"04+", file)]
apr_files
Output:
['0428_UkraineCombinedTweetsDeduped.csv.gzip',
'0430_UkraineCombinedTweetsDeduped.csv.gzip',
'0401_UkraineCombinedTweetsDeduped.csv.gzip']
temp_list = []
for file in apr_files:
print(f"Reading in {file}")
# unzip and read in the csv file as a dataframe
temp = pd.read_csv(file, compression="gzip", header=0, index_col=0)
# append dataframe to temp list
temp_list.append(temp)
Error:
Reading in 0428_UkraineCombinedTweetsDeduped.csv.gzip
Reading in 0430_UkraineCombinedTweetsDeduped.csv.gzip
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882: DtypeWarning: Columns (15) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
Reading in 0401_UkraineCombinedTweetsDeduped.csv.gzip
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-26-5cba3ca01b1e> in <module>()
3 print(f"Reading in {file}")
4 # unzip and read in the csv file as a dataframe
----> 5 tmp_df = pd.read_csv(file, compression="gzip", header=0, index_col=0)
6 # append dataframe to temp list
7 tmp_df_list.append(tmp_df)
8 frames
/usr/local/lib/python3.7/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 8048: invalid start byte
I assumed that this error might be because the tweets contain multiple characters (like emoji, non-english characters, etc.).
I just switched to Jupyter Notebook, and It worked fine there.
As of now, I don't know what was the issue with Google Colab though.

Writing pandas dataframe to excel in dbfs azure databricks: OSError: [Errno 95] Operation not supported

I am trying to write a pandas dataframe to the local file system in azure databricks:
import pandas as pd
url = 'https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-March-2019-quarter/Download-data/business-price-indexes-march-2019-quarter-csv.csv'
data = pd.read_csv(url)
with pd.ExcelWriter(r'/dbfs/tmp/export.xlsx', engine="openpyxl") as writer:
data.to_excel(writer)
Then I get the following error message:
OSError: [Errno 95] Operation not supported
--------------------------------------------------------------------------- OSError Traceback (most recent call
last) in
3 data = pd.read_csv(url)
4 with pd.ExcelWriter(r'/dbfs/tmp/export.xlsx', engine="openpyxl") as writer:
----> 5 data.to_excel(writer)
/databricks/python/lib/python3.8/site-packages/pandas/io/excel/_base.py
in exit(self, exc_type, exc_value, traceback)
892
893 def exit(self, exc_type, exc_value, traceback):
--> 894 self.close()
895
896 def close(self):
/databricks/python/lib/python3.8/site-packages/pandas/io/excel/_base.py
in close(self)
896 def close(self):
897 """synonym for save, to make it more file-like"""
--> 898 content = self.save()
899 self.handles.close()
900 return content
I read in this post some limitations for mounted file systems: Pandas: Write to Excel not working in Databricks
But if I got it right, the solution is to write to the local workspace file system, which is exactly what is not working for me.
My user is workspace admin and I am using a standard cluster with 10.4 Runtime.
I also verified I can write csv file to the same location using pd.to_csv
What could be missing.
Databricks has a drawback that does not allow random write operations into DBFS which is indicated in the SO thread you are referring to.
So, a workaround for this would be to write the file to local file system (file:/) and then move to the required location inside DBFS. You can use the following code:
import pandas as pd
url = 'https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-March-2019-quarter/Download-data/business-price-indexes-march-2019-quarter-csv.csv'
data = pd.read_csv(url)
with pd.ExcelWriter(r'export.xlsx', engine="openpyxl") as writer:
#file will be written to /databricks/driver/ i.e., local file system
data.to_excel(writer)
dbutils.fs.ls("/databricks/driver/") indicates that the path you want to use to list the files is dbfs:/databricks/driver/ (absolute path) which does not exist.
/databricks/driver/ belongs to the local file system (DBFS is a part of this). The absolute path of /databricks/driver/ is file:/databricks/driver/. You can list the contents of this path by using either of the following:
import os
print(os.listdir("/databricks/driver/")
#OR
dbutils.fs.ls("file:/databricks/driver/")
So, use the file located in this path and move (or copy) it to your destination using shutil library as the following:
from shutil import move
move('/databricks/driver/export.xlsx','/dbfs/tmp/export.xlsx')

Error while converting csv to parquet file using pandas

I would like to upload csv as parquet file to S3 bucket. Below is the code snippet.
df = pd.read_csv('right_csv.csv')
csv_buffer = BytesIO()
df.to_parquet(csv_buffer, compression='gzip', engine='fastparquet')
csv_buffer.seek(0)
Above is giving me an error: TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO
How to make it work?
As per the documentation, when fastparquet is used as the engine, io.BytesIO cannot be used. auto or pyarrow engine have to be used. Quoting from the documentation.
The engine fastparquet does not accept file-like objects.
Below code works without any issues.
import io
f = io.BytesIO()
df.to_parquet(f, compression='gzip', engine='pyarrow')
f.seek(0)
As mentioned in the other answer, this is not supported. One work around would be to save as parquet to a NamedTemporaryFile. Then copy the content to a BytesIO buffer:
import tempfile
with tempfile.NamedTemporaryFile() as tmp:
df.to_parquet(tmp.name, compression='gzip', engine='fastparquet')
with open(tmp.name, 'rb') as fh:
buf = io.BytesIO(fh.read())

Dask array from_npy_stack misses info file

Action
Trying to create a Dask array from a stack of .npy files not written by Dask.
Problem
Dask from_npy_stack() expects an info file, which is normally created by to_npy_stack() function when creating .npy stack with Dask.
Attempts
I found this PR (https://github.com/dask/dask/pull/686) with a description of how the info file is created
def to_npy_info(dirname, dtype, chunks, axis):
with open(os.path.join(dirname, 'info'), 'wb') as f:
pickle.dump({'chunks': chunks, 'dtype': x.dtype, 'axis': axis}, f)
Question
How do I go about loading .npy stacks that are created outside of Dask?
Example
from pathlib import Path
import numpy as np
import dask.array as da
data_dir = Path('/home/tom/data/')
for i in range(3):
data = np.zeros((2,2))
np.save(data_dir.joinpath('{}.npy'.format(i)), data)
data = da.from_npy_stack('/home/tom/data')
Resulting in the following error:
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-94-54315c368240> in <module>()
9 np.save(data_dir.joinpath('{}.npy'.format(i)), data)
10
---> 11 data = da.from_npy_stack('/home/tom/data/')
/home/tom/vue/env/local/lib/python2.7/site-packages/dask/array/core.pyc in from_npy_stack(dirname, mmap_mode)
3722 Read data in memory map mode
3723 """
-> 3724 with open(os.path.join(dirname, 'info'), 'rb') as f:
3725 info = pickle.load(f)
3726
IOError: [Errno 2] No such file or directory: '/home/tom/data/info'
The function from_npy_stack is short and simple. Agree that it probably ought to take the metadata as an optional argument for cases such as yours, but you could simply use the lines of code after loading the "info" file assuming you have the right values to. Some of these values, i.e., dtype and the shape of each array for making chunks, could presumably be obtained by looking at the first of the data files
name = 'from-npy-stack-%s' % dirname
keys = list(product([name], *[range(len(c)) for c in chunks]))
values = [(np.load, os.path.join(dirname, '%d.npy' % i), mmap_mode)
for i in range(len(chunks[axis]))]
dsk = dict(zip(keys, values))
out = Array(dsk, name, chunks, dtype)
Also, note that we are constructing the names of the files here, but you might want to get those by doing a listdir or glob.

Pandas Dataframe to RDD

Can I convert a Pandas DataFrame to RDD?
if isinstance(data2, pd.DataFrame):
print 'is Dataframe'
else:
print 'is NOT Dataframe'
is DataFrame
Here is the output when trying to use .rdd
dataRDD = data2.rdd
print dataRDD
AttributeError Traceback (most recent call last)
<ipython-input-56-7a9188b07317> in <module>()
----> 1 dataRDD = data2.rdd
2 print dataRDD
/usr/lib64/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
2148 return self[name]
2149 raise AttributeError("'%s' object has no attribute '%s'" %
-> 2150 (type(self).__name__, name))
2151
2152 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'rdd'
I would like to use Pandas Dataframe and not sqlContext to build as I'm not sure if all the functions in Pandas DF are available in Spark. If this is not possible, is there anyone that can provide an example of using Spark DF
Can I convert a Pandas Dataframe to RDD?
Well, yes you can do it. Pandas Data Frames
pdDF = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v"))
print pdDF
## k v
## 0 foo 1
## 1 bar 2
can be converted to Spark Data Frames
spDF = sqlContext.createDataFrame(pdDF)
spDF.show()
## +---+-+
## | k|v|
## +---+-+
## |foo|1|
## |bar|2|
## +---+-+
and after that you can easily access underlying RDD
spDF.rdd.first()
## Row(k=u'foo', v=1)
Still, I think you have a wrong idea here. Pandas Data Frame is a local data structure. It is stored and processed locally on the driver. There is no data distribution or parallel processing and it doesn't use RDDs (hence no rdd attribute). Unlike Spark DataFrame it provides random access capabilities.
Spark DataFrame is distributed data structures using RDDs behind the scenes. It can be accessed using either raw SQL (sqlContext.sql) or SQL like API (df.where(col("foo") == "bar").groupBy(col("bar")).agg(sum(col("foobar")))). There is no random access and it is immutable (no equivalent of Pandas inplace). Every transformation returns new DataFrame.
If this is not possible, is there anyone that can provide an example of using Spark DF
Not really. It is far to broad topic for SO. Spark has a really good documentation and Databricks provides some additional resources. For starters you check these:
Introducing DataFrames in Spark for Large Scale Data Science
Spark SQL and DataFrame Guide