Loading .txt file from Google Cloud Storage into a Pandas DF - pandas

I'm trying to load a .txt file from a GCS bucket into pandas df via pd.read_csv. When I run this code on my local machine (sourcing the .txt file from a local directory), it works perfectly. However, when I try and run the code in a cloud function , accessing the same .txt file but from a GCS bucket, I get a 'TypeError: cannot use a string pattern on a bytes-like object'
The only thing that's different is the fact that I'm accessing the .txt file via the GCS bucket so its a bucket object (Blob) instead of a normal file. Would I need to download the blob as a string or as a file-like object first before doing pd.read_csv? code is below
def stage1_cogs_vfc(data, context):
from google.cloud import storage
import pandas as pd
import dask.dataframe as dd
import io
import numpy as np
start_bucket = 'my_bucket'
storage_client = storage.Client()
source_bucket = storage_client.bucket(start_bucket)
df = pd.DataFrame()
file_path = 'gs://my_bucket/SCE_Var_Fact_Costs.txt'
df = pd.read_csv(file_path,skiprows=12, encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python')
Traceback (most recent call last):
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 20, in stage1_cogs_vfc df = pd.read_csv(file_path,skiprows=12, encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python') File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f return _read(filepath_or_buffer, kwds) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 429, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__ self._make_engine(self.engine) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1132, in _make_engine self._engine = klass(self.f, **self.options) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2238, in __init__ self.unnamed_cols) = self._infer_columns() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2614, in _infer_columns line = self._buffered_line() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2689, in _buffered_line return self._next_line() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2791, in _next_line next(self.data) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2379, in _read yield pat.split(line.strip()) TypeError: cannot use a string pattern on a bytes-like object
``|

I found a similar situation here.
I also noticed that on the line:
source_bucket = storage_client.bucket(source_bucket)
you are using "source_bucket" for both: your variable name and parameter. I would suggest to change one of those.
However, I think you'd like to see this doc for any further question related to the API itself: Storage Client - Google Cloud Storage API

Building on points from #K_immer is my updated code that includes reading into 'Dask' df...
def stage1_cogs_vfc(data, context):
from google.cloud import storage
import pandas as pd
import dask.dataframe as dd
import io
import numpy as np
import datetime as dt
start_bucket = 'my_bucket'
destination_path = 'gs://my_bucket/ddf-*_cogs_vfc.csv'
storage_client = storage.Client()
bucket = storage_client.get_bucket(start_bucket)
blob = bucket.get_blob('SCE_Var_Fact_Costs.txt')
df0 = pd.DataFrame()
file_path = 'gs://my_bucket/SCE_Var_Fact_Costs.txt'
df0 = dd.read_csv(file_path,skiprows=12, dtype=object ,encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python')
df7 = df7.compute() # converts dask df to pandas df
# then do your heavy ETL stuff here using pandas...

Related

Error while converting pandas dataframe to polars dataframe (pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object)

I am converting pandas dataframe to polars dataframe but pyarrow throws error.
My code:
import polars as pl
import pandas as pd
if __name__ == "__main__":
with open(r"test.xlsx", "rb") as f:
excelfile = f.read()
excelfile = pd.ExcelFile(excelfile)
sheetnames = excelfile.sheet_names
df = pd.concat(
[
pd.read_excel(
excelfile, sheet_name=x, header=0)
for x in sheetnames
], axis=0)
df_pl = pl.from_pandas(df)
Error:
File "pyarrow\array.pxi", line 312, in pyarrow.lib.array
File "pyarrow\array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow\error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
I tried changing pandas dataframe dtype to str and problem is solved, but i don't want to change dtypes. Is it bug in pyarrow or am I missing something?
Edit: Polars 0.13.42 and later
Polars now has a read_excel function that will correctly handle this situation. read_excel is now the preferred way to read Excel files into Polars.
Note: to use read_excel, you will need to install xlsx2csv (which can be installed with pip).
Polars: prior to 0.13.42
I can replicate this result. It is due to a column in the original Excel file that contains both text and numbers.
For example, create a new Excel file with one column in which you type both numbers and text, save it, and run your code on that file. I get the following traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/convert.py", line 299, in from_pandas
return DataFrame._from_pandas(df, rechunk=rechunk, nan_to_none=nan_to_none)
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/frame.py", line 454, in _from_pandas
pandas_to_pydf(
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 485, in pandas_to_pydf
arrow_dict = {
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 486, in <dictcomp>
str(col): _pandas_series_to_arrow(
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 237, in _pandas_series_to_arrow
return pa.array(values, pa.large_utf8(), from_pandas=nan_to_none)
File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
There are several lengthy discussions on this issue, such as these:
to_parquet can't handle mixed type columns #21228
pyarrow.lib.ArrowTypeError: "Expected a string or bytes object, got a 'int' object" #349
This particular comment might be relevant, as you are concatenating the results of parsing multiple sheets in an Excel file. This may lead to conflicting dtypes for a column:
https://github.com/pandas-dev/pandas/issues/21228#issuecomment-419175116
How to approach this depends on your data and its use, so I can't recommend a blanket solution (i.e., fixing your source Excel file, or changing the dtype to str).
My problem is solved by saving pandas dataframe to 'csv' format and then importing 'csv' file in polars.
import os
import polars as pl
import pandas as pd
if __name__ == "__main__":
with open(r"test.xlsx", "rb") as f:
excelfile = f.read()
excelfile = pd.ExcelFile(excelfile)
sheetnames = excelfile.sheet_names
df = pd.concat([pd.read_excel(excelfile, sheet_name=x, header=0)
for x in sheetnames
], axis=0)
df.to_csv("temp.csv",index=False)
df_pl = pl.scan_csv("temp.csv")
os.remove("temp.csv")

how to read csv file in pyspark?

I am trying to read csv file using pyspark but its showing some error.
Can you tell what is the correct process to read csv file?
python code:
from pyspark.sql import *
df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)
i tried also below one:
sqlContext = SQLContext
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")
error:
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)
NameError: name 'spark' is not defined
and
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")
AttributeError: type object 'SQLContext' has no attribute 'load'
First you need to create a SparkSession like below
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("MyApp").getOrCreate()
and your csv needs to be on hdfs then you can use spark.csv
df = spark.read.csv('/tmp/data.csv', header=True)
where /tmp/data.csv is on hdfs
The simplest to read csv in pyspark - use Databrick's spark-csv module.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
Also you can read by string and parse to your separator.
reader = sc.textFile("file.csv").map(lambda line: line.split(","))

How to avoid set_index on a pre-sorted DataFrame constructed with from_delayed?

I am trying to get the expression, 'df.resample('1T', how='mean').sum()' to work in Dask but, running into an issue where it seems like Dask needs me to explicitly set_index on the DataFrame before performing resample. I get an error as below...
>>> c.gather(df).compute()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 1508, in gather
asynchronous=asynchronous)
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 615, in sync
return sync(self.loop, func, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/distributed/utils.py", line 253, in sync
six.reraise(*error[0])
File "/usr/local/lib/python2.7/site-packages/distributed/utils.py", line 238, in f
result[0] = yield make_coro()
File "/usr/local/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/local/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/local/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 1385, in _gather
traceback)
File "/usr/local/lib/python2.7/site-packages/dask/dataframe/core.py", line 1633, in resample
return _resample(self, rule, how=how, closed=closed, label=label)
File "/usr/local/lib/python2.7/site-packages/dask/dataframe/tseries/resample.py", line 33, in _resample
return getattr(resampler, how)()
File "/usr/local/lib/python2.7/site-packages/dask/dataframe/tseries/resample.py", line 151, in mean
return self._agg('mean')
File "/usr/local/lib/python2.7/site-packages/dask/dataframe/tseries/resample.py", line 126, in _agg
meta_r = self.obj._meta_nonempty.resample(self._rule, **self._kwargs)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/generic.py", line 7104, in resample
base=base, key=on, level=level)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/resample.py", line 1148, in resample
return tg._get_resampler(obj, kind=kind)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/resample.py", line 1276, in _get_resampler
"but got an instance of %r" % type(ax).__name__)
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
Below is the python code which I am using. Since the pandas DFs being returned by my delayed objects were already timestamp indexed, my expectation was for Dask to infer/construct an index from those DFs' timestamp indices instead of me having to explicitly set one. Although, I am unsure how an explicit set_index can be called in this case (what are the arguments to be passed?). Setting a pd.DatetimeIndex on the meta dataframe (commented line as below) works. Is constructing the index by hand and feeding it to meta the only realistic way to do this? Am I missing something?
#! /usr/bin/env python
# Start dask scheduler and workers
# dask-scheduler &
# dask-worker --nthreads 1 --nprocs 6 --memory-limit 3GB localhost:8786 --local-directory /dev/shm &
from dask.distributed import Client
from dask.delayed import delayed
import pandas as pd
import numpy as np
import dask.dataframe as dd
import time
c = Client('127.0.0.1:8786')
def load(epoch):
# 1525132800 - 1/5
# 1527811200 - 1/6
num_ts=100
idx = []
for ts in range(0, 86400, 15):
idx.append(epoch + ts)
d = np.random.rand(86400/15, num_ts)
ts = []
for i in range(0, num_ts):
# tsname = "ts_%s_%s" % (i, epoch)
tsname = "ts_%s" % (i)
ts.append(tsname)
gts.append(tsname)
res = pd.DataFrame(index=idx, data=d, columns=ts, dtype=np.float64)
res.index = pd.to_datetime(arg=res.index, unit='s')
return res
gts = []
load(1525132800)
print time.time()
i = pd.DatetimeIndex(start=1525132800, freq='15S', end=1527811185, dtype='datetime64[s]')
# meta = pd.DataFrame(index=i, data=[], columns=gts, dtype=np.float64)
meta = pd.DataFrame(index=[], data=[], columns=gts, dtype=np.float64)
dfs = [delayed(load)(fn) for fn in range(1525132800, 1527811200, 86400)]
print time.time()
df = dd.from_delayed(dfs, meta, 'sorted')
print time.time()
df.npartitions
df.divisions
print time.time()
df = c.submit(dd.DataFrame.resample, df, rule='1T', how='mean')
print time.time()
#df = c.submit(dd.DataFrame.sum, df, axis=1)
print time.time()
c.gather(df).compute()
print time.time()
#c.gather(df).visualize(filename='/usr/share/nginx/html/svg/df4.svg')
Dask uses the meta of a data-frame to infer the data types before computing any of the chunks of data. In your case, your chunks contain datetime indexes, but the meta doesn't. The meta should be a zero-length version of the data:
meta = pd.DataFrame(index=i[:0], data=[], columns=gts, dtype=np.float64)

Write Pandas Dataframe to CSV with a variable name in pathway

I've written a python script that takes in a file and matches some columns in another file. I would like to write this to a csv with the name "[original file name]_matched". E.g. I have a bunch of files (xaa, xab, ...) and after running the script on each file I would also have (xaa_matched, xab_matched, etc...) This is what I've tried based on this solution: Set File_Path for to_csv() in Pandas
import sys
import os
filename = sys.argv[1]
# some code
path = r'/Users/mdong/dataScience/movie_representation/fuzzy_match_dir/'
input_file.to_csv(os.path.join(path,'match_' + filename), index = False)
However, I get back this error
Traceback (most recent call last):
File "movie_matching.py", line 29, in <module>
input_file.to_csv(os.path.join(path,filename), index = False)
File "/Users/mdong/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 1413, in to_csv
formatter.save()
File "/Users/mdong/anaconda/lib/python3.6/site-packages/pandas/io/formats/format.py", line 1568, in save
compression=self.compression)
File "/Users/mdong/anaconda/lib/python3.6/site-packages/pandas/io/common.py", line 382, in _get_handle
f = open(path_or_buf, mode, errors='replace')
FileNotFoundError: [Errno 2] No such file or directory: '/Users/mdong/dataScience/movie_representation/fuzzy_match_dir/fuzzy_match_dir/xaa.csv'
I'm not sure what's going wrong in order to troubleshoot, any pointers would be appreciated!
I would use pathlib in this situation.
from pathlib import Path
p = Path('/Path/to/your/folder/')
input_file.to_csv(Path(p, 'match_' + filename + '.csv')), index=False)
I would also check that your filename variable is what you expect it to be. Which you can do with Pathlib as well.
>>> p = Path('/Path/To/Thing.csv')
>>> p.stem
'Thing'
>>> p.name
'Thing.csv'

Pandas HDF5 append time series fails

Going through the documentation of pandas HDF5 usability (http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5) the given example raises an error:
import pandas as pd
import numpy as np
store = pd.HDFStore('store.h5')
np.random.seed(1234)
index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index)
store['df'] = df
df1 = df[0:4]
df2 = df[4:]
store.append('df', df1)
store.append('df', df2)
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-225-ef7f2e059c6a>", line 1, in <module>
store.append('df', df1)
File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 919, in append
**kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1252, in _write_to_group
raise ValueError('Can only append to Tables')
ValueError: Can only append to Tables
Has something changed here? Or am I doing something wrong?
You need to enable append by default store in the table format by setting the following option at the beginning as your store behaves like a DF currently:
pd.set_option('io.hdf.default_format','table')
Docs