compute() in dask not working - dataframe

I am trying a simple parallel computation in Dask.
This is my code.
import time
import dask as dask
import dask.distributed as distributed
import dask.dataframe as dd
import dask.delayed as delayed
from dask.distributed import Client,progress
client = Client('localhost:8786')
df = dd.read_csv('file.csv')
ddf = df.groupby(['col1'])[['col2']].sum()
ddf = ddf.compute()
print ddf
It seems fine from the documentation but on running I am getting this :
Traceback (most recent call last):
File "dask_prg1.py", line 17, in <module>
ddf = ddf.compute()
File "/usr/local/lib/python2.7/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/usr/local/lib/python2.7/site-packages/dask/base.py", line 402, in compute
results = schedule(dsk, keys, **kwargs)
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 2159, in get
direct=direct)
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 1562, in gather
asynchronous=asynchronous)
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 652, in sync
return sync(self.loop, func, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/distributed/utils.py", line 275, in sync
six.reraise(*error[0])
File "/usr/local/lib/python2.7/site-packages/distributed/utils.py", line 260, in f
result[0] = yield make_coro()
File "/usr/local/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/usr/local/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result
raise_exc_info(self._exc_info)
File "/usr/local/lib/python2.7/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 1439, in _gather
traceback)
File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 122, in read_block_from_file
with lazy_file as f:
File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 166, in __enter__
f = SeekableFile(self.fs.open(self.path, mode=mode))
File "/usr/local/lib/python2.7/site-packages/dask/bytes/local.py", line 58, in open
return open(self._normalize_path(path), mode=mode)
IOError: [Errno 2] No such file or directory: 'file.csv'
I am not understanding what is wrong.Kindly help me with this .Thank you in advance .

You may wish to pass the absolute file path to read_csv. The reason is, that you are giving the work of opening and reading the file to a dask worker, and you might not have started that worked with the same working directory as your script/session.

Related

MissingDataError: exog contains inf or nans

Input:
import statsmodels.api as sm
import pandas as pd
# reading data from the csv
data = pd.read_csv('/Users/justkiddings/Desktop/Python/TM/TM.csv')
# defining the variables
x = data['FSP'].tolist()
y = data['RSP'].tolist()
# adding the constant term
x = sm.add_constant(x)
# performing the regression
# and fitting the model
result = sm.OLS(y, x).fit()
# printing the summary table
print(result.summary())
Output:
runfile('/Users/justkiddings/Desktop/Python/Code/untitled28.py', wdir='/Users/justkiddings/Desktop/Python/Code')
Traceback (most recent call last):
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/spyder_kernels/py3compat.py", line 356, in compat_exec
exec(code, globals, locals)
File "/Users/justkiddings/Desktop/Python/Code/untitled28.py", line 24, in <module>
result = sm.OLS(y, x).fit()
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/regression/linear_model.py", line 890, in __init__
super(OLS, self).__init__(endog, exog, missing=missing,
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/regression/linear_model.py", line 717, in __init__
super(WLS, self).__init__(endog, exog, missing=missing,
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/regression/linear_model.py", line 191, in __init__
super(RegressionModel, self).__init__(endog, exog, **kwargs)
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/model.py", line 267, in __init__
super().__init__(endog, exog, **kwargs)
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/model.py", line 92, in __init__
self.data = self._handle_data(endog, exog, missing, hasconst,
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/model.py", line 132, in _handle_data
data = handle_data(endog, exog, missing, hasconst, **kwargs)
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/data.py", line 673, in handle_data
return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/data.py", line 86, in __init__
self._handle_constant(hasconst)
File "/Users/justkiddings/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/data.py", line 132, in _handle_constant
raise MissingDataError('exog contains inf or nans')
MissingDataError: exog contains inf or nans
Some of the Data:
DATE,HOUR,STATION,CO,FSP,NO2,NOX,O3,RSP,SO2
1/1/2022,1,TUEN MUN,75,38,39,40,83,59,2
1/1/2022,2,TUEN MUN,72,35,29,30,90,61,2
1/1/2022,3,TUEN MUN,74,38,28,30,91,66,2
1/1/2022,4,TUEN MUN,76,39,31,32,79,61,2
1/1/2022,5,TUEN MUN,72,38,25,26,83,65,2
1/1/2022,6,TUEN MUN,74,37,24,25,86,60,2
I have removed the N.A. in my dataset and they have converted into blanks. (Eg. 3/1/2022,12,TUEN MUN,85,,53,70,59,,5) Why there is MissingDataError? How to fix it? Thanks.

pandas 1.3.3 to_feather giving ArrowMemoryError

I have a dataset of size around 270MB and I use the following to write to feather file:
df.reset_index().to_feather(feather_path)
This gives me an error :
File "C:\apps\Python\lib\site-packages\pandas\util\_decorators.py", line 207, in wrapper
return func(*args, **kwargs)
File "C:\apps\Python\lib\site-packages\pandas\core\frame.py", line 2519, in to_feather
to_feather(self, path, **kwargs)
File "C:\apps\Python\lib\site-packages\pandas\io\feather_format.py", line 87, in to_feather
feather.write_feather(df, handles.handle, **kwargs)
File "C:\apps\Python\lib\site-packages\pyarrow\feather.py", line 152, in write_feather
table = Table.from_pandas(df, preserve_index=False)
File "pyarrow\table.pxi", line 1553, in pyarrow.lib.Table.from_pandas
File "C:\apps\Python\lib\site-packages\pyarrow\pandas_compat.py", line 607, in dataframe_to_arrays
arrays[i] = maybe_fut.result()
File "C:\apps\Python\lib\concurrent\futures\_base.py", line 438, in result
return self.__get_result()
File "C:\apps\Python\lib\concurrent\futures\_base.py", line 390, in __get_result
raise self._exception
File "C:\apps\Python\lib\concurrent\futures\thread.py", line 52, in run
result = self.fn(*self.args, **self.kwargs)
File "C:\apps\Python\lib\site-packages\pyarrow\pandas_compat.py", line 575, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow\array.pxi", line 302, in pyarrow.lib.array
File "pyarrow\array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow\error.pxi", line 114, in pyarrow.lib.check_status
pyarrow.lib.ArrowMemoryError: realloc of size 3221225472 failed
Note : This works well in PyCharm. No issues writing the feather file.
But when the python program is called in a Windows batch file like:
call python "myprogram.py"
and when I schedule the batch file in a task using Task Scheduler it fails with above memory error.
PyArrow version is 5.0.0 if that helps.
Any ideas please?

Reading keys from an .npz file with multiple workers in pytorch dataloader?

I have an .npz file where I have stored a dictionary. The dictionary has some keys and the values are numpy arrays. I want to read the dictionary in my getitem() method of the dataloader. When I set the dataloader num_workers to 1, everything runs fine. But when I increase the num workers, it throws the following error when reading the data from that npz file:
Traceback (most recent call last):
File "scripts/train.py", line 235, in <module>
train(args)
File "scripts/train.py", line 186, in train
solver(args.epoch, args.verbose)
File "/local-scratch/codebase/cap/lib/solver.py", line 174, in __call__
self._feed(self.dataloader["train"], "train", epoch_id)
File "/local-scratch/codebase/cap/lib/solver.py", line 366, in _feed
for data_dict in dataloader:
File "/local-scratch/anaconda3/envs/scanenv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
return self._process_data(data)
File "/local-scratch/anaconda3/envs/scanenv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
data.reraise()
File "/local-scratch/anaconda3/envs/scanenv/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise
raise self.exc_type(msg)
zipfile.BadZipFile: Caught BadZipFile in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/local-scratch/anaconda3/envs/scanenv/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/local-scratch/anaconda3/envs/scanenv/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/local-scratch/anaconda3/envs/scanenv/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/local-scratch/codebase/cap/lib/dataset.py", line 947, in __getitem__
other_bbox_feat = self.box_features['{}-{}_{}.{}'.format(scene_id, target_object_id, ann_id, object_id)]
File "/local-scratch/anaconda3/envs/scanenv/lib/python3.6/site-packages/numpy/lib/npyio.py", line 255, in __getitem__
pickle_kwargs=self.pickle_kwargs)
File "/local-scratch/anaconda3/envs/scanenv/lib/python3.6/site-packages/numpy/lib/format.py", line 763, in read_array
data = _read_bytes(fp, read_size, "array data")
File "/local-scratch/anaconda3/envs/scanenv/lib/python3.6/site-packages/numpy/lib/format.py", line 892, in _read_bytes
r = fp.read(size - len(data))
File "/local-scratch/anaconda3/envs/scanenv/lib/python3.6/zipfile.py", line 872, in read
data = self._read1(n)
File "/local-scratch/anaconda3/envs/scanenv/lib/python3.6/zipfile.py", line 962, in _read1
self._update_crc(data)
File "/local-scratch/anaconda3/envs/scanenv/lib/python3.6/zipfile.py", line 890, in _update_crc
raise BadZipFile("Bad CRC-32 for file %r" % self.name)
zipfile.BadZipFile: Bad CRC-32 for file 'scene0519_00-13_1.0.npy'
As far as I know, pytorch dataloader uses multiprocessing to for data loading. Perhaps the issue is with multiprocessing and .npz files. I really appreciate any help.

pandas.read_csv gives FileNotFound error inside a loop

pandas.read_csv is working properly when used as a single statement. But it is giving FileNotFoundError when it is being used inside a loop even though the file exists.
for filename in os.listdir("./Datasets/pollution"):
print(filename) # To check which file is under processing
df = pd.read_csv(filename, sep=",").head(1)
These above lines are giving this following error.
pollutionData184866.csv <----- The name of the file is printed properly.
Traceback (most recent call last):
File "/home/parnab/PycharmProjects/FinalYearProject/locationExtractor.py", line 13, in <module>
df = pd.read_csv(i, sep=",").head(1)
File "/usr/lib/python3.6/site-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3.6/site-packages/pandas/io/parsers.py", line 389, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib/python3.6/site-packages/pandas/io/parsers.py", line 730, in __init__
self._make_engine(self.engine)
File "/usr/lib/python3.6/site-packages/pandas/io/parsers.py", line 923, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib/python3.6/site-packages/pandas/io/parsers.py", line 1390, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 373, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4184)
File "pandas/parser.pyx", line 667, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:8449)
FileNotFoundError: File b'pollutionData184866.csv' does not exist
But when I am doing
filename = 'pollutionData184866.csv'
df = pd.read_csv(filename, sep=',')
It is working fine.
What am I doing wrong?
os.listdir("./Datasets/pollution") returns a list of files without a path and according to the path "./Datasets/pollution" you are parsing CSV files NOT from the current directory ".", so changing it to glob.glob('./Datasets/pollution/*.csv') should work, because glob.glob() returns a list of satisfying files/directories including given path
Demo:
In [19]: os.listdir('d:/temp/.data/629509')
Out[19]:
['AAON_data.csv',
'AAON_data.png',
'AAPL_data.csv',
'AAPL_data.png',
'AAP_data.csv',
'AAP_data.png']
In [20]: glob.glob('d:/temp/.data/629509/*.csv')
Out[20]:
['d:/temp/.data/629509\\AAON_data.csv',
'd:/temp/.data/629509\\AAPL_data.csv',
'd:/temp/.data/629509\\AAP_data.csv']

matplotlib pgf: OSError: No such file or directory in subprocess.py

I try to use matplotlib to create a pgf file for LaTeX:
from matplotlib.pyplot import subplots
from numpy import linspace
x = linspace(0, 100, 30)
fig, ax = subplots(figsize = (10, 6))
ax.scatter(x, x)
fig.tight_layout()
fig.savefig('/home/mark/dicp/python/figure.pgf')
But I get OSError: [Errno 2] No such file or directory:
Traceback (most recent call last):
File "visualize/latex_figs.py", line 32, in <module>
fig.savefig('/home/mark/dicp/python/figure.pgf')
File "/usr/local/lib/python2.7/dist-packages/matplotlib/figure.py", line 1421, in savefig
self.canvas.print_figure(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backend_bases.py", line 2220, in print_figure
**kwargs)
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backend_bases.py", line 1957, in print_pgf
return pgf.print_pgf(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/backend_pgf.py", line 818, in print_pgf
self._print_pgf_to_fh(fh, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/backend_pgf.py", line 797, in _print_pgf_to_fh
RendererPgf(self.figure, fh),
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/backend_pgf.py", line 409, in __init__
self.latexManager = LatexManagerFactory.get_latex_manager()
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/backend_pgf.py", line 223, in get_latex_manager
new_inst = LatexManager()
File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/backend_pgf.py", line 305, in __init__
cwd=self.tmpdir)
File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
It also generates this part of the output file:
%% [whole bunch of comments]
\begingroup%
\makeatletter%
\begin{pgfpicture}%
\pgfpathrectangle{\pgfpointorigin}{\pgfqpoint{10.000000in}{6.000000in}}%
\pgfusepath{use as bounding box}%
I do not understand what OSError: No such file or directory in subprocesses.py has to do with anything... The file I'm trying to save is writable. Am I misunderstanding something, or is this a bug I should report?
I also had this problem while trying to run the example scripts. The problem occurs where backend_pgf.py first tries to use the default LaTeX command. It seems that the PGF backend assumes that it should use xelatex by default. If the problem is the same for you as for me, then you have two options:
add the key "pgf.texsystem" : "pdflatex" (or lualatex, whatever) to your matplotlib.rcParams. For example, add the following snippet to the top of your script:
import matplotlib
pgf_with_rc_fonts = {"pgf.texsystem": "pdflatex"}
matplotlib.rcParams.update(pgf_with_rc_fonts)
ensure that you have xelatex, and that it is on your PATH, and use that as the default latex command (i.e. assuming you're on a Mac or Linux system, which xelatex should return a path).