Python Script that used to work, is now getting automatically killed in Ubuntu - pandas

I was once able to run the below python script on my Ubuntu machine without the memory errors I was getting on windows.
import pandas as pd
import numpy as np
#create a pandas dataframe for each input file
dfs1 = pd.read_csv('s1.csv', encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
dfs2 = pd.read_csv('s2.csv', encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
dfr = pd.read_csv('r.csv' , encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
#combine them into one dataframe
dfs12r = pd.concat([dfs1, dfs2, dfr],ignore_index=True)#withour ignore index the line numbers are not adjusted
# bow is "comming
wordlist=[]
for line in range(8052):
for row in range(106) :
#print(line,row,dfs12r[row][line])
if dfs12r[row][line] not in wordlist :
wordlist.append(dfs12r[row][line])
wordlist.sort()
#print(wordlist)
print(len(wordlist)) #12350
dfBOW = pd.DataFrame(np.zeros((len(dfs12r.index), len(wordlist))),dtype='int')
#create the dictionary
wordDict = dict.fromkeys(wordlist,'default')
counter=0
for word in wordlist :
wordDict[word] = counter
counter+=1
#print(wordDict)
#will start scanning every word from dfS12R and +1 the respective cell in dfBOW
for line in range(8052):
for row in range(107):
dfBOW[wordDict[dfs12r[row][line]]][line]+=1
Unfortunately, probably after some automatic Ubuntu updates I am now getting the simple message "KIlled", after trying to run the process without any further explanation.
Through simple print statements I know that the script is interrupted inside the for loop in the end.
I understand that I should be able to make the script more memory efficient, but I am also hoping for guidance on how to get Ubuntu able to run again the same script like they used to. (Through the TOP command I can see the all of my memory including the swap is being used while inside this loop)
Could paging have been disabled somehow after the updates? Any advice is welcome.
I still have 16GB of RAM, and use Ubuntu 20.04 (Specs are the same before and after the script stopped working). I use dual boot on the same SSD.
Below is the error I am getting from teh same script on windows :
Traceback (most recent call last):
File "D:\sharedfiles\Organised\WorkSpace\ptixiaki\github\ptixiaki\code\makingthedata\2.1 Approach (Same as 2 but turning all words to lowercase)\2.1_CSVtoDataframe\CSVtoBOW.py", line 60, in <module>
dfBOW[wordDict[dfs12r[row][line]]][line]+=1
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\series.py", line 1143, in __setitem__
self._maybe_update_cacher()
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\series.py", line 1279, in _maybe_update_cacher
ref._maybe_cache_changed(cacher[0], self, inplace=inplace)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\frame.py", line 3950, in _maybe_cache_changed
self._mgr.iset(loc, arraylike, inplace=inplace)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\internals\managers.py", line 1141, in iset
blk.delete(blk_locs)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\internals\blocks.py", line 388, in delete
self.values = np.delete(self.values, loc, 0) # type: ignore[arg-type]
File "<__array_function__ internals>", line 5, in delete
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\numpy\lib\function_base.py", line 4555, in delete
new = arr[tuple(slobj)]
MemoryError: Unable to allocate 501. MiB for an array with shape (12234, 10736) and data type int32

Related

Matplotlib.pyplot backend qt giving ValueError of image size

I am using Python 3.7. I would like to change my backend temporarily so that I can zoom in/out of my data. However, generating figures using qt backend gives me a ValueError, even if the figure is minimally simple.
Here is a minimum example, where I typed the following commands in the console one by one
[Input Commands]
import matplotlib.pyplot as plt
%matplotlib qt
plt.plot([1,2],[1,2])
[Output Error]
[<matplotlib.lines.Line2D at 0x1b2efda7648>]Traceback (most recent call last):
File "C:\Users\USER_NAME\anaconda3\lib\site-packages\matplotlib\backends\backend_qt.py", line 455, in _draw_idle
self.draw()
File "C:\Users\USER_NAME\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py", line 431, in draw
self.renderer = self.get_renderer(cleared=True)
File "C:\Users\USER_NAME\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py", line 447, in get_renderer
self.renderer = RendererAgg(w, h, self.figure.dpi)
File "C:\Users\USER_NAME\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py", line 93, in __init__
self._renderer = _RendererAgg(int(width), int(height), dpi)
ValueError: Image size of 213120x159840 pixels is too large. It must be less than 2^16 in each direction.
I don't think this question is a duplicate of previous questions (here and here) because the error appears even for a very simple figure with no labels.
I thought I had accidentally overwritten some kind of setting file, so I tried re-installing matplotlib using
"conda install matplotlib" at the Anaconda prompt, but it didn't help. Changing back to my default backend "%matplotlib inline" generates figures normally.
How can I fix my qt backend?

Pandas 0.24.0 breaks my pandas dataframe with special column identifiers

I had code that worked fine until I tried to run it on a coworker's machine, whereupon I discovered that while it worked using pandas 0.22.0, it broke on pandas 0.24.0. For the moment, we've solved this problem by downgrading their copy of pandas, but I would like to find a better solution if one exists.
The problem seems to be that I am creating a user-defined class to use as identifiers for my columns in the dataframe. When trying to compare two dataframes it for some reason tries to call my column labels as functions, and then throws an exception because they aren't callable
Here's some example code:
import pandas as pd
import numpy as np
class label(object):
def __init__(self, var):
self.var = var
def __eq__(self,other):
return self.var == other.var
df = pd.DataFrame(np.eye(5),columns=[label(ii) for ii in range(5)])
df == df
This produces the following stack trace:
Traceback (most recent call last):
File "<ipython-input-4-496e4ab3f9d9>", line 1, in <module>
df==df1
File "C:\...\site-packages\pandas\core\ops.py", line 2098, in f
return dispatch_to_series(self, other, func, str_rep)
File "C:\...\site-packages\pandas\core\ops.py", line 1157, in dispatch_to_series
new_data = expressions.evaluate(column_op, str_rep, left, right)
File "C:\...\site-packages\pandas\core\computation\expressions.py", line 208, in evaluate
return _evaluate(op, op_str, a, b, **eval_kwargs)
File "C:\...\site-packages\pandas\core\computation\expressions.py", line 68, in _evaluate_standard
return op(a, b)
File "C:\...\site-packages\pandas\core\ops.py", line 1135, in column_op
for i in range(len(a.columns))}
File "C:\...\site-packages\pandas\core\ops.py", line 1135, in <dictcomp>
for i in range(len(a.columns))}
File "C:\...\site-packages\pandas\core\ops.py", line 1739, in wrapper
name=res_name).rename(res_name)
File "C:\...\site-packages\pandas\core\series.py", line 3733, in rename
return super(Series, self).rename(index=index, **kwargs)
File "C:\...\site-packages\pandas\core\generic.py", line 1091, in rename
level=level)
File "C:\...\site-packages\pandas\core\internals\managers.py", line 171, in rename_axis
obj.set_axis(axis, _transform_index(self.axes[axis], mapper, level))
File "C:\...\site-packages\pandas\core\internals\managers.py", line 2004, in _transform_index
items = [func(x) for x in index]
TypeError: 'label' object is not callable
I've found I can fix the problem by making my class callable with a single argument and returning that argument, but that breaks .loc indexing, which will default to treating my objects as callables.
This problem only occurs when the custom objects are in the columns - the index can handle them just fine.
Is this a bug or a change in usage, and is there any way I can work around it without giving up my custom labels?

pandas memory error on large RAM machine but not on smaller RAM machine: same code, same data

I run the following on two of my machines:
import os, sqlite3
import pandas as pd
from feat_transform import filter_anevexp
db_path = r'C:\Users\timregan\Desktop\anondb_280718.sqlite3'
db = sqlite3.connect(db_path)
anevexp_df = filter_anevexp(db, 0)
On my laptop (with 8GB of RAM) this runs without issue (although the call out to filter_anevexp takes a few minutes). On my desktop (with 128GB of RAM) it fails in pandas with a memory error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\timregan\source\MentalHealth\code\preprocessing\feat_transform.py", line 171, in filter_anevexp
anevexp_df = anevexp_df[anevexp_df["user_id"].isin(df)].copy()
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\frame.py", line 2682, in __getitem__
return self._getitem_array(key)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\frame.py", line 2724, in _getitem_array
return self._take(indexer, axis=0)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 2789, in _take
verify=True)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4539, in take
axis=axis, allow_dups=True)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4425, in reindex_indexer
for blk in self.blocks]
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4425, in <listcomp>
for blk in self.blocks]
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 1258, in take_nd
allow_fill=True, fill_value=fill_value)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\algorithms.py", line 1655, in take_nd
out = np.empty(out_shape, dtype=dtype)
MemoryError
Is there anything special I need to do to prevent errors (e.g. addressing errors) on machines with lots of memory?
N.B. I have not included the code in the filter_anevexp function because I am not interested in advice on how to reduce its memory footprint. I am interested in understanding why the same code running on the same data fails with a memory error on a 128GB RAM machine while it succeeds on a 8GB RAM machine?
You are using a 32 bit version in your home pc, this means that your python executables can only access 4gb of ram. Try to reinstall python37 with the 64bits instead of the 32 you are currently using.

Pandas Group Example Errors

I am trying to replicate one example out of Wes McKinney's book on Pandas, the code is here (it assumes all names datafiles are under names folder)
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
path = 'names/yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
names = pd.concat(pieces, ignore_index=True)
names
def get_tops(group):
return group.sort_index(by='births', ascending=False)[:1000]
grouped = names.groupby(['year','sex'])
grouped.apply(get_tops)
I am using Pandas 0.10 and Python 2.7. The error I am seeing is this:
Traceback (most recent call last):
File "names.py", line 21, in <module>
grouped.apply(get_tops)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 321, in apply
return self._python_apply_general(f)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 324, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, self.obj, self.axis)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 585, in apply
values, mutated = splitter.fast_apply(f, group_keys)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 2127, in fast_apply
results, mutated = lib.apply_frame_axis0(sdata, f, names, starts, ends)
File "reduce.pyx", line 421, in pandas.lib.apply_frame_axis0 (pandas/lib.c:24934)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2028, in __setattr__
self[name] = value
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2043, in __setitem__
self._set_item(key, value)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2078, in _set_item
value = self._sanitize_column(key, value)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2112, in _sanitize_column
raise AssertionError('Length of values does not match '
AssertionError: Length of values does not match length of index
Any ideas?
I think this was a bug introduced in 0.10, namely issue #2605,
"AssertionError when using apply after GroupBy". It's since been fixed.
You can either wait for the 0.10.1 release, which shouldn't be too long from now, or you can upgrade to the development version (either via git or simply by downloading the zip of master.)

Python Matplotlib errors with savefig (newbie).

All parts of Python on my computer were recently installed from the Enthought academic package, but use Pyscripter for editing and running code. I'm very early in my learning curve, and so could very well be overlooking some obvious things here.
When I try to create a plot and save it like so:
import matplotlib.pylab as pl
pl.hist(myEst, bins=20, range=(.1,.60))
pl.ylabel("Freq")
pl.xlabel("Success Probability")
pl.title('Histogram of Binomial Estimator')
pl.axis([0, 1, 0, 500])
pl.vlines (.34,0,500)
pl.savefig('TestHist.png')
pl.show()
I get these errors:
Traceback (most recent call last):
File "<editor selection>", line 9, in <module>
File "C:\Python27\lib\site-packages\matplotlib\figure.py", line 1172, in savefig
self.canvas.print_figure(*args, **kwargs)
File "C:\Python27\lib\site-packages\matplotlib\backends\backend_wxagg.py", line 100, in print_figure
FigureCanvasAgg.print_figure(self, filename, *args, **kwargs)
File "C:\Python27\lib\site-packages\matplotlib\backend_bases.py", line 2017, in print_figure
**kwargs)
File "C:\Python27\lib\site-packages\matplotlib\backends\backend_agg.py", line 450, in print_png
filename_or_obj = file(filename_or_obj, 'wb')
IOError: [Errno 13] Permission denied: 'TestHist.png'
If I take out the pl.savefig('TestHist') line everything works fine, and I can see the plot I want, but when that line is in there I get the errors.
I've checked my backend version using pl.get_backend(), it returns 'WXAgg', which according to documentation should be able to use .png format.
I've also tried including an explicit format='png' and format=png within the savefig command, but still get errors.
Can anyone give me advice on how to proceed, or another approach for saving a plot?
There's nothing wrong with your code. I just tested it locally on my machine. The issue is this error:
IOError: [Errno 13] Permission denied: 'TestHist.png'
You are most likely trying to save the file somewhere that the Python process doesn't have permission to access. What OS are you on? Where are you trying to save the file?
If it helps others, I made the silly mistake of not actually designating a file name and as a result had returned the same error message that lead me to this question for review.
Here is the code that was generating the error:
plt.savefig('C:\\Users\\bwarn\\Canopy', format='png')
Here is my correction that resolved (you'll see I designated the actual file and name)
plt.savefig('C:\\Users\\bwarn\\Canopy\\myplot.png', format='png')
The following worked for me when I was running a neural network on my windows machine:
image_path = 'A:/DeepLearning/Padhai/MLFlow/images/%s.png' % (expt_id)
plt.savefig(image_path)
Or otherwise refer:
Using 'r' in front of the path