Creating Numpy Matrix from pyspark dataframe - numpy

I have a pyspark dataframe child with columns like:
lat1 lon1
80 70
65 75
I am trying to convert it into numpy matrix using IndexedRowMatrix as below:
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
mat = IndexedRowMatrix(child.select('lat','lon').rdd.map(lambda row: IndexedRow(row[0], Vectors.dense(row[1:]))))
But its throwing me error. I want to avoid converting to pandas dataframe to get the matrix.
error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 33.0 failed 4 times, most recent failure: Lost task 0.3 in stage 33.0 (TID 733, ebdp-avdc-d281p.sys.comcast.net, executor 16): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/data/02/yarn/nm/usercache/mbansa001c/appcache/application_1506130884691_56333/container_e48_1506130884691_56333_01_000017/pyspark.zip/pyspark/worker.py", line 174, in main
process()

You want to avoid pandas, but you try to convert to an RDD, which is severely suboptimal...
Anyway, assuming you can collect the selected columns of your child dataframe (a reasonable assumption, since you aim to put them in a Numpy array), it can be done with plain Numpy:
import numpy as np
np.array(child.select('lat1', 'lon1').collect())
# array([[80, 70],
# [65, 75]])

Related

PyODBC+Pandas+Read_SQL: Error: The cursor's connection has been closed

I am reading tables as SELECT * FROM TABLE (sql); from an ODBC data source via PyODBC and fetching/loading all the rows using Pandas read_sql(). However, there are 200+ tables and some have 100,000 rows so for that have been using chunksize to read and load to dataframes to gain some read performance.
Below is a sample code:
def get_odbc_tables(dsn,uid,pwd)
try:
cnxn = pyodbc.connect('DSN={};UID={};PWD={}'.format(dsn, uid, pwd), autocommit=True)
# Get data into pandas dataframe
dfl = []
df = pd.DataFrame()
for chunk in pd.read_sql(sql, cnxn, chunksize=10000):
dfl.append(chunk)
df = pd.concat(dfl, ignore_index=True)
records = json.loads(df.T.to_json()).values()
print("Load to Target")
......
cnxn.close()
except Exception as e:
print("Error: {}".format(str(e)))
sys.exit(1)
However, I am always getting this error after pandas has read/processed specified chunksize (10,000) as defined in the read_sql and loaded to target:
Error: The cursor's connection has been closed
If chunksize is increased to 50,000; it errors out again with same above error message once it has processed/loaded just 50,000 records, even though source has more records than this. This is also causing program failure.
C:\Program Files (x86)\Python\lib\site-packages\pandas\io\sql.py in _query_iterator(cursor, chunksize, columns, index_col, coerce_float, parse_dates)
1419 while True:
-> 1420 data = cursor.fetchmany(chunksize)
1421 if type(data) == tuple:
ProgrammingError: The cursor's connection has been closed.
During handling of the above exception, another exception occurred:
SystemExit Traceback (most recent call last)
<ipython-input-127-b106daee9737> in <module>()
Please suggest if there's any way to handle this. The source is an ODBC data source connection only, hence I think can't create an SQLAlchemy engine for an ODBC data sources.

Getting a EOF error when calling pd.read_pickle

Had a quick question regarding a pandas DataFrame and the pd.read_pickle() function. Basically, I have a large but simple Dataframe (333 mb). When I run pd.read_pickle on the dataframe, I am getting and EOFError.
Is there any way around this issue? What might be causing this?
I saw the same EOFError when I created a pickle using:
pandas.DataFrame.to_pickle('path.pkl', compression='bz2')
and then tried to read with:
pandas.read_pickle('path.pkl')
I fixed the issue by supplying the compression on read:
pandas.read_pickle('path.pkl', compression='bz2')
According to the Pandas docs:
compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’
string representing the compression to use in the output file. By default,
infers from the file extension in specified path.
Thus, simply changing the path from 'path.pkl' to 'path.bz2' also fixed the problem.
I can confirm the valuable comment of greg_data:
When I encountered this error I worked out that it was due to the
initial pickling not having completed correctly. The pickle file was
created, but not finished correctly. Seems to me this is the only
possible source of the EOFError in pickle, that the pickle is
malformed, i.e. not finished.
My error during pickling was:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-40-263240bbee7e> in <module>()
----> 1 main()
<ipython-input-38-9b3c6d782a2a> in main()
43 with open("/content/drive/MyDrive/{}.file".format(tm.id), "wb") as f:
---> 44 pickle.dump(tm, f, pickle.HIGHEST_PROTOCOL)
45
46 print('Coherence:', get_coherence(tm, token_lists, 'c_v'))
TypeError: can't pickle weakref objects
And when reading that pickle file that was obviously not finished during pickling, the reported error occured:
pd.read_pickle(r'/content/drive/MyDrive/TEST_2021_06_01_10_23_02.file')
Error:
---------------------------------------------------------------------------
EOFError Traceback (most recent call last)
<ipython-input-41-460bdd0a2779> in <module>()
----> 1 object = pd.read_pickle(r'/content/drive/MyDrive/TEST_2021_06_01_10_23_02.file')
/usr/local/lib/python3.7/dist-packages/pandas/io/pickle.py in read_pickle(filepath_or_buffer, compression)
180 # We want to silence any warnings about, e.g. moved modules.
181 warnings.simplefilter("ignore", Warning)
--> 182 return pickle.load(f)
183 except excs_to_catch:
184 # e.g.
EOFError: Ran out of input

Silhouette Score function in sklearn giving unexpected error

I am trying to run Kmeans clustering on a data. My data frame is a pandas data frame which is of following dimensions.
People_reduced.shape
Out[155]:
(417837, 13)
Now while k-means is running fine, when I try to feed the output of Kmeans cluster labels and the original data frame to silhouette_score method of sklearn , it is throwing a weird error.
Here is the code I used:
kmeans=KMeans(n_clusters=2,init='k-means++',n_init=10, max_iter=20)
kmeans.fit(People_reduced.ix[:,1:])
cluster_labels = kmeans.labels_
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(People_reduced.ix[:,1:].values,cluster_labels)
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-154-b392e118f64a> in <module>()
19 # This gives a perspective into the density and separation of the formed
20 # clusters
---> 21 silhouette_avg = silhouette_score(People_reduced.ix[:,1:].values,cluster_labels)
22 #silhouette_avg = silhouette_score(People_reduced.ix[:,1:], cluster_labels)
23
TypeError: 'list' object is not callable

Use of metadata with MultiIndex colum DataFrame

I have produced some software that is processing data for analysis and plotting. For each type of data the data frames are produced in a module dedicated for the type.
Depending on the structure of the data the data frame columns could be normal or multindex.
I will pass the data frames to a procedure function that will produce plots of columns that are numeric.
I would like to be able to "attach" a string to each of the "printable" column with a string that will be used as plot labels. This string will not be the same as the name of the column.
I don't seem to be able to figure out a good way to do this purely with pandas DataFrame, so far I don't have any other solution either.
I have seen posts about metadata but I don't completely understand if this functionality is supported or not? At least I don't get this to work, especially it seems like using frames with MultiIndex columns complicates things.
If it is not supported is it still on the todo list?
From my reading I get the impression it have worked differently in different versions of pandas and even depend on if python 2 or 3 is used.
I would like to know if there is a convenient way to accomplish what I require with Pandas data frames? Is using _metadata for this advisable? If so how?
I have looked around quite a bit but especially the MultiIndex concern seems to not be addressed anywhere.
This one seem to indicate that metadata should be supported but is it for data frames? I need Series in a DataFrame.
Adding meta-information/metadata to pandas DataFrame
This one seem to be a similar question but I have tried the solution and it did not help, I tried the solution but it seems not to help me.
Propagate pandas series metadata through joins
Here is some experimentation I have done based on my understanding of the use of _metadata functionality. It seems to indicate that the _metadata did not make any difference and that the attribute did not persist a copy. Also it shows that using MultiIndex is an even more "unsupported" case.
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from numpy.random import randn # To get values for the test frames
>>> import platform # To print python version
>>> # A function to set labels of the columns
>>> def labelSetter(aDF) :
... DFtmp = aDF.copy() # Just to ensure it is a different dataframe
... for column in DFtmp.columns :
... DFtmp[column].myLab='This is '+column.__str__()
... DFtmp[column].notMyLab='This should not persist'
... return DFtmp
...
>>>
>>> print 'Pandas version: {}'.format(pd.version.version)
Pandas version: 0.15.2
>>>
>>> pd.Series._metadata.append('myLab');print pd.Series._metadata # now _metadata contains 'myLab'
['name', 'myLab']
>>>
>>> # Make dataframes normal columns and MultiIndex
>>> dfS=pd.DataFrame(randn(2, 6),columns=['a1','a2','a3','b1','b2','c1']);print dfS
a1 a2 a3 b1 b2 c1
0 -0.934869 -0.310979 0.362635 -0.994605 -0.880114 -1.663265
1 0.205341 -1.642080 -0.732969 -0.080109 -0.082483 -0.208360
>>>
>>> dfMI=pd.DataFrame(randn(2, 6),columns=[['a','a','a','b','b','c'],['a1','a2','a3','b1','b2','c1']]);print dfMI
a b c
a1 a2 a3 b1 b2 c1
0 -0.578399 0.478925 1.047342 -0.087225 1.905074 0.146105
1 0.640575 0.153328 -1.117847 1.043026 0.671220 -0.218550
>>>
>>> # Run the labelSetter function on the data frames
>>> dfSWlab=labelSetter(dfS)
>>> dfMIWlab=labelSetter(dfMI)
>>>
>>> print dfSWlab['a2'].myLab
This is a2
>>> # This worked
>>>
>>> print dfSWlab['a2'].notMyLab
This should not persist
>>> # 'notMyLab' has not been appended to _metadata but the label still persists.
>>>
>>> dfSWlabCopy=dfSWlab.copy() # make a copy to see if myLab persists.
>>>
>>> dfSWlabCopy['a2'].myLab
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 1942, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'myLab'
>>> # 'myLab' was appended to _metadata but still did not persist the copy
>>>
>>> print dfMIWlab['a']['a2'].myLab
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 1942, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'myLab'
>>> # For the MultiIndex data frame the 'myLab' is not accessible

'numpy.ndarray' object is not callable error

Hi I am getting the following error
'numpy.ndarray' object is not callable
when performing the calculation in the following manner
rolling_means = pd.rolling_mean(prices,20,min_periods=20)`
rolling_std = pd.rolling_std(prices, 20)`
#print rolling_means.head(20)
upper_band = rolling_means + (rolling_std)* 2
lower_band = rolling_means - (rolling_std)* 2
I am not sure how to resolve, can someone point me in right direction....
The error TypeError: 'numpy.ndarray' object is not callable means that you tried to call a numpy array as a function. We can reproduce the error like so in the repl:
In [16]: import numpy as np
In [17]: np.array([1,2,3])()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/home/user/<ipython-input-17-1abf8f3c8162> in <module>()
----> 1 np.array([1,2,3])()
TypeError: 'numpy.ndarray' object is not callable
If we are to assume that the error is indeed coming from the snippet of code that you posted (something that you should check,) then you must have reassigned either pd.rolling_mean or pd.rolling_std to a numpy array earlier in your code.
What I mean is something like this:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: pd.rolling_mean(np.array([1,2,3]), 20, min_periods=5) # Works
Out[3]: array([ nan, nan, nan])
In [4]: pd.rolling_mean = np.array([1,2,3])
In [5]: pd.rolling_mean(np.array([1,2,3]), 20, min_periods=5) # Doesn't work anymore...
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/home/user/<ipython-input-5-f528129299b9> in <module>()
----> 1 pd.rolling_mean(np.array([1,2,3]), 20, min_periods=5) # Doesn't work anymore...
TypeError: 'numpy.ndarray' object is not callable
So, basically you need to search the rest of your codebase for pd.rolling_mean = ... and/or pd.rolling_std = ... to see where you may have overwritten them.
Also, if you'd like, you can put in reload(pd) just before your snippet, which should make it run by restoring the value of pd to what you originally imported it as, but I still highly recommend that you try to find where you may have reassigned the given functions.
For everyone with this problem in 2021, sometimes you can have this problem when you create
a numpy variable with the same name as one of your function, what happens is that instead of calling the function python tries to call the numpy array as a function and you get the error, just change the name of the numpy variable
I met the same question and the solved.
The point is that my function parameters and variables have the same name.
Try to give them different name.