How do I enable the REFS_OK flag in nditer in numpy in Python 3.3? - numpy

Does anyone know how one goes about enabling the REFS_OK flag in numpy? I cannot seem to find a clear explanation online.
My code is:
import sys
import string
import numpy as np
import pandas as pd
SNP_df = pd.read_csv('SNPs.txt',sep='\t',index_col = None ,header = None,nrows = 101)
output = open('100 SNPs.fa','a')
for i in SNP_df:
data = SNP_df[i]
data = np.array(data)
for j in np.nditer(data):
if j == 0:
output.write(("\n>%s\n")%(str(data(j))))
else:
output.write(data(j))
I keep getting the error message: Iterator operand or requested dtype holds references, but the REFS_OK was not enabled.
I cannot work out how to enable the REFS_OK flag so the program can continue...

I have isolated the problem. There is no need to use np.nditer. The main problem was with me misinterpreting how Python would read iterator variables in a for loop. The corrected code is below.
import sys
import string
import fileinput
import numpy as np
SNP_df = pd.read_csv('datafile.txt',sep='\t',index_col = None ,header = None,nrows = 5000)
output = open('outputFile.fa','a')
for i in range(1,51):
data = SNP_df[i]
data = np.array(data)
for j in range(0,1):
output.write(("\n>%s\n")%(str(data[j])))
for k in range(1,len(data)):
output.write(str(data[k]))

If you really want to enable the flag, I have an working example.
Python 2.7, numpy 1.14.2, pandas 0.22.0
import pandas as pd
import numpy as np
# get all data as panda DataFrame
data = pd.read_csv("./monthdata.csv")
print(data)
# get values as numpy array
data_ar = data.values # numpy.ndarray, every element is a row
for row in data_ar:
print(row)
sum = 0
count = 0
for month in np.nditer(row, flags=["refs_OK"], op_flags=["readwrite"]):
print month

Related

Problem with merging multiply excel files from python

My dtype is changing after i unhash the foo and groupby i get # we require a list, but not a 'str'.
I wanted if the value (in my case Date) in the 1 column is the same then the text from the 3 column goes there after a ',' sign, in my final project
import os
import pandas as pd
import dateutil
from pandas import DataFrame
from datetime import datetime, timedelta
data_file_folder = '.\Data'
df = []
for file in os.listdir(data_file_folder):
if file.endswith('.xlsx'):
print('Loading File {0}...'.format(file))
df.append(pd.read_excel(os.path.join(data_file_folder,file),sheet_name='Sheet1'))
df_master = pd.concat(df,axis=0)
df_master['Date'] = df_master['Date'].dt.date
#foo = lambda a: ", ".join(a)
#df_master = df_master.groupby(by='Date').agg({'Tweet': foo}).reset_index()
#df_master.to_excel('.\NewFolder\example.xlsx',index=False)
#df_master

Using Sklearn with NumPy and Images and get this error 'setting an array element with a sequence'

I am trying to create a simple image classification tool.
I would like the code below to work with classifying images. It works fine when it is a non image NumPy array.
#https://e2eml.school/images_to_numbers.html
import numpy as np
from sklearn.utils import Bunch
from PIL import Image
monkey = [1]
dog = [2]
example_animals = Bunch(data = np.array([monkey,dog]),target = np.array(['monkey','dog']))
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2) #with KMeans you get to pre specify the number of Clusters
KModel = kmeans.fit(example_animals.data) #fit a model using the training data , in this case original example animal data passed through
import pandas as pd
crosstab = pd.crosstab(example_animals.target,KModel.labels_)
print(crosstab)
I have looked into how to make an image into a NumPy array at https://e2eml.school/images_to_numbers.html
The code below where I have converted images to NumPy array doesn't work.
When run it gets the following error
** 'setting an array element with a sequence'**
#https://e2eml.school/images_to_numbers.html
import numpy as np
from sklearn.utils import Bunch
from PIL import Image
monkey = np.asarray(Image.open("monkey.jpg"))
dog = np.asarray(Image.open("dog.jpeg"))
example_animals = Bunch(data = np.array([monkey,dog]),target = np.array(['monkey','dog']))
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2) #with KMeans you get to pre specify the number of Clusters
KModel = kmeans.fit(example_animals.data) #fit a model using the training data , in this case original example animal data passed through
import pandas as pd
crosstab = pd.crosstab(example_animals.target,KModel.labels_)
print(crosstab)
I would appreciate any insight how I fix the error 'setting an array element with a sequence' so that the images will be compatible with the sklearn processing.
You need to be sure that your images "monkey.jpg" and "dog.jpeg" have the same number of pixels. Otherwise, you will have to resize the images to have the same size. Moreover, the data of your Bunch object need to be of shape (n_samples, n_features) (you can check the documentation https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit)
You need to be aware that you use an unserpervised learning model (Kmeans). So the output of the model is not directly "monkey" or "dog".
I found the solution to error setting an array element with a sequence
Kmeans requires the data arrays for comparison need to be the same size.
This means if importing pictures, the pictures need to be resized, converted into a numpy array (a format that is compatible with Kmeans) and finally made into a 1 dimensional array.
#https://e2eml.school/images_to_numbers.html
#https://machinelearningmastery.com/how-to-load-and-manipulate-images-for-deep-learning-in-python-with-pil-pillow/
import numpy as np
from matplotlib import pyplot as plt
from sklearn.utils import Bunch
from PIL import Image
from sklearn.cluster import KMeans
import pandas as pd
monkey = Image.open("monkey.jpg")
dog = Image.open("dog.jpeg")
#resize pictures
monkey1 = monkey.resize((180,220))
dog1 = dog.resize((180,220))
#make pictures into numpy array
monkey2 = np.asarray(monkey1)
dog2 = np.asarray(dog1)
#https://www.quora.com/How-do-I-convert-image-data-from-2D-array-to-1D-using-python
#make numpy array into 1 dimensional array
monkey3 = monkey2.reshape(-1)
dog3 = dog2.reshape(-1)
example_animals = Bunch(data = np.array([monkey3,dog3]),target = np.array(['monkey','dog']))
kmeans = KMeans(n_clusters=2) #with KMeans you get to pre specify the number of Clusters
KModel = kmeans.fit(example_animals.data) #fit a model using the training data , in this case original example food data passed through
crosstab = pd.crosstab(example_animals.target,KModel.labels_)
print(crosstab)

Convert R object(Dataframe) to Pandas Dataframe using rpy2

Iam using rpy2 to get comorbidity Index of patients , i got the results but iam not able to convert those output to pandas Dataframe
below is the code
#creating Datframe
data = {"person_id":[1,1,1,2,2,3],
"dx_1":["F11","E40","","F32","C77","G10"],
"dx_2":["F1P","E400","","F322","C737",""]}
#converting Pandas Dataframe to R Datframe using rpy2
import rpy2
from rpy2.robjects import pandas2ri
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
r_dataframe = pandas2ri.py2ri(df1)
print(r_dataframe)
#installing 'comorbidity ' package using rpy2
R = rpy2.robjects.r
DTW = importr('comorbidity')
#executing comorbidity function by using one column icd_1
output = DTW.comorbidity(x = r_dataframe, id = "person_id", code = "icd_1",
score = "charlson", assign0 = False,
icd = "icd10")
print(output)
but not able to convert output to pandas dataframe
import rpy2, rpy2.robjects as robjects, rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
#Converting data frames back and forth between rpy2 and pandas
from rpy2.robjects import r, pandas2ri
#convert output to pandas dataframe
pandas2ri.ri2py_dataframe(output)
getting below error
TypeError: Parameter 'categories' must be list-like, was
please help
Thanks in advance

Cannot store an array using dask

I am using the following code to create an array and and store the the results sequentially in a hdf5 format. I was checking out the dask documentation, and the suggested to use dask.store to store the arrays generated in a function like mine. However I receive an error: dask has no attribute store
My code:
import os
import numpy as np
import time
import concurrent.futures
import multiprocessing
from itertools import product
import h5py
import dask as da
def mean_py(array):
start_time = time.time()
x = array.shape[1]
y = array.shape[2]
values = np.empty((x,y), type(array[0][0][0]))
for i in range(x):
for j in range(y):
values[i][j] = ((np.mean(array[:,i,j])))
end_time = time.time()
hours, rem = divmod(end_time-start_time, 3600)
minutes, seconds = divmod(rem,60)
print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours), int(minutes), int(seconds)))
print(f"{'.'*80}")
return values
def generate_random_array():
a = np.random.randn(120560400).reshape(10980,10980)
return a
def generate_array(nums):
for num in range(nums):
a = generate_random_array()
f = h5py.File('test_db.hdf5')
d = f.require_dataset('/data', shape=a.shape, dtype=a.dtype)
da.store(a, d)
start = time.time()
generate_array(8)
end = time.time()
print(f'\nTime complete: {end-start:.2f}s\n')
Should I use dask for such a a task, or do you recommend to store the results using h5py directly?
Please Ignore the mean_py(array) function. It's for something I want to try out once the data has been produced.
As suggested in the comments, you're currently doing this
import dask as da
When you probably meant to do this
import dask.array as da

PySpark - numpy | pandas | matplotlib

I already make some research to find the answer for my question but I ddin't find anything...
How can I use this labs in Pyspark:
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
Anyone have a article that ilustrate ao to use it?
I'm trying to save a dataframe to make some graphs like this (I see this code here):
data1 = pd.sc.textFile("TEXT").flatMap { line => line.split("\n") }.distinct()
freqMap = {}
for line in data1:
for item in line:
if not item in freqMap:
freqMap[item] = {}
for other_item in line:
if not other_item in freqMap:
freqMap[other_item] = {}
freqMap[item][other_item] = freqMap[item].get(other_item, 0) + 1
freqMap[other_item][item] = freqMap[other_item].get(item, 0) + 1
df = data1[freqMap].fillna(0)
print(df)
plt.pcolormesh(df, edgecolors='black')
plt.yticks(np.arange(0.5, len(df.index), 1), df.index)
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)
plt.savefig('plot.png')
plt.savefig('plot.png')
Thanks!