"ValueError: arrays must all be same length" - pandas

In a machine learning project, suppose I have 3 cat images and 2 dog images. when I make a dataframe for the training data.
#pre processing train data
filenames = os.listdir('/content/train')
categories = []
for filename in os.listdir('/content/train'):
category = filename.split('.')[0]
if category == 'dog' :
categories.append(1) #1 for dog and 0 for cat
else :
categories.append(0)
#make a dictonary
df = pd.DataFrame(
{
'filename' : filenames,
'category':categories
}
)
It gives an error because I haven't the same amount of dog, cat images.
ValueError Traceback (most recent call last)
<ipython-input-28-2d4e2440ba41> in <module>()
12 {
13 'filename' : filenames,
---> 14 'category' : categories
15 }
16 )
3 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/construction.py in extract_index(data)
395 lengths = list(set(raw_lengths))
396 if len(lengths) > 1:
--> 397 raise ValueError("arrays must all be same length")
398
399 if have_dicts:
ValueError: arrays must all be same length
Is there any way to fix it without adding any image to the training dataset?

Related

AttributeError: 'ArrayView' object has no attribute 'A1'

I have to import a processed h5ad file, but it seems that X has been passed as a numpy array instead of a numpy matrix. See below:
# Read the data
data_path = "/home/bbb5130/snOMICS/maria/msrna.h5ad"
adata = sn.pp.read_h5ad(data_path, pr_process="Yes")
adata
But the output was:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [15], line 3
1 # Read the data
2 data_path = "/home/bbb5130/snOMICS/maria/msrna.h5ad"
----> 3 adata = sn.pp.read_h5ad(data_path, pr_process="Yes")
4 adata
File ~/miniconda3/envs/snOMICS/lib/python3.9/site-packages/scanet/preprocessing.py:54, in Preprocessing.read_h5ad(cls, filename, pr_process)
51 return sc.read_h5ad(filename)
52 else:
53 # initial preprocessing as it is required later
---> 54 return cls._intial(adata)
File ~/miniconda3/envs/snOMICS/lib/python3.9/site-packages/scanet/preprocessing.py:35, in Preprocessing._intial(adata)
33 adata.var['mt'] = adata.var_names.str.startswith('MT-')
34 mito_genes = adata.var_names.str.startswith('MT-')
---> 35 adata.obs['percent_mito'] = np.sum(adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
36 sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, inplace=True)
37 sc.pp.filter_cells(adata, min_genes=0)
AttributeError: 'ArrayView' object has no attribute 'A1'
Is there anyway I can change the format, so the file can be read?
Thanks in advance.

TypeError: descriptor 'lower' for 'str' objects doesn't apply to a 'list' object

I wanna stemming my dataset. Before stemming, I did tokenize use nltk tokenize
You can see the output on the pic
Dataset
Col Values
But when i do stemming, it return error :
[Error][3]
TypeError Traceback (most recent call
last)
<ipython-input-102-7700a8e3235b> in <module>()
----> 1 df['Message'] = df['Message'].apply(stemmer.stem)
2 df = df[['Message', 'Category']]
3 df.head()
5 frames
/usr/local/lib/python3.7/dist-
packages/Sastrawi/Stemmer/Filter/TextNormalizer.py in
normalize_text(text)
2
3 def normalize_text(text):
----> 4 result = str.lower(text)
5 result = re.sub(r'[^a-z0-9 -]', ' ', result, flags =
re.IGNORECASE|re.MULTILINE)
6 result = re.sub(r'( +)', ' ', result, flags =
re.IGNORECASE|re.MULTILINE)
TypeError: descriptor 'lower' requires a 'str' object but received a
'list'
Hope all you guys can help me

Can pandas df have cell values of numpy array

I want to store Numpy arrays as values for cells in my Dataframe. Is there any way to do this?
Basically i have pixel data which is a (512,512) Numpy array that i want to save as the value for pixel_data column corresponding to its particular id in the ID column of my Dataframe. How can i do this?
Heres what i tried:
for f in train_files[:10]:
id_tmp = f.split('/')[4].split('.')[0]
first_dcm = pydicom.read_file(f)
img = first_dcm.pixel_array
window = get_windowing(first_dcm)
image = window_image(img, *window)
train.loc[train.Image == id_tmp, 'img_before_w'] = img
train.loc[train.Image == id_tmp, 'img_after_w'] = image
The error i got:
ValueError Traceback (most recent call last)
<ipython-input-47-32236f8c9ccc> in <module>
5 window = get_windowing(first_dcm)
6 image = window_image(img, *window)
----> 7 train.loc[train.Image == id_tmp, 'img_before_w'] = img
8 train.loc[train.Image == id_tmp, 'img_after_w'] = image
9
/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
203 key = com.apply_if_callable(key, self.obj)
204 indexer = self._get_setitem_indexer(key)
--> 205 self._setitem_with_indexer(indexer, value)
206
207 def _validate_key(self, key, axis: int):
/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
525 if len(labels) != value.shape[1]:
526 raise ValueError(
--> 527 "Must have equal len keys and value "
528 "when setting with an ndarray"
529 )
ValueError: Must have equal len keys and value when setting with an ndarray
Taking sample dataframe as below:
train=pd.DataFrame({'Image':[1,2,3,2],'img_before_w':[np.nan, np.nan, np.nan,np.nan]})
print(train) gives
Image img_before_w
0 1 NaN
1 2 NaN
2 3 NaN
3 2 NaN
Now, for example, if you want to insert pixel data when train.Image == 2, then it can be achieved using below code:
mask = train.Image == 2 # contains True for desired rows
target_index=mask[mask==True].index # gives index of rows, wherever condition is met
train.loc[mask, 'img_before_w'] = pd.Series([[512,512]]*len(target_index), index=target_index) # inserts [512,512] array in rows wherever condition is met, in given column
Now, print(train) gives, desired output:
Image img_before_w
0 1 NaN
1 2 [512, 512]
2 3 NaN
3 2 [512, 512]

While removing html text from column, object of type 'float' has no len() error is occuring

I am using an amazon dataset to do sentiment analysis. Dataset content is
https://i.stack.imgur.com/qcKZp.png
dataset con be found on:
https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
I am trying to remove html from Review column.
This is what I am doing. Note: dataset is assigned to df.
df_removedNoise = []
def removingHTML(text):
soup = BeautifulSoup(text, 'lxml').get_text()
return soup
def removingNoise(text):
html_removed = removingHTML(text)
return html_removed
for i in df["Reviews"]:
text = removingNoise(i)
df_removedNoise.append(text)
Even though Reviews column has object as a datatype, I am still getting an error like.
TypeError Traceback (most recent call last)
<ipython-input-83-3591f5d7a54f> in <module>
9
10 for i in df["Reviews"]:
---> 11 df_removedNoise.append(removingNoise(i))
<ipython-input-83-3591f5d7a54f> in removingNoise(text)
5
6 def removingNoise(text):
----> 7 html_removed = removingHTML(text)
8 return html_removed
9
<ipython-input-83-3591f5d7a54f> in removingHTML(text)
1 df_removedNoise = []
2 def removingHTML(text):
----> 3 soup = BeautifulSoup(text, 'lxml').get_text()
4 return soup
5
~/anaconda3/lib/python3.7/site-packages/bs4/__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, **kwargs)
244 if hasattr(markup, 'read'): # It's a file-type object.
245 markup = markup.read()
--> 246 elif len(markup) <= 256 and (
247 (isinstance(markup, bytes) and not b'<' in markup)
248 or (isinstance(markup, str) and not '<' in markup)
TypeError: object of type 'float' has no len()
Any help will be appreciated!
Check for NaN with df[df['Reviews'].isnull()], if you find any try to dropna first

Convert labels(int) into one-hot vectors for tensorflow

Kindly help me to resolve my error.Thanks
This is my python code:
shape of Y (199584, 1) and data type is int
num_labels = len(np.unique(Y))
simulated_labels = np.eye(num_labels)[Y] # One liner trick!
print simulated_labels
Error:
IndexError Traceback (most recent call last)
in ()
1 num_labels = len(np.unique(Y)) # unique labels 681
2 print num_labels
----> 3 simulated_labels = np.eye(num_labels)[Y] # One liner trick!
4 print simulated_labels
5
IndexError: index 1001 is out of bounds for axis 0 with size 681
You can use tf.one_hot (There are examples in the doc string)