SentenceTransformers throwing KeyError on pandas Series - pandas

I'm using the following simplified code:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
where sentences is a pandas Series, containing the sentences I want to transform.
and then I'm getting the following error Traceback
embeddings = model.encode(sentences)
File "/anaconda/envs/topics/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 157, in encode
sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
File "/anaconda/envs/topics/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 157, in <listcomp>
sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
File "/anaconda/envs/topics/lib/python3.8/site-packages/pandas/core/series.py", line 942, in
__getitem__
return self._get_value(key)
File "/anaconda/envs/topics/lib/python3.8/site-packages/pandas/core/series.py", line 1051, in
_get_value
loc = self.index.get_loc(label)
File "/anaconda/envs/topics/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 144

The actual solution was to convert the pandas Series to a numpy array:
sentences_array = sentences.to_numpy()

The Embeddings will give in the form of array or Tensor form so use the following codes to solve this issue
embeddings = model.encode(sentences, convert_to_tensor=True)
or
embeddings = model.encode(sentences, convert_to_numpy=True)

Related

Error while converting pandas dataframe to polars dataframe (pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object)

I am converting pandas dataframe to polars dataframe but pyarrow throws error.
My code:
import polars as pl
import pandas as pd
if __name__ == "__main__":
with open(r"test.xlsx", "rb") as f:
excelfile = f.read()
excelfile = pd.ExcelFile(excelfile)
sheetnames = excelfile.sheet_names
df = pd.concat(
[
pd.read_excel(
excelfile, sheet_name=x, header=0)
for x in sheetnames
], axis=0)
df_pl = pl.from_pandas(df)
Error:
File "pyarrow\array.pxi", line 312, in pyarrow.lib.array
File "pyarrow\array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow\error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
I tried changing pandas dataframe dtype to str and problem is solved, but i don't want to change dtypes. Is it bug in pyarrow or am I missing something?
Edit: Polars 0.13.42 and later
Polars now has a read_excel function that will correctly handle this situation. read_excel is now the preferred way to read Excel files into Polars.
Note: to use read_excel, you will need to install xlsx2csv (which can be installed with pip).
Polars: prior to 0.13.42
I can replicate this result. It is due to a column in the original Excel file that contains both text and numbers.
For example, create a new Excel file with one column in which you type both numbers and text, save it, and run your code on that file. I get the following traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/convert.py", line 299, in from_pandas
return DataFrame._from_pandas(df, rechunk=rechunk, nan_to_none=nan_to_none)
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/frame.py", line 454, in _from_pandas
pandas_to_pydf(
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 485, in pandas_to_pydf
arrow_dict = {
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 486, in <dictcomp>
str(col): _pandas_series_to_arrow(
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 237, in _pandas_series_to_arrow
return pa.array(values, pa.large_utf8(), from_pandas=nan_to_none)
File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
There are several lengthy discussions on this issue, such as these:
to_parquet can't handle mixed type columns #21228
pyarrow.lib.ArrowTypeError: "Expected a string or bytes object, got a 'int' object" #349
This particular comment might be relevant, as you are concatenating the results of parsing multiple sheets in an Excel file. This may lead to conflicting dtypes for a column:
https://github.com/pandas-dev/pandas/issues/21228#issuecomment-419175116
How to approach this depends on your data and its use, so I can't recommend a blanket solution (i.e., fixing your source Excel file, or changing the dtype to str).
My problem is solved by saving pandas dataframe to 'csv' format and then importing 'csv' file in polars.
import os
import polars as pl
import pandas as pd
if __name__ == "__main__":
with open(r"test.xlsx", "rb") as f:
excelfile = f.read()
excelfile = pd.ExcelFile(excelfile)
sheetnames = excelfile.sheet_names
df = pd.concat([pd.read_excel(excelfile, sheet_name=x, header=0)
for x in sheetnames
], axis=0)
df.to_csv("temp.csv",index=False)
df_pl = pl.scan_csv("temp.csv")
os.remove("temp.csv")

Can anyone tell me the solution of this ValueError in MatPlotLib?

When I'm plotting some random scatter plot an error occurred. Can someone solve it?
The code is
# Date 27-06-2021
import time
import matplotlib.pyplot as plt
import numpy as np
import random as rd
def rd_color():
random_number = rd.randint(0, 16777215)
hex_number = str(hex(random_number))
hex_number = '#' + hex_number[2:]
return hex_number
arr1_for_x = np.linspace(10, 99, 1000)
arr1_for_y = np.random.uniform(10, 99, 1000)
# print(rd_color())
for i in range(1000):
plt.scatter(arr1_for_x[i:i+1], arr1_for_y[i:i+1], s=5,
linewidths=0, color=rd_color())
plt.show()
and the ValueError is
ValueError: 'color' kwarg must be a color or sequence of color specs. For a sequence of values to be color-mapped, use the 'c' argument instead.
Traceback (most recent call last):
File "C:\Users\amanr\AppData\Local\Programs\Python\Python39\lib\site-packages\matplotlib\axes\_axes.py", line 4289, in _parse_scatter_color_args
mcolors.to_rgba_array(kwcolor)
File "C:\Users\amanr\AppData\Local\Programs\Python\Python39\lib\site-packages\matplotlib\colors.py", line 367, in to_rgba_array
raise ValueError("Using a string of single character colors as "
ValueError: Using a string of single character colors as a color sequence is not supported. The colors can be passed as an explicit list instead.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Google Drive\tp\Programming\Python\tp2.py", line 24, in <module>
plt.scatter(arr1_for_x[i:i+1], arr1_for_y[i:i+1], s=5,
File "C:\Users\amanr\AppData\Local\Programs\Python\Python39\lib\site-packages\matplotlib\pyplot.py", line 3068, in scatter
__ret = gca().scatter(
File "C:\Users\amanr\AppData\Local\Programs\Python\Python39\lib\site-packages\matplotlib\__init__.py", line 1361, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "C:\Users\amanr\AppData\Local\Programs\Python\Python39\lib\site-packages\matplotlib\axes\_axes.py", line 4516, in scatter
self._parse_scatter_color_args(
File "C:\Users\amanr\AppData\Local\Programs\Python\Python39\lib\site-packages\matplotlib\axes\_axes.py", line 4291, in _parse_scatter_color_args
raise ValueError(
ValueError: 'color' kwarg must be an color or sequence of color specs. For a sequence of values to be color-mapped, use the 'c' argument instead.
after this error I also use c in-place of color
# Date 27-06-2021
import time
import matplotlib.pyplot as plt
import numpy as np
import random as rd
def rd_color():
random_number = rd.randint(0, 16777215)
hex_number = str(hex(random_number))
hex_number = '#' + hex_number[2:]
return str(hex_number)
arr1_for_x = np.linspace(10, 99, 1000)
arr1_for_y = np.random.uniform(10, 99, 1000)
# print(rd_color())
for i in range(1000):
plt.scatter(arr1_for_x[i:i+1], arr1_for_y[i:i+1], s=5,
linewidths=0, c=rd_color())
plt.show()
but again error occured
this time the error is
Traceback (most recent call last):
File "C:\Users\amanr\AppData\Local\Programs\Python\Python39\lib\site-packages\matplotlib\axes\_axes.py", line 4350, in _parse_scatter_color_args
colors = mcolors.to_rgba_array(c)
File "C:\Users\amanr\AppData\Local\Programs\Python\Python39\lib\site-packages\matplotlib\colors.py", line 367, in to_rgba_array
raise ValueError("Using a string of single character colors as "
ValueError: Using a string of single character colors as a color sequence is not supported. The colors can be passed as an explicit list instead.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Google Drive\tp\Programming\Python\tp2.py", line 22, in <module>
plt.scatter(arr1_for_x[i:i+1], arr1_for_y[i:i+1], s=5,
File "C:\Users\amanr\AppData\Local\Programs\Python\Python39\lib\site-packages\matplotlib\pyplot.py", line 3068, in scatter
__ret = gca().scatter(
File "C:\Users\amanr\AppData\Local\Programs\Python\Python39\lib\site-packages\matplotlib\__init__.py", line 1361, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "C:\Users\amanr\AppData\Local\Programs\Python\Python39\lib\site-packages\matplotlib\axes\_axes.py", line 4516, in scatter
self._parse_scatter_color_args(
File "C:\Users\amanr\AppData\Local\Programs\Python\Python39\lib\site-packages\matplotlib\axes\_axes.py", line 4359, in _parse_scatter_color_args
raise ValueError(
ValueError: 'c' argument must be a color, a sequence of colors, or a sequence of numbers, not #814aa
please anyone resolve/spell it out for me.
The way I read Matplotlib's Tutorial on Specifying Colours, the hex literals need to have exactly 6 or 3 "hex digits".
This will work:
for i in range(1000):
myColor=rd_color()
plt.scatter(arr1_for_x[i:i+1], arr1_for_y[i:i+1], s=5,
linewidths=0, color=[myColor])
plt.show()
Because you cannot call a function inside this color specification and your colors must be in a list, so define a variable for that color and pass it as list into your color parameter.

How to avoid set_index on a pre-sorted DataFrame constructed with from_delayed?

I am trying to get the expression, 'df.resample('1T', how='mean').sum()' to work in Dask but, running into an issue where it seems like Dask needs me to explicitly set_index on the DataFrame before performing resample. I get an error as below...
>>> c.gather(df).compute()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 1508, in gather
asynchronous=asynchronous)
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 615, in sync
return sync(self.loop, func, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/distributed/utils.py", line 253, in sync
six.reraise(*error[0])
File "/usr/local/lib/python2.7/site-packages/distributed/utils.py", line 238, in f
result[0] = yield make_coro()
File "/usr/local/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/local/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/local/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 1385, in _gather
traceback)
File "/usr/local/lib/python2.7/site-packages/dask/dataframe/core.py", line 1633, in resample
return _resample(self, rule, how=how, closed=closed, label=label)
File "/usr/local/lib/python2.7/site-packages/dask/dataframe/tseries/resample.py", line 33, in _resample
return getattr(resampler, how)()
File "/usr/local/lib/python2.7/site-packages/dask/dataframe/tseries/resample.py", line 151, in mean
return self._agg('mean')
File "/usr/local/lib/python2.7/site-packages/dask/dataframe/tseries/resample.py", line 126, in _agg
meta_r = self.obj._meta_nonempty.resample(self._rule, **self._kwargs)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/generic.py", line 7104, in resample
base=base, key=on, level=level)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/resample.py", line 1148, in resample
return tg._get_resampler(obj, kind=kind)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/resample.py", line 1276, in _get_resampler
"but got an instance of %r" % type(ax).__name__)
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
Below is the python code which I am using. Since the pandas DFs being returned by my delayed objects were already timestamp indexed, my expectation was for Dask to infer/construct an index from those DFs' timestamp indices instead of me having to explicitly set one. Although, I am unsure how an explicit set_index can be called in this case (what are the arguments to be passed?). Setting a pd.DatetimeIndex on the meta dataframe (commented line as below) works. Is constructing the index by hand and feeding it to meta the only realistic way to do this? Am I missing something?
#! /usr/bin/env python
# Start dask scheduler and workers
# dask-scheduler &
# dask-worker --nthreads 1 --nprocs 6 --memory-limit 3GB localhost:8786 --local-directory /dev/shm &
from dask.distributed import Client
from dask.delayed import delayed
import pandas as pd
import numpy as np
import dask.dataframe as dd
import time
c = Client('127.0.0.1:8786')
def load(epoch):
# 1525132800 - 1/5
# 1527811200 - 1/6
num_ts=100
idx = []
for ts in range(0, 86400, 15):
idx.append(epoch + ts)
d = np.random.rand(86400/15, num_ts)
ts = []
for i in range(0, num_ts):
# tsname = "ts_%s_%s" % (i, epoch)
tsname = "ts_%s" % (i)
ts.append(tsname)
gts.append(tsname)
res = pd.DataFrame(index=idx, data=d, columns=ts, dtype=np.float64)
res.index = pd.to_datetime(arg=res.index, unit='s')
return res
gts = []
load(1525132800)
print time.time()
i = pd.DatetimeIndex(start=1525132800, freq='15S', end=1527811185, dtype='datetime64[s]')
# meta = pd.DataFrame(index=i, data=[], columns=gts, dtype=np.float64)
meta = pd.DataFrame(index=[], data=[], columns=gts, dtype=np.float64)
dfs = [delayed(load)(fn) for fn in range(1525132800, 1527811200, 86400)]
print time.time()
df = dd.from_delayed(dfs, meta, 'sorted')
print time.time()
df.npartitions
df.divisions
print time.time()
df = c.submit(dd.DataFrame.resample, df, rule='1T', how='mean')
print time.time()
#df = c.submit(dd.DataFrame.sum, df, axis=1)
print time.time()
c.gather(df).compute()
print time.time()
#c.gather(df).visualize(filename='/usr/share/nginx/html/svg/df4.svg')
Dask uses the meta of a data-frame to infer the data types before computing any of the chunks of data. In your case, your chunks contain datetime indexes, but the meta doesn't. The meta should be a zero-length version of the data:
meta = pd.DataFrame(index=i[:0], data=[], columns=gts, dtype=np.float64)

while_loop in tensorflow returns type error

I am confused why the following code returns this error message:
Traceback (most recent call last):
File "/Users/Desktop/TestPython/tftest.py", line 46, in <module>
main(sys.argv[1:])
File "/Users/Desktop/TestPython/tftest.py", line 35, in main
result = tf.while_loop(Cond_f2, Body_f1, loop_vars=loopvars)
File "/Users/Desktop/HPC_LIB/TENSORFLOW/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2518, in while_loop
result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/Users/Desktop/HPC_LIB/TENSORFLOW/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2356, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/Users/Desktop/HPC_LIB/TENSORFLOW/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2292, in _BuildLoop
c = ops.convert_to_tensor(pred(*packed_vars))
File "/Users/Desktop/TestPython/tftest.py", line 18, in Cond_f2
boln = tf.less(tf.cast(tf.constant(ind), dtype=tf.int32), tf.cast(tf.constant(N), dtype=tf.int32))
File "/Users/Desktop/HPC_LIB/TENSORFLOW/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 163, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape))
File "/Users/Desktop/HPC_LIB/TENSORFLOW/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.py", line 353, in make_tensor_proto
_AssertCompatible(values, dtype)
File "/Users/Desktop/HPC_LIB/TENSORFLOW/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.py", line 287, in _AssertCompatible
raise TypeError("List of Tensors when single Tensor expected")
TypeError: List of Tensors when single Tensor expected
I would appreciate if someone could help me fix this error. Thanks!
from math import *
import numpy as np
import sys
import tensorflow as tf
def Body_f1(n, ind, N, T):
# Compute trace
a = tf.trace(tf.random_normal(0.0, 1.0, (n, n)))
# Update trace
a = tf.cast(a, dtype=T.dtype)
T = tf.scatter_update(T, ind, a)
# Update index
ind = ind + 1
return n, ind, N, T
def Cond_f2(n, ind, N, T):
boln = tf.less(tf.cast(tf.constant(ind), dtype=tf.int32), tf.cast(tf.constant(N), dtype=tf.int32))
return boln
def main(argv):
# Open tensorflow session
sess = tf.Session()
# Parameters
N = 10
T = tf.zeros((N), dtype=tf.float64)
n = 4
ind = 0
# While loop
loopvars = [n, ind, N, T]
result = tf.while_loop(Cond_f2, Body_f1, loop_vars=loopvars, shape_invariants=None, \
parallel_iterations=1, back_prop=False, swap_memory=False, name=None)
trace = result[3]
trace = sess.run(trace)
print trace
print 'Done!'
# Close tensorflow session
if session==None:
sess.close()
if __name__ == "__main__":
main(sys.argv[1:])
Update: I have added the full error message. I am not sure why I get this error message. Does loop_vars expect a single tensor and not a list of tensors? I hope not.
tf.constant expects a non-Tensor value, like a Python list or a numpy array. You can get the same error by iterating tf.constant, as in tf.constant(tf.constant(5.)). Removing those calls fixes that first error. It's a very poor error message, so I would encourage you to file a bug on Github.
It also looks like the arguments to random_normal are a bit mixed up; keyword arguments are good for avoiding issues like that:
tf.random_normal(mean=0.0, stddev=1.0, shape=(n, n))
Finally scatter_update expects a variable. It looks like a TensorArray may be what you're looking for here (or one of the higher level looping constructs which use a TensorArray implicitly).

Running IJulia on Conda. Trying to plot Quandl data

Simple example of using Google prices in DataFrame format. Gadfly plot gives the following error: TypeError(u'There is no Line2D property "y"',). Also references matplotlib for some reason.
Here's code:
using Quandl
using DataFrames
google = quandl("GOOG/NASDAQ_QQQ", format = "DataFrame")
date = google[1]
dt_str = Array(Any,length(date))
for i=1:length(date)
dt_str[i] = string(date[i]);
end
price = google[5]
using Gadfly
set_default_plot_size(20cm, 10cm)
p1 = plot(x=dt_str, y=price,
Geom.point,
Geom.smooth(method=:lm),
Guide.xticks(ticks=[1:25]),
Guide.yticks(ticks=[1:25]),
Guide.xlabel("Date"),
Guide.ylabel("Price"),
Guide.title("Google: Close Price"))
LoadError: PyError (:PyObject_Call)
TypeError(u'There is no Line2D property "y"',)
File "C:\Anaconda2\lib\site-packages\matplotlib\pyplot.py", line 3154, in plot
ret = ax.plot(*args, **kwargs)
File "C:\Anaconda2\lib\site-packages\matplotlib\__init__.py", line 1811, in inner
return func(ax, *args, **kwargs)
File "C:\Anaconda2\lib\site-packages\matplotlib\axes\_axes.py", line 1424, in plot
for line in self._get_lines(*args, **kwargs):
File "C:\Anaconda2\lib\site-packages\matplotlib\axes\_base.py", line 395, in _grab_next_args
for seg in self._plot_args(remaining[:isplit], kwargs):
File "C:\Anaconda2\lib\site-packages\matplotlib\axes\_base.py", line 374, in _plot_args
seg = func(x[:, j % ncx], y[:, j % ncy], kw, kwargs)
File "C:\Anaconda2\lib\site-packages\matplotlib\axes\_base.py", line 281, in _makeline
self.set_lineprops(seg, **kwargs)
File "C:\Anaconda2\lib\site-packages\matplotlib\axes\_base.py", line 189, in set_lineprops
line.set(**kwargs)
File "C:\Anaconda2\lib\site-packages\matplotlib\artist.py", line 936, in set
(self.__class__.__name__, k))
while loading In[64], in expression starting on line 1
in getindex at C:\Users\yburkitbayev\.julia\v0.4\PyCall\src\PyCall.jl:239
The presence of the PyError would imply to me that the session in which this example was executed has loaded PyPlot prior to loading Gadfly. Both PyPlot and Gadfly export the plot function, so uses of plot in a session where both PyPlot and Gadfly have been loaded require the qualification of the function name with the package name (e.g. PyPlot.plot or Gadfly.plot).
Executing your example in a session where PyPlot has not been loaded, but Gadfly is loaded, produces a Gadfly plot without displaying the error message provided in your post.