Vaex Dataframe - Groupby on a calculated field - throws error - vaex

I have the referenced vaex dataframe
The column "Amount_INR" is calculated using the other three attributes using the function:
def convert_curr(x,y,z):
c = CurrencyRates()
return c.convert(x, 'INR', y, z)
data_df_usd['Amount_INR'] = data_df_usd.apply(convert_curr,arguments=[data_df_usd.CURRENCY_CODE,data_df_usd.TOTAL_AMOUNT,data_df_usd.SUBSCRIPTION_START_DATE_DATE])
I'm trying to perform a groupby operation using the below code:
data_df_usd.groupby('CONTENTID', agg={'Revenue':vaex.agg.sum('Amount_INR')})
The code throws the below error:
RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/vaex/scopes.py", line 113, in evaluate
result = self[expression]
File "/usr/local/lib/python3.7/dist-packages/vaex/scopes.py", line 198, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
**KeyError: "Unknown variables or column: 'lambda_function(CURRENCY_CODE, TOTAL_AMOUNT, SUBSCRIPTION_START_DATE_DATE)'"**
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/forex_python/converter.py", line 103, in convert
converted_amount = rate * amount
TypeError: can't multiply sequence by non-int of type 'float'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.7/dist-packages/vaex/expression.py", line 1616, in _apply
scalar_result = self.f(*[fix_type(k[i]) for k in args], **{key: value[i] for key, value in kwargs.items()})
File "<ipython-input-7-8cc933ccf57d>", line 3, in convert_curr
return c.convert(x, 'INR', y, z)
File "/usr/local/lib/python3.7/dist-packages/forex_python/converter.py", line 107, in convert
"convert requires amount parameter is of type Decimal when force_decimal=True")
forex_python.converter.DecimalFloatMismatchError: convert requires amount parameter is of type Decimal when force_decimal=True
"""
The above exception was the direct cause of the following exception:
DecimalFloatMismatchError Traceback (most recent call last)
<ipython-input-13-cc7b243be138> in <module>
----> 1 data_df_usd.groupby('CONTENTID', agg={'Revenue':vaex.agg.sum('Amount_INR')})

According to the error screenshot, this does not look like it is related to the groupby. Something is happening with the convert_curr function.
You get the error
TypeError; can't multiply sequence by non-int of type 'float'
See of you can evaluate the Amount_INR in the first place.

Related

typeerror '<' not supported between instances of 'float' and 'pandas._libs.interval.Interval'

I am sorting data within a dataframe, I am using the data from one column, binning it to intervals of 2 units and then I want a list of each of the intervals created. This is the code:
df = gpd.read_file(file)
final = df[df['i_h100']>=0]
final['H_bins']=pd.cut(x=final['i_h100'], bins=np.arange(0, 120+2, 2))
HBins = list(np.unique(final['H_bins']))
With most of my dataframes this works totally fine but occasionally I get the following error:
Traceback (most recent call last):
File "/var/folders/bg/5n2pqdj53xv1lm099dt8z2lw0000gn/T/ipykernel_2222/1548505304.py", line 1, in <cell line: 1>
HBins = list(np.unique(final['H_bins']))
File "<__array_function__ internals>", line 180, in unique
File "/Users/heatherkay/miniconda3/envs/gpd/lib/python3.9/site-packages/numpy/lib/arraysetops.py", line 272, in unique
ret = _unique1d(ar, return_index, return_inverse, return_counts)
File "/Users/heatherkay/miniconda3/envs/gpd/lib/python3.9/site-packages/numpy/lib/arraysetops.py", line 333, in _unique1d
ar.sort()
TypeError: '<' not supported between instances of 'float' and 'pandas._libs.interval.Interval'
I don't understand why, or how to resolve this.

trouble deleting specific columns in genfromtxt function

I made a python script which takes pdbqt files as input and returns a txt file. As all the lines doesn't have the same no. of columns its not able to read the files. How can I ignore those lines?
sample pdbqt and txt files
the code
from __future__ import division
import numpy as np
def function(filename):
data = np.genfromtxt(filename, dtype = float , usecols = (6, 7, 8), skip_footer=1)
import os
all_filenames = os.listdir()
import glob
all_filenames = glob.glob('*.pdbqt')
print(all_filenames)
for filename in all_filenames:
function(filename)
the error I am getting
Traceback (most recent call last):
File "cen7.py", line 45, in <module>
function(filename)
File "cen7.py", line 7, in function
data = np.genfromtxt(filename, dtype = float , usecols = (6, 7, 8), skip_footer=1)
File "/home/../.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 2261, in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #3037 (got 4 columns instead of 3)
Line #6066 (got 4 columns instead of 3)
Line #9103 (got 4 columns instead of 3)
Line #12140 (got 4 columns instead of 3)
Line #15177 (got 4 columns instead of 3)
Let's make a sample csv:
In [75]: txt = """1,2,3,4
...: 5,6,7,8,9
...: """.splitlines()
This error is to be expected - the number of columns in the 2nd line is larger than previous:
In [76]: np.genfromtxt(txt, delimiter=',')
Traceback (most recent call last):
Input In [76] in <cell line: 1>
np.genfromtxt(txt, delimiter=',')
File /usr/local/lib/python3.8/dist-packages/numpy/lib/npyio.py:2261 in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #2 (got 5 columns instead of 4)
I can avoid that with usecols. It isn't bothered by the extra columns in line 2:
In [77]: np.genfromtxt(txt, delimiter=',',usecols=(1,2,3))
Out[77]:
array([[2., 3., 4.],
[6., 7., 8.]])
But if the line is too short for the usecols, I get an error:
In [78]: np.genfromtxt(txt, delimiter=',',usecols=(2,3,4))
Traceback (most recent call last):
Input In [78] in <cell line: 1>
np.genfromtxt(txt, delimiter=',',usecols=(2,3,4))
File /usr/local/lib/python3.8/dist-packages/numpy/lib/npyio.py:2261 in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #1 (got 4 columns instead of 3)
The wording of the error isn't quite right, but it is clear which line is the problem.
That should give you something to look for when scanning the problem lines in your csv.

Getting the mean of a column using Groupby

I am attempting to get the mean of the columns in a df but keeping getting this error using groupby
grouped_df = df.groupby('location')['number of orders'].mean()
print(grouped_df)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-8cc491c4c100> in <module>
----> 1 grouped_df = df.groupby('location')['number of orders'].mean()
2 print(grouped_df)
NameError: name 'df' is not defined
If I understand your comment correctly, your DataFrame is calleed df_dig. Accordingly
grouped_df = df_dig.groupby('location')['number of orders'].mean()

Tensorflow tf.split() list index out of range?

Here's the codes:
a = tf.constant([1,2,3,4])
b = tf.constant([4])
c = tf.split(a, tf.squeeze(b))
then, it turns out to be wrong:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jeff/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1203, in split
num = size_splits_shape.dims[0]
IndexError: list index out of range
But why?
The docs state,
If num_or_size_splits is a tensor, size_splits, then splits value into len(size_splits) pieces. The shape of the i-th piece has the same size as the value except along dimension axis where the size is size_splits[i].
Note that size_splits needs to be slicable.
However when you squeeze(b), because it has only one element in your example, it returns a scalar that has no dimension. A scalar cannot be sliced :
b_ = tf.squeeze(b)
b_[0] # error
Hence your error.

df.Change[-1] producing errors.

I'm trying to slice the last value of the series Change from my dataframe df.
The dataframe looks something like this
Change
0 1.000000
1 0.917727
2 1.000000
3 0.914773
4 0.933182
5 0.936136
6 0.957500
14466949 1.998392
14466950 2.002413
14466951 1.998392
14466952 1.974266
14466953 1.966224
When I input the following code
df.Change[0]
df.Change[100]
df.Change[100000]
I'm getting an output, but when I'm input
df.Change[-1]
I'm getting the following error
Traceback (most recent call last):
File "<pyshell#188>", line 1, in <module>
df.Change[-1]
File "C:\Python27\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Python27\lib\site-packages\pandas\indexes\base.py", line 2139, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas\index.c:3338)
File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas\index.c:3041)
File "pandas/index.pyx", line 151, in pandas.index.IndexEngine.get_loc (pandas\index.c:3898)
KeyError: -1
Pretty much any negative number I use for slicing is resulting in an error, and I'm not exactly sure why.
Thanks.
There are several ways to do this. What's happening is that pandas has no issues with df.Change[100] because 100 is in its index. -1 is not. You happen to have your index the same as if you were using ordinal positions. To explicitly get ordinal positions, use iloc.
df.Change.iloc[-1]
or
df.Change.values[-1]