trouble deleting specific columns in genfromtxt function

trouble deleting specific columns in genfromtxt function - numpy

I made a python script which takes pdbqt files as input and returns a txt file. As all the lines doesn't have the same no. of columns its not able to read the files. How can I ignore those lines?
sample pdbqt and txt files
the code
from __future__ import division
import numpy as np
def function(filename):
data = np.genfromtxt(filename, dtype = float , usecols = (6, 7, 8), skip_footer=1)
import os
all_filenames = os.listdir()
import glob
all_filenames = glob.glob('*.pdbqt')
print(all_filenames)
for filename in all_filenames:
function(filename)
the error I am getting
Traceback (most recent call last):
File "cen7.py", line 45, in <module>
function(filename)
File "cen7.py", line 7, in function
data = np.genfromtxt(filename, dtype = float , usecols = (6, 7, 8), skip_footer=1)
File "/home/../.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 2261, in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #3037 (got 4 columns instead of 3)
Line #6066 (got 4 columns instead of 3)
Line #9103 (got 4 columns instead of 3)
Line #12140 (got 4 columns instead of 3)
Line #15177 (got 4 columns instead of 3)

Let's make a sample csv:
In [75]: txt = """1,2,3,4
...: 5,6,7,8,9
...: """.splitlines()
This error is to be expected - the number of columns in the 2nd line is larger than previous:
In [76]: np.genfromtxt(txt, delimiter=',')
Traceback (most recent call last):
Input In [76] in <cell line: 1>
np.genfromtxt(txt, delimiter=',')
File /usr/local/lib/python3.8/dist-packages/numpy/lib/npyio.py:2261 in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #2 (got 5 columns instead of 4)
I can avoid that with usecols. It isn't bothered by the extra columns in line 2:
In [77]: np.genfromtxt(txt, delimiter=',',usecols=(1,2,3))
Out[77]:
array([[2., 3., 4.],
[6., 7., 8.]])
But if the line is too short for the usecols, I get an error:
In [78]: np.genfromtxt(txt, delimiter=',',usecols=(2,3,4))
Traceback (most recent call last):
Input In [78] in <cell line: 1>
np.genfromtxt(txt, delimiter=',',usecols=(2,3,4))
File /usr/local/lib/python3.8/dist-packages/numpy/lib/npyio.py:2261 in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #1 (got 4 columns instead of 3)
The wording of the error isn't quite right, but it is clear which line is the problem.
That should give you something to look for when scanning the problem lines in your csv.

Related

typeerror '<' not supported between instances of 'float' and 'pandas._libs.interval.Interval'

I am sorting data within a dataframe, I am using the data from one column, binning it to intervals of 2 units and then I want a list of each of the intervals created. This is the code:
df = gpd.read_file(file)
final = df[df['i_h100']>=0]
final['H_bins']=pd.cut(x=final['i_h100'], bins=np.arange(0, 120+2, 2))
HBins = list(np.unique(final['H_bins']))
With most of my dataframes this works totally fine but occasionally I get the following error:
Traceback (most recent call last):
File "/var/folders/bg/5n2pqdj53xv1lm099dt8z2lw0000gn/T/ipykernel_2222/1548505304.py", line 1, in <cell line: 1>
HBins = list(np.unique(final['H_bins']))
File "<__array_function__ internals>", line 180, in unique
File "/Users/heatherkay/miniconda3/envs/gpd/lib/python3.9/site-packages/numpy/lib/arraysetops.py", line 272, in unique
ret = _unique1d(ar, return_index, return_inverse, return_counts)
File "/Users/heatherkay/miniconda3/envs/gpd/lib/python3.9/site-packages/numpy/lib/arraysetops.py", line 333, in _unique1d
ar.sort()
TypeError: '<' not supported between instances of 'float' and 'pandas._libs.interval.Interval'
I don't understand why, or how to resolve this.

Vaex Dataframe - Groupby on a calculated field - throws error

I have the referenced vaex dataframe
The column "Amount_INR" is calculated using the other three attributes using the function:
def convert_curr(x,y,z):
c = CurrencyRates()
return c.convert(x, 'INR', y, z)
data_df_usd['Amount_INR'] = data_df_usd.apply(convert_curr,arguments=[data_df_usd.CURRENCY_CODE,data_df_usd.TOTAL_AMOUNT,data_df_usd.SUBSCRIPTION_START_DATE_DATE])
I'm trying to perform a groupby operation using the below code:
data_df_usd.groupby('CONTENTID', agg={'Revenue':vaex.agg.sum('Amount_INR')})
The code throws the below error:
RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/vaex/scopes.py", line 113, in evaluate
result = self[expression]
File "/usr/local/lib/python3.7/dist-packages/vaex/scopes.py", line 198, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
**KeyError: "Unknown variables or column: 'lambda_function(CURRENCY_CODE, TOTAL_AMOUNT, SUBSCRIPTION_START_DATE_DATE)'"**
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/forex_python/converter.py", line 103, in convert
converted_amount = rate * amount
TypeError: can't multiply sequence by non-int of type 'float'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.7/dist-packages/vaex/expression.py", line 1616, in _apply
scalar_result = self.f(*[fix_type(k[i]) for k in args], **{key: value[i] for key, value in kwargs.items()})
File "<ipython-input-7-8cc933ccf57d>", line 3, in convert_curr
return c.convert(x, 'INR', y, z)
File "/usr/local/lib/python3.7/dist-packages/forex_python/converter.py", line 107, in convert
"convert requires amount parameter is of type Decimal when force_decimal=True")
forex_python.converter.DecimalFloatMismatchError: convert requires amount parameter is of type Decimal when force_decimal=True
"""
The above exception was the direct cause of the following exception:
DecimalFloatMismatchError Traceback (most recent call last)
<ipython-input-13-cc7b243be138> in <module>
----> 1 data_df_usd.groupby('CONTENTID', agg={'Revenue':vaex.agg.sum('Amount_INR')})

According to the error screenshot, this does not look like it is related to the groupby. Something is happening with the convert_curr function.
You get the error
TypeError; can't multiply sequence by non-int of type 'float'
See of you can evaluate the Amount_INR in the first place.

Pandas tells me non-ambiguous time is ambiguous

I have the following test code:
import pandas as pd
dt = pd.to_datetime('2021-11-07 01:00:00-0400').tz_convert('America/New_York')
pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
When using pandas version 1.1.5, this runs successfully. But under pandas version 1.2.5 or 1.3.4, it fails with the following error:
Traceback (most recent call last):
File "test.py", line 5, in <module>
'value': [3, 4, 5]})
File "venv/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 465, in dict_to_mgr
arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 124, in arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 590, in _homogenize
val, index, dtype=dtype, copy=False, raise_cast_failure=False
File "venv/lib/python3.7/site-packages/pandas/core/construction.py", line 514, in sanitize_array
data = construct_1d_arraylike_from_scalar(data, len(index), dtype)
File "venv/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1907, in construct_1d_arraylike_from_scalar
subarr = cls._from_sequence([value] * length, dtype=dtype)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 336, in _from_sequence
return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 362, in _from_sequence_not_strict
ambiguous=ambiguous,
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2098, in sequence_to_dt64ns
data.view("i8"), tz, ambiguous=ambiguous
File "pandas/_libs/tslibs/tzconversion.pyx", line 284, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2021-11-07 01:00:00, try using the 'ambiguous' argument
I am aware that Daylight Saving Time is happening on November 7. But this data looks explicit to me, and fully localized; why is pandas forgetting its timezone information, and why is it refusing to put it in a DataFrame? Is there some kind of workaround here?
Update:
I remembered that I'd actually filed a bug about this a few months ago, but it was only of somewhat academic interest to us until this week when we're starting to see actual DST-transition dates in production: https://github.com/pandas-dev/pandas/issues/42505

It's ambiguous because there are 2 dates with this special time: with DST and without DST:
# Timestamp('2021-11-07 01:00:00-0500', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=False).dst()
datetime.timedelta(0)
# Timestamp('2021-11-07 01:00:00-0400', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=True).dst()
datetime.timedelta(3600)
Workaround
dt = pd.to_datetime('2021-11-07 01:00:00-0400')
df = pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
df['datetime'] = df['datetime'].dt.tz_convert('America/New_York')

I accepted #Corralien's answer, and I also wanted to show what workaround I finally decided to go with:
# Work around Pandas DST bug, see https://github.com/pandas-dev/pandas/issues/42505 and
# https://stackoverflow.com/questions/69846645/pandas-tells-me-non-ambiguous-time-is-ambiguous
max_len = max(len(x) if self.is_array(x) else 1 for x in data.values())
if max_len > 0 and self.is_scalar(data['datetime']):
data['datetime'] = [data['datetime']] * max_len
df = pd.DataFrame(data)
The is_array() and is_scalar() functions check whether x is an instance of any of set, list, tuple, np.ndarray, pd.Series, pd.Index.
It's not perfect, but hopefully the duct tape will hold until this can be fixed in Pandas.

Pandas error with multidimensional key using .loc and a boolean

Been running into this same error for 2 weeks, even though the code worked before. Not sure if I updated pandas as part of another library install, and maybe something changed there. Currently on version 23.4. Expected outcome is returning just the row with that identifier value.
In [42]: df.head()
Out[43]:
index Identifier ...
0 51384710 ...
1 74838J10 ...
2 80589M10 ...
3 67104410 ...
4 50241310 ...
[5 rows x 14 columns]
In [43]: df.loc[df.Identifier.isin(['51384710'])].head()
Traceback (most recent call last):
File "C:\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-44-a3dbf43451ef>", line 1, in <module>
df.loc[df.Identifier.isin(['51384710'])].head()
File "C:\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1478, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "C:\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1899, in _getitem_axis
raise ValueError('Cannot index with multidimensional key')
**ValueError: Cannot index with multidimensional key**
Code Snippet

Fixed it. I'd done df.columns = [column_list] where column_list = [...], which caused df to be treated as if it had a multiindex, even though there was only one level. removed the brackets from the df.columns assignment.

Try changing
df.loc[df.Identifier.isin(['51384710'])].head()
to
df[df.Identifier.isin(['51384710'])].head()

df.Change[-1] producing errors.

I'm trying to slice the last value of the series Change from my dataframe df.
The dataframe looks something like this
Change
0 1.000000
1 0.917727
2 1.000000
3 0.914773
4 0.933182
5 0.936136
6 0.957500
14466949 1.998392
14466950 2.002413
14466951 1.998392
14466952 1.974266
14466953 1.966224
When I input the following code
df.Change[0]
df.Change[100]
df.Change[100000]
I'm getting an output, but when I'm input
df.Change[-1]
I'm getting the following error
Traceback (most recent call last):
File "<pyshell#188>", line 1, in <module>
df.Change[-1]
File "C:\Python27\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Python27\lib\site-packages\pandas\indexes\base.py", line 2139, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas\index.c:3338)
File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas\index.c:3041)
File "pandas/index.pyx", line 151, in pandas.index.IndexEngine.get_loc (pandas\index.c:3898)
KeyError: -1
Pretty much any negative number I use for slicing is resulting in an error, and I'm not exactly sure why.
Thanks.

There are several ways to do this. What's happening is that pandas has no issues with df.Change[100] because 100 is in its index. -1 is not. You happen to have your index the same as if you were using ordinal positions. To explicitly get ordinal positions, use iloc.
df.Change.iloc[-1]
or
df.Change.values[-1]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

trouble deleting specific columns in genfromtxt function - numpy

Related

typeerror '<' not supported between instances of 'float' and 'pandas._libs.interval.Interval'

Vaex Dataframe - Groupby on a calculated field - throws error

Pandas tells me non-ambiguous time is ambiguous

Pandas error with multidimensional key using .loc and a boolean

df.Change[-1] producing errors.

Categories

Resources