How to create a char[] array - jython

I'm trying to allocate a Java char array from Jython which will be populated by a Java library. I want to do the equivalent to from Jython:
char[] charBuffer = new char[charCount];
I've read the documentation for the array and jarray modules (I think they're the same) but I'm not entirely sure which type code I want to use. The two document's seem slightly contradictory, but the newer array module seems more correct.
According to the Java documentation, a char is a "16-bit Unicode character" (2 bytes).
So if I check the following type codes:
>>> array.array('c').itemsize # C char, Python character
1
>>> array.array('b').itemsize # C signed char, Python int
1
>>> array.array('B').itemsize # C unsigned char, Python int
2
>>> array.array('u').itemsize # C Py_UNICODE, Python unicode character
4
>>> array.array('h').itemsize # C signed short, Python int
2
>>> array.array('H').itemsize # C unsigned short Python int
4
It seems odd to me that the size of B and H are twice the size of their signed counterparts b and h. Can I safely and reliably use the 16-bit B (unsigned char) or h (signed short int) for a Java char? Or, if using the array module for this is completely wrong for this, please let me know.

The short answer is: use 'c'
Under the hood, jython is doing the work of converting data types for you.
You can verify with some tests. There is a class java.nio.CharBuffer with a method wrap() that takes a char[] array. Observe that jython array type 'c' works, while everything else fails:
>>> import array
>>> from java.nio import CharBuffer
>>> array.array('c', 'Hello World')
array('c', 'Hello World')
>>> CharBuffer.wrap( array.array('c', 'Hello World') )
Hello World
>>> array.array('b','Hello World')
array('b', [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100])
>>> CharBuffer.wrap( array.array('b', 'Hello World') )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: wrap(): 1st arg can't be coerced to char[], java.lang.CharSequence
>>> array.array('u', u'Hello World')
array('u', u'Hello World')
>>> CharBuffer.wrap( array.array('u', u'Hello World') )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: wrap(): 1st arg can't be coerced to char[], java.lang.CharSequence

Related

trouble deleting specific columns in genfromtxt function

I made a python script which takes pdbqt files as input and returns a txt file. As all the lines doesn't have the same no. of columns its not able to read the files. How can I ignore those lines?
sample pdbqt and txt files
the code
from __future__ import division
import numpy as np
def function(filename):
data = np.genfromtxt(filename, dtype = float , usecols = (6, 7, 8), skip_footer=1)
import os
all_filenames = os.listdir()
import glob
all_filenames = glob.glob('*.pdbqt')
print(all_filenames)
for filename in all_filenames:
function(filename)
the error I am getting
Traceback (most recent call last):
File "cen7.py", line 45, in <module>
function(filename)
File "cen7.py", line 7, in function
data = np.genfromtxt(filename, dtype = float , usecols = (6, 7, 8), skip_footer=1)
File "/home/../.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 2261, in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #3037 (got 4 columns instead of 3)
Line #6066 (got 4 columns instead of 3)
Line #9103 (got 4 columns instead of 3)
Line #12140 (got 4 columns instead of 3)
Line #15177 (got 4 columns instead of 3)
Let's make a sample csv:
In [75]: txt = """1,2,3,4
...: 5,6,7,8,9
...: """.splitlines()
This error is to be expected - the number of columns in the 2nd line is larger than previous:
In [76]: np.genfromtxt(txt, delimiter=',')
Traceback (most recent call last):
Input In [76] in <cell line: 1>
np.genfromtxt(txt, delimiter=',')
File /usr/local/lib/python3.8/dist-packages/numpy/lib/npyio.py:2261 in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #2 (got 5 columns instead of 4)
I can avoid that with usecols. It isn't bothered by the extra columns in line 2:
In [77]: np.genfromtxt(txt, delimiter=',',usecols=(1,2,3))
Out[77]:
array([[2., 3., 4.],
[6., 7., 8.]])
But if the line is too short for the usecols, I get an error:
In [78]: np.genfromtxt(txt, delimiter=',',usecols=(2,3,4))
Traceback (most recent call last):
Input In [78] in <cell line: 1>
np.genfromtxt(txt, delimiter=',',usecols=(2,3,4))
File /usr/local/lib/python3.8/dist-packages/numpy/lib/npyio.py:2261 in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #1 (got 4 columns instead of 3)
The wording of the error isn't quite right, but it is clear which line is the problem.
That should give you something to look for when scanning the problem lines in your csv.

Pandas tells me non-ambiguous time is ambiguous

I have the following test code:
import pandas as pd
dt = pd.to_datetime('2021-11-07 01:00:00-0400').tz_convert('America/New_York')
pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
When using pandas version 1.1.5, this runs successfully. But under pandas version 1.2.5 or 1.3.4, it fails with the following error:
Traceback (most recent call last):
File "test.py", line 5, in <module>
'value': [3, 4, 5]})
File "venv/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 465, in dict_to_mgr
arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 124, in arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 590, in _homogenize
val, index, dtype=dtype, copy=False, raise_cast_failure=False
File "venv/lib/python3.7/site-packages/pandas/core/construction.py", line 514, in sanitize_array
data = construct_1d_arraylike_from_scalar(data, len(index), dtype)
File "venv/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1907, in construct_1d_arraylike_from_scalar
subarr = cls._from_sequence([value] * length, dtype=dtype)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 336, in _from_sequence
return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 362, in _from_sequence_not_strict
ambiguous=ambiguous,
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2098, in sequence_to_dt64ns
data.view("i8"), tz, ambiguous=ambiguous
File "pandas/_libs/tslibs/tzconversion.pyx", line 284, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2021-11-07 01:00:00, try using the 'ambiguous' argument
I am aware that Daylight Saving Time is happening on November 7. But this data looks explicit to me, and fully localized; why is pandas forgetting its timezone information, and why is it refusing to put it in a DataFrame? Is there some kind of workaround here?
Update:
I remembered that I'd actually filed a bug about this a few months ago, but it was only of somewhat academic interest to us until this week when we're starting to see actual DST-transition dates in production: https://github.com/pandas-dev/pandas/issues/42505
It's ambiguous because there are 2 dates with this special time: with DST and without DST:
# Timestamp('2021-11-07 01:00:00-0500', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=False).dst()
datetime.timedelta(0)
# Timestamp('2021-11-07 01:00:00-0400', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=True).dst()
datetime.timedelta(3600)
Workaround
dt = pd.to_datetime('2021-11-07 01:00:00-0400')
df = pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
df['datetime'] = df['datetime'].dt.tz_convert('America/New_York')
I accepted #Corralien's answer, and I also wanted to show what workaround I finally decided to go with:
# Work around Pandas DST bug, see https://github.com/pandas-dev/pandas/issues/42505 and
# https://stackoverflow.com/questions/69846645/pandas-tells-me-non-ambiguous-time-is-ambiguous
max_len = max(len(x) if self.is_array(x) else 1 for x in data.values())
if max_len > 0 and self.is_scalar(data['datetime']):
data['datetime'] = [data['datetime']] * max_len
df = pd.DataFrame(data)
The is_array() and is_scalar() functions check whether x is an instance of any of set, list, tuple, np.ndarray, pd.Series, pd.Index.
It's not perfect, but hopefully the duct tape will hold until this can be fixed in Pandas.

NumPy: Check if field exists

I have a structured numpy array:
>>> import numpy
>>> a = numpy.zeros(1, dtype = [('field0', 'i2'), ('field1', 'f4')])
Then I start to retrieve some values. However, I do not know in advance, if my array contains a certain field. Therefore, if I try to reach a non-existing field, I am expectedly getting IndexError:
>>> a[0]['field0']
0
>>> a[0]['field2']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: invalid index
I could of course go with try-except; however, this can potentially mask some errors, as IndexError does not specify, on which level I hit the non-existing index:
>>> try:
... a[9999]['field2']['subfield3']
... except IndexError:
... print('Some index does not exist')
...
Some index does not exist
I also tried to approach numpy arrays as lists, but this does not work:
>>> if 'field0' in a[0]:
... print('yes')
... else:
... print('no')
...
no
Therefore, question: Is there a way to check if a given field exists in a structured numpy array?
You could check .dtype.names or .dtype.fields:
>>> a.dtype.names
('field0', 'field1')
>>> 'field0' in a.dtype.names
True
>>> a.dtype.fields
mappingproxy({'field0': (dtype('int16'), 0), 'field1': (dtype('float32'), 2)})
>>> 'field0' in a.dtype.fields
True

How to implement aggregation functions for pandas groupby objects?

Here's the setup for this question:
import numpy as np
import pandas as pd
import collections as co
data = [['a', 1],
['a', 2],
['a', 3],
['a', 4],
['b', 5],
['b', 6],
['b', 7]]
varnames = tuple('PQ')
df = pd.DataFrame(co.OrderedDict([(varnames[i], [row[i] for row in data])
for i in range(len(varnames))]))
gdf = df.groupby(df.ix[:, 0])
After evaluating the above, df looks like this:
>>> df
P Q
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 b 7
gdf is a DataFrameGroupBy object associated with df, where the groups are determined by the values in the first column of df.
Now, watch this:
>>> gdf.aggregate(sum)
Q
P
a 10
b 18
...but repeating the same thing after replacing sum with a pass-through wrapper for it, bombs:
>>> mysum = lambda *a, **k: sum(*a, **k)
>>> mysum(range(10)) == sum(range(10))
True
>>> gdf.aggregate(mysum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1699, in aggregate
result = self._aggregate_generic(arg, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1757, in _aggregate_generic
return self._aggregate_item_by_item(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1782, in _aggregate_item_by_item
result[item] = colg.aggregate(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1426, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1508, in _aggregate_named
output = func(group, *args, **kwargs)
File "<stdin>", line 1, in <lambda>
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Here's a subtler (though probably related) issue. Recall that the result of gdf.aggregate(sum) was a dataframe with a single column, Q. Now, note the result below contains two columns, P and Q:
>>> import random as rn
>>> gdf.aggregate(lambda *a, **k: rn.random())
P Q
P
a 0.344457 0.344457
b 0.990507 0.990507
I have not been able to find anything in the documentation that would explain
why should gdf.aggregate(mysum) fail? (IOW, does this failure agree with documented behavior, or is it a bug in pandas?)
why should gdf.aggregate(lambda *a, **k: rn.random()) produce a two-column output while gdf.aggregate(sum) produce a one-column output?
what signatures (input and output) should an aggregation function foo have so that gdf.aggregate(foo) will return a table having only column Q (like the result of gdf.aggregate(sum))?
Your problems all come down to the columns that are included in the GroupBy. I think you want to group by P and computed statistics on Q. To do that use
gdf = df.groupby('P')
instead of your method. Then any aggregations will not include the P column.
The sum in your function is Python's sum. Groupby.sum() is written in Cython and only acts on numeric dtypes. That's why you get the error about adding ints to strs.
Your other two questions are related to that. You're inputing two columns into gdf.agg, P and Q so you get two columns out for your gdf.aggregate(lambda *a, **k: rn.random()). gdf.sum() ignores the string column.

Regular expression for two underscores

What would be the regular expression to match a string with two underscores breaking up a 3 series of characters. This can also be a SQL 'like' statement if that is possible.
Match:
b_06/18/2012_06:02:34 PM
y1289423_06/14/2011_03:06:35 AM
23479693_11/01/2011_06:12:55 PM
Not Match:
CCC Valuation_b_06/28/2012_05:57:20 PM
CCC Valuation_CCC Valuation_b_06/28/2012_05:57:20 PM
doc1_2.pdf
testdoc.txt
If all of the data is consistent with your examples, then this should work:
EDIT: Updated to match the whole line. This will preclude any substring matches in the invalid list. However, it assumes that OP does not want to match any substrings.
^[a-z0-9]+_[0-9\/]+_[A-Z0-9:\s]+$
For example, in python:
>>> import re
>>> s = 'b_06/18/2012_06:02:34 PM'
>>> pattern = '^[a-z0-9]+_[0-9\/]+_[A-Z0-9:\s]+$'
>>> m = re.match(pattern, s)
>>> m.group(0)
'b_06/18/2012_06:02:34 PM' # <======== matches from valid list
>>> s = 'CCC Valuation_CCC Valuation_b_06/28/2012_05:57:20 PM'
>>> m = re.match(pattern, s)
>>> m.group(0)
Traceback (most recent call last): # <======= does NOT match from invalid list
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
^[^_]+_[0-9\/]+_[0-9:]+\s[AM|PM]