Regular expression for two underscores - sql

What would be the regular expression to match a string with two underscores breaking up a 3 series of characters. This can also be a SQL 'like' statement if that is possible.
Match:
b_06/18/2012_06:02:34 PM
y1289423_06/14/2011_03:06:35 AM
23479693_11/01/2011_06:12:55 PM
Not Match:
CCC Valuation_b_06/28/2012_05:57:20 PM
CCC Valuation_CCC Valuation_b_06/28/2012_05:57:20 PM
doc1_2.pdf
testdoc.txt

If all of the data is consistent with your examples, then this should work:
EDIT: Updated to match the whole line. This will preclude any substring matches in the invalid list. However, it assumes that OP does not want to match any substrings.
^[a-z0-9]+_[0-9\/]+_[A-Z0-9:\s]+$
For example, in python:
>>> import re
>>> s = 'b_06/18/2012_06:02:34 PM'
>>> pattern = '^[a-z0-9]+_[0-9\/]+_[A-Z0-9:\s]+$'
>>> m = re.match(pattern, s)
>>> m.group(0)
'b_06/18/2012_06:02:34 PM' # <======== matches from valid list
>>> s = 'CCC Valuation_CCC Valuation_b_06/28/2012_05:57:20 PM'
>>> m = re.match(pattern, s)
>>> m.group(0)
Traceback (most recent call last): # <======= does NOT match from invalid list
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

^[^_]+_[0-9\/]+_[0-9:]+\s[AM|PM]

Related

Renaming Pandas Column Names to String Datatype

I have a df with column names that do not appear to be of the typcial string datatype. I am trying to rename these column names to give them the same name. I have tried this and I end up with an error. Here is my df with column names:
dfap.columns
Out[169]: Index(['month', 'plant_name', 0, 1, 2, 3, 4], dtype='object')
Here is my attempt at renaming columns 2,3,4,5,6 or 2:7
dfap.columns[2:7] = [ 'Adj_Prod']
Traceback (most recent call last):
File "<ipython-input-175-ebec554a2fd1>", line 1, in <module>
dfap.columns[2:7] = [ 'Adj_Prod']
File "C:\Users\U321103\Anaconda3\envs\Maps2\lib\site-packages\pandas\core\indexes\base.py", line 4585, in __setitem__
raise TypeError("Index does not support mutable operations")
TypeError: Index does not support mutable operations
Thank you,
You can't rename only some columns using that method.
You can do
tempcols=dfap.columns
tempcols[2:7]=newcols
dfap.columns=tempcols
Of course you'll want newcols to be the same len as what you're replacing. In your example you're only assigning a len 1 list.
You could do.
dfap.rename(columns=dict_of_name_changes, inplace=True)
The dict needs for the key to be the existing name and the value to be the new name. In this method you can rename as few columns as you want.
You could use rename(columns= with a lambda function to handle the renaming logic.
df.rename(columns=lambda x: x if type(x)!=int else 'Adj_Prod')
Result
Columns: [month, plant_name, Adj_Prod, Adj_Prod, Adj_Prod, Adj_Prod, Adj_Prod]

Why does Series.min(skipna=True) throws an error caused by na value?

I work with timestamps (having mixed DST values). Tried in Pandas 1.0.0:
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
Asking for min() or max() fails:
s.min(), s.max() # same result with s.min(skipna=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 11216, in stat_func
f, name, axis=axis, skipna=skipna, numeric_only=numeric_only
File "C:\Anaconda\lib\site-packages\pandas\core\series.py", line 3892, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 125, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 837, in reduction
result = getattr(values, meth)(axis)
File "C:\Anaconda\lib\site-packages\numpy\core\_methods.py", line 34, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial, where)
TypeError: '<=' not supported between instances of 'Timestamp' and 'float'
Workaround:
s.loc[s.notna()].min(), s.loc[s.notna()].max()
(Timestamp('2019-04-13 12:10:20+0200', tz='pytz.FixedOffset(120)'), Timestamp('2020-02-01 11:35:44+0100', tz='pytz.FixedOffset(60)'))
What I am missing here? Is it a bug?
I think problem here is pandas working with Series with different timezones like objects, so max and min here failed.
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
print (s)
0 2020-02-01 11:35:44+01:00
1 NaN
2 2019-04-13 12:10:20+02:00
dtype: object
So if convert to datetimes (but not with mixed timezones) it working well:
print (pd.to_datetime(s, utc=True))
0 2020-02-01 10:35:44+00:00
1 NaT
2 2019-04-13 10:10:20+00:00
dtype: datetime64[ns, UTC]
print (pd.to_datetime(s, utc=True).max())
2020-02-01 10:35:44+00:00
Another possible solution if need different timezones is:
print (s.dropna().max())
2020-02-01 11:35:44+01:00

NumPy: Check if field exists

I have a structured numpy array:
>>> import numpy
>>> a = numpy.zeros(1, dtype = [('field0', 'i2'), ('field1', 'f4')])
Then I start to retrieve some values. However, I do not know in advance, if my array contains a certain field. Therefore, if I try to reach a non-existing field, I am expectedly getting IndexError:
>>> a[0]['field0']
0
>>> a[0]['field2']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: invalid index
I could of course go with try-except; however, this can potentially mask some errors, as IndexError does not specify, on which level I hit the non-existing index:
>>> try:
... a[9999]['field2']['subfield3']
... except IndexError:
... print('Some index does not exist')
...
Some index does not exist
I also tried to approach numpy arrays as lists, but this does not work:
>>> if 'field0' in a[0]:
... print('yes')
... else:
... print('no')
...
no
Therefore, question: Is there a way to check if a given field exists in a structured numpy array?
You could check .dtype.names or .dtype.fields:
>>> a.dtype.names
('field0', 'field1')
>>> 'field0' in a.dtype.names
True
>>> a.dtype.fields
mappingproxy({'field0': (dtype('int16'), 0), 'field1': (dtype('float32'), 2)})
>>> 'field0' in a.dtype.fields
True

How to implement aggregation functions for pandas groupby objects?

Here's the setup for this question:
import numpy as np
import pandas as pd
import collections as co
data = [['a', 1],
['a', 2],
['a', 3],
['a', 4],
['b', 5],
['b', 6],
['b', 7]]
varnames = tuple('PQ')
df = pd.DataFrame(co.OrderedDict([(varnames[i], [row[i] for row in data])
for i in range(len(varnames))]))
gdf = df.groupby(df.ix[:, 0])
After evaluating the above, df looks like this:
>>> df
P Q
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 b 7
gdf is a DataFrameGroupBy object associated with df, where the groups are determined by the values in the first column of df.
Now, watch this:
>>> gdf.aggregate(sum)
Q
P
a 10
b 18
...but repeating the same thing after replacing sum with a pass-through wrapper for it, bombs:
>>> mysum = lambda *a, **k: sum(*a, **k)
>>> mysum(range(10)) == sum(range(10))
True
>>> gdf.aggregate(mysum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1699, in aggregate
result = self._aggregate_generic(arg, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1757, in _aggregate_generic
return self._aggregate_item_by_item(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1782, in _aggregate_item_by_item
result[item] = colg.aggregate(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1426, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1508, in _aggregate_named
output = func(group, *args, **kwargs)
File "<stdin>", line 1, in <lambda>
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Here's a subtler (though probably related) issue. Recall that the result of gdf.aggregate(sum) was a dataframe with a single column, Q. Now, note the result below contains two columns, P and Q:
>>> import random as rn
>>> gdf.aggregate(lambda *a, **k: rn.random())
P Q
P
a 0.344457 0.344457
b 0.990507 0.990507
I have not been able to find anything in the documentation that would explain
why should gdf.aggregate(mysum) fail? (IOW, does this failure agree with documented behavior, or is it a bug in pandas?)
why should gdf.aggregate(lambda *a, **k: rn.random()) produce a two-column output while gdf.aggregate(sum) produce a one-column output?
what signatures (input and output) should an aggregation function foo have so that gdf.aggregate(foo) will return a table having only column Q (like the result of gdf.aggregate(sum))?
Your problems all come down to the columns that are included in the GroupBy. I think you want to group by P and computed statistics on Q. To do that use
gdf = df.groupby('P')
instead of your method. Then any aggregations will not include the P column.
The sum in your function is Python's sum. Groupby.sum() is written in Cython and only acts on numeric dtypes. That's why you get the error about adding ints to strs.
Your other two questions are related to that. You're inputing two columns into gdf.agg, P and Q so you get two columns out for your gdf.aggregate(lambda *a, **k: rn.random()). gdf.sum() ignores the string column.

How to create a char[] array

I'm trying to allocate a Java char array from Jython which will be populated by a Java library. I want to do the equivalent to from Jython:
char[] charBuffer = new char[charCount];
I've read the documentation for the array and jarray modules (I think they're the same) but I'm not entirely sure which type code I want to use. The two document's seem slightly contradictory, but the newer array module seems more correct.
According to the Java documentation, a char is a "16-bit Unicode character" (2 bytes).
So if I check the following type codes:
>>> array.array('c').itemsize # C char, Python character
1
>>> array.array('b').itemsize # C signed char, Python int
1
>>> array.array('B').itemsize # C unsigned char, Python int
2
>>> array.array('u').itemsize # C Py_UNICODE, Python unicode character
4
>>> array.array('h').itemsize # C signed short, Python int
2
>>> array.array('H').itemsize # C unsigned short Python int
4
It seems odd to me that the size of B and H are twice the size of their signed counterparts b and h. Can I safely and reliably use the 16-bit B (unsigned char) or h (signed short int) for a Java char? Or, if using the array module for this is completely wrong for this, please let me know.
The short answer is: use 'c'
Under the hood, jython is doing the work of converting data types for you.
You can verify with some tests. There is a class java.nio.CharBuffer with a method wrap() that takes a char[] array. Observe that jython array type 'c' works, while everything else fails:
>>> import array
>>> from java.nio import CharBuffer
>>> array.array('c', 'Hello World')
array('c', 'Hello World')
>>> CharBuffer.wrap( array.array('c', 'Hello World') )
Hello World
>>> array.array('b','Hello World')
array('b', [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100])
>>> CharBuffer.wrap( array.array('b', 'Hello World') )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: wrap(): 1st arg can't be coerced to char[], java.lang.CharSequence
>>> array.array('u', u'Hello World')
array('u', u'Hello World')
>>> CharBuffer.wrap( array.array('u', u'Hello World') )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: wrap(): 1st arg can't be coerced to char[], java.lang.CharSequence