Regex to not match group of characters - regex-negation

I want a regex to match a number string of length 5, but should not match group of 2 in middle of string
Eg. if my sample string is 12345 then i want 23 and 24 should not match in the string but should match 21, 22, 25
Should match
12145
12245
12567
Should not match
12345
12456
Can any please help me..

import re
pat = '\d(?!23|24)\d{4}'
rgx = re.compile(pat)
for x in ('12145','12245','12345','12456','12567'):
print rgx.match(x)
result
12145 <_sre.SRE_Match object at 0x00AB01E0>
12245 <_sre.SRE_Match object at 0x00AB01E0>
12345 None
12456 None
12567 <_sre.SRE_Match object at 0x00AB01E0>
EDIT 1
If you need something a little more complex, that is to say excluding sucessions of 2 digits and successions of 3 digits starting at position 1:
import re
tu2 = (23,0,24,78,2,85,84,32)
tu3 = (333,899)
pat = '\d(?!%s|%s)\d{4}' % ('|'.join(map('{:02}'.format,tu2)),
'|'.join(map('{:03}'.format,tu3)))
print pat,'\n'
rgx = re.compile(pat)
for x in ('12145','12245','12345','12456','12567',
'278455','13358','13337','18907','18995'):
print x,rgx.match(x)
result
\d(?!23|00|24|78|02|85|84|32|333|899)\d{4}
12145 <_sre.SRE_Match object at 0x011DB528>
12245 <_sre.SRE_Match object at 0x011DB528>
12345 None
12456 None
12567 <_sre.SRE_Match object at 0x011DB528>
278455 None
13358 <_sre.SRE_Match object at 0x011DB528>
13337 None
18907 <_sre.SRE_Match object at 0x011DB528>
18995 None
.
EDIT 2
Showing what can be done:
import re
from collections import defaultdict
conditions = ( (2,2,(23,0,24,78,2)),
(1,2,(1,79) ),
(2,3,(333,899,8) ),
(1,4,(3333,3353))
)
conds = defaultdict(list)
for d,L,tu in conditions:
conds[d].extend(map(('{:0%s}' % L).format,tu))
pat = ''.join( '(?!\d{%s}(?:%s))' % (k-1,'|'.join(conds[k]))
for k in conds )
pat += '\d{5}'
print '\npat:\n',pat,'\n'
rgx = re.compile(pat)
for x in ('01655','02655','79443','78443',
'12145','12245','12345','12456','12567',
'278455','70245','13338','13348',
'18997','18905','60089','60189',
'33330','33530','33730'):
print x,rgx.match(x)
result
pat:
(?!\d{0}(?:01|79|3333|3353))(?!\d{1}(?:23|00|24|78|02|333|899|008))\d{5}
01655 None
02655 <_sre.SRE_Match object at 0x011DB528>
79443 None
78443 <_sre.SRE_Match object at 0x011DB528>
12145 <_sre.SRE_Match object at 0x011DB528>
12245 <_sre.SRE_Match object at 0x011DB528>
12345 None
12456 None
12567 <_sre.SRE_Match object at 0x011DB528>
278455 None
70245 None
13338 None
13348 <_sre.SRE_Match object at 0x011DB528>
18997 None
18905 <_sre.SRE_Match object at 0x011DB528>
60089 None
60189 <_sre.SRE_Match object at 0x011DB528>
33330 None
33530 None
33730 <_sre.SRE_Match object at 0x011DB528>

Related

Merging GeoDataFrames - TypeError: float() argument must be a string or a number, not 'Point'

I have a dataframe whose one of the columns has a Series of shapely Points and another one in which I have a Series of Polygons.
df.head()
hash number street unit \
2024459 283e04eca5c4932a SN AVENIDA DOUTOR SEVERIANO DE ALMEIDA NaN
2024460 1a92a1c3cba7941a 485 AVENIDA DOUTOR SEVERIANO DE ALMEIDA NaN
2024461 837341c45de519a3 475 AVENIDA DOUTOR SEVERIANO DE ALMEIDA NaN
city district region postcode id geometry
2024459 Jaguari NaN RS 97760-000 NaN POINT (-54.69445 -29.49421)
2024460 Jaguari NaN RS 97760-000 NaN POINT (-54.69445 -29.49421)
2024461 Jaguari NaN RS 97760-000 NaN POINT (-54.69445 -29.49421)
poly_df.head()
centroids geometry
0 POINT (-29.31067315122428 -54.64176359828149) POLYGON ((-54.64069 -29.31161, -54.64069 -29.3...
1 POINT (-29.31067315122428 -54.63961783106958) POLYGON ((-54.63854 -29.31161, -54.63854 -29.3...
2 POINT (-29.31067315122428 -54.637472063857665) POLYGON ((-54.63640 -29.31161, -54.63640 -29.3...
I'm checking if the Point belongs to the Polygon and inserting the Point object into the cell of the second dataframe. However, I'm getting the following error:
Traceback (most recent call last):
File "/tmp/ipykernel_4771/1967309101.py", line 1, in <module>
df.loc[idx, 'centroids'] = poly_mun.loc[ix, 'centroids']
File ".local/lib/python3.8/site-packages/pandas/core/indexing.py", line 692, in __setitem__
iloc._setitem_with_indexer(indexer, value, self.name)
File ".local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1599, in _setitem_with_indexer
self.obj[key] = infer_fill_value(value)
File ".local/lib/python3.8/site-packages/pandas/core/dtypes/missing.py", line 516, in infer_fill_value
val = np.array(val, copy=False)
TypeError: float() argument must be a string or a number, not 'Point'
I'm using the following command line:
df.loc[idx, 'centroids'] = poly_df.loc[ix, 'centroids']
I have already tried at as well.
Thanks
You can't create a new column in pandas with a shapely geometry using loc:
In [1]: import pandas as pd, shapely.geometry
In [2]: df = pd.DataFrame({'mycol': [1, 2, 3]})
In [3]: df.loc[0, "centroid"] = shapely.geometry.Point([0, 0])
/Users/mikedelgado/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/indexing.py:1642: ShapelyDeprecationWarning: The array interface is deprecated and will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array instead.
self.obj[key] = infer_fill_value(value)
/Users/mikedelgado/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/dtypes/missing.py:550: FutureWarning: The input object of type 'Point' is an array-like implementing one of the corresponding protocols (`__array__`, `__array_interface__` or `__array_struct__`); but not a sequence (or 0-D). In the future, this object will be coerced as if it was first converted using `np.array(obj)`. To retain the old behaviour, you have to either modify the type 'Point', or assign to an empty array created with `np.empty(correct_shape, dtype=object)`.
val = np.array(val, copy=False)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 df.loc[0, "centroid"] = shapely.geometry.Point([0, 0])
File ~/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/indexing.py:716, in _LocationIndexer.__setitem__(self, key, value)
713 self._has_valid_setitem_indexer(key)
715 iloc = self if self.name == "iloc" else self.obj.iloc
--> 716 iloc._setitem_with_indexer(indexer, value, self.name)
File ~/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/indexing.py:1642, in _iLocIndexer._setitem_with_indexer(self, indexer, value, name)
1639 self.obj[key] = empty_value
1641 else:
-> 1642 self.obj[key] = infer_fill_value(value)
1644 new_indexer = convert_from_missing_indexer_tuple(
1645 indexer, self.obj.axes
1646 )
1647 self._setitem_with_indexer(new_indexer, value, name)
File ~/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/dtypes/missing.py:550, in infer_fill_value(val)
548 if not is_list_like(val):
549 val = [val]
--> 550 val = np.array(val, copy=False)
551 if needs_i8_conversion(val.dtype):
552 return np.array("NaT", dtype=val.dtype)
TypeError: float() argument must be a string or a real number, not 'Point'
Essentially, pandas doesn't know how to interpret a point object, and so creates a float column with NaNs, and then can't handle the point. This might get fixed in the future, but you're best off explicitly defining the column as object dtype:
In [27]: df['centroid'] = None
In [28]: df['centroid'] = df['centroid'].astype(object)
In [29]: df
Out[29]:
mycol centroid
0 1 None
1 2 None
2 3 None
In [30]: df.loc[0, "centroid"] = shapely.geometry.Point([0, 0])
/Users/mikedelgado/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/internals/managers.py:304: ShapelyDeprecationWarning: The array interface is deprecated and will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array instead.
applied = getattr(b, f)(**kwargs)
In [31]: df
Out[31]:
mycol centroid
0 1 POINT (0 0)
1 2 None
2 3 None
That said, joining two GeoDataFrames with polygons and points based on whether the points are in the polygons certainly sounds like a job for geopandas.sjoin:
union = gpd.sjoin(polygon_df, points_df, op='contains')

A contain masking operation

Suppose such an array
In [8]: pd.Series(['testing', 'the', 'masking'])
Out[8]:
0 testing
1 the
2 masking
dtype: object
Masking is handy
In [10]: arr == 'testing'
Out[10]:
0 True
1 False
2 False
dtype: bool
If check if 't' in the individual strings, nested iterations should be applied
In [11]: [ u for u in arr if 't' in u]
Out[11]: ['testing', 'the']
Is it possible to get it done with
arr contains 't'
It is possible
s[s.str.contains('t')]

Return count for specific value in pandas .value_counts()?

Assume running pandas' dataframe['prod_code'].value_counts() and storing result as 'df'. The operation outputs:
125011 90300
762 72816
None 55512
7156 14892
75162 8825
How would I extract the count for None? I'd expect the result to be 55512.
I've tried
>>> df.loc[df.index.isin(['None'])]
>>> Series([], Name: prod_code, dtype: int64)
and also
>>> df.loc['None']
>>> KeyError: 'the label [None] is not in the [index]'
It seems you need None, not string 'None':
df.loc[df.index.isin([None])]
df.loc[None]
EDIT:
If need check where NaN in index:
print (s1.loc[np.nan])
#or
print (df[pd.isnull(df.index)])
Sample:
s = pd.Series(['90300', '90300', '8825', '8825', '8825', None, np.nan])
s1 = s.value_counts(dropna=False)
print (s1)
8825 3
90300 2
NaN 2
dtype: int64
print (s1[pd.isnull(s1.index)])
NaN 2
dtype: int64
print (s1.loc[np.nan])
2
print (s1.loc[None])
2
EDIT1:
For stripping whitespaces:
s = pd.Series(['90300', '90300', '8825', '8825', '8825', 'None ', np.nan])
print (s)
0 90300
1 90300
2 8825
3 8825
4 8825
5 None
6 NaN
dtype: object
s1 = s.value_counts()
print (s1)
8825 3
90300 2
None 1
dtype: int64
s1.index = s1.index.str.strip()
print (s1.loc['None'])
1
Couple of things
pd.Series([None] * 2 + [1] * 3).value_counts() automatically drops the None.
pd.Series([None] * 2 + [1] * 3).value_counts(dropna=False) converts the None to np.NaN
That tells me that your None is a string. But since df.loc['None'] didn't work, I suspect your string has white space around it.
Try:
df.filter(regex='None', axis=0)
Or:
df.index = df.index.to_series().str.strip().combine_first(df.index.to_series())
df.loc['None']
All that said, I was curious how to reference np.NaN in the index
s = pd.Series([1, 2], [0, np.nan])
s.iloc[s.index.get_loc(np.nan)]
2

After rename column get keyerror

I have df:
df = pd.DataFrame({'a':[7,8,9],
'b':[1,3,5],
'c':[5,3,6]})
print (df)
a b c
0 7 1 5
1 8 3 3
2 9 5 6
Then rename first value by this:
df.columns.values[0] = 'f'
All seems very nice:
print (df)
f b c
0 7 1 5
1 8 3 3
2 9 5 6
print (df.columns)
Index(['f', 'b', 'c'], dtype='object')
print (df.columns.values)
['f' 'b' 'c']
If select b it works nice:
print (df['b'])
0 1
1 3
2 5
Name: b, dtype: int64
But if select a it return column f:
print (df['a'])
0 7
1 8
2 9
Name: f, dtype: int64
And if select f get keyerror.
print (df['f'])
#KeyError: 'f'
print (df.info())
#KeyError: 'f'
What is problem? Can somebody explain it? Or bug?
You aren't expected to alter the values attribute.
Try df.columns.values = ['a', 'b', 'c'] and you get:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-61-e7e440adc404> in <module>()
----> 1 df.columns.values = ['a', 'b', 'c']
AttributeError: can't set attribute
That's because pandas detects that you are trying to set the attribute and stops you.
However, it can't stop you from changing the underlying values object itself.
When you use rename, pandas follows up with a bunch of clean up stuff. I've pasted the source below.
Ultimately what you've done is altered the values without initiating the clean up. You can initiate it yourself with a followup call to _data.rename_axis (example can be seen in source below). This will force the clean up to be run and then you can access ['f']
df._data = df._data.rename_axis(lambda x: x, 0, True)
df['f']
0 7
1 8
2 9
Name: f, dtype: int64
Moral of the story: probably not a great idea to rename a column this way.
but this story gets weirder
This is fine
df = pd.DataFrame({'a':[7,8,9],
'b':[1,3,5],
'c':[5,3,6]})
df.columns.values[0] = 'f'
df['f']
0 7
1 8
2 9
Name: f, dtype: int64
This is not fine
df = pd.DataFrame({'a':[7,8,9],
'b':[1,3,5],
'c':[5,3,6]})
print(df)
df.columns.values[0] = 'f'
df['f']
KeyError:
Turns out, we can modify the values attribute prior to displaying df and it will apparently run all the initialization upon the first display. If you display it prior to changing the values attribute, it will error out.
weirder still
df = pd.DataFrame({'a':[7,8,9],
'b':[1,3,5],
'c':[5,3,6]})
print(df)
df.columns.values[0] = 'f'
df['f'] = 1
df['f']
f f
0 7 1
1 8 1
2 9 1
As if we didn't already know that this was a bad idea...
source for rename
def rename(self, *args, **kwargs):
axes, kwargs = self._construct_axes_from_arguments(args, kwargs)
copy = kwargs.pop('copy', True)
inplace = kwargs.pop('inplace', False)
if kwargs:
raise TypeError('rename() got an unexpected keyword '
'argument "{0}"'.format(list(kwargs.keys())[0]))
if com._count_not_none(*axes.values()) == 0:
raise TypeError('must pass an index to rename')
# renamer function if passed a dict
def _get_rename_function(mapper):
if isinstance(mapper, (dict, ABCSeries)):
def f(x):
if x in mapper:
return mapper[x]
else:
return x
else:
f = mapper
return f
self._consolidate_inplace()
result = self if inplace else self.copy(deep=copy)
# start in the axis order to eliminate too many copies
for axis in lrange(self._AXIS_LEN):
v = axes.get(self._AXIS_NAMES[axis])
if v is None:
continue
f = _get_rename_function(v)
baxis = self._get_block_manager_axis(axis)
result._data = result._data.rename_axis(f, axis=baxis, copy=copy)
result._clear_item_cache()
if inplace:
self._update_inplace(result._data)
else:
return result.__finalize__(self)

Pandas timeseries indexing fails when the index is hierarchical

I tried the following code snippet.
In [84]:
from datetime import datetime
from dateutil.parser import parse
​
rng = [datetime(2017,1,13), datetime(2017,1,14), datetime(2017,2,15), datetime(2017,2,16)]
​
s = Series([1,2,3,4], index=rng)
​s['2017/1']
Out[84]:
2017-01-13 1
2017-01-14 2
dtype: int64
As I expected, I could successfully retrieve only those items belonging to JAN by only specifying up to JAN like s['2017/1'].
Next time, I tried a bit extended version of the above code, where a hierarchical index was used instead:
from datetime import datetime
from dateutil.parser import parse
rng1 = [datetime(2017,1,1), datetime(2017,1,1), datetime(2017,2,1), datetime(2017,2,1)]
rng2 = [datetime(2017,1,13), datetime(2017,1,14), datetime(2017,2,15), datetime(2017,2,16)]
midx = pd.MultiIndex.from_arrays([rng1, rng2])
s = Series([1,2,3,4], index=midx)
s['2017/1']
The above code snippet, however, generates an error:
TypeError: unorderable types: int() > slice()
Would you give me some help?
It seems it is more complicated.
Partial string indexing on datetimeindex when part of a multiindex is implemented in DataFrame in pandas 0.18.
So if use:
rng1 = [pd.Timestamp(2017,5,1), pd.Timestamp(2017,5,1),
pd.Timestamp(2017,6,1), pd.Timestamp(2017,6,1)]
rng2 = pd.date_range('2017-01-13', periods=2).tolist() +
pd.date_range('2017-02-15', periods=2).tolist()
s = pd.Series([1,2,3,4], index=[rng1, rng2])
print (s)
2017-05-01 2017-01-13 1
2017-01-14 2
2017-06-01 2017-02-15 3
2017-02-16 4
Then for me works:
print (s.to_frame().loc[pd.IndexSlice[:, '2017/1'],:].squeeze())
2017-05-01 2017-01-13 1
2017-01-14 2
Name: 0, dtype: int64
print (s.loc['2017/6'])
2017-06-01 2017-02-15 3
2017-02-16 4
dtype: int64
But this return empty Series:
print (s.loc[pd.IndexSlice[:, '2017/2']])
Series([], dtype: int64