convert polygon to multi polygon - pandas

I have a Dataframe as below. If a FID contains more than one polygons I need to create Multipolygon for FID. In my Dataframe I have FID 978 which contains two polygons so it should be converted to Multipolygon otherwise polygon.
|FID|Geometry|
|975|POLYGON ((-94.2019149289999 32.171910245, -94.201889847 32.171799917, -94.2019145369999 32.1719110220001, -94.2019344619999 32.171974117, -94.2019149289999 32.171910245))
|976|POLYGON ((-94.204485668 32.175813341, -94.2045721649999 32.1758854190001, -94.2044856639999 32.1758124690001, -94.204358881 32.1757171630001, -94.204485668 32.175813341))
|978|POLYGON ((-94.30755277 32.402906479, -94.321881945 32.4028226820001, -94.321896361 32.4035500580001, -94.3074214489999 32.4037557020001, -94.3075504129999 32.4029064600001, -94.30755277 32.402906479))
|978|POLYGON ((-94.30755277 32.402906479, -94.307552779 32.4005399370001, -94.307558688 32.4005401040001, -94.30755277 32.402906479))
I am using the following function to convert Multipolygons
def ploygon_to_multipolygon(wkt):
list_polygons = [wkt.loads(poly) for poly in wkt]
return shapely.geometry.MultiPolygon(list_polygons)
looks like polygons are not converting to Multipolygons.

from shapely.wkt import loads
from shapely.ops import unary_union
# Convert from wkt ...
# I think there's a better way to do this with Geopandas, this is pure pandas.
df.Geometry = df.Geometry.apply(loads)
# Use groupby and unary_union to combine Polygons.
df = df.groupby('FID', as_index=False)['Geometry'].apply(unary_union)
print(df)
# Let's print out the multi-polygon to verify
print(df.iat[2,1])
Output:
FID Geometry
0 975 POLYGON ((-94.2019149289999 32.171910245, -94....
1 976 POLYGON ((-94.204485668 32.175813341, -94.2045...
2 978 (POLYGON ((-94.30755277900001 32.4005399370001...
MULTIPOLYGON (((-94.30755277900001 32.4005399370001, -94.307558688 32.4005401040001, -94.30755277 32.402906479, -94.30755277900001 32.4005399370001)), ((-94.321881945 32.4028226820001, -94.321896361 32.4035500580001, -94.3074214489999 32.4037557020001, -94.3075504129999 32.4029064600001, -94.30755277 32.402906479, -94.321881945 32.4028226820001)))

I edited your function to return the Multipolygon as wkt and to check whether there are more than one polygons in the group.
import pandas as pd
from shapely import wkt, geometry
df = pd.DataFrame({
'FID': [975, 976, 978, 978],
'Geometry': [
'POLYGON ((-94.2019149289999 32.171910245, -94.201889847 32.171799917, -94.2019145369999 32.1719110220001, -94.2019344619999 32.171974117, -94.2019149289999 32.171910245))',
'POLYGON ((-94.204485668 32.175813341, -94.2045721649999 32.1758854190001, -94.2044856639999 32.1758124690001, -94.204358881 32.1757171630001, -94.204485668 32.175813341))',
'POLYGON ((-94.30755277 32.402906479, -94.321881945 32.4028226820001, -94.321896361 32.4035500580001, -94.3074214489999 32.4037557020001, -94.3075504129999 32.4029064600001, -94.30755277 32.402906479))',
'POLYGON ((-94.30755277 32.402906479, -94.307552779 32.4005399370001, -94.307558688 32.4005401040001, -94.30755277 32.402906479))',
],
})
def to_multipolygon(polygons):
return (
geometry.MultiPolygon([wkt.loads(polygon) for polygon in polygons]).wkt
if len(polygons) > 1
else polygons.iloc[0]
)
result = df.groupby('FID')['Geometry'].apply(lambda x: to_multipolygon(x))
print(result)
Output
FID
975 POLYGON ((-94.2019149289999 32.171910245, -94....
976 POLYGON ((-94.204485668 32.175813341, -94.2045...
978 MULTIPOLYGON (((-94.30755277 32.402906479, -94...
Name: Geometry, dtype: object

Related

alternative way to define a function inside a class method [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
I have a the following class:
class Analysis():
def __init__(self, file_dir):
self.path = file_dir #file path directory
def getData(self):
return pd.read_csv(self.path) # create a pandas dataframe
def getStd(self):
return self.getData().loc['1':'5'].apply(lambda x: x.std()) # cacluate the standard deviation of all columns
def getHighlight(self):
#a function to highlight df based on the given condition
def highlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#rows over which the highlighting function should apply
r = ['1', '2', '3', '4', '5']
#first boolean mask for selecting the df elements
m1 = (x.loc[r]>x.loc['USL']) | (x.loc[r]<x.loc['LSL'])
#second boolean mask for selecting the df elements
m2 = (x.loc[r]==x.loc['USL']) | (x.loc[r]==x.loc['LSL'])
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] = np.select([m1, m2], [c1, c2], default=c3)
return df1
#apply the highlight function on the df to get highlighted
return self.getData().style.apply(highlight, axis=None)
getData method returns the df like this:
my_analysis = Analysis(path_to_file)
my_analysis.getData()
A-A A-B A-C A-D A-E
Tg 0.37 10.24 5.02 0.63 20.30
USL 0.39 10.26 5.04 0.65 20.32
LSL 0.35 10.22 5.00 0.63 20.28
1 0.35 10.23 5.05 0.65 20.45
2 0.36 10.19 5.07 0.67 20.25
3 0.34 10.25 5.03 0.66 20.33
4 0.35 10.20 5.08 0.69 20.22
5 0.33 10.17 5.05 0.62 20.40
Max 0.36 10.25 5.08 0.69 20.45
Min 0.33 10.17 5.03 0.62 20.22
The getHighligt method has an inner function which applies to the df in order to highlight the df elements based on the given mask and it would out put something like this:
my_analysis.getHighlight()
My question is what is the pythonic or elegant way of defining the inner function inside the class method?
Disclaimer: the following remarks represent my opinion about the topic of pythonic code.
Avoid Inner Functions
You should avoid inner functions at all cost. Sometimes they're necessary, but most of the time they're an indication that you might want to refactor your code.
Avoid re-reading multiple times
I would also avoid calling pd.read_csv every time I want to perform some operation in the data. Unless there's a good reason to read the file over and over again, It's more performant to read it once and store it in a class attribute, or property.
PEP-8 Naming Conventions
Another important thing to consider, if you're trying to make your code more pythonic, is to try to follow the PEP8 naming conventions, unless you're working on legacy code that does not follow PEP-8.
Class Overkill
Finally, I think that creating a class for what you're doing seems a little overkill. Most of your methods are simply transformations that could be easily converted to functions. Aside from making your code less complex, It would improve its reusability.
How I would write the Analysis class
from __future__ import absolute_import, annotations
from pathlib import Path
from typing import Any, Collection, Iterable, Type, Union
import numpy as np
import pandas as pd
from pandas.core.dtypes.dtypes import ExtensionDtype # type: ignore
# Custom types for type hinting
Axes = Collection[Any]
NpDtype = Union[
str, np.dtype, Type[Union[str, float, int, complex, bool, object]]
]
Dtype = Union["ExtensionDtype", NpDtype]
# Auxiliary functions
def is_iterable_not_string(iterable: Any) -> bool:
"""Return True, if `iterable` is an iterable object, and not a string.
Parameters
----------
iterable: Any
The object to check whether it's an iterable except for strings,
or not.
Returns
-------
bool
True, if object is iterable, but not a string.
Otherwise, if object isn't an iterable, or if it's a string, return
False.
Examples
--------
>>> import numpy as np
>>> import pandas as pd
>>> class FakeIterable(int):
... def __iter__(self): pass
>>> print(is_iterable_not_string('abcde'))
False
>>> print(is_iterable_not_string(bytes(12345)))
False
>>> print(is_iterable_not_string(12345))
False
>>> print(is_iterable_not_string(123.45))
False
>>> print(is_iterable_not_string(type))
False
>>> print(is_iterable_not_string(list)) # Type list isn't iterable
False
>>> print(is_iterable_not_string(object))
False
>>> print(is_iterable_not_string(None))
False
>>> print(is_iterable_not_string(list())) # Empty list is still iterable
True
>>> # `FakeIterable` has a method `__iter__`, therefore it's considered
>>> # iterable, even though it isn't.
>>> print(is_iterable_not_string(FakeIterable(10)))
True
>>> print(is_iterable_not_string(list('abcde')))
True
>>> print(is_iterable_not_string(tuple('abcde')))
True
>>> print(is_iterable_not_string(set('abcde')))
True
>>> print(is_iterable_not_string(np.array(list('abcdef'))))
True
>>> print(is_iterable_not_string({col: [1, 2, 3, 4] for col in 'abcde'}))
True
>>> print(is_iterable_not_string(
... pd.DataFrame({col: [1, 2, 3, 4] for col in 'abcde'}))
... )
True
>>> print(is_iterable_not_string(pd.DataFrame()))
True
Notes
-----
In python, any object that contains a method called `__iter__` considered
an “iterable”. This means that you can, in theory, fake an “iterable”
object, by creating a method called `__iter__` that doesn't contain any
real implementation. For a concrete case, see the examples section.
Python common iterable objects:
- strings
- bytes
- lists
- tuples
- sets
- dictionaries
Python common non-iterable objects:
- integers
- floats
- None
- types
- objects
"""
return (not isinstance(iterable, (bytes, str))
and isinstance(iterable, Iterable))
def prepare_dict(data: dict) -> dict:
"""Transform non-iterable dictionary values into lists.
Parameters
----------
data : dict
The dictionary to convert non-iterable values into lists.
Returns
-------
dict
Dictionary with non-iterable values converted to lists.
Examples
--------
>>> import pandas as pd
>>> d = {'a': '1', 'b': 2}
>>> prepare_dict(d)
{'a': ['1'], 'b': [2]}
>>> pd.DataFrame(d) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
ValueError: If using all scalar values, you must pass an index
>>> pd.DataFrame(prepare_dict(d))
a b
0 1 2
Notes
-----
Use this function to prepare dictionaries, before calling
`pandas.DataFrame`, to make sure all values have the correct format.
"""
return {
key: value if is_iterable_not_string(value) else [value]
for key, value in data.items()
}
def check_dict_value_lens(data: dict) -> bool:
"""Check whether all values from dictionary have the same lenght.
Parameters
----------
data : dict
The dictionary to check the values lenghts.
Returns
-------
bool
True, if all `data` values have the same lenght. False otherwise.
"""
min_len = min(map(lambda value: len(value), data.values()))
return all(len(value) == min_len for value in data.values())
def read_file(path: Path | str, **kwargs: Any) -> pd.DataFrame:
"""
Read a DataFrame from a file.
Supported file types are:
- `.csv`
- `.xlsx`, `.xls`, `.xlsm`, `.xlsb` (Excel files)
- `.json`
- `.parquet`
- `.feather`
- `.html`
Parameters
----------
path : Path | str
The path to the file.
kwargs : Any
Keyword arguments to pass to pandas io functions.
Returns
-------
pd.DataFrame
The DataFrame read from the file.
Raises
------
ValueError
If the file type not supported.
FileNotFoundError
If the file doesn't exist.
"""
_path = Path(path)
path = str(path)
if not _path.is_file():
raise FileNotFoundError(f"File {path} does not exist.")
if _path.suffix in [".csv", ".txt"]:
return pd.read_csv(path, **kwargs)
if ".xls" in _path.suffix:
return pd.read_excel(path, **kwargs)
if _path.suffix == ".json":
return pd.read_json(path, **kwargs)
if _path.suffix == ".pickle":
return pd.read_pickle(path, **kwargs)
if _path.suffix == ".html":
return pd.read_html(path, **kwargs)
if _path.suffix == ".feather":
return pd.read_feather(path, **kwargs)
if _path.suffix in [".parquet", ".pq"]:
return pd.read_parquet(path, **kwargs)
raise ValueError(f"File {path} has an unknown extension.")
def highlight(df: pd.DataFrame) -> pd.DataFrame:
"""Highlight a DataFrame.
Parameters
----------
df : pd.DataFrame
The DataFrame to highlight. Required indexes:
- ["USL", "LSL", "1", "2", "3", "4", "5"]
Returns
-------
pd.DataFrame
The DataFrame with highlighted rows.
"""
# The dataframe cells background colors.
c1: str = "background-color:red"
c2: str = "background-color:yellow"
c3: str = "background-color:green"
# Rows over which the highlighting function should apply
rows: list[str] = ["1", "2", "3", "4", "5"]
# First boolean mask for selecting the df elements
m1 = (df.loc[rows] > df.loc["USL"]) | (df.loc[rows] < df.loc["LSL"])
# Second boolean mask for selecting the df elements
m2 = (df.loc[rows] == df.loc["USL"]) | (df.loc[rows] == df.loc["LSL"])
# DataFrame with same index, and column names as the original,
# but with filled empty strings.
df_highlight = pd.DataFrame("", index=df.index, columns=df.columns)
# Change values of df1 columns by boolean mask
df_highlight.loc[rows, :] = np.select(
[m1, m2], [c1, c2], default=c3
)
return df_highlight
class Analysis:
"""
Read a dataframe, and help performing some analysis in the data.
Parameters
----------
path_or_data : str | Path | pd.DataFrame
The path to a file, or a dataframe to analyze.
Attributes
----------
_data : pd.DataFrame
The data read from the file.
_path : str | Path
The path to the file.
Examples
--------
>>> data = {
... 'A-A': [
... 0.37, 0.39, 0.35, 0.35, 0.36, 0.34, 0.35, 0.33, 0.36, 0.33,
... ],
... 'A-B': [
... 10.24, 10.26, 10.22, 10.23, 10.19, 10.25, 10.2, 10.17, 10.25,
... 10.17,
... ],
... 'A-C': [
... 5.02, 5.04, 5.0, 5.05, 5.07, 5.03, 5.08, 5.05, 5.08, 5.03,
... ],
... 'A-D': [
... 0.63, 0.65, 0.63, 0.65, 0.67, 0.66, 0.69, 0.62, 0.69, 0.62,
... ],
... 'A-E': [
... 20.3, 20.32, 20.28, 20.45, 20.25, 20.33, 20.22, 20.4,
... 20.45, 20.22,
... ],
... }
>>> index = ['Tg', 'USL', 'LSL', '1', '2', '3', '4', '5', 'Max', 'Min']
>>> analysis = Analysis.from_dict(data, index=index)
>>> analysis.get_std()
A-A 0.011402
A-B 0.031937
A-C 0.019494
A-D 0.025884
A-E 0.097211
dtype: float64
"""
_path: Path | str | None = None
_data: pd.DataFrame | None = None
#property
def path(self) -> str | Path:
"""Get the path to the file.
Returns
-------
str | Path
The path to the file.
Raises
------
ValueError
If `_path` is `None`.
"""
if self._path is None:
raise ValueError("Path not set.")
return str(self._path)
#path.setter
def path(self, path: str | Path):
"""Set the path of the file to analyze.
Parameters
----------
path : str | Path
The path of the file to analyze.
Path should point to a `.csv` file.
Raises
------
FileNotFoundError
If the path not found.
"""
_path = Path(path)
if _path.is_file():
self._path = str(path)
else:
raise FileNotFoundError(f"Path {path} does not exist.")
#property
def data(self) -> pd.DataFrame:
"""Dataframe read from `path`.
Returns
-------
pd.DataFrame
The dataframe read from `path`.
"""
if self._data is None:
self._data = self.get_data()
return self._data
#data.setter
def data(self, data: pd.DataFrame):
"""Set the data to analyze.
Parameters
----------
data : pd.DataFrame
The data to analyze.
"""
self._data = data
def __init__(self, path_or_data: str | Path | pd.DataFrame):
"""Initialize the Analyzer.
Parameters
----------
path_or_data : str | Path | pd.DataFrame
The path to a file, or a dataframe to analyze.
Raises
------
ValueError
If `path_or_data` not a `str`, `Path`, or `pd.DataFrame`.
"""
if isinstance(path_or_data, (str, Path)):
self.path = path_or_data
elif isinstance(path_or_data, pd.DataFrame):
self.data = path_or_data
else:
raise ValueError(f"Invalid type {type(path_or_data)}.")
def get_data(self) -> pd.DataFrame:
"""Read the data from the file.
Returns
-------
pd.DataFrame
The dataframe read from the `path` property.
"""
return read_file(self.path)
def get_std(self) -> pd.Series:
"""Calcuate the standard deviation of every column.
Returns
-------
pd.Series
The standard deviation of every column.
"""
return self.data.loc["1":"5"].apply(lambda x: x.std()) # type: ignore
def highlight_frame(
self, round_values: int | None = None
) -> pd.io.formats.style.Styler: # type: ignore
"""Highlight dataframe, based on some condition.
Parameters
----------
round_values: int | None
If defined, sets the precision of the Styler object with the
highlighted dataframe.
Returns
-------
pd.io.formats.style.Styler
The Styler object with the highlighted dataframe.
"""
highlight_df = self.data.style.apply(highlight, axis=None)
if isinstance(round_values, int) and round_values >= 0:
return highlight_df.format(precision=round_values)
return highlight_df
#classmethod
def from_dict(
cls,
data: dict,
index: Axes | None = None,
columns: Axes | None = None,
dtype: Dtype | None = None,
) -> Analysis:
"""Create an Analysis object from a dictionary.
Parameters
----------
data : dict
The dictionary to create the Analysis object from.
index : Index or array-like
Index to use for resulting frame. Defaults to RangeIndex, if
no indexing information part of input data and no index provided.
columns : Index or array-like
Column labels to use for resulting frame when data doesn't have
them, defaulting to RangeIndex(0, 1, 2, ..., n).
If data contains column labels, will perform column selection
instead.
dtype : dtype, default None
Data type to force. Only a single dtype allowed. If None, infer.
Returns
-------
Analysis
An instance of the `Analysis` class.
Raises
------
ValueError
If dictionary values have different lenghts.
"""
data = prepare_dict(data)
if check_dict_value_lens(data):
return cls(
pd.DataFrame(data, index=index, columns=columns, dtype=dtype)
)
raise ValueError(
f"Dictionary values don't have the same lenghts.\nData: {data}"
)
if __name__ == "__main__":
import doctest
doctest.testmod()

Merging GeoDataFrames - TypeError: float() argument must be a string or a number, not 'Point'

I have a dataframe whose one of the columns has a Series of shapely Points and another one in which I have a Series of Polygons.
df.head()
hash number street unit \
2024459 283e04eca5c4932a SN AVENIDA DOUTOR SEVERIANO DE ALMEIDA NaN
2024460 1a92a1c3cba7941a 485 AVENIDA DOUTOR SEVERIANO DE ALMEIDA NaN
2024461 837341c45de519a3 475 AVENIDA DOUTOR SEVERIANO DE ALMEIDA NaN
city district region postcode id geometry
2024459 Jaguari NaN RS 97760-000 NaN POINT (-54.69445 -29.49421)
2024460 Jaguari NaN RS 97760-000 NaN POINT (-54.69445 -29.49421)
2024461 Jaguari NaN RS 97760-000 NaN POINT (-54.69445 -29.49421)
poly_df.head()
centroids geometry
0 POINT (-29.31067315122428 -54.64176359828149) POLYGON ((-54.64069 -29.31161, -54.64069 -29.3...
1 POINT (-29.31067315122428 -54.63961783106958) POLYGON ((-54.63854 -29.31161, -54.63854 -29.3...
2 POINT (-29.31067315122428 -54.637472063857665) POLYGON ((-54.63640 -29.31161, -54.63640 -29.3...
I'm checking if the Point belongs to the Polygon and inserting the Point object into the cell of the second dataframe. However, I'm getting the following error:
Traceback (most recent call last):
File "/tmp/ipykernel_4771/1967309101.py", line 1, in <module>
df.loc[idx, 'centroids'] = poly_mun.loc[ix, 'centroids']
File ".local/lib/python3.8/site-packages/pandas/core/indexing.py", line 692, in __setitem__
iloc._setitem_with_indexer(indexer, value, self.name)
File ".local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1599, in _setitem_with_indexer
self.obj[key] = infer_fill_value(value)
File ".local/lib/python3.8/site-packages/pandas/core/dtypes/missing.py", line 516, in infer_fill_value
val = np.array(val, copy=False)
TypeError: float() argument must be a string or a number, not 'Point'
I'm using the following command line:
df.loc[idx, 'centroids'] = poly_df.loc[ix, 'centroids']
I have already tried at as well.
Thanks
You can't create a new column in pandas with a shapely geometry using loc:
In [1]: import pandas as pd, shapely.geometry
In [2]: df = pd.DataFrame({'mycol': [1, 2, 3]})
In [3]: df.loc[0, "centroid"] = shapely.geometry.Point([0, 0])
/Users/mikedelgado/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/indexing.py:1642: ShapelyDeprecationWarning: The array interface is deprecated and will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array instead.
self.obj[key] = infer_fill_value(value)
/Users/mikedelgado/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/dtypes/missing.py:550: FutureWarning: The input object of type 'Point' is an array-like implementing one of the corresponding protocols (`__array__`, `__array_interface__` or `__array_struct__`); but not a sequence (or 0-D). In the future, this object will be coerced as if it was first converted using `np.array(obj)`. To retain the old behaviour, you have to either modify the type 'Point', or assign to an empty array created with `np.empty(correct_shape, dtype=object)`.
val = np.array(val, copy=False)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 df.loc[0, "centroid"] = shapely.geometry.Point([0, 0])
File ~/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/indexing.py:716, in _LocationIndexer.__setitem__(self, key, value)
713 self._has_valid_setitem_indexer(key)
715 iloc = self if self.name == "iloc" else self.obj.iloc
--> 716 iloc._setitem_with_indexer(indexer, value, self.name)
File ~/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/indexing.py:1642, in _iLocIndexer._setitem_with_indexer(self, indexer, value, name)
1639 self.obj[key] = empty_value
1641 else:
-> 1642 self.obj[key] = infer_fill_value(value)
1644 new_indexer = convert_from_missing_indexer_tuple(
1645 indexer, self.obj.axes
1646 )
1647 self._setitem_with_indexer(new_indexer, value, name)
File ~/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/dtypes/missing.py:550, in infer_fill_value(val)
548 if not is_list_like(val):
549 val = [val]
--> 550 val = np.array(val, copy=False)
551 if needs_i8_conversion(val.dtype):
552 return np.array("NaT", dtype=val.dtype)
TypeError: float() argument must be a string or a real number, not 'Point'
Essentially, pandas doesn't know how to interpret a point object, and so creates a float column with NaNs, and then can't handle the point. This might get fixed in the future, but you're best off explicitly defining the column as object dtype:
In [27]: df['centroid'] = None
In [28]: df['centroid'] = df['centroid'].astype(object)
In [29]: df
Out[29]:
mycol centroid
0 1 None
1 2 None
2 3 None
In [30]: df.loc[0, "centroid"] = shapely.geometry.Point([0, 0])
/Users/mikedelgado/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/internals/managers.py:304: ShapelyDeprecationWarning: The array interface is deprecated and will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array instead.
applied = getattr(b, f)(**kwargs)
In [31]: df
Out[31]:
mycol centroid
0 1 POINT (0 0)
1 2 None
2 3 None
That said, joining two GeoDataFrames with polygons and points based on whether the points are in the polygons certainly sounds like a job for geopandas.sjoin:
union = gpd.sjoin(polygon_df, points_df, op='contains')

input must be an array, list, tuple or scalar pyproj

I Have a DF in which I am trying to convert the eastings/northings to long/lats. My df looks like this:
import pandas as pd
import numpy as np
import pyproj
Postcode Eastings Northings
0 AB101AB 394235 806529
1 AB101AF 394181 806429
2 AB101AG 394230 806469
3 AB101AH 394371 806359
4 AB101AL 394296 806581
I am using a well know code block to convert the eastings and northings to long/lats and add those long/lats as new columns to the df:
def proj_transform(df):
bng = pyproj.Proj("+init=EPSG:27700")
wgs84 = pyproj.Proj("+init=EPSG:4326")
lats = pd.Series()
lons = pd.Series()
for idx, val in enumerate(df['Eastings']):
lon, lat = pyproj.transform(bng, wgs84, df['Eastings'][idx], df['Northings'][idx])
lats.set_value(idx, lat)
lons.set_value(idx, lon)
df['lat'] = lats
df['lon'] = lons
return df
df_transform = proj_transform(my_df)
However, I keep getting the following error, "input must be an array, list, tuple or scalar". Does anyone have any insight into where I am going wrong here?
This is the fastest method:
https://gis.stackexchange.com/a/334307/144357
from pyproj import Transformer
trans = Transformer.from_crs(
"EPSG:27700",
"EPSG:4326",
always_xy=True,
)
xx, yy = trans.transform(my_df["Eastings"].values, my_df["Northings"].values)
my_df["X"] = xx
my_df["Y"] = yy
Also helpful for reference:
https://pyproj4.github.io/pyproj/stable/gotchas.html#upgrading-to-pyproj-2-from-pyproj-1
https://pyproj4.github.io/pyproj/stable/gotchas.html#init-auth-auth-code-should-be-replaced-with-auth-auth-code
https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
You can use DataFrame.apply with axis=1 and change function like:
def proj_transform(x):
e = x['Eastings']
n = x['Northings']
bng = pyproj.Proj("+init=EPSG:27700")
wgs84 = pyproj.Proj("+init=EPSG:4326")
lon, lat = pyproj.transform(bng, wgs84, e, n)
return pd.Series([lon, lat])
my_df[['lat','lon']] = my_df.apply(proj_transform, axis=1)

ValueError: np.nan is an invalid document, expected byte or unicode string

I am trying to perform sentiment analysis on Uber-Review. I have used Naive bays sklearn to perform sentiment analyis,I used trianing data from kaggle on reviwes,
But The test data is in xlsx sheet, I used pandas to create data frame,
import pandas as pd
test=pd.read_excel("uber.xlsx",sep="\t",encoding="ISO-8859-1");
test.head(3)
as it returned d:type object, I transformed it to list using this
test_text = []
for comments in comments_t:
test_text.append(comments)
My code for classifying text based on training data:
# Training Phase
from sklearn.naive_bayes import BernoulliNB
classifier = BernoulliNB().fit(train_documents,labels)
def sentiment(word):
return classifier.predict(count_vectorizer.transform([word]))
but while predicting it return this value error:
/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents)
1084
1085 # use the same matrix-building strategy as fit_transform
-> 1086 _, X = self._count_vocab(raw_documents, fixed_vocab=True)
1087 if self.binary:
1088 X.data.fill(1)
/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
940 for doc in raw_documents:
941 feature_counter = {}
--> 942 for feature in analyze(doc):
943 try:
944 feature_idx = vocabulary[feature]
/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
326 tokenize)
327 return lambda doc: self._word_ngrams(
--> 328 tokenize(preprocess(self.decode(doc))), stop_words)
329
330 else:
/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in decode(self, doc)
141
142 if doc is np.nan:
--> 143 raise ValueError("np.nan is an invalid document, expected byte or "
144 "unicode string.")
145
ValueError: np.nan is an invalid document, expected byte or unicode string.
I tried to solve according to this:
https://stackoverflow.com/questions/39303912/tfidfvectorizer-in-scikit-learn-valueerror-np-nan-is-an-invalid-document
the Data that i have found in Kaggle for Uber is https://www.kaggle.com/purvank/uber-rider-reviews-dataset/downloads/Uber_Ride_Reviews.csv/2
now coming to your problem
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
df = pd.read_csv('Uber_Ride_Reviews.csv')
df.head()
Out[7]:
ride_review ... sentiment
0 I completed running New York Marathon requeste... ... 0
1 My appointment time auto repairs required earl... ... 0
2 Whether I using Uber ride service Uber Eats or... ... 0
3 Why hard understand I trying retrieve Uber cab... ... 0
4 I South Beach FL I staying major hotel ordered... ... 0
df.columns
Out[8]: Index(['ride_review', 'ride_rating', 'sentiment'], dtype='object')
vect = CountVectorizer()
vect1 = vect.fit_transform(df['ride_review'])
classifier = BernoulliNB()
classifier.fit(vect1,df['sentiment'])
# predicting new comment it is giving O/p
new_test_= vect.transform(['uber ride is very good'])
classifier.predict(new_test_)
Out[5]: array([0], dtype=int64)
# but when applying your function sentiment you are only passing word, you need to
#passclassifier as well as Countvectorizer instance
def sentiment(word, classifier, vect):
return classifier.predict(vect.transform([word]))
#calling above function for new sentiment
sentiment('uber ride is very good', vect, classifier)
O/p --> Out[10]: array([0], dtype=int64)

Convert geojson to geopandas dataframe

I am using OpenRouteService API and I am trying to convert the GeoJSON result of the GET Directions service to a GeoPandas dataframe and, in the end, store it as a spatial PostGIS table. My code so far is:
import pandas as pd
import geopandas as gpd
import sqlalchemy as sa
import openrouteservice
def getroute(lon1, lat1, lon2, lat2):
coords = ((lon1, lat1), (lon2, lat2))
params_route = {'profile': 'foot-walking','coordinates' : coords,
'format_out': 'geojson',
'geometry': 'true','geometry_simplify':'true',
'geometry_format': 'geojson',
'instructions': 'false',
}
geometry = client.directions(**params_route)['features']
print geometry
return geometry
# Creating SQLAlchemy's engine to use
client = openrouteservice.Client(key='myapikey')
lon1=8.34234
lat1=48.23424
lon2=8.34423
lat2=48.26424
myroutes = getroute(lon1, lat1, lon2, lat2)
print myroutes
print type(myroutes)
myroutes = gpd.GeoDataFrame(myroutes)
print myroutes
engine = sa.create_engine('postgresql+psycopg2://username:password#host/database', encoding = 'utf-8')
with engine.connect() as conn, conn.begin():
# Note use of regular Pandas `to_sql()` method.
myroutes['geometry'].to_sql('contents', con=conn, schema='schema', if_exists='replace', index=False)
However, I can't seem to overpass the geojson structure and store it. Can anyone help me? The resulting error is
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'dict' [SQL: 'INSERT INTO paa.contents (geometry) VALUES (%(geometry)s)'] [parameters: {'geometry': {u'type': u'LineString', u'coordinates': [[8.344268, 48.233826], [8.344147, 48.233507], [8.344098, 48.233435], [8.343945, 48.233136], [8.343853, 48.233047], [8.34332, 48.232736], [8.343098, 48.232473], [8.342861, 48.232307], [8.342711, 48.23224], [8.342328, 48.232159], [8.342045, 48.23209], [8.341843, 48.232035], [8.341711, 48.231946], [8.341092, 48.232163], [8.340386, 48.232388], [8.34028, 48.23245], [8.339983, 48.23274], [8.339451, 48.23315], [8.3393, 48.233316], [8.339219, 48.233457], [8.339185, 48.233646], [8.339372, 48.234329], [8.339367, 48.234539], [8.339262, 48.234685], [8.338886, 48.234971], [8.338431, 48.235181], [8.338327, 48.23528], [8.338234, 48.235495], [8.338176, 48.235798], [8.338105, 48.235955], [8.337919, 48.236102], [8.33725, 48.236483], [8.336922, 48.236771], [8.336726, 48.237039], [8.336421, 48.237391], [8.33621, 48.237641], [8.336115, 48.237759], [8.335913, 48.237947], [8.335782, 48.23804], [8.335572, 48.238146], [8.335367, 48.238292], [8.335175, 48.238458], [8.335038, 48.238638], [8.335097, 48.238674], [8.335049, 48.238932], [8.335044, 48.239155], [8.334709, 48.239726], [8.334583, 48.239904], [8.33455, 48.240095], [8.334344, 48.240506], [8.334089, 48.240776], [8.334175, 48.240817], [8.334326, 48.240799], [8.334562, 48.240779], [8.335146, 48.240961], [8.335056, 48.241105], [8.334592, 48.241447], [8.334338, 48.241616], [8.333982, 48.241818], [8.333449, 48.242185], [8.333166, 48.242623], [8.333047, 48.242774], [8.33289, 48.242884], [8.332437, 48.243097], [8.332313, 48.243212], [8.332203, 48.2434], [8.332093, 48.243811], [8.331966, 48.244102], [8.331775, 48.244413], [8.331649, 48.244575], [8.331717, 48.24471], [8.331836, 48.244822], [8.332961, 48.245226], [8.33325, 48.245292], [8.333439, 48.245365], [8.333781, 48.245519], [8.334241, 48.245794], [8.334417, 48.245979], [8.333901, 48.246311], [8.33362, 48.246637], [8.33304, 48.246836], [8.332729, 48.247071], [8.332437, 48.247353], [8.332278, 48.247583], [8.332271, 48.247685], [8.332345, 48.247923], [8.332441, 48.248093], [8.332291, 48.248137], [8.331258, 48.248526], [8.330556, 48.248909], [8.329865, 48.249228], [8.329128, 48.249545], [8.328832, 48.249737], [8.328606, 48.249949], [8.328412, 48.250198], [8.328342, 48.250322], [8.328084, 48.250757], [8.327975, 48.25103], [8.32782, 48.251499], [8.327715, 48.251941], [8.327707, 48.252051], [8.327735, 48.252168], [8.327871, 48.252433], [8.328022, 48.252827], [8.328051, 48.252982], [8.328067, 48.253367], [8.328094, 48.253482], [8.328188, 48.253678], [8.328516, 48.253748], [8.329388, 48.253956], [8.329619, 48.25405], [8.32993, 48.254114], [8.330179, 48.254184], [8.330565, 48.254448], [8.33078, 48.254627], [8.330909, 48.254812], [8.331049, 48.255072], [8.331165, 48.255189], [8.331417, 48.25535], [8.331592, 48.255536], [8.331745, 48.255884], [8.331778, 48.256163], [8.331733, 48.256781], [8.331604, 48.257332], [8.332141, 48.257903], [8.332452, 48.258317], [8.332688, 48.258781], [8.332668, 48.259148], [8.332765, 48.259448], [8.33286, 48.259582], [8.333589, 48.259789], [8.333881, 48.259898], [8.334074, 48.259932], [8.334615, 48.260141], [8.334832, 48.260261], [8.335546, 48.260712], [8.335655, 48.260829], [8.335753, 48.260994], [8.335783, 48.261319], [8.33623, 48.261624], [8.337095, 48.261891], [8.337525, 48.262004], [8.33783, 48.262411], [8.337898, 48.262441], [8.337994, 48.262433], [8.338356, 48.26232], [8.338735, 48.262012], [8.339091, 48.261771], [8.339439, 48.261581], [8.339604, 48.261778], [8.339748, 48.261829], [8.339754, 48.26183], [8.33996, 48.262052], [8.340984, 48.262661], [8.341287, 48.262828], [8.341604, 48.262945], [8.342296, 48.263073], [8.343026, 48.263176], [8.343188, 48.263176], [8.343387, 48.263132], [8.3438, 48.262989], [8.343999, 48.26297], [8.344228, 48.263014], [8.344626, 48.263142], [8.344987, 48.263166], [8.345244, 48.263242], [8.344865, 48.263233], [8.344067, 48.263207], [8.343897, 48.263233], [8.343478, 48.263529], [8.343433, 48.263552]]}}]
Finally got it:
geometry = client.directions(**params_route)['routes'][0]
geometry = pd.DataFrame({k : pd.Series(v) for k, v in geometry.iteritems()})
geometry = geometry[:-2]
geometry['coordinates'] = geometry['geometry'].apply(Point)
geometry['myline'] = 1
geometry = gpd.GeoDataFrame(geometry, geometry='coordinates')
geometry = geometry.groupby('myline')['geometry'].apply(lambda x: LineString(x.tolist()))
geometry = gpd.GeoDataFrame(geometry, geometry='geometry')
myroute = LineString(geometry['geometry'].iloc[0]).wkb_hex
# Update table:
insert_query = """UPDATE schema.contents SET geom = ST_GeomFromWKB(%(geometry)s::geometry, 4326) WHERE id='1'"""
engine.execute(insert_query, geometry=myroute)