I have two Geopandas dataframes. the schema looks like below.
inventoryid object
dsuid object
basinquantum object
reservoir object
geometry geometry
crs_epsg object
buffer_dist float64
buffer geometry
The second dataframe schema looks like this
API12 object
geometry geometry
Basin object
since first dataframe having two geometry types am setting geometry to buffer column
wells1=wells1.set_geometry("buffer")
I am performing intersection operation
res_intersection = gpd.overlay(wells2,wells1,how='intersection')
Although geometry column is present but still i am getting error like
"['geometry'] not found in axis"
creating a repeatable section of code that matches your description
this does fail in my environment with a different error when using pygeos GEOSException: IllegalArgumentException: Argument must be Polygonal or LinearRing
will further work on this and if necessary raise BUG against geopandas
will try against main branch not just v0.11 and investigate other versions of pygeos and geos
this does work with rtree
import geopandas as gpd
pygeos = False
gpd._compat.set_use_pygeos(pygeos)
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
cities = gpd.read_file(gpd.datasets.get_path("naturalearth_cities"))
# construct geodataframes logically equivalent to question
# two geometry columns
wells1 = cities.iloc[[11, 36, 50, 113, 158, 161, 172, 199]].copy()
wells1["buffer"] = (
wells1.to_crs(wells1.estimate_utm_crs()).buffer(1.5 * 10**5).to_crs(wells1.crs)
)
wells1 = wells1.set_geometry("buffer")
wells2 = world.iloc[[43, 54, 55, 58, 82, 129, 130]].copy()
gdf_i = gpd.overlay(wells2, wells1, how="intersection")
gdf_i.explore(height=300, width=500)
output
Environment
gpd.get_versions()
SYSTEM INFO
python : 3.10.6 (main, Aug 30 2022, 05:11:14) [Clang 13.0.0 (clang-1300.0.29.30)]
machine : macOS-11.6.8-x86_64-i386-64bit
GEOS, GDAL, PROJ INFO
GEOS : 3.11.0
GEOS lib : /usr/local/Cellar/geos/3.11.0/lib/libgeos_c.dylib
GDAL : 3.4.1
PROJ : 9.1.0
PYTHON DEPENDENCIES
geopandas : 0.11.1
numpy : 1.23.3
pandas : 1.4.4
pyproj : 3.4.0
shapely : 1.8.4
fiona : 1.8.21
geoalchemy2: None
geopy : None
matplotlib : 3.5.3
mapclassify: 2.4.3
pygeos : 0.13
pyogrio : None
psycopg2 : None
pyarrow : None
rtree : 1.0.0
Related
I'm trying to convert a Pandas dataframe to a Pyspark dataframe, and getting the following pyarrow-related error:
import pandas as pd
import numpy as np
data = np.random.rand(1000000, 10)
pdf = pd.DataFrame(data, columns=list("abcdefghij"))
df = spark.createDataFrame(pdf)
/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
'JavaPackage' object is not callable
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
I've tried different versions of pyarrow (0.10.0, 0.14.1, 0.15.1 and more) but with the same result. How can I debug this?
I had the same issue, changed the cluster setting to emr-5.30.1 and arrow version to 0.14.1 and it resolved the issue
Can you try upgrading your pyspark to >= 3.0.0? I had the above error with all versions of arrow, but bumping to the newer pyspark fixed it for me.
There is a version conflict with older versions of Spark (ex: 2.4.x) and newer versions of arrow.
I got the following code to elaborate on my problem. I'm using python 3.6 with pandas==0.25.3.
import pandas as pd
from enum import Enum, IntEnum
class BookType(Enum):
DRAMA = 5
ROMAN = 3
class AuthorType(IntEnum):
UNKNOWN = 0
GROUP = 1
MAN = 2
def print_num_type(df: pd.DataFrame, col_name: str, enum_type: Enum) -> int:
counts = df[col_name].value_counts()
val = counts[enum_type]
print('value counts:', counts)
print(f'Found "{val}" of type {enum_type}')
d = {'title': ['Charly Morry', 'James', 'Watson', 'Marry L.'], 'isbn': [21412412, 334764712, 12471021, 124141111], 'book_type': [BookType.DRAMA, BookType.ROMAN, BookType.ROMAN, BookType.ROMAN], 'author_type': [AuthorType.UNKNOWN, AuthorType.UNKNOWN, AuthorType.MAN, AuthorType.UNKNOWN]}
df = pd.DataFrame(data=d)
df.set_index(['title', 'isbn'], inplace=True)
df['book_type'] = df['book_type'].astype('category')
df['author_type'] = df['author_type'].astype('category')
print(df)
print(df.dtypes)
print_num_type(df, 'book_type', BookType.DRAMA)
print_num_type(df, 'author_type', AuthorType.UNKNOWN)
My pandas.DataFrame consists of two columns (book_type and author_type) of type categorical.
Furthermore, book_type is a class inheriting from type Enum and author_type from IntEnum. When calling print_num_type(df, 'book_type', BookType.DRAMA) everything works out as expected and the number of books of this type are printed, whereas print_num_type(df, 'author_type', AuthorType.UNKNOWN) raises the error:
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\abc.py", line 182, in __instancecheck__
if subclass in cls._abc_cache:
File "C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\_weakrefset.py", line 72, in __contains__
wr = ref(item)
RecursionError: maximum recursion depth exceeded while calling a Python object
Exception ignored in: 'pandas._libs.lib.c_is_list_like'
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\abc.py", line 182, in __instancecheck__
if subclass in cls._abc_cache:
File "C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\_weakrefset.py", line 72, in __contains__
wr = ref(item)
RecursionError: maximum recursion depth exceeded while calling a Python object
What am I doing wrong here?
Is there a workaround to get this error fixed? since I can't change the IntEnum type of AuthorType since it's provided from another library.
Thanks in advance!
See answer here
The main idea is that since x.value_counts() or counts in your function is itself a pandas Series, it's best to use .iat or .iloc when calling it, e.g, see iat docs
I think the easiest solution is to just use (x==0).sum(), or in your syntax:
val = (df[col_name]==enum_type).sum()
I put a minimal working example in the comments under your question so you can reproduce the problem/fix easily with the "x" notation.
What version of Pandas are you using? I realized after reproducing the error that upgrading Pandas (now on pandas-1.4.2) fixes the error, and the value_counts()[0] worked as expected.
run pip install --upgrade pandas
I'm trying to plot some graphs using the latest version of Pycharm as a Python IDE.
As an interpreter, I'm using Anaconda with Python 3.4.3-0.
I have installed using conda install the news version of pandas (0.17.0), seaborn (0.6.0), numpy (1.10.1), matplotlib (1.4.3), ipython (4.0.1)
Inside the nesarc_pds.csv I have this:
IDNUM,S1Q2I
39191,1
39787,1
40082,1
40189,1
40226,1
40637,1
41306,1
41627,1
41710,1
42113,1
42120,1
42720,1
42909,1
43092,1
7,2
15,2
25,2
40,2
46,2
49,2
57,2
63,2
68,2
100,2
104,2
116,2
125,2
136,2
137,2
145,2
168,2
3787,9
6554,9
7616,9
11686,9
12431,9
14889,9
17694,9
19440,9
20141,9
21540,9
22476,9
24207,9
25762,9
29045,9
29731,9
So, that being said, this is my code:
import pandas as pd
import numpy
import seaborn as snb
import matplotlib.pyplot as plt
data = pd.read_csv("nesarc_pds.csv", low_memory=False)
#converting variable to numeric
pd.to_numeric(data["S1Q2I"], errors='coerce')
#setting a new dataset...
sub1=data[(data["S1Q2I"]==1) & (data["S3BQ1A5"]==1)]
sub2 = sub1.copy()
#setting the missing data 9 = unknown into NaN
sub2["S1Q2I"] = sub2["S1Q2I"].replace(9, numpy.nan)
#setting date to categorical type
sub2["S1Q2I"] = sub2["S1Q2I"].astype('category')
#plotting
snb.countplot(x="S1Q2I", data=sub2)
plt.xlabel("blablabla")
plt.title("lalala")
And then.....this is the error:
Traceback (most recent call last):
File "C:/Users/LPForGE_1/PycharmProjects/guido/haha.py", line 49, in <module>
snb.countplot(x="S1Q2I", data=sub2)
File "C:\Anaconda3\lib\site-packages\seaborn\categorical.py", line 2544, in countplot
errcolor)
File "C:\Anaconda3\lib\site-packages\seaborn\categorical.py", line 1263, in __init__
self.establish_colors(color, palette, saturation)
File "C:\Anaconda3\lib\site-packages\seaborn\categorical.py", line 300, in establish_colors
l = min(light_vals) * .6
ValueError: min() arg is an empty sequence
Any help would be really nice. I pretty much exhausted my intelligence trying to understand how to solve this.
I have produced some software that is processing data for analysis and plotting. For each type of data the data frames are produced in a module dedicated for the type.
Depending on the structure of the data the data frame columns could be normal or multindex.
I will pass the data frames to a procedure function that will produce plots of columns that are numeric.
I would like to be able to "attach" a string to each of the "printable" column with a string that will be used as plot labels. This string will not be the same as the name of the column.
I don't seem to be able to figure out a good way to do this purely with pandas DataFrame, so far I don't have any other solution either.
I have seen posts about metadata but I don't completely understand if this functionality is supported or not? At least I don't get this to work, especially it seems like using frames with MultiIndex columns complicates things.
If it is not supported is it still on the todo list?
From my reading I get the impression it have worked differently in different versions of pandas and even depend on if python 2 or 3 is used.
I would like to know if there is a convenient way to accomplish what I require with Pandas data frames? Is using _metadata for this advisable? If so how?
I have looked around quite a bit but especially the MultiIndex concern seems to not be addressed anywhere.
This one seem to indicate that metadata should be supported but is it for data frames? I need Series in a DataFrame.
Adding meta-information/metadata to pandas DataFrame
This one seem to be a similar question but I have tried the solution and it did not help, I tried the solution but it seems not to help me.
Propagate pandas series metadata through joins
Here is some experimentation I have done based on my understanding of the use of _metadata functionality. It seems to indicate that the _metadata did not make any difference and that the attribute did not persist a copy. Also it shows that using MultiIndex is an even more "unsupported" case.
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from numpy.random import randn # To get values for the test frames
>>> import platform # To print python version
>>> # A function to set labels of the columns
>>> def labelSetter(aDF) :
... DFtmp = aDF.copy() # Just to ensure it is a different dataframe
... for column in DFtmp.columns :
... DFtmp[column].myLab='This is '+column.__str__()
... DFtmp[column].notMyLab='This should not persist'
... return DFtmp
...
>>>
>>> print 'Pandas version: {}'.format(pd.version.version)
Pandas version: 0.15.2
>>>
>>> pd.Series._metadata.append('myLab');print pd.Series._metadata # now _metadata contains 'myLab'
['name', 'myLab']
>>>
>>> # Make dataframes normal columns and MultiIndex
>>> dfS=pd.DataFrame(randn(2, 6),columns=['a1','a2','a3','b1','b2','c1']);print dfS
a1 a2 a3 b1 b2 c1
0 -0.934869 -0.310979 0.362635 -0.994605 -0.880114 -1.663265
1 0.205341 -1.642080 -0.732969 -0.080109 -0.082483 -0.208360
>>>
>>> dfMI=pd.DataFrame(randn(2, 6),columns=[['a','a','a','b','b','c'],['a1','a2','a3','b1','b2','c1']]);print dfMI
a b c
a1 a2 a3 b1 b2 c1
0 -0.578399 0.478925 1.047342 -0.087225 1.905074 0.146105
1 0.640575 0.153328 -1.117847 1.043026 0.671220 -0.218550
>>>
>>> # Run the labelSetter function on the data frames
>>> dfSWlab=labelSetter(dfS)
>>> dfMIWlab=labelSetter(dfMI)
>>>
>>> print dfSWlab['a2'].myLab
This is a2
>>> # This worked
>>>
>>> print dfSWlab['a2'].notMyLab
This should not persist
>>> # 'notMyLab' has not been appended to _metadata but the label still persists.
>>>
>>> dfSWlabCopy=dfSWlab.copy() # make a copy to see if myLab persists.
>>>
>>> dfSWlabCopy['a2'].myLab
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 1942, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'myLab'
>>> # 'myLab' was appended to _metadata but still did not persist the copy
>>>
>>> print dfMIWlab['a']['a2'].myLab
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 1942, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'myLab'
>>> # For the MultiIndex data frame the 'myLab' is not accessible
I am trying to migrate some code from using ElementTree to using lxml.etree and have encountered an error early on:
>>> import lxml.etree as ET
>>> main = ET.Element("main")
>>> another = ET.Element("another", foo="bar")
>>> main.attrib.update(another.attrib)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
main.attrib.update(another.attrib)
File "lxml.etree.pyx", line 2153, in lxml.etree._Attrib.update
(src/lxml/lxml.etree.c:46972)
ValueError: too many values to unpack (expected 2)
But I am able to update using the following:
>>> main.attrib.update({'foo': 'bar'})
Is this a bug in lxml (version 2.3) or am I just missing something obvious?
I'm getting the same error, don't think that it's only 2.3 issue.
Workaround:
main.attrib.update(dict(another.attrib))
# or more efficient if it has many attributes:
main.attrib.update(another.attrib.iteritems())
UPDATE
lxml.etree._Attrib.update accepts dict or iterable (source). Although _Attrib has dict interface, it is not dict instance.
In [3]: type(another.attrib)
Out[3]: lxml.etree._Attrib
In [4]: isinstance(another.attrib, dict)
Out[4]: False
So update tries to iterate items as key, value. Maybe it's done for perfomance. Only lxml author knows.
Ways to change it in lxml:
Subclass dict.
Check for hasattr(sequence_or_dict, 'items').
I'm not familiar with Cython and don't know what is better.