geopandas series fillna results in float - pandas

As seen below, why does fillna results in a float number instead of a Point?
In [4]: import numpy as np
In [5]: import geopandas as gpd
In [8]: import shapely
In [9]: df_tmp = gpd.GeoDataFrame([['a', np.NaN], ['b', shapely.geometry.Point(35, 70)]], columns=['id', 'geometry'])
In [10]: df_tmp
Out[10]:
id geometry
0 a NaN
1 b POINT (35 70)
In [11]: df_tmp.geometry.fillna(shapely.geometry.Point(90, 0))
Out[11]:
0 90
1 POINT (35 70)
Name: geometry, dtype: object
version info:
In [12]: gpd.__version__
Out[12]: '0.5.0'
In [13]: shapely.__version__
Out[13]: '1.6.4.post2'
In [14]: np.__version__
Out[14]: '1.16.4'

Check the documentation for fillna in pandas.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
A shapely geometry may not be a valid argument for that function.

Related

pandas DataFrame value_counts on column that stores DataFrame

I am trying to use value_counts() on a pandas DataFrame column that stores another DataFrame.
Is there a possibility to get the value_counts() function working (or something similar), without having to transform my DataFrames to Strings or Hashes or something like that?
I've tried to count the inner DataFrames, which completely breaks, and then I tried with Arrays, which it seems it cannot make the correct comparison also:
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
ar1 = np.array([11,22])
ar2 = np.array([11,22])
ar3 = np.array([33,44])
df = pd.DataFrame([
['0', ar1],
['1', ar2],
['2', ar3]
], columns =['str', 'ars'])
print(df["ars"].value_counts())
Expected:
[11, 22] 2
[33, 44] 1
Actual:
[11, 22] 1
[11, 22] 1
[33, 44] 1
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
df1 = pd.DataFrame({'col1': [11], 'col2': [22]})
df2 = pd.DataFrame({'col1': [11], 'col2': [22]})
df3 = pd.DataFrame({'col1': [33], 'col2': [44]})
df = pd.DataFrame([
['0', df1],
['1', df2],
['2', df3]
], columns =['str', 'dfs'])
print(df["dfs"].value_counts())
Expected:
{} 2
{} 1
Actual:
BREAKS COMPLETELY
How can I achive the count of complex values in a DataFrame?
I'm honestly confused how either of those managed to run without raising an exception.
Neither np.array nor pd.DataFrame are hashable, and as far as I understood, hashing was necessary for value_count.
Case and point, neither of your examples can be translated to their DataFrame.value_counts equivalent, because underneath it's doing df.groupby(["ars"], dropna=True).grouper.size() which requires hashing.
>>> df.value_counts(["ars"])
TypeError: unhashable type: 'numpy.ndarray'
Overall, I would not count on any .value_count method working on non-hashable columns.

Replace NaN values of pandas.DataFrame based on values of other columns (according to formula)

Demo dataframe:
import pandas as pd
df = pd.DataFrame({'a': [1,None,3], 'b': [5,10,15]})
I want to replace all NaN values in a with the corresponding values in b**2, and make b NaN (shift NaN values and make some operations on them).
Desired result:
1 5
100 NaN
3 15
How is it possible with pandas?
You can get the rows you want to change using df['a'].isnull(). Then you can use that to update the columns with loc.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, None, 3], 'b': [5, 10, 15]})
change = df['a'].isnull()
df.loc[change, ['a', 'b']] = [df.loc[change, 'b']**2, np.NaN]
print(df)
Note that the change variable is only to keep from repeating df['a'].isnull() on both sides of the assignment. You could replace it with that expression to do this in one line, but I think that looks cluttered.
Result:
a b
0 1.0 5.0
1 100.0 NaN
2 3.0 15.0

pandas different column names same data

How do i use the same data storage and change only the columns.
If i do something like this:
In [30]: import pandas as pd
In [31]: import numpy as np
In [32]: df = pd.DataFrame(np.zeros((2,2)))
In [33]: df_new = pd.DataFrame(df)
In [34]: df[0][0]=5
In [35]: df_new
Out[35]:
0 1
0 5.0 0.0
1 0.0 0.0
In [36]: df_new.columns=["a", "b"]
In [37]: df_new.columns
Out[37]: Index(['a', 'b'], dtype='object')
In [38]: df.columns
Out[38]: Index(['a', 'b'], dtype='object')
Changes columns for both dataframes. Using dataframe rename inplace leads to columns being changed for both dataframes.
You should use pandas.DataFrame.copy() to create a copy of an existing dataframe.
For your code,
df_new = df.copy()
instead of
df_new = pd.DataFrame(df)
will do the trick.

Drop categories from Dask DataFrame? [duplicate]

I am trying to use dask instead of pandas since I have 2.6gb csv file.
I load it and I want to drop a column. but it seems that neither the drop method
df.drop('column') or slicing df[ : , :-1]
is implemented yet. Is this the case or am I just missing something ?
We implemented the drop method in this PR. This is available as of dask 0.7.0.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf.drop('y', axis=1).compute()
Out[5]:
x
0 1
1 2
2 3
Previously one could also have used slicing with column names; though of course this can be less attractive if you have many columns.
In [6]: ddf[['x']].compute()
Out[6]:
x
0 1
1 2
2 3
This should work:
print(ddf.shape)
ddf = ddf.drop(columns, axis=1)
print(ddf.shape)

How to call unique() on dask DataFrame

How do I call unique on a dask DataFrame ?
I get the following error if I try to call it the same way as for a regular pandas dataframe:
In [27]: len(np.unique(ddf[['col1','col2']].values))
AttributeError Traceback (most recent call last)
<ipython-input-27-34c0d3097aab> in <module>()
----> 1 len(np.unique(ddf[['col1','col2']].values))
/dir/anaconda2/lib/python2.7/site-packages/dask/dataframe/core.pyc in __getattr__(self, key)
1924 return self._constructor_sliced(merge(self.dask, dsk), name,
1925 meta, self.divisions)
-> 1926 raise AttributeError("'DataFrame' object has no attribute %r" % key)
1927
1928 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute 'values'
For both Pandas and Dask.dataframe you should use the drop_duplicates method
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 1, 2], 'y': [10, 10, 20]})
In [3]: df.drop_duplicates()
Out[3]:
x y
0 1 10
2 2 20
In [4]: import dask.dataframe as dd
In [5]: ddf = dd.from_pandas(df, npartitions=2)
In [6]: ddf.drop_duplicates().compute()
Out[6]:
x y
0 1 10
2 2 20
This works with dask==2022.11.1
ddf.symbol.unique().compute()
I'm not too familiar with Dask, but they appear to have a subset of Pandas functionality, and that subset doesn't seem to include the DataFrame.values attribute.
http://dask.pydata.org/en/latest/dataframe-api.html
You could try this:
sum(ddf[['col1','col2']].apply(pd.Series.nunique, axis=0))
I don't know how it fares performance-wise, but it should provide you with the value (total number of distinct values in col1 and col2 from the ddf DataFrame).