pandas different column names same data - pandas

How do i use the same data storage and change only the columns.
If i do something like this:
In [30]: import pandas as pd
In [31]: import numpy as np
In [32]: df = pd.DataFrame(np.zeros((2,2)))
In [33]: df_new = pd.DataFrame(df)
In [34]: df[0][0]=5
In [35]: df_new
Out[35]:
0 1
0 5.0 0.0
1 0.0 0.0
In [36]: df_new.columns=["a", "b"]
In [37]: df_new.columns
Out[37]: Index(['a', 'b'], dtype='object')
In [38]: df.columns
Out[38]: Index(['a', 'b'], dtype='object')
Changes columns for both dataframes. Using dataframe rename inplace leads to columns being changed for both dataframes.

You should use pandas.DataFrame.copy() to create a copy of an existing dataframe.
For your code,
df_new = df.copy()
instead of
df_new = pd.DataFrame(df)
will do the trick.

Related

pandas DataFrame value_counts on column that stores DataFrame

I am trying to use value_counts() on a pandas DataFrame column that stores another DataFrame.
Is there a possibility to get the value_counts() function working (or something similar), without having to transform my DataFrames to Strings or Hashes or something like that?
I've tried to count the inner DataFrames, which completely breaks, and then I tried with Arrays, which it seems it cannot make the correct comparison also:
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
ar1 = np.array([11,22])
ar2 = np.array([11,22])
ar3 = np.array([33,44])
df = pd.DataFrame([
['0', ar1],
['1', ar2],
['2', ar3]
], columns =['str', 'ars'])
print(df["ars"].value_counts())
Expected:
[11, 22] 2
[33, 44] 1
Actual:
[11, 22] 1
[11, 22] 1
[33, 44] 1
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
df1 = pd.DataFrame({'col1': [11], 'col2': [22]})
df2 = pd.DataFrame({'col1': [11], 'col2': [22]})
df3 = pd.DataFrame({'col1': [33], 'col2': [44]})
df = pd.DataFrame([
['0', df1],
['1', df2],
['2', df3]
], columns =['str', 'dfs'])
print(df["dfs"].value_counts())
Expected:
{} 2
{} 1
Actual:
BREAKS COMPLETELY
How can I achive the count of complex values in a DataFrame?
I'm honestly confused how either of those managed to run without raising an exception.
Neither np.array nor pd.DataFrame are hashable, and as far as I understood, hashing was necessary for value_count.
Case and point, neither of your examples can be translated to their DataFrame.value_counts equivalent, because underneath it's doing df.groupby(["ars"], dropna=True).grouper.size() which requires hashing.
>>> df.value_counts(["ars"])
TypeError: unhashable type: 'numpy.ndarray'
Overall, I would not count on any .value_count method working on non-hashable columns.

Running df.apply, dask and pd.get_dummies together

I have multiple categorical columns with millions of distinct values in these categorical columns. So, I am using dask and pd.get_dummies for converting these categorical columns into bit vectors. Like this:
import pandas as pd
import numpy as np
import scipy.sparse
import dask.dataframe as dd
import multiprocessing
train_set = pd.read_csv('train_set.csv')
def convert_into_one_hot (col1, col2):
return pd.get_dummies(train_set, columns=[col1, col2], sparse=True)
ddata = dd.from_pandas(train_set, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
But, I get this error:
ValueError: Metadata inference failed in `lambda`.
You have supplied a custom function and Dask is unable to determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
KeyError("None of [Index(['foo'], dtype='object')] are in the [columns]")
What am I doing wrong here? Thanks.
EDIT:
A small example to reproduce the error. Hope it helps to understand the problem.
def convert_into_one_hot (x, y):
return pd.get_dummies(df, columns=[x, y], sparse=True)
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
dd.from_pandas(df, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
I think you could have some problems if you try to use get_dummies within partitions. there is a dask version for this and should work as following
import pandas as pd
import dask.dataframe as dd
import multiprocessing as mp
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
Pandas
pd.get_dummies(df, columns=["col1", "col2"], sparse=True)
Dask
ddf = dd.from_pandas(df, npartitions=2 * mp.cpu_count())
# you need to converts columns dtypes to category
dummies_cols = ["col1", "col2"]
ddf[dummies_cols] = ddf[dummies_cols].categorize()
dd.get_dummies(ddf, columns=["col1", "col2"], sparse=True)

geopandas series fillna results in float

As seen below, why does fillna results in a float number instead of a Point?
In [4]: import numpy as np
In [5]: import geopandas as gpd
In [8]: import shapely
In [9]: df_tmp = gpd.GeoDataFrame([['a', np.NaN], ['b', shapely.geometry.Point(35, 70)]], columns=['id', 'geometry'])
In [10]: df_tmp
Out[10]:
id geometry
0 a NaN
1 b POINT (35 70)
In [11]: df_tmp.geometry.fillna(shapely.geometry.Point(90, 0))
Out[11]:
0 90
1 POINT (35 70)
Name: geometry, dtype: object
version info:
In [12]: gpd.__version__
Out[12]: '0.5.0'
In [13]: shapely.__version__
Out[13]: '1.6.4.post2'
In [14]: np.__version__
Out[14]: '1.16.4'
Check the documentation for fillna in pandas.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
A shapely geometry may not be a valid argument for that function.

Drop categories from Dask DataFrame? [duplicate]

I am trying to use dask instead of pandas since I have 2.6gb csv file.
I load it and I want to drop a column. but it seems that neither the drop method
df.drop('column') or slicing df[ : , :-1]
is implemented yet. Is this the case or am I just missing something ?
We implemented the drop method in this PR. This is available as of dask 0.7.0.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf.drop('y', axis=1).compute()
Out[5]:
x
0 1
1 2
2 3
Previously one could also have used slicing with column names; though of course this can be less attractive if you have many columns.
In [6]: ddf[['x']].compute()
Out[6]:
x
0 1
1 2
2 3
This should work:
print(ddf.shape)
ddf = ddf.drop(columns, axis=1)
print(ddf.shape)

Pandas dataframes and PyCharm IntelliSense

When I create new dataframes from old ones, using concat or merge, PyCharm intellisense stops working for the resulting dataframe unless I explicitly pass it to a DataFrame constructor
import pandas as pd
d1 = {1: [1, 2, 3], 2: [11, 22, 33]}
d2 = {1: [4], 2: [5]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df3 = pd.concat([df1, df2], axis=0)
df3_ = pd.DataFrame(pd.concat([df1, df2], axis=0))
In the above example df3 and df3_ are the "same" dataframe, but intellisense only works on df3_. Am I doing something wrong? How can I avoid always having to call the DataFrame constructor and still get intellisense out of pycharm?
The answer is to use type hints like this:
df3 = pd.concat([df1, df2], axis=0) # type: pandas.DataFrame