Pandas dataframe without copy - pandas

How can I avoid taking a copy of the dictionary supplied when creating a Pandas DataFrame?
>>> a = np.arange(10)
>>> b = np.arange(10.0)
>>> df1 = pd.DataFrame(a)
>>> a[0] = 100
>>> df1
0
0 100
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
>>> d = {'a':a, 'b':b}
>>> df2 = pd.DataFrame(d)
>>> a[1] = 200
>>> d
{'a': array([100, 200, 2, 3, 4, 5, 6, 7, 8, 9]), 'b': array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])}
>>> df2
a b
0 100 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
If I create the dataframe from just a then changes to a are reflected in df (and vice versa).
Is there any way of making this work when supplying a dictionary?

It is possible to initialize a dataframe without copying the data. To understand how, you need to understand the BlockManager, which is the underlying datastructure used by DataFrame. It tries to group data of the same dtype together and hold their memory in a single block -- it does not function as as a columns of columns, as the documentation says. If the data is already provided as a single block, for example you initialize from a matrix:
a = np.zeros((100,20))
a.flags['WRITEABLE'] = False
df = pd.DataFrame(a, copy=False)
assert_read_only(df[df.columns[0]].iloc)
... then the DataFrame will usually just reference the ndarray.
However, this ain't gonna work if you're starting with multiple arrays or have heterogeneous types.
In which case, you can monkey patch the BlockManager to force it not to consolidate same-typed data columns.
However, if you initialize your dataframe with non-numpy arrays, then pandas will immediately copy it.

There is no way to 'share' a dict and have the frame update based on the dict changes. The copy argument is not relevant for a dict, data is always copied, because it is transformed to an ndarray.
However, there is a way to get this type of dynamic behavior in a limited way.
In [9]: arr = np.array(np.random.rand(5,2))
In [10]: df = DataFrame(arr)
In [11]: arr[0,0] = 0
In [12]: df
Out[12]:
0 1
0 0.000000 0.192056
1 0.847185 0.609028
2 0.833997 0.422521
3 0.937638 0.711856
4 0.047569 0.033282
Thus a passed ndarray will at construction time be a view onto the underlying numpy array. Depending on how you operate on the DataFrame you could trigger a copy (e.g. if you assign say a new column, or change a columns dtype). This will also only work for a single dtyped frame.

Related

Numpy vs Pandas axis

Why axis differs in Numpy vs Pandas?
Example:
If I want to get rid of column in Pandas I could do this:
df.drop("column", axis = 1, inplace = True)
Here, we are using axis = 1 to drop a column (vertically in a DF).
In Numpy, if I want to sum a matrix A vertically I would use:
A.sum(axis = 0)
Here I use axis = 0.
axis isn't used that often in pandas. A dataframe has 2 dimensions, which are often treated quite differently. In drop the axis definition is well documented, and actually corresponds to the numpy usage.
Make a simple array and data frame:
In [180]: x = np.arange(9).reshape(3,3)
In [181]: df = pd.DataFrame(x)
In [182]: df
Out[182]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
Delete a row from the array, or a column:
In [183]: np.delete(x, 1, 0)
Out[183]:
array([[0, 1, 2],
[6, 7, 8]])
In [184]: np.delete(x, 1, 1)
Out[184]:
array([[0, 2],
[3, 5],
[6, 8]])
Drop does the same thing for the same axis:
In [185]: df.drop(1, axis=0)
Out[185]:
0 1 2
0 0 1 2
2 6 7 8
In [186]: df.drop(1, axis=1)
Out[186]:
0 2
0 0 2
1 3 5
2 6 8
In sum, the definitions are the same as well:
In [188]: x.sum(axis=0)
Out[188]: array([ 9, 12, 15])
In [189]: df.sum(axis=0)
Out[189]:
0 9
1 12
2 15
dtype: int64
In [190]: x.sum(axis=1)
Out[190]: array([ 3, 12, 21])
In [191]: df.sum(axis=1)
Out[191]:
0 3
1 12
2 21
dtype: int64
The pandas sums are Series, which are the pandas equivalent of a 1d array.
Visualizing what axis does with reduction operations like sum is a bit tricky - especially with 2d arrays. Is the axis kept or removed? It can help to think about axis for 1d arrays (the only axis is removed), or 3d arrays, where one axis is removed leaving two.
When you get rid of a column, the name is picked from the axis 1, which is the horizontal axis. When you sum along the axis 0, you sum vertically.

Access elements of pandas series

I have a dataframe and I want to extract the frequency of 0/1 in a particular column.
df=pd.DataFrame({'A':[0,0,1,0,1]})
df
Out[6]:
A
0 0
1 0
2 1
3 0
4 1
Calculating frequency of occurance of 0/1s -
df['A'].value_counts()
Out[8]:
0 3
1 2
Name: A, dtype: int64
type(df['A'].value_counts())
Out[9]: pandas.core.series.Series
How can I extract the frequency of 0s and 1s, in lets suppose two variables, ones and zeros as -
zeros=3, ones=2
I think it would be bit more flexible to return a dictionary:
In [234]: df['A'].value_counts().to_dict()
Out[234]: {0: 3, 1: 2}
or
In [236]: d = df['A'].astype(str).replace(['0','1'], ['zeros','ones']).value_counts().to_dict()
In [237]: d
Out[237]: {'ones': 2, 'zeros': 3}
In [238]: d['ones']
Out[238]: 2
In [239]: d['zeros']
Out[239]: 3
you can also access it directly:
In [3]: df['A'].value_counts().loc[0]
Out[3]: 3
In [4]: df['A'].value_counts().loc[1]
Out[4]: 2
Another way to solve this issue is to use collections library and the function counter() in it.
import collections
c = collections.Counter(df['A'])
c
Out[31]: Counter({0: 3, 1: 2})
count_0s=c.Counter(df['A'])[0]#Returns 3
count_1s=c.Counter(df['A'])[1]#Returns 2

Seaborn Violin Plot from Pandas Dataframe, each column its own separate violin plot

I have Pandas Dataframe with structure:
A B
0 1 1
1 2 1
2 3 4
3 3 7
4 6 8
How do I generate a Seaborn Violin plot with each column as its own separate violin plot for side-by-side comparison?
seaborn (at least, version 0.8.1; not sure if this is new) supports what you want without messing around with your dataframe at all:
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'A': [1, 2, 3, 3, 6], 'B': [1, 1, 4, 7, 8]})
sns.violinplot(data=df)
(Note that you do need to set data=df; if you just pass in df as the first argument (equivalent to setting x=df in the function call), it seems like it concatenates the columns together and then makes a violin plot of all of the data)
You can first reshape by melt for groups from columns and then seaborn.violinplot:
#old version of pandas
#df = pd.melt(df, var_name='groups', value_name='vals')
df = df.melt(var_name='groups', value_name='vals')
print (df)
groups vals
0 A 1
1 A 2
2 A 3
3 A 3
4 A 6
5 B 1
6 B 1
7 B 4
8 B 7
9 B 8
ax = sns.violinplot(x="groups", y="vals", data=df)

Check whether a column in a dataframe is an integer or not, and perform operation

Check whether a column in a dataframe is an integer or not, and if it is an integer, it must be multiplied by 10
import numpy as np
import pandas as pd
df = pd.dataframe(....)
#function to check and multiply if a column is integer
def xtimes(x):
for col in x:
if type(x[col]) == np.int64:
return x[col]*10
else:
return x[col]
#using apply to apply that function on df
df.apply(xtimes).head(10)
I am getting an error like ('GP', 'occurred at index school')
You could use select_dtypes to get numeric columns and then multiply.
In [1284]: df[df.select_dtypes(include=['int', 'int64', np.number]).columns] *= 10
You could have your specific check list for include=[... np.int64, ..., etc]
You can use the dtypes attribute and loc.
df.loc[:, df.dtypes <= np.integer] *= 10
Explanation
pd.DataFrame.dtypes returns a pd.Series of numpy dtype objects. We can use the comparison operators to determine subdtype status. See this document for the numpy.dtype hierarchy.
Demo
Consider the dataframe df
df = pd.DataFrame([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
]).astype(pd.Series([np.int32, np.int16, np.int64, float, object, str]))
df
0 1 2 3 4 5
0 1 2 3 4.0 5 6
1 1 2 3 4.0 5 6
The dtypes are
df.dtypes
0 int32
1 int16
2 int64
3 float64
4 object
5 object
dtype: object
We'd like to change columns 0, 1, and 2
Conveniently
df.dtypes <= np.integer
0 True
1 True
2 True
3 False
4 False
5 False
dtype: bool
And that is what enables us to use this within a loc assignment.
df.loc[:, df.dtypes <= np.integer] *= 10
df
0 1 2 3 4 5
0 10 20 30 4.0 5 6
1 10 20 30 4.0 5 6

pyspark's flatMap in pandas

Is there an operation in pandas that does the same as flatMap in pyspark?
flatMap example:
>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]
So far I can think of apply followed by itertools.chain, but I am wondering if there is a one-step solution.
There's a hack. I often do something like
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0 1
1 3
2 2
3 4
4 NaN
5 5
dtype: float64
The introduction of NaN is because the intermediate object creates a MultiIndex, but for a lot of things you can just drop that:
In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0 1
1 3
2 2
3 4
5 5
dtype: float64
This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.
there are three steps to solve this question.
import pandas as pd
df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
df_new = df['x'].apply(pd.Series).unstack().reset_index().dropna()
df_new[['level_1',0]]`
Since July 2019, Pandas offer pd.Series.explode to unnest frames. Here's a possible implementation of pd.Series.flatmap based on explode and map. Why?
flatmap operations should be a subset of map, not apply. check this thread for map/applymap/apply details Difference between map, applymap and apply methods in Pandas
import pandas as pd
from typing import Callable
def flatmap(
self,
func:Callable[[pd.Series],pd.Series],
ignore_index:bool=False):
return self.map(func).explode(ignore_index)
pd.Series.flatmap = flatmap
# example
df = pd.DataFrame([(x,y) for x,y in zip(range(1,6),range(6,16))], columns=['A','B'])
print(df.head(5))
# A B
# 0 1 6
# 1 2 7
# 2 3 8
# 3 4 9
# 4 5 10
print(df.A.flatmap(range,False))
# 0 NaN
# 1 0
# 2 0
# 2 1
# 3 0
# 3 1
# 3 2
# 4 0
# 4 1
# 4 2
# 4 3
# Name: A, dtype: object
print(df.A.flatmap(range,True))
# 0 0
# 1 0
# 2 1
# 3 0
# 4 1
# 5 2
# 6 0
# 7 1
# 8 2
# 9 3
# 10 0
# 11 1
# 12 2
# 13 3
# 14 4
# Name: A, dtype: object
As you can see, the main issue is the indexing. You could ignore it and just reset, but then you're better of using NumPy or std lists, as indexing is one of the key pandas' points. If you do not care about indexing at all, you could reuse the idea of the solution above, change pd.Series.map to pd.DataFrame.applymap and pd.Series.explode to pd.DataFrame.explode and forcing ignore_index=True.
I suspect that the answer is "no, not efficiently."
Pandas isn't built for nested data like this. I suspect that the case you're considering in Pandas looks a bit like the following:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
In [3]: df
Out[3]:
x
0 [1, 2]
1 [3, 4, 5]
And that you want something like the following
x
0 1
0 2
1 3
1 4
1 5
It is far more typical to normalize your data in Python before you send it to Pandas. If Pandas did do this then it would probably only be able to operate at slow Python speeds rather than fast C speeds.
Generally one does a bit of munging of data before one uses tabular computation.