Convert all entries in a pandas (list) to just first entry of list [duplicate] - pandas

I have a Pandas DataFrame with a column containing lists objects
A
0 [1,2]
1 [3,4]
2 [8,9]
3 [2,6]
How can I access the first element of each list and save it into a new column of the DataFrame? To get a result like this:
A new_col
0 [1,2] 1
1 [3,4] 3
2 [8,9] 8
3 [2,6] 2
I know this could be done via iterating over each row, but is there any "pythonic" way?

As always, remember that storing non-scalar objects in frames is generally disfavoured, and should really only be used as a temporary intermediate step.
That said, you can use the .str accessor even though it's not a column of strings:
>>> df = pd.DataFrame({"A": [[1,2],[3,4],[8,9],[2,6]]})
>>> df["new_col"] = df["A"].str[0]
>>> df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
>>> df["new_col"]
0 1
1 3
2 8
3 2
Name: new_col, dtype: int64

You can use map and a lambda function
df.loc[:, 'new_col'] = df.A.map(lambda x: x[0])

Use apply with x[0]:
df['new_col'] = df.A.apply(lambda x: x[0])
print df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2

You can use the method str.get:
df['A'].str.get(0)

You can just use a conditional list comprehension which takes the first value of any iterable or else uses None for that item. List comprehensions are very Pythonic.
df['new_col'] = [val[0] if hasattr(val, '__iter__') else None for val in df["A"]]
>>> df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
Timings
df = pd.concat([df] * 10000)
%timeit df['new_col'] = [val[0] if hasattr(val, '__iter__') else None for val in df["A"]]
100 loops, best of 3: 13.2 ms per loop
%timeit df["new_col"] = df["A"].str[0]
100 loops, best of 3: 15.3 ms per loop
%timeit df['new_col'] = df.A.apply(lambda x: x[0])
100 loops, best of 3: 12.1 ms per loop
%timeit df.A.map(lambda x: x[0])
100 loops, best of 3: 11.1 ms per loop
Removing the safety check ensuring an interable.
%timeit df['new_col'] = [val[0] for val in df["A"]]
100 loops, best of 3: 7.38 ms per loop

Related

how to add costum ID in pandas dataframe [duplicate]

In pandas, how can I convert a column of a DataFrame into dtype object?
Or better yet, into a factor? (For those who speak R, in Python, how do I as.factor()?)
Also, what's the difference between pandas.Factor and pandas.Categorical?
You can use the astype method to cast a Series (one column):
df['col_name'] = df['col_name'].astype(object)
Or the entire DataFrame:
df = df.astype(object)
Update
Since version 0.15, you can use the category datatype in a Series/column:
df['col_name'] = df['col_name'].astype('category')
Note: pd.Factor was been deprecated and has been removed in favor of pd.Categorical.
There's also pd.factorize function to use:
# use the df data from #herrfz
In [150]: pd.factorize(df.b)
Out[150]: (array([0, 1, 0, 1, 2]), array(['yes', 'no', 'absent'], dtype=object))
In [152]: df['c'] = pd.factorize(df.b)[0]
In [153]: df
Out[153]:
a b c
0 1 yes 0
1 2 no 1
2 3 yes 0
3 4 no 1
4 5 absent 2
Factor and Categorical are the same, as far as I know. I think it was initially called Factor, and then changed to Categorical. To convert to Categorical maybe you can use pandas.Categorical.from_array, something like this:
In [27]: df = pd.DataFrame({'a' : [1, 2, 3, 4, 5], 'b' : ['yes', 'no', 'yes', 'no', 'absent']})
In [28]: df
Out[28]:
a b
0 1 yes
1 2 no
2 3 yes
3 4 no
4 5 absent
In [29]: df['c'] = pd.Categorical.from_array(df.b).labels
In [30]: df
Out[30]:
a b c
0 1 yes 2
1 2 no 1
2 3 yes 2
3 4 no 1
4 5 absent 0

Numpy vs Pandas axis

Why axis differs in Numpy vs Pandas?
Example:
If I want to get rid of column in Pandas I could do this:
df.drop("column", axis = 1, inplace = True)
Here, we are using axis = 1 to drop a column (vertically in a DF).
In Numpy, if I want to sum a matrix A vertically I would use:
A.sum(axis = 0)
Here I use axis = 0.
axis isn't used that often in pandas. A dataframe has 2 dimensions, which are often treated quite differently. In drop the axis definition is well documented, and actually corresponds to the numpy usage.
Make a simple array and data frame:
In [180]: x = np.arange(9).reshape(3,3)
In [181]: df = pd.DataFrame(x)
In [182]: df
Out[182]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
Delete a row from the array, or a column:
In [183]: np.delete(x, 1, 0)
Out[183]:
array([[0, 1, 2],
[6, 7, 8]])
In [184]: np.delete(x, 1, 1)
Out[184]:
array([[0, 2],
[3, 5],
[6, 8]])
Drop does the same thing for the same axis:
In [185]: df.drop(1, axis=0)
Out[185]:
0 1 2
0 0 1 2
2 6 7 8
In [186]: df.drop(1, axis=1)
Out[186]:
0 2
0 0 2
1 3 5
2 6 8
In sum, the definitions are the same as well:
In [188]: x.sum(axis=0)
Out[188]: array([ 9, 12, 15])
In [189]: df.sum(axis=0)
Out[189]:
0 9
1 12
2 15
dtype: int64
In [190]: x.sum(axis=1)
Out[190]: array([ 3, 12, 21])
In [191]: df.sum(axis=1)
Out[191]:
0 3
1 12
2 21
dtype: int64
The pandas sums are Series, which are the pandas equivalent of a 1d array.
Visualizing what axis does with reduction operations like sum is a bit tricky - especially with 2d arrays. Is the axis kept or removed? It can help to think about axis for 1d arrays (the only axis is removed), or 3d arrays, where one axis is removed leaving two.
When you get rid of a column, the name is picked from the axis 1, which is the horizontal axis. When you sum along the axis 0, you sum vertically.

How to add values to an multiindexed column dataframe

My dataframe is
a b
1 2 1 2
0 0.281045 0.975469 -0.538213 -0.180008
1 0.128696 1.875480 0.247637 -0.047927
I want to insert the matrix to (a,3), (b, 3)
[[1, 1],
[1, 1]]
a b
1 2 3 1 2 3
0 0.281045 0.975469 1. -0.538213 -0.180008 1.
1 0.128696 1.875480 1. 0.247637 -0.047922 1.
It seems like there is no decent way to add value to the multiindex dataframe, Here is the code that I tried:
df[:,:,3] = [[1, 1],
[1, 1]]```
But it didn't work...
You can create new DataFrame with MultiIndex and then append to data by DataFrame.join with sorting MultiIndex:
arr = np.array([[1, 1],[1, 1]])
df1 = pd.DataFrame(arr,
index=df.index,
columns= pd.MultiIndex.from_product([df.columns.levels[0], [3]]))
df = df.join(df1).sort_index(axis=1)
print (df)
a b
1 2 3 1 2 3
0 0.281045 0.975469 1 -0.538213 -0.180008 1
1 0.128696 1.875480 1 0.247637 -0.047927 1

Access elements of pandas series

I have a dataframe and I want to extract the frequency of 0/1 in a particular column.
df=pd.DataFrame({'A':[0,0,1,0,1]})
df
Out[6]:
A
0 0
1 0
2 1
3 0
4 1
Calculating frequency of occurance of 0/1s -
df['A'].value_counts()
Out[8]:
0 3
1 2
Name: A, dtype: int64
type(df['A'].value_counts())
Out[9]: pandas.core.series.Series
How can I extract the frequency of 0s and 1s, in lets suppose two variables, ones and zeros as -
zeros=3, ones=2
I think it would be bit more flexible to return a dictionary:
In [234]: df['A'].value_counts().to_dict()
Out[234]: {0: 3, 1: 2}
or
In [236]: d = df['A'].astype(str).replace(['0','1'], ['zeros','ones']).value_counts().to_dict()
In [237]: d
Out[237]: {'ones': 2, 'zeros': 3}
In [238]: d['ones']
Out[238]: 2
In [239]: d['zeros']
Out[239]: 3
you can also access it directly:
In [3]: df['A'].value_counts().loc[0]
Out[3]: 3
In [4]: df['A'].value_counts().loc[1]
Out[4]: 2
Another way to solve this issue is to use collections library and the function counter() in it.
import collections
c = collections.Counter(df['A'])
c
Out[31]: Counter({0: 3, 1: 2})
count_0s=c.Counter(df['A'])[0]#Returns 3
count_1s=c.Counter(df['A'])[1]#Returns 2

pyspark's flatMap in pandas

Is there an operation in pandas that does the same as flatMap in pyspark?
flatMap example:
>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]
So far I can think of apply followed by itertools.chain, but I am wondering if there is a one-step solution.
There's a hack. I often do something like
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0 1
1 3
2 2
3 4
4 NaN
5 5
dtype: float64
The introduction of NaN is because the intermediate object creates a MultiIndex, but for a lot of things you can just drop that:
In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0 1
1 3
2 2
3 4
5 5
dtype: float64
This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.
there are three steps to solve this question.
import pandas as pd
df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
df_new = df['x'].apply(pd.Series).unstack().reset_index().dropna()
df_new[['level_1',0]]`
Since July 2019, Pandas offer pd.Series.explode to unnest frames. Here's a possible implementation of pd.Series.flatmap based on explode and map. Why?
flatmap operations should be a subset of map, not apply. check this thread for map/applymap/apply details Difference between map, applymap and apply methods in Pandas
import pandas as pd
from typing import Callable
def flatmap(
self,
func:Callable[[pd.Series],pd.Series],
ignore_index:bool=False):
return self.map(func).explode(ignore_index)
pd.Series.flatmap = flatmap
# example
df = pd.DataFrame([(x,y) for x,y in zip(range(1,6),range(6,16))], columns=['A','B'])
print(df.head(5))
# A B
# 0 1 6
# 1 2 7
# 2 3 8
# 3 4 9
# 4 5 10
print(df.A.flatmap(range,False))
# 0 NaN
# 1 0
# 2 0
# 2 1
# 3 0
# 3 1
# 3 2
# 4 0
# 4 1
# 4 2
# 4 3
# Name: A, dtype: object
print(df.A.flatmap(range,True))
# 0 0
# 1 0
# 2 1
# 3 0
# 4 1
# 5 2
# 6 0
# 7 1
# 8 2
# 9 3
# 10 0
# 11 1
# 12 2
# 13 3
# 14 4
# Name: A, dtype: object
As you can see, the main issue is the indexing. You could ignore it and just reset, but then you're better of using NumPy or std lists, as indexing is one of the key pandas' points. If you do not care about indexing at all, you could reuse the idea of the solution above, change pd.Series.map to pd.DataFrame.applymap and pd.Series.explode to pd.DataFrame.explode and forcing ignore_index=True.
I suspect that the answer is "no, not efficiently."
Pandas isn't built for nested data like this. I suspect that the case you're considering in Pandas looks a bit like the following:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
In [3]: df
Out[3]:
x
0 [1, 2]
1 [3, 4, 5]
And that you want something like the following
x
0 1
0 2
1 3
1 4
1 5
It is far more typical to normalize your data in Python before you send it to Pandas. If Pandas did do this then it would probably only be able to operate at slow Python speeds rather than fast C speeds.
Generally one does a bit of munging of data before one uses tabular computation.