Pandas | How to effectively filter a column - dataframe

I'm looking for a way to quickly and effectively filter through a dataframe column and remove values that don't meet a condition.
Say, I have a column with the numbers 4, 5 and 10. I want to filter the column and replace any numbers above 7 with 0. How would I go about this?

You're talking about two separate things - filtering and value replacement. They both have uses and end up being similar in nature but for filtering I'll point to this great answer.
Let's say our data frame is called df and looks like
A B
1 4 10
2 4 2
3 10 1
4 5 9
5 10 3
Column A fits your statement of a column only having values 4, 5, 10. If you wanted to replace numbers above 7 with 0, this would do it:
df["A"] = [0 if x > 7 else x for x in df["A"]]
If you read through the right-hand side it cleanly explains what it is doing. It helps to include parentheses to separate out the "what to do" with the "what you're doing it over":
df["A"] = [(0 if x > 7 else x) for x in df["A"]]
If you want to do a manipulation over multiple columns, then utilizing zip allows you to do it easily. For example, if you want the sum of columns A and B then:
df["sum"] = [x[0] + x[1] for x in zip(df["A"], df["B"])]
Take care when you overwrite data - this removes information. It's a good practice to have the transformed data in other columns so you can trace back when something inevitably goes wonky.

There is many options. One possibility for if then... is np.where
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [1, 200, 4, 5, 6, 11],
'y': [4, 5, 10, 24, 4 , 3]})
df['y'] = np.where(df['y'] > 7, 0, df['y'])

Related

Does Pandas have a resample method without dependency on a datetime index?

I have a series that I want to apply an external function to in subsets/chunks of three. Although the actual external function is more complex, for the sake of an example, lets just assume my external function takes an ndarray of integers and returns the sum of all values. So for example:
series = pd.Series([1,1,1,1,1,1,1,1,1])
# Some pandas magic similar to:
result = series.resample(3).apply(myFunction)
# where 3 just represents every 3 values and
# result == pd.Series([3,3,3])
I looked at combining Series.resample and Series.apply as hinted to by the psuedo code above but it appears resample depends on a datetime index. Any ideas on how I can effectively downsample by applying an external function like this without a datetime index? Or do you just recommend creating a temporary datetime index to do this then reverting to the original index?
pandas.DataFrame.groupby would do the trick here. What you need is a repeated index to specify subsets/chunks
Create chunks
n = 3
repeat_idx = np.repeat(np.arange(0,len(series), n), n)[:len(series)]
print(repeat_idx)
array([0, 0, 0, 3, 3, 3, 6, 6, 6])
Groupby
def myFunction(l):
output = 0
for item in l:
output+=item
return output
series = pd.Series([1,1,1,1,1,1,1,1,1])
result = series.groupby(repeat_idx).apply(myFunction)
(result)
0 3
3 3
6 3
The solution will also work for chunks not adding to the length of series,
n = 4
repeat_idx = np.repeat(np.arange(0,len(series), n), n)[:len(series)]
print(repeat_idx)
array([0, 0, 0, 0, 4, 4, 4, 4, 8])
result = series.groupby(repeat_idx).apply(myFunction)
print(result)
0 4
4 4
8 1

Plotting by groupby and average

I have a dataframe with multiple columns and rows. One column, say 'name' has several rows with names, the same name used multiple times. Other rows, say, 'x', 'y', 'z', 'zz' have values. I want to group by name and get the mean of each column (x,y,z,zz)for each name, then plot on a bar chart.
Using the pandas.DataFrame.groupby is an important data-wrangling stuff. Let's first make a dummy Pandas data frame.
df = pd.DataFrame({"name": ["John", "Sansa", "Bran", "John", "Sansa", "Bran"],
"x": [2, 3, 4, 5, 6, 7],
"y": [5, -3, 10, 34, 1, 54],
"z": [10.6, 99.9, 546.23, 34.12, 65.04, -74.29]})
>>>
name x y z
0 John 2 5 10.60
1 Sansa 3 -3 99.90
2 Bran 4 10 546.23
3 John 5 34 34.12
4 Sansa 6 1 65.04
5 Bran 7 54 -74.29
We can use the label of the column to group the data (here the label is "name"). Explicitly defining the by parameter can be omitted (c.f., df.groupby("name")).
df.groupby(by = "name").mean().plot(kind = "bar")
which gives us a nice bar graph.
Transposing the group by results using T (as also suggested by anky) yields a different visualization. We can also pass a dictionary as the by parameter to determine the groups. The by parameter can also be a function, Pandas series, or ndarray.
df.groupby(by = {1: "Sansa", 2: "Bran"}).mean().T.plot(kind = "bar")

How do I assign index values to a level of Multi index Data Frame?

I have a Multi index Data Frame. However, I wanted to change its first level to a certain list of index values. Suppose its first level is initially [2,4,1], I want to change it to [1,2,100]. What is the simplest way to achieve it? My current approach would involve, reset_index, change column values and set index again.
One way is to create a dictionary of the old values to the replacement values, then iterate through the index as tuples replacing the values, and assign the new index back to the DataFrame:
new_vals = {2: 1, 4: 2, 1: 100}
df.index = pd.MultiIndex.from_tuples([(new_vals[tup[0]], tup[1]) for tup in df.index.to_list()])
(This assumes your MultiIndex has only 2 levels, for every additional level that you want to keep you'd need to add tup[2] etc into the list comprehension.)
Use df.reindex()
data.reindex([1,2,100])
Use rename:
Setup
import pandas as pd
index = pd.MultiIndex.from_tuples([(e, i) for i, e in enumerate([2, 4, 1])])
df = pd.DataFrame([1, 2, 3], index=index)
print(df)
Output (of setup)
0
2 0 1
4 1 2
1 2 3
Code
new_index = [1, 2, 100]
new_vals = dict(zip(df.index.levels[0], new_index))
print(df.rename(new_vals, level=0))
Output
0
1 0 1
2 1 2
100 2 3

using pd.DataFrame.apply to create multiple columns

My first question here!
I'm having some trouble figuring out what I'm doing wrong here, trying to append columns to an existing pd.DataFrame object. Specifically, my original dataframe has n-many columns, and I want to use apply to append an additional 2n-many columns to it. The problem seems to be that doing this via apply() doesn't work, in that if I try to append more than n-many columns, it falls over. This doesn't make sense to me, and I was hoping somebody could either shed some light on to why I'm seeing this behaviour, or suggest a better approach.
For example,
df = pd.DataFrame(np.random.rand(10,2))
def this_works(x):
return 5 * x
def this_fails(x):
return np.append(5 * x, 5 * x)
df.apply(this_works, 1) # Two columns of output, as expected
df.apply(this_fails, 1) # Unexpected failure...
Any ideas? I know there are other ways to create the data columns, this approach just seemed very natural to me and I'm quite confused by the behaviour.
SOLVED! CT Zhu's solution below takes care of this, my error arises from not properly returning a pd.Series object in the above.
Are you trying to do a few different calculations on your df and put the resulting vectors together in one larger DataFrame, like in this example?:
In [39]:
print df
0 1
0 0.718003 0.241216
1 0.580015 0.981128
2 0.477645 0.463892
3 0.948728 0.653823
4 0.056659 0.366104
5 0.273700 0.062131
6 0.151237 0.479318
7 0.425353 0.076771
8 0.317731 0.029182
9 0.543537 0.589783
In [40]:
print df.apply(lambda x: pd.Series(np.hstack((x*5, x*6))), axis=1)
0 1 2 3
0 3.590014 1.206081 4.308017 1.447297
1 2.900074 4.905639 3.480088 5.886767
2 2.388223 2.319461 2.865867 2.783353
3 4.743640 3.269114 5.692369 3.922937
4 0.283293 1.830520 0.339951 2.196624
5 1.368502 0.310656 1.642203 0.372787
6 0.756187 2.396592 0.907424 2.875910
7 2.126764 0.383853 2.552117 0.460624
8 1.588656 0.145909 1.906387 0.175091
9 2.717685 2.948917 3.261222 3.538701
FYI in this trivial case you can do 5 * df !
I think the issue here is that np.append flattens the Series:
In [11]: np.append(df[0], df[0])
Out[11]:
array([ 0.33145275, 0.14964056, 0.86268119, 0.17311983, 0.29618537,
0.48831228, 0.64937305, 0.03353709, 0.42883925, 0.99592229,
0.33145275, 0.14964056, 0.86268119, 0.17311983, 0.29618537,
0.48831228, 0.64937305, 0.03353709, 0.42883925, 0.99592229])
what you want is it to create four columns (isn't it?). The axis=1 means that you are doing this row-wise (i.e. x is the row which is a Series)...
In general you want apply to return either:
a single value, or
a Series (with unique index).
Saying that I kinda thought the following may work (to get four columns):
In [21]: df.apply((lambda x: pd.concat([x[0] * 5, x[0] * 5], axis=1)), axis=1)
TypeError: ('cannot concatenate a non-NDFrame object', u'occurred at index 0')
In [22]: df.apply(lambda x: np.array([1, 2, 3, 4]), axis=1)
ValueError: Shape of passed values is (10,), indices imply (10, 2)
In [23]: df.apply(lambda x: pd.Series([1, 2, 3, 4]), axis=1) # works
Maybe I expected the first to raise about non-unique index... but I was surprised that the second failed.

numpy, sums of subsets with no iterations [duplicate]

I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?
you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.
I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.
You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.
If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)
The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)
You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)