Apply a function to each column of a dataframe - pandas

I have a dataframe of numbers going from 1 to 13 (each number is a location). As the index, I have set a timeline representing timesteps of 2 min during 24h (720 rows). Each column represents a single person. So I have columns of locations along 24h in 2 min timesteps.
I am trying to convert this numbers to binary (if it's a 13, I want a 1, and otherwise a 0). But when I try to apply the function I get an error:
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Here's the code:
import pandas as pd
from datetime import timedelta
df = pd.read_csv("dataset_belgium/all_patterns_2MINS.csv", encoding="utf-8")
df = df.transpose()
df.reset_index(drop=True, inplace=True)
timeline = []
for timestep in range(len(df.index)):
time = timedelta(seconds=timestep*2*60)
time = str(time)
timeline.append(time)
tl = pd.DataFrame(timeline)
tl.columns = ['timeline']
df=df.join(tl, how='left')
df = df.set_index('timeline')
#df.drop(['0:00:00'])
def to_binary(element):
if element == 13:
element = 1
else:
element = 0
return element
binary_df = df.apply(to_binary)
Also I would like to eliminate the 1st row, the one of index ('0:00:00'), since it doesn't contain numbers from 1 to 13.
​
Thanks in advance!

As you say in the title, you apply the function to each column of the data frame. So what you call element within the function is actually a whole column. That's why the line if element == 13: raises an error. Python doesn't know what it would mean for a whole column to be equal to one number. One straightforward solution would be to use a for loop:
def to_binary(column):
for element in column:
if element == 13:
element = 1
else:
element = 0
return column
However, this would still not solve the more basic issue that the function doesn't actually change anything with lasting effect, because it uses only local variables.
An easy alternative approach is to use the pandas replace method, which allows you to explicitly replace arbitrary values with other ones:
df.replace([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
inplace=True)
To delete the first row, you can use df = df[1:].

Related

i need to return a value from a dataframe cell as a variable not a series

i have the following issue:
when i use .loc funtion it returns a series not a single value with no index.
As i need to do some math operation with the selected cells. the function that i am using is:
import pandas as pd
data = [[82,1], [30, 2], [3.7, 3]]
df = pd.DataFrame(data, columns = ['Ah-Step', 'State'])
df['Ah-Step'].loc[df['State']==2]+ df['Ah-Step'].loc[df['State']==3]
.values[0] will do what OP wants.
Assuming one wants to obtain the value 30, the following will do the work
df.loc[df['State'] == 2, 'Ah-Step'].values[0]
print(df)
[Out]: 30.0
So, in OP's specific case, the operation 30+3.7 could be done as follows
df.loc[df['State'] == 2, 'Ah-Step'].values[0] + df['Ah-Step'].loc[df['State']==3].values[0]
[Out]: 33.7

What does the [1] do when using .where()?

I m practicing on a Data Cleaning Kaggle excercise.
In parsing dates example I can´t figure out what the [1] does at the end of the indices object.
Thanks..
# Finding indices corresponding to rows in different date format
indices = np.where([date_lengths == 24])[1]
print('Indices with corrupted data:', indices)
earthquakes.loc[indices]
As described in the documentation, numpy.where called with a single argument is equivalent to calling np.asarray([date_lengths == 24]).nonzero().
numpy.nonzero return a tuple with as many items as the dimensions of the input array with the indexes of the non-zero values.
>>> np.nonzero([1,0,2,0])
(array([0, 2]),)
Slicing [1] enables to get the second element (i.e. second dimension) but as the input was wrapped into […], this is equivalent to doing:
np.where(date_lengths == 24)[0]
>>> np.nonzero([1,0,2,0])[0]
array([0, 2])
It is an artefact of the extra [] around the condition. For example:
a = np.arange(10)
To find, for example, indices where a>3 can be done like this:
np.where(a > 3)
gives as output a tuple with one array
(array([4, 5, 6, 7, 8, 9]),)
So the indices can be obtained as
indices = np.where(a > 3)[0]
In your case, the condition is between [], which is unnecessary, but still works.
np.where([a > 3])
returns a tuple of which the first is an array of zeros, and the second array is the array of indices you want
(array([0, 0, 0, 0, 0, 0]), array([4, 5, 6, 7, 8, 9]))
so the indices are obtained as
indices = np.where([a > 3])[1]

Does Pandas have a resample method without dependency on a datetime index?

I have a series that I want to apply an external function to in subsets/chunks of three. Although the actual external function is more complex, for the sake of an example, lets just assume my external function takes an ndarray of integers and returns the sum of all values. So for example:
series = pd.Series([1,1,1,1,1,1,1,1,1])
# Some pandas magic similar to:
result = series.resample(3).apply(myFunction)
# where 3 just represents every 3 values and
# result == pd.Series([3,3,3])
I looked at combining Series.resample and Series.apply as hinted to by the psuedo code above but it appears resample depends on a datetime index. Any ideas on how I can effectively downsample by applying an external function like this without a datetime index? Or do you just recommend creating a temporary datetime index to do this then reverting to the original index?
pandas.DataFrame.groupby would do the trick here. What you need is a repeated index to specify subsets/chunks
Create chunks
n = 3
repeat_idx = np.repeat(np.arange(0,len(series), n), n)[:len(series)]
print(repeat_idx)
array([0, 0, 0, 3, 3, 3, 6, 6, 6])
Groupby
def myFunction(l):
output = 0
for item in l:
output+=item
return output
series = pd.Series([1,1,1,1,1,1,1,1,1])
result = series.groupby(repeat_idx).apply(myFunction)
(result)
0 3
3 3
6 3
The solution will also work for chunks not adding to the length of series,
n = 4
repeat_idx = np.repeat(np.arange(0,len(series), n), n)[:len(series)]
print(repeat_idx)
array([0, 0, 0, 0, 4, 4, 4, 4, 8])
result = series.groupby(repeat_idx).apply(myFunction)
print(result)
0 4
4 4
8 1

Assert an integer is in list on pandas series

I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")
I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())
I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))

Find subgroups of a numpy array

I have a numpy array like this one:
A = ([249, 250, 3016, 3017, 5679, 5680, 8257, 8258,
10756, 10757, 13178, 13179, 15531, 15532, 17824, 17825,
20058, 20059, 22239, 22240, 24373, 24374, 26455, 26456,
28491, 28492, 30493, 30494, 32452, 32453, 34377, 34378,
36264, 36265, 38118, 38119, 39939, 39940, 41736, 41737,
43501, 43502, 45237, 45238, 46950, 46951, 48637, 48638])
I would like to write a small script that finds a subgroup of values of the array for which the difference is smaller than a certain threshold, let say 3, and that returns the highest value of the subgroup. In the case of A array the output should be:
A_out =([250,3017,5680,8258,10757,13179,...])
Is there a numpy function for that?
Here's a vectorized Numpy approach.
First, the data (in a numpy array) and the threshold:
In [41]: A = np.array([249, 250, 3016, 3017, 5679, 5680, 8257, 8258,
10756, 10757, 13178, 13179, 15531, 15532, 17824, 17825,
20058, 20059, 22239, 22240, 24373, 24374, 26455, 26456,
28491, 28492, 30493, 30494, 32452, 32453, 34377, 34378,
36264, 36265, 38118, 38119, 39939, 39940, 41736, 41737,
43501, 43502, 45237, 45238, 46950, 46951, 48637, 48638])
In [42]: threshold = 3
The following produces the array delta. It is almost the same as delta = np.diff(A), but I want to include one more value that is greater than the threshold at the end of delta.
In [43]: delta = np.hstack((diff(A), threshold + 1))
Now the group maxima are simply A[delta > threshold]:
In [46]: A[delta > threshold]
Out[46]:
array([ 250, 3017, 5680, 8258, 10757, 13179, 15532, 17825, 20059,
22240, 24374, 26456, 28492, 30494, 32453, 34378, 36265, 38119,
39940, 41737, 43502, 45238, 46951, 48638])
Or, if you want, A[delta >= threshold]. That gives the same result for this example:
In [47]: A[delta >= threshold]
Out[47]:
array([ 250, 3017, 5680, 8258, 10757, 13179, 15532, 17825, 20059,
22240, 24374, 26456, 28492, 30494, 32453, 34378, 36265, 38119,
39940, 41737, 43502, 45238, 46951, 48638])
There is a case where this answer differs from #DrV's answer. From your description, it isn't clear to me how a set of values such as 1, 2, 3, 4, 5, 6 should be handled. The consecutive differences are all 1, but the difference between the first and last is 5. The numpy calculation above will treat these as a single group. #DrV's answer will create two groups.
Interpretation 1: The value of an item in a group must not differ more than 3 units from that of the first item of the group
This is one of the things where NumPy's capabilities are at their limits. As you will have to iterate through the list, I suggest a pure Python approach:
first_of_group = A[0]
previous = A[0]
group_lasts = []
for a in A[1:]:
# if this item no longer belongs to the group
if abs(a - first_of_group) > 3:
group_lasts.append(previous)
first_of_group = a
previous = a
# add the last item separately, because it is always a last of the group
group_lasts.append(a)
Now you have the group lasts in group_lasts.
Using any NumPy array functionality here does not seem to provide much help.
Interpretation 2: The value of an item in a group must not differ more than 3 units from the previous item
This is easier, as we can easily form a list of group breaks as in Warren Weckesser's answer. Here NumPy is of a lot of help.