In [93]: a = np.arange(24).reshape(2, 3, 4)
In [94]: a[0, 1, ::2]
Out[94]: array([4, 6])
Can someone explain what '::2' means here?
Thanks!
::2 means : in this dimension, get all the "layers" having a pair index (starting from 0, counting by 2).
it means: get the element at a[0, 1, 0] and a[0, 1, 2] and put it into the same array.
each index position (you have 3 in this sample) is indexable and "sliceable". perhaps you saw slices like [this:slice] before in normal arrays. well... slices can also have a third value which is the "step" value.
so: [a:b:c] means [startPosition:endPosition:step] where endPosition is not included.
so having ::2 means start=0, end=the end of the ... dimension, step=2.
you have at most 4 in that dimension (see your reshape line), so the index it will count are 0 and 2 (1 and 3 are skipped, and 3 is the last element).
0 0 0 => 0
0 0 1 => 1
0 0 2 => 2
0 0 3 => 3
0 1 0 => 4 -> (0, 1, 0) is rescued via the slice
0 1 1 => 5
0 1 2 => 6 -> (0, 1, 2) is rescued via the slice
Related
I have two arrays (i and j) that are exactly the same. I shuffle them with a specified random seed.
import numpy as np
np.random.seed(42)
i = np.array([0, 1, 2, 3, 4, 5, 6, 7])
j = np.array([0, 1, 2, 3, 4, 5, 6, 7])
np.random.shuffle(i)
np.random.shuffle(j)
print(i, j)
# [1 5 0 7 2 4 3 6] [3 7 0 4 5 2 1 6]
They were supposed to be the same after shuffling, but it is not the case.
Do you have any ideas about how to get the same results (like the example below) after shuffling?
# [1 5 0 7 2 4 3 6] [1 5 0 7 2 4 3 6]
Many thanks in advance!
Calling seed() sets the state of a global random number generator. Each call of shuffle continues with the same global random number generator, so the results are different, as they should be. If you want them to be the same, reset the seed before each call of shuffle.
Why axis differs in Numpy vs Pandas?
Example:
If I want to get rid of column in Pandas I could do this:
df.drop("column", axis = 1, inplace = True)
Here, we are using axis = 1 to drop a column (vertically in a DF).
In Numpy, if I want to sum a matrix A vertically I would use:
A.sum(axis = 0)
Here I use axis = 0.
axis isn't used that often in pandas. A dataframe has 2 dimensions, which are often treated quite differently. In drop the axis definition is well documented, and actually corresponds to the numpy usage.
Make a simple array and data frame:
In [180]: x = np.arange(9).reshape(3,3)
In [181]: df = pd.DataFrame(x)
In [182]: df
Out[182]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
Delete a row from the array, or a column:
In [183]: np.delete(x, 1, 0)
Out[183]:
array([[0, 1, 2],
[6, 7, 8]])
In [184]: np.delete(x, 1, 1)
Out[184]:
array([[0, 2],
[3, 5],
[6, 8]])
Drop does the same thing for the same axis:
In [185]: df.drop(1, axis=0)
Out[185]:
0 1 2
0 0 1 2
2 6 7 8
In [186]: df.drop(1, axis=1)
Out[186]:
0 2
0 0 2
1 3 5
2 6 8
In sum, the definitions are the same as well:
In [188]: x.sum(axis=0)
Out[188]: array([ 9, 12, 15])
In [189]: df.sum(axis=0)
Out[189]:
0 9
1 12
2 15
dtype: int64
In [190]: x.sum(axis=1)
Out[190]: array([ 3, 12, 21])
In [191]: df.sum(axis=1)
Out[191]:
0 3
1 12
2 21
dtype: int64
The pandas sums are Series, which are the pandas equivalent of a 1d array.
Visualizing what axis does with reduction operations like sum is a bit tricky - especially with 2d arrays. Is the axis kept or removed? It can help to think about axis for 1d arrays (the only axis is removed), or 3d arrays, where one axis is removed leaving two.
When you get rid of a column, the name is picked from the axis 1, which is the horizontal axis. When you sum along the axis 0, you sum vertically.
I need to break a column in a DataFrame that at present collects multiple values (someone else's excel sheet unfortunately) for a categorical data field that can have multiple values.
As you can see below the column has 15 category codes seen in the column header.
Original DataFrame
I want to split the column based on the category codes seen in the column header ['Pamphlet'] and then transform the values collected for each record in the original column to be mapped to there respective new columns as a (1) for checked and (0) for unchecked instead of the raw value [1,2,4,5].
This is the code to split based on , between values but I need to put these into the new columns I need to set up by splitting the column ['Pamphlet'] up by the values in the header [15: 1) OSA\n2) Nutrition\n3) Activity\n4) etc.].
'''df_old['Pamphlets'].str.split(pat = ',', n = -1, expand = True)'''
Shape of desired DatFrame
If I could just get an outline of whats the best approach, if it is even possible to do this within Pandas, Thanks.
You need to go through your columns one by one and divide the headers, then create a new dataframe for each column made up of split columns, then join all that back to the original dataframe. It's a bit messy but doable.
You need to use a function and some loops to go through the columns.
First lets define the dataframe. (It would be much appreciated if in future questions you supply a replicatable dataframe and any other data.
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
print(df_full)
1) Mail\n2) Email \n3) At PAC/TPAC 1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation
0 2 5
1 1 1
2 3 4
3 2 4
4 3 2
5 1 5
6 3 1
7 2 4
8 3 3
9 1 2
We will go through the dataframe column by column using a function. For now let's build the column manually for the first column. After we'll turn this next part into a function.
First, let's grab the first column.
s_col = df_full.iloc[:, 0]
print(s_col)
0 2
1 1
2 3
3 2
4 3
5 1
6 3
7 2
8 3
9 1
Name: 1) Mail\n2) Email \n3) At PAC/TPAC, dtype: int64
Split the header into individual pieces.
col = s_col.name.split("\n")
print(col)
['1) Mail', '2) Email ', '3) At PAC/TPAC']
Clean up any leading or trailing white space.
col = [x.strip() for x in col]
print(col)
['1) Mail', '2) Email', '3) At PAC/TPAC']
Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
print(df)
1) Mail 2) Email 3) At PAC/TPAC
0 2 2 2
1 1 1 1
2 3 3 3
3 2 2 2
4 3 3 3
5 1 1 1
6 3 3 3
7 2 2 2
8 3 3 3
9 1 1 1
Create a copy to make changes to the values.
df_res = df.copy()
Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
print(df_res)
1) Mail 2) Email 3) At PAC/TPAC
0 0 1 0
1 1 0 0
2 0 0 1
3 0 1 0
4 0 0 1
5 1 0 0
6 0 0 1
7 0 1 0
8 0 0 1
9 1 0 0
Now we have split a column into its components and assigned a bool value.
Let's step back and make the above a function so we can use it for each column in the original dataframe.
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
Now for the last step. Let's create a loop to go through the columns in the original dataframe, call the function to split each column, and then concat it to the original dataframe less the columns that were split.
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)
1) Mail 2) Email 3) At PAC/TPAC 1) ACC 2) IM 3) PT 4) Smoking, 5) Cessation
0 0 1 0 0 0 0 0 1
1 1 0 0 1 0 0 0 0
2 0 0 1 0 0 0 1 0
3 0 1 0 0 0 0 1 0
4 0 0 1 0 1 0 0 0
5 1 0 0 0 0 0 0 1
6 0 0 1 1 0 0 0 0
7 0 1 0 0 0 0 1 0
8 0 0 1 0 0 1 0 0
9 1 0 0 0 1 0 0 0
Here is the full code...
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)
My first question here! I'm looking for help on how to vectorize an operation on a pandas dataframe. I can simplify the problem down to a dataframe with three columns, a column that has values that will be updated, and two columns that have an iteration number, which is not the same between the two columns.
What I'd like to do is for one of the iteration columns, for each of the first values in the iteration column, to then refer to the corresponding value of the other iteration column (at that same index), and then finally to fill a value (zero) to the column with update values, but only for the rows in which the second iteration column has that same value. Hopefully this example will explain a bit better:
df = pd.DataFrame()
df['update_col'] = [1, 2, 3, 4, 5, 6, 7, 8, 9]
df['iter2'] = [0, 1, 1, 2, 2, 3, 3, 4, 4]
df['iter1'] = [0, 0, 1, 1, 1, 2, 2, 2, 2]
print(df)
# update_col iter2 iter1
0 1 0 0
1 2 1 0
2 3 1 1
3 4 2 1
4 5 2 1
5 6 3 2
6 7 3 2
7 8 4 2
8 9 4 2
So basically, I want to do the following:
Reference the iter1 column, and when it changes (i.e. goes from 0 to 1 or from 1 to 2)
Look at the iter2 column at that index
Change the values in the "update column" to zero for all rows starting from the index in step 2 until iter2 is incremented to a new value
So the output would look like the following:
# update_col iter2 iter1
0 1 0 0
1 2 1 0
2 0 1 1
3 4 2 1
4 5 2 1
5 0 3 2
6 0 3 2
7 8 4 2
8 9 4 2
I think a properly constructed groupby could be a solution, but I am still a rookie at using it effectively.
I am currently achieving what I want with a complicated for loop, it makes the run time extremely long for the size and number of dataframes that I have to do this to. I think another solution could be a map or replace operation, but the complicating caveat is that I don't want to update all of the values for that value of iter2, only the values from that index of iter 1 until the last of those values in iter2.
Any help or insight is greatly appreciated!
This might not be a huge improvement on the loop you already have defined, but I think it eliminates the need to use a nested loop at least:
import pandas as pd
# creating data frame
df = pd.DataFrame()
df['update_col'] = [1, 2, 3, 4, 5, 6, 7, 8, 9]
df['iter2'] = [0, 1, 1, 2, 2, 3, 3, 4, 4]
df['iter1'] = [0, 0, 1, 1, 1, 2, 2, 2, 2]
# computing difference from prev element by creating a shifted col
# and subtracting from the original.
# (you could also use a rolling window function for this)
df['change1'] = df['iter1'] - df['iter1'].shift(1)
df['change2'] = df['iter2'] - df['iter2'].shift(1)
# creating boolean cols to flag if iter1 or iter2 have changed
df['start'] = df['change1'] == 1
df['stop'] = df['change2'] == 1
# list to store result: if True, you update value to 0
res = [False] * len(df['start'])
for i in range(0, len(df['start'])):
if df['start'][i]:
#print('start detected')
res[i] = True
elif i > 1 and (not df['stop'][i]) and res[i-1]:
#print('continuation detected')
res[i] = True
#print(f'set res[{i}] to ', res[i])
df['update_to_zero'] = res
Which results in this df:
update_col iter2 iter1 change1 change2 start stop update_to_zero
0 1 0 0 NaN NaN False False False
1 2 1 0 0.0 1.0 False True False
2 3 1 1 1.0 0.0 True False True
3 4 2 1 0.0 1.0 False True False
4 5 2 1 0.0 0.0 False False False
5 6 3 2 1.0 1.0 True True True
6 7 3 2 0.0 0.0 False False True
7 8 4 2 0.0 1.0 False True False
8 9 4 2 0.0 0.0 False False False
Hope this helps!
I have to create an id in a pandas df where a counter resets itself.
My data looks like
counter
0
0
1
1
1
2
0
1
1
My desired output looks like
counter id
0 0
0 0
1 1
1 1
1 1
2 2
0 3
1 4
1 4
I have tried the following, which does not help. Any help will be appreciated.
df['id'] = df.groupby(df.counter.tolist(), sort=False).ngroup()
Check diff and cumsum
df['id'] = df.diff().ne(0).cumsum()-1
Another way of using itertools.groupby
from itertools import groupby
sum([ [y]*len(list(g)) for y,(_,g) in enumerate(groupby(df.counter))],[])
Out[46]: [0, 0, 1, 1, 1, 2, 3, 4, 4]