Apply to each element in a Pandas dataframe

Apply to each element in a Pandas dataframe - pandas

Since each series in the data frame is of tuple, I need to convert them into one number. Basically I have something like this:
price_table['Col1'].apply(lambda x: x[0])
But I actually need to do this for each column. x itself is a tuple but it has only 1 number inside, so I need to return x[0] to get its "value" which is of format float instead of tuple.
In R, I will put axis = c(1,2) but here seems that putting 2 numbers in axis doesnt work:
price_table.apply(lambda x: x[0],axis = 1)
TypeError: <lambda>() got an unexpected keyword argument 'axis'
Is there anyway to apply this simple function to each element in the data frame?
Thanks in advance.

For me the following works well:
"price_table['Col1'].apply(lambda x: x[0],1)"
I do not use the axis. But, I do not know the reason.

Related

Why 'float object not iterable error' if it's an array of integers?

So I have a column dataframe made of arrays that I already divided by using indexes of other columns. Therefore I just get a part of the array that depends on these indexes and position, the position is a list of tuples with the positions (they can have the same start and ending point) and the index is a value. These are displayed in columns as well. The code is the following:
df['relative_cells_array'] = df.apply(lambda x: x['cells_array'][:, x['position'][x['relative_track']][0]:x['position'][x['relative_track']][1]+1] if x['relative_track']<=len(x['position']) else np.nan, axis=1)
This works. But the problem comes when I use other arrays that are modified, in this case the array uses spatial binomial weights to interpolate values. Due to the standardization, this transformation of the original array gives you float when dividing by the neighboring cells. I convert it to integer and PRINT the array, but still gives me error, I tried other things and it also gave me error of the tuple (position is a list of tuple). But why it worked before?
The code for this is the following:
df['relative_cells_array_weighted1'] = df.apply(lambda x: [[int(y) for y in sublist] for sublist in x['cells_weighted1'][:, x['position'][x['relative_track']][0]:x['position'][x['relative_track']][1]+1]] if x['relative_track']<=len(x['position']) else np.nan, axis=1)
df['relative_average_weighted1_cell_reading'] = df['relative_cells_array_weighted1'].apply(lambda x: [num for sublist in x for num in sublist])
This is the error: TypeError: 'float' object is not iterable
And after making some changes I have the tuple error (don't remember the changes, I used chatgpt)

Selecting two sets of columns from a dataFrame with all rows

I have a dataFrame with 28 columns (features) and 600 rows (instances). I want to select all rows, but only columns from 0-12 and 16-27. Meaning that I don't want to select columns 12-15.
I wrote the following code, but it doesn't work and throws a syntax error at : in 0:12 and 16:. Can someone help me understand why?
X = df.iloc[:,[0:12,16:]]
I know there are other ways for selecting these rows, but I am curious to learn why this one does not work, and how I should write it to work (if there is a way).
For now, I have written it is as:
X = df.iloc[:,0:12]
X = X + df.iloc[:,16:]
Which seems to return an incorrect result, because I have already treated the NaN values of df, but when I use this code, X includes lots of NaNs!
Thanks for your feedback in advance.

You can use np.r_ to concatenate the slices:
x = df.iloc[:, np.r_[0:12,16:]]

iloc has these allowed inputs (from the docs):
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
What you're passing to iloc in X = df.iloc[:,[0:12,16:]] is not a list of integers or a slice of ints, but a list of slice objects. You need to convert those slices to a list of integers, and the best way to do that is using the numpy.r_ function.
X = df.iloc[:, np.r_[0:13, 16:28]]

iloc using scikit learn random forest classifier

I am trying to build a random forest classifier to determine the 'type' of an object based on different attributes. I am having trouble understanding iloc and separating the predictors from the classification. If the 50th column is the 'type' column, I am wondering why the iloc (commented out) line does not work, but the line y = dataset["type"] does. I have attached the code below. Thank you!
X = dataset.iloc[:, 0:50].values
y = dataset["type"]
#y = dataset.iloc[:,50].values

Let's assume that the first column in your dataframe is named "0" and the following columns are named consequently. Like the result of the following lines
last_col=50
tab=pd.DataFrame([[x for x in range(last_col)] for c in range(10)])
now, please try tab.iloc[:,0:50] - it will work because you used slice to select columns indexes.
But if you try tab.iloc[:,50] - it will not work, because there is no column with index 50.
Slicing and selecting column by its index is just a bit different. From pandas documentation:
.iloc[] is primarily integer position based (from 0 to length-1 of the axis)
I hope this help.

How to show truncated form of large pandas dataframe after style.apply?

Normally, a relatively long dataframe like
df = pd.DataFrame(np.random.randint(0,10,(100,2)))
df
will display a truncated form in jupyter notebook like
With head, tail, ellipsis in between and row column count in the end.
However, after style.apply
def highlight_max(x):
return ['background-color: yellow' if v == x.max() else '' for v in x]
df.style.apply(highlight_max)
we got all rows displayed
Is it possible to still display the truncated form of dataframe after style.apply?

Something simple like this?
def display_df(dataframe, function):
display(dataframe.head().style.apply(function))
display(dataframe.tail().style.apply(function))
print(f'{dataframe.shape[0]} rows x {dataframe.shape[1]} columns')
display_df(df, highlight_max)
Output:
**** EDIT ****
def display_df(dataframe, function):
display(pd.concat([dataframe.iloc[:5,:],
pd.DataFrame(index=['...'], columns=dataframe.columns),
dataframe.iloc[-5:,:]]).style.apply(function))
print(f'{dataframe.shape[0]} rows x {dataframe.shape[1]} columns')
display_df(df, highlight_max)
Output:
The jupyter preview is basically something like this:
def display_df(dataframe):
display(pd.concat([dataframe.iloc[:5,:],
pd.DataFrame(index=['...'], columns=dataframe.columns, data={0: '...', 1: '...'}),
dataframe.iloc[-5:,:]]))
but if you try to apply style you are getting an error (TypeError: '>=' not supported between instances of 'int' and 'str') because it's trying to compare and highlight the string values '...'

You can capture the output in a variable and then use head or tail on it. This gives you more control on what you display every time.
output = df.style.apply(highlight_max)
output.head(10) # 10 -> number of rows to display
If you want to see more variate data you can also use sample, which will get random rows:
output.sample(10)

Linear 1D interpolation on multiple datasets using loops

I'm interested in performing Linear interpolation using the scipy.interpolate library. The dataset looks somewhat like this:
DATAFRAME for interpolation between X, Y for different RUNs
I'd like to use this interpolated function to find the missing Y from this dataset:
DATAFRAME to use the interpolation function
The number of runs given here is just 3, but I'm running on a dataset that will run into 1000s of runs. Hence appreciate if you could advise how to use the iterative functions for the interpolation ?
from scipy.interpolate import interp1d
for RUNNumber in range(TotalRuns)
InterpolatedFunction[RUNNumber]=interp1d(X, Y)

As I understand it, you want a separate interpolation function defined for each run. Then you want to apply these functions to a second dataframe. I defined a dataframe df with columns ['X', 'Y', 'RUN'], and a second dataframe, new_df with columns ['X', 'Y_interpolation', 'RUN'].
interpolating_functions = dict()
for run_number in range(1, max_runs):
run_data = df[df['RUN']==run_number][['X', 'Y']]
interpolating_functions[run_number] = interp1d(run_data['X'], run_data['Y'])
Now that we have interpolating functions for each run, we can use them to fill in the 'Y_interpolation' column in a new dataframe. This can be done using the apply function, which takes a function and applies it to each row in a dataframe. So let's define an interpolate function that will take a row of this new df and use the X value and the run number to calculate an interpolated Y value.
def interpolate(row):
int_func = interpolating_functions[row['RUN']]
interp_y = int_func._call_linear([row['X'])[0] #the _call_linear method
#expects and returns an array
return interp_y[0]
Now we just use apply and our defined interpolate function.
new_df['Y_interpolation'] = new_df.apply(interpolate,axis=1)
I'm using pandas version 0.20.3, and this gives me a new_df that looks like this:

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Apply to each element in a Pandas dataframe - pandas

For me the following works well: "price_table['Col1'].apply(lambda x: x[0],1)" I do not use the axis. But, I do not know the reason.

Related

Why 'float object not iterable error' if it's an array of integers?

Selecting two sets of columns from a dataFrame with all rows

iloc using scikit learn random forest classifier

How to show truncated form of large pandas dataframe after style.apply?

Linear 1D interpolation on multiple datasets using loops

Categories

Resources