pandas applying function to columns array is very slow - pandas

os hour day
0 13 14 0
1 19 14 0
2 13 14 0
3 13 14 0
4 13 14 0
Here is my dataframe and I just want to get a new column which is str(os)+'_'+str(hour)+'_'str(day). I use apply function to process the dataframe but it is very slow.
Any high-performance method to realize this ?
I also tried convert the df to array and process every row. It seems that it is slow too.
There are nearly two hundred millions rows of the dataframe.

Not sure what code are you using but you can try
df.astype(str).apply('_'.join, axis = 1)
0 13_14_0
1 19_14_0
2 13_14_0
3 13_14_0
4 13_14_0

Related

Why does Pandas df.mode() return a zero before the actual modal value?

When I run df.mode() on the below dataframe I get a leading zero before the expected output. Why is that?
df
sample 1 2 3 4 5 6 7 8 9 10
zone run
2 5 14 12 22 23 24 22 23 22 23 23
print(df.iloc[:,3:10].mode(axis=1)))
gives
0
zone run
2 5 23
expecting
zone run
2 5 23
pd.Series.mode
Return the mode(s) of the dataset. Always returns Series even if only one value is returned.
So that's how it is by design. A Series must have an index and it will start counting from 0. This ensures that the return type is stable regardless of whether there is only a single mode or multiple values tied for the mode.
So if you take a slice where values are tied for the mode, your return is a Series where the numbers 0, ...N are indicators for the N values tied for the mode (modal values in sorted order).
df.iloc[:, 4:7]
#sample 5 6 7
#zone run
#2 5 24 22 23
df.iloc[:,4:7].mode(axis=1)
# 0 1 2 # <- 3 values tied for mode so 3 labels
#zone run
#2 5 22 23 24
My thinking is, df.mode returns a dataframe. By default, dataframes if no column values are given allocates indices as column names. In this case,0 is allocated because that is how pandas/python begins count.
Because it is a dataframe, the only way to change the column name which in this case is an index is to apply the .rename(columnn) method. Hence, to get what you need you will have to;
df1.iloc[:,3:10].agg('mode', axis=1).reset_index().rename(columns={0:''})
zone run
0 2 5 23

iterrows() of 2 columns and save results in one column

in my data frame I want to iterrows() of two columns but want to save result in 1 column.for example df is
x y
5 10
30 445
70 32
expected output is
points sequence
5 1
10 2
30 1
445 2
I know about iterrows() but it saved out put in two different columns.How can I get expected output and is there any way to generate sequence number according to condition? any help will be appreciated.
First never use iterrows, because really slow.
If want 1, 2 sequence by number of columns convert values to numy array by DataFrame.to_numpy and add numpy.ravel, then for sequence use numpy.tile:
df = pd.DataFrame({'points': df.to_numpy().ravel(),
'sequence': np.tile([1,2], len(df))})
print (df)
points sequence
0 5 1
1 10 2
2 30 1
3 445 2
4 70 1
5 32 2
Do this way:
>>> pd.DataFrame([i[1] for i in df.iterrows()])
points sequence
0 5 1
1 10 2
2 30 1
3 445 2

How to run assembled sample data

I have a pd df assembled from various samples that I randomly picked. Now, I want to run 10,000 times and get mean values for column ['MP_learning'] and ['LCC_saving'] for each row.
How should I write the code?
I tried
output=np.mean(df), but it didn't work.
PC EL MP_Learning LCC _saving
0 1 0 24 95
1 1 1 35 67
2 1 2 12 23
3 1 3 23 45
4 2 0 36 67
5 2 1 74 10
6 2 2 80 23
np.random.seed()
output=[]
for i in range (10000):
output=np.mean(df)
output
For your code, you did not post the entire code. Thus, I don't know where the data come from. However, I replicated something similar and here is the solution. For you loop code though, you suppose to append to output. Use only one of those two lines in the "for" loop code, unless you need them both.
import pandas as pd
import numpy as np
df =\
pd.DataFrame([[1,0,24,95],
[1,1,35,67],
[1,2,12,23],
[1,3,23,45],
[2,0,36,67],
[2,1,74,10],
[2,2,80,23]],
columns = ["PC","EL","MP_Learning","LCC_saving"],
index = [0,1,2,3,4,5,6]
).T
output = []
for i in range (10000):
# Use the line below to get mean for both column
output.append(np.mean([df.loc["MP_Learning"],df.loc["LCC_saving"]]))
# Use the line below to get mean for one column
output.append(np.mean(df.loc["MP_Learning"]))
print(output)

How to get the mode of a column in pandas where there are few of the same mode values pandas

I have a data frame and i'd like to get the mode of a specific column.
i'm using:
freq_mode = df.mode()['my_col'][0]
However I get the error:
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index my_col')
I'm guessing it's because I have few mode that are the same.
I need any of the mode, it doesn't matter. How can I use any() to get any of the mode existed?
For me your code working nice with sample data.
If necessary select first value of Series from mode use:
freq_mode = df['my_col'].mode().iat[0]
We can see the one column
df=pd.DataFrame({"A":[14,4,5,4,1,5],
"B":[5,2,54,3,2,7],
"C":[20,20,7,3,8,7],
"train_label":[7,7,6,6,6,7]})
X=df['train_label'].mode()
print(X)
DataFrame
A B C train_label
0 14 5 20 7
1 4 2 20 7
2 5 54 7 6
3 4 3 3 6
4 1 2 8 6
5 5 7 7 7
Output
0 6
1 7
dtype: int64

rolling sum of a column in pandas dataframe at variable intervals

I have a list of index numbers that represent index locations for a DF. list_index = [2,7,12]
I want to sum from a single column in the DF by rolling through each number in list_index and totaling the counts between the index points (and restart count at 0 at each index point). Here is a mini example.
The desired output is in OUTPUT column, which increments every time there is another 1 from COL 1 and RESTARTS the count at 0 on the location after the number in the list_index.
I was able to get it to work with a loop but there are millions of rows in the DF and it takes a while for the loop to run. It seems like I need a lambda function with a sum but I need to input start and end point in index.
Something like lambda x:x.rolling(start_index, end_index).sum()? Can anyone help me out on this.
You can try of cummulative sum and retrieving only 1 values related information , rolling sum with diffferent intervals is not possible
a = df['col'].eq(1).cumsum()
df['output'] = a - a.mask(df['col'].eq(1)).ffill().fillna(0).astype(int)
Out:
col output
0 0 0
1 1 1
2 1 2
3 0 0
4 1 1
5 1 2
6 1 3
7 0 0
8 0 0
9 0 0
10 0 0
11 1 1
12 1 2
13 0 0
14 0 0
15 1 1