How to run assembled sample data - pandas

I have a pd df assembled from various samples that I randomly picked. Now, I want to run 10,000 times and get mean values for column ['MP_learning'] and ['LCC_saving'] for each row.
How should I write the code?
I tried
output=np.mean(df), but it didn't work.
PC EL MP_Learning LCC _saving
0 1 0 24 95
1 1 1 35 67
2 1 2 12 23
3 1 3 23 45
4 2 0 36 67
5 2 1 74 10
6 2 2 80 23
np.random.seed()
output=[]
for i in range (10000):
output=np.mean(df)
output

For your code, you did not post the entire code. Thus, I don't know where the data come from. However, I replicated something similar and here is the solution. For you loop code though, you suppose to append to output. Use only one of those two lines in the "for" loop code, unless you need them both.
import pandas as pd
import numpy as np
df =\
pd.DataFrame([[1,0,24,95],
[1,1,35,67],
[1,2,12,23],
[1,3,23,45],
[2,0,36,67],
[2,1,74,10],
[2,2,80,23]],
columns = ["PC","EL","MP_Learning","LCC_saving"],
index = [0,1,2,3,4,5,6]
).T
output = []
for i in range (10000):
# Use the line below to get mean for both column
output.append(np.mean([df.loc["MP_Learning"],df.loc["LCC_saving"]]))
# Use the line below to get mean for one column
output.append(np.mean(df.loc["MP_Learning"]))
print(output)

Related

Trying to convert column to be row indexes, set_index error

data_new. set_index('Usual Mode of Transport to Work')
jupyter notebook
Trying to convert column to be row indexes, however, it shows up as NaN? How do i resolve it? Thanks. Im a beginner in python.
Lets start with a toy dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(5, 4)), columns=list('ABCD'))
print(df)
A B C D
0 3 1 2 1
1 2 2 3 4
2 2 4 4 1
3 1 0 3 2
4 1 2 4 0
Now, let's set column A as the index
df.set_index('A')
B C D
A
3 1 2 1
2 2 3 4
2 4 4 1
1 0 3 2
1 2 4 0
This sets the index correctly but doesn't save this newly indexed dataframe in the original data frame variable, i.e., df. So when you check the value of df you see find the original dataframe.
To save the new indexing, you can do one of the following
df = df.set_index('A)
or
df.set_index('A', inplace=True)
Coming to the NaN values, I believe it has got something to do with using Jupyter notebook. Since Jupyter allows jumping between cells, it does not necessarily follow the linear execution order like traditional coding. This can get confusing. You can use the "Variable View" in Jupyter to cross-check if you are passing the value you intend to. I hope this can help you figure out the NaN issue.

Concatenate/Append many dataframes in pandas

I have a list of dataframes df1 to df20 that are being created from a loop and I need to concatenate all of them in one go. These dataframes are dynamic and can be any number between 1 to 20 as per the loop that generates in my code.
So, I was trying to create an empty list first and add these dataframe names to it (in a loop for 1 to 20 as example) and to use this list in pd.concat(df_list) as below:
df_list=[]
for in in range(1,21):
df_list.append(f'df{i}')
pd.concat(df_list)
the above code is creating list of dataframe names but in the form of string with quotes like below and I'm unable to concatenate the dataframes using the pd.concat(df_list) as it's considering all the dataframe names as string elements
print(df_list)
['df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9', 'df10', 'df11', 'df12', 'df13', 'df14', 'df15', 'df16', 'df17', 'df18','df19','df20']
Appreciate if anyone can help me in getting this concatenation of dataframes.
I think if I can add the dataframes names without quotes, like df_list=[df0,df1,df2...] then the pd.concat can work or else please let me know if there is any best alternative to get this done. Thanks!
UPDATE
As per commented suggestions, I've created a simple loop to create multiple dataframes and then I tried to append the "names of these dataframes" to an empty list in this loop itself where these dataframes are getting created. But, the o/p is not what am I expecting.
mylist=[]
for x in range(1,4):
globals()[f"df{i}"]=pd.DataFrame(np.random.randint(99,size=(3,3)),columns=['AA','BB','CC'])
mylist.append(globals()[f"df{i}"])
The above code creates 3 dataframes df1,df2 and df3 and also the empty list is getting appended but with the contents of dataframes as shown below
print(mylist)
[ AA BB CC
0 57 92 50
1 33 47 28
2 82 77 46, AA BB CC
0 18 8 75
1 1 15 52
2 4 69 38, AA BB CC
0 19 24 31
1 24 52 62
2 50 8 63]
But, my desired output is not the contents of the dataframes, but the names of the dataframes themselves like below.
print(mylist)
[df1,df2,df3]
Appreciate if anyone can show me how to get this done. I think there must be some simple way to do this.
That's because you're efectively appending strings to your list. If you have named variables, df1 to df20, you can access them by using locals() (or globals() depending on where your named variables are, and if you are concatenating the dataframes in a function or not). Here is an example,
df1 = 0
df2 = 1
df3 = 2
df_list = []
for i in range(1, 4):
df_list.append(locals()[f'df{i}'])
>>> df_list
[0, 1, 2]
EDIT: I think what you want to do is the following:
import pandas as pd
import numpy as np
mylist = []
for x in range(1, 4):
df = pd.DataFrame(np.random.randint(99, size=(3,3)), columns=['AA','BB','CC'])
mylist.append(df)
dfs = pd.concat(mylist)
Note that printing mylist is never going to tell you something along the lines of mylist = [df1, df2, df3], even if you hardcode that. That will print the entire content of all the variables inside your list. If you don't know how many dataframes you're going to concatenate for some reason, then just implement a while loop that breaks when you want to stop creating dataframes.
Consider another example
# create a list of 100 dataframes (df0 to df99)
mylist = []
for x in range(100):
df = pd.DataFrame(np.random.randint(99,size=(3,3)), columns=['AA','BB','CC'])
mylist.append(df)
concat_range = input("Range of dataframes to concatenate (0-100): ")
i, j = concat_range.split(" ")
dfs = pd.concat(mylist[int(i) : int(j)])
# further operations on dfs
Now, let's say I am the user and I want to concatenate df5 to df32.
>>> Range of dataframes to concatenate (0-100): 5 32
>>> dfs
AA BB CC
0 28 37 36
1 34 18 14
2 39 41 97
0 44 66 76
1 57 16 3
.. .. .. ..
1 43 87 74
2 67 70 73
0 40 60 57
1 23 63 70
2 96 24 31
[81 rows x 3 columns]

Using apply for multiple columns

I need to create 2 new columns based one existing 2 columns. I am trying to do it using 1 single apply function instead of 2 apply functions separately.
Initial Df for example is as follows:
ID1 ID2
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
8 9 19
9 10 20
Next I try to create 2 new columns using the below method:
def funct(row):
list1 = row.values
print(list1[0])
return row
df[['s1','s2']] = df[['ID1',"ID2"]].apply(lambda row: funct(row))
The issue is that I want to access the values individually which I am unable to do so . Here i tried converting into list but when I do list[0] i get
1
11
How to access 1 and 11 above? how should I index to access individual series value when I send two series together using apply?
NOTE: the content of funct() is just returning the same now because I still dont know how to access the values inorder to do something
add a parameter axis=1 to your apply function, like
import pandas as pd
from io import StringIO
s = """
,ID1,ID2
0,1,11
1,2,12
2,3,13
3,4,14
4,5,15
5,6,16
6,7,17
7,8,18
8,9,19
9,10,20
"""
df = pd.read_csv(StringIO(s),index_col=0)
def funct(row):
# return row
# update the answer
return pd.Series([row.ID1 + 100, row.ID2 + 20])
df[['s1','s2']] = df[['ID1',"ID2"]].apply(funct, axis=1)

iterrows() of 2 columns and save results in one column

in my data frame I want to iterrows() of two columns but want to save result in 1 column.for example df is
x y
5 10
30 445
70 32
expected output is
points sequence
5 1
10 2
30 1
445 2
I know about iterrows() but it saved out put in two different columns.How can I get expected output and is there any way to generate sequence number according to condition? any help will be appreciated.
First never use iterrows, because really slow.
If want 1, 2 sequence by number of columns convert values to numy array by DataFrame.to_numpy and add numpy.ravel, then for sequence use numpy.tile:
df = pd.DataFrame({'points': df.to_numpy().ravel(),
'sequence': np.tile([1,2], len(df))})
print (df)
points sequence
0 5 1
1 10 2
2 30 1
3 445 2
4 70 1
5 32 2
Do this way:
>>> pd.DataFrame([i[1] for i in df.iterrows()])
points sequence
0 5 1
1 10 2
2 30 1
3 445 2

pandas applying function to columns array is very slow

os hour day
0 13 14 0
1 19 14 0
2 13 14 0
3 13 14 0
4 13 14 0
Here is my dataframe and I just want to get a new column which is str(os)+'_'+str(hour)+'_'str(day). I use apply function to process the dataframe but it is very slow.
Any high-performance method to realize this ?
I also tried convert the df to array and process every row. It seems that it is slow too.
There are nearly two hundred millions rows of the dataframe.
Not sure what code are you using but you can try
df.astype(str).apply('_'.join, axis = 1)
0 13_14_0
1 19_14_0
2 13_14_0
3 13_14_0
4 13_14_0