IPython Notebook cell multiple outputs - pandas

I am running this cell in IPython Notebook:
# salaries and teams are Pandas dataframe
salaries.head()
teams.head()
The result is that I am only getting the output of teams data-frame rather than of both salaries and teams. If I just run salaries.head() I get the result for salaries data-frame but on running both the statement I just see the output of teams.head(). How can I correct this?

have you tried the display command?
from IPython.display import display
display(salaries.head())
display(teams.head())

An easier way:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
It saves you having to repeatedly type "Display"
Say the cell contains this:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
a = 1
b = 2
a
b
Then the output will be:
Out[1]: 1
Out[1]: 2
If we use IPython.display.display:
from IPython.display import display
a = 1
b = 2
display(a)
display(b)
The output is:
1
2
So the same thing, but without the Out[n] part.

IPython Notebook shows only the last return value in a cell. The easiest solution for your case is to use two cells.
If you really need only one cell you could do a hack like this:
class A:
def _repr_html_(self):
return salaries.head()._repr_html_() + '</br>' + teams.head()._repr_html_()
A()
If you need this often, make it a function:
def show_two_heads(df1, df2, n=5):
class A:
def _repr_html_(self):
return df1.head(n)._repr_html_() + '</br>' + df2.head(n)._repr_html_()
return A()
Usage:
show_two_heads(salaries, teams)
A version for more than two heads:
def show_many_heads(*dfs, n=5):
class A:
def _repr_html_(self):
return '</br>'.join(df.head(n)._repr_html_() for df in dfs)
return A()
Usage:
show_many_heads(salaries, teams, df1, df2)

Enumerating all the solutions:
sys.displayhook(value), which IPython/jupyter hooks into. Note this behaves slightly differently from calling display, as it includes the Out[n] text. This works fine in regular python too!
display(value), as in this answer
get_ipython().ast_node_interactivity = 'all'. This is similar to but better than the approach taken by this answer.
Comparing these in an interactive session:
In [1]: import sys
In [2]: display(1) # appears without Out
...: sys.displayhook(2) # appears with Out
...: 3 # missing
...: 4 # appears with Out
1
Out[2]: 2
Out[2]: 4
In [3]: get_ipython().ast_node_interactivity = 'all'
In [2]: display(1) # appears without Out
...: sys.displayhook(2) # appears with Out
...: 3 # appears with Out (different to above)
...: 4 # appears with Out
1
Out[4]: 2
Out[4]: 3
Out[4]: 4
Note that the behavior in Jupyter is exactly the same as it is in ipython.

Provide,
print salaries.head()
teams.head()

This works if you use the print function since giving direct commands only returns the output of last command.
For instance,
salaries.head()
teams.head()
outputs only for teams.head()
while,
print(salaries.head())
print(teams.head())
outputs for both the commands.
So, basically, use the print() function

Related

why can't we use the argument expand of split() inside apply() -in pandas-

# Split single column into two columns use apply()
df[['First Name', 'Last Name']] = df["Student_details"].apply(lambda x: pd.Series(str(x).split(",")))
print(df)
1- why when i change the code to .apply(lambda x: str(x).split("," , expand=True)) i got an error which is "expand is invalid argument to split function"
2- why do i have to use pd.Series() although the default return value of str.split() is Series
3- how does pd.Series() return a series while it returns a DF -here-
i tried to write expand and use it normally but it didn't work
here is the DF
import pandas as pd
import numpy as np
technologies = {
'Student_details':["Pramodh_Roy", "Leena_Singh", "James_William", "Addem_Smith"],
'Courses':["Spark", "PySpark", "Pandas", "Hadoop"],
'Fee' :[25000, 20000, 22000, 25000]
}
df = pd.DataFrame(technologies)
print(df)
df[['First Name', 'Last Name']] = df["Student_details"].str.split("_", expand=True)
I don't get what you want... is it about the solution above? Or do you really wanna know why #1 throws an error?
EDIT 1:
The expand parameter does not exist for the split function of the str type you are referring to, as this is the str type of python. The expand parameter has been written for the split function for a Series in pandas.
EDIT 2:
Re your third question: As you can see in my suggestion, I'm not using even the pd.Series function, however of course the df["Student_details"] is a series. The key here in my answer is, that the "expand" parameter is here returning a DF with as many columns as required for the split results. So if one of the names were "a_b_c_d" I would get in total a df with four columns.
EDIT 3:
pd.Series will convert result to a series. The result of the split function will be a list, which in turn will be casted to a Series. All of which would not have been necessary in my eyes. That piece of code I wrote saves time, both in execution and understanding (I hope)
FYI, this does work:
technologies = {
'Student_details':["Pramodh_Roy", "Leena_Singh", "James_William", "Addem_Smith"],
'Courses':["Spark", "PySpark", "Pandas", "Hadoop"],
'Fee' :[25000, 20000, 22000, 25000]
}
df = pd.DataFrame(technologies)
df[['First Name', 'Last Name']] = df["Student_details"].apply(lambda x: pd.Series(str(x).split("_")))
print(df)
Output:
Student_details Courses Fee First Name Last Name
0 Pramodh_Roy Spark 25000 Pramodh Roy
1 Leena_Singh PySpark 20000 Leena Singh
2 James_William Pandas 22000 James William
3 Addem_Smith Hadoop 25000 Addem Smith

Indexing lists in a Pandas dataframe column based on variable length

I've got a column in a Pandas dataframe comprised of variable-length lists and I'm trying to find an efficient way of extracting elements conditional on list length. Consider this minimal reproducible example:
t = pd.DataFrame({'a':[['1234','abc','444'],
['5678'],
['2468','def']]})
Say I want to extract the 2nd element (where relevant) into a new column, and use NaN otherwise. I was able to get it in a very inefficient way:
_ = []
for index,row in t.iterrows():
if (len(row['a']) > 1):
_.append(row['a'][1])
else:
_.append(np.nan)
t['element_two'] = _
And I gave an attempt using np.where(), but I'm not specifying the 'if' argument correctly:
np.where(t['a'].str.len() > 1, lambda x: x['a'][1], np.nan)
Corrections and tips to other solutions would be greatly appreciated! I'm coming from R where I take vectorization for granted.
I'm on pandas 0.25.3 and numpy 1.18.1.
Use str accesor :
n = 2
t['second'] = t['a'].str[n-1]
print(t)
a second
0 [1234, abc, 444] abc
1 [5678] NaN
2 [2468, def] def
While not incredibly efficient, apply is at least clean:
t['a'].apply(lambda _: np.nan if len(_)<2 else _[1])

When does pandas do pass-by-reference Vs pass-by-value when passing dataframe to a function?

def dropdf_copy(df):
df = df.drop('y',axis=1)
def dropdf_inplace(df):
df.drop('y',axis=1,inplace=True)
def changecell(df):
df['y'][0] = 99
x = pd.DataFrame({'x': [1,2],'y': [20,31]})
x
Out[204]:
x y
0 1 20
1 2 31
dropdf_copy(x)
x
Out[206]:
x y
0 1 20
1 2 31
changecell(x)
x
Out[208]:
x y
0 1 99
1 2 31
In the above example dropdf() doesnt modify the original dataframe x while changecell() modifies x. I know if I add the minor change to changecell() it wont change x.
def changecell(df):
df = df.copy()
df['y'][0] = 99
I dont think its very elegant to inlcude df = df.copy() in every function I write.
Questions
1) Under what circumstances does pandas change the original dataframe and when it does not? Can someone give me a clear generalizable rule? I know it may have something to do with mutability Vs immutability but its not clearly explained in stackoverflow.
2) Does numpy behave simillary or its different? What about other python objects?
PS: I have done research in stackoverflow but couldnt find a clear generalizable rule for this problem.
By default python does pass by reference. Only if a explicit copy is made in the function like assignment or a copy() function is used the original object passed is unchanged.
Example with explicit copy :
#1. Assignment
def dropdf_copy1(df):
df = df.drop('y',axis=1)
#2. copy()
def dropdf_copy2(df):
df = df.copy()
df.drop('y',axis=1,inplace = True)
If explicit copy is not done then original object passed is changed.
def dropdf_inplace(df):
df.drop('y',axis=1,inplace = True)
Nothing to deal with pandas. It'a problem of local/global variables on mutable values. in dropdf, you set df as a local variable.
The same with lists:
def global_(l):
l[0]=1
def local_(l):
l=l+[0]
in the second function, it will be the same if you wrote :
def local_(l):
l2=l+[0]
so you don't affect l.
Here the python tutor exemple which shoes what happen.

There are three problems(Load database, loop, and append series)

Unlike when I started, I found this problem to be a more difficult problem than I thought.
I want to refer to a particular column content from the SQLite database, make it into a Series, and then combine it into a single data frame.
I have tried like this but faild:
import pandas as pd
from pandas import Series, DataFrame
import sqlite3
con = sqlite3.connect("C:/Users/Kun/Documents/Dashin/data.db") #my sqldb
tmplist = ['A003060','A003070'] #db contains that table,I decided to call
#only two for practice.
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i), con , index_col =
None)['Close'].head(5)
tmpSeries2 = tmpSeries.append(listSeries)
print(tmpSeries2)
that code result show only dummy thing like this:
0 7150.0
1 6770.0
2 7450.0
3 7240.0
4 6710.0
dtype: float64
0 14950.0
1 15500.0
2 15000.0
3 14800.0
4 14500.0
What I want to do is like this:
A003060 A003070
0 7150.0 14950.0
1 6770.0 15500.0
2 7450.0 15000.0
3 7240.0 14800.0
4 6710.0 14500.0
I had a similar question ahead and got a answer. But The last question is
using predefined variables. But I must use loop because I have to deal with a series of large databases. I have already tried another effort using dataframe.append, transpose(). But I failed.
I would appreciate some small hints. Thank you.
To append pandas series using for loop
I think you can create list, then append data and last use concat:
dfs = []
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i) con,index_col = None)['Close'].head(5)
dfs.append(listSeries)
df = pd.concat(dfs, axis=1, keys=tmplist)
print(df)

Pandas fill cells in a column with NaN values, derive the value from other cells in the row

I have a dataframe:
a b c
0 1 2 3
1 1 1 1
2 3 7 NaN
3 2 3 5
...
I want to fill column "three" inplace (update the values) where the values are NaN using a machine learning algorithm.
I don't know how to do it inplace. Sample code:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df=pd.DataFrame([range(3), [1, 5, np.NaN], [2, 2, np.NaN], [4,5,9], [2,5,7]],columns=['a','b','c'])
x=[]
y=[]
for row in df.iterrows():
index,data = row
if(not pd.isnull(data['c'])):
x.append(data[['a','b']].tolist())
y.append(data['c'])
model = LinearRegression()
model.fit(x,y)
#this line does not do it in place.
df[~df.c.notnull()].assign(c = lambda x:model.predict(x[['a','b']]))
But this gives me a copy of the dataframe. Only option I have left is using a for loop however, I don't want to do that. I think there should be more pythonic way of doing it using pandas. Can someone please help? Or is there any other way of doing this?
You'll have to do something like :
df.loc[pd.isnull(df['three']), 'three'] = _result of model_
This modifies directly dataframe df
This way you first filter the dataframe to keep the slice you want to modify (pd.isnull(df['three'])), then from that slice you select the column you want to modify (three).
On the right hand side of the equal, it expects to get an array / list / series with the same number of lines than the filtered dataframe ( in your example, one line)
You may have to adjust depending on what your model returns exactly
EDIT
You probably need to do stg like this
pred = model.predict(df[['a', 'b']])
df['pred'] = model.predict(df[['a', 'b']])
df.loc[pd.isnull(df['c']), 'c'] = df.loc[pd.isnull(df['c']), 'pred']
Note that a significant part of the issue comes from the way you are using scikit learn in your example. You need to pass the whole dataset to the model when you predict.
The simplest way is yo transpose first, then forward fill/backward fill at your convenience.
df.T.ffill().bfill().T