Create new column in dataframe by passing existing pandas column values as argument to API call - pandas

I have created a function below, get_lyrics, which I want to pass the Song_Title and Singer_Name column values from an existing dataframe and create a new column in the dataframe.
My code below that attempts to create a column df['Lyrics'] gives me this error below and I have no idea why:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
My testing of the function with get_lyrics(test_song_name, test_song_author) works, it returns a very long string.
import lyricsgenius as lg
import pandas as pd
genius = lg.Genius(access_token=token)
test_song_name = "My Heart Will Go On"
test_song_author = "Celine Dion"
def get_lyrics(Song_Title, Singer_Name):
song = genius.search_song(Song_Title, Singer_Name)
return song.lyrics
get_lyrics(test_song_name, test_song_author)
df['Lyrics'] = df.apply(
get_lyrics(
df["Song_Title"], df["Singer_Name"]
)
)

To apply function on rows, you can use apply() with axis=1.
df['Lyrics'] = df.apply(lambda row: get_lyrics(row["Song_Title"], row["Singer_Name"]), axis=1)
Or with lambda function in one line
df['Lyrics'] = df.apply(lambda row: genius.search_song(row["Song_Title"], row["Singer_Name"]).lyrics, axis=1)
If you don't want lambda, you can do
def get_lyrics(row):
song = genius.search_song(row["Song_Title"], row["Singer_Name"])
return song.lyrics
df['Lyrics'] = df.apply(genius.search_song, axis=1)

I found this page here: https://www.codeforests.com/2020/07/18/pass-multiple-columns-to-lambda/
That has two working solutions. The first is identical to that posted by #Ynjxsjmh
df["Lyrics"] = df.apply(lambda x :
get_lyrics(x["Song_Title"], x["Singer_Name"]), axis=1
)
This latter one first selects a subset of the dataframe columns and then unpacks them with *x to feed into get_lyrics.
df["Lyrics"] = df[["Song_Title", "Singer_Name"]].apply(lambda x:
get_lyrics(*x),
axis=1)

Related

Empty cells when using an apply function

So I am trying to calculate a value from one column or another based based on which one has data available into a new column. This is the code I have right now. It doesn't seem to notice when there is no data present and always goes to the "else" statement. My dataframe is an imported excel file. Thanks for any advice!
def create_sulfide_col(row):
if row["Sulphate-S(HCL Leachable)_%S"] is None:
val = row["Total-S_%S"] - row["Sulphate-S(HCL Leachable)_%S"]
else:
val = ["Total-S_%S"]- df["Sulphate-S_%S"]
return val
df["Sulphide-S(calc)-C_%S"] = df.apply(lambda row: create_sulfide_col(row), axis='columns')
This is can be done by using numpy.where
Import numpy as np
df['newcol'] = np.where(df["Sulphate-S(HCL Leachable)_%S"].isna(),df["Total-S_%S"]- df["Sulphate-S(HCL Leachable)_%S"],df["Total-S_%S"]- df["Sulphate-S_%S"])

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

How do I create multiple new columns, and populate columns depending on values in 2 other columns using pandas/python?

I want to populate 10 columns with the numbers 1-16 depending on the values in 2 other columns. I can start by providing the column header or create new columns (does not matter to me).
I tried to create a function that iterates over the numbers 1-10 and then assigns a value to the z variable depending on the values of b and y.
Then I want to apply this function to each row in my dataframe.
import pandas as pd
import numpy as np
data = pd.read_csv('Nuc.csv')
def write_Pcolumns(df):
"""populates a column in the given dataframe, df, based on the values in two other columns in the same dataframe"""
#create string of numbers for each nucleotide position
positions = ('1','2','3','4','5','6','7','8','9','10')
a = "Po "
x = "O.Po "
#for each position create a variable for the nucleotide in the sequence (Po) and opposite to the sequence(o. Po)
for each in positions:
b = a + each
y = x + each
z = 'P' + each
#assign a value to z based on the nucleotide identities in the sequence and opposite position
if df[b] == 'A' and df[y]=='A':
df[z]==1
elif df[b] == 'A' and df[y]=='C':
df[z]==2
elif df[b] == 'A' and df[y]=='G':
df[z]==3
elif df[b] == 'A' and df[y]=='T':
df[z]==4
...
elif df[b] == 'T' and df[y]=='G':
df[z]==15
else:
df[z]==16
return(df)
data.apply(write_Pcolumns(data), axis=1)
I get the following error message:
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This happens because df[index]=='value' returns a series of booleans, not a single boolean for each value.
Check out Pandas error when using if-else to create new column: The truth value of a Series is ambiguous

pandas groupby transform: multiple functions applied at the same time with custom names

as the title suggests, I want to be able to do the following (best explained with some code) [pandas 0.20.1 is mandatory]
import pandas as pd
import numpy as np
a = pd.DataFrame(np.random.rand(10, 4), columns=[['a','a','b','b'], ['alfa','beta','alfa','beta',]])
def as_is(x):
return x
def power_2(x):
return x**2
# desired result
a.transform([as_is, power_2])
the problem is the function could be more complex than this and thus I would lose the "naming" feature as pandas.DataFrame.transform only allows for lists to be passed whereas a dictionary would have been most convenient.
going back to the basics, I got to this:
dict_funct= {'as_is': as_is, 'power_2': power_2}
def wrapper(x):
return pd.concat({k: x.apply(v) for k,v in dict_funct.items()}, axis=1)
a.groupby(level=[0,1], axis=1).apply(wrapper)
but the output Dataframe is all nan, presumably due to multi-index columns ordering. is there any way I can fix this?
If need dict I remove paramater axis in concat to to default (axis=0), but then is necessary add parameter group_keys=False and function unstack:
def wrapper(x):
return pd.concat({k: x.apply(v) for k,v in dict_funct.items()})
a.groupby(level=[0,1], axis=1, group_keys=False).apply(wrapper).unstack(0)
Similar solution:
def wrapper(x):
return pd.concat({k: x.transform(v) for k,v in dict_funct.items()})
a.groupby(level=[0,1], axis=1, group_keys=False).apply(wrapper).unstack(0)
Another solution is simply add list comprehension:
a.transform([v for k, v in dict_funct.items()])

pandas apply() with and without lambda

What is the rule/process when a function is called with pandas apply() through lambda vs. not? Examples below. Without lambda apparently, the entire series ( df[column name] ) is passed to the "test" function which throws an error trying to do a boolean operation on a series.
If the same function is called via lambda it works. Iteration over each row with each passed as "x" and the df[ column name ] returns a single value for that column in the current row.
It's like lambda is removing a dimension. Anyone have an explanation or point to the specific doc on this? Thanks.
Example 1 with lambda, works OK
print("probPredDF columns:", probPredDF.columns)
def test( x, y):
if x==y:
r = 'equal'
else:
r = 'not equal'
return r
probPredDF.apply( lambda x: test( x['yTest'], x[ 'yPred']), axis=1 ).head()
Example 1 output
probPredDF columns: Index([0, 1, 'yPred', 'yTest'], dtype='object')
Out[215]:
0 equal
1 equal
2 equal
3 equal
4 equal
dtype: object
Example 2 without lambda, throws boolean operation on series error
print("probPredDF columns:", probPredDF.columns)
def test( x, y):
if x==y:
r = 'equal'
else:
r = 'not equal'
return r
probPredDF.apply( test( probPredDF['yTest'], probPredDF[ 'yPred']), axis=1 ).head()
Example 2 output
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
There is nothing magic about a lambda. They are functions in one parameter, that can be defined inline, and do not have a name. You can use a function where a lambda is expected, but the function will need to also take one parameter. You need to do something like...
Define it as:
def wrapper(x):
return test(x['yTest'], x['yPred'])
Use it as:
probPredDF.apply(wrapper, axis=1)