Filling NaNs using apply lambda function to work with DASK dataframe - pandas

I am trying to figure out how to fill a column with a value if another column and filling the null values with the value of another column as so
df['NewCol'] = df.apply(lambda row: 'Three' if row['VersionThree'] == row['VersionThree']\
else ('Two' if row['VersionTwo'] == row['VersionTwo'] else \
('Test' if row['VS'] == row['VS'] else '')), axis=1)
So the function works as it should, but I am now trying to figure out how to get it to run when I read my dataset in as a Dask Data Frame
I tried to vectorize and see if I could use numpy where with it as so,
df['NewCol'] = np.where((df['VersionThree'] == df['VersionThree']), ['Three'],
np.where((df['VersionTwo']== df['VersionTwo']), ['Two'],
np.where((df['VS'] == df['VS']), ['Test'], np.nan)))
But it does not run/crashes. I would like the function to iterate through every row and does a check for those 3 columns if any of them exist then output it to NewCol, if null then check the next column in the if's, if all are null, then place a np.nan into that cell
I am trying to use a Dask DataFrame

Related

cannot look up values in pandas df with the reference number with leading 0

I'm trying to get the description from a dataframe by entering the item number. However, python3 does not support numbers starting with 0.
From the below code, I'm looking up in a df with the style no equal to '0400271' and return the description of that row. The code works only for numbers not starting with 0. For leading 0 numbers, it just returns an empty series.
joe.loc[joe['STYLE_NO'] == '0400271', 'DESCRIPTION']
#also separately tried: (also returns empty series)
value = '0400271'.zfill(7)
print(value)
print(type(value))
b = joe.query('STYLE_NO == #value')['DESCRIPTION']
b

Pandas rewriting function without calling apply

I need to create a derived field in a pandas dataframe based on the value in another field of the same dataframe. Like so:
def newfield(row):
If row.col1 == ‘x’
return ‘value is x’
Elif row.col1 == ‘y’
Return ‘value is y’
Then i call it:
df.newfield = df.apply(lambda row:newfield(row),axis=1)
Is there a way to do it without’apply’? Also would like to make it less verbose. Np.where only allows 2 conditions, but i have more than 2.
Yes, you can use np.select:
df['newfield'] = np.select([df['col1'] == 'x', df['col1'=='y'],
['value is x', 'value is y'],
np.nan)

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

Pandas DF Select Column Contained By String

I am currently trying to figure out the inverse of df[df['A'].str.contains("hello")]. I would like to return all rows from from column 'A' if they have a string in "hello".
If the values were "he", "llo" "l" in column 'A' they would return true.
If the values were "het", "lil", "elp" in column 'A' they would return false.
Is there a way to do this in a dataframe without iterating each row in the dataframe?
Currently using 2.7 due to working with ESRI ArcGIS 10.4 software constraints.
You can use apply() of Pandas to iterate over each row of column A, and evaluate if it is a substring of 'hello'
def hello_check(row):
return (row in 'hello')
df['contains_hello'] = df['A'].apply(hello_check)

Repeat elements in pandas dataframe so equal number of each unique element

I have a pandas dataframe with multiple different feature columns. I have one particular column which can take on a variety of integer value. I want to manipulate the dataframe in such a way that there is an equal number of each of these integer value.
Before;
df['key'] = [1,1,1,3,4,5,5]
After;
df['key'] = [1,1,1,3,3,3,4,4,4,5,5,5]
I want this to be applied to every key in the dataframe.
So here's an ugly way that I've coded up a solution, but I feel like it goes against the entire reason to use pandas dataframes.
for idx, i in enumerate(data['key'].value_counts()):
if i == max(data['key'].value_counts()):
pass
else:
scaling = (max(data['key'].value_counts()) // i) - 1
data2 = pd.concat([data[data['key'] == idx]]*scaling, ignore_index=True)
data = pd.concat([data, data2], ignore_index=True)