Pandas Series .loc() access error after appending - pandas

I have a multi-index pandas series as below. I want to add a new entry (new_series) to multi_df, calling it multi_df_appended. However I don't understand the change in behaviour between multi_df and multi_df_appended when I try to access a non-existing multi-index.
Below is the code that reproduces the problem. I want the penultimate line of code: multi_df_appended.loc['five', 'black', 'hard', 'square' ] to return an empty Series like it does with multi_df but instead I get the error given. What am I doing wrong here?
df = pd.DataFrame({'id' : range(1,9),
'code' : ['one', 'one', 'two', 'three',
'two', 'three', 'one', 'two'],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'texture': ['soft', 'soft', 'hard','soft','hard',
'hard','hard','hard'],
'shape': ['round', 'triangular', 'triangular','triangular','square',
'triangular','round','triangular']
}, columns= ['id','code','colour', 'texture', 'shape'])
multi_df = df.set_index(['code','colour','texture','shape']).sort_index()['id']
# try to access a non-existing multi-index combination:
multi_df.loc['five', 'black', 'hard', 'square' ]
Series([], dtype: int64) # returns an empty Series as desired/expected.
# append multi_df with a new row
new_series = pd.Series([9], index = [('four', 'black', 'hard', 'round')] )
multi_df_appended = multi_df.append(new_series)
# now try again to access a non-existing multi-index combination:
multi_df_appended.loc['five', 'black', 'hard', 'square' ]
error: 'MultiIndex lexsort depth 0, key was length 4' # now instead of the empty Series, I get an error!?

As #Jeff answered, if I do .sortlevel(0) and then run .loc() for an unknown index, it does not give the "lexsort depth" error:
multi_df_appended_sorted = multi_df.append(new_series).sortlevel(0)
multi_df_appended_sorted.loc['five', 'black', 'hard', 'square' ]
Series([], dtype: int64)

Related

Highlight distinct cells based on a different cell in the same row in a multiindex pivot table

I have created a pivot table where the column headers have several levels. This is a simplified version:
index = ['Person 1', 'Person 2', 'Person 3']
columns = [
["condition 1", "condition 1", "condition 1", "condition 2", "condition 2", "condition 2"],
["Mean", "SD", "n", "Mean", "SD", "n"],
]
data = [
[100, 10, 3, 200, 12, 5],
[500, 20, 4, 750, 6, 6],
[1000, 30, 5, None, None, None],
]
df = pd.DataFrame(data, columns=columns)
df
Now I would like to highlight the adjacent cells next to SD if SD > 10. This is how it should look like:
I found this answer but couldn't make it work for multiindices.
Thanks for any help.
Use Styler.apply with custom function - for select column use DataFrame.xs and for repeat boolean use DataFrame.reindex:
def hightlight(x):
c1 = 'background-color: red'
mask = x.xs('SD', axis=1, level=1).gt(10)
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 column by boolean mask
return df1.mask(mask.reindex(x.columns, level=0, axis=1), c1)
df.style.apply(hightlight, axis=None)

columns.values is not returning the strings

I have a dataframe with column name msg that has string values.
I am trying to get this values using:
df['msg'].values
But I am getting integers(problaby the index of the dataframe) and not the texts.
What am I doing wrong?
Say you have a pandas dataframe with column 'msg':
df['msg'] = ['red', 'orange', 'yellow', 'green', 'blue', 'purple']
You can print just the string values with just:
df['msg'].values** --> **['red', 'orange', 'yellow', 'green', 'blue', 'purple']
In order to print the index:
df['msg'].index.to_list()** --> **[0, 1, 2, 3, 4, 5]
You can print certain string values by indexing. If you wanted the first string value:
df['msg'][0] --> 'red'
Or last value:
df['msg'][5] --> 'purple'**

How to stack a pd.DataFrame until it becomes pd.Series?

I have the following pd.DataFrame:
df = pd.DataFrame(
data=[['dog', 'kg', 100, 241], ['cat', 'lbs', 300, 1]],
columns=['animal', 'unit', 0, 1],
).set_index(['animal', 'unit'])
df.columns = pd.MultiIndex.from_tuples(list(zip(*[['2019', '2018'], ['Apr', 'Oct']])))
and I would like to convert it a 2D matrix with no multilevel indexes on index or column:
pd.DataFrame(
data=[
['dog', 'kg', 100, '2019', 'Apr'],
['dog', 'kg', 241, '2018', 'Oct'],
['cat', 'lbs', 300, '2019', 'Apr'],
['cat', 'lbs', 1, '2018', 'Oct']
],
columns=['animal', 'unit', 'value', 'year', 'month']
)
To achieve this, I use df.stack().stack() -> this becomes a pd.Series and then I do .reset_index() on these series t convert to DataFrame.
My question is - how do I avoid the second (or multiple more) stack()?
Is there a way to stack a pd.DataFrame until it becomes a pd.Series?

How to turn a pandas pivot_table into a simple table

I have a simple DataFrame defined by:
df = pd.DataFrame({
'year': [2019, 2020, 2019, 2020, 2019, 2020, 2019, 2020],
'name': ['Alice', 'Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob', 'Bob'],
'sales': [100, 200, 300, 400, 500, 600, 700, 800]
})
This DataFrame is easily turned into a pivot table using pivot_table :
table = pd.pivot_table(
df,
index=['name'],
columns=['year'],
aggfunc=np.sum)
Now I need to turn this DataFrame into a simple JSON array. Unfortunately, the to_json method doesn't return a simple array:
table.reset_index().to_json(orient="records")
[
{
"["name",""]":"Alice",
"["sales",2019]":400,
"["sales",2020]":600
},
{
"["name",""]":"Bob",
"["sales",2019]":1200,
"["sales",2020]":1400}
]
How can I turn the table DataFrame into a simple (without multiindex) DataFrame?
[
{
"name":"Alice",
"2019":400,
"2020":600
},
{
"name":"Bob",
"2019":1200,
"2020":1400
}
]
you need to pass the values param to get rid of the multiindex , also get rid of the lists in the param inside pivot when passing a single column:
table = pd.pivot_table(
df,
index='name',
columns='year',
values='sales',
aggfunc=np.sum)
table.reset_index().to_json(orient="records")
'[{"name":"Alice","2019":400,"2020":600},{"name":"Bob","2019":1200,"2020":1400}]'
Adding another alternative if you like:
out = (df.groupby(['name','year'])['sales'].sum().unstack()
.reset_index().to_json(orient='records'))
'[{"name":"Alice","2019":400,"2020":600},{"name":"Bob","2019":1200,"2020":1400}]'

Pandas Groupby: return dict of rows

I would like to group my dataframe by one of the columns and then return a dictionary that has a list of all of the rows per column value. Is there a fast Pandas idiom for doing this?
Example:
test = pd.DataFrame({
'id': ['alice', 'bob', 'bob', 'charlie'],
'transaction_date': ['2020-01-01', '2020-01-01', '2020-01-02', '2020-01-02'],
'amount': [50.0, 10.0, 12.0, 13.0]
})
Desired output:
result = {
'alice': [Series(transaction_date='2020-01-01', amount=50.0)],
'bob': [Series(transaction_date='2020-01-01', amount=10.0), Series(transaction_date='2020-01-02', amount=12.0)],
'charlie': [Series(transaction_date='2020-01-02', amount=53.0)],
}
The following approaches do NOT work:
test.groupby('id').agg(list)
Returns a Dataframe where each column (amount and transaction_date) has a list of values, but that's not what I want. I want the result to be one list of rows / Pandas series per unique grouping column value ('id' value).
test.groupby('id').agg(list).to_dict():
{'amount': {'charlie': [13.0], 'bob': [10.0, 12.0], 'alice': [50.0]}, 'transaction_date': {'charlie': ['2020-01-02'], 'bob': ['2020-01-01', '2020-01-02'], 'alice': ['2020-01-01']}}
test.groupby('id').apply(list).to_dict():
{'charlie': ['amount', 'id', 'transaction_date'], 'bob': ['amount', 'id', 'transaction_date'], 'alice': ['amount', 'id', 'transaction_date']}
Use itertuples and zip,
import pandas as pd
test = pd.DataFrame({
'id': ['alice', 'bob', 'bob', 'charlie'],
'transaction_date': ['2020-01-01', '2020-01-01', '2020-01-02', '2020-01-02'],
'amount': [50.0, 10.0, 12.0, 13.0]
})
columns = ['transaction_date', 'amount']
grouped = (test
.groupby('id')[columns]
.apply(lambda x: list(x.itertuples(name='Series', index=False))))
print(dict(zip(grouped.index, grouped.values)))
{
'alice': [Series(transaction_date='2020-01-01', amount=50.0)],
'bob': [
Series(transaction_date='2020-01-01', amount=10.0),
Series(transaction_date='2020-01-02', amount=12.0)
],
'charlie': [Series(transaction_date='2020-01-02', amount=13.0)]
}