convert pandas dataframe to list and nest a dict? - pandas

I have a list:
l = [{'level': '1', 'rows': 2}, {'level': '2', 'rows': 3}]
I can conert to DataFrame, but how do I convert back?
frame = pd.DataFrame(l)

We have to_dict
frame.to_dict('r')
Out[67]: [{'level': '1', 'rows': 2}, {'level': '2', 'rows': 3}]

Related

How to stack a pd.DataFrame until it becomes pd.Series?

I have the following pd.DataFrame:
df = pd.DataFrame(
data=[['dog', 'kg', 100, 241], ['cat', 'lbs', 300, 1]],
columns=['animal', 'unit', 0, 1],
).set_index(['animal', 'unit'])
df.columns = pd.MultiIndex.from_tuples(list(zip(*[['2019', '2018'], ['Apr', 'Oct']])))
and I would like to convert it a 2D matrix with no multilevel indexes on index or column:
pd.DataFrame(
data=[
['dog', 'kg', 100, '2019', 'Apr'],
['dog', 'kg', 241, '2018', 'Oct'],
['cat', 'lbs', 300, '2019', 'Apr'],
['cat', 'lbs', 1, '2018', 'Oct']
],
columns=['animal', 'unit', 'value', 'year', 'month']
)
To achieve this, I use df.stack().stack() -> this becomes a pd.Series and then I do .reset_index() on these series t convert to DataFrame.
My question is - how do I avoid the second (or multiple more) stack()?
Is there a way to stack a pd.DataFrame until it becomes a pd.Series?

How to parse a nested column in a df column?

Is there a smart pythonic way to parse a nested column in a pandas dataframe like this one to 3 different columns? So for example the column could look like this:
col1
[{'name': 'amount', 'value': 1}, {'name': 'frequency', 'value': 2}, {'name': 'freq_unit', 'value': 'month'}]
[{'name': 'amount', 'value': 3}, {'name': 'frequency', 'value': 1}, {'name': 'freq_unit', 'value': 'month'}]
And the expected result should be these 3 columns:
amount frequency freq_unit
1 2 month
3 1 month
That's just level 1. I have the level 2: What if the elements in the list still have the same names (amount, frequency and freq_unit) but the order could change? Could the code in the answer deal with this?
col1
[{'name': 'amount', 'value': 1}, {'name': 'frequency', 'value': 2}, {'name': 'freq_unit', 'value': 'month'}]
[{'name': 'amount', 'value': 3}, {'name': 'freq_unit', 'value': 'month'}, {'name': 'frequency', 'value': 1}]
Code for reproduce the data. Really look forward to see how the community would solve this. Thank you
data = {'col1':[[{'name': 'amount', 'value': 1}, {'name': 'frequency', 'value': 2}, {'name': 'freq_unit', 'value': 'month'}],
[{'name': 'amount', 'value': 3}, {'name': 'frequency', 'value': 1}, {'name': 'freq_unit', 'value': 'month'}]]}
df = pd.DataFrame(data)
A combination of list comprehension, itertools.chain, and collections.defaultdict could help out here:
from itertools import chain
from collections import defaultdict
data = defaultdict(list)
phase1 = [[(data["name"], data["value"])
for data in entry]
for entry in df.col1
]
phase1 = chain.from_iterable(phase1)
for key, value in phase1:
data[key].append(value)
pd.DataFrame(data)
amount frequency freq_unit
0 1 2 month
1 3 1 month
The above is verbose: #piRSquared's comment is much simpler, with a list comprehension:
pd.DataFrame([{x["name"]: x["value"] for x in lst} for lst in df.col1])
Another idea, but very unnecessary, is to use a list comprehension, combined with Pandas' string methods:
outcome = [(df.col1.str[num].str["value"]
.rename(df.col1.str[num].str["name"][0])
)
for num in range(df.col1.str.len()[0])
]
pd.concat(outcome, axis = 'columns')
#piRsquared's solution is the simplest, in my opinion.
You can write a function that will parse each cell in your Series and return a properly formatted Series and use apply to tuck the iteration away:
>>> def custom_parser(record):
... clean_record = {rec["name"]: rec["value"] for rec in record}
... return pd.Series(clean_record)
>>> df["col1"].apply(custom_parser)
amount frequency freq_unit
0 1 2 month
1 3 1 month

How to apply str.split() on pandas column?

Using Simple Data:
df = pd.DataFrame({'ids': [0,1,2], 'value': ['2 4 10 0 14', '5 91 19 20 0', '1 1 1 2 44']})
I need to convert the column to array, so I use:
df.iloc[:,-1] = df.iloc[:,-1].apply(lambda x: str(x).split())
X = df.iloc[:, 1:]
X = np.array(X.values)
but the problem is the data is being nested and I just need a matrix (3,5). How to make this properly and fast for large data (avoid looping)?
As said in the comments by #anky, #ScottBoston. You can use string method split along with expand parameter and finally change to NumPy:
df.iloc[:, 1].str.split(expand=True).values
array([['2', '4', '10', '0', '14'],
['5', '91', '19', '20', '0'],
['1', '1', '1', '2', '44']], dtype=object)

Pandas Groupby: return dict of rows

I would like to group my dataframe by one of the columns and then return a dictionary that has a list of all of the rows per column value. Is there a fast Pandas idiom for doing this?
Example:
test = pd.DataFrame({
'id': ['alice', 'bob', 'bob', 'charlie'],
'transaction_date': ['2020-01-01', '2020-01-01', '2020-01-02', '2020-01-02'],
'amount': [50.0, 10.0, 12.0, 13.0]
})
Desired output:
result = {
'alice': [Series(transaction_date='2020-01-01', amount=50.0)],
'bob': [Series(transaction_date='2020-01-01', amount=10.0), Series(transaction_date='2020-01-02', amount=12.0)],
'charlie': [Series(transaction_date='2020-01-02', amount=53.0)],
}
The following approaches do NOT work:
test.groupby('id').agg(list)
Returns a Dataframe where each column (amount and transaction_date) has a list of values, but that's not what I want. I want the result to be one list of rows / Pandas series per unique grouping column value ('id' value).
test.groupby('id').agg(list).to_dict():
{'amount': {'charlie': [13.0], 'bob': [10.0, 12.0], 'alice': [50.0]}, 'transaction_date': {'charlie': ['2020-01-02'], 'bob': ['2020-01-01', '2020-01-02'], 'alice': ['2020-01-01']}}
test.groupby('id').apply(list).to_dict():
{'charlie': ['amount', 'id', 'transaction_date'], 'bob': ['amount', 'id', 'transaction_date'], 'alice': ['amount', 'id', 'transaction_date']}
Use itertuples and zip,
import pandas as pd
test = pd.DataFrame({
'id': ['alice', 'bob', 'bob', 'charlie'],
'transaction_date': ['2020-01-01', '2020-01-01', '2020-01-02', '2020-01-02'],
'amount': [50.0, 10.0, 12.0, 13.0]
})
columns = ['transaction_date', 'amount']
grouped = (test
.groupby('id')[columns]
.apply(lambda x: list(x.itertuples(name='Series', index=False))))
print(dict(zip(grouped.index, grouped.values)))
{
'alice': [Series(transaction_date='2020-01-01', amount=50.0)],
'bob': [
Series(transaction_date='2020-01-01', amount=10.0),
Series(transaction_date='2020-01-02', amount=12.0)
],
'charlie': [Series(transaction_date='2020-01-02', amount=13.0)]
}

pandas same attribute comparison

I have the following dataframe:
df = pd.DataFrame([{'name': 'a', 'label': 'false', 'score': 10},
{'name': 'a', 'label': 'true', 'score': 8},
{'name': 'c', 'label': 'false', 'score': 10},
{'name': 'c', 'label': 'true', 'score': 4},
{'name': 'd', 'label': 'false', 'score': 10},
{'name': 'd', 'label': 'true', 'score': 6},
])
I want to return names that have the "false" label score value higher than the score value of the "true" label with at least the double. In my example, it should return only the "c" name.
First you can pivot the data, and look at the ratio, filter what you want:
new_df = df.pivot(index='name',columns='label', values='score')
new_df[new_df['false'].div(new_df['true']).gt(2)]
output:
label false true
name
c 10 4
If you only want the label, you can do:
new_df.index[new_df['false'].div(new_df['true']).gt(2)].values
which gives
array(['c'], dtype=object)
Update: Since your data is result of orig_df.groupby().count(), you could instead do:
orig_df['label'].eq('true').groupby('name').mean()
and look at the rows with values <= 1/3.