Converting dataframe to dictionary in pyspark without using pandas - pandas

Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this:
dictionary = df_2.unstack().to_dict(orient='index')
However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this?
EDIT:
I have now tried the following approach:
dictionary_list = map(lambda row: row.asDict(), df_2.collect())
dictionary = {age['age']: age for age in dictionary_list}
(reference) but it is not yielding what it is supposed to.
In pandas, what I was obtaining was the following:

df2 is the dataframe from the previous post. You can do a pivot first, and then convert to dictionary as described in your linked post.
import pyspark.sql.functions as F
df3 = df2.groupBy('age').pivot('siblings').agg(F.first('count'))
list_persons = [row.asDict() for row in df3.collect()]
dict_persons = {person['age']: person for person in list_persons}
{15: {'age': 15, '0': 1.0, '1': None, '3': None}, 10: {'age': 10, '0': None, '1': None, '3': 1.0}, 14: {'age': 14, '0': None, '1': 1.0, '3': None}}
Or another way:
df4 = df3.fillna(float('nan')).groupBy().pivot('age').agg(F.first(F.struct(*df3.columns[1:])))
result_dict = eval(df4.select(F.to_json(F.struct(*df4.columns))).head()[0])
{'10': {'0': 'NaN', '1': 'NaN', '3': 1.0}, '14': {'0': 'NaN', '1': 1.0, '3': 'NaN'}, '15': {'0': 1.0, '1': 'NaN', '3': 'NaN'}}

Related

Create new data frame from unique values of certain columns [duplicate]

Say my data looks like this:
date,name,id,dept,sale1,sale2,sale3,total_sale
1/1/17,John,50,Sales,50.0,60.0,70.0,180.0
1/1/17,Mike,21,Engg,43.0,55.0,2.0,100.0
1/1/17,Jane,99,Tech,90.0,80.0,70.0,240.0
1/2/17,John,50,Sales,60.0,70.0,80.0,210.0
1/2/17,Mike,21,Engg,53.0,65.0,12.0,130.0
1/2/17,Jane,99,Tech,100.0,90.0,80.0,270.0
1/3/17,John,50,Sales,40.0,50.0,60.0,150.0
1/3/17,Mike,21,Engg,53.0,55.0,12.0,120.0
1/3/17,Jane,99,Tech,80.0,70.0,60.0,210.0
I want a new column average, which is the average of total_sale for each name,id,dept tuple
I tried
df.groupby(['name', 'id', 'dept'])['total_sale'].mean()
And this does return a series with the mean:
name id dept
Jane 99 Tech 240.000000
John 50 Sales 180.000000
Mike 21 Engg 116.666667
Name: total_sale, dtype: float64
but how would I reference the data? The series is a one dimensional one of shape (3,). Ideally I would like this put back into a dataframe with proper columns so I can reference properly by name/id/dept.
If you call .reset_index() on the series that you have, it will get you a dataframe like you want (each level of the index will be converted into a column):
df.groupby(['name', 'id', 'dept'])['total_sale'].mean().reset_index()
EDIT: to respond to the OP's comment, adding this column back to your original dataframe is a little trickier. You don't have the same number of rows as in the original dataframe, so you can't assign it as a new column yet. However, if you set the index the same, pandas is smart and will fill in the values properly for you. Try this:
cols = ['date','name','id','dept','sale1','sale2','sale3','total_sale']
data = [
['1/1/17', 'John', 50, 'Sales', 50.0, 60.0, 70.0, 180.0],
['1/1/17', 'Mike', 21, 'Engg', 43.0, 55.0, 2.0, 100.0],
['1/1/17', 'Jane', 99, 'Tech', 90.0, 80.0, 70.0, 240.0],
['1/2/17', 'John', 50, 'Sales', 60.0, 70.0, 80.0, 210.0],
['1/2/17', 'Mike', 21, 'Engg', 53.0, 65.0, 12.0, 130.0],
['1/2/17', 'Jane', 99, 'Tech', 100.0, 90.0, 80.0, 270.0],
['1/3/17', 'John', 50, 'Sales', 40.0, 50.0, 60.0, 150.0],
['1/3/17', 'Mike', 21, 'Engg', 53.0, 55.0, 12.0, 120.0],
['1/3/17', 'Jane', 99, 'Tech', 80.0, 70.0, 60.0, 210.0]
]
df = pd.DataFrame(data, columns=cols)
mean_col = df.groupby(['name', 'id', 'dept'])['total_sale'].mean() # don't reset the index!
df = df.set_index(['name', 'id', 'dept']) # make the same index here
df['mean_col'] = mean_col
df = df.reset_index() # to take the hierarchical index off again
Adding to_frame
df.groupby(['name', 'id', 'dept'])['total_sale'].mean().to_frame()
You are very close. You simply need to add a set of brackets around [['total_sale']] to tell python to select as a dataframe and not a series:
df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()
If you want all columns:
df.groupby(['name', 'id', 'dept'], as_index=False).mean()[['name', 'id', 'dept', 'total_sale']]
The answer is in two lines of code:
The first line creates the hierarchical frame.
df_mean = df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()
The second line converts it to a dataframe with four columns('name', 'id', 'dept', 'total_sale')
df_mean = df_mean.reset_index()

How to stack a pd.DataFrame until it becomes pd.Series?

I have the following pd.DataFrame:
df = pd.DataFrame(
data=[['dog', 'kg', 100, 241], ['cat', 'lbs', 300, 1]],
columns=['animal', 'unit', 0, 1],
).set_index(['animal', 'unit'])
df.columns = pd.MultiIndex.from_tuples(list(zip(*[['2019', '2018'], ['Apr', 'Oct']])))
and I would like to convert it a 2D matrix with no multilevel indexes on index or column:
pd.DataFrame(
data=[
['dog', 'kg', 100, '2019', 'Apr'],
['dog', 'kg', 241, '2018', 'Oct'],
['cat', 'lbs', 300, '2019', 'Apr'],
['cat', 'lbs', 1, '2018', 'Oct']
],
columns=['animal', 'unit', 'value', 'year', 'month']
)
To achieve this, I use df.stack().stack() -> this becomes a pd.Series and then I do .reset_index() on these series t convert to DataFrame.
My question is - how do I avoid the second (or multiple more) stack()?
Is there a way to stack a pd.DataFrame until it becomes a pd.Series?

Pandas Groupby: return dict of rows

I would like to group my dataframe by one of the columns and then return a dictionary that has a list of all of the rows per column value. Is there a fast Pandas idiom for doing this?
Example:
test = pd.DataFrame({
'id': ['alice', 'bob', 'bob', 'charlie'],
'transaction_date': ['2020-01-01', '2020-01-01', '2020-01-02', '2020-01-02'],
'amount': [50.0, 10.0, 12.0, 13.0]
})
Desired output:
result = {
'alice': [Series(transaction_date='2020-01-01', amount=50.0)],
'bob': [Series(transaction_date='2020-01-01', amount=10.0), Series(transaction_date='2020-01-02', amount=12.0)],
'charlie': [Series(transaction_date='2020-01-02', amount=53.0)],
}
The following approaches do NOT work:
test.groupby('id').agg(list)
Returns a Dataframe where each column (amount and transaction_date) has a list of values, but that's not what I want. I want the result to be one list of rows / Pandas series per unique grouping column value ('id' value).
test.groupby('id').agg(list).to_dict():
{'amount': {'charlie': [13.0], 'bob': [10.0, 12.0], 'alice': [50.0]}, 'transaction_date': {'charlie': ['2020-01-02'], 'bob': ['2020-01-01', '2020-01-02'], 'alice': ['2020-01-01']}}
test.groupby('id').apply(list).to_dict():
{'charlie': ['amount', 'id', 'transaction_date'], 'bob': ['amount', 'id', 'transaction_date'], 'alice': ['amount', 'id', 'transaction_date']}
Use itertuples and zip,
import pandas as pd
test = pd.DataFrame({
'id': ['alice', 'bob', 'bob', 'charlie'],
'transaction_date': ['2020-01-01', '2020-01-01', '2020-01-02', '2020-01-02'],
'amount': [50.0, 10.0, 12.0, 13.0]
})
columns = ['transaction_date', 'amount']
grouped = (test
.groupby('id')[columns]
.apply(lambda x: list(x.itertuples(name='Series', index=False))))
print(dict(zip(grouped.index, grouped.values)))
{
'alice': [Series(transaction_date='2020-01-01', amount=50.0)],
'bob': [
Series(transaction_date='2020-01-01', amount=10.0),
Series(transaction_date='2020-01-02', amount=12.0)
],
'charlie': [Series(transaction_date='2020-01-02', amount=13.0)]
}

populating nested dictionaries with rows from Pandas data frame

I'm trying to populate a dictionary of dictionaries with entries from a Pandas data frame in Python by iterating through the nested dictionary and populating the values of each sub-dictionary with entries from a row of a Pandas data frame.
Although there are as many sub-dictionaries as there are rows in the data frame, all dictionaries get populated with the data from the last row of the data frame, instead of using every row for every dictionary.
Here is a toy reproducible example.
import pandas as pd
# initialize an empty df
data = pd.DataFrame()
# populate data frame with entries
data['name'] = ['Joe Smith', 'Mary James', 'Charles Williams']
data['school'] = ["Jollywood Secondary", "Northgate Sixth From", "Brompton High"]
data['subjects'] = [['Maths', 'Art', 'Biology'], ['English', 'French', 'History'], ['Chemistry', 'Biology', 'English']]
# use dictionary comprehensions to set up main dictionary and sub-dictionary templates
# sub-dictionary
keys = ['name', 'school', 'subjects']
record = {key: None for key in keys}
# main dictionary
keys2 = ['cand1', 'cand2', 'cand3']
candidates = {key: record for key in keys2}
# as a result i get something like this
# {'cand1': {'name': None, 'school': None, 'subjects': None},
# 'cand2': {'name': None, 'school': None, 'subjects': None},
# 'cand3': {'name': None, 'school': None, 'subjects': None}}
# iterate through main dictionary and populate each sub-dict with row of df
for i, d in enumerate(candidates.items()):
d[1]['name'] = data['name'].iloc[i]
d[1]['school'] = data['school'].iloc[i]
d[1]['subjcts'] = data['subjects'].iloc[i]
# what i end up with is the last row entry in each sub-dictionary
#{'cand1': {'name': 'Charles Williams',
# 'school': 'Brompton High',
# 'subjects': None,
# 'subjcts': ['Chemistry', 'Biology', 'English']},
# 'cand2': {'name': 'Charles Williams',
# 'school': 'Brompton High',
# 'subjects': None,
# 'subjcts': ['Chemistry', 'Biology', 'English']},
# 'cand3': {'name': 'Charles Williams',
# 'school': 'Brompton High',
# 'subjects': None,
# 'subjcts': ['Chemistry', 'Biology', 'English']}}
How do I need to modify my code to get each dictionary populated with a different row from my data frame?
I did not work through your code to look for the bug, because the solution is a one-liner with the method to_dict.
Here is a minimal working example with your sample data.
import pandas as pd
# initialize an empty df
data = pd.DataFrame()
# populate data frame with entries
data['name'] = ['Joe Smith', 'Mary James', 'Charles Williams']
data['school'] = ["Jollywood Secondary", "Northgate Sixth From", "Brompton High"]
data['subjects'] = [['Maths', 'Art', 'Biology'], ['English', 'French', 'History'], ['Chemistry', 'Biology', 'English']]
# redefine index to match your keys
data.index = ['cand{}'.format(i) for i in range(1,len(data)+1)]
# convert to dict
data_dict = data.to_dict(orient='index')
print(data_dict)
This will look something like this
{'cand1': {
'name': 'Joe Smith',
'school': 'Jollywood Secondary',
'subjects': ['Maths', 'Art', 'Biology']},
'cand2': {
'name': 'Mary James',
'school': 'Northgate Sixth From',
'subjects': ['English', 'French', 'History']},
'cand3': {
'name': 'Charles Williams',
'school': 'Brompton High',
'subjects': ['Chemistry', 'Biology', 'English']}}
Consider avoiding the roundabout away of building dictionary as Pandas maintains various methods to render nested structures such as to_dict and to_json. Specifically, consider adding a new column, cand and set it as index for to_dict output:
data['cand'] = 'cand' + pd.Series((data.index.astype('int') + 1).astype('str'))
mydict = data.set_index('cand').to_dict(orient='index')
print(mydict)
{'cand1': {'name': 'Joe Smith', 'school': 'Jollywood Secondary',
'subjects': ['Maths', 'Art', 'Biology']},
'cand2': {'name': 'Mary James', 'school': 'Northgate Sixth From',
'subjects': ['English', 'French', 'History']},
'cand3': {'name': 'Charles Williams', 'school': 'Brompton High',
'subjects': ['Chemistry', 'Biology', 'English']}}

Convert pandas to dictionary defining the columns used fo the key values

There's the pandas dataframe 'test_df'. My aim is to convert it to a dictionary. Therefore I run this:
id Name Gender Age
0 1 'Peter' 'M' 32
1 2 'Lara' 'F' 45
Therefore I run this:
test_dict = test_df.set_index('id').T.to_dict()
The output is this:
{1: {'Name': 'Peter', 'Gender': 'M', 'Age': 32}, 2: {'Name': 'Lara', 'Gender': 'F', 'Age': 45}}
Now, I want to choose only the 'Name' and 'Gender' columns as the values of dictionary's keys. I'm trying to modify the above script into sth like this:
test_dict = test_df.set_index('id')['Name']['Gender'].T.to_dict()
with no success!
Any suggestion please?!
You was very close, use subset of columns [['Name','Gender']]:
test_dict = test_df.set_index('id')[['Name','Gender']].T.to_dict()
print (test_dict)
{1: {'Name': 'Peter', 'Gender': 'M'}, 2: {'Name': 'Lara', 'Gender': 'F'}}
Also T is not necessary, use parameter orient='index':
test_dict = test_df.set_index('id')[['Name','Gender']].to_dict(orient='index')
print (test_dict)
{1: {'Name': 'Peter', 'Gender': 'M'}, 2: {'Name': 'Lara', 'Gender': 'F'}}