Create a dataframe from MultiLCA results in Brightway2 - pandas

I am trying to create a pandas dataframe from the results of a MultiLCA calculation, using as columns the methods and as rows the functional units. I did find a sort of solution, but it is a bit cumbersome (I am not very good with dictionaries)
...
mlca=MultiLCA("my_calculation_setup")
pd.DataFrame(mlca.results,columns=mlca.methods)
fu_names=[]
for d in mlca.func_units:
for key in d:
fu_names.append(str(key))
dfresults['fu']=fu_names
dfresults.set_index('fu',inplace=True)
is there a more elegant way of doing this? The names are also very long, but that's a different story...

Your code seems relatively elegant to me. If you want to stick with str(key), then you could simplify it somewhat with a list comprehension:
mlca=MultiLCA("my_calculation_setup")
dfresults = pd.DataFrame(mlca.results, columns=mlca.methods)
dfresults['fu'] = [str(key) for demand in mlca.func_units for key in demand]
dfresults.set_index('fu', inplace=True)
Note that this only works if your demand dictionaries have one activity each. You could have situations where one demand dictionary would have two activities (like LCA({'foo': 1, 'bar': 2})), where this would fail because there would be too many elements in the fu list.
If you do know that you only have one activity per demand, then you can make a slightly nicer dataframe as follows:
mlca=MultiLCA("my_calculation_setup")
scores = pd.DataFrame(mlca.results, columns=mlca.methods)
as_activities = [
(get_activity(key), amount)
for dct in mlca.func_units
for key, amount in dct.items()
]
nicer_fu = pd.DataFrame(
[
(x['database'], x['code'], x['name'], x['location'], x['unit'], y)
for x, y in as_activities
],
columns=('Database', 'Code', 'Name', 'Location', 'Unit', 'Amount')
)
nicer = pd.concat([nicer_fu, scores], axis=1)
However, in the general case dataframes as not a perfect match for calculation setups. When a demand dictionary has multiple activities, there is no nice way to "squish" this into one dimension or one row.

Related

Pandas Map on some rows only?

Is there a way to use pandas map on only some rows and ignore all others?
Example DF :
df = pd.DataFrame({'ProductID': ['Playstation','Playstation','Playstation','Sony Playstation','Sony Playstation','Sony Playstation'],
'date': [dt.date(2022,11,1),dt.date(2022,11,5),dt.date(2022,11,1),dt.date(2022,11,10),dt.date(2022,11,15),dt.date(2022,11,1)],
'customerID' : ['01','01','01','01','02','01'],
'brand': ['Cash','Cash','Game','Cash','Cash','Game'],
'gmv': [10,50,30,40,50,60]})
As you can see, I have similar products that, for some reason, are sometimes as "Playstation" and sometime as "Sony Playstation".
How can I do a pandas map to replace "Playstation" to "Sony Playstation"?
Take into that this is an example with only 2 brands. On my DF I have several brands, so make a dict of them is not viable (and I might need to change many brandes at once).
Can I apply map on a filtered df? I´ve tried apply map on partial DF:
Gift.loc[(Gift.brand == 'PlayStation')].brand.map({'PlayStation': 'Sony PlayStation'}, na_action='ignore')
If so, how to I move the new data to origim df?

How to look up the first row in a DF #1 that matches the values from DF #2?

I have a very large DF with bike ride data from the Chicago DIVVY system. It includes start/end data for each ride, including station ID and lat/lng information.
My Goal: find the station with the most "start" rides. Return the number of rides and the lat/long data for the station.
I can find the 15 busiest stations with:
df['startID'].value_counts().head(15)
This creates a pd.series with the ID (as index) and the N rides. Executes quite fast (<1 sec).
(After changing the series to a DF) What's the easiest / fastest way to add the lat/lng data to this df?
I've got a very kludgy and slow solution that takes the series, turns it into a DF, and then iterates over the DF, looking up the station ID in the big DF and returns the lat/lng values. (I put these in a dictionary, because I will plot them on a map later.)
points = {}
for index, row in stat_df.iterrows():
id = row['start_station_id']
lat_lng = bigData.loc[bigData['start_station_id'] == id].head(1)[['start_lat','start_lng']].values.tolist()
points[id] = [row['count'],lat_lng[0]]
Although my list is short (15 stations/rows), this is REALLY slow (over 2 minutes!), since .loc finds all the rows in the main DF that match the station ID (thousands of rows) and then takes just the head row.
I've tried to use .merge() to match the station/frequency table with the big DF, but that does a one-to-many match, which results in a huge new DF, which isn't what I want.
This seems like a very basic goal, so I suspect there is a simple solution that eludes me.
Is this what you are trying to do? Where df1 equal the list of highest starts and df2 is the df that would have all the long/lats?
df1 = pd.DataFrame({
'Location' : ['Here', 'There', 'Over There'],
'Count' : [1000, 20000, 3000]
})
df2 = pd.DataFrame({
'Location' : ['Here', 'Here', 'Somewhere', 'There', 'Else where', 'Over There'],
'Long_Lat' : [10.123, 10.123, 21830.238, 10238.2318, 830.2139, 10223.123]
})
pd.merge(df1, df2.drop_duplicates())

What is the most efficient method for calculating per-row historical values in a large pandas dataframe?

Say I have two pandas dataframes (df_a & df_b), where each row represents a toy and features about that toy. Some pretend features:
Was_Sold (Y/N)
Color
Size_Group
Shape
Date_Made
Say df_a is relatively small (10s of thousands of rows) and df_b is relatively large (>1 million rows).
Then for every row in df_a, I want to:
Find all the toys from df_b with the same type as the one from df_a (e.g. the same color group)
The df_b toys must also be made before the given df_a toy
Then find the ratio of those sold (So count sold / count all matched)
What is the most efficient means to make those per-row calculations above?
The best I've came up with so far is something like the below.
(Note code might have an error or two as I'm rough typing from a different use case)
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
prior_toys = df_b[(df_b.Date_Made < created_date) & (df_b[col] == relevant_val)]
prior_count = len(prior_toys)
# Now find the ones that were sold
prior_sold_count = len(prior_toys[prior_toys.Was_Sold == "Y"])
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
Using .itertuples() is useful, but this is still pretty slow. Is there a more efficient method or something I'm missing?
EDIT
Added the below script which will emulated data for the above scenario:
import numpy as np
import pandas as pd
colors = ['red', 'green', 'yellow', 'blue']
sizes = ['small', 'medium', 'large']
shapes = ['round', 'square', 'triangle', 'rectangle']
sold = ['Y', 'N']
size_df_a = 200
size_df_b = 2000
date_start = pd.to_datetime('2015-01-01')
date_end = pd.to_datetime('2021-01-01')
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
df_a = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_a),
'Size_Group': np.random.choice(sizes, size_df_a),
'Shape': np.random.choice(shapes, size_df_a),
'Was_Sold': np.random.choice(sold, size_df_a),
'Date_Made': random_dates(date_start, date_end, n=size_df_a)
}
)
df_b = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_b),
'Size_Group': np.random.choice(sizes, size_df_b),
'Shape': np.random.choice(shapes, size_df_b),
'Was_Sold': np.random.choice(sold, size_df_b),
'Date_Made': random_dates(date_start, date_end, n=size_df_b)
}
)
First of all, I think your computation would be much more efficient using relational database and SQL query. Indeed, the filters can be done by indexing columns, performing a database join, some advance filtering and count the result. An optimized relational database can generate an efficient algorithm based on a simple SQL query (hash-based row grouping, binary search, fast intersection of sets, etc.). Pandas is sadly not very good to perform efficiently advanced requests like this. It is also very slow to iterate over pandas dataframe although I am not sure this can be alleviated in this case using only pandas. Hopefully you can use some Numpy and Python tricks and (partially) implement what fast relational database engines would do.
Additionally, pure-Python object types are slow, especially (unicode) strings. Thus, **converting column types to efficient ones in a first place can save a lot of time (and memory). For example, there is no need for the Was_Sold column to contains "Y"/"N" string objects: a boolean can just be used in that case. Thus let us convert that:
df_b.Was_Sold = df_b.Was_Sold == "Y"
Finally, the current algorithm has a bad complexity: O(Na * Nb) where Na is the number of rows in df_a and Nb is the number of rows in df_b. This is not easy to improve though due to the non-trivial conditions. A first solution is to group df_b by col column ahead-of-time so to avoid an expensive complete iteration of df_b (previously done with df_b[col] == relevant_val). Then, the date of the precomputed groups can be sorted so to perform a fast binary search later. Then you can use Numpy to count boolean values efficiently (using np.sum).
Note that doing prior_toys['Was_Sold'] is a bit faster than prior_toys.Was_Sold.
Here is the resulting code:
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Split df_b by col and sort each (indexed) group by date
colGroups = {grId: grDf.sort_values('Date_Made') for grId, grDf in df_b.groupby(col)}
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
curColGroup = colGroups[relevant_val]
prior_count = np.searchsorted(curColGroup['Date_Made'], created_date)
prior_toys = curColGroup[:prior_count]
# Now find the ones that were sold
prior_sold_count = prior_toys['Was_Sold'].values.sum()
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
This is 5.5 times faster on my machine.
The iteration of the pandas dataframe is a major source of slowdown. Indeed, prior_toys['Was_Sold'] takes half the computation time because of the huge overhead of pandas internal function calls repeated Na times... Using Numba may help to reduce the cost of the slow iteration. Note that the complexity can be increased by splitting colGroups in subgroups ahead of time (O(Na log Nb)). This should help to completely remove the overhead of prior_sold_count. The resulting program should be about 10 time faster than the original one.

Extract info from each row of a dataframe without a loop

I have a large dataframe (~500,000 rows). Processing each row gives me a Counter object (a dictionary with objects counts). The output I want is a new dataframe which column headers are the objects that are being counted (the keys in the dictionary). I am looping over the rows, however it takes very long.I know that loops should be avoided in Pandas, any suggestion?
out_df = pd.DataFrame()
for row in input_df['text']:
tokens = nltk.word_tokenize(row)
pos = nltk.pos_tag(tokens)
count = Counter(elem[1] for elem in pos)
out_df = out_df.append(count, ignore_index=True)
for indication, Counter(elem[1] for elem in pos) looks like Counter({'NN':8, 'VBZ': 2, 'DT':3, 'IN': 4})
Using append on a dataframe is quite inefficient I believe (has to reallocate memory for the entire data frame each time).
DataFrames were meant for analyzing data and easily adding columns—but not rows.
So I think a better approach would be to create list first (lists are mutable) and convert it to a dataframe at the end.
I'm not familiar with nltk so I can't actually test this but something along the following lines should work:
out_data = []
for row in input_df['text']:
tokens = nltk.word_tokenize(row)
pos = nltk.pos_tag(tokens)
count = Counter(elem[1] for elem in pos)
out_data.append(count)
out_df = pd.DataFrame(out_data)
You might want to add the following to remove any NaNs and convert the final counts to integers:
out_df = out_df.fillna(0).astype(int)
And delete the list after to free up the memory:
del out_data
I think must use a vecotrized solution maybe: "Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided (using) a vectorized solution: many operations can be performed using built-in methods or NumPy functions, (boolean) indexing." From https://towardsdatascience.com/you-dont-always-have-to-loop-through-rows-in-pandas-22a970b347ac

Pandas groupby aggregate apply multiple functions to multiple columns

Have a dataframe, need to apply same calculations for many columns, currently I'm doing it manually.
Any good and elegant way to do this?
tt = pd.DataFrame(data={'Status' : ['green','green','red','blue','red','yellow','black'],
'Group' : ['A','A','B','C','A','B','C'],
'City' : ['Toronto','Montreal','Vancouver','Toronto','Edmonton','Winnipeg','Windsor'],
'Sales' : [13,6,16,8,4,3,1], 'Counts' : [100,200,50,30,20,10,300]})
ss = tt.groupby('Group').agg({'Sales':['count','mean',np.median],\
'Counts':['count','mean',np.median]})
ss.columns = ['_'.join(col).strip() for col in ss.columns.values]
So the result is
How could I do this for many columns with same calculations, count, mean, median for each column if I have a very large dataframe?
In pandas, the agg operation takes single or multiple individual methods to be applied to relevant columns and returns a summary of the outputs. In python, lists hold and parse multiple entities. In this case, I pass a list of functions into the aggregator. In your case, you were parsing a dictionary, which means you had to handle each column individually making it very manual. Happy to explain further if not clear
ss=tt.groupby('Group').agg(['count','mean','median'])
ss.columns = ['_'.join(col).strip() for col in ss.columns.values]
ss