Reshape a DataFrame based on column value, and pad missing slices with zeros - pandas

I have a Pandas DataFrame which looks like:
ID
order
other_column_1
other_column_x
A
0
10
20
A
1
11
21
A
2
12
22
B
0
31
41
B
2
33
43
I want to reshape it to a 3D matrix with shape (#IDs, #order, #other columns). For the example above, it should be of shape (2, 3, 2).
The order column holds the order of the 2nd dimension, so slice ['A', 0, :] should be [10, 20] and ['A', 1, :] [11, 21] etc. The values of order are identical for all ID (0, 1, 2 in this case).
Trouble is, sometimes a slice is missing e.g. for 'B', the slice (order) '1' is missing, which I want to make it a slice pad with all 0's, to keep the shape consistent.
I think of pre-sorting the whole DataFrame by ID and order, loop over each ID , insert missing slices, and stack them together. However, the DataFrame is huge so I try to avoid global sort and loop if possible.

I came up with a way to do it (if you have enough pc memory to allocate) where you dont have to loop the whole dataframe although I coudn't test it with 10M rows because of memory allocation. I tested it with 5M rows by 300 columns and I will show the results at the end of the answer.
The idea is to get all the combinations of the unique values of the first 2 columns as an index to build the first 2 dimensions of the 3D array.
After that you can merge the original dataframe with the dataframe containing index combinations to then fill all the missing values with 0.
Once the data is complete you can pass it to numpy and reshape it in 3 dimensions.
Code without comments:
# df = orginal dataframe
d1 = df.ID.unique()
d2 = df.order.unique()
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.merge(df, on=['ID', 'order'], how='left')\
.fillna(0)
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
Code with comments:
# df = orginal dataframe
# Get unique id for 1st dimension
d1 = df.ID.unique()
# Get unique order fpr 2nd dimension
d2 = df.order.unique()
# Get complete DF
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\ # Get missing values from 1st and 2nd dimensions as index
.to_frame().reset_index(drop=True)\ # Get Dataframe from multiindex and reset index
.merge(df, on=['ID', 'order'], how='left')\ # Merge the complete dimensions with the original values
.fillna(0) # fill missing values with 0
# get complete data as 2D array and reshape as 3D array
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
Test:
First I tried to test with 10M rows but I could not allocate the memory needed for that.
To test the code I created a a dataframe with 6M rows x 300 columns (random float numbers) and dropped 1M rows to simulate the missing values.
Here is the code I used to test and the results.
Test code:
import random
import time
import pandas as pd
import numpy as np
# 100000 diff. ID and 60 diff. order
df_test = pd.MultiIndex.from_product((range(100000), range(60)), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.drop(random.sample(range(6_000_000), k=1_000_000))\ # Drop 1M rows to simulate missing rows
.reset_index(drop=True)
# 5M rows random data by 298 columns
df_test2 = pd.DataFrame(np.random.random(size=(5_000_000, 298)))
df = df_test.merge(df_test2, left_index=True, right_index=True)
start = time.time()
d1 = df.ID.unique()
print(f'time 1st Dimension: {round(time.time()-start, 3)}')
d2 = df.order.unique()
print(f'time 2nd Dimension: {round(time.time()-start, 3)}')
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.merge(df, on=['ID', 'order'], how='left').fillna(0)
print(f'time merge: {round(time.time()-start, 3)}')
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
print(f'time ndarray: {round(time.time()-start, 3)}')
print(f'array shape: {np_3d_array.shape}')
print(f'array type: {type(np_3d_array)}')
Test Results:
time 1st Dimension: 0.035
time 2nd Dimension: 0.063
time merge: 47.202
time ndarray: 49.441
array shape: (100000, 60, 298)
array type: <class 'numpy.ndarray'>

ids = df.ID.unique()
orders = df.order.unique()
ar = (df.set_index(['ID','order'])
.reindex(pd.MultiIndex.from_product((ids, orders)))
.fillna(0)
.to_numpy()
.reshape(len(ids), len(orders), len(df.columns[2:])))
print(ar)
print(ar.shape)
Output:
[[[10. 20.]
[11. 21.]
[12. 22.]]
[[31. 41.]
[ 0. 0.]
[33. 43.]]]
(2, 3, 2)

Related

Transforming a dataframe of dict of dict specific format

I have this df dataset:
df = pd.DataFrame({'train': {'auc': [0.432, 0.543, 0.523],
'logloss': [0.123, 0.234, 0.345]},
'test': {'auc': [0.456, 0.567, 0.678],
'logloss': [0.321, 0.432, 0.543]}})
Where I'm trying to transform it into this:
And also considering that:
epochs always have the same order for every cell, but instead of only 3 epochs, it could reach 1.000 or 10.000.
The column names and axis could change. For example another day the data could have f1 instead of logloss, or val instead of train. But no matter the names, in df each row will always be a metric name, and each column will always be a dataset name.
The number of columns and rows in df could change too. There are some models with 5 datasets, and 7 metrics for example (which would give a df with 5 columns and 7 rows)
The columname of the output table should be datasetname_metricname
So I'm trying to build some generic code transformation where at the same time avoiding brute force transformations. Just if it's helpful, the df source is:
df = pd.DataFrame(model_xgb.evals_result())
df.columns = ['train', 'test'] # This is the line that can change (and the metrics inside `model_xgb`)
Where model_xgb = xgboost.XGBClassifier(..), but after using model_xgb.fit(..)
Here's a generic way to get the result you've specified, irrespective of the number of epochs or the number or labels of rows and columns:
df2 = df.stack().apply(pd.Series)
df2.index = ['_'.join(reversed(x)) for x in df2.index]
df2 = df2.T.assign(epochs=range(1, len(df2.columns) + 1)).set_index('epochs').reset_index()
Output:
epochs train_auc test_auc train_logloss test_logloss
0 1 0.432 0.456 0.123 0.321
1 2 0.543 0.567 0.234 0.432
2 3 0.523 0.678 0.345 0.543
Explanation:
Use stack() to convert the input dataframe to a series (of lists) with a multiindex that matches the desired column sequence in the question
Use apply(pd.Series) to convert the series of lists to a dataframe with each list converted to a row and with column count equal to the uniform length of the list values in the input series (in other words, equal to the number of epochs)
Create the desired column labels from the latest multiindex rows transformed using join() with _ as a separator, then use T to transpose the dataframe so these index labels (which are the desired column labels) become column labels
Use assign() to add a column named epochs enumerating the epochs beginning with 1
Use set_index() followed by reset_index() to make epochs the leftmost column.
Try this:
df = pd.DataFrame({'train': {'auc': [0.432, 0.543, 0.523],
'logloss': [0.123, 0.234, 0.345]},
'test': {'auc': [0.456, 0.567, 0.678],
'logloss': [0.321, 0.432, 0.543]}})
de=df.explode(['train', 'test'])
df_out = de.set_index(de.groupby(level=0).cumcount()+1, append=True).unstack(0)
df_out.columns = df_out.columns.map('_'.join)
df_out = df_out.reset_index().rename(columns={'index':'epochs'})
print(df_out)
Output:
epochs train_auc train_logloss test_auc test_logloss
0 1 0.432 0.123 0.456 0.321
1 2 0.543 0.234 0.567 0.432
2 3 0.523 0.345 0.678 0.543

How do you speed up a score calculation based on two rows in a Pandas Dataframe?

TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.

Why does pandas.DataFrame.apply produces Series instead of DataFrame

I do not really understand why from the following code pandas return is Series but not a DataFrame.
import pandas as pd
df = pd.DataFrame([[4,9]]*3, columns = ["A", "B"])
def plus_2(x):
y =[]
for i in range(0, len(x)):
y.append(x[i]+2)
return y
df_row = df.apply(plus_2, axis = 1) # Applied to each row
df_row
While if I change axis=0 it produces DataFrame as expected:
import pandas as pd
df = pd.DataFrame([[4,9]]*3, columns = ["A", "B"])
def plus_2(x):
y =[]
for i in range(0, len(x)):
y.append(x[i]+2)
return y
df_row = df.apply(plus_2, axis = 0) # Applied to each row
df_row
Here is the output:
In first example where you put axis=1 you implement on row level.
It means that for each row plus_2 function returns y which is list of two element (but list as a whole is single element so this is pd.Series).
Based on your example it will be returned 3x list (2 element each). Here single list if single row.
You could expand this result and create two columns (each element from list will be new column) by adding result_type="expand" in apply:
df_row = df.apply(lambda x: plus_2(x), axis=1, result_type="expand")
# output
0 1
0 6 11
1 6 11
2 6 11
In second approach you have axis=0 co this is applied on column level.
It means that for each column plus_2 function returns y, so plus_2 is applied twice, separately for A column and for B column. This is why it returns dataframe: your input is DataFrame with columns A and B, each column applies plus_2 function and returns A and B columns as result of plus_2 functions applied.
Based on your example it will be returned 2x list (3 element each). Here single list is single column.
So the main difference between axis=1 and axis=0 is that:
if you applied on row level apply will return:
[6, 11]
[6, 11]
[6, 11]
if you applied on column level apply will return:
[6, 6, 6]
[11, 11, 11]

Calculate statistics on subset of a dataframe based on values in dataframe (latitude and longitude)

I am looking to calculate summary statistics on subsets of a dataframe but related to a specific values within the row.
For example, I have a dataframe that has latitude and longitude and number of people.
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
I want to know the total people within .05 miles from each row. This can be easily created with a loop, but as the space starts to increase this becomes unusable.
Current/Sample:
from geopy.distance import distance
def distance_calc (row, focus_lat, focus_long):
start = (row['latitude'], row['longitude'])
stop = (focus_lat, focus_long)
return distance(start, stop).miles
df['total_people_within_05'] = 0
df['total_rows_within_05'] = 0
for index, row in df.iterrows():
focus_lat = df['latitude'][index]
focus_long = df['longitude'][index]
new_df = df.copy()
new_df['distance'] = new_df.apply (lambda row: (distance_calc(row, focus_lat, focus_long)),axis=1)
df.at[index, 'total_people_within_05'] = new_df.loc[new_df.distance<=.05]['people'].sum()
df.at[index, 'total_rows_within_05'] = new_df.loc[new_df.distance<=.05].shape[0]
Is there any pythonic way to do this?
Cartesian product to itself to get all combinations. This will be expensive on larger datasets. This generates N^2 rows, so in this case 25 rows
calculate distance on each of these combinations
filter query() to distances required
groupby() to get total number of people. Also generate a list of indexes included in total for helping with transparency
finally join() this back together and you have what you want
import geopy.distance as gd
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
df = df.join((df.reset_index().assign(foo=1).merge(df.reset_index().assign(foo=1), on="foo")
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_x,r.longitude_x),
(r.latitude_y,r.longitude_y)).miles, axis=1))
.query("distance<=0.05")
.rename(columns={"people_y":"nearby"})
.groupby("index_x").agg({"nearby":"sum","index_y":lambda x: list(x)})
))
print(df.to_markdown())
latitude
longitude
people
nearby
index_y
0
40.9919
-106.049
1
6
[0, 1, 2]
1
40.992
-106.049
2
6
[0, 1, 2]
2
40.9916
-106.049
3
6
[0, 1, 2]
3
40.9899
-106.05
4
4
[3]
4
40.9878
-106.049
5
5
[4]
Update - use combinations instead of Cartesian product
It's been bugging me that a Cartesian product is a huge overhead, when all that is required is to calculate distances between valid combinations
make use of itertools.combinations() to make a list of valid combinations of indexes
calculate distances between this minimum set
filter down to only distances we're interested in
now build permutations of this smaller set to provide a simple join to actual data
join and aggregate
# get distances between all valid combinations
dfd = (pd.DataFrame(list(itertools.combinations(df.index, 2)))
.merge(df, left_on=0, right_index=True)
.merge(df, left_on=1, right_index=True, suffixes=("_0","_1"))
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_0,r.longitude_0),
(r.latitude_1,r.longitude_1)).miles, axis=1))
.loc[:,[0,1,"distance"]]
# filter down to close proximities
.query("distance <= 0.05")
)
# build all valid permuations of close by combinations
dfnppl = (pd.DataFrame(itertools.permutations(pd.concat([dfd[0],dfd[1]]).unique(), 2))
.merge(df.loc[:,"people"], left_on=1, right_index=True)
)
# bring it all together
df = (df.reset_index().rename(columns={"index":0}).merge(dfnppl, on=0, suffixes=("","_near"), how="left")
.groupby(0).agg({**{c:"first" for c in df.columns}, **{"people_near":"sum"}})
)
0
latitude
longitude
people
people_near
0
40.9919
-106.049
1
5
1
40.992
-106.049
2
4
2
40.9916
-106.049
3
3
3
40.9899
-106.05
4
0
4
40.9878
-106.049
5
0

Pandas dataframe

I have a dataframe df having two columns 'voltage'(v) and 'current'(I). I want to randomly select 5 values of 'voltage' from the file, save it in 1D array like [v1,v2,v3,v4,v5], and save the corresponding values of currents in another 1D array like [I1,I2,...,I5]. Here is what I tried:
df=pd.read_csv(file,sep=",",header=None,usecols=[0,1],names=['voltage','current'])
#pick 5 random values of voltage and save it in np array
V= np.array( df['voltage'].sample(n=5))
How to do the same with the corresponding values of I at selected values of V?
I think need:
arr = df.sample(n=5).values
a = arr[:, 0]
b = arr[:, 1]
While jezrael's answer does provide the desired output, the answer to your question would be:
V= df['voltage'].sample(n=5)
I = df.loc[V.index,'current']