Transforming a dataframe of dict of dict specific format - pandas

I have this df dataset:
df = pd.DataFrame({'train': {'auc': [0.432, 0.543, 0.523],
'logloss': [0.123, 0.234, 0.345]},
'test': {'auc': [0.456, 0.567, 0.678],
'logloss': [0.321, 0.432, 0.543]}})
Where I'm trying to transform it into this:
And also considering that:
epochs always have the same order for every cell, but instead of only 3 epochs, it could reach 1.000 or 10.000.
The column names and axis could change. For example another day the data could have f1 instead of logloss, or val instead of train. But no matter the names, in df each row will always be a metric name, and each column will always be a dataset name.
The number of columns and rows in df could change too. There are some models with 5 datasets, and 7 metrics for example (which would give a df with 5 columns and 7 rows)
The columname of the output table should be datasetname_metricname
So I'm trying to build some generic code transformation where at the same time avoiding brute force transformations. Just if it's helpful, the df source is:
df = pd.DataFrame(model_xgb.evals_result())
df.columns = ['train', 'test'] # This is the line that can change (and the metrics inside `model_xgb`)
Where model_xgb = xgboost.XGBClassifier(..), but after using model_xgb.fit(..)

Here's a generic way to get the result you've specified, irrespective of the number of epochs or the number or labels of rows and columns:
df2 = df.stack().apply(pd.Series)
df2.index = ['_'.join(reversed(x)) for x in df2.index]
df2 = df2.T.assign(epochs=range(1, len(df2.columns) + 1)).set_index('epochs').reset_index()
Output:
epochs train_auc test_auc train_logloss test_logloss
0 1 0.432 0.456 0.123 0.321
1 2 0.543 0.567 0.234 0.432
2 3 0.523 0.678 0.345 0.543
Explanation:
Use stack() to convert the input dataframe to a series (of lists) with a multiindex that matches the desired column sequence in the question
Use apply(pd.Series) to convert the series of lists to a dataframe with each list converted to a row and with column count equal to the uniform length of the list values in the input series (in other words, equal to the number of epochs)
Create the desired column labels from the latest multiindex rows transformed using join() with _ as a separator, then use T to transpose the dataframe so these index labels (which are the desired column labels) become column labels
Use assign() to add a column named epochs enumerating the epochs beginning with 1
Use set_index() followed by reset_index() to make epochs the leftmost column.

Try this:
df = pd.DataFrame({'train': {'auc': [0.432, 0.543, 0.523],
'logloss': [0.123, 0.234, 0.345]},
'test': {'auc': [0.456, 0.567, 0.678],
'logloss': [0.321, 0.432, 0.543]}})
de=df.explode(['train', 'test'])
df_out = de.set_index(de.groupby(level=0).cumcount()+1, append=True).unstack(0)
df_out.columns = df_out.columns.map('_'.join)
df_out = df_out.reset_index().rename(columns={'index':'epochs'})
print(df_out)
Output:
epochs train_auc train_logloss test_auc test_logloss
0 1 0.432 0.123 0.456 0.321
1 2 0.543 0.234 0.567 0.432
2 3 0.523 0.345 0.678 0.543

Related

Reshape a DataFrame based on column value, and pad missing slices with zeros

I have a Pandas DataFrame which looks like:
ID
order
other_column_1
other_column_x
A
0
10
20
A
1
11
21
A
2
12
22
B
0
31
41
B
2
33
43
I want to reshape it to a 3D matrix with shape (#IDs, #order, #other columns). For the example above, it should be of shape (2, 3, 2).
The order column holds the order of the 2nd dimension, so slice ['A', 0, :] should be [10, 20] and ['A', 1, :] [11, 21] etc. The values of order are identical for all ID (0, 1, 2 in this case).
Trouble is, sometimes a slice is missing e.g. for 'B', the slice (order) '1' is missing, which I want to make it a slice pad with all 0's, to keep the shape consistent.
I think of pre-sorting the whole DataFrame by ID and order, loop over each ID , insert missing slices, and stack them together. However, the DataFrame is huge so I try to avoid global sort and loop if possible.
I came up with a way to do it (if you have enough pc memory to allocate) where you dont have to loop the whole dataframe although I coudn't test it with 10M rows because of memory allocation. I tested it with 5M rows by 300 columns and I will show the results at the end of the answer.
The idea is to get all the combinations of the unique values of the first 2 columns as an index to build the first 2 dimensions of the 3D array.
After that you can merge the original dataframe with the dataframe containing index combinations to then fill all the missing values with 0.
Once the data is complete you can pass it to numpy and reshape it in 3 dimensions.
Code without comments:
# df = orginal dataframe
d1 = df.ID.unique()
d2 = df.order.unique()
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.merge(df, on=['ID', 'order'], how='left')\
.fillna(0)
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
Code with comments:
# df = orginal dataframe
# Get unique id for 1st dimension
d1 = df.ID.unique()
# Get unique order fpr 2nd dimension
d2 = df.order.unique()
# Get complete DF
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\ # Get missing values from 1st and 2nd dimensions as index
.to_frame().reset_index(drop=True)\ # Get Dataframe from multiindex and reset index
.merge(df, on=['ID', 'order'], how='left')\ # Merge the complete dimensions with the original values
.fillna(0) # fill missing values with 0
# get complete data as 2D array and reshape as 3D array
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
Test:
First I tried to test with 10M rows but I could not allocate the memory needed for that.
To test the code I created a a dataframe with 6M rows x 300 columns (random float numbers) and dropped 1M rows to simulate the missing values.
Here is the code I used to test and the results.
Test code:
import random
import time
import pandas as pd
import numpy as np
# 100000 diff. ID and 60 diff. order
df_test = pd.MultiIndex.from_product((range(100000), range(60)), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.drop(random.sample(range(6_000_000), k=1_000_000))\ # Drop 1M rows to simulate missing rows
.reset_index(drop=True)
# 5M rows random data by 298 columns
df_test2 = pd.DataFrame(np.random.random(size=(5_000_000, 298)))
df = df_test.merge(df_test2, left_index=True, right_index=True)
start = time.time()
d1 = df.ID.unique()
print(f'time 1st Dimension: {round(time.time()-start, 3)}')
d2 = df.order.unique()
print(f'time 2nd Dimension: {round(time.time()-start, 3)}')
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.merge(df, on=['ID', 'order'], how='left').fillna(0)
print(f'time merge: {round(time.time()-start, 3)}')
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
print(f'time ndarray: {round(time.time()-start, 3)}')
print(f'array shape: {np_3d_array.shape}')
print(f'array type: {type(np_3d_array)}')
Test Results:
time 1st Dimension: 0.035
time 2nd Dimension: 0.063
time merge: 47.202
time ndarray: 49.441
array shape: (100000, 60, 298)
array type: <class 'numpy.ndarray'>
ids = df.ID.unique()
orders = df.order.unique()
ar = (df.set_index(['ID','order'])
.reindex(pd.MultiIndex.from_product((ids, orders)))
.fillna(0)
.to_numpy()
.reshape(len(ids), len(orders), len(df.columns[2:])))
print(ar)
print(ar.shape)
Output:
[[[10. 20.]
[11. 21.]
[12. 22.]]
[[31. 41.]
[ 0. 0.]
[33. 43.]]]
(2, 3, 2)

How can i create a column from 2 related columns of lists in python?

sampleID
testnames
results
23939332
[32131,34343,35566]
[NEGATIVE,0.234,3.331]
32332323
[34343,96958,39550,88088]
[0,312,0.008,0.1,0.2]
The table above is what I have, and the one below is what I want to achieve:
sampleID
32131
34343
39550
88088
96985
35566
23939332
NEGATIVE
0.234
NaN
NaN
NaN
3.331
32332323
NaN
0,312
0.1
0.2
0.008
NaN
So I need to create columns of unique values from the testnames column and fill the cells with the corresponding values from the results column.
Considering this is as a sample from a very large dataset (table).
Here is a commented solution:
(df.set_index(['sampleID']) # keep sampleID out of the expansion
.apply(pd.Series.explode) # expand testnames and results
.reset_index() # reset the index
.groupby(['sampleID', 'testnames']) #
.first() # set the expected shape
.unstack()) #
It gives the result you expected, though with a different column order:
results
testnames 32131 34343 35566 39550 88088 96958
sampleID
23939332 NEGATIVE 0.234 3.331 NaN NaN NaN
32332323 NaN 0.312 NaN 0.1 0.2 0.008
Let's see how it does on generated data:
def build_df(n_samples, n_tests_per_sample, n_test_types):
df = pd.DataFrame(columns=['sampleID', 'testnames', 'results'])
test_types = np.random.choice(range(0,100000), size=n_test_types, replace=False)
for i in range(n_samples):
testnames = list(np.random.choice(test_types,size=n_tests_per_sample))
results = list(np.random.random(size=n_tests_per_sample))
df = df.append({'sampleID': i, 'testnames':testnames, 'results':results}, ignore_index=True)
return df
def reshape(df):
df2 = (df.set_index(['sampleID']) # keep the sampleID out of the expansion
.apply(pd.Series.explode) # expand testnames and results
.reset_index() # reset the index
.groupby(['sampleID', 'testnames']) #
.first() # set the expected shape
.unstack())
return df2
%time df = build_df(60000, 10, 100)
# Wall time: 9min 48s (yes, it was ugly)
%time df2 = reshape(df)
# Wall time: 1.01 s
reshape() breaks when n_test_types becomes too large, with ValueError: Unstacked DataFrame is too big, causing int32 overflow.

My Dataframe contains 500 columns, but I only want to pick out 27 columns in a new Dataframe. How do I do that?

My Dataframe contains 500 columns, but I only want to pick out 27 columns in a new Dataframe.
How do I do that?
I used query()
but output
TypeError: query() takes from 2 to 3 positional arguments but 27 were given
If you want to select the columns based on their name, you can do the following:
df_new = df[["colA", "colB", "colC", ...]]
or use the "filter" function:
df_new = df.filter(["colA", "colB", "colC", ..])
In case that your column selection is based on the index of columns:
df_new = df.iloc[:, 0:27] # if columns are consecutive
df_new = df.iloc[:, [0,2,10,..]] # if columns are not consecutive (the numbers refer to the column indices)

Series.replace cannot use dict-like to_replace and non-None value [duplicate]

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the nans with averages of columns where they are?
This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.
You can simply use DataFrame.fillna to fill the nan's directly:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().
Try:
sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))
In [17]: df.iloc[3:5,0] = np.nan
In [18]: df.iloc[4:6,1] = np.nan
In [19]: df.iloc[5:8,2] = np.nan
In [20]: df
Out[20]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 NaN -0.985188 -0.324136
4 NaN NaN 0.238512
5 0.769657 NaN NaN
6 0.141951 0.326064 NaN
7 -1.694475 -0.523440 NaN
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [22]: df.mean()
Out[22]:
0 -0.251534
1 -0.040622
2 -0.841219
dtype: float64
Apply per-column the mean of that columns and fill
In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622 0.238512
5 0.769657 -0.040622 -0.841219
6 0.141951 0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame
DataFrame with ~100k records
Code 1: 22.06 Seconds
Code 2: 0.03 Seconds
DataFrame with ~200k records
Code 1: 180.06 Seconds
Code 2: 0.06 Seconds
DataFrame with ~1.6 Million records
Code 1: code kept running endlessly
Code 2: 0.40 Seconds
DataFrame with ~13 Million records
Code 1: --did not even try, after seeing performance on 1.6 Mn records--
Code 2: 3.20 Seconds
Apologies for a long answer ! Hope this helps !
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.
sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Directly use df.fillna(df.mean()) to fill all the null value with mean
If you want to fill null value with mean of that column then you can use this
suppose x=df['Item_Weight'] here Item_Weight is column name
here we are assigning (fill null values of x with mean of x into x)
df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))
If you want to fill null value with some string then use
here Outlet_size is column name
df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column
Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']
If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:
Use method .fillna():
mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)
I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.
You should be careful when using the mean. If you have outliers is more recommendable to use the median
Another option besides those above is:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.
using sklearn library preprocessing class
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])
Note: In the recent version parameter missing_values value change to np.nan from NaN
I use this method to fill missing values by average of a column.
fill_mean = lambda col : col.fillna(col.mean())
df = df.apply(fill_mean, axis = 0)
You can also use value_counts to get the most frequent values. This would work on different datatypes.
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
Here is the value_counts api reference.

Create a sequence row from a filtered series

I am trying to create a row that has columns from t0 to t(n).
I have a complete data frame (df) that stores the full set of data, and a data series (df_t) specific time markers I am interested in.
What I want is to create a row that has the time marker as t0 then the previous [sequence_length] rows from the complete data frame.
def t_data(df, df_t, col_names, sequence_length):
df_ret = pd.DataFrame()
for i in range(sequence_length):
col_names_seq = [col_name + "_" + str(i) for col_name in col_names]
df_ret[col_names_seq] = df[df.shift(i)["time"].isin(df_t)][col_names]
return df_ret
Running:
t_data(df, df_t, ["close"], 3)
I get:
close_0 close_1 close_2
1110 1.32080 NaN NaN
2316 1.30490 NaN NaN
2549 1.30290 NaN NaN
The obvious line in issue is:
df[df.shift(i)["time"].isin(df_t)][col_names]
I have tried several ways but cant seem to select data surrounding a subset.
Sample (df):
time open close high low volume EMA21 EMA13 EMA9
20 2005-01-10 04:10:00 1.3071 1.3074 1.3075 1.3070 32.0 1.306624 1.306790 1.306887
21 2005-01-10 04:15:00 1.3074 1.3073 1.3075 1.3073 16.0 1.306685 1.306863 1.306969
22 2005-01-10 04:20:00 1.3073 1.3072 1.3074 1.3072 35.0 1.306732 1.306911 1.307015
Sample (df_t):
1110 2005-01-13 23:00:00
2316 2005-01-18 03:30:00
2549 2005-01-18 22:55:00
Name: time, dtype: datetime64[ns]
I don’t have data but hope this drawing helps:
def t_data(df, df_T, n):
# Get the indexs of the original df that matches with the values of df_T
indexs = df.reset_index().merge(df_T, how="inner")['index'].tolist()
#create new index list where we will store the index-n vales
newIndex = []
#create list of values to subtract from index
toSub = np.arange(n)
#loop over index values and subtract the values, and append in newIndex
for i in indexs:
for sub in toSub:
v = i - sub
newIndex.append(v)
#Use iloc to get all the rows in the original df with the newIndex values that we want
closedCosts = df.iloc[newIndex].reset_index(drop = True)["close"].values
#concat our data back to df_T, and reshape closedCosts by n columns
df_final = pd.concat([df_T, pd.DataFrame(closedCosts.reshape(-1, n))], axis= 1)
#return final df
return df_final
This should do what you're asking for. The easiest way to do this is to figure out all the indexs that you would want from the original df with its corresponding closing value. Note: you will have to rename the columns after this, but all the values are there.