Reshape pandas dataframe and work with columns - pandas

I have dataset:
dat = {'Block': ['blk_-105450231192318816', 'blk_-1076549517733373559', 'blk_-1187723472581877455', 'blk_-1385756122847916710', 'blk_-1470784088028862059'], 'Seq': ['13 13 13 15',' 15 13 13', '13 13 15', '13 13 15 13', '13'], 'Time' : ['1257712532.0 1257712532.0 1257712532.0 1257712532.0','1257712533.0 1257712534.0 1257712534.0','1257712533.0 1257712533.0 1257712533.0','1257712532.0 1257712532.0 1257712532.0 1257712534.0','1257712535.0']}
df = pd.DataFrame(data = dat)
Block is id. Seq is id. Time is time in unix format.
I want to change columns or create new columns.
1)I need to join Seq and Time columns by index of elements in two columns.
2)After i want to get delta of Time column(next element - previous) and first element set to zero.
And in the end write in file rows from different block, but witch have same Seq-id.
I want to solve this problem by pandas methods
I tried to solve it by dictionary, but this way is complicated.
dict_block = dict((key, []) for key in np.unique(df.Block))
for idx, row in enumerate(seq):
block = df.Block[idx]
dict_seq = dict((key, []) for key in np.unique(row.split(' ')))
for idy, key in enumerate(row.split(' ')):
item = df.Time[idx].split(' ')[idy]
dict_seq[key].append(item)
dict_block[block].append(dict_seq)
1)For example:
blk_-105450231192318816 :
13: 1257712532.0, 1257712532.0, 1257712532.0
15: 1257712532.0
2)For example:
blk_-105450231192318816 :
13: 0, (1257712532.0 - 1257712532.0) = 0, (1257712532.0 - 1257712532.0) = 0
15: 0
Output for dictionary try:
{'blk_-105450231192318816':
[{'13': ['1257712532.0', '1257712532.0','1257712532.0'],
'15': ['1257712532.0']}],
'blk_-1076549517733373559':
[{'13': ['1257712534.0', '1257712534.0'],
'15': ['1257712533.0']}],
'blk_-1187723472581877455':
[{'13': ['1257712533.0', '1257712533.0'],
'15': ['1257712533.0']}],
'blk_-1385756122847916710':
[{'13': ['1257712532.0',
'1257712532.0',
'1257712534.0'],
'15': ['1257712532.0']}],
'blk_-1470784088028862059':
[{'13': ['1257712535.0']}]}
Summary:
I want solve next points by pandas, numpy methods:
1) Group columns
2) Get delta of time(t1-t0)
Waiting for your comment :)

Solution 1: Working with dicts
If you prefer working with dictionaries, you can use apply and custom methods where you do your tricks with the dictionaries.
df is the sample dataframe you provided. Here I've made two methods. I hope the code is clear enough to be understandable.
def grouping(x):
"""Make a dictionary combining 'Seq' and 'Time' columns.
'Seq' elements are the keys, 'Time' are the values. 'Time' elements
corresponding to the same key are stored in a list.
"""
#splitting the string and make it numeric
keys = list(map(int, x['Seq'].split()))
times = list(map(float, x['Time'].split()))
#building the result dictionary.
res = {}
for i, k in enumerate(keys):
try:
res[k].append(times[i])
except KeyError:
res[k] = [times[i]]
return res
def timediffs(x):
"""Make a dictionary starting from 'GroupedSeq' column, which can
be created with the grouping function.
It contains the difference between the times of each key.
"""
ddt = x['GroupedSeq']
res = {}
#iterating over the dictionary to calculate the differences.
for k, v in ddt.items():
res[k] = [0.0] + [t1 - t0 for t0, t1 in zip(v[:-1], v[1:])]
return res
df['GroupedSeq'] = df.apply(grouping, axis=1)
df['difftimes'] = df.apply(timediffs, axis=1)
What apply does here is to apply the function on each row. The result is stored in a new column of the dataframe. Now df contains two new column, you can drop the original 'Seq' and Time columns if you wish, by doing: df.drop(['Seq', 'Time'], axis=1, inplace=True). In the end, df looks like:
Block grouped difftimes
0 blk_-105450231192318816 {13: [1257712532.0, 1257712532.0, 1257712532.0... {13: [0.0, 0.0, 0.0], 15: [0.0]}
1 blk_-1076549517733373559 {15: [1257712533.0], 13: [1257712534.0, 125771... {15: [0.0], 13: [0.0, 0.0]}
2 blk_-1187723472581877455 {13: [1257712533.0, 1257712533.0], 15: [125771... {13: [0.0, 0.0], 15: [0.0]}
3 blk_-1385756122847916710 {13: [1257712532.0, 1257712532.0, 1257712534.0... {13: [0.0, 0.0, 2.0], 15: [0.0]}
4 blk_-1470784088028862059 {13: [1257712535.0]} {13: [0.0]}
As you can see, here pandas itself is used only to apply the custom methods, but inside those methods there is normal python code at work.
Solution 2: No dictionaries, more Pandas
Pandas itself is not very useful if you are storing list or dicts in the dataframe. So I propose an alternative a solution without dictionaries. I use groupby in combination with apply to perform operations on selected rows based on their values.
groupby selects a subsample of the dataframe based on the values of one or more columns: all rows with the same values in those columns are grouped, and a method or action is performed on this subsample.
Again, df is the sample dataframe you provided.
df1 = df.copy() #working on a copy, not really needed but I wanted to preserve the original
##splitting the string and make it a numeric list using apply
df1['Seq'] = df1['Seq'].apply(lambda x : list(map(int, x.split())))
df1['Time'] = df1['Time'].apply(lambda x : list(map(float, x.split())))
#for each index in 'Block', unnest the list in 'Seq' making it a secodary index.
df2 = df1.groupby('Block').apply(lambda x : pd.DataFrame([[e] for e in x['Time'].iloc[0]], index=x['Seq'].tolist()))
#resetting index and renaming column names created by pandas
df2 = df2.reset_index().rename(columns={'level_1':'Seq', 0:'Time'})
#custom method to store the differences between times.
def timediffs(x):
x['tdiff'] = x['Time'].diff().fillna(0.0)
return x
df3 = df2.groupby(['Block', 'Seq']).apply(timediffs)
The final df3 is:
Block Seq Time tdiff
0 blk_-105450231192318816 13 1.257713e+09 0.0
1 blk_-105450231192318816 13 1.257713e+09 0.0
2 blk_-105450231192318816 13 1.257713e+09 0.0
3 blk_-105450231192318816 15 1.257713e+09 0.0
4 blk_-1076549517733373559 15 1.257713e+09 0.0
5 blk_-1076549517733373559 13 1.257713e+09 0.0
6 blk_-1076549517733373559 13 1.257713e+09 0.0
7 blk_-1187723472581877455 13 1.257713e+09 0.0
8 blk_-1187723472581877455 13 1.257713e+09 0.0
9 blk_-1187723472581877455 15 1.257713e+09 0.0
10 blk_-1385756122847916710 13 1.257713e+09 0.0
11 blk_-1385756122847916710 13 1.257713e+09 0.0
12 blk_-1385756122847916710 15 1.257713e+09 0.0
13 blk_-1385756122847916710 13 1.257713e+09 2.0
14 blk_-1470784088028862059 13 1.257713e+09 0.0
As you can see, no dictionaries inside the dataframe. You have repetitions in columns 'Block' and 'Seq', but that's unavoidable.

Related

How to Exclude NaNs from Pandas Rolling, but not return NaN if there is one in the DataFrame

Currently I have the DataFrame seen below and I want to do a rolling average over the last 10 occurrences that have actual values, but to skip the NaNs
Example DataFrame
The issues is that if I run df['AST_Hit'].rolling(10).mean(skipna=True).shift(1) I get this DataFrame below which is not what I am looking for
Example Output DataFrame
I've tried using window and min_period but that does not give me what I want as I don't want the average over anything greater than 10.
Ideally I would like the DataFrame to be able to discard a NaN, but still look to see if there are 10 values in that selection. From what I am describing I think I need some sort of max period where it is equal to 10 as well as the min period equal to 10, but I could not find anything on Pandas documentation for rolling on setting up a max period.
Maybe it would also be best if I just dropped any NaN rows. My DataFrame is much bigger than what is seen, so it isn't just those 3 rows that contain a NaN, but it may be the best course of action
Any help or tips is greatly appreciated.
This might help you:
import pandas as pd
import numpy as np
# create a sample DataFrame with non-numeric columns
np.random.seed(123)
df = pd.DataFrame({
'Date': pd.date_range(start='2022-01-01', periods=100),
'AST_Hit': np.random.randint(0, 10, size=100),
'Other_Column': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] * 10
})
df.iloc[3:6, 1] = np.nan
df.iloc[7, 0] = np.nan
df.iloc[10:15, 2] = np.nan
df.iloc[20:25, 1] = np.nan
df.iloc[30:40, 2] = np.nan
# compute rolling average over the last 10 non-null values
rolling_mask = df['AST_Hit'].notnull().rolling(window=10, min_periods=1).sum().eq(10)
result = df['AST_Hit'].rolling(window=10, min_periods=1).apply(lambda x: np.mean(x[rolling_mask]))
print(result)
which gives
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
95 5.6
96 5.1
97 4.7
98 4.2
99 3.9
Name: AST_Hit, Length: 100, dtype: float64

Transforming a dataframe of dict of dict specific format

I have this df dataset:
df = pd.DataFrame({'train': {'auc': [0.432, 0.543, 0.523],
'logloss': [0.123, 0.234, 0.345]},
'test': {'auc': [0.456, 0.567, 0.678],
'logloss': [0.321, 0.432, 0.543]}})
Where I'm trying to transform it into this:
And also considering that:
epochs always have the same order for every cell, but instead of only 3 epochs, it could reach 1.000 or 10.000.
The column names and axis could change. For example another day the data could have f1 instead of logloss, or val instead of train. But no matter the names, in df each row will always be a metric name, and each column will always be a dataset name.
The number of columns and rows in df could change too. There are some models with 5 datasets, and 7 metrics for example (which would give a df with 5 columns and 7 rows)
The columname of the output table should be datasetname_metricname
So I'm trying to build some generic code transformation where at the same time avoiding brute force transformations. Just if it's helpful, the df source is:
df = pd.DataFrame(model_xgb.evals_result())
df.columns = ['train', 'test'] # This is the line that can change (and the metrics inside `model_xgb`)
Where model_xgb = xgboost.XGBClassifier(..), but after using model_xgb.fit(..)
Here's a generic way to get the result you've specified, irrespective of the number of epochs or the number or labels of rows and columns:
df2 = df.stack().apply(pd.Series)
df2.index = ['_'.join(reversed(x)) for x in df2.index]
df2 = df2.T.assign(epochs=range(1, len(df2.columns) + 1)).set_index('epochs').reset_index()
Output:
epochs train_auc test_auc train_logloss test_logloss
0 1 0.432 0.456 0.123 0.321
1 2 0.543 0.567 0.234 0.432
2 3 0.523 0.678 0.345 0.543
Explanation:
Use stack() to convert the input dataframe to a series (of lists) with a multiindex that matches the desired column sequence in the question
Use apply(pd.Series) to convert the series of lists to a dataframe with each list converted to a row and with column count equal to the uniform length of the list values in the input series (in other words, equal to the number of epochs)
Create the desired column labels from the latest multiindex rows transformed using join() with _ as a separator, then use T to transpose the dataframe so these index labels (which are the desired column labels) become column labels
Use assign() to add a column named epochs enumerating the epochs beginning with 1
Use set_index() followed by reset_index() to make epochs the leftmost column.
Try this:
df = pd.DataFrame({'train': {'auc': [0.432, 0.543, 0.523],
'logloss': [0.123, 0.234, 0.345]},
'test': {'auc': [0.456, 0.567, 0.678],
'logloss': [0.321, 0.432, 0.543]}})
de=df.explode(['train', 'test'])
df_out = de.set_index(de.groupby(level=0).cumcount()+1, append=True).unstack(0)
df_out.columns = df_out.columns.map('_'.join)
df_out = df_out.reset_index().rename(columns={'index':'epochs'})
print(df_out)
Output:
epochs train_auc train_logloss test_auc test_logloss
0 1 0.432 0.123 0.456 0.321
1 2 0.543 0.234 0.567 0.432
2 3 0.523 0.345 0.678 0.543

Comparing string values from sequential rows in pandas series

I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0

Creating a base 100 Index from time series that begins with a number of NaNs

I have the following dataframe (time-series of returns truncated for succinctness):
import pandas as pd
import numpy as np
df = pd.DataFrame({'return':np.array([np.nan, np.nan, np.nan, 0.015, -0.024, 0.033, 0.021, 0.014, -0.092])})
I'm trying to start the index (i.e., "base-100") at the last NaN before the first return - while at the same time keep the NaNs preceding the 100 value in place - (thinking in terms of appending to existing dataframe and for graphing purposes).
I only have found a way to create said index when there are no NaNs in the return vector:
df['index'] = 100*np.exp(np.nan_to_num(df['return'].cumsum()))
Any ideas - thx in advance!
If your initial array is
zz = np.array([np.nan, np.nan, np.nan, 0.015, -0.024, 0.033, 0.021, 0.014, -0.092])
Then you can obtain your desired output like this (although there's probably a more optimized way to do it):
np.concatenate((zz[:np.argmax(np.isfinite(zz))],
100*np.exp(np.cumsum(zz[np.isfinite(zz)]))))
Use Series.isna, change order by indexing and get index of last NaN by Series.idxmax:
idx = df['return'].isna().iloc[::-1].idxmax()
Pass to DataFrame.loc, repalce missing value and use cumulative sum:
df['return'] = df.loc[idx:, 'return'].fillna(100).cumsum()
print (df)
return
0 NaN
1 NaN
2 100.000
3 100.015
4 99.991
5 100.024
6 100.045
7 100.059
8 99.967
You can use Series.isna with Series.cumsum and compare by max, then replace last NaN by Series.fillna and last use cumulative sum:
s = df['return'].isna().cumsum()
df['return'] = df['return'].mask(s.eq(s.max()), df['return'].fillna(100)).cumsum()
print (df)
return
0 NaN
1 NaN
2 100.000
3 100.015
4 99.991
5 100.024
6 100.045
7 100.059
8 99.967

How to use the diff() function in pandas but enter the difference values in a new column?

I have a dataframe df:
df
x-value
frame
1 15
2 20
3 19
How can I get:
df
x-value delta-x
frame
1 15 0
2 20 5
3 19 -1
Not to say there is anything wrong with what #Wen posted as a comment, but I want to post a more complete answer.
The Problem
There are 3 things going on that need to be addressed:
Calculating the values that are the differences from one row to the next.
Handling the fact that the "difference" will be one less value than the original length of the dataframe and we'll have to fill in a value for the missing bit.
How do we assign this to a new column.
Option #1
The most natural way to do the diff would be to use pd.Series.diff (as #Wen suggested). But in order to produce the stated results, which are integers, I recommend using the pd.Series.fillna parameter, downcast='infer'. Finally, I don't like editing the dataframe unless there is a need for it. So I use pd.DataFrame.assign to produce a new dataframe that is a copy of the old one with a new column associated.
df.assign(**{'delta-x': df['x-value'].diff().fillna(0, downcast='infer')})
x-value delta-x
frame
1 15 0
2 20 5
3 19 -1
Option #2
Similar to #1 but I'll use numpy.diff to preserve int type in addition to picking up some performance.
df.assign(**{'delta-x': np.append(0, np.diff(df['x-value'].values))})
x-value delta-x
frame
1 15 0
2 20 5
3 19 -1
Testing
pir1 = lambda d: d.assign(**{'delta-x': d['x-value'].diff().fillna(0, downcast='infer')})
pir2 = lambda d: d.assign(**{'delta-x': np.append(0, np.diff(d['x-value'].values))})
res = pd.DataFrame(
index=[10, 300, 1000, 3000, 10000, 30000],
columns=['pir1', 'pir2'], dtype=float)
for i in res.index:
d = pd.concat([df] * i, ignore_index=True)
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=1000)
res.plot(loglog=True)
res.div(res.min(1), 0)
pir1 pir2
10 2.069498 1.0
300 2.123017 1.0
1000 2.397373 1.0
3000 2.804214 1.0
10000 4.559525 1.0
30000 7.058344 1.0