iterating through a dataframe alternative of for loop - pandas

i have a very large dataframe, i did a for loop but it is taking forever, and I am wondering if there is any alternative?
index
ids
year
0
1890
2001
1
2678
NaN
2
4780
NaN
3
9844
1999
the idea is to get an array of ids of people who have NaN values in the 'year' column, so what I did, was I turned NaN into 0, and wrote this for loop.
df_nan = []
for i in range(0, len(df.index)):
for j in range(0, len(df.columns)):
if ((int(df.values[i,j])) == 0):
df_nan.append(df.values[i,0])
the for loop works, coz I tried it on a smaller dataframe, but I cant use it on the main one because it takes so long.

You can use filtering.
df = pd.DataFrame({'ids': [1890, 2678, 4780, 9844], 'year': [2001, pd.np.nan, pd.np.nan, 1999]})
nan_rows = df[df['year'].isnull()]
ids = nan_rows['ids'].values
print(ids) # outputs: [2678 4780]

Related

Cleaner way to copy a slice from one slightly-differently-named Multiindex to another

I need to assign a slice of dfB to dfA. Each df has multiindex columns where the level=0 has different names, while the level=1 has same names between dataframe.
A simple copy of the slice doesn't work because the multindexes are not identical.
I have a gross hack to temporarily rename the slice (which is a Series) before it's assigned to the dataframe. But this is gross. Is there a cleaner way to do this? I am not able to rename the level=0 of either dataframe.
af = pd.DataFrame(index=range(4),columns=pd.MultiIndex.from_product([['mike'], ['age', 'weight']]))
bf = pd.DataFrame(index=range(4),columns=pd.MultiIndex.from_product([['foo'], ['age', 'weight']]))
for i in range(4):
bf.loc[i,idx['foo','age']]= 10*i
bf.loc[i,idx['foo', 'weight']]= "huge"
af.loc[1, idx['mike', 'age']] = bf.loc[1, idx['foo', 'age']] # slice returns an int
af.loc[3, idx['mike', :]] = bf.loc[1, idx['foo', :]] # slice returns a series
print("3 is not copied")
print(af) # nothing copied to 3
#nasty hack. Can I do better?
gross = af.loc[3, idx['mike', :]].index
hack = bf.loc[1, idx['foo', :]].copy()
hack.index = gross
af.loc[3, idx['mike', :]] = hack
print("\n3 IS copied")
print(af)
For same indices use rename:
print (bf.loc[1, idx['foo', :]].rename({'foo':'mike'}, level=0))
mike age 10
weight huge
Name: 1, dtype: object
af.loc[3, idx['mike', :]] = bf.loc[1, idx['foo', :]].rename({'foo':'mike'}, level=0)
print(af)
mike
age weight
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 10 huge
Or assign numpy array:
af.loc[3, idx['mike', :]] = bf.loc[1, idx['foo', :]].to_numpy()
print(af)
mike
age weight
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 10 huge

Pandas: Merging multiple dataframes efficiently

I have a situation where I need to merge multiple dataframes that I can do easily using the below code:
# Merge all the datasets together
df_prep1 = df_prep.merge(df1,on='e_id',how='left')
df_prep2 = df_prep1.merge(df2,on='e_id',how='left')
df_prep3 = df_prep2.merge(df3,on='e_id',how='left')
df_prep4 = df_prep3.merge(df_4,on='e_id',how='left')
df_prep5 = df_prep4.merge(df_5,on='e_id',how='left')
df_prep6 = df_prep5.merge(df_6,on='e_id',how='left')
But what I want to understand is that if there is any other efficient way to perform this merge, maybe using a helper function? If yes, then how could I achieve that?
You can use reduce from functools module to merge multiple dataframes:
from functools import reduce
dfs = [df_1, df_2, df_3, df_4, df_5, df_6]
out = reduce(lambda dfl, dfr: pd.merge(dfl, dfr, on='e_id', how='left'), dfs)
You can put all your dfs into a list, or pass them from a function, a loop, etc. and then have 1 main df that you merge everything onto.
You can start with an empty df and iterate through. In your case, since you are doing left merge, it looks like your df_prep should already have all of the e_id values that you want. You'll need to figure out what you want to do with any additional columns, e.g., you can have pandas add _x and _y after conflicting column names that you don't merge, or rename them, etc. See this toy example:
main_df = pd.DataFrame({'e_id': [0, 1, 2, 3, 4]})
for x in range(3):
dfx = pd.DataFrame({'e_id': [x], 'another_col' + str(x): [x * 10]})
main_df = main_df.merge(dfx, on='e_id', how='left')
to get:
e_id another_col0 another_col1 another_col2
0 0 0.0 NaN NaN
1 1 NaN 10.0 NaN
2 2 NaN NaN 20.0
3 3 NaN NaN NaN
4 4 NaN NaN NaN

Creating a base 100 Index from time series that begins with a number of NaNs

I have the following dataframe (time-series of returns truncated for succinctness):
import pandas as pd
import numpy as np
df = pd.DataFrame({'return':np.array([np.nan, np.nan, np.nan, 0.015, -0.024, 0.033, 0.021, 0.014, -0.092])})
I'm trying to start the index (i.e., "base-100") at the last NaN before the first return - while at the same time keep the NaNs preceding the 100 value in place - (thinking in terms of appending to existing dataframe and for graphing purposes).
I only have found a way to create said index when there are no NaNs in the return vector:
df['index'] = 100*np.exp(np.nan_to_num(df['return'].cumsum()))
Any ideas - thx in advance!
If your initial array is
zz = np.array([np.nan, np.nan, np.nan, 0.015, -0.024, 0.033, 0.021, 0.014, -0.092])
Then you can obtain your desired output like this (although there's probably a more optimized way to do it):
np.concatenate((zz[:np.argmax(np.isfinite(zz))],
100*np.exp(np.cumsum(zz[np.isfinite(zz)]))))
Use Series.isna, change order by indexing and get index of last NaN by Series.idxmax:
idx = df['return'].isna().iloc[::-1].idxmax()
Pass to DataFrame.loc, repalce missing value and use cumulative sum:
df['return'] = df.loc[idx:, 'return'].fillna(100).cumsum()
print (df)
return
0 NaN
1 NaN
2 100.000
3 100.015
4 99.991
5 100.024
6 100.045
7 100.059
8 99.967
You can use Series.isna with Series.cumsum and compare by max, then replace last NaN by Series.fillna and last use cumulative sum:
s = df['return'].isna().cumsum()
df['return'] = df['return'].mask(s.eq(s.max()), df['return'].fillna(100)).cumsum()
print (df)
return
0 NaN
1 NaN
2 100.000
3 100.015
4 99.991
5 100.024
6 100.045
7 100.059
8 99.967

Series.replace cannot use dict-like to_replace and non-None value [duplicate]

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the nans with averages of columns where they are?
This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.
You can simply use DataFrame.fillna to fill the nan's directly:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().
Try:
sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))
In [17]: df.iloc[3:5,0] = np.nan
In [18]: df.iloc[4:6,1] = np.nan
In [19]: df.iloc[5:8,2] = np.nan
In [20]: df
Out[20]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 NaN -0.985188 -0.324136
4 NaN NaN 0.238512
5 0.769657 NaN NaN
6 0.141951 0.326064 NaN
7 -1.694475 -0.523440 NaN
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [22]: df.mean()
Out[22]:
0 -0.251534
1 -0.040622
2 -0.841219
dtype: float64
Apply per-column the mean of that columns and fill
In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622 0.238512
5 0.769657 -0.040622 -0.841219
6 0.141951 0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame
DataFrame with ~100k records
Code 1: 22.06 Seconds
Code 2: 0.03 Seconds
DataFrame with ~200k records
Code 1: 180.06 Seconds
Code 2: 0.06 Seconds
DataFrame with ~1.6 Million records
Code 1: code kept running endlessly
Code 2: 0.40 Seconds
DataFrame with ~13 Million records
Code 1: --did not even try, after seeing performance on 1.6 Mn records--
Code 2: 3.20 Seconds
Apologies for a long answer ! Hope this helps !
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.
sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Directly use df.fillna(df.mean()) to fill all the null value with mean
If you want to fill null value with mean of that column then you can use this
suppose x=df['Item_Weight'] here Item_Weight is column name
here we are assigning (fill null values of x with mean of x into x)
df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))
If you want to fill null value with some string then use
here Outlet_size is column name
df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column
Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']
If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:
Use method .fillna():
mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)
I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.
You should be careful when using the mean. If you have outliers is more recommendable to use the median
Another option besides those above is:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.
using sklearn library preprocessing class
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])
Note: In the recent version parameter missing_values value change to np.nan from NaN
I use this method to fill missing values by average of a column.
fill_mean = lambda col : col.fillna(col.mean())
df = df.apply(fill_mean, axis = 0)
You can also use value_counts to get the most frequent values. This would work on different datatypes.
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
Here is the value_counts api reference.

Create a sequence row from a filtered series

I am trying to create a row that has columns from t0 to t(n).
I have a complete data frame (df) that stores the full set of data, and a data series (df_t) specific time markers I am interested in.
What I want is to create a row that has the time marker as t0 then the previous [sequence_length] rows from the complete data frame.
def t_data(df, df_t, col_names, sequence_length):
df_ret = pd.DataFrame()
for i in range(sequence_length):
col_names_seq = [col_name + "_" + str(i) for col_name in col_names]
df_ret[col_names_seq] = df[df.shift(i)["time"].isin(df_t)][col_names]
return df_ret
Running:
t_data(df, df_t, ["close"], 3)
I get:
close_0 close_1 close_2
1110 1.32080 NaN NaN
2316 1.30490 NaN NaN
2549 1.30290 NaN NaN
The obvious line in issue is:
df[df.shift(i)["time"].isin(df_t)][col_names]
I have tried several ways but cant seem to select data surrounding a subset.
Sample (df):
time open close high low volume EMA21 EMA13 EMA9
20 2005-01-10 04:10:00 1.3071 1.3074 1.3075 1.3070 32.0 1.306624 1.306790 1.306887
21 2005-01-10 04:15:00 1.3074 1.3073 1.3075 1.3073 16.0 1.306685 1.306863 1.306969
22 2005-01-10 04:20:00 1.3073 1.3072 1.3074 1.3072 35.0 1.306732 1.306911 1.307015
Sample (df_t):
1110 2005-01-13 23:00:00
2316 2005-01-18 03:30:00
2549 2005-01-18 22:55:00
Name: time, dtype: datetime64[ns]
I don’t have data but hope this drawing helps:
def t_data(df, df_T, n):
# Get the indexs of the original df that matches with the values of df_T
indexs = df.reset_index().merge(df_T, how="inner")['index'].tolist()
#create new index list where we will store the index-n vales
newIndex = []
#create list of values to subtract from index
toSub = np.arange(n)
#loop over index values and subtract the values, and append in newIndex
for i in indexs:
for sub in toSub:
v = i - sub
newIndex.append(v)
#Use iloc to get all the rows in the original df with the newIndex values that we want
closedCosts = df.iloc[newIndex].reset_index(drop = True)["close"].values
#concat our data back to df_T, and reshape closedCosts by n columns
df_final = pd.concat([df_T, pd.DataFrame(closedCosts.reshape(-1, n))], axis= 1)
#return final df
return df_final
This should do what you're asking for. The easiest way to do this is to figure out all the indexs that you would want from the original df with its corresponding closing value. Note: you will have to rename the columns after this, but all the values are there.