The documentation for pandas.read_excel mentions something called 'roundtripping', under the description of the index_col parameter.
Missing values will be forward filled to allow roundtripping with to_excel for merged_cells=True.
I have never heard of this term before, and if I search for a definition, I can find one only in the context of finance. I have seen it referred to in the context of merging dataframes in Pandas, but I have not found a definition.
For context, this is the complete description of the index_col parameter:
index_col : int, list of int, default None
Column (0-indexed) to use as the row labels of the DataFrame. Pass
None if there is no such column. If a list is passed, those columns
will be combined into a MultiIndex. If a subset of data is selected
with usecols, index_col is based on the subset.
Missing values will be forward filled to allow roundtripping with
to_excel for merged_cells=True. To avoid forward filling the
missing values use set_index after reading the data instead of
index_col.
For a general idea of the meaning of roundtripping, have a look at the answers to this post on SE. Applied to your example, "allow roundtripping" is used to mean something like this:
facilitate an easy back-and-forth between the data in an Excel file
and the same data in a df. I.e. while maintaining the intended
structure throughout.
Example round trip
The usefulness of this idea is perhaps best seen if we start with a somewhat complex df with both index and columns as named MultiIndices (for the constructor, see pd.MultiIndex.from_product):
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.rand(4,4),
columns=pd.MultiIndex.from_product([['A','B'],[1,2]],
names=['col_0','col_1']),
index=pd.MultiIndex.from_product([[0,1],[1,2]],
names=['idx_0','idx_1']))
print(df)
col_0 A B
col_1 1 2 1 2
idx_0 idx_1
0 1 0.952749 0.447125 0.846409 0.699479
2 0.297437 0.813798 0.396506 0.881103
1 1 0.581273 0.881735 0.692532 0.725254
2 0.501324 0.956084 0.643990 0.423855
If we now use df.to_excel with the default for merge_cells (i.e. True) to write this data to an Excel file, we will end up with data as follows:
df.to_excel('file.xlsx')
Result:
Aesthetics aside, the structure here is very clear, and indeed, the same as the structure in our df. Take notice of the merged cells especially.
Now, let's suppose we want to retrieve this data again from the Excel file at some later point, and we use pd.read_excel with default parameters. Problematically, we will end up with a complete mess:
df = pd.read_excel('file.xlsx')
print(df)
Unnamed: 0 col_0 A Unnamed: 3 B Unnamed: 5
0 NaN col_1 1.000000 2.000000 1.000000 2.000000
1 idx_0 idx_1 NaN NaN NaN NaN
2 0 1 0.952749 0.447125 0.846409 0.699479
3 NaN 2 0.297437 0.813798 0.396506 0.881103
4 1 1 0.581273 0.881735 0.692532 0.725254
5 NaN 2 0.501324 0.956084 0.643990 0.423855
Getting this data "back into shape" would be quite time-consuming. To avoid such a hassle, we can rely on the parameters index_col and header inside pd.read_excel:
df2 = pd.read_excel('file.xlsx', index_col=[0,1], header=[0,1])
print(df2)
col_0 A B
col_1 1 2 1 2
idx_0 idx_1
0 1 0.952749 0.447125 0.846409 0.699479
2 0.297437 0.813798 0.396506 0.881103
1 1 0.581273 0.881735 0.692532 0.725254
2 0.501324 0.956084 0.643990 0.423855
# check for equality
df.equals(df2)
# True
As you can see, we have made a "round trip" here, and index_col and header allow for it to have been smooth sailing!
Two final notes:
(minor) The docs for pd.read_excel contain a typo in the index_col section: it should read merge_cells=True, not merged_cells=True.
The header section is missing a similar comment (or a reference to the comment at index_col). This is somewhat confusing. As we saw above, the two behave exactly the same (for present purposes, at least).
Related
I am not certain if this question is appropriate here, and apologies in advance if it is not.
I am a pandas maintainer, and recently I've been working on fixing bugs in pandas groupby when used with dropna=True and transform for the 1.5 release. For example, in pandas 1.4.2,
import pandas as pd
df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})
print(df.groupby('a', dropna=True).transform('sum'))
produces the incorrect (in particular, the last row) output
b
0 5
1 5
2 5
While working on this, I've been wondering how useful the dropna argument is in groupby. For aggregations (e.g. df.groupby('a').sum()) and filters (e.g. df.groupby('a').head(2)), it seems to me it's always possible to drop the offending rows prior to the groupby. In addition to this, in my use of pandas if I have null values in the groupers, then I want them in the groupby result. For transformations, where the resulting index should match that of the input, the value is instead filled with null. For the above code block, the output should be
b
0 5.0
1 5.0
2 NaN
But I can't imagine this result ever being useful. In case it is, it also is not too difficult to accomplish:
result = df.groupby('a', dropna=False).transform('sum')
result.loc[df['a'].isnull()] = np.nan
If we were able to deprecate and then remove the dropna argument to groupby (i.e. groupby always behaves as if dropna=False), then this would help simplify a good part of the groupby code.
So I'd like to ask if there are examples where dropna=True and the operation might be otherwise hard to accomplish.
Thanks!
I have dataframes A of shape XxY with values and dataframes B of shape ZxY to be filled with statistics calculated from A.
As an example:
A = pd.DataFrame(np.array(range(9)).reshape((3,3)))
B = pd.DataFrame(np.array(range(6)).reshape((2,3)))
Now I need to fill row 1 of B with quantile(0.5) of A columns where row 0 of B > 1 (else: np.nan). I need to use a function of the kind:
def mydef(df0, df1):
df1.loc[1] = np.where(df1.loc[0]>1,
df0.quantile(0.5),
np.nan)
pass
mydef(A,B)
Now B is:
0 1 2
0 0.0 1.0 2.0
1 NaN NaN 3.5
It works perfectly for these mock dataframes and all my real ones apart from one.
For that one this error is raised:
ValueError: cannot set using a list-like indexer with a different length than the value
When I run the same code without calling a function, it doesn't raise any error.
Since I need to use a function, any suggestion?
I found the error. I erroneously had the same label twice in the index. Essentially my dataframe B was something like:
B = pd.DataFrame(np.array(range(9)).reshape((3,3)), index=[0,0,1])
so that calling the def:
def mydef(df0, df1):
df1.loc[1] = np.where(df1.loc[0]>1,
df0.quantile(0.5),
np.nan)
pass
would cause the condition and the if-false lines of np.where to not match their shapes, I guess.
Still not sure why working outside the def worked.
I have a simple dataframe as :
0 1 2 3
1 NaN like dislike
2 Cow dog snail
After dropping the nan value the dataframe is :
0 1 2 3
2 Cow dog snail
Now when I try the following code to print the values it gives key error :
for i in range(len(data)):
print(data.loc[i,:])
Any help will be appreciated.
Please add the following line after dropping nan values:
data = data.reset_index(drop=True)
You're dropping a value with an specific index in which after dropping NA values will not exist anymore. In your specific case, the index 1 will be dropped and you won't be able to iterate your dataframe via len. I highly recommend you to use iterrows instead as it's better form. Example:
for index, value in mydataframe.iterrows():
print(index, " ", value)
# 2
# 1 Cow
# 2 dog
# 3 snail
The value is of class 'pandas.core.series.Series' which in your case functions a lot like a dictionary. Pay attention to the column names which are exactly the ones in your example.
I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397
Let us say I have two dataframes: df1 and df2. Assume the following initial values.
df1=pd.DataFrame({'ID':['ASX-112','YTR-789','ASX-124','UYT-908','TYE=456','ERW-234','UUI-675','GHV-805','NMB-653','WSX-123'],
'Costperlb':[4515,5856,3313,9909,8980,9088,6765,3456,9012,1237]})
df2=df1[df1['Costperlb']>4560]
As you can see, df2 is a proper subset of df1 (it was created from df1 by imposing a condition on selection of rows).
I added a column to df2, which contains certain values based on a calculation. Let us call this df2['grade'].
df2['grade']=[1,4,3,5,1,1]
df1 and df2 contain one column named 'ID' which is guaranteed to be unique in each dataframe.
I want to:
Create a new column in df1 and initialize it to 0. Easy. df1['grade']=0.
Copy df2['grade'] values over to df1['grade'], ensuring that df1['ID']=df2['ID'] for each such copy.
The result should be the grade values for the corresponding IDs copied over.
Step 2 is what is perplexing me a bit. A naive df1['grade']=df2['grade'].values does not work obviously as the lengths of the two dataframes is different.
Now, if I think hard enough, I could possibly come up with a monstrosity like:
df1['grade'].loc[(df1['ID'].isin(df2)) & ...] but I am uncomfortable with doing that.
I am a newbie with python, and furthermore, the indices of df1 are being used elsewhere after this assignment, and I do not want drop indices, reset indices as some of the solutions are suggested in some of the search results I found.
I just want to find out rows in df1 where the 'ID' row matches the 'ID' row in df2, and then copy the 'grade' column value in that specific row over. How do I do this?
Your code:
df1=pd.DataFrame({'ID':['ASX-112','YTR-789','ASX-124','UYT-908','TYE=456','ERW-234','UUI-675','GHV-805','NMB-653','WSX-123'],
'Costperlb':[4515,5856,3313,9909,8980,9088,6765,3456,9012,1237]})
df2=df1[df1['Costperlb']>4560]
df2['grade']=[1,4,3,5,1,1]
You can use merge with "left". In this way the indexing of df1 is preserved:
new_df = df1.merge(df2[["ID","grade"]], on="ID", how="left")
new_df["grade"] = new_df["grade"].fillna(0)
new_df
Output:
ID Costperlb grade
0 ASX-112 4515 0.0
1 YTR-789 5856 1.0
2 ASX-124 3313 0.0
3 UYT-908 9909 4.0
4 TYE=456 8980 3.0
5 ERW-234 9088 5.0
6 UUI-675 6765 1.0
7 GHV-805 3456 0.0
8 NMB-653 9012 1.0
9 WSX-123 1237 0.0
Here I called the merged dataframe new_df, but you can simply change it to df1.
EDIT
If instead of 0 you want to replace the NaN with a string, try this:
new_df = df1.merge(df2[["ID","grade"]], on="ID", how="left")
new_df["grade"] = new_df["grade"].fillna("No transaction possible")
new_df
Output:
ID Costperlb grade
0 ASX-112 4515 No transaction possible
1 YTR-789 5856 1
2 ASX-124 3313 No transaction possible
3 UYT-908 9909 4
4 TYE=456 8980 3
5 ERW-234 9088 5
6 UUI-675 6765 1
7 GHV-805 3456 No transaction possible
8 NMB-653 9012 1
9 WSX-123 1237 No transaction possible