When I import a excel as a data frame in pandas and try to get rid of the first column I'm unable to do it even though i give index=None. What am I missing?
I ran into this, too. You can make the friend column the index instead. That gets rid of the original index that comes with pd.read_excel(). As #ALollz says, a dataframe always has an index. You get to choose what's in it.
data.set_index('friend', inplace=True)
See examples in the documentation on DataFrame.set_index().
Related
I have a function that returns tuples. When I apply this to my pandas dataframe using pd.apply() function, the results look this way.
The Date here is an index and I am not interested in it.
I want to create two new columns in a dataframe and set their values to the values you see in these tuples.
How do I do this?
I tried the following:
This errors out citing mismatch between expected and available values. It is seeing these tuples as a single entity, so those two columns I specified on the left hand side are a problem. Its expecting only one.
And what I need is to break it down into two parts that can be used to set two different columns.
Whats the correct way to achieve this?
Make your function return a pd.Series, this will be expanded into a frame.
orders.apply(lambda x: pd.Series(myFunc(x)), axis=1)
use zip
orders['a'], orders['b'] = zip(*df['your_column'])
I tried to do this:
districtident.rename(columns={'leaid':'nces'},inplace=True)
and it failed:
Other things that didn't work:
districtident = districtident.rename(columns={'leaid':'nces'})
Renaming the column names of pandas dataframe is not working as expected - python
renaming columns in pandas doesn't do anything
Wasn't great either.
Here's an obvious appeal:
Alas, no.
Restarting the kernel didnt' work either. The only thing that worked was:
districtident['nces'] = districtident['leaid']
disrictident.drop(['leaid'],axis=1,inplace=True)
But that's really not the best, I feel. Especially if I need to do a number of columns.
I am trying to convert a Pyspark dataframe to Pandas, so I simply write df1=df.toPandas(), and I get the error "ValueError: ordinal must be >= 1". Unfortunately, I don't see any other usefull information in the error message (it's quite long so i cannot post it here).
If somebody has an idea, what could be wrong, it would be nice.
I only saw this error in the case when a Pyspark dataframe had multiple columns with the same name, but this is not the case this time.
Thanks in advance.
Edit: I have experimented and found out, that the problem appears only if I select some specific columns. But I don't see what can be wrong with these columns.
I am reading an excel file into pandas using pd.ExcelFile.
It reads correctly and I can print the dataframe. But when I try to select a subset of columns like:
subdf= origdf[['CUTOMER_ID','ASSET_BAL']]
I get error:
KeyError: "['CUTOMER_ID' 'ASSET_BAL'] not in index"
Do I need to define some kind of index here? When I printed the df, I verified that the columns are there.
Ensure that the columns actually exist in the dataframe. For example, you have written CUTOMER and not CUSTOMER, which I assume is the correct name.
You can verify the column names by using list(origdf.columns.values).
And for when you don't have a typo problem, here is a solution:
Use loc instead,
subdf= origdf.loc[:, ['CUSTOMER_ID','ASSET_BAL']].values
(I'd be glad to learn why this one works, though.)
I noticed a mechanism of auto inserting when selecting rows by index. To illustrate, I use the following code:
Then my questions are 2 (may be they are the same):
Any document about this mechanism? (I have tried but cannot find it in the long long official documents)
How to avoid the auto inserting? For example, I want the last line of code returns the only 'a' row.
Thank you very much in advance!
I have not seen any documentation. It looks like an unintended artifact. I can think of some clever things to do with it but I wouldn't trust it.
Work around
df1.loc[pd.Index([1, 'a']).intersection(df1.index), :]