I started writing this question in other form, but found a solution in the meantime. I have a dataframe shaped like 55k x 4. Since couple of hours now I couldn't understand why I can't drop rows I need to drop. I had something like that:
print(df.shape)
indexes_to_drop = list()
for row in df.itertuples(index=True):
if some_complex_function(row[1]):
indexes_to_drop.append(row[0])
print(len(indexes_to_drop))
df = df.drop(index=indexes_to_drop)
print(df.shape)
My output was like:
55000 x 4
2500
52500 x 4
However, once I displayed some rows from my df I was still able to find rows I thought were deleted. Ofc 1st thought was to check some_complex_function. But I logged everything it did, and it was just fine.
So, I tried couple other ways of deleting rows using index, for example:
df = df.drop(df.index[ignore_indexes])
Still, shape is ok, but not the rows.
Then I tried with iterrows() instead of itertuples. Same thing.
I thought maybe there is something wrong with indexing. You know: index number vs index label. I tested my code on small dataframe and everything worked like a charm.
Then I realized, I do some stuff with my df before I run above, code. So I reset the index before like that:
df.reset_index(inplace=True, drop=True) Indexes changed, started counting from 0, but my results were still wrong.
Then I tried this: df.drop(index=indexes_to_drop, inplace=True)
And BOOM it worked.
Right now Im not looking for the solution, as I apparently found one. I'd like to know WHY dropping rows not "inplace" didn't work. I don't get that.
Cheers!
Related
I think in pandas a series S
S[0:2] is equivalent to s.iloc[0:2] , in both cases two rows will be there but recently I got into a trouble The first picture shows the expected output but I didn't know what went wrong in the In this picture S[0] is showing error i don't know why
I can try to explain this behavior a bit. In Pandas, you have selection by position or by label, and it's important to remember that every single row/column will always have a position "name" and a label name. In the case of columns, this distinction is often easy to see, because we usually give columns string names. The difference is also obvious when you use explicitly .iloc vs .loc slicing. Finally, s[X] is indexing, which s[X:Y] is slicing, and the behaviour of the two actions is different.
Example:
df = pd.DataFrame({'a':[1,2,3], 'b': [3,3,4]})
df.iloc[:,0]
df.loc[:,'a']
both will return
0 1
1 2
2 3
Name: a, dtype: int64
Now, what happened in your case is that you overwrote the index names when you declared s.index = [11,12,13,14]. You can see that by inspecting the index before and after this change. Before, if you run s.index, you see that it is a RangeIndex(start=0, stop=4, step=1). After you change the index, it becomes Int64Index([11, 12, 13, 14], dtype='int64').
Why does this matter? Because although you overrode the labels of the index, the position of each one of them remains the same as before. So, when you call
s[0:2]
you are slicing by position (this section in the documentation explains that it's equivalent to .iloc. However, when you run
s[0]
Pandas thinks you want to select by label, so it starts looking for the label 0, which doesn't exist anymore, because you overrode it. Think of the square-bracket selection in the context of selecting a dataframe column: you would say df["column"] (so you're asking for the column by label), so the same is in the context of a series.
In summary, what happens with Series indexing is the following:
In the case you use string index labels, and you index by an string, Pandas will look up the string label.
In the case you use string index labels, and you index by an integer, Pandas will fall back to indexing by position (that's why your example in the comment works).
In the case you use integer index labels, and you index by an integer, Pandas will always try to index by the "label" (that's why the first case doesn't work, as you have overriden the label 0).
Here are two articles explaining this bizarre behavior:
Indexing Best Practices in Pandas.series
Retrieving values in a Series by label or position
I am new to python. Could someone help me why the first code statement generates no coding error but the second one will generate a keyerror(I do not even know what is a keyerror).
p.s.: data is a DataFrame,'StrategyCumulativePct' and 'BuyHold' are two columns in the DataFrame respectively.
data[['StrategyCumulativePct', 'BuyHold']].plot()
data['StrategyCumulativePct', 'BuyHold'].plot()
On the other hand, may I ask why sometimes when I had only written 20 lines of codes but there could be errors pointing an arrow at lines 2000 / 3000.... which I have not created before? Thanks.
I am completely new in coding and started to experiment with python and pandas. Quite an adventure and I am learning a lot. I found a lot of solutions already here on Stack, but not for my latest quest.
With Pandas I imported and edited a txt-file in such a way that I could export it in a csv-file. But to be able to import this csv-file into another program I need that the header row starts on row number 20. So I actually need 19 empty rows.
Can somebody guide me in the right direction?
You can join your dataframe with an empty dataframe:
empty_data = [[''] * len(df.columns) for i in range(19)]
empty_df = pd.DataFrame(empty_data, columns=df.columns)
pd.concat((df, empty_df))
I was analysing a developed code. I found something like this.
val newDF = df.repartition(1).withColumn("name", lit("xyz")).orderBy(col("count").asc)
Later at a different module, this newDF was reused as below
newDF.repartition(1).write.format("csv").save(path/of/file)
Now my doubt is, since the same dataframe is repartitioned in 2 places - that too with an orderby in place for the first Dataframe - Will the data not get shuffled after the second repartition which makes orderBy void ?
I have a dataframe which I am doing some work on
d={'x':[2,8,4,-5,4,5,-3,5],'y':[-.12,.35,.3,.15,.4,-.5,.6,.57]}
df=pd.DataFrame(d)
df['x_even']=df['x']%2==0
subdf, get all rows where x is negative and then square x and then multiple 100 to y
subdf=df[df.x<0]
subdf['x']=subdf.x**2
subdf['y']=subdf.y*100
subdf's work is completed. I am not sure how I can incorporate these changes to the master dataframe (df).
It looks like your current code should give you a SettingWithCopyWarning warning.
To avoid this you could do the following:
df.loc[df.x<0, 'y'] = df.loc[df.x<0, 'y']*100
df.loc[df.x<0, 'x'] = df.loc[df.x<0, 'x']**2
Which will change your df, without raising a warning and there is no need to merge anything back.
pd.merge(subdf,df,how='outer')
This does what I was asking for. Thanks for the tip Primer