How do I delete rows I don't need in dataframe pandas?

How do I delete rows I don't need in dataframe pandas? - pandas

I want to delete a certain row where both the ZIPCODE and AV_LAND values would be deleted. For instance, I want to delete row 1 and 2. How would I do that? In addition, I want to reset the index once I delete all the rows I don't need.
ZIPCODE AV_LAND
0 02108 2653506
1 02109 5559661
2 02110 11804931
3 02134 4333212

You can use drop:
df.drop([1, 2]).reset_index(drop=True)
Out:
ZIPCODE AV_LAND
0 02108 2653506
1 02134 4333212
This is not an inplace operation so if you want to change the original DataFrame you need to assign it back: df = df.drop([1, 2]).reset_index(drop=True)

Related

How to change rows in pandas based on an attribute of the other rows

I have a dataframe with columns: A(continuous variable) and B(discrete 1 or 0). The df is initially sorted by A variable.
I need to order the dataframe so for each set of X rows, there are Y rows with value 1 in B column, and (X-Y) rows with 0 (B column) (when possible!). But these sets should have variable A in desceding order. X and Y are input by the user
Example:
X=4, Y=3
Rows 0-11 are ok, since the sets (0-3),(4-7) and (8-11) has 3 rows with 1 in column B and only one row with 0 AND variable A is descending. However, rows 12-15 are not ok, since there are 2 rows with 1(variable B) and two with 0. Row 17 would replace row 15 to make this set valid. There is no problem if the last rows has 0 in variable B, since there isn't any with value 1.
The code should be general enough to run on dataframes with different number of rows.
Any ideas?

Dropping semi-dupliacted rows in pandas according to specific column value

I have a dataframe with duplicated rows except one column value, I want to drop the row with a value of "None" if the id is the same (not all the rows are duplicated)
a b
1 1 None
2 1 7
3 2 2
4 3 4
I need to drop the first row with the duplicated (1) and the value of b is None.

You can use duplicated and also search for None. That will return the row you want to drop, so use ~ to get the inverse dataframe (so everything but the row you want to drop) to return the expected result. EDIT: Passing keep=False will return all duplicates, so order doesn't matter.
df[~((df['b'].isnull()) & (df.duplicated('a', keep=False)))] #if None is Null value
OR
df[~((df['b'] == 'None') & (df.duplicated('a', keep=False)))] if 'None' is string

Get column index label based on values

I have the following:
C1 C2 C3
0 0 0 1
1 0 0 1
2 0 0 1
And i would like to get the corresponding column index value that has 1's, so the result
should be "C3".
I know how to do this by transposing the dataframe and then getting the index values, but this is not ideal for data in the dataframes i have, and i wonder there might be a more efficient solution?

I will save the result in a list because otherwise there could be more than one column with values equal to 1. You can use DataFrame.loc
if all column values must be 1 then you can use:
df.loc[:,df.eq(1).all()].columns.tolist()
Output:
['C3']
if this isn't necessary then use:
df.loc[:,df.eq(1).any()].columns.tolist()
or as suggested #piRSquared, you can select directly from df.columns:
[*df.columns[df.eq(1).all()]]

Pandas DataFrame, turn index and its name into column

I wanted to create DataFrame with 2 columns, one called 'id' , one called 'SalePrice'
submission = pd.DataFrame({'SalePrice':pre})
It looks like this
SalePrice
0 183242.025920
1 188796.451732
2 187878.763989
3 179789.672031
I know that I can name the index, but I need instead name it as a normal column name, on the same level as SalePrice. Anyone knows how to do that?

Try create it with DataFrame constructor
submission = pd.DataFrame({'SalePrice':pre,'id':np.arange(len(per))})

Just use reset_index, same as #Andy L. suggested. here's the full code:
submission = pd.DataFrame({'SalePrice':[1,2,3,4]}).reset_index()
submission.rename(columns = {'index':'id'}, inplace=True)
print(submission)
The output:
id SalePrice
0 0 1
1 1 2
2 2 3
3 3 4

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.

per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How do I delete rows I don't need in dataframe pandas? - pandas

You can use drop: df.drop([1, 2]).reset_index(drop=True) Out: ZIPCODE AV_LAND 0 02108 2653506 1 02134 4333212 This is not an inplace operation so if you want to change the original DataFrame you need to assign it back: df = df.drop([1, 2]).reset_index(drop=True)

Related

How to change rows in pandas based on an attribute of the other rows

Dropping semi-dupliacted rows in pandas according to specific column value

Get column index label based on values

Pandas DataFrame, turn index and its name into column

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

Categories

Resources