Subtracting the 2nd column of the 1st Dataframe to the 2nd column of the 2nd Dataframe based on conditions - dataframe

I have 2 dataframes with 2 columns each and with different number of index. I want to subtract the 2nd column of my 2nd dataframe to the 2nd column of my 1st dataframe, And storing the answers to another dataframe.. NOTE: Subtract only the values with the same values in column 1.
e.g.
df:
col1 col2
29752 35023.0 40934.0
subtract it to
df2:
c1 c2
962 35023.0 40935.13
Here is my first Dataframe:
col1 col2
0 193431.0 40955.41
1 193432.0 40955.63
2 193433.0 40955.89
3 193434.0 40956.31
4 193435.0 40956.43
... ...
29752 35023.0 40934.89
29753 35024.0 40935.00
29754 35025.0 40935.13
29755 35026.0 40934.85
29756 35027.0 40935.18
Here is my 2nd dataframe;
c1 c2
0 194549.0 41561.89
1 194550.0 41563.96
2 194551.0 41563.93
3 194552.0 41562.75
4 194553.0 41561.22
.. ... ...
962 35027.0 41563.80
963 35026.0 41563.18
964 35025.0 41563.87
965 35024.0 41563.97
966 35023.0 41564.02

You can iterate for each row of the first dataframe, if the col1 of the row is equal to col1 of df2[i], then subtract the value of each column
for i, row in df1.iterrows():
if(row["col1"] == df2["col1"][i]):
diff = row["col2"] - df2["col2"][i]
print(i, diff)

Related

conditional filtering on rows rather than columns

Given a table
col_0
col_1
col_2
0
a_00
a_01
a_02
1
a_10
nan
a_12
2
a_20
a_21
a_22
If I am returning all rows such col_1 does not contain nan, then it can be easily done by df[df['col_1'].notnull()], which returns
col_0
col_1
col_2
0
a_00
a_01
a_02
2
a_20
a_21
a_22
If I would like to return all columns such that its 1-th row does not contain nan, what should I do? The following is the result that I want:
col_0
col_2
0
a_00
a_02
1
a_10
a_12
2
a_20
a_22
I can transpose dataframe, remove rows on transposed dataframe, and transpose back, but it would become inefficient if dataframe is huge. I also tried
df.loc[df.loc[0].notnull()]
but the code gives me an error. Any ideas?
you can use pandas DataFrame.dropna() function for this.
case 1: want to drop all nan values in column wise-
ex: df.dropna(axis = 1)
axis = 0 refers to horizontal axis or rows and axis = 1 refers to vertical axis or columns.
case 2: want to drop upto n number of rows-
ex: df[:n].dropna(axis = 1)
case 2: drop column in set of columns-
ex: df[["col_1","col_2"]].dropna(axis = 1)
it will drop nan values with in this two columns
note: If you want to make this change permant then use
inplace = True (df.dropna(axis=1,inplace = True) or assign the results to another variable (df2 = df.dropna(axis=1)
Boolean indexing with loc along columns axis
df.loc[:, df.iloc[1].notna()]
Result
col_0 col_2
0 a_00 a_02
1 a_10 a_12
2 a_20 a_22

return column name of min greater than 0 pandas

I have a dataframe with one date column and rest numeric columns, something like this
date col1 col2 col3 col4
2020-1-30 0 1 2 3
2020-2-1 0 2 3 4
2020-2-2 0 2 2 5
I want to now find the name of the column which gives me minimum sum per column, but only when greater than 0. So in the above case, I want it to give me col2 as a result because the sum of this (5) is least of all other columns other than col1 which is 0. Appreciate any help with this
I would use:
# get only numeric columns
df2 = df.select_dtypes('number')
# drop the columns with 0, compute the sum
# get index of min
out = df2.loc[:, df2.ne(0).all()].sum().idxmin()
If you want to ignore a column only if all values are 0, use any in place of all:
df2.loc[:, df2.ne(0).any()].sum().idxmin()
Output: 'col2'
all minima
# get only numeric columns
df2 = df.select_dtypes('number')
# drop the columns with 0, compute the sum
s = df2.loc[:, df2.ne(0).any()].sum()
# get all minimal
out = s[s.eq(s.min())].index.tolist()
Output:
['col2']

Concatenate single row dataframe with multiple row dataframe

I have a dataframe with large number of columns but single row as df1:
Col1 Col2 Price Qty
A B 16 5
I have another dataframe as follows, df2:
Price Qty
8 2.5
16 5
6 1.5
I want to achieve the following:
Col1 Col2 Price Qty
A B 8 2.5
A B 16 5
A B 6 1.5
Where essentially I am taking all rows of df1 and repeat it while concatenating with df2 but bring the Price and Qty columns from df2 and replace the ones present originally in df1.
I am not sure how to proceed with above.
I believe the following approach will work,
# first lets repeat the single row df1 as many times as there are rows in df2
df1 = pd.DataFrame(np.repeat(df1.values, len(df2.index), axis=0), columns=df1.columns)
# lets reset the indexes of both DataFrames just to be safe
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
# now, lets merge the two DataFrames based on the index
# after dropping the Price and Qty columns from df1
df3 = pd.merge(df1.drop(['Price', 'Qty'], axis=1), df2, left_index=True, right_index=True)
# finally, lets drop the index columns
df3.drop(['index_x', 'index_y'], inplace=True, axis=1)

How to make a scatter plot from unique values of df column against index where they first appear?

I have a data frame df with the shape (100, 1)
point
0 1
1 12
2 13
3 1
4 1
5 12
...
I need to make a scatter plot of unique values from column 'point'.
I tried to drop duplicates and move indexes of unique values to a column called 'indeks', and then to plot:
uniques = df.drop_duplicates(keep=False)
uniques.loc['indeks'] = uniques.index
and I get:
ValueError: cannot set a row with mismatched columns
Is there a smart way to plot only unique values where they first appear?
Use DataFrame.drop_duplicates with no parameter if need only first unique values and remove .loc for new column:
uniques = df.drop_duplicates().copy()
uniques['indeks'] = uniques.index
print (uniques)
point indeks
0 1 0
1 12 1
2 13 2

Deleting/Selecting rows from pandas based on conditions on multiple columns

From a pandas dataframe, I need to delete specific rows based on a condition applied on two columns of the dataframe.
The dataframe is
0 1 2 3
0 -0.225730 -1.376075 0.187749 0.763307
1 0.031392 0.752496 -1.504769 -1.247581
2 -0.442992 -0.323782 -0.710859 -0.502574
3 -0.948055 -0.224910 -1.337001 3.328741
4 1.879985 -0.968238 1.229118 -1.044477
5 0.440025 -0.809856 -0.336522 0.787792
6 1.499040 0.195022 0.387194 0.952725
7 -0.923592 -1.394025 -0.623201 -0.738013
I need to delete some rows where the difference between column 1 and columns 2 is less than threshold t.
abs(column1.iloc[index]-column2.iloc[index]) < t
I have seen examples where conditions are applied individually on column values but did not find anything where a row is deleted based on a condition applied on multiple columns.
First select columns by DataFrame.iloc for positions, subtract, get Series.abs, compare by thresh with inverse opearator like < to >= or > and filter by boolean indexing:
df = df[(df.iloc[:, 0]-df.iloc[:, 1]).abs() >= t]
If need select columns by names, here 0 and 1:
df = df[(df[0]-df[1]).abs() >= t]