conditional filtering on rows rather than columns - pandas

Given a table
col_0
col_1
col_2
0
a_00
a_01
a_02
1
a_10
nan
a_12
2
a_20
a_21
a_22
If I am returning all rows such col_1 does not contain nan, then it can be easily done by df[df['col_1'].notnull()], which returns
col_0
col_1
col_2
0
a_00
a_01
a_02
2
a_20
a_21
a_22
If I would like to return all columns such that its 1-th row does not contain nan, what should I do? The following is the result that I want:
col_0
col_2
0
a_00
a_02
1
a_10
a_12
2
a_20
a_22
I can transpose dataframe, remove rows on transposed dataframe, and transpose back, but it would become inefficient if dataframe is huge. I also tried
df.loc[df.loc[0].notnull()]
but the code gives me an error. Any ideas?

you can use pandas DataFrame.dropna() function for this.
case 1: want to drop all nan values in column wise-
ex: df.dropna(axis = 1)
axis = 0 refers to horizontal axis or rows and axis = 1 refers to vertical axis or columns.
case 2: want to drop upto n number of rows-
ex: df[:n].dropna(axis = 1)
case 2: drop column in set of columns-
ex: df[["col_1","col_2"]].dropna(axis = 1)
it will drop nan values with in this two columns
note: If you want to make this change permant then use
inplace = True (df.dropna(axis=1,inplace = True) or assign the results to another variable (df2 = df.dropna(axis=1)

Boolean indexing with loc along columns axis
df.loc[:, df.iloc[1].notna()]
Result
col_0 col_2
0 a_00 a_02
1 a_10 a_12
2 a_20 a_22

Related

Subtracting the 2nd column of the 1st Dataframe to the 2nd column of the 2nd Dataframe based on conditions

I have 2 dataframes with 2 columns each and with different number of index. I want to subtract the 2nd column of my 2nd dataframe to the 2nd column of my 1st dataframe, And storing the answers to another dataframe.. NOTE: Subtract only the values with the same values in column 1.
e.g.
df:
col1 col2
29752 35023.0 40934.0
subtract it to
df2:
c1 c2
962 35023.0 40935.13
Here is my first Dataframe:
col1 col2
0 193431.0 40955.41
1 193432.0 40955.63
2 193433.0 40955.89
3 193434.0 40956.31
4 193435.0 40956.43
... ...
29752 35023.0 40934.89
29753 35024.0 40935.00
29754 35025.0 40935.13
29755 35026.0 40934.85
29756 35027.0 40935.18
Here is my 2nd dataframe;
c1 c2
0 194549.0 41561.89
1 194550.0 41563.96
2 194551.0 41563.93
3 194552.0 41562.75
4 194553.0 41561.22
.. ... ...
962 35027.0 41563.80
963 35026.0 41563.18
964 35025.0 41563.87
965 35024.0 41563.97
966 35023.0 41564.02
You can iterate for each row of the first dataframe, if the col1 of the row is equal to col1 of df2[i], then subtract the value of each column
for i, row in df1.iterrows():
if(row["col1"] == df2["col1"][i]):
diff = row["col2"] - df2["col2"][i]
print(i, diff)

return column name of min greater than 0 pandas

I have a dataframe with one date column and rest numeric columns, something like this
date col1 col2 col3 col4
2020-1-30 0 1 2 3
2020-2-1 0 2 3 4
2020-2-2 0 2 2 5
I want to now find the name of the column which gives me minimum sum per column, but only when greater than 0. So in the above case, I want it to give me col2 as a result because the sum of this (5) is least of all other columns other than col1 which is 0. Appreciate any help with this
I would use:
# get only numeric columns
df2 = df.select_dtypes('number')
# drop the columns with 0, compute the sum
# get index of min
out = df2.loc[:, df2.ne(0).all()].sum().idxmin()
If you want to ignore a column only if all values are 0, use any in place of all:
df2.loc[:, df2.ne(0).any()].sum().idxmin()
Output: 'col2'
all minima
# get only numeric columns
df2 = df.select_dtypes('number')
# drop the columns with 0, compute the sum
s = df2.loc[:, df2.ne(0).any()].sum()
# get all minimal
out = s[s.eq(s.min())].index.tolist()
Output:
['col2']

Instead of appending value as a new column on the same row, pandas adds a new column AND new row

What I have below is an example of the type of the type of concatenation that I am trying to do.
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df, pd.Series(1, name = 'label')])
The result I am hoping for is:
col1 col2 col3 label
a 1.0 2.0 3.0 1
but I get is
col1 col2 col3 0
a 1.0 2.0 3.0 NaN
0 NaN NaN NaN 1.0
I know that I'm joining these wrong, but I cannot seem to figure out how its done. Any advice?
This is because the series you are adding has an incompatible index. The original dataframe has ['a'] as the specified index and there is no index specified in the series. If you want to add a new column without specifying an index, the following will give you what you want:
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df]) # append the desired dataframe
df2['label'] = 1 # add a new column with the value 1 across all rows
print(df2.to_string())
col1 col2 col3 label
a 1 2 3 1

How to make a scatter plot from unique values of df column against index where they first appear?

I have a data frame df with the shape (100, 1)
point
0 1
1 12
2 13
3 1
4 1
5 12
...
I need to make a scatter plot of unique values from column 'point'.
I tried to drop duplicates and move indexes of unique values to a column called 'indeks', and then to plot:
uniques = df.drop_duplicates(keep=False)
uniques.loc['indeks'] = uniques.index
and I get:
ValueError: cannot set a row with mismatched columns
Is there a smart way to plot only unique values where they first appear?
Use DataFrame.drop_duplicates with no parameter if need only first unique values and remove .loc for new column:
uniques = df.drop_duplicates().copy()
uniques['indeks'] = uniques.index
print (uniques)
point indeks
0 1 0
1 12 1
2 13 2

what is the simplest way to check for occurrence of character/substring in dataframe values?

consider a pandas dataframe that has values such as 'a - b'. I would like to check for the occurrence of '-' anywhere across all values of the dataframe without looping through individual columns. Clearly a check such as the following won't work:
if '-' in df.values
Any suggestions on how to check for this? Thanks.
I'd use stack() + .str.contains() in this case:
In [10]: df
Out[10]:
a b c
0 1 a - b w
1 2 c z
2 3 d 2 - 3
In [11]: df.stack().str.contains('-').any()
Out[11]: True
In [12]: df.stack().str.contains('-')
Out[12]:
0 a NaN
b True
c False
1 a NaN
b False
c False
2 a NaN
b False
c True
dtype: object
You can use replace to to swap a regex match with something else then check for equality
df.replace('.*-.*', True, regex=True).eq(True)
One way may be to try using flatten to values and list comprehension.
df = pd.DataFrame([['val1','a-b', 'val3'],['val4','3', 'val5']],columns=['col1','col2', 'col3'])
print(df)
Output:
col1 col2 col3
0 val1 a-b val3
1 val4 3 val5
Now, to search for -:
find_value = [val for val in df.values.flatten() if '-' in val]
print(find_value)
Output:
['a-b']
Using NumPy: np.core.defchararray.find(a,s) returns an array of indices where the substring s appears in a;
if it's not present, -1 is returned.
(np.core.defchararray.find(df.values.astype(str),'-') > -1).any()
returns True if '-' is present anywhere in df.