How to make a scatter plot from unique values of df column against index where they first appear? - pandas

I have a data frame df with the shape (100, 1)
point
0 1
1 12
2 13
3 1
4 1
5 12
...
I need to make a scatter plot of unique values from column 'point'.
I tried to drop duplicates and move indexes of unique values to a column called 'indeks', and then to plot:
uniques = df.drop_duplicates(keep=False)
uniques.loc['indeks'] = uniques.index
and I get:
ValueError: cannot set a row with mismatched columns
Is there a smart way to plot only unique values where they first appear?

Use DataFrame.drop_duplicates with no parameter if need only first unique values and remove .loc for new column:
uniques = df.drop_duplicates().copy()
uniques['indeks'] = uniques.index
print (uniques)
point indeks
0 1 0
1 12 1
2 13 2

Related

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

How to get a value of a column from another df based on index? pandas

I have 2 data frames, and i'd like to get the first data frame that contains data from the second data frame, based on the their index. The catch is that I do it iteratively and the columns index numbers of only the first df increase by one with each iteration, so it causes error.
example to that would be:
First df after first iteration:
0
440 7.691
Second df after first iteration (doesn't change after each iteration):
1
0 M
1 M
2 M
3 M
4 M
.. ..
440 B
441 M
442 M
when i ran the code, I get the wanted df:
df_with_label = first_df.join(self.second_df)
0 1
440 7.691 B
After second iteration, my first df in now:
1
3 10.72
and when i run the same df_with_label = first_df.join(self.second_df) i'd like to get:
1 2
3 10.72 M
But I get the error:
ValueError: columns overlap but no suffix specified: Int64Index([1], dtype='int64')
I'm guessing it has a problem with the fact that the index of the column of the first df is 1 after the second iteration, but don't know how to fix it.
i'd like to keep the index of the first column to keep increasing.
The best solution would be to give the second column different name, so like:
1 class
3 10.72 M
Any idea how to fix it?
If I got it right your second dataframe doesn't change with iterations so why don't you just change its column name once and for all:
second_df.columns=['colname']
this should solve your naming conflicts.
Try:
df_with_label = first_df.join(self.second_df, rsuffix = "_2")
The thing is - df_with_label and second_df both have column 1, so the rsuffix will add "_2" to the second_df column name "1" := "1_2". You join on indexes, so every other column is shown on default - so you need to avoid naming conflicts.
REF
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html

How to create new pandas column by vlookup-like procedure on another data-frame

I have a dataframe that looks like this. It will be used to map values using two categorical variables. Maybe converting this to a dictionary would be better.
The 2nd data-frame is very large with a screenshot shown below. I want to take the values from the categorical variables to create a new attribute (column) based on the 1st data-frame.
For example...
A row with FICO_cat of (700,720] and OrigLTV_cat of (75,80] would receive a value of 5.
A row with FICO_cat of (700,720] and OrigLTV_cat of (85,90] would receive a value of 6.
Is there an efficient way to do this?
If your column labels are the FICO_cat values, and your Index is OrigLTV_cat, this should work:
Given a dataframe df:
780+ (740,780) (720,740)
(60,70) 3 3 3
(70,75) 4 5 4
(75,80) 3 1 2
Do:
df = df.unstack().reset_index()
df.rename(columns = {'level_0' : 'FICOCat', 'level_1' : 'OrigLTV', 0 : 'value'}, inplace = True)
Output:
FICOCat OrigLTV value
0 780+ (60,70) 3
1 780+ (70,75) 4
2 780+ (75,80) 3
3 (740,780) (60,70) 3
4 (740,780) (70,75) 5
5 (740,780) (75,80) 1
6 (720,740) (60,70) 3
7 (720,740) (70,75) 4
8 (720,740) (75,80) 2

Slicing and Setting Values in Pandas, with a composite of position and labels

I want to set a value in a specific cell in a pandas dataFrame.
I know which position the row is in (I can even get the row by using df.iloc[i], for example), and I know the name of the column, but I can't work out how to select the cell so that I can set a value to it.
df.loc[i,'columnName']=val
won't work because I want the row in position i, not labelled with index i. Also
df.iloc[i, 'columnName'] = val
obviously doesn't like being given a column name. So, short of converting to a dict and back, how do I go about this? Help very much appreciated, as I can't find anything that helps me in the pandas documentation.
You can use ix to set a specific cell:
In [209]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[209]:
a b c
0 1.366340 1.643899 -0.264142
1 0.052825 0.363385 0.024520
2 0.526718 -0.230459 1.481025
3 1.068833 -0.558976 0.812986
4 0.208232 0.405090 0.704971
In [210]:
df.ix[1,'b'] = 0
df
Out[210]:
a b c
0 1.366340 1.643899 -0.264142
1 0.052825 0.000000 0.024520
2 0.526718 -0.230459 1.481025
3 1.068833 -0.558976 0.812986
4 0.208232 0.405090 0.704971
You can also call iloc on the col of interest:
In [211]:
df['b'].iloc[2] = 0
df
Out[211]:
a b c
0 1.366340 1.643899 -0.264142
1 0.052825 0.000000 0.024520
2 0.526718 0.000000 1.481025
3 1.068833 -0.558976 0.812986
4 0.208232 0.405090 0.704971
You can get the position of the column with get_loc:
df.iloc[i, df.columns.get_loc('columnName')] = val

grouping by column and then doing a boxplot by the index in pandas

I have a large dataframe which I would like to group by some column and examine graphically the distribution per group using a boxplot. I found that df.boxplot() will do it for each column of the dataframe and put it in one plot, just as I need.
The problem is that after a groupby operation, my data is all in one column with the group labels in the index , so i can't call boxplot on the result.
here is an example:
df = DataFrame({'a':rand(10),'b':[x%2 for x in range(10)]})
df
a b
0 0.273548 0
1 0.378765 1
2 0.190848 0
3 0.646606 1
4 0.562591 0
5 0.409250 1
6 0.637074 0
7 0.946864 1
8 0.203656 0
9 0.276929 1
Now I want to group by column b and boxplot the distribution of both groups in one boxplot. How can I do that?
You can use the by argument of boxplot. Is that what you are looking for?
df.boxplot(column='a', by='b')