I have this Pandas dataframe with rows and columns having the same titles which are names of people: Alex, Bob, Cynthia and cells being the number of times they have met, where -1 means that the cell is diagonal.
Alex
Bob
Cynthia
Alex
-1
2
3
Bob
1
-1
2
Cynthia
2
2
-1
Is there any elegant way to get a set of numeric values that are in the cells? So, for this table I want values = {1, 2, 3}. So far I can think of only iterating over the whole table in a nested-loop fashion and putting everything in a set.
Is there any other way of getting this set?
You can try:
set(df.values.flatten())
And to exclude -1:
set(df.values.flatten()).difference([-1])
Related
I'm having a very tough time trying to figure out how to do this with python. I have the following table:
NAMES VALUE
john_1 1
john_2 2
john_3 3
bro_1 4
bro_2 5
bro_3 6
guy_1 7
guy_2 8
guy_3 9
And I would like to go to:
NAMES VALUE1 VALUE2 VALUE3
john 1 2 3
bro 4 5 6
guy 7 8 9
I have tried with pandas, so I first split the index (NAMES) and I can create the new columns but I have trouble indexing the values to the right column.
Can someone at least give me a direction where the solution to this problem is? I don't expect a full code (I know that this is not appreciated) but any help is welcome.
After splitting the NAMES column, use .pivot to reshape your DataFrame.
# Split Names and Pivot.
df['NAME_NBR'] = df['NAMES'].str.split('_').str.get(1)
df['NAMES'] = df['NAMES'].str.split('_').str.get(0)
df = df.pivot(index='NAMES', columns='NAME_NBR', values='VALUE')
# Rename columns and reset the index.
df.columns = ['VALUE{}'.format(c) for c in df.columns]
df.reset_index(inplace=True)
If you want to be slick, you can do the split in a single line:
df['NAMES'], df['NAME_NBR'] = zip(*[s.split('_') for s in df['NAMES']])
I have a DataFrame with some "Non-leaves rows" in it. Is there any way to get plotly to ignore them, or a way to automatically remove them?
Here's a sample DataFrame:
0
1
2
3
0
Alice
Bob
1
Alice
Bob
Carol
David
2
Alice
Bob
Chuck
Delia
3
Alice
Bob
Chuck
Ella
4
Alice
Bob
Frank
In this case, I get the error Non-leaves rows are not permitted in the dataframe because the 0th row is not a distinct leaf.
I've tried using df = df.replace(np.NaN, pd.NA).where(df.notnull(), None) to add the None to the empty values, but the error persists.
Is there any way to have the non-leaves ignored? If not, is there a simple way to prune them? My real dataset is several thousand rows.
One way is to reshape your dataframe with unique relations parent-child. Here is one way:
import plotly.express as px
cols = df.columns
data = (
pd.concat(
[df[[i,j]].rename(columns={i:'parents',j:'childs'})
for i,j in zip(cols[:-1], cols[1:])])
.drop_duplicates()
)
fig = px.sunburst(data, names='childs', parents='parents')
fig.show()
I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']
I have googled it and found lots of question in stackoverflow. So suppose I have a dataframe like this
A B
-----
1
2
4 4
First 3 rows will be deleted. And suppose I have not 2 but 200 columns. How can I do that?
As per your request - first replace to Nan:
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.dropna()
If you want to remove on a specific column, then you need to specify the column name in the brackets
This is my
original data frame
I want to remove the duplicates for the columns 'head_x' and 'head_y' and the columns 'cost_x' and 'cost_y'.
This is my code:
df=df.astype(str)
df.drop_duplicates(subset={'head_x','head_y'}, keep=False, inplace=True)
df.drop_duplicates(subset={'cost_x','cost_y'}, keep=False, inplace=True)
print(df)
This is the output dataframe, as you can see the first row is a duplicate on both subsets. So why is this row stil there?
I do not just want to remove the first row but all duplicates. Tis is another output where also for Index/Node 6 there is a duplicate.
Take a look at the first 2 rows:
head_x cost_x head_y cost_y
Node
1 2 6 2 3
1 2 6 3 4
Start from head_x and head_y:
from the first row are 2 and 2,
from the second row are 2 and 3,
so these two pairs are different.
Then look at cost_x and cost_y:
from the first row are 6 and 3,
from the second row are 6 and 4,
so these two pairs are also different.
Conclusion: These 2 rows are not duplicates, taking into account both column
subsets.
df=df.astype(str)
df = df.drop_duplicates(subset={'head_x','head_y'}, keep=False, inplace=True)
df = df.drop_duplicates(subset={'cost_x','cost_y'}, keep=False, inplace=True)
I assume that cost_x should be replaced with head_y, in other way there are no duplicates