How to get a set of values from a Pandas dataframe?

How to get a set of values from a Pandas dataframe? - pandas

I have this Pandas dataframe with rows and columns having the same titles which are names of people: Alex, Bob, Cynthia and cells being the number of times they have met, where -1 means that the cell is diagonal.
Alex
Bob
Cynthia
Alex
-1
2
3
Bob
1
-1
2
Cynthia
2
2
-1
Is there any elegant way to get a set of numeric values that are in the cells? So, for this table I want values = {1, 2, 3}. So far I can think of only iterating over the whole table in a nested-loop fashion and putting everything in a set.
Is there any other way of getting this set?

You can try:
set(df.values.flatten())
And to exclude -1:
set(df.values.flatten()).difference([-1])

Related

Convert transactions with several products from columns to row [duplicate]

I'm having a very tough time trying to figure out how to do this with python. I have the following table:
NAMES VALUE
john_1 1
john_2 2
john_3 3
bro_1 4
bro_2 5
bro_3 6
guy_1 7
guy_2 8
guy_3 9
And I would like to go to:
NAMES VALUE1 VALUE2 VALUE3
john 1 2 3
bro 4 5 6
guy 7 8 9
I have tried with pandas, so I first split the index (NAMES) and I can create the new columns but I have trouble indexing the values to the right column.
Can someone at least give me a direction where the solution to this problem is? I don't expect a full code (I know that this is not appreciated) but any help is welcome.

After splitting the NAMES column, use .pivot to reshape your DataFrame.
# Split Names and Pivot.
df['NAME_NBR'] = df['NAMES'].str.split('_').str.get(1)
df['NAMES'] = df['NAMES'].str.split('_').str.get(0)
df = df.pivot(index='NAMES', columns='NAME_NBR', values='VALUE')
# Rename columns and reset the index.
df.columns = ['VALUE{}'.format(c) for c in df.columns]
df.reset_index(inplace=True)
If you want to be slick, you can do the split in a single line:
df['NAMES'], df['NAME_NBR'] = zip(*[s.split('_') for s in df['NAMES']])

plotly - remove or ignore "Non-leaves rows" for sunburst diagram

I have a DataFrame with some "Non-leaves rows" in it. Is there any way to get plotly to ignore them, or a way to automatically remove them?
Here's a sample DataFrame:
0
1
2
3
0
Alice
Bob
1
Alice
Bob
Carol
David
2
Alice
Bob
Chuck
Delia
3
Alice
Bob
Chuck
Ella
4
Alice
Bob
Frank
In this case, I get the error Non-leaves rows are not permitted in the dataframe because the 0th row is not a distinct leaf.
I've tried using df = df.replace(np.NaN, pd.NA).where(df.notnull(), None) to add the None to the empty values, but the error persists.
Is there any way to have the non-leaves ignored? If not, is there a simple way to prune them? My real dataset is several thousand rows.

One way is to reshape your dataframe with unique relations parent-child. Here is one way:
import plotly.express as px
cols = df.columns
data = (
pd.concat(
[df[[i,j]].rename(columns={i:'parents',j:'childs'})
for i,j in zip(cols[:-1], cols[1:])])
.drop_duplicates()
)
fig = px.sunburst(data, names='childs', parents='parents')
fig.show()

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!

try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.

Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

Delete all rows with an empty cell anywhere in the table at once in pandas

I have googled it and found lots of question in stackoverflow. So suppose I have a dataframe like this
A B
-----
1
2
4 4
First 3 rows will be deleted. And suppose I have not 2 but 200 columns. How can I do that?

As per your request - first replace to Nan:
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.dropna()
If you want to remove on a specific column, then you need to specify the column name in the brackets

why does python not drop all duplicates?

This is my
original data frame
I want to remove the duplicates for the columns 'head_x' and 'head_y' and the columns 'cost_x' and 'cost_y'.
This is my code:
df=df.astype(str)
df.drop_duplicates(subset={'head_x','head_y'}, keep=False, inplace=True)
df.drop_duplicates(subset={'cost_x','cost_y'}, keep=False, inplace=True)
print(df)
This is the output dataframe, as you can see the first row is a duplicate on both subsets. So why is this row stil there?
I do not just want to remove the first row but all duplicates. Tis is another output where also for Index/Node 6 there is a duplicate.

Take a look at the first 2 rows:
head_x cost_x head_y cost_y
Node
1 2 6 2 3
1 2 6 3 4
Start from head_x and head_y:
from the first row are 2 and 2,
from the second row are 2 and 3,
so these two pairs are different.
Then look at cost_x and cost_y:
from the first row are 6 and 3,
from the second row are 6 and 4,
so these two pairs are also different.
Conclusion: These 2 rows are not duplicates, taking into account both column
subsets.

df=df.astype(str)
df = df.drop_duplicates(subset={'head_x','head_y'}, keep=False, inplace=True)
df = df.drop_duplicates(subset={'cost_x','cost_y'}, keep=False, inplace=True)
I assume that cost_x should be replaced with head_y, in other way there are no duplicates

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to get a set of values from a Pandas dataframe? - pandas

You can try: set(df.values.flatten()) And to exclude -1: set(df.values.flatten()).difference([-1])

Related

Convert transactions with several products from columns to row [duplicate]

plotly - remove or ignore "Non-leaves rows" for sunburst diagram

pandas dataframe - how to find multiple column names with minimum values

Delete all rows with an empty cell anywhere in the table at once in pandas

why does python not drop all duplicates?

Categories

Resources