Plot multiple boxplots from seaborn with hue [duplicate] - pandas

This question already has answers here:
Boxplot of Multiple Columns of a Pandas Dataframe on the Same Figure (seaborn)
(4 answers)
Closed 4 months ago.
I have a dataframe with three columns
df=pd.DataFrame(data = {'Dose1': [1,2,3,4,5], 'Dose2': [6,6,4,7,4],'SickOrNot':[True,False,True,True,False]})
The last column corresponds to whether or not a patient stayed sick and the first two columns are doses of two drugs administered to the patient. I want to create two pairs of boxplots (in seaborn) of the doses, using whether the patient was sick or not as a hue.
So, essentially, I want the x axis to have two sections (Dose 1, Dose 2), which each section containing two boxplots. So that my final four boxplots are that of Dose 1 of sick patients, dose 1 of non sick patients, dose 2 of sick patients, dose 2 of non sick patients.
What is the syntax that I would use to do this? I have tried setting hue to be 'sick or not' but I am very confused about what to set as my x and y values when calling sns.boxplot.

Reshape the data into long form such that each column is one variable and each row is one observation. In this case Dose1 and Dose2 should be combined into one column, e.g. Section.
melt() the data with SickOrNot as the identifier and Dose1 and Dose2 as the values. Then set SickOrNot as the plot's hue:
sns.boxplot(
data=df.melt(id_vars=['SickOrNot'], value_vars=['Dose1', 'Dose2'],
var_name='Section', value_name='Dosage'),
x='Section',
y='Dosage',
hue='SickOrNot',
)

Related

split number based df of one column into 2 columns based on white space [duplicate]

This question already has answers here:
Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries
(3 answers)
How to split a dataframe string column into two columns?
(11 answers)
Closed 4 months ago.
According to the docs https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html, I want to split this one column of numbers into 2 columns based on default whitespace. However, the following doesnt appear to do anything.
self.data[0].str.split(expand=True)
The df is of shape (1,1) but would like to split into (1,2)
Output
0
0 1.28353e-02 3.24985e-02
Desired Output
0 1
0 1.28353e-02 3.24985e-02
PS: I dont want to specifically create columns A and B

How to compute the similarity between two text columns in dataframes with pyspark?

I have 2 data frames with different # of rows. Both them has a column as text. My aim to compare them and find similarities and find a ratio of similarity and add this score in final data set. Comparison is between title from df1 and headline from df2. Position of these text rows are different.
df1
duration
title
publish_start_date
129.33
Smuggler's Run fr...
2021-10-29T10:21:...
49.342
anchises. Founded...
2021-10-29T06:00:...
69.939
by Diego Angel in...
2021-10-29T00:33:...
102.60
Orange County sch...
2021-10-28T10:24:...
df2
DataSource
Post Id
headline
Linkedin
L1904055
in English versi...
Linkedin
F6955268
in other language...
Facebook
F1948698
Its combined edit...
Twitter
T7954991
Emma Raducanu: 10...
Basically, I am trying to find a similarities between 2 data sets row by row (on text). Is there any way to do this?
number of final data set = number of first data set x number of second data set
What you are looking for is a Cross Join. This way each row in DF1 will get joined with all rows in DF2 after which you can apply a function to compare similatiries between them.

Iterate two dataframes, compare and change a value in pandas or pyspark

I am trying to do an exercise in pandas.
I have two dataframes. I need to compare few columns between both dataframes and change the value of one column in the first dataframe if the comparison is successful.
Dataframe 1:
Article Country Colour Buy
Pants Germany Red 0
Pull Poland Blue 0
Initially all my articles have the flag 'Buy' set to zero.
I have dataframe 2 that looks as:
Article Origin Colour
Pull Poland Blue
Dress Italy Red
I want to check if the article, country/origin and colour columns match (so check whether I can find the each article from dataframe 1 in dataframe two) and, if so, I want to put the flag 'Buy' to 1.
I trying to iterate through both dataframe with pyspark but pyspark daatframes are not iterable.
I thought about doing it in pandas but apaprently is a bad practise to change values during iteration.
Which code in pyspark or pandas would work to do what I need to do?
Thanks!
merge with an indicator then map the values. Make sure to drop_duplicates on the merge keys in the right frame so the merge result is always the same length as the original, and rename so we don't repeat the same information after the merge. No need to have a pre-defined column of 0s.
df1 = df1.drop(columns='Buy')
df1 = df1.merge(df2.drop_duplicates().rename(columns={'Origin': 'Country'}),
indicator='Buy', how='left')
df1['Buy'] = df1['Buy'].map({'left_only': 0, 'both': 1}).astype(int)
Article Country Colour Buy
0 Pants Germany Red 0
1 Pull Poland Blue 1

counting each value in dataframe

So I want to create a plot or graph. I have a time series data.
My dataframe looks like that:
df.head()
I need to count values in df['status'] (there are 4 different values) and df['group_name'] (2 different values) for each day.
So i want to have date index and count of how many times each value from df['status'] appear as well as df['group_name']. It should return Series.
I used spam.groupby('date')['column'].value_counts().unstack().fillna(0).astype(int) and it working as it should. Thank you all for help

Plot values in specific column range for a particular row in a pandas data frame

I have a 10 rows x 26 columns data set of a country region's oil production between 1990-2011. The first column designates the country region (e.g. Canada), the next 22 columns correspond to oil production between 1990 and 2010, and the last two columns have ratios of oil production in one year relative to another.
My goal is to simply plot the oil production as a function of time separately for each country (i.e. categorize by column 1 and discard the last two columns when plotting). What is the most efficient way to do this?
It seems like you want all of the columns in your data except the last two, so use df.iloc[:, :-2] to select it. You then want to transpose this data so that the dates are now the row and the countries are the columns (use .T). Finally, plot your data.
df.iloc[:, :-2].T.plot()