Plot values in specific column range for a particular row in a pandas data frame - pandas

I have a 10 rows x 26 columns data set of a country region's oil production between 1990-2011. The first column designates the country region (e.g. Canada), the next 22 columns correspond to oil production between 1990 and 2010, and the last two columns have ratios of oil production in one year relative to another.
My goal is to simply plot the oil production as a function of time separately for each country (i.e. categorize by column 1 and discard the last two columns when plotting). What is the most efficient way to do this?

It seems like you want all of the columns in your data except the last two, so use df.iloc[:, :-2] to select it. You then want to transpose this data so that the dates are now the row and the countries are the columns (use .T). Finally, plot your data.
df.iloc[:, :-2].T.plot()

Related

Bar plot from two different datasets with different data range

I have the following datasets:
df1 = {'lower':[3.99,4.99,5.99,1700], 'percentile':[1,2,5,10,50,100]}
df2 = {'lower':[2.99,4.50,5,1850], 'percentile':[2,4,7,15,55,100]}
The data:
The percentile refers to the percentage of the data that corresponds
to a particular price e.g: 3.99 would represent 1% of the data while
all values under 5.99 would represent 5% of the data.
The length of the two datasets is 100 given that we are showing percentiles, but they vary between the two datasets as the price.
What I have done so far:
What I need help with:
As you see in the third graph, I can plot the two datasets overlayed, which is what I need, but I have been unsuccessful trying to change the legend and the weird tick x values on the third graph. It is not showing the percentile, or other metrics I might use the x axis with.
Any help?

How to compute the similarity between two text columns in dataframes with pyspark?

I have 2 data frames with different # of rows. Both them has a column as text. My aim to compare them and find similarities and find a ratio of similarity and add this score in final data set. Comparison is between title from df1 and headline from df2. Position of these text rows are different.
df1
duration
title
publish_start_date
129.33
Smuggler's Run fr...
2021-10-29T10:21:...
49.342
anchises. Founded...
2021-10-29T06:00:...
69.939
by Diego Angel in...
2021-10-29T00:33:...
102.60
Orange County sch...
2021-10-28T10:24:...
df2
DataSource
Post Id
headline
Linkedin
L1904055
in English versi...
Linkedin
F6955268
in other language...
Facebook
F1948698
Its combined edit...
Twitter
T7954991
Emma Raducanu: 10...
Basically, I am trying to find a similarities between 2 data sets row by row (on text). Is there any way to do this?
number of final data set = number of first data set x number of second data set
What you are looking for is a Cross Join. This way each row in DF1 will get joined with all rows in DF2 after which you can apply a function to compare similatiries between them.

Plot multiple boxplots from seaborn with hue [duplicate]

This question already has answers here:
Boxplot of Multiple Columns of a Pandas Dataframe on the Same Figure (seaborn)
(4 answers)
Closed 4 months ago.
I have a dataframe with three columns
df=pd.DataFrame(data = {'Dose1': [1,2,3,4,5], 'Dose2': [6,6,4,7,4],'SickOrNot':[True,False,True,True,False]})
The last column corresponds to whether or not a patient stayed sick and the first two columns are doses of two drugs administered to the patient. I want to create two pairs of boxplots (in seaborn) of the doses, using whether the patient was sick or not as a hue.
So, essentially, I want the x axis to have two sections (Dose 1, Dose 2), which each section containing two boxplots. So that my final four boxplots are that of Dose 1 of sick patients, dose 1 of non sick patients, dose 2 of sick patients, dose 2 of non sick patients.
What is the syntax that I would use to do this? I have tried setting hue to be 'sick or not' but I am very confused about what to set as my x and y values when calling sns.boxplot.
Reshape the data into long form such that each column is one variable and each row is one observation. In this case Dose1 and Dose2 should be combined into one column, e.g. Section.
melt() the data with SickOrNot as the identifier and Dose1 and Dose2 as the values. Then set SickOrNot as the plot's hue:
sns.boxplot(
data=df.melt(id_vars=['SickOrNot'], value_vars=['Dose1', 'Dose2'],
var_name='Section', value_name='Dosage'),
x='Section',
y='Dosage',
hue='SickOrNot',
)

Plot chart using dataframe columns

I have a data frame that has persons affected by two different issues for each year from 1999 to 2005 as in bellow image.
dataframe
Can I create a bar chart that shows a comparison of person affected by cancer, Heart disease for each year?

How to group by and sum several columns?

I have a big dataframe with several columns which contains strings, numbers, etc. I am trying to group by SCENARIO and then sum only the columns between 2020 and 2050. The only thing I have got so far is sum one column as displayed as follows, but I need to change this '2050' by the columns between 2020 and 2050, for instance.
df1 = df.groupby(["SCENARIO"])['2050'].sum().sum(axis=0)
You are creating a subset of the df with only that single column. I can't tell how your dataset looks like from the information provided, but try:
df.groupby(["SCENARIO"]).sum()
This should some up all the rows which are in the column.
Alternatively select the columns which you want to perform the summation on.
df.groupby(["SCENARIO"])[["column1","column2"]].sum()