Is there a boxplot() function where you can specify the min/max and quartile values as arguments instead of supplying whole DataFrame? [duplicate] - matplotlib

This question already has answers here:
Is it possible to draw a matplotlib boxplot given the percentile values instead of the original inputs?
(4 answers)
How do you create a boxplot in seaborn with pre-calculated values for mean, median, percentile, etc?
(1 answer)
Closed last month.
I am working on some privacy-friendly data visualization and I am looking for a Python library with an implementation of boxplot where you can provide the values needed to build a boxplot (minimum, first quartile, median, third quartile, and maximum) instead of providing a whole dataset. Alternatively, if anyone has any ideas of how I could implement it from scratch using available tools and without providing full columns of data, that would also be very useful!
SO far I have checked out matplotlib and Seaborn but they don't allow for this option.

Related

Why is Pandas inverting my x-axis order? [duplicate]

This question already has answers here:
x-axis inverted unexpectedly by pandas.plot(...)
(2 answers)
Closed 4 years ago.
When I was plotting two series of data against eachother, the X axis was inverted unexpectedly. I know this question sounds pretty similar to this other: x-axis inverted unexpectedly by pandas.plot(...) and it actually is, but I want to know if this can be disabled or something, not a workaround. Let me explain myself.
I have a very simple DF that consists on a datetime index and two columns; one has humidity measurements and the other daily weights. Both of them are in descending order because when my sample loses water, it also loses weight and humidity. So my DF looks something like this, where my data is in descending order
But then, when I plot using X = "Peso" (weight), and Y = 'Humedad' (humidity), my X axis goes in ascending order insted of descending order.
My ploting code:
plt.figure(figsize=(12,9))
plt.scatter(data['Peso'],data['Humedad'])
plt.xlabel('Peso (kg)',fontsize=14)
plt.ylabel("Raw Counts",fontsize=14)
plt.xticks(rotation=90,fontsize=10)
plt.grid()
Resulting in this kind of plot, where X axis is inverted
So, I could do two simple types of workaround:
plt.scatter(sorted(data['Peso']),data['Humedad'])
or
plt.scatter(data['Peso'][::-1],data['Humedad'])
Both of them have the same result, they print my data as I wanted, BUT my xticks are still inverted:
So what I did was creating a list with my weight values in order to insert it as it follows:
semin=data['Peso']
semin=semin.tolist()
And then adding it to my plt.xticks like this
plt.xticks(semin,rotation=90,fontsize=10)
It "kind off" worked, overlaping some of the xticks as you can see in the image below:
I know I can solve this with [Locs] and general xticks information, but I really wanted to know if it's possible to just ask Pandas to follow the natural data descending order or anything similiar and avoiding all of this xticks stuff?
I've checked this too: https://github.com/pandas-dev/pandas/issues/10118
and I tried by doing the set_index suggestion:
plt.figure(figsize=(12,9))
data.set_index('Peso').Humedad.plot()
plt.xlabel('Peso (kg)',fontsize=14)
plt.ylabel("Raw Counts",fontsize=14)
plt.xticks(rotation=90,fontsize=10)
plt.grid()
And it went almost perfect, except that I needed it in scatter...
So I tried some stuff to "scatter it"
1. Putting the marker type:
data.set_index('Peso').Humedad.plot(marker='o')
Got a marker + line graph:
2. Changing .plot for .scatter to the plot:
data.set_index('Peso').Humedad.scatter()
Got this error:
AttributeError: 'Series' object has no attribute 'scatter'
3. Using both
data.set_index('Peso').Humedad.plot.scatter()
Got this one:
AttributeError: 'SeriesPlotMethods' object has no attribute 'scatter'
4. Making this giant question. Please help.
And that's all, sorry if I'm missing something or if my post is too long. I'm open to suggestions, corrections or anything you're willing to tell me.
Thanks!
Oh I just saw that the linked question actually does exactly what you need. Will leave this as here. But please refer to the linked question instead.
It's not a solution to change the ticks! The resultung plot may easily get completely wrong.
Instead google for "invert x axis" or so and find that you can invert the axis via
ax = df.plot(...)
ax.invert_xaxis()
This is not a workaround. It is the solution. (How much easier can it get?)

what is the difference between series/dataframe and ndarray?

Leaving that they are from two different binaries.
I know that series/dataframe can hold any data type, and ndarray is also heterogenous data.
And also all the slicing operations of numpy are applicable to series.
Is there any other difference between them?
After some research I found the answer to my question I asked above. For anyone who needs, here it is from pandas docs:
A key difference between Series and ndarray is that operations between
Series automatically align the data based on the label. Thus, you can
write computations without giving consideration to whether the Series
involved have the same labels.
An example:
s[1:] + s[:-1]
The result for above would produce NaN for both first and last index.
If a label is not found in one Series or the other, the result will be marked as missing NaN.

Is there a way of using Pandas or Matplotlib to plot Pandas Time Series density?

I am having a hard time of plotting the density of Pandas time series.
I have a data frame with perfectly organised timestamps, like below:
It's a web log, and I want to show the density of the timestamp, which indicates how many visitors in certain period of time.
My solution atm is extracting the year, month, week and day of each timestamp, and group them. Like below:
But I don't think it would be a efficient way of dealing with time. And I couldn't find any good info on this, more of them are about plot the calculated values on a date or something.
So, anybody have any suggestions on how to plot Pandas time series?
Much appreciated!
The best way to compute the values you want to plot is to use Series.resample; for example, to aggregate the count of dates daily, use this:
ser = pd.Series(1, index=dates)
ser.resample('D').sum()
The documentation there has more details depending on exactly how you want to resample & aggregate the data.
If you want to plot the result, you can use Pandas built-in plotting capabilities; for example:
ser.resample('D').sum().plot()
More info on plotting is here.

Which features of Pandas DataFrame could be used to model GPS Tracklog data (read from GPX file)

It's been months now since I started to use Pandas DataFrame to deserialize GPS data and perform some data processing and analyses.
Although I am very impressed with Pandas robustness, flexibility and power, I'm a bit lost about which features, and in which way, I should use to properly model the data, both for clarity, simplicity and computational speed.
Basically, each DataFrame is primarily indexed by a datetime object, having at least one column for a latitude-longitude tuple, and one column for elevation.
The first thing I do is to calculate a new column with the geodesic distance between coordinate pairs (first one being 0.0), using a function that takes two coordinate pairs as arguments, and from that new column I can calculate the cumulative distance along the track, which I use as a Linear Referencing System
The questions I need to address would be:
Is there a way in which I can use, in the same dataframe, two different monotonically increasing columns (cumulative distance and timestamp), choosing whatever is more convenient in each given context at runtime, and use these indexes to auto-align newly inserted rows?
In the specific case of applying a diff function that could be vectorized (applied like an array operation instead of an iterative pairwise loop), is there a way to do that idiomatically in pandas? Should I create a "coordinate" class which support the diff (__sub__) operation so I could use dataframe.latlng.diff directly?
I'm not sure these questions are well formulated, but that is due, at least a bit, by the overwhelming number of possibilities, and a somewhat fragmented documentation (yet).
Also, any tip about using Pandas for GPS data (tracklogs) or Geospatial data in general is very much welcome.
Thanks for any help!

What exactly do the whiskers in pandas' boxplots specify?

In python-pandas boxplots with default settings, the red bar is the mean median, and the box signifies the 25th and 75th quartiles, but what exactly do the whiskers mean in this case? Where is the documentation to figure out the exact definition (couldn't find it)?
Example code:
df.boxplot()
Example result:
Pandas just wraps the boxplot function from matplotlib. The matplotlib docs have the definition of the whiskers in detail:
whis : float, sequence, or string (default = 1.5)
As a float, determines the reach of the whiskers to the beyond the
first and third quartiles. In other words, where IQR is the
interquartile range (Q3-Q1), the upper whisker will extend to last
datum less than Q3 + whis*IQR). Similarly, the lower whisker will
extend to the first datum greater than Q1 - whis*IQR. Beyond the
whiskers, data are considered outliers and are plotted as individual
points.
Matplotlib (and Pandas) also gives you a lot of options to change this default definition of the whiskers:
Set this to an unreasonably high value to force the whiskers to show
the min and max values. Alternatively, set this to an ascending
sequence of percentile (e.g., [5, 95]) to set the whiskers at specific
percentiles of the data. Finally, whis can be the string 'range' to
force the whiskers to the min and max of the data.
Below a graphic that illustrates this from a stats.stackexchange answer. Note that k=1.5 if you don't supply the whis keyword in Pandas.
From Amelio Vazquez-Reina's answer in Boxplots in matplotlib: Markers and outliers:
The outliers (the + markers in the boxplot) are simply points outside of the wide [(Q1-1.5 IQR), (Q3+1.5 IQR)] margin below.
FYI: Confused by location of fences in box-whisker plots
You mention in your question that the red line is the mean - it is actually the median.
From the matplotlib link mentioned by Chang She above:
The box extends from the lower to upper quartile values of the data,
with a line at the median. The whiskers extend from the box to show
the range of the data. Flier points are those past the end of the
whiskers.
I didn't experiment, but there is a 'meanline' option which might put the line at the mean.
These are specified in the matplotlib documentation. The whiskers are some multiple (1.5 by default) of the interquartile range.