why is ggplot2 geom_col misreading discrete x axis labels as continuous? - ggplot2

Aim: plot a column chart representing concentration values at discrete sites
Problem: the 14 site labels are numeric, so I think ggplot2 is assuming continuous data and adding spaces for what it sees as 'missing numbers'. I only want 14 columns with 14 marks/labels, relative to the 14 values in the dataframe. I've tried assigning the sites as factors and characters but neither work.
Also, how do you ensure the y-axis ends at '0', so the bottom of the columns meet the x-axis?
Thanks
Data:
Sites: 2,4,6,7,8,9,10,11,12,13,14,15,16,17
Concentration: 10,16,3,15,17,10,11,19,14,12,14,13,18,16

You have two questions in one with two pretty straightforward answers:
1. How to force a discrete axis when your column is a continuous one? To make ggplot2 draw a discrete axis, the data must be discrete. You can force your numeric data to be discrete by converting to a factor. So, instead of x=Sites in your plot code, use x=as.factor(Sites).
2. How to eliminate the white space below the columns in a column plot? You can control the limits of the y axis via the scale_y_continuous() function. By default, the limits extend a bit past the actual data (in this case, from 0 to the max Concentration). You can override that behavior via the expand= argument. Check the documentation for expansion() for more details, but here I'm going to use mult=, which uses a multiplication to find the new limits based on the data. I'm using 0 for the lower limit to make the lower axis limit equal the minimum in your data (0), and 0.05 as the upper limit to expand the chart limits about 5% past the max value (this is default, I believe).
Here's the code and resulting plot.
library(ggplot2)
df <- data.frame(
Sites = c(2,4,6,7,8,9,10,11,12,13,14,15,16,17),
Concentration = c(10,16,3,15,17,10,11,19,14,12,14,13,18,16)
)
ggplot(df, aes(x=as.factor(Sites), y=Concentration)) +
geom_col(color="black", fill="lightblue") +
scale_y_continuous(expand=expansion(mult=c(0, 0.05))) +
theme_bw()

Related

how fix the y-axis's rate in plot

I am using a line to estimate the slope of my graphs. the data points are in the same size. But look at these two pictures. the first one seems to have a larger slope but its not true. the second one has larger slope. but since the y-axis has different rate, the first one looks to have a larger slope. is there any way to fix the rate of y-axis, then I can see with my eye which one has bigger slop?
code:
x = np.array(list(range(0,df.shape[0]))) # = array([0, 1, 2, ..., 3598, 3599, 3600])
df1[skill]=pd.to_numeric(df1[skill])
fit = np.polyfit(x, df1[skill], 1)
fit_fn = np.poly1d(fit)
df['fit_fn(x)']=fit_fn(x)
df[['Hodrick-Prescott filter',skill,'fit_fn(x)']].plot(title=skill + date)
Two ways:
One, use matplotlib.pyplot.axis to get the axis limits of the first figure and set the second figure to have the same axis limits (using the same function) (could also use get_ylim and set_ylim, which are specific to the y-axis but require directly referencing the Axes object)
Two, plot both in a subplots figure and set the argument sharey to True (my preferred, depending on the desired use)

Adjusting the y-axis in ggplot (bar size, ordering, formatting)

I have the following data.table:
I would like to have a plot which shows the columns symbol and value in a box plot. The boxes should be ordered by the column value.
My code, that I've tried:
plot1 <- ggplot(symbols, aes(symbol, value, fill = from)) +
geom_bar(stat = 'identity') +
ggtitle(paste0("Total quantity traded: ", format(sum(symbols$quantity), scientific = FALSE, nsmall = 2, big.mark = " "))) +
theme_bw()
plot1
This returns the following plot:
What I would like to change:
- flip x- and y-axis
- show the correct height of boxes (y-axis)...currently the relation between the boxes is not correct.
- decreasing order of the boxes by columns value
- format the y-axis with two digits
- make the x-axis readable...currently the x-axis is just a long bunch of what is written in column symbol.
Thanks upfront for the help!
To make things a bit easier, it is suggested that you post your data frame as the output of dput(your.data.frame), which presents code that can be used to replicate your dataset in r.
With that being said, I recreated your data (it was not too big)--some numbers were rounded a bit to make things easier.
A few comments:
y-axis numbers are odd: The numbers on your y-axis are not numeric. If you type str(your.data.frame) you'll probably notice that "value" is not numeric, but a character or factor. This can be easily remedied via: df$value <- as.numeric(df$value), where df is your dataframe.
flipping axis: You can use coord_flip() (typically added to the end of your ggplot call. Be warned that when you do this, your aesthetics flip for the plot, so just keep that in mind.
your dataframe name is also a function/data name in r: This may not be causing any issues (due to your environment), but just be aware to use caution to name your dataset to not have names that are used in r elsewhere. This goes for column/variable names too. I don't think it causes any issues here, but just an FYI
geom_col vs geom_bar: Check out this documentation link for some description on the differences between geom_bar and geom_col. Basically, you want to use geom_bar when your y-axis is count, and geom_col when your y-axis is a value. Here, you want to plot a value, so choose geom_col(), and not geom_bar().
Fixing the issues in plot
Here's the representation of your data (note I rounded... hopefully got the actual data correct, because I manually had to copy each value):
from symbol quantity usd value
1 BTC BTCUSDT 12910.470 6776.340 87485737
2 ETH ETHUSDT 6168.730 154.398 952445
3 BNB BNBUSDT 51002.650 14.764 753017
4 BNB BNBBTC 31071.280 14.764 458745
5 ETH ETHBTC 2216.576 154.398 342236
6 LTC LTCUSDT 4332.024 40.481 175368
7 BNB BNBETH 3150.030 14.764 46507
8 LTC LTCBTC 922.560 40.481 37346
9 LTC LTCBNB 521.476 40.481 21110
10 NEO NEOUSDT 2438.353 7.203 17564
11 NEO NEOBTC 417.930 7.203 3010
Here's the basic plot, flipped:
ggplot(df, aes(symbol, value, fill=from)) +
geom_col() +
coord_flip()
The problem here is that when you plot values... BTCUSDT is huge in comparison. I would suggest you plot on log of value. See this link for some advice on how to do that. I like the scale_y_log10() function, since it just works here pretty well:
ggplot(df, aes(symbol, value, fill=from)) +
geom_col() +
scale_y_log10() +
coord_flip()
If you wanted to keep the columns going in the vertical orientation, you can still do that and avoid having the text run into each other on the x-axis. In that case, you can rotate the labels via theme(axis.text.x=...). Note the adjustments to horizontal and vertical alignment (hjust=1), which forces the labels to be "right-aligned":
ggplot(df, aes(symbol, value, fill=from)) +
geom_col() +
scale_y_log10() +
theme(axis.text.x=element_text(angle=45, hjust=1))

dual y axis plotting causes data points looks messy at left corner of chart

i am using MPAndroid charting toolkit for data visualization. chart plotting smooth in MPAndroid but problem arise when i try plot dual y axis(left& right). As right axis appear data points on spread over the x axis completely. all data points appearing on the left of chart. How can i spread the data points ?
mChart = (LineChart) findViewById(R.id.Chart);
mChart.setGridBackgroundColor(Color.parseColor("#F4F4F4"));
mChart.setDrawGridBackground(false);
mChart.setTouchEnabled(true);
mChart.setHighlightEnabled(false);
mChart.setDragEnabled(false);
mChart.setScaleEnabled(true);
mChart.setPinchZoom(true);
mChart.setDescription("");
mChart.getAxisLeft().setAxisMaxValue(YMaxValue);
mChart.getAxisLeft().setAxisMinValue(YMinValue);
mChart.getAxisLeft().setStartAtZero(false);
mChart.getAxisRight().setEnabled(false);
if(General.InnerClass.Y2AxisValues.size()>0)
{
mChart.getAxisRight().setEnabled(true);
mChart.getAxisRight().setSpaceBottom(12.25f);
mChart.getAxisRight().setAxisMaxValue(Y2MaxValue);
mChart.getAxisRight().setAxisMinValue(Y2MinValue);
mChart.getAxisRight().setStartAtZero(false);
mChart.getXAxis().setAvoidFirstLastClipping(true);
}
mChart.setData(data);
progbar.setVisibility(View.GONE);
mChart.invalidate();
Before
After
The data points are exactly where they should be. The only thing that has changed is the range of values that is displayed on each axis.
--> this range is now significationly lower because only one set of data is represented by each axis, before, one axis had to scale large enough to display both datasets.
I suggest you read the documentation of the YAxis and simply increase the range of values that should be displayed on the axis.
while plotting y2 axis providing the x values is the problem. values for the x axis should be provide for single time. providing twice add the values to x axis exiting values get doubled.
thnx #philipp

What exactly do the whiskers in pandas' boxplots specify?

In python-pandas boxplots with default settings, the red bar is the mean median, and the box signifies the 25th and 75th quartiles, but what exactly do the whiskers mean in this case? Where is the documentation to figure out the exact definition (couldn't find it)?
Example code:
df.boxplot()
Example result:
Pandas just wraps the boxplot function from matplotlib. The matplotlib docs have the definition of the whiskers in detail:
whis : float, sequence, or string (default = 1.5)
As a float, determines the reach of the whiskers to the beyond the
first and third quartiles. In other words, where IQR is the
interquartile range (Q3-Q1), the upper whisker will extend to last
datum less than Q3 + whis*IQR). Similarly, the lower whisker will
extend to the first datum greater than Q1 - whis*IQR. Beyond the
whiskers, data are considered outliers and are plotted as individual
points.
Matplotlib (and Pandas) also gives you a lot of options to change this default definition of the whiskers:
Set this to an unreasonably high value to force the whiskers to show
the min and max values. Alternatively, set this to an ascending
sequence of percentile (e.g., [5, 95]) to set the whiskers at specific
percentiles of the data. Finally, whis can be the string 'range' to
force the whiskers to the min and max of the data.
Below a graphic that illustrates this from a stats.stackexchange answer. Note that k=1.5 if you don't supply the whis keyword in Pandas.
From Amelio Vazquez-Reina's answer in Boxplots in matplotlib: Markers and outliers:
The outliers (the + markers in the boxplot) are simply points outside of the wide [(Q1-1.5 IQR), (Q3+1.5 IQR)] margin below.
FYI: Confused by location of fences in box-whisker plots
You mention in your question that the red line is the mean - it is actually the median.
From the matplotlib link mentioned by Chang She above:
The box extends from the lower to upper quartile values of the data,
with a line at the median. The whiskers extend from the box to show
the range of the data. Flier points are those past the end of the
whiskers.
I didn't experiment, but there is a 'meanline' option which might put the line at the mean.
These are specified in the matplotlib documentation. The whiskers are some multiple (1.5 by default) of the interquartile range.

plotting matrices with gnuplot

I am trying to plot a matrix in Gnuplot as I would using imshow in Matplotlib. That means I just want to plot the actual matrix values, not the interpolation between values. I have been able to do this by trying
splot "file.dat" u 1:2:3 ps 5 pt 5 palette
This way we are telling the program to use columns 1,2 and 3 in the file, use squares of size 5 and space the points with very narrow gaps. However the points in my dataset are not evenly spaced and hence I get discontinuities.
Anyone a method of plotting matrix values in gnuplot regardless of not evenly spaced in Xa and y axes?
Gnuplot doesn't need to have evenly space X and Y axes. ( see another one of my answers: https://stackoverflow.com/a/10690041/748858 ). I frequently deal with grids that look like x[i] = f_x(i) and y[j] = f_y(j). This is quite trivial to plot, the datafile just looks like:
#datafile.dat
x1 y1 z11
x1 y2 z12
...
x1 yN z1N
#<--- blank line (leave these comments out of your datafile ;)
x2 y1 z21
x2 y2 z22
...
x2 yN z2N
#<--- blank line
...
...
#<--- blank line
xN y1 zN1
...
xN yN zNN
(note the blank lines)
A datafile like that can be plotted as:
set view map
splot "datafile.dat" u 1:2:3 w pm3d
the option set pm3d corners2color can be used to fine tune which corner you want to color the rectangle created.
Also note that you could make essentially the same plot doing this:
set view map
plot "datafile.dat" u 1:2:3 w image
Although I don't use this one myself, so it might fail with a non-equally spaced rectangular grid (you'll need to try it).
Response to your comment
Yes, pm3d does generate (M-1)x(N-1) quadrilaterals as you've alluded to in your comment -- It takes the 4 corners and (by default) averages their value to assign a color. You seem to dislike this -- although (in most cases) I doubt you'd be able to tell a difference in the plot for reasonably large M and N (larger than 20). So, before we go on, you may want to ask yourself if it is really necessary to plot EVERY POINT.
That being said, with a little work, gnuplot can still do what you want. The solution is to specify that a particular corner is to be used to assign the color to the entire quadrilateral.
#specify that the first corner should be used for coloring the quadrilateral
set pm3d corners2color c1 #could also be c2,c3, or c4.
Then simply append the last row and last column of your matrix to plot it twice (making up an extra gridpoint to accommodate the larger dataset. You're not quite there yet, you still need to shift your grid values by half a cell so that your quadrilaterals are centered on the point in question -- which way you shift the cells depends on your choice of corner (c1,c2,c3,c4) -- You'll need to play around with it to figure out which one you want.
Note that the problem here isn't gnuplot. It's that there isn't enough information in the datafile to construct an MxN surface given MxN triples. At each point, you need to know it's position (x,y) it's value (z) and also the size of the quadrilateral to be draw there -- which is more information than you've packed into the file. Of course, you can guess the size in the interior points (just meet halfway), but there's no guessing on the exterior points. but why not just use the size of the next interior point?. That's a good question, and it would (typically) work well for rectangular grids, but that is only a special case (although a common one) -- which would (likely) fail miserably for many other grids. The point is that gnuplot decided that averaging the corners is typically "close enough", but then gives you the option to change it.
See the explanation for the input data here. You may have to change your data file's format accordingly.