How to extract values from a dataframe in Julia - dataframe

I would like to plot the energy per spin < E >/N against the temperature T.
However, I am not sure how to "extract" the values from the table below and plot them.

Just do:
using Plots
plot(data.T, data.Emean)
This is the simplest way to get a column from a data frame.
You might also want to check out this notebook: https://github.com/bkamins/Julia-DataFrames-Tutorial/blob/master/02_basicinfo.ipynb in the section "
Most elementary get and set operations".

Related

How to plot timeseries with many NaNs?

Originally I had a dataframe containing power consumption of some devices like this:
and I wanted to plot power consumption vs time for different devices, one plot per one of 6 possible dates. After grouping by date I got plots like this one (for each group = date):
Then I tried to create similar plot, but switch date and device roles so that it is grouped by device and colored by date. In order to do it I prepared this dataframe:
It is similar to the previous one, but has many NaN values due to differing measurement times. I thought it won't be a problem, but then after grouping by device, subplots look like this one (ex is just a name of sub-dataframe extracted from loop going through groups = devices):
This is the ex dataframe (mean lag between observations is around 20 seconds)
Question: What should I do to make plot grouped by device look like ones grouped by date? (I'd like to use ex dataframe but handle NaNs somehow.)
I found solution in answer to similar question: ex.interpolate(method='linear').plot(). This line will fill gaps between data points via interpolation between plotting. This is the result:
Another thing that can help is adding .plot(marker='o', ms = 3) which won't fill gaps between points, but at least will make points visible (previously some points, mainly the peaks in energy consumption were too small in scale of whole plot). This is the result:

Ordering Seaborn heatmap with non-index variable

I am currently in the process of moving from R and ggplot2 to seaborn for a lot of work because R was struggling with the size of data I was using. I am currently working on a heatmap that is fairly simplistic and I have been able to render the general heatmap without too many issues, but I am not sure how to adjust the ordering of my categoricals for the heatmap.
In this case my data has this header:
Sample Position Depth Order
Sample is the "y-axis" categorical and Position is the "x-axis" categorical. Depth is the value of the cell. Order is a meta-value calculated elsewhere, but I want to use Order as my ordering value for the y-axis, while retaining Sample as the label. Is there a way to do this?
You need to provide a rectangular format, or matrix for sns.heatmap, so though you have a Order column for ordering Sample, it's not clear whether there is a unique value for each 'Order' category.
Below I use a simple example, and basically you change the 'Sample' to a category, according to the mean value of 'Order'. It is like changing the factor levels in R. Also, you need to make sure there is no NaN otherwise the heatmap might complain:
df = pd.DataFrame({'Sample':np.repeat(['A','B','C'],4),
'Position':[1,2,3,4]*3,
'Depth':np.random.normal(0,1,12),
'Order':np.repeat([2,1,3],4)})
y_order = df.groupby('Sample')['Order'].agg('mean').sort_values().index
df['Sample'] = pd.Categorical(df['Sample'],ordered=True,categories=y_order)
sns.heatmap(df.pivot(index='Sample',columns='Position', values='Depth'))

Gridding/binning data

I have a dataset with three columns: lat, lon, and wind speed. My goal is to have a 2-dimensional lat/lon gridded array that sums the wind speed observations that fall within each gridbox. It seems like that should be possible with groupby or cut in pandas. But I can't puzzle through how to do that.
Here is an example of what I'm trying to replicate from another language: https://www.ncl.ucar.edu/Document/Functions/Built-in/bin_sum.shtml
It sounds like you are using pandas. Are the data already binned? If so, something like this should work
data.groupby(["lat_bins", "lon_bins"]).sum()
If the lat and lon data are not binned yet, you can use pandas.cut to create a binned value column like this
data["lat_bins"] = pandas.cut(x = data["lat"], bins=[...some binning...])

Is there a way of using Pandas or Matplotlib to plot Pandas Time Series density?

I am having a hard time of plotting the density of Pandas time series.
I have a data frame with perfectly organised timestamps, like below:
It's a web log, and I want to show the density of the timestamp, which indicates how many visitors in certain period of time.
My solution atm is extracting the year, month, week and day of each timestamp, and group them. Like below:
But I don't think it would be a efficient way of dealing with time. And I couldn't find any good info on this, more of them are about plot the calculated values on a date or something.
So, anybody have any suggestions on how to plot Pandas time series?
Much appreciated!
The best way to compute the values you want to plot is to use Series.resample; for example, to aggregate the count of dates daily, use this:
ser = pd.Series(1, index=dates)
ser.resample('D').sum()
The documentation there has more details depending on exactly how you want to resample & aggregate the data.
If you want to plot the result, you can use Pandas built-in plotting capabilities; for example:
ser.resample('D').sum().plot()
More info on plotting is here.

Pandas, compute many means with bootstrap confidence intervals for plotting

I want to compute means with bootstrap confidence intervals for some subsets of a dataframe; the ultimate goal is to produce bar graphs of the means with bootstrap confidence intervals as the error bars. My data frame looks like this:
ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage
5.55 4.99 8.99 IIA
4.87 5.77 8.88 IIA
5.98 7.88 8.34 IIC
The subsets I'm interested in are every combination of Norm columns and cancer stage. I've managed to produce a table of means using:
df.groupby('Cancer Stage')['ATG12 Norm', 'ATG5 Norm', 'ATG7 Norm'].mean()
But I need to compute bootstrap confidence intervals to use as error bars for each of those means using the approach described here: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
It boils down to:
import scipy
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data=Series, statfunction=scipy.mean)
# CI[0] and CI[1] are your low and high confidence intervals
I tried to apply this method to each subset of data with a nested-loop script:
for i in data.groupby('Cancer Stage'):
for p in i.columns[1:3]: # PROBLEM!!
Series = i[p]
print p
print Series.mean()
ci = bootstrap.ci(data=Series, statfunction=scipy.mean)
Which produced an error message
AttributeError: 'tuple' object has no attribute called 'columns'
Not knowing what "tuples" are, I have some reading to do but I'm worried that my current approach of nested for loops will leave me with some kind of data structure I won't be able to easily plot from. I'm new to Pandas so I wouldn't be surprised to find there's a simpler, easier way to produce the data I'm trying to graph. Any and all help will be very much appreciated.
The way you iterate over the groupby-object is wrong! When you use groupby(), your data frame is sliced along the values in your groupby-column(s), together with these values as group names, forming a so-called "tuple":
(name, dataforgroup). The correct recipe for iterating over groupby-objects is
for name, group in data.groupby('Cancer Stage'):
print name
for p in group.columns[0:3]:
...
Please read more about the groupby-functionality of pandas here and go through the python-reference in order to understand what tuples are!
Grouping data frames and applying a function is essentially done in one statement, using the apply-functionality of pandas:
cols=data.columns[0:2]
for col in columns:
print data.groupby('Cancer Stage')[col].apply(lambda x:bootstrap.ci(data=x, statfunction=scipy.mean))
does everything you need in one line, and produces a (nicely plotable) series for you
EDIT:
I toyed around with a data frame object I created myself:
df = pd.DataFrame({'A':range(24), 'B':list('aabb') * 6, 'C':range(15,39)})
for col in ['A', 'C']:
print df.groupby('B')[col].apply(lambda x:bootstrap.ci(data=x.values))
yields two series that look like this:
B
a [6.58333333333, 14.3333333333]
b [8.5, 16.25]
B
a [21.5833333333, 29.3333333333]
b [23.4166666667, 31.25]