Gridding/binning data

Gridding/binning data - pandas

I have a dataset with three columns: lat, lon, and wind speed. My goal is to have a 2-dimensional lat/lon gridded array that sums the wind speed observations that fall within each gridbox. It seems like that should be possible with groupby or cut in pandas. But I can't puzzle through how to do that.
Here is an example of what I'm trying to replicate from another language: https://www.ncl.ucar.edu/Document/Functions/Built-in/bin_sum.shtml

It sounds like you are using pandas. Are the data already binned? If so, something like this should work
data.groupby(["lat_bins", "lon_bins"]).sum()
If the lat and lon data are not binned yet, you can use pandas.cut to create a binned value column like this
data["lat_bins"] = pandas.cut(x = data["lat"], bins=[...some binning...])

Related

Numpy: using the interpolatin function to average between the new ticks

I have a graph like this and since i want to use the log scale in the frequency domain i want to resample the data so the points are evently distributed. When I use the interpolate function of numpy like this:
f_new = np.geomspace(f[0], f[-1], points)
mag_new = np.interp(f_new, f, mag)
it would just interpolate between the neighboring point but i want to take the average of the nearast points.
Is there an elegant numpyish way to do this using some agglomaration function?
Thanks ahead!!!

How Can I Find Peak Values of Defined Areas from Spectrogram Data using numpy?

I have spectrogram data from an audio analysis which looks like this:
On one axis I have frequencies in Hz and in the other times in seconds. I added the grid over the map to show the actual data points. Due to the nature of the used frequency analysis, the best results never give evenly spaced time and frequency values.
To allow comparison data from multiple sources, I would like to normalize this data. For this reason, I would like to calculate the peak values (maximum and minimum values) for specified areas in the map.
The second visualization shows the areas where I would like to calculate the peak values. I marked an area with a green rectangle to visualize this.
While for the time values, I would like to use equally spaced ranges (e.g 0.0-10.0, 10.0-20.0, 20.0-30.0), The frequency ranges are unevenly distributed. In higher frequencies, they will be like 450-550, 550-1500, 1500-2500, ...
You can download an example data-set here: data.zip. You can unpack the datasets like this:
with np.load(DATA_PATH) as data:
frequency_labels = data['frequency_labels']
time_labels = data['time_labels']
spectrogram_data = data['data']
DATA_PATH has to point to the path of the .npz data file.
As input, I would provide an array of frequency and time ranges. The result should be another 2d NumPy ndarray with either the maximum or the minimum values. As the amount of data is huge, I would like to rely on NumPy as much as possible to speed up the calculations.
How do I calculate the maximum/minimum values of defined areas from a 2d data map?

How to extract values from a dataframe in Julia

I would like to plot the energy per spin < E >/N against the temperature T.
However, I am not sure how to "extract" the values from the table below and plot them.

Just do:
using Plots
plot(data.T, data.Emean)
This is the simplest way to get a column from a data frame.
You might also want to check out this notebook: https://github.com/bkamins/Julia-DataFrames-Tutorial/blob/master/02_basicinfo.ipynb in the section "
Most elementary get and set operations".

Bin and sum over time for precipitation data in CSV form

I have a large CSV file with this formatting: month, year, lat, long, rainfall
How do I bin and sum over time by year? Also, different Q, how do I separate the data out into three bins of rainfalls in different basins?

it will better if you can post a example of the data frame you have, but something like this might work:
# read into data frame
df = pd.read_csv('your_csv_path')
# groupby year and get sum
df.groupby('year')['rainfall'].sum().reset_index(name='rainfall_sum)
for grouping into basins, I assume you need to make a scatter plot first, you might need a clustering algorithm. take a look at various algorithm in sklearn: https://scikit-learn.org/stable/modules/clustering.html

How do I plot the output of numpy.fft in bins?

I wrote some python code that plots the fast fourier transform of a pandas DataFrame called res, which contains two columns of data ("data" and "filtered"):
fft = pd.DataFrame(np.abs(np.fft.rfft(res["data"])))
fft.columns = ["data"]
fft["filtered"] = pd.DataFrame(np.abs(np.fft.rfft(res["filtered"])))
fft.index=np.fft.fftfreq(len(res))[0:len(fft)]
fft.plot(logy=True, logx=True)
The res dataset contains some original randomised datapoints in the "data" column along with the same data after passing through a filter. The output looks reasonable to me;
While this plot is probably correct, it's not very useful to look at. How can I organise this data into a smaller number of discrete frequency bins to make it easier to understand?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas