How do I plot the output of numpy.fft in bins? - pandas

I wrote some python code that plots the fast fourier transform of a pandas DataFrame called res, which contains two columns of data ("data" and "filtered"):
fft = pd.DataFrame(np.abs(np.fft.rfft(res["data"])))
fft.columns = ["data"]
fft["filtered"] = pd.DataFrame(np.abs(np.fft.rfft(res["filtered"])))
fft.index=np.fft.fftfreq(len(res))[0:len(fft)]
fft.plot(logy=True, logx=True)
The res dataset contains some original randomised datapoints in the "data" column along with the same data after passing through a filter. The output looks reasonable to me;
While this plot is probably correct, it's not very useful to look at. How can I organise this data into a smaller number of discrete frequency bins to make it easier to understand?

Related

Normalization of data using MinMaxScaler over a list of dataframes

I been trying to normalize my data using MinMaxScaler method. However, I have a list of data frames (20 columns) and every one of them include different number of data readings. For example, the first one has 23 cell. Then, if we get into each cell, we will see 1500x24 data (Not every cell has the same number of rows, it's changeable). My plan is to find the min and max across the entire data by scanning those 20 categories (col variable). If I use scikitlearn MinMaxScaler, it only does the normalization considering that specific sub-category, not over the whole data (Or I could not achieve it). Probably, I am also having some issues in importing data because I could not convert it into exactly the array form I wanted. The plan is to convert them into array, normalize the data and then transform them into grayscale RGB images.
Thus, the problem is to find the max and min across all 20 list of dataframes and then normalize every sub-category based on that max and min value. Could anyone give some ideas?
col = ['UN_DTE','UN_DTF','UN_SM05','UN_SM10','UN_SM15','D1_DTE', 'D1_DTF', 'D1_SM05', 'D1_SM10', 'D1_SM15', 'D2_DTE','D2_DTF','D2_SM05','D2_SM10','D2_SM15','D3_DTE','D3_DTF','D3_SM05','D3_SM10','D3_SM15']
df = []
for i in range(len(col)):
df.append(pd.DataFrame(mat[col[i]]))
# Normalize the data
scaler = MinMaxScaler()
scaler.fit(df[0][0][0])
scaled_features = scaler.transform(df[0][0][0])
pd.DataFrame(scaled_features).describe()

How Can I Find Peak Values of Defined Areas from Spectrogram Data using numpy?

I have spectrogram data from an audio analysis which looks like this:
On one axis I have frequencies in Hz and in the other times in seconds. I added the grid over the map to show the actual data points. Due to the nature of the used frequency analysis, the best results never give evenly spaced time and frequency values.
To allow comparison data from multiple sources, I would like to normalize this data. For this reason, I would like to calculate the peak values (maximum and minimum values) for specified areas in the map.
The second visualization shows the areas where I would like to calculate the peak values. I marked an area with a green rectangle to visualize this.
While for the time values, I would like to use equally spaced ranges (e.g 0.0-10.0, 10.0-20.0, 20.0-30.0), The frequency ranges are unevenly distributed. In higher frequencies, they will be like 450-550, 550-1500, 1500-2500, ...
You can download an example data-set here: data.zip. You can unpack the datasets like this:
with np.load(DATA_PATH) as data:
frequency_labels = data['frequency_labels']
time_labels = data['time_labels']
spectrogram_data = data['data']
DATA_PATH has to point to the path of the .npz data file.
As input, I would provide an array of frequency and time ranges. The result should be another 2d NumPy ndarray with either the maximum or the minimum values. As the amount of data is huge, I would like to rely on NumPy as much as possible to speed up the calculations.
How do I calculate the maximum/minimum values of defined areas from a 2d data map?

Gridding/binning data

I have a dataset with three columns: lat, lon, and wind speed. My goal is to have a 2-dimensional lat/lon gridded array that sums the wind speed observations that fall within each gridbox. It seems like that should be possible with groupby or cut in pandas. But I can't puzzle through how to do that.
Here is an example of what I'm trying to replicate from another language: https://www.ncl.ucar.edu/Document/Functions/Built-in/bin_sum.shtml
It sounds like you are using pandas. Are the data already binned? If so, something like this should work
data.groupby(["lat_bins", "lon_bins"]).sum()
If the lat and lon data are not binned yet, you can use pandas.cut to create a binned value column like this
data["lat_bins"] = pandas.cut(x = data["lat"], bins=[...some binning...])

Pandas, compute many means with bootstrap confidence intervals for plotting

I want to compute means with bootstrap confidence intervals for some subsets of a dataframe; the ultimate goal is to produce bar graphs of the means with bootstrap confidence intervals as the error bars. My data frame looks like this:
ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage
5.55 4.99 8.99 IIA
4.87 5.77 8.88 IIA
5.98 7.88 8.34 IIC
The subsets I'm interested in are every combination of Norm columns and cancer stage. I've managed to produce a table of means using:
df.groupby('Cancer Stage')['ATG12 Norm', 'ATG5 Norm', 'ATG7 Norm'].mean()
But I need to compute bootstrap confidence intervals to use as error bars for each of those means using the approach described here: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
It boils down to:
import scipy
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data=Series, statfunction=scipy.mean)
# CI[0] and CI[1] are your low and high confidence intervals
I tried to apply this method to each subset of data with a nested-loop script:
for i in data.groupby('Cancer Stage'):
for p in i.columns[1:3]: # PROBLEM!!
Series = i[p]
print p
print Series.mean()
ci = bootstrap.ci(data=Series, statfunction=scipy.mean)
Which produced an error message
AttributeError: 'tuple' object has no attribute called 'columns'
Not knowing what "tuples" are, I have some reading to do but I'm worried that my current approach of nested for loops will leave me with some kind of data structure I won't be able to easily plot from. I'm new to Pandas so I wouldn't be surprised to find there's a simpler, easier way to produce the data I'm trying to graph. Any and all help will be very much appreciated.
The way you iterate over the groupby-object is wrong! When you use groupby(), your data frame is sliced along the values in your groupby-column(s), together with these values as group names, forming a so-called "tuple":
(name, dataforgroup). The correct recipe for iterating over groupby-objects is
for name, group in data.groupby('Cancer Stage'):
print name
for p in group.columns[0:3]:
...
Please read more about the groupby-functionality of pandas here and go through the python-reference in order to understand what tuples are!
Grouping data frames and applying a function is essentially done in one statement, using the apply-functionality of pandas:
cols=data.columns[0:2]
for col in columns:
print data.groupby('Cancer Stage')[col].apply(lambda x:bootstrap.ci(data=x, statfunction=scipy.mean))
does everything you need in one line, and produces a (nicely plotable) series for you
EDIT:
I toyed around with a data frame object I created myself:
df = pd.DataFrame({'A':range(24), 'B':list('aabb') * 6, 'C':range(15,39)})
for col in ['A', 'C']:
print df.groupby('B')[col].apply(lambda x:bootstrap.ci(data=x.values))
yields two series that look like this:
B
a [6.58333333333, 14.3333333333]
b [8.5, 16.25]
B
a [21.5833333333, 29.3333333333]
b [23.4166666667, 31.25]

Visualizing randomized four dimensional data set

I have a four dimensional data set. None of the four variables are equally spaced. Right now, I visualize the data using 3D scatter (with the color of the dots indicating the fourth dimension). But this makes it extremely unwieldy while it is printed. Had the variables been evenly spaced,a series of pcolors would have been an option. Is there some way, wherein I can represent such a data using a series of 2D plots? My data set looks something like this:
x = [3.67, 3.89, 25.6]
y = [4.88, 4.88, 322.9]
z = [1.0, 2.0, 3.0]
b = [300.0,411.0,414.5]
A scatter plot matrix is a common way to plot multiple dimensions. Here's a plot of four continuous variables colored by a fifth categorical variable.
To deal with the uneven spacing, it depends on the nature of the unevenness.
You might plot it as-is if the unevenness is significant.
You might make a second plot with the extreme values excluded.
You might apply a transformation (such as log or quantile) if the data justifies it.