Creating a 2D image by logarithimal binning - pandas

I have a DataFrame consisting of two columns as follows:
col1 col2
0.33 4.33
0.21 4.89
3.2 18.78
6.22 0.05
6.0 2.1
... ...
... ...
Now I would like to create a 200 x 200 numpy array by binning both columns. The x-axis should be col1 and the y-axis should be col2. col1 should be binned logarithmically from 0 to 68 and col2 logarithmically from 0 to 35. I would like to use logarithmic binning because there are more smaller values than larger values (i.e. the bins are getting larger with larger values). The 200 x 200 array should then store the amount of samples in each bin (i.e. the count).
Is this possible to do in an efficient way?

Something like this might work for you... (note that you have to choose how close to zero the lower end is):
bins1 = np.logspace(np.log10(0.001), np.log10(68), num=201)
bins2 = np.logspace(np.log10(0.001), np.log10(35), num=201)
result = np.histogram2d(df['col1'], df['col2'], bins=[bins1, bins2])
...where result[0] are the counts in the bins, and result[1] and result[2] are the bin edges (the same as bins1 and bins2)

Related

Pandas wrong round decimation

I am calculating the duration of the data acquisition from some sensors. Although the data is collected faster, I would like to sample it at 10Hz. Anyways, I created a dataframe with a column called 'Time_diff' which I expect it goes [0.0, 0.1, 0.2, 0.3 ...]. However it goes somehow like [0.0, 0.1, 0.2, 0.30000004 ...]. I am rounding the data frame but still, I have this weird decimation. Is there any suggestions on how to fix it?
The code:
for i in range(self.n_of_trials):
start = np.zeros(0)
stop = np.zeros(0)
for df in self.trials[i].df_list:
start = np.append(stop, df['Time'].iloc[0])
stop = np.append(start, df['Time'].iloc[-1])
t_start = start.min()
t_stop = stop.max()
self.trials[i].duration = t_stop-t_start
t = np.arange(0, self.trials[i].duration+self.trials[i].dt, self.trials[i].dt)
self.trials[i].df_merged['Time_diff'] = t
self.trials[i].df_merged.round(1)
when I print the data it looks like this:
0 0.0
1 0.1
2 0.2
3 0.3
4 0.4
...
732 73.2
733 73.3
734 73.4
735 73.5
736 73.6
Name: Time_diff, Length: 737, dtype: float64
However when I open as csv file it is like that:
Addition
I think the problem is not csv conversion but how the float data converted/rounded. Here is the next part of the code where I merge more dataframes on 10Hz time stamps:
for j in range(len(self.trials[i].df_list)):
df = self.trials[i].df_list[j]
df.insert(0, 'Time_diff', round(df['Time']-t_start, 1))
df.round({'Time_diff': 1})
df.drop_duplicates(subset=['Time_diff'], keep='first', inplace=True)
self.trials[i].df_merged = pd.merge(self.trials[i].df_merged, df, how="outer", on="Time_diff", suffixes=(None, '_'+self.trials[i].df_list_names[j]))
#Test csv
self.trials[2].df_merged.to_csv(path_or_buf='merged.csv')
And since the inserted dataframes have exact correct decimation, it is not merged properly and create another instance with a new index.
This is not a rounding problem, it is a behavior intrinsic in how floating point numbers work. Actually 0.30000000000000004 is the result of 0.1+0.1+0.1 (try it out yourself in a Python prompt).
In practice not every decimal number is exactly representable as a floating point number so what you get is instead the closest possible value.
You have some options depending if you just want to improve the visualization or if you need to work on exact values. If for example you want to use that column for a merge you can use an approximate comparison instead of an exact one.
Another option is to use the decimal module: https://docs.python.org/3/library/decimal.html which works with exact arithmetic but can be slower.
In your case you said the column should represent frequency at steps of 10Hz so I think changing the representation so that you directly use 10, 20, 30, ... will allow you to use integers instead of floats.
If you want to see the "true" value of a floating point number in python you can use format(0.1*6, '.30f') and it will print the number with 30 digits (still an approximation but much better than the default).

Getting single value from the N dim histogram in NumPy or SciPy

Assume I have a data like this:
x = np.random.randn(4, 100000)
and I fit a histogram
hist = np.histogramdd(x, density=True)
What I want is to get the probability of number g, e.g. g=0.1. Assume some hypothetical function foo then.
g = 0.1
prob = foo(hist, g)
print(prob)
>> 0.2223124214
How could I do something like this, where I get probability back for a single or a vector of numbers for a fitted histogram? Especially histogram that is N-dimensional.
histogramdd takes O(r^D) memory, and unless you have a very large dataset or very small dimension you will have a poor estimate. Consider your example data, 100k points in 4-D space, the default histogram will be 10 x 10 x 10 x 10, so it will have 10k bins.
x = np.random.randn(4, 100000)
hist = np.histogramdd(x.transpose(), density=True)
np.mean(hist[0] == 0)
gives something arround 0.77 meaning that 77% of the bins in the histogram have no points.
You probably want to smooth the distribution. Unless you have a good reason to not do, I would suggest you to use Gaussian kernel-density Estimate
x = np.random.randn(4, 100000) # d x n array
f = scipy.stats.gaussian_kde(x) # d-dimensional PDF
f([1,2,3,4]) # evaluate the PDF in a given point

How to filter a Dataframe based on an ID-Column which corresponds to a second Dataframe containing conditions for each ID efficiently?

I have a Dataframe with one ID column and two data columns X,Y containing numeric values. For each ID there are several rows of data.
I have a second Dataframe with the same ID column and two numeric columns specifing the lower and upper Limit for the X - Values for each ID.
I want to use the second Dataframe to filter the first Dataframe to only have rows which have X Values within in the X_min-X_max Range of the specific ID.
I can solve this by Looping over the second dataframe and filtering groupby(ID) - Elements of the first DF but that is slow for large amount of IDs. Is there an efficient way to solve this?
Example Code with the data in df, the ranges in df_ranges and the expected result in df_result. The real data Frame is obviously a lot bigger.
import pandas as pd
x=[2.1,2.2,2.6,2.4,2.8,3.5,2.8,3.2]
y=[3.1,3.5,3.4,2.7,2.1,2.7,4.1,4.3]
ID=[0]*4+[0.1]*4
x_min=[2.0,3.0]
x_max=[2.5,3.4]
IDs=[0,0.1]
df=pd.DataFrame({'ID':ID,'X':x,'Y':y})
df_ranges=pd.DataFrame({'ID':IDs,'X_min':x_min,'X_max':x_max})
df_result=df.iloc[[0,1,3,7],:]
Possible Solution:
def filter_ranges(grp,df_ranges):
x_min=df_ranges.loc[df_ranges.ID==grp.name,'X_min'].values[0]
x_max=df_ranges.loc[df_ranges.ID==grp.name,'X_max'].values[0]
return grp.loc[(grp.X>=x_min)&(grp.X<=x_max),:]
target_df_grp=df.groupby('ID').apply(filter_ranges,df_ranges=df_ranges)
Try this:
merged = df.merge(df_ranges, on='ID')
target_df = merged[(merged.X>=merged.X_min)&(merged.X<=merged.X_max)][['ID', 'X', 'Y']] # Here, desired filter is applied.
print(target_df) will give:
ID X Y
0 0.0 2.1 3.1
1 0.0 2.2 3.5
3 0.0 2.4 2.7
7 0.1 3.2 4.3

Pandas, approximating a bar plot for large dataframes

I have a dataframe (around 10k rows) of the following form:
id | voted
123 1.0
12 0.0
215 1.0
362 0.0
...
And I want to bar plot this and look at where the values are mostly 0.0 and where they are mostly 1.0. (the order of indices in the first column is essential, as the dataframe is sorted).
I tried doing a bar plot, but even if I restrict myself to a small subset of the dataframe, the plot is still not readable:
Is there a way to approximate areas that are mostly 1.0 with a single thicker bar, such as we do for histograms, when we set the bins to a higher and lower number?
As you are searching for an interval approximation for the density of the votes, maybe you can add a moving average to it, :
df['ma'] = df['voted'].rolling(5).mean()
With this you would have always an average, then you could plot it over the indexes as a line graph, if the value is close to 1 then you know that you have a group of id's which votes with 1.0.

whether pandas.decribe() excludes missing value rows?

pandas.describe() function generate descriptive statistics that summarize the dataset, excluding NaN values. But does the exclusion here means that the total count (i.e., rows of a variable) vary or fixed?
For example, I calculate the mean by using describe() for a df with missing values:
varA
1
1
1
1
NaN
Is the mean = 4/5 or 4/4 here?
And how does it apply to other results in describe? For example, the standard deviation, quartiles?
Thanks!
As ayhan pointed out, in the current 0.21 release NaN values are excluded from all summary statistics provided by pandas.DataFrame.describe().
With NaN:
data_with_nan = list(range(20)) + [np.NaN]*20
df = pd.DataFrame(data=data_with_nan, columns=['col1'])
df.describe()
col1
count 20.00000
mean 9.50000
std 5.91608
min 0.00000
25% 4.75000
50% 9.50000
75% 14.25000
max 19.00000
Without:
data_without_nan = list(range(20))
df = pd.DataFrame(data=data_without_nan, columns=['col1'])
df.describe()
col1
count 20.00000
mean 9.50000
std 5.91608
min 0.00000
25% 4.75000
50% 9.50000
75% 14.25000
max 19.00000