Statistics and Pandas: What normalization means in value_counts() in Pandas - pandas

The question is not about coding but to understand what normalize means in terms of statistics and correlation of data
This is an example of what I am doing.
Without normalization:
plt.subplot(111)
plt.plot(df['alcoholism'].value_counts(), marker='o')
plt.plot(df.query('no_show =="Yes"')['alcoholism'].value_counts(), color='black')
plt.show();
With normalization:
plt.subplot(111)
plt.plot(df['alcoholism'].value_counts(normalize=True), marker='o')
plt.plot(df.query('no_show =="Yes"')['alcoholism'].value_counts(normalize=True), color='black')
plt.show();
Which one better correlates the values with or without normalization? or is it a whole wrong idea?
I am new to data and pandas, so excuse my bad code, chaining, commenting, style :)

As you can see when you normalize (second plot), the sum of both points is equal to 1, for each line that is plotted. Normalizing is giving you the rate of occurrences of each value instead of the number of occurrences.
Heres what the doc says:
normalize : bool, default False
    Return proportions rather than frequencies.
value_counts() probably returns something like:
0 110000
1 1000
dtype: int64
and value_counts(normalize=True) probably returns something like:
0 0.990991
1 0.009009
dtype: float64
In other words, the relation between the normalized and non-normalized can be checked as:
>>> counts = df['alcoholism'].value_counts()
>>> rate = df['alcoholism'].value_counts(normalize=True)
>>> np.allclose(rate, counts / counts.sum())
True
Where np.allclose allowing to properly compare two series of floating point numbers.

Related

Pandas wrong round decimation

I am calculating the duration of the data acquisition from some sensors. Although the data is collected faster, I would like to sample it at 10Hz. Anyways, I created a dataframe with a column called 'Time_diff' which I expect it goes [0.0, 0.1, 0.2, 0.3 ...]. However it goes somehow like [0.0, 0.1, 0.2, 0.30000004 ...]. I am rounding the data frame but still, I have this weird decimation. Is there any suggestions on how to fix it?
The code:
for i in range(self.n_of_trials):
start = np.zeros(0)
stop = np.zeros(0)
for df in self.trials[i].df_list:
start = np.append(stop, df['Time'].iloc[0])
stop = np.append(start, df['Time'].iloc[-1])
t_start = start.min()
t_stop = stop.max()
self.trials[i].duration = t_stop-t_start
t = np.arange(0, self.trials[i].duration+self.trials[i].dt, self.trials[i].dt)
self.trials[i].df_merged['Time_diff'] = t
self.trials[i].df_merged.round(1)
when I print the data it looks like this:
0 0.0
1 0.1
2 0.2
3 0.3
4 0.4
...
732 73.2
733 73.3
734 73.4
735 73.5
736 73.6
Name: Time_diff, Length: 737, dtype: float64
However when I open as csv file it is like that:
Addition
I think the problem is not csv conversion but how the float data converted/rounded. Here is the next part of the code where I merge more dataframes on 10Hz time stamps:
for j in range(len(self.trials[i].df_list)):
df = self.trials[i].df_list[j]
df.insert(0, 'Time_diff', round(df['Time']-t_start, 1))
df.round({'Time_diff': 1})
df.drop_duplicates(subset=['Time_diff'], keep='first', inplace=True)
self.trials[i].df_merged = pd.merge(self.trials[i].df_merged, df, how="outer", on="Time_diff", suffixes=(None, '_'+self.trials[i].df_list_names[j]))
#Test csv
self.trials[2].df_merged.to_csv(path_or_buf='merged.csv')
And since the inserted dataframes have exact correct decimation, it is not merged properly and create another instance with a new index.
This is not a rounding problem, it is a behavior intrinsic in how floating point numbers work. Actually 0.30000000000000004 is the result of 0.1+0.1+0.1 (try it out yourself in a Python prompt).
In practice not every decimal number is exactly representable as a floating point number so what you get is instead the closest possible value.
You have some options depending if you just want to improve the visualization or if you need to work on exact values. If for example you want to use that column for a merge you can use an approximate comparison instead of an exact one.
Another option is to use the decimal module: https://docs.python.org/3/library/decimal.html which works with exact arithmetic but can be slower.
In your case you said the column should represent frequency at steps of 10Hz so I think changing the representation so that you directly use 10, 20, 30, ... will allow you to use integers instead of floats.
If you want to see the "true" value of a floating point number in python you can use format(0.1*6, '.30f') and it will print the number with 30 digits (still an approximation but much better than the default).

Difficulty with numpy broadcasting

I have two 2d point clouds (oldPts and newPts) which I whish to combine. They are mx2 and nx2 numpyinteger arrays with m and n of order 2000. newPts contains many duplicates or near duplicates of oldPts and I need to remove these before combining.
So far I have used the histogram2d function to produce a 2d representation of oldPts (H). I then compare each newPt to an NxN area of H and if it is empty I accept the point. This last part I am currently doing with a python loop which i would like to remove. Can anybody show me how to do this with broadcasting or perhaps suggest a completely different method of going about the problem. the working code is below
npzfile = np.load(path+datasetNo+'\\temp.npz')
arrs = npzfile.files
oldPts = npzfile[arrs[0]]
newPts = npzfile[arrs[1]]
# remove all the negative values
oldPts = oldPts[oldPts.min(axis=1)>=0,:]
newPts = newPts[newPts.min(axis=1)>=0,:]
# round to integers
oldPts = np.around(oldPts).astype(int)
newPts = newPts.astype(int)
# put the oldPts into 2d array
H, xedg,yedg= np.histogram2d(oldPts[:,0],oldPts[:,1],
bins = [xMax,yMax],
range = [[0, xMax], [0, yMax]])
finalNewList = []
N = 5
for pt in newPts:
if not H[max(0,pt[0]-N):min(xMax,pt[0]+N),
max(0,pt[1]- N):min(yMax,pt[1]+N)].any():
finalNewList.append(pt)
finalNew = np.array(finalNewList)
The right way to do this is to use linear algebra to compute the distance between each pair of 2-long vectors, and then accept only the new points that are "different enough" from each old point: using scipy.spatial.distance.cdist:
import numpy as np
oldPts = np.random.randn(1000,2)
newPts = np.random.randn(2000,2)
from scipy.spatial.distance import cdist
dist = cdist(oldPts, newPts)
print(dist.shape) # (1000, 2000)
okIndex = np.max(dist, axis=0) > 5
print(np.sum(okIndex)) # prints 1503 for me
finalNew = newPts[okIndex,:]
print(finalNew.shape) # (1503, 2)
Above I use the Euclidean distance of 5 as the threshold for "too close": any point in newPts that's farther than 5 from all points in oldPts is accepted into finalPts. You will have to look at the range of values in dist to find a good threshold, but your histogram can guide you in picking the best one.
(One good way to visualize dist is to use matplotlib.pyplot.imshow(dist).)
This is a more refined version of what you were doing with the histogram. In fact, you ought to be able to get the exact same answer as the histogram by passing in metric='minkowski', p=1 keyword arguments to cdist, assuming your histogram bin widths are the same in both dimensions, and using 5 again as the threshold.
(PS. If you're interested in another useful function in scipy.spatial.distance, check out my answer that uses pdist to find unique rows/columns in an array.)

Sklearn and Sparse Matrices ValueError

I'm aware similar questions have been asked before, and I've tried everything suggested in them, but I'm still stumped. I have a dataset with 2 columns: The first with vectors representing words stored as a 1x10000 sparse csr matrix (so a matrix in each cell), and the second contains integer ratings which I will use for classification. When I run the following code
for index, row in data.iterrows():
print(row)
print(row[0].shape)
I get the correct output for all the rows
Name: 0, dtype: object
(1, 10000)
Vector (0, 0)\t1.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
Rating 5
Now when I try passing my data in any SKlearn classifier like so:
uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])
I get the following error:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
What am I doing wrong? I've made sure all my sparse matrices are the same size and I've tried reshaping my data in various ways, but with no luck, and the Sklearn classifiers are supposed to be able to deal with csr matrices.
Update: Converting the entire "Vectors" column into one large 2-D matrix did the trick, but for completeness sake the following is the code I used to generate my dataframe if anyone is curious and wants to try solving the original issue. Assume data is a pandas dataframe with rows that look like
"560 420 222" 5.0
"2345 2344 2344 5" 3.0
def vectorize(feature, size):
"""Given a numeric string generated from a vocabulary table return a binary vector representation of
each feature"""
vector = sparse.lil_matrix((1, size))
for number in feature.split(' '):
try:
vector[0, int(number) - 1] = 1
except ValueError:
pass
return vector
def vectorize_dataset(data, vectorize, size):
"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""
result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])
for index, row in data.iterrows():
# All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
if type(row[0]) == type('str'):
result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
result_data.iat[index, 1] = data.loc[index][1]
return result_data

How to find if any column in an array has duplicate values

Let's say I have a numpy matrix A
A = array([[ 0.5, 0.5, 3.7],
[ 3.8, 2.7, 3.7],
[ 3.3, 1.0, 0.2]])
I would like to know if there is at least two rows i and i' such that A[i, j]=A[i', j] for some column j?
In the example A, i=0 and i'=1 for j=2 and the answer is yes.
How can I do this?
I tried this:
def test(A, n):
for j in range(n):
i = 0
while i < n:
a = A[i, j]
for s in range(i+1, n):
if A[s, j] == a:
return True
i += 1
return False
Is there a faster/better way?
There are a number of ways of checking for duplicates. The idea is to use as few loops in the Python code as possible to do this. I will present a couple of ways here:
Use np.unique. You would still have to loop over the columns since it wouldn't make sense for unique to accept an axis argument because each column could have a different number of unique elements. While it still requires a loop, unique allows you to find the positions and other stats of repeated elements:
def test(A):
for i in range(A.shape[1]):
if np.unique(A[:, i]).size < A.shape[0]:
return True
return False
With this method, you basically check if the number of unique elements in a column is equal to the size of the column. If not, there are duplicates.
Use np.sort, np.diff and np.any. This is a fully vectorized solution that does not require any loops because you can specify an axis for each of these functions:
def test(A):
return np.any(diff(np.sort(A, axis=0), axis=0) == 0)
This literally reads "if any of the column-wise differences in the column-wise sorted array are zero, return True". A zero difference in the sorted array means that there are identical elements. axis=0 makes sort and diff operate on each column individually.
You never need to pass in n since the size of the matrix is encoded in the attribute shape. If you need to look at the subset of a matrix, just pass in the subset using indexing. It will not copy the data, just return a view object with the required dimensions.
A solution without numpy would look like this: First, swap columns and rows with zip()
zipped = zip(*A)
then check if any now row has any duplicates. You can check for duplicates by turning a list into a set, which discards duplicates, and check the length.
has_duplicates = any(len(set(row)) != len(row) for row in zip(*A))
Most probably way slower and also more memory intensive than the pure numpy solution, but this may help for clarity

The fastest way to get filtered data checking substring value within ndarray

I have a big array of data:
>>> len(b)
6636849
>>> print(b)
[['60D19E9E-4E2C-11E2-AA9A-52540027E502' '100015361']
['60D19EB6-4E2C-11E2-AA9A-52540027E502' '100015385']
['60D19ECE-4E2C-11E2-AA9A-52540027E502' '100015409']
...,
['8CC90633-447E-11E6-B010-005056A76B49' '106636785']
['F8C74244-447E-11E6-B010-005056A76B49' '106636809']
['F8C7425C-447E-11E6-B010-005056A76B49' '106636833']]
I need to get the filtered dataset, i.e, everything containing (or starting with) '106' in the string). Something like the following code with substring operation instead of math operation:
>>> len(b[b[:,1] > '10660600'])
30850
I don't think numpy is well suited for this type of operation. You can do it simply using basic python operations. Here it is with some sample data a:
import random # for the test data
a = []
for i in range(10000):
a.append(["".join(random.sample('abcdefg',3)), "".join(random.sample('01234567890',8))])
answer = [i for i in a if i[1].find('106') != -1]
Keep in mind that startswith is going to be a lot faster than find, because find has to look for matching substrings in all positions.
It's not too clear why you need do this with such a large list/array in the first place, and there might be a better solution when it comes to not including these values in the list in the first place.
Here's a simple pandas solution
import pandas as pd
df = pd.DataFrame(b, columns=['1st String', '2nd String'])
df_filtered = df[df['2nd String'].str.contains('106')]
This gives you
In [29]: df_filtered
Out[29]:
1st String 2nd String
3 8CC90633-447E-11E6-B010-005056A76B49 106636785
4 F8C74244-447E-11E6-B010-005056A76B49 106636809
5 F8C7425C-447E-11E6-B010-005056A76B49 106636833
Update: Timing Results
Using Benjamin's list a as the test sample:
In [20]: %timeit [i for i in a if i[1].find('106') != -1]
100 loops, best of 3: 2.2 ms per loop
In [21]: %timeit df[df['2nd String'].str.contains('106')]
100 loops, best of 3: 5.94 ms per loop
So it looks like Benjamin's answer is actually about 3x faster. This surprises me since I was under the impression that the operation in pandas is vectorized. Moreover, the speed ratio does not change when a is 100 times longer.
Look at the functions in the np.char submodule:
data = [['60D19E9E-4E2C-11E2-AA9A-52540027E502', '100015361'],
['60D19EB6-4E2C-11E2-AA9A-52540027E502', '100015385'],
['60D19ECE-4E2C-11E2-AA9A-52540027E502', '100015409'],
['8CC90633-447E-11E6-B010-005056A76B49', '106636785'],
['F8C74244-447E-11E6-B010-005056A76B49', '106636809'],
['F8C7425C-447E-11E6-B010-005056A76B49', '106636833']]
data = np.array([r[1] for r in data], np.str)
idx = np.char.startswith(data, '106')
print(idx)