Find the longest notnull segment in a ndarray using numpy - numpy

I have an array ab of shape (2,12)
ab = np.array([[0,3,6,3,np.nan,3,7,3,5,4,3,np.nan],
[5,9,np.nan,3,7,5,3,6,4,np.nan,np.nan,np.nan]])
I am trying to get the longest segment of consecutive notnull values between the two rows. From the example above, the output should be:
[[3. 7. 3. 5.]
[5. 3. 6. 4.]]
I used the solution proposed for a similar question here: Find longest subsequence without NaN values in set of series, after converting my array into a dataframe:
df = pd.DataFrame(ab.T)
seq = np.array(df.dropna(how='any').index)
longest_seq = max(np.split(seq, np.where(np.diff(seq)!=1)[0]+1), key=len)
print(df.iloc[longest_seq])
0 1
5 3.0 5.0
6 7.0 3.0
7 3.0 6.0
8 5.0 4.0
However, is it possible to find a solution using numpy only?
Thanks

I am not sure your code handles the case where the length of such sequences differs from one row to the other. Instead, I would proceed row-by-row:
res = []
for array in ab:
# First, let's prepend a nan for regularity:
arr = np.append(np.nan, array)
nanindexes = np.nonzero(np.isnan(arr))[0]
longest = max(np.split(arr, nanindexes), key=len) # select the biggest slice, they all start with nan
longest = longest[1:] # remove the nan we added, or the starting one
res.append(longest)
print(res)
[array([3., 7., 3., 5., 4., 3.]), array([3., 7., 5., 3., 6., 4.])]
I am not too familiar with numpy, so I took your question as an exercise. There are probably many ways to improve that code.

Related

Get the last non-null value per row of a 2D Numpy array

I have a 2D numpy array that looks like
a = np.array(
[
[1,2,np.nan,np.nan],
[1,33,45,np.nan],
[11,22,3,78],
]
)
I need to extract the last non null value per row i.e.
[2, 45, 78]
Please guide on how to get it.
thanks
Break this into two sub-problems.
Remove nans from each row
Select last element from the array
[r[~np.isnan(r)][-1] for r in a]
produces
[2.0, 45.0, 78.0]
For a vectorial solution, you can try:
# get the index of last nan
idx = np.argmax(~np.isnan(a)[:, ::-1], axis=1)
# slice
a[np.arange(a.shape[0]), a.shape[1]-idx-1]
output: array([ 2., 45., 78.])

appending rows to pandas dataframe results in duplicate rows

here's a MWE that illustrates a problem I'm having, where incrementally saving values to a dataframe over the course of a series of loops results in what looks like the overwriting of previous rows.
import pandas as pd
import numpy as np
saved = pd.DataFrame(columns = ['value1', 'value2'])
m = np.zeros(2)
for t in range(5):
for i in range(2):
m[i] = m[i] + i + 1
print(t)
print(m)
saved.loc[t] = m
print(saved)
The output I get is:
0
[1. 2.]
1
[2. 4.]
2
[3. 6.]
3
[4. 8.]
4
[5. 10.]
value1 value2
0 2.0 4.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 510.0
Why is the first row of the saved dataframe not 1.0, 2.0?
Edit:
Here's another articulation of the problem, now using lists for saving then configuring as dataframe at end. The following code in a .py script
import numpy as np
import pandas as pd
saved_list = []
m = np.zeros(2)
for t in range(5):
for i in range(2):
m[i] = m[i] + i + 1
print(t)
print(m)
saved_list.append(m)
saved = pd.DataFrame(saved_list, columns = ['value1', 'value2'])
print(saved)
gives this output from the command line:
0
[1. 2.]
1
[2. 4.]
2
[3. 6.]
3
[4. 8.]
4
[ 5. 10.]
value1 value2
0 5.0 10.0
1 5.0 10.0
2 5.0 10.0
3 5.0 10.0
4 5.0 10.0
Why are the previous saved_list items being overwritten?
It works as expected without any change. This is a print screen from Google Colab.
Well, it seems that making a copy of the array within the loop for saving solves both scenarios.
For the first, I used
saved.loc[t] = m.copy() and for the second I used saved_list.append(m.copy()).
It may be obvious to some that when the array is defined outside the loop, the items saved to either the list or the frame are pointers to the original item so anything saved within the loop ends up pointing to the final version.
Now I know.

how to delete null value in tensorflow?

This is my first time to ask StackOverflow so if I tell strange something, please let me know smoothly.
I want to predict accuracy in this dataset.
But I don't know how to accept or delete nan or null values using pandas or numpy and how to get right accuracy and loss in this model.
And I tried to use isnull function in pandas but it didn't work.
I think the reason why the value of accuracy and loss are not calculated is that the value set as X belongs to the null value.
so I want to know how to except nan or null values.
if you have a similar case or solve this problem, please let me know how to solve this problem. thanks!
You said you don't know how to delete nan values in pandas or numpy, but some times we don't delete them, we replace them. For example you could put zero for non values for a feature, or you can calculate average value of that feature and use it.
To remove rows with nan values you could do that
x = np.array([[1,2,3,4],
[2,3,np.nan,5],
[np.nan,5,2,3]])
x = x[~np.isnan(x).any(axis=1)]
output:
array([[1., 2., 3., 4.]])
Or to replace Nan with some thing else like 0, for numpy array:
x[np.isnan(x)] = 0
output
x:
array([[1., 2., 3., 4.],
[2., 3., 0., 5.],
[0., 5., 2., 3.]])
In pandas dataframe, you can remove a specific column with Nan elemnt
x = np.array([[np.nan,2,3,4],
[2,3,np.nan,5],
[3,5,2,3]])
xpd = pd.DataFrame(x, columns=["A","B","C","D"])
xpd = xpd[pd.notnull(xpd['A'])]
output
xpd
A B C D
1 2.0 3.0 NaN 5.0
2 3.0 5.0 2.0 3.0
or remove a row with Nan element
xpd:
A B C D
2 3.0 5.0 2.0 3.0
and also you can replace nan value with some thing else in padas dataframe
xpd = xpd.replace([np.nan], 0)
output
xpd:
A B C D
0 0.0 2.0 3.0 4.0
1 2.0 3.0 0.0 5.0
2 3.0 5.0 2.0 3.0

How to handle categorical features in neural network?

I am currently having a dataset for the location of stores and name of item to predict sales of a particular product.
I wanted to use binary encoding or pandas get_dummies(), but there are 5000 names for items and it causes memory error, is there any alternative or better way to handle this? Thanks all!
print(train.shape)
print(train.dtypes)
print(train.head())
(125497040, 6)
id int64
date object
store_nbr int64
item_nbr int64
unit_sales float64
onpromotion object
dtype: object
id date store_nbr item_nbr unit_sales onpromotion
0 0 2013-01-01 25 103665 7.0 NaN
1 1 2013-01-01 25 105574 1.0 NaN
2 2 2013-01-01 25 105575 2.0 NaN
3 3 2013-01-01 25 108079 1.0 NaN
4 4 2013-01-01 25 108701 1.0 NaN
Instead of creating gazillions of dummy variables you should use one-hot encoding instead: https://en.wikipedia.org/wiki/One-hot
Pandas doesn't have this functionality built-in, so the easiest way is to use scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
The way I see it you could:
Not to use all items but only most frequent ones.
This way creating dummies, creates fewer new columns and needs less memory. For this happen you will need items with few counts (define few with a threshold) and you will lose some information.
An alternative approach will be to use a Factorization Machine.
You could use both suggestions above and at the end average their prediction for an even better score.

pandas using qcut on series with fewer values than quantiles

I have thousands of series (rows of a DataFrame) that I need to apply qcut on. Periodically there will be a series (row) that has fewer values than the desired quantile (say, 1 value vs 2 quantiles):
>>> s = pd.Series([5, np.nan, np.nan])
When I apply .quantile() to it, it has no problem breaking into 2 quantiles (of the same boundary value)
>>> s.quantile([0.5, 1])
0.5 5.0
1.0 5.0
dtype: float64
But when I apply .qcut() with an integer value for number of quantiles an error is thrown:
>>> pd.qcut(s, 2)
...
ValueError: Bin edges must be unique: array([ 5., 5., 5.]).
You can drop duplicate edges by setting the 'duplicates' kwarg
Even after I set the duplicates argument, it still fails:
>>> pd.qcut(s, 2, duplicates='drop')
....
IndexError: index 0 is out of bounds for axis 0 with size 0
How do I make this work? (And equivalently, pd.qcut(s, [0, 0.5, 1], duplicates='drop') also doesn't work.)
The desired output is to have the 5.0 assigned to a single bin and the NaN are preserved:
0 (4.999, 5.000]
1 NaN
2 NaN
Ok, this is a workaround which might work for you.
pd.qcut(s,len(s.dropna()),duplicates='drop')
Out[655]:
0 (4.999, 5.0]
1 NaN
2 NaN
dtype: category
Categories (1, interval[float64]): [(4.999, 5.0]]
You can try filling your object/number cols with the appropriate filling ('null' for string and 0 for numeric)
#fill numeric cols with 0
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(0)
#fill object cols with null
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('null')
Use python 3.5 instead of python 2.7 .
This worked for me