This is my first time to ask StackOverflow so if I tell strange something, please let me know smoothly.
I want to predict accuracy in this dataset.
But I don't know how to accept or delete nan or null values using pandas or numpy and how to get right accuracy and loss in this model.
And I tried to use isnull function in pandas but it didn't work.
I think the reason why the value of accuracy and loss are not calculated is that the value set as X belongs to the null value.
so I want to know how to except nan or null values.
if you have a similar case or solve this problem, please let me know how to solve this problem. thanks!
You said you don't know how to delete nan values in pandas or numpy, but some times we don't delete them, we replace them. For example you could put zero for non values for a feature, or you can calculate average value of that feature and use it.
To remove rows with nan values you could do that
x = np.array([[1,2,3,4],
[2,3,np.nan,5],
[np.nan,5,2,3]])
x = x[~np.isnan(x).any(axis=1)]
output:
array([[1., 2., 3., 4.]])
Or to replace Nan with some thing else like 0, for numpy array:
x[np.isnan(x)] = 0
output
x:
array([[1., 2., 3., 4.],
[2., 3., 0., 5.],
[0., 5., 2., 3.]])
In pandas dataframe, you can remove a specific column with Nan elemnt
x = np.array([[np.nan,2,3,4],
[2,3,np.nan,5],
[3,5,2,3]])
xpd = pd.DataFrame(x, columns=["A","B","C","D"])
xpd = xpd[pd.notnull(xpd['A'])]
output
xpd
A B C D
1 2.0 3.0 NaN 5.0
2 3.0 5.0 2.0 3.0
or remove a row with Nan element
xpd:
A B C D
2 3.0 5.0 2.0 3.0
and also you can replace nan value with some thing else in padas dataframe
xpd = xpd.replace([np.nan], 0)
output
xpd:
A B C D
0 0.0 2.0 3.0 4.0
1 2.0 3.0 0.0 5.0
2 3.0 5.0 2.0 3.0
Related
I have an array ab of shape (2,12)
ab = np.array([[0,3,6,3,np.nan,3,7,3,5,4,3,np.nan],
[5,9,np.nan,3,7,5,3,6,4,np.nan,np.nan,np.nan]])
I am trying to get the longest segment of consecutive notnull values between the two rows. From the example above, the output should be:
[[3. 7. 3. 5.]
[5. 3. 6. 4.]]
I used the solution proposed for a similar question here: Find longest subsequence without NaN values in set of series, after converting my array into a dataframe:
df = pd.DataFrame(ab.T)
seq = np.array(df.dropna(how='any').index)
longest_seq = max(np.split(seq, np.where(np.diff(seq)!=1)[0]+1), key=len)
print(df.iloc[longest_seq])
0 1
5 3.0 5.0
6 7.0 3.0
7 3.0 6.0
8 5.0 4.0
However, is it possible to find a solution using numpy only?
Thanks
I am not sure your code handles the case where the length of such sequences differs from one row to the other. Instead, I would proceed row-by-row:
res = []
for array in ab:
# First, let's prepend a nan for regularity:
arr = np.append(np.nan, array)
nanindexes = np.nonzero(np.isnan(arr))[0]
longest = max(np.split(arr, nanindexes), key=len) # select the biggest slice, they all start with nan
longest = longest[1:] # remove the nan we added, or the starting one
res.append(longest)
print(res)
[array([3., 7., 3., 5., 4., 3.]), array([3., 7., 5., 3., 6., 4.])]
I am not too familiar with numpy, so I took your question as an exercise. There are probably many ways to improve that code.
I have a DataFrame that looks like this :
initial dataframe
I have different tags in the 'Concepts_clean' column and I want to automatically fill the other ones like so : resulting dataframe
For example: fourth row, column 'Concepts_clean" I have ['Accueil Amabilité', 'Tarifs'] then I want to fill the columns 'Accueil Amabilité' and 'Tarifs' with ones and all the others with zeros.
What is the most effective way to do it?
Thank you
It's more of a n-hot encoding problem -
>>> def change_df(x):
... for i in x['Concepts_clean'].replace('[','').replace(']','').split(','):
... x[i.strip()] = 1
... return x
...
>>> df.apply(change_df, axis=1)
Example Output
Concepts_clean Ecoute Informations Tarifs
[Tarifs] 0.0 0.0 1.0
[] 0.0 0.0 0.0
[Ecoute] 1.0 0.0 0.0
[Tarifs, Informations] 0.0 1.0 1.0
I am currently having a dataset for the location of stores and name of item to predict sales of a particular product.
I wanted to use binary encoding or pandas get_dummies(), but there are 5000 names for items and it causes memory error, is there any alternative or better way to handle this? Thanks all!
print(train.shape)
print(train.dtypes)
print(train.head())
(125497040, 6)
id int64
date object
store_nbr int64
item_nbr int64
unit_sales float64
onpromotion object
dtype: object
id date store_nbr item_nbr unit_sales onpromotion
0 0 2013-01-01 25 103665 7.0 NaN
1 1 2013-01-01 25 105574 1.0 NaN
2 2 2013-01-01 25 105575 2.0 NaN
3 3 2013-01-01 25 108079 1.0 NaN
4 4 2013-01-01 25 108701 1.0 NaN
Instead of creating gazillions of dummy variables you should use one-hot encoding instead: https://en.wikipedia.org/wiki/One-hot
Pandas doesn't have this functionality built-in, so the easiest way is to use scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
The way I see it you could:
Not to use all items but only most frequent ones.
This way creating dummies, creates fewer new columns and needs less memory. For this happen you will need items with few counts (define few with a threshold) and you will lose some information.
An alternative approach will be to use a Factorization Machine.
You could use both suggestions above and at the end average their prediction for an even better score.
I have thousands of series (rows of a DataFrame) that I need to apply qcut on. Periodically there will be a series (row) that has fewer values than the desired quantile (say, 1 value vs 2 quantiles):
>>> s = pd.Series([5, np.nan, np.nan])
When I apply .quantile() to it, it has no problem breaking into 2 quantiles (of the same boundary value)
>>> s.quantile([0.5, 1])
0.5 5.0
1.0 5.0
dtype: float64
But when I apply .qcut() with an integer value for number of quantiles an error is thrown:
>>> pd.qcut(s, 2)
...
ValueError: Bin edges must be unique: array([ 5., 5., 5.]).
You can drop duplicate edges by setting the 'duplicates' kwarg
Even after I set the duplicates argument, it still fails:
>>> pd.qcut(s, 2, duplicates='drop')
....
IndexError: index 0 is out of bounds for axis 0 with size 0
How do I make this work? (And equivalently, pd.qcut(s, [0, 0.5, 1], duplicates='drop') also doesn't work.)
The desired output is to have the 5.0 assigned to a single bin and the NaN are preserved:
0 (4.999, 5.000]
1 NaN
2 NaN
Ok, this is a workaround which might work for you.
pd.qcut(s,len(s.dropna()),duplicates='drop')
Out[655]:
0 (4.999, 5.0]
1 NaN
2 NaN
dtype: category
Categories (1, interval[float64]): [(4.999, 5.0]]
You can try filling your object/number cols with the appropriate filling ('null' for string and 0 for numeric)
#fill numeric cols with 0
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(0)
#fill object cols with null
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('null')
Use python 3.5 instead of python 2.7 .
This worked for me
taking inspiration from this discussion here on SO (Merge Columns within a DataFrame that have the Same Name), I tried the method suggested and, while it works while using the function sum() it doesn't when I am using np.nansum :
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100,4), columns=['a', 'a','b','b'], index=pd.date_range('2011-1-1', periods=100))
print(df.head(3))
sum() case:
print(df.groupby(df.columns, axis=1).apply(sum, axis=1).head(3))
a b
2011-01-01 1.328933 1.678469
2011-01-02 1.878389 1.343327
2011-01-03 0.964278 1.302857
np.nansum() case:
print(df.groupby(df.columns, axis=1).apply(np.nansum, axis=1).head(3))
a [1.32893299939, 1.87838886222, 0.964278430632,...
b [1.67846885234, 1.34332662587, 1.30285727348, ...
dtype: object
any idea why?
The issue is that np.nansum converts its input to a numpy array, so it effectively loses the column information (sum doesn't do this). As a result, the groupby doesn't get back any column information when constructing the output, so the output is just a Series of numpy arrays.
Specifically, the source code for np.nansum calls the _replace_nan function. In turn, the source code for _replace_nan checks if the input is an array, and converts it to one if it's not.
All hope isn't lost though. You can easily replicate np.nansum with Pandas functions. Specifically use sum followed by fillna:
df.groupby(df.columns, axis=1).sum().fillna(0)
The sum should ignore NaN's and just sum the non-null values. The only case you'll get back a NaN is if all the values attempting to be summed are NaN, which is why fillna is required. Note that you could also do the fillna before the groupby, i.e. df.fillna(0).groupby....
If you really want to use np.nansum, you can recast as pd.Series. This will likely impact performance, as constructing a Series can be a relatively expensive, and you'll be doing it multiple times:
df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
Example Computations
For some example computations, I'll be using the following simple DataFrame, which includes NaN values (your example data doesn't):
df = pd.DataFrame([[1,2,2,np.nan,4],[np.nan,np.nan,np.nan,3,3],[np.nan,np.nan,-1,2,np.nan]], columns=list('aaabb'))
a a a b b
0 1.0 2.0 2.0 NaN 4.0
1 NaN NaN NaN 3.0 3.0
2 NaN NaN -1.0 2.0 NaN
Using sum without fillna:
df.groupby(df.columns, axis=1).sum()
a b
0 5.0 4.0
1 NaN 6.0
2 -1.0 2.0
Using sum and fillna:
df.groupby(df.columns, axis=1).sum().fillna(0)
a b
0 5.0 4.0
1 0.0 6.0
2 -1.0 2.0
Comparing to the fixed np.nansum method:
df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
a b
0 5.0 4.0
1 0.0 6.0
2 -1.0 2.0