Suppose I have sparse data in dataframe. How can I create a sparse matrix from it and in which models I can use it for predictions?
Consider the dataframe df
df = pd.DataFrame(np.zeros((10, 10)))
df.iloc[5, 5] = 1
df
0 1 2 3 4 5 6 7 8 9
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Memmory Usage: 880
You can make it sparse with to_sparse(0).
The first argument is the value to assume is the filler value.
d1 = df.to_sparse(0)
d1
0 1 2 3 4 5 6 7 8 9
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Memmory Usage: 88
The memory footprint is a 10th the size.
This answer will keep the data as sparse as possible and avoids memory issues. The csr_matrix is a standard sparse matrix format that can be used with scipy and sklearn for modeling.
import pandas as pd
from scipy import sparse
df = pd.DataFrame({'rowid':[1,2,3,4,5], 'val1':[1, 1, 0, 0, 0], 'val2':[1, 0, 0, 1, 0]})
print 'Input data frame\n{0}'.format(df)
print 'DataFrame to a sparse matrix'
df_as_sparse_matrix = sparse.csr_matrix(df.as_matrix())
print df_as_sparse_matrix.todense()
Related
I have this code
df = pd.DataFrame({'R': {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6', 6: '7'}, 'a': {0: 1.0, 1: 1.0, 2: 2.0, 3: 3.0, 4: 3.0, 5: 2.0, 6: 3.0}, 'nv1': {0: [-1.0], 1: [-1.0], 2: [], 3: [], 4: [-2.0], 5: [-2.0, -1.0, -3.0, -1.0], 6: [-2.0, -1.0, -2.0, -1.0]}})
yielding the following dataframe:
R a nv1
0 1 1.0 [-1.0]
1 2 1.0 [-1.0]
2 3 2.0 []
3 4 3.0 []
4 5 3.0 [-2.0]
5 6 2.0 [-2.0, -1.0, -3.0, -1.0]
6 7 3.0 [-2.0, -1.0, -2.0, -1.0]
I need to calculate median of df['nv1']
df['med'] = median of df['nv1']
Desired output as follows
R a nv1 med
1 1.0 [-1.0] -1
2 1.0 [-1.0] -1
3 2.0 []
4 3.0 []
5 3.0 [-2.0] -2
6 2.0 [-2.0, -1.0, -3.0, -1.0] -1.5
7 3.0 [-2.0, -1.0, -2.0, -1.0] -1.5
I tried both line of codes below independently, but I ran into errors:
df['nv1'] = pd.to_numeric(df['nv1'],errors = 'coerce')
df['med'] = df['nv1'].median()
Use np.median:
df['med'] = df['nv1'].apply(np.median)
Output:
>>> df
R a nv1 med
0 1 1.0 [-1.0] -1.0
1 2 1.0 [-1.0] -1.0
2 3 2.0 [] NaN
3 4 3.0 [] NaN
4 5 3.0 [-2.0] -2.0
5 6 2.0 [-2.0, -1.0, -3.0, -1.0] -1.5
6 7 3.0 [-2.0, -1.0, -2.0, -1.0] -1.5
Or:
df['med'] = df['nv1'].explode().dropna().groupby(level=0).median()
I have this matrix df.head():
0 1 2 3 4 5 6 7 8 9 ... 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857
0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.00000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.00000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 30.88689 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.00000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 42.43819 0.0 0.0 0.0 0.00000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 rows × 1858 columns
And I need to apply a transformation to it every time a value other than 0.0 is found, dividing the value by 0.32
So far I have the mask, like so:
normalize = 0.32
mask = (df>=0.0)
df = df.where(mask)
How do I apply such a transformation on a very large dataframe, after masking it?
You don't need mask, just divide your dataframe by 0.32.
df / 0.32
>>> df
A B
0 0 3
1 5 0
>>> df / 0.32
A B
0 0.000 9.375
1 15.625 0.000
If you needed to use mask, try;
mask = (df.eq(0))
df.where(mask, df/0.32)
I'm having trouble with shift and diff and I feel it is simple?
Assume I have customers with different product demands, and they get handled with priority top down. I'd like to have it efficient without looping....
df_situation = pd.DataFrame(
{
"cust": [1, 2, 3, 3,4],
"prod": [1, 1, 1, 2,2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"needed": [200, 300, 1000, 1000,1000],
}
)
My objective is to get some additional columns like this, but it looks like difference calculations and shift operation are in a "chicken and egg problem situation".
Thanks in advance for any hint
leftover_prod is the available ffill - the cumulative demand groupby cumsum:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
0 800.0
1 500.0
2 -500.0
3 1000.0
4 0.0
Name: leftover_prod, dtype: float64
fulfilled_cust is either the demand if there is enough leftover_prod or the leftover_prod groupby shift + np.where:
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
0 200.0
1 300.0
2 500.0
3 1000.0
4 1000.0
Name: fulfilled_cust, dtype: float64
missing_cust is the demand - the fulfilled_cust:
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
0 0.0
1 0.0
2 500.0
3 0.0
4 0.0
Name: missing_cust, dtype: float64
Together:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
cust prod available demand leftover_prod fulfilled_cust missing_cust
0 1 1 1000.0 200 800.0 200.0 0.0
1 2 1 NaN 300 500.0 300.0 0.0
2 3 1 NaN 1000 -500.0 500.0 500.0
3 3 2 2000.0 1000 1000.0 1000.0 0.0
4 4 2 NaN 1000 0.0 1000.0 0.0
imports and DataFrame used:
import numpy as np
import pandas as pd
df_situation = pd.DataFrame({
"cust": [1, 2, 3, 3, 4],
"prod": [1, 1, 1, 2, 2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"demand": [200, 300, 1000, 1000, 1000],
})
(changed "needed" to "demand" as it appears in image.)
when I use aggfunc = np.var in pivot table. I found the value of metrics became NaN. But when it comes to aggfunc = np.sum it doesn't.
why the original value was changed with aggfunc = np.var or aggfunc = np.std. I can not found answer in the docs. docs of pivot table
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
print(df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
margins=True,
margins_name = 'sum',
dropna = False
))
print('-' * 100)
df = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.var,
margins=True,
margins_name = 'var',
dropna = False
)
print(df)
D E
C large small sum large small sum
A B
bar one 4.0 5.0 9 6.0 8.0 14
two 7.0 6.0 13 9.0 9.0 18
foo one 4.0 1.0 5 9.0 2.0 11
two NaN 6.0 6 NaN 11.0 11
sum 15.0 18.0 33 24.0 30.0 54
-----------------------------------------------------------------------
D E
C large small var large small var
A B
bar one NaN NaN 0.500000 NaN NaN 2.000000
two NaN NaN 0.500000 NaN NaN 0.000000
foo one 0.000000 NaN 0.333333 0.500000 NaN 2.333333
two NaN 0.0 0.000000 NaN 0.5 0.500000
var 5.583333 3.8 3.555556 4.666667 7.5 4.888889
what's more, I found the var of D = large is np.var([4.0, 7.0, 4.0]) = 2.0 instead of 5.583333.
what I expected is:
D E
C large small var large small var
A B
bar one 4.0 5.0 0.25 6.0 8.0 1.0
two 7.0 6.0 0.25 9.0 9.0 0
foo one 4.0 1.0 2.25 9.0 2.0 12.25
two NaN 6.0 0 NaN 11.0 0.0
var 2.0 4.25 3.6 2.0 11.25 7.34
What is the meaning of aggfunc = np.var in pivot table?
Pandas uses by default ddof = 1, see here for details on np.var.
When you have just one value, then the variance using ddof = 1 will be NaN as you try to divide by zero.
Var of D = large is np.var([2,2,4,7], ddof=1) = 5.583333333333333, so everything is correct (you'll have to use the individual values, not the sums).
If you need var with ddof = 0 then you can provide your own function:
def var0(x):
return np.var(x, ddof=0)
print(df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= var0,
margins=True,
margins_name = 'var',
dropna = False
))
Result:
D E
C large small var large small var
A B
bar one 0.0000 0.00 0.250000 0.00 0.00 1.000000
two 0.0000 0.00 0.250000 0.00 0.00 0.000000
foo one 0.0000 0.00 0.222222 0.25 0.00 1.555556
two NaN 0.00 0.000000 NaN 0.25 0.250000
var 4.1875 3.04 3.555556 3.50 6.00 4.888889
UPDATE based on the edited question.
Pivot table with the sums of C and additionally the var of the sums as margin columns/row.
We first create the sum pivot table with margin columns/row named var. Then we updated these margin columns/row with the var of the sum table:
dfs = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
margins=True,
margins_name = 'var',
dropna = False)
dfs[[('D','var'),('E','var')]] = df.pivot_table(
index = ['A', 'B'],
values = ['D', 'E'],
columns = ['C'],
aggfunc= np.sum,
dropna = False).stack().groupby(level=(0,1)).apply(var0)
dfs.iloc[-1] = dfs.iloc[:-1].apply(var0)
Result:
D E
C large small var large small var
A B
bar one 4.0 5.00 0.250000 6.0 8.00 1.000000
two 7.0 6.00 0.250000 9.0 9.00 0.000000
foo one 4.0 1.00 2.250000 9.0 2.00 12.250000
two NaN 6.00 0.000000 NaN 11.00 0.000000
var 2.0 4.25 0.824219 2.0 11.25 26.792969
In the margin row (last row) the var columns are calculated as the var of the row vars. I don't understand how the OP calculated his values for these two cells. Anyway they don't seem to make much sense.
I have this following dataframe:
Date
2002-01-01 10.0 NaN NaN
2002-05-01 NaN 30.0 40.0
2002-07-01 NaN NaN 50.0
I would like to complete the missing months with zeros. I am actualy able to do that, but I can do that only adding the entire range of days that are missing as you can get in the following code. The relevant part of the code is marked with
#############################
-
def createSeriesOfCompanies(df):
listOfCompanies=list(set(df['Company']))
dfSeries=df.pivot(index='Date', columns='Company', values='var1')
# Here I include the missing dates
#######################################################
initialDate=dfSeries.index[0]
endDate=dfSeries.index[-1]
idx = pd.date_range(initialDate, endDate)
dfSeries.index = pd.DatetimeIndex(dfSeries.index)
dfSeries = dfSeries.reindex(idx, fill_value=0)
########################################################
# Here it finishes the procedure
def creatingDataFrame():
dateList=[]
dateList.append(datetime.date(2002,1,1))
dateList.append(datetime.date(2002,7,1))
dateList.append(datetime.date(2002,5,1))
dateList.append(datetime.date(2002,5,1))
dateList.append(datetime.date(2002,7,1))
raw_data = {'Date': dateList,
'Company': ['A', 'B', 'B', 'C' , 'C'],
'var1': [10, 20, 30, 40 , 50]}
df = pd.DataFrame(raw_data, columns = ['Date','Company', 'var1'])
df.loc[1, 'var1'] = np.nan
return df
if __name__=="__main__":
df=creatingDataFrame()
print(df)
dfSeries,listOfCompanies=createSeriesOfCompanies(df)
I would like to get
Date
2002-01-01 10.0 NaN NaN
2002-02-01 0 0 0
2002-03-01 0 0 0
2002-04-01 0 0 0
2002-05-01 NaN 30.0 40.0
2002-06-01 0 0 0
2002-07-01 NaN NaN 50.0
But I am getting this
Company A B C
2002-01-01 10.0 NaN NaN
2002-01-02 0.0 0.0 0.0
2002-01-03 0.0 0.0 0.0
2002-01-04 0.0 0.0 0.0
2002-01-05 0.0 0.0 0.0
2002-01-06 0.0 0.0 0.0
2002-01-07 0.0 0.0 0.0
2002-01-08 0.0 0.0 0.0
2002-01-09 0.0 0.0 0.0
2002-01-10 0.0 0.0 0.0
2002-01-11 0.0 0.0 0.0
2002-01-12 0.0 0.0 0.0
2002-01-13 0.0 0.0 0.0
2002-01-14 0.0 0.0 0.0
2002-01-15 0.0 0.0 0.0
2002-01-16 0.0 0.0 0.0
2002-01-17 0.0 0.0 0.0
2002-01-18 0.0 0.0 0.0
2002-01-19 0.0 0.0 0.0
2002-01-20 0.0 0.0 0.0
2002-01-21 0.0 0.0 0.0
2002-01-22 0.0 0.0 0.0
2002-01-23 0.0 0.0 0.0
2002-01-24 0.0 0.0 0.0
2002-01-25 0.0 0.0 0.0
2002-01-26 0.0 0.0 0.0
2002-01-27 0.0 0.0 0.0
2002-01-28 0.0 0.0 0.0
2002-01-29 0.0 0.0 0.0
2002-01-30 0.0 0.0 0.0
...
How can I deal with this problem?
You can use reindex. Given the date is index,
df.index = pd.to_datetime(df.index)
df.reindex(pd.date_range(df.index.min(), df.index.max(), freq = 'MS'))
A B C
2002-01-01 10.0 NaN NaN
2002-02-01 NaN NaN NaN
2002-03-01 NaN NaN NaN
2002-04-01 NaN NaN NaN
2002-05-01 NaN 30.0 40.0
2002-06-01 NaN NaN NaN
2002-07-01 NaN NaN 50.0
Use asfreq by MS (start of months):
df=creatingDataFrame()
df = df.pivot(index='Date', columns='Company', values='var1').asfreq('MS', fill_value=0)
print (df)
Company A B C
Date
2002-01-01 10.0 NaN NaN
2002-02-01 0.0 0.0 0.0
2002-03-01 0.0 0.0 0.0
2002-04-01 0.0 0.0 0.0
2002-05-01 NaN 30.0 40.0
2002-06-01 0.0 0.0 0.0
2002-07-01 NaN NaN 50.0