pandas dataframe to coo matrix and to lil matix - pandas

I have following series:
groups['combined']
0 (28, 1) 1
1 (32, 1) 1
2 (36, 1) 1
3 (37, 1) 1
4 (84, 1) 1
....
Name: combined, Length: 14476, dtype: object
How can I convert this dataframe into .tocoo() matrix and .tolil()?
Reference how combined column is formed from
Original Pandas DataFrame:
import pandas as pd pd.DataFrame ({0:[28,32,36,37,84],1: [1,1,1,1,1], 2: [1,1,1,1,1]}). col 0 has over 10K unique features, col 1 has 39 groups and col 2 is just 1.

Formation of COOrdinate format from original pandas DataFrame
import scipy.sparse as sps
groups.set_index([0, 1], inplace=True)
sps.coo_matrix((groups[2], (groups.index.labels[0], groups.index.labels[1])))
-------------Results to---------
<10312x39 sparse matrix of type '<class 'numpy.int64'>'
with 14476 stored elements in COOrdinate format>

In regards to lil matrix
print(len(networks[0]), len(networks[1]), networks[0].nunique(), networks[1].nunique())
667966 667966 10312 10312
networks[:5]
0 1
0 176 1
1 233 1
2 283 1
3 371 1
4 394 1
# make row and col labels
rows = networks[0]
cols = networks[1]
# crucial third array in python
networks.set_index([0, 1], inplace=True)
Ntw= sps.coo_matrix((networks[2], (networks.index.labels[0],
networks.index.labels[1])))
d=Ntw.tolil()
d
generates
<10312x10312 sparse matrix of type '<class 'numpy.int64'>'
with 667966 stored elements in LInked List format>

Related

Creating a dataframe using roll-forward window on multivariate time series

Based on the simplifed sample dataframe
import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
index = timestamps )
print(df)
A B
2017-01-01 0 0
2017-01-02 1 2
2017-01-03 2 4
2017-01-04 3 6
I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.
My first idea was to use pandas
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
But that seems to only work in combination with aggregate functions such as sum, which is a different use case.
Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it:
window_size = 3
new_df = pd.concat(
[
df.iloc[i : i + window_size, :]
.T.reset_index()
.assign(other_index=i)
.set_index(["other_index", "index"])
.set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
for i in range(df.shape[0] - window_size + 1)
]
)
new_df.index.names = ["", ""]
print(df)
# Output
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6

Multi-indexed series into DataFrame and reformat

I have a correlation matrix of stock returns in a Pandas DataFrame and I want to extract the top/bottom 10 correlated pairs from the matrix.
Sample DataFrame:
import pandas as pd
import numpy as np
data = np.random.randint(5,30,size=500)
df = pd.DataFrame(data.reshape((50,10)))
corr = df.corr()
This is my function to get the top/bottom 10 correlated pairs by 1) first returning a multi-indexed series (high) for highest correlated pairs, and then 2) unstacking back into a DataFrame (high_df):
def get_rankings(corr_matrix):
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
ranked_corr = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
high = ranked_corr[:10]
high_df = high.unstack().fillna("")
return high_df
get_rankings(corr)
My current DF output looks something like this:
6 4 5 7 8 3 9
3 0.359 0.198
1 0.275
4 0.257
2 0.176 0.154
0 0.153 0.164
5 0.156
But I want it to look this in either 2/3 columns:
ID1 ID2 Corr
0 9 0.304471
2 8 0.271009
2 3 0.147702
7 9 0.146176
0 7 0.144549
7 8 0.111888
4 6 0.098619
1 7 0.092338
1 4 0.09091
3 6 0.079688
It needs to be in a DataFrame so I can pass it to a grid widget, which only accepts DataFrames. Can anyone help me rehash the shape of the unstacked DF?

Numpy vs Pandas axis

Why axis differs in Numpy vs Pandas?
Example:
If I want to get rid of column in Pandas I could do this:
df.drop("column", axis = 1, inplace = True)
Here, we are using axis = 1 to drop a column (vertically in a DF).
In Numpy, if I want to sum a matrix A vertically I would use:
A.sum(axis = 0)
Here I use axis = 0.
axis isn't used that often in pandas. A dataframe has 2 dimensions, which are often treated quite differently. In drop the axis definition is well documented, and actually corresponds to the numpy usage.
Make a simple array and data frame:
In [180]: x = np.arange(9).reshape(3,3)
In [181]: df = pd.DataFrame(x)
In [182]: df
Out[182]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
Delete a row from the array, or a column:
In [183]: np.delete(x, 1, 0)
Out[183]:
array([[0, 1, 2],
[6, 7, 8]])
In [184]: np.delete(x, 1, 1)
Out[184]:
array([[0, 2],
[3, 5],
[6, 8]])
Drop does the same thing for the same axis:
In [185]: df.drop(1, axis=0)
Out[185]:
0 1 2
0 0 1 2
2 6 7 8
In [186]: df.drop(1, axis=1)
Out[186]:
0 2
0 0 2
1 3 5
2 6 8
In sum, the definitions are the same as well:
In [188]: x.sum(axis=0)
Out[188]: array([ 9, 12, 15])
In [189]: df.sum(axis=0)
Out[189]:
0 9
1 12
2 15
dtype: int64
In [190]: x.sum(axis=1)
Out[190]: array([ 3, 12, 21])
In [191]: df.sum(axis=1)
Out[191]:
0 3
1 12
2 21
dtype: int64
The pandas sums are Series, which are the pandas equivalent of a 1d array.
Visualizing what axis does with reduction operations like sum is a bit tricky - especially with 2d arrays. Is the axis kept or removed? It can help to think about axis for 1d arrays (the only axis is removed), or 3d arrays, where one axis is removed leaving two.
When you get rid of a column, the name is picked from the axis 1, which is the horizontal axis. When you sum along the axis 0, you sum vertically.

Pandas: Imputing Missing Values to Data Frame

Suppose I have a data frame with some missing values, as below:
import pandas as pd
df = pd.DataFrame([[1,3,'NA',2], [0,1,1,3], [1,2,'NA',1]], columns=['W', 'X', 'Y', 'Z'])
print(df)
The variable Y is missing two values. Say I run some imputation model and come up with an estimate of what the two values should be:
to_impute = [2,1]
What is the best way of replacing the two NA's with those two values? I know of ways that are fairly roundabout, e.g. looping over to_impute and using df.iloc to add each value. But I'm hoping there is a concise and non-iterative way.
(This is something that is easy in R, and I'm hoping it can be easy in Pandas.)
In pandas NA should be NaN, 1st you need to replace it , then we can using fillna
df.Y=df.Y.replace('NA',np.nan)
df.Y=df.Y.fillna(pd.Series([1,2],index=df.index[df.Y.isnull()]))
df
Out[1375]:
W X Y Z
0 1 3 1.0 2
1 0 1 1.0 3
2 1 2 2.0 1
Let us treat your NA as str
df.loc[df.Y=='NA','Y']=[1,2]
df
Out[1380]:
W X Y Z
0 1 3 1 2
1 0 1 1 3
2 1 2 2 1

Check whether a column in a dataframe is an integer or not, and perform operation

Check whether a column in a dataframe is an integer or not, and if it is an integer, it must be multiplied by 10
import numpy as np
import pandas as pd
df = pd.dataframe(....)
#function to check and multiply if a column is integer
def xtimes(x):
for col in x:
if type(x[col]) == np.int64:
return x[col]*10
else:
return x[col]
#using apply to apply that function on df
df.apply(xtimes).head(10)
I am getting an error like ('GP', 'occurred at index school')
You could use select_dtypes to get numeric columns and then multiply.
In [1284]: df[df.select_dtypes(include=['int', 'int64', np.number]).columns] *= 10
You could have your specific check list for include=[... np.int64, ..., etc]
You can use the dtypes attribute and loc.
df.loc[:, df.dtypes <= np.integer] *= 10
Explanation
pd.DataFrame.dtypes returns a pd.Series of numpy dtype objects. We can use the comparison operators to determine subdtype status. See this document for the numpy.dtype hierarchy.
Demo
Consider the dataframe df
df = pd.DataFrame([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
]).astype(pd.Series([np.int32, np.int16, np.int64, float, object, str]))
df
0 1 2 3 4 5
0 1 2 3 4.0 5 6
1 1 2 3 4.0 5 6
The dtypes are
df.dtypes
0 int32
1 int16
2 int64
3 float64
4 object
5 object
dtype: object
We'd like to change columns 0, 1, and 2
Conveniently
df.dtypes <= np.integer
0 True
1 True
2 True
3 False
4 False
5 False
dtype: bool
And that is what enables us to use this within a loc assignment.
df.loc[:, df.dtypes <= np.integer] *= 10
df
0 1 2 3 4 5
0 10 20 30 4.0 5 6
1 10 20 30 4.0 5 6