Multiplication of RDD row to all other rows in PySpark - dataframe

I have an RDD of DenseVector objects and I want to:
Select one of these vectors (one row)
Perform a multiplication of this vector to all other vector rows in order to compute a similarity (cosine)
Basically I am trying to perform a dot product between a vector and a matrix, starting from an RDD. For reference, the RDD contains TF-IDF values built with Spark ML, which furnishes a dataframe of SparseVectors, and have been mapped to DenseVectors in order to do the multiplication. The dataframe and corresponding RDD are called tfidf_df and tfidf_rdd respectively.
What I do, which works is (full script with sample data)
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.ml.feature import IDF, Tokenizer, CountVectorizer
from pyspark.mllib.linalg import DenseVector
import numpy as np
sc = SparkContext()
sqlc = SQLContext(sc)
spark_session = SparkSession(sc)
sentenceData = spark_session.createDataFrame([
(0, "I go to school school is good"),
(1, "I like school"),
(2, "I also like cinema")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="tokens")
tokens_df = tokenizer.transform(sentenceData)
# TF feats
count_vectorizer = CountVectorizer(inputCol="tokens",
outputCol="tf_features")
model = count_vectorizer.fit(tokens_df)
tf_df = model.transform(tokens_df)
print(model.vocabulary)
print(tf_df.rdd.take(5))
idf = IDF(inputCol="tf_features",
outputCol="tf_idf_features",
)
model = idf.fit(tf_df)
tfidf_df = model.transform(tf_df)
# Transform into RDD of dense vectors
tfidf_rdd = tfidf_df.select("tf_idf_features") \
.rdd \
.map(lambda row: DenseVector(row.tf_idf_features.toArray()))
print(tfidf_rdd.take(3))
# Select the test vector
test_label = 1
vec = tfidf_df.filter(tfidf_df.label == test_label) \
.select('tf_idf_features') \
.rdd \
.map(lambda row: DenseVector(row.tf_idf_features.toArray())).collect()[0]
rddB = tfidf_rdd.map(lambda row: np.dot(row/np.linalg.norm(row),
vec/np.linalg.norm(vec))) \
.zipWithIndex()
# print('*** multiplication', rddB.take(20))
# Sort the similarities
sorted_rddB = rddB.sortByKey(False)
print(sorted_rddB.take(20))
The test vector has been selected as the one whose label is 1. The end result with similarities is (from the last print statement) [(1.0000000000000002, 1), (0.27105728525552131, 0), (0.1991208898963957, 2)] where indexing has been used to trace back to the original dataset.
This works fine but looks a bit clunky. I'd be looking for the best practices to perform a multiplication between a selected row of a dataframe (vector) with all the dataframe vectors. Am open to any suggestions over the workflow, specifically performance - related.

Related

statsmodelformula.api.ols.fit().pvalues returns a Pandas series instead of numpy array

So this may be hard to explain cause its a chunk of some really large code - I don't expect it to be reproducible.
But essentially it's a simulation which (using multiple simulated datasets) creates a one-way or two-way regression and calculates the respective t-values and p-values for them.
However, putting some of the datasets (with the same information and no missing values), results in stats.model.formula.ols.fit() returning the pvals / tvals as a pandas series instead of a numpy array (even one way studies).
Could someone please explain why / if there is a way to specify the output?
An example dataframe looks like this: (x0-x187 is our y, genotype and treatment are the desired factors, staging is a factor used for normalisation)
x0
x1
...
treatment
genotype
200926_ku20_e1_wt_veh
0.075821
0.012796
...
veh
wt
201210_ku25_e7_wt_veh
0.082307
0.007596
...
veh
wt
201127_ku55_e6_wt_veh
0.083049
0.008978
...
veh
wt
201220_ku52_e2_wt_veh
0.078414
0.013488
...
veh
wt
...
...
...
...
...
...
210913_b6ku_22297_e5_wt
0.067858
0.008081
...
treat
wt
210821_b6ku_3_e5_wt
0.070417
0.012396
...
treat
wt
And then the code:
'''import subprocess as sub
import os
import struct
from pathlib import Path
import tempfile
from typing import Tuple
import shutil
from logzero import logger as logging
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
for col in range(data.shape[1]):
if not df[f'x{col}'].any():
p = np.nan
t = np.nan
else:
if two_way:
# two way model - if its just the geno or treat comparison; the one-factor col
# will
# be ignored
# for some simulations smf is returning a Series.
fit = smf.ols(formula=f'x{col} ~ genotype * treatment + staging', data=df, missing='drop').fit()
# get all pvals except intercept and staging
p = fit.pvalues[~fit.pvalues.index.isin(['Intercept', 'staging'])]
t = fit.tvalues[~fit.tvalues.index.isin(['Intercept', 'staging'])]
else:
fit = smf.ols(formula=f'x{col} ~ genotype + staging', data=df, missing='drop').fit()
p = fit.pvalues['genotype[T.wt]']
t = fit.tvalues['genotype[T.wt]']
pvals.append(p)
tvals.append(t)
p_all = np.array(pvals)
print("example", p_all[0])
print(type(p_all[0][0]), p_all[0][0])
And finally some output:
Desired output:
'''example [1.63688492e-01 6.05907115e-06 7.70710934e-02]
<class 'numpy.float64'> 0.16368849176977607 '''
"Error" output:
'''example genotype[T.wt] 0.862423
treatment[T.veh] 0.000177
genotype[T.wt]:treatment[T.veh] 0.522066
dtype: float64
< class 'numpy.float64'> 0.8624226150886212'''
I've manually corrected the data but I would rather not have to do dumb fixes in the future.

Scaling down high dimensional pandas' data frame data using sklean

I am trying to scale down values in pandas data frame. The problem is that I have 291 dimensions, so scale down the values one by one is time consuming if we are to do it as follows:
from sklearn.preprocessing import StandardScaler
sclaer = StandardScaler()
scaler = sclaer.fit(dataframe['dimension_1'])
dataframe['dimension_1'] = scaler.transform(dataframe['dimension_1'])
Problem: This is only for one dimension, so how we can do this please for the 291 dimension in one shot?
You can pass in a list of the columns that you want to scale instead of individually scaling each column.
# convert the columns labelled 0 and 1 to boolean values
df.replace({0: False, 1: True}, inplace=True)
# make a copy of dataframe
scaled_features = df.copy()
# take the numeric columns i.e. those which are not of type object or bool
col_names = df.dtypes[df.dtypes != 'object'][df.dtypes != 'bool'].index.to_list()
features = scaled_features[col_names]
# Use scaler of choice; here Standard scaler is used
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features[col_names] = features
I normally use pipeline, since it can do multi-step transformation.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([('std_scale', StandardScaler())])
transformed_dataframe = num_pipeline.fit_transform(dataframe)
If you need to do more for transformation, e.g. fill NA,
you just add in the list (Line 3 of the code).
Note: The above code works, if the datatype of all columns is numeric. If not we need to
select only numeric columns
pass into the pipeline, then
put the result back to the original dataframe.
Here is the code for the 3 steps:
num_col = dataframe.dtypes[df.dtypes != 'object'][dataframe.dtypes != 'bool'].index.to_list()
df_num = dataframe[num_col] #1
transformed_df = num_pipeline.fit_transform(dataframe) #2
dataframe[num_col] = transformed_df #3

How to convert a matrix as string to ndarray?

I have a csv file with this structure:
id;matrix
1;[[1.2 1.3] [1.2 1.3] [1.2 1.3]]
I'm trying read the matrix field as numpy.ndarray using pandas.read_csv to read and making df.to_numpy() to convert the matrix, but the shape array result in (1,0). I was waiting for the shape equals (3,2) as:
matrix = [[1.2 1.3]
[1.2 1.3]
[1.2 1.3]]
I was try too numpy.asmatrix, but the result is like df.to_numpy()
A solution with pandas
Providing the format of the matrix column is consistent with that shown in the example, replace the spaces with ,, then use literal_eval to turn the string into a list of lists, and then apply np.array.
import pandas as pd
from ast import literal_eval
import numpy as np
# read the data
df = pd.read_csv('file.csv', sep=';')
# replace the spaces
df['matrix'] = df['matrix'].str.replace(' ', ',')
# apply literal_eval
df['matrix'] = df['matrix'].apply(literal_eval)
# apply numpy array
df['matrix'] = df['matrix'].apply(np.array)
print(type(df.iloc[0, 1]))
>>> numpy.ndarray
Each row of the matrix column will be an ndarray
The two apply calls can be combined into:
df['matrix'] = df['matrix'].apply(lambda x: np.array(literal_eval(x)))
Or this hot mess:
df['matrix'] = df['matrix'].str.replace(' ', ',').apply(lambda x: np.array(literal_eval(x)))
I personally prefer one transformation per line for code clarity.

Categorical features correlation

I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?
There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut of pandas.
import numpy as np
import pandas as pd
import scipy.stats as ss
import seaborn as sns
print('Pandas version:', pd.__version__)
# Pandas version: 1.3.0
tips = sns.load_dataset("tips")
tips["total_bill_cut"] = pd.cut(tips["total_bill"],
np.arange(0, 55, 5),
include_lowest=True,
right=False)
def cramers_v(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
confusion_matrix = pd.crosstab(tips["day"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[2]: 0.9386619340722221
confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[3]: 0.1649870749498837
please note the .as_matrix() is deprecated in pandas since verison 0.23.0 . use .values instead
I found phik library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation
I was looking to do same thing in BigQuery.
For numeric features you can use built in CORR(x,y) function.
For categorical features, you can calculate it as:
cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2).
Which translates to following SQL:
SELECT
COUNT(DISTINCT(CONCAT(cat1, cat2))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat2))) as cat1_2,
COUNT(DISTINCT(CONCAT(cat1, cat3))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat3))) as cat1_3,
....
FROM ...
Higher number means lower correlation.
I used following python script to generate SQL:
import itertools
arr = range(1,10)
query = ',\n'.join(list('COUNT(DISTINCT(CONCAT({a}, {b}))) / GREATEST (COUNT(DISTINCT({a})), COUNT(DISTINCT({b}))) as cat{a}_{b}'.format(a=a,b=b)
for (a,b) in itertools.combinations(arr,2)))
query = 'SELECT \n ' + query + '\n FROM `...`;'
print (query)
It should be straightforward to do same thing in numpy.

pandas: finding the root of a function

I have some data frame in pandas, where the columns can be viewed as smooth functions of the index:
f g
x ------------
0.1 f(0.1) g(0.1)
0.2 f(0.2) g(0.2)
...
And I want to know the x value for some f(x) = y -- where y is a given, and I don't necessarily have a point at the x that I am looking for.
Essentially I want to find the intersection of a line and a data series in pandas. Is there a best way to do this?
Suppose your DataFrame looks something like this:
import numpy as np
import pandas as pd
def unknown_func(x):
return -x ** 3 + 1
x = np.linspace(-10, 10, 100)
df = pd.DataFrame({'f': unknown_func(x)}, index=x)
then, using scipy, you could create an interpolation function:
import scipy.interpolate as interpolate
func = interpolate.interp1d(x, df['f'], kind='linear')
and then use a root finder to solve f(x)-y=0 for x:
import scipy.optimize as optimize
root = optimize.brentq(lambda x: func(x)-y, x.min(), x.max())
import numpy as np
import pandas as pd
import scipy.optimize as optimize
import scipy.interpolate as interpolate
def unknown_func(x):
return -x ** 3 + 1
x = np.linspace(-10, 10, 100)
df = pd.DataFrame({'f': unknown_func(x)}, index=x)
y = 50
func = interpolate.interp1d(x, df['f'], kind='linear')
root = optimize.brentq(lambda x: func(x)-y, x.min(), x.max())
print(root)
# -3.6566397064
print(func(root))
# 50.0
idx = np.searchsorted(df.index.values, root)
print(df.iloc[idx-1:idx+1])
# f
# -3.737374 53.203496
# -3.535354 45.187410
Notice that you need some model for your data. Above, the linear interpolator,
interp1d is implicitly imposing a model for the unknown function that
generated the data.
If you already have a model function (such as unknown_func), then you could use that instead of the func returned by interp1d. If
you have a parametrized model function, then instead of interp1d you could use
optimize.curve_fit to find the best fitting parameters. And if you do choose
to interpolate, there are many other choices (e.g. quadratic or cubic
interpolation) for interpolation which you might use too. What to choose depends on what you think best models your data.