I have a df file likeso:
+------------+----------------------+---------+--+--+
| number_col | text_col | outcome | | |
+------------+----------------------+---------+--+--+
| 42 | Here isa string | 1 | | |
+------------+----------------------+---------+--+--+
| 52 | And here is a string | 0 | | |
+------------+----------------------+---------+--+--+
I am trying to use an SVM to predict outcome, with a feature union of text_col being transformed with a vectorizer and number-col being transformed by standardizing scalar. Here is my code (drawing on this example):
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
a = Pipeline([
('union', FeatureUnion(
transformer_list=[
# Pipeline for vectorizing text
('subject', Pipeline([
('selector', ItemSelector(key='text_col')),
('tfidf', TfidfVectorizer(min_df=50)),
])),
# Pipeline for standardizing number_col
('body_bow', Pipeline([
('selector', ItemSelector(key='number_col')),
('std', StandardScaler()),
])),
]
)),
# Use a SVC classifier on the combined features
('svc', SVC()),
])
a.fit(df, df['outcome'])
I get KeyError: 'text_col' because evidently transform is dropping off the other columns? How would I make my code work so that I vectorize text_col but don't lop off number_col?
Related
I have a dataset which looks something like this:
time | src | a | b | c | d | e | Label
----------------------------------------
0. | 1 | # | # | # | # | # | #
1. | 1 | # | # | # | # | # | #
2. | 1 | # | # | # | # | # | #
3. | 1 | # | # | # | # | # | #
4. | 1 | # | # | # | # | # | #
....
0. | 2 | # | # | # | # | # | #
1. | 2 | # | # | # | # | # | #
2. | 2 | # | # | # | # | # | #
3. | 2 | # | # | # | # | # | #
4. | 2 | # | # | # | # | # | #
I'm training a model to predict label against a window of [a,b,c,d,e] values. So my X is of shape (window_size,5) and my y would be the value of label at the end of the window. All values of X must have the same value in src (i.e. a window of data should only come from a single source).
I've been previously compiling X/y pairs, with a little tf.keras.utils.Sequence to hack semi usable memory management. In looking for a better way, I found tf.keras.utils.timeseries_dataset_from_array, but, based on my understanding, it would have no concept of src, meaning a single X datum could be from numerous src's. How can I leverage something like tf.keras.utils.timeseries_dataset_from_array, but have it only extract windows of data that have one src value?
note: I'd like to get a rolling window. i.e. every possible window with overlap, from each source.
Progress
1
I successfully used timeseries_dataset_from_array, but it doesn't respect src
# ============= Prep ===========
import tensorflow as tf
import numpy as np
#creating numpy data structures representing the problem
X = np.random.random((100,5))
y = np.random.random((100))
src = np.array([0]*50 + [1]*50)
window_size = 5
#making a time series dataset which does not respect src
Xy_ds = tf.keras.utils.timeseries_dataset_from_array(X, y, batch_size = 2, sequence_length=window_size,
sequence_stride=window_size, shuffle=True)
# ============= Train ===========
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer, LSTM, Flatten
#training a model, to validate the dataset is working correctly
model = Sequential()
model.add(InputLayer(input_shape=[window_size,5]))
model.add(LSTM(3))
model.add(Flatten())
model.add(Dense(1, activation='relu'))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(Xy_ds,epochs=1)
2
Implemented mdaoust's solution, but I'm getting shape errors when training
# ============= Prep ===========
import tensorflow as tf
import numpy as np
#creating numpy data structures representing the problem
X = np.random.random((100,5))
y = np.random.random((100))
src = np.expand_dims(np.array([0]*50 + [1]*50),1)
window_size = 5
#appending source information to X, for filtration
X = np.append(src, X, 1)
#making a time series dataset which does not respect src
Xy_ds = tf.keras.utils.timeseries_dataset_from_array(X, y, sequence_length=window_size,
sequence_stride=1, shuffle=True)
def single_source(x,y):
source = x[:,0]
return tf.reduce_all(source == source[0])
#filtering by and removing src info
def drop_source(x,y):
return x[:, 1:], y
def set_shapes(x,y, shape):
x.set_shape(shape)
return x,y
Xy_ds = Xy_ds.filter(single_source).map(drop_source)
# ============= Train ===========
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer, LSTM, Flatten
#training a model, to validate the dataset is working correctly
model = Sequential()
model.add(InputLayer(input_shape=[window_size,5]))
model.add(LSTM(3))
model.add(Flatten())
model.add(Dense(1, activation='relu'))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(Xy_ds,epochs=1)
Error:
ValueError: Input 0 is incompatible with layer sequential_3: expected shape=(None, None, 5), found shape=(None, None, 6)
Presumably related to this Github thread
I tried this and something like this, but no dice.
The simplest thing you can do is:
Include the source as one of the coluns of X.
Use timeseries_dataset_from_array.
Use filter to drop slices that have mixed sources.
Xy_ds = tf.keras.utils.timeseries_dataset_from_array(...)
def single_source(x,y):
source = x[:,0]
return tf.reduce_all(source == source[0])
def drop_source(x,y):
return x[:, 1:], y
Xy_ds = Xy_ds.filter(single_source).map(drop_source)
Based on mdaoust's answer, but the final working code.
Prep
This will create the time series dataset and do all the manipulation to format it correcly
# ============= Prep ===========
import tensorflow as tf
import numpy as np
batch_size = 32
#creating numpy data structures representing the problem
X = np.random.random((100,5))
y = np.random.random((100))
src = np.expand_dims(np.array([0]*50 + [1]*50),1)
window_size = 5
#appending source information to X, for filtration
X = np.append(src, X, 1)
#making a time series dataset which does not respect src
Xy_ds = tf.keras.utils.timeseries_dataset_from_array(X, y, sequence_length=window_size, batch_size=1,
sequence_stride=1, shuffle=True)
#filtering by and removing src info
def single_source(x,y):
source = x[:,:,0]
return tf.reduce_all(source == source[0])
def drop_source(x,y):
x_ = x[:, :, 1:]
print(x_)
return x_, y
Xy_ds = Xy_ds.filter(single_source)
Xy_ds = Xy_ds.map(drop_source)
Xy_ds = Xy_ds.unbatch().batch(batch_size)
#printing the dataset
i = 0
for x, y in Xy_ds:
i+=1
print(x)
print(y)
print('total batches: {}'.format(i))
Training
training, just to sanity check that everything is working
# ============= Train ===========
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer, LSTM, Flatten
#training a model, to validate the dataset is working correctly
model = Sequential()
model.add(InputLayer(input_shape=[window_size,5]))
model.add(LSTM(3))
model.add(Flatten())
model.add(Dense(1, activation='relu'))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(Xy_ds,epochs=1)
Important Note: in order for this to work, batching must occur after the filter and map are applied. That's why batch_size = 1 initially, then batching happens after.
I have the DataFrame like this one (How to get the occurence rate of the specific values with Apache Spark)
+-----------+--------------------+------------+-------+
|device | windowtime | values| counts|
+-----------+--------------------+------------+-------+
| device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]|
| device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]|
Windowtime is considered to be X axis value, values are considered to be Y value, while counts are Z axis value (to be later plot say on heatmap).
How to export that to Pandas 3d object from PySpark dataframe?
With "2 dimensions", I have
pdf = df.toPandas()
and then I can use that for Bokeh's figure like that:
fig1ADB = figure(title="My 2 graph", tooltips=TOOLTIPS, x_axis_type='datetime')
fig1ADB.line(x='windowtime', y='values', source=source, color="orange")
But I'd like to use something like this:
hm = HeatMap(data, x='windowtime', y='values', values='counts', title='My heatmap (3d) graph', stat=None)
show(hm)
WHat kind of transformation I should do for that?
I have realized, that the approach itself is wrong, there should be no aggregation to list done before the exporting to Pandas!
According to discussion below
https://discourse.bokeh.org/t/cant-render-heatmap-data-for-apache-zeppelins-pyspark-dataframe/8844/8
instead of grouped to list columns values/counts we have raw table with one line per unique id ('value') and value of count ('index') and each line has its 'write_time'
+-------------------+------+-----+
|window_time |values|index|
+-------------------+------+-----+
|2022-01-24 18:00:00|999 |2 |
|2022-01-24 19:00:00|999 |1 |
|2022-01-24 20:00:00|999 |3 |
|2022-01-24 21:00:00|999 |4 |
|2022-01-24 22:00:00|999 |5 |
|2022-01-24 18:00:00|998 |4 |
|2022-01-24 19:00:00|998 |5 |
|2022-01-24 20:00:00|998 |3 |
rowIDs = pdf['values']
colIDs = pdf['window_time']
A = pdf.pivot_table('index', 'values', 'window_time', fill_value=0)
source = ColumnDataSource(data={'x':[pd.to_datetime('Jan 24 2022')] #left most
,'y':[0] #bottom most
,'dw':[pdf['window_time'].max()-pdf['window_time'].min()] #TOTAL width of image
#,'dh':[df['delayWindowEnd'].max()] #TOTAL height of image
,'dh':[1000] #TOTAL height of image
,'im':[A.to_numpy()] #2D array using to_numpy() method on pivotted df
})
color_mapper = LogColorMapper(palette="Viridis256", low=1, high=20)
plot = figure(toolbar_location=None,x_axis_type='datetime')
plot.image(x='x', y='y', source=source, image='im',dw='dw',dh='dh', color_mapper=color_mapper)
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=12)
plot.add_layout(color_bar, 'right')
#show(plot)
show(gridplot([plot], ncols=1, plot_width=1000, plot_height=400))
And the result:
I'm trying to calculate a fuzzy score (preferable partial_ratio score) across two columns in the same dataframe.
| column1 | column2|
| -------- | -------------- |
| emmett holt| holt
| greenwald| christopher
It would need to look something like this:
| column1 | column2|partial_ratio|
| -------- | -------------- |-----------|
| emmett holt| holt|100|
| greenwald| christopher|22|
|schaefer|schaefer|100|
With the help of another question on this website, I worked towards the following code:
compare=pd.MultiIndex.from_product([ dataframe['column1'],dataframe ['column2'] ]).to_series()
def metrics (tup):
return pd.Series([fuzz.partial_ratio(*tup)], ['partial_ratio'])
df['partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(x['original_title'], x['title']), axis=1)
But the problem already starts with the first line of the code that returns the following error notification:
Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
You can say I'm kind of stuck here so any advice on this is appreciated!
You need a UDF to use fuzzywuzzy:
from fuzzywuzzy import fuzz
import pyspark.sql.functions as F
#F.udf
def fuzzyudf(original_title, title):
return fuzz.partial_ratio(original_title, title)
df2 = df.withColumn('partial_ratio', fuzzyudf('column1', 'column2'))
df2.show()
+-----------+-----------+-------------+
| column1| column2|partial_ratio|
+-----------+-----------+-------------+
|emmett holt| holt| 100|
| greenwald|christopher| 22|
+-----------+-----------+-------------+
Let us say my spark DataFrame (DF) looks like
id | age | earnings| health
----------------------------
1 | 34 | 65 | 8
2 | 65 | 12 | 4
2 | 20 | 7 | 10
1 | 40 | 75 | 7
. | .. | .. | ..
and I would like to group the DF, apply a function (say linear
regression which depends on multiple columns - two columns in this case -
of aggregated DF) on each aggregated DF and get output like
id | intercept| slope
----------------------
1 | ? | ?
2 | ? | ?
from sklearn.linear_model import LinearRegression
lr_object = LinearRegression()
def linear_regression(ith_DF):
# Note: for me it is necessary that ith_DF should contain all
# data within this function scope, so that I can apply any
# function that needs all data in ith_DF
X = [i.earnings for i in ith_DF.select("earnings").rdd.collect()]
y = [i.health for i in ith_DF.select("health").rdd.collect()]
lr_object.fit(X, y)
return lr_object.intercept_, lr_object.coef_[0]
coefficient_collector = []
# following iteration is not possible in spark as 'GroupedData'
# object is not iterable, please consider it as pseudo code
for ith_df in df.groupby("id"):
c, m = linear_regression(ith_df)
coefficient_collector.append((float(c), float(m)))
model_df = spark.createDataFrame(coefficient_collector, ["intercept", "slope"])
model_df.show()
I think this can be done since Spark 2.3 using pandas_UDF. In fact, there is an example of fitting grouped regressions on the announcement of pandas_UDFs here:
Introducing Pandas UDF for Python
What I'd do is to filter the main DataFrame to create smaller DataFrames and do the processing, say a linear regression.
You can then execute the linear regression in parallel (on separate threads using the same SparkSession which is thread-safe) and the main DataFrame cached.
That should give you the full power of Spark.
p.s. My limited understanding of that part of Spark makes me think that a very similar approach is used for grid search-based model selection in Spark MLlib and also TensorFrames which is "Experimental TensorFlow binding for Scala and Apache Spark".
I have a multi-dimensional array:
julia> sim1.value[1:5,:,:]
5x3x3 Array{Float64,3}:
[:, :, 1] =
0.201974 0.881742 0.497407
0.0751914 0.921308 0.732588
-0.109084 1.06304 1.15962
-0.0149133 0.896267 1.22897
0.717094 0.72558 0.456043
[:, :, 2] =
1.28742 0.760712 1.61112
2.21436 0.229947 1.87528
-1.66456 1.46374 1.94794
-2.4864 1.84093 2.34668
-2.79278 1.61191 2.22896
[:, :, 3] =
0.649675 0.899028 0.628103
0.718837 0.665043 0.153844
0.914646 0.807048 0.207743
0.612839 0.790611 0.293676
0.759457 0.758115 0.280334
I have names for the 2nd dimension in
julia> sim1.names
3-element Array{String,1}:
"beta[1]"
"beta[2]"
"s2"
Whats best way to reshape this multi-dim array so that I have a data frame like:
beta[1] | beta[2] | s2 | chain
0.201974 | 0.881742 | 0.497407 | 1
0.0751914| 0.921308 | 0.732588 | 1
-0.109084 | 1.06304 | 1.15962 | 1
-0.0149133| 0.896267 | 1.22897 | 1
... | ... | ... | ...
1.28742 | 0.760712 | 1.61112 | 2
2.21436 | 0.229947 | 1.87528 | 2
-1.66456 | 1.46374 | 1.94794 | 2
-2.4864 | 1.84093 | 2.34668 | 2
-2.79278 | 1.61191 | 2.22896 | 2
... | ... | ... | ...
At the moment, I think the best way to do this would be a mixture of loops and calls to reshape:
using DataFrames
A = randn(5, 3, 3)
df = DataFrame()
for j in 1:3
df[j] = reshape(A[:, :, j], 5 * 3)
end
names!(df, [:beta1, :beta2, :s2])
Looking at your data, it seems you wanted to basically stack the three matrices output by sim1.value[1:5,:,:] on top of each other vertically, plus add another column with the index of the matrix. The accepted answer of the brilliant and venerable John Myles White seems to put the entire contents of each of those matrices into it's own column.
The below matches your desired output using vcat for the stacking and hcat and fill to add the extra column. JMW I'm sure will know if there's a better way :)
using DataFrames
A = randn(5, 3, 3)
names = ["beta[1]","beta[2]","s2"]
push!(names, "chain")
newA = vcat([hcat(A[:,:,i],fill(i,size(A,1))) for i in 1:size(A,3)]...)
df = DataFrame(newA, Symbol[names...])
note also you can do this slightly more concisely without the explicit calls to hcat and vcat:
newA = [[[A[:,:,i] fill(i,size(A,1))] for i in 1:size(A,3)]...]