TypeError: 'Value' object is not iterable : iterate around a Dataframe for prediction purpose with GCP Natural Language Model - pandas

I'm trying to iterate over a dataframe in order to apply a predict function, which calls a Natural Language Model located on GCP. Here is the loop code :
model = 'XXXXXXXXXXXXXXXX'
barometre_df_processed = barometre_df
barometre_df_processed['theme'] = ''
barometre_df_processed['proba'] = ''
print('DEBUT BOUCLE FOR')
for ind in barometre_df.index:
if barometre_df.verbatim[ind] is np.nan :
barometre_df_processed.theme[ind]="RAS"
barometre_df_processed.proba[ind]="1"
else:
print(barometre_df.verbatim[ind])
print(type(barometre_df.verbatim[ind]))
res = get_prediction(file_path={'text_snippet': {'content': barometre_df.verbatim[ind]},'mime_type': 'text/plain'} },model_name=model)
print(res)
theme = res['displayNames']
proba = res["classification"]["score"]
barometre_df_processed.theme[ind]=theme
barometre_df_processed.proba[ind]=proba
and the get_prediction function that I took from the Natural Language AI Documentation :
def get_prediction(file_path, model_name):
options = ClientOptions(api_endpoint='eu-automl.googleapis.com:443')
prediction_client = automl_v1.PredictionServiceClient(client_options=options)
payload = file_path
# Uncomment the following line (and comment the above line) if want to predict on PDFs.
# payload = pdf_payload(file_path)
parameters_dict = {}
params = json_format.ParseDict(parameters_dict, Value())
request = prediction_client.predict(name=model_name, payload=payload, params=params)
print("fonction prediction")
print(request)
return resultat[0]["displayName"], resultat[0]["classification"]["score"], resultat[1]["displayName"], resultat[1]["classification"]["score"], resultat[2]["displayName"], resultat[2]["classification"]["score"]
I'm doing a loop this way because I want each of my couple [displayNames, score] to create a new line on my final dataframe, to have something like this :
verbatim1, theme1, proba1
verbatim1, theme2, proba2
verbatim1, theme3, proba3
verbatim2, theme1, proba1
verbatim2, theme2, proba2
...
The if barometre_df.verbatim[ind] is np.nan is not causing problems, I just use it to deal with nans, don't take care of it.
The error that I have is this one :
TypeError: 'Value' object is not iterable
I guess the issues is about
res = get_prediction(file_path={'text_snippet': {'content': barometre_df.verbatim[ind]} },model_name=model)
but I can't figure what's goign wrong here.
I already try to remove
,'mime_type': 'text/plain'}
from my get_prediction parameters, but it doesn't change anything.
Does someone knows how to deal with this issue ?
Thank you already.

I think you are not iterating correctly.
The way to iterate through a dataframe is:
for index, row in df.iterrows():
print(row['col1'])

Related

Is there a method for converting a winmids object to a mids object?

Suppose I create 10 multiply-imputed datasets and use the (wonderful) MatchThem package in R to create weights for my exposure variable. The MatchThem package takes a mids object and converts it to an object of the class winmids.
My desired output is a mids object - but with weights. I hope to pass this mids object to BRMS as follows:
library(brms)
m0 <- brm_multiple(Y|weights(weights) ~ A, data = mids_data)
Open to suggestions.
EDIT: Noah's solution below will unfortunately not work.
The package's first author, Farhad Pishgar, sent me the following elegant solution. It will create a mids object from a winmidsobject. Thank you Farhad!
library(mice)
library(MatchThem)
#"weighted.dataset" is our .wimids object
#Extracting the original dataset with missing value
maindataset <- complete(weighted.datasets, action = 0)
#Some spit-and-polish
maindataset <- data.frame(.imp = 0, .id = seq_len(nrow(maindataset)), maindataset)
#Extracting imputed-weighted datasets in the long format
alldataset <- complete(weighted.datasets, action = "long")
#Binding them together
alldataset <- rbind(maindataset, alldataset)
#Converting to .mids
newmids <- as.mids(alldataset)
Additionally, for BRMS, I worked out this solution which instead creates a list of dataframes. It will work in fewer steps.
library("mice")
library("dplyr")
library("MatchThem")
library("brms") # for bayesian estimation.
# Note, I realise that my approach here is not fully Bayesian, but that is a good thing! I need to ensure balance in the exposure.
# impute missing data
data("nhanes2")
imp <- mice(nhanes2, printFlag = FALSE, seed = 0, m = 10)
# MathThem. This is just a fast method
w_imp <- weightthem(hyp ~ chl + age, data = imp,
approach = "within",
estimand = "ATE",
method = "ps")
# get individual data frames with weights
out <- complete(w_imp, action ="long", include = FALSE, mild = TRUE)
# assemble individual data frames into a list
m <- 10
listdat<- list()
for (i in 1:m) {
listdat[[i]] <- as.data.frame(out[[i]])
}
# pass the list to brms, and it runs as it should!
fit_1 <- brm_multiple(bmi|weights(weights) ~ age + hyp + chl,
data = listdat,
backend = "cmdstanr",
family = "gaussian",
set_prior('normal(0, 1)',
class = 'b'))
brm_multiple() can take in a list of data frames for its data argument. You can produce this from the wimids object using complete(). The output of complete() with action = "all" is a mild object, which is a list of data frames, but this is not recognized by brm_multiple() as such. So, you can just convert it to a list. This should look like the following:
df_list <- complete(mids_data, "all")
class(df_list) <- "list"
m0 <- brm_multiple(Y|weights(weights) ~ A, data = df_list)
Using complete() automatically adds a weights column to the resulting imputed data frames.

How to point from the inputs of shape (100,24,24,6) the last channel dimension i.e (6,) to be worked on?

I am trying to use the tf.map_fn() , where my elems should be pointing to the channel dimension of my inputs(shape = 100,24,24,6), so my elems should be a list/tuple of tensors, pointing or accessing the values of the channel dimension(6) of the inputs .I am trying to do it by making a for loop in such a way :
#tf.function
def call(self, inputs, training=True):
elems = []
for b in inputs:
for h in b:
for w in h:
for c in w:
elems.append(c)
changed_inputs = tf.map_fn(self.do_mapping, elems)
return changed_inputs
What i am trying to achieve in the self.do_mapping is that it is doing a dictionary look up for the values of a dictionary (vmap) using the keys and the return the values. the dictionary vmap is made by accessing the output of a layer and appending only the similar values of the channel dimension of the output of layer so the keys in dictionary are tuple of 6 (as the size of channel dimension) tf.tensorobjects and values of dictionary is the count which i keep. This is how the dictionary is made :
value = list(self.get_values())
vmap = {}
cnt = 0
for v0 in value:
for v1 in v0:
for v2 in v1:
for v3 in v2:
v = tuple(v3)
if v not in vmap:
vmap[v]=cnt
cnt+=1
the do_mapping function is :
#tf.function
def do_mapping(self,pixel):
if self._compression :
pixel = tuple(pixel)
enumerated_value=self._vmap.get(pixel)
print(enumerated_value)
print(tf.shape(pixel))
exit()
return enumerated_value
If i try to use the tf.map_fn now where i try to point the elems to the channel dimension then i get the following error :(ValueError: elements in elems must be 1+ dimensional Tensors, not scalars ). Please help me to understand how can i use the tf.map_fn for my case ? Thank you in advance
First, instead of doing a for loop (try to avoid for efficiency), you can just reshape that way:
elems = tf.reshape(inputs,-1)
Second, what do you want to do exactly? What do you mean by "it doesn't work"? What is the error message? What is self.do_mapping?
Best,
Keivan

What exactly to test (unittest) in a larger function containing several dataframe manipulations

Perhaps this is a constraint of my understanding of unittests, but I get quite confused as to what should be tested, patched, etc in a method that has several pandas dataframe manipulations. Many of the unittest examples out there focus on classes and methods that are typically small. For larger methods, I get a bit lost on the typical unittest paradigm. For example:
myscript.py
class Pivot:
def prepare_dfs(self):
df = pd.read_csv(self.file, sep=self.delimiter)
g = df.groupby("Other_Location")
df1 = g.apply(lambda x: x[x["PRN"] == "Free"].count())
locations = ["O12-03-01", "O12-03-02"]
cp = df1["PRN"]
cp = cp[locations].tolist()
data = [locations, cp]
new_df = pd.DataFrame({"Other_Location": data[0], "Free": data[1]})
return new_df, df
test_myscript.py
class TestPivot(unittest.TestCase):
def setUp(self):
args = parse_args(["-f", "test1", "-d", ","])
self.pivot = Pivot(args)
self.pivot.path = "Pivot/path"
#mock.patch("myscript.cp[locations].tolist()", return_value=None)
#mock.patch("myscript.pd.read_csv", return_value=df)
def test_prepare_dfs_1(self, mock_read_csv, mock_cp):
new_df, df = self.pivot.prepare_dfs()
# Here I get a bit lost
For example here I try to circumvent the following error message:
ModuleNotFoundError: No module named 'myscript.cp[locations]'; 'myscript' is not a package
I managed to mock correctly the pd.read_csv in my method, however further down in the code there are groupy, apply, tolist etc. The error message is thrown at the following line:
cp = cp[locations].tolist()
What is the best way to approach unittesting when your method involves several manipulations on a dataframe? Is refactoring the code always advised (into smaller chunks)? In this case, how can I mock correctly the tolist ?

Problems appending rows to DataFrame. ZMQ messages to Pandas Dataframe

I am taking messages for market data from a ZMQ subscription and turning it into a pandas dataframe.
I tried creating a empty dataframe and appending rows to it. It did not work out. I keep getting this error.
RuntimeWarning: '<' not supported between instances of 'str' and 'int', sort
order is undefined for incomparable objects
result = result.union(other)
Im guessing this is because Im appending a list of strings to a dataframe. I clear the list then try to append the next row. The data is 9 rows. First one is a string and the other 8 are all floats.
list_heartbeat = []
list_fills= []
market_data_bb = []
market_data_fs = []
abacus_fs = []
abacus_bb =[]
df_bar_data_bb = pd.DataFrame(columns= ['Ticker','Start_Time_Intervl','Interval_Length','Current_Open_Price',
'Previous_Open','Previous_Low','Previous_High','Previous_Close','Message_ID'])
def main():
context = zmq.Context()
socket_sub1 = context.socket(zmq.SUB)
socket_sub2 = context.socket(zmq.SUB)
socket_sub3 = context.socket(zmq.SUB)
print('Opening Socket...')
# We can connect to several endpoints if we desire, and receive from all.
print('Connecting to Nicks BroadCast...')
socket_sub1.connect("Server:port")
socket_sub2.connect("Server:port")
socket_sub3.connect("Server:port")
print('Connected To Nicks BroadCast... Waiting For Messages.')
print('Connected To Jasons Two BroadCasts... Waiting for Messages.')
#socket_sub1.setsockopt_string(zmq.SUBSCRIBE, 'H')
socket_sub1.setsockopt_string(zmq.SUBSCRIBE, 'R')
#socket_sub1.setsockopt_string(zmq.SUBSCRIBE, 'HEARTBEAT') #possible heartbeat from Jason
socket_sub2.setsockopt_string(zmq.SUBSCRIBE, 'BAR_FS')
socket_sub2.setsockopt_string(zmq.SUBSCRIBE, 'HEARTBEAT')
socket_sub2.setsockopt_string(zmq.SUBSCRIBE, 'BAR_BB')
socket_sub3.setsockopt_string(zmq.SUBSCRIBE, 'ABA_FS')
socket_sub3.setsockopt_string(zmq.SUBSCRIBE, 'ABA_BB')
poller = zmq.Poller()
poller.register(socket_sub1, zmq.POLLIN)
poller.register(socket_sub2, zmq.POLLIN)
poller.register(socket_sub3, zmq.POLLIN)
while (running):
try:
socks = dict(poller.poll())
except KeyboardInterrupt:
break
#check if the message is in socks, if so then save to message1-3 for future use.
#Msg1 = heartbeat for Nicks server
#Msg2 = fills
#msg3 Mrkt Data split between FS and BB
#msg4
if socket_sub1 in socks:
message1 = socket_sub1.recv_string()
list_heartbeat.append(message1.split())
if socket_sub2 in socks:
message2 = socket_sub2.recv_string()
message3 = socket_sub2.recv_string()
if message2 == 'HEARTBEAT':
print(message2)
print(message3)
if message2 == 'BAR_BB':
message3_split = message3.split(";")
message3_split = [e[3:] for e in message3_split]
#print(message3_split)
message3_split = message3_split
market_data_bb.append(message3_split)
if len(market_data_bb) > 20:
#df_bar_data_bb = pd.DataFrame(market_data_bb, columns= ['Ticker','Start_Time_Intervl','Interval_Length','Current_Open_Price',
# 'Previous_Open','Previous_Low','Previous_High','Previous_Close','Message_ID'])
#df_bar_data_bb.set_index('Start_Time_Intervl', inplace=True)
#ESA = df_bar_data_bb[df_bar_data_bb['Ticker'] == 'ESA Index'].copy()
#print(ESA)
#df_bar_data_bb.set_index('Start_Time_Intervl', inplace=True)
df_bar_data_bb.append(market_data_bb)
market_data_bb.clear()
print(df_bar_data_bb)
The very bottom is what throws the Error. I found a simple way around this that may or may not work. Its the 4 lines above that create a dataframe then set the index and try to create copies of the dataframe. The only problem is I get about anywhere from 40-90 messages a second and every time I get a new one it creates a new dataframe. I eventually have to create a graph out of this and im not exactly sure how I would create a live graph out of this. But thats another problem.
EDIT: I figured it out. Instead of adding the messages to a list I simply convert each message to a pandas series then call my dataframe globally then do df=df.append(message4,ignore_index=True)
I completely removed the need for lists
if message2 == 'BAR_BB':
message3_split = message3.split(";")
message3_split = [e[3:] for e in message3_split]
message4 = pd.Series(message3_split)
global df_bar_data_bb1
df_bar_data_bb1 = df_bar_data_bb1.append(message4, ignore_index = True)

Changing label name when retraining Inception on Google Cloud ML

I currently follow the tutorial to retrain Inception for image classification:
https://cloud.google.com/blog/big-data/2016/12/how-to-train-and-classify-images-using-google-cloud-machine-learning-and-cloud-dataflow
However, when I make a prediction with the API I get only the index of my class as a label. However I would like that the API actually gives me a string back with the actual class name e.g instead of
​predictions:
- key: '0'
prediction: 4
scores:
- 8.11998e-09
- 2.64907e-08
- 1.10307e-06
I would like to get:
​predictions:
- key: '0'
prediction: ROSES
scores:
- 8.11998e-09
- 2.64907e-08
- 1.10307e-06
Looking at the reference for the Google API it should be possible:
https://cloud.google.com/ml-engine/reference/rest/v1/projects/predict
I already tried to change in the model.py the following to
outputs = {
'key': keys.name,
'prediction': tensors.predictions[0].name,
'scores': tensors.predictions[1].name
}
tf.add_to_collection('outputs', json.dumps(outputs))
to
if tensors.predictions[0].name == 0:
pred_name ='roses'
elif tensors.predictions[0].name == 1:
pred_name ='tulips'
outputs = {
'key': keys.name,
'prediction': pred_name,
'scores': tensors.predictions[1].name
}
tf.add_to_collection('outputs', json.dumps(outputs))
but this doesn't work.
My next idea was to change this part in the preprocess.py file. So instead getting the index I want to use the string label.
def process(self, row, all_labels):
try:
row = row.element
except AttributeError:
pass
if not self.label_to_id_map:
for i, label in enumerate(all_labels):
label = label.strip()
if label:
self.label_to_id_map[label] = label #i
and
label_ids = []
for label in row[1:]:
try:
label_ids.append(label.strip())
#label_ids.append(self.label_to_id_map[label.strip()])
except KeyError:
unknown_label.inc()
but this gives the error:
TypeError: 'roses' has type <type 'str'>, but expected one of: (<type 'int'>, <type 'long'>) [while running 'Embed and make TFExample']
hence I thought that I should change something here in preprocess.py, in order to allow strings:
example = tf.train.Example(features=tf.train.Features(feature={
'image_uri': _bytes_feature([uri]),
'embedding': _float_feature(embedding.ravel().tolist()),
}))
if label_ids:
label_ids.sort()
example.features.feature['label'].int64_list.value.extend(label_ids)
But I don't know how to change it appropriately as I could not find someting like str_list. Could anyone please help me out here?
Online prediction certainly allows this, the model itself needs to be updated to do the conversion from int to string.
Keep in mind that the Python code is just building a graph which describes what computation to do in your model -- you're not sending the Python code to online prediction, you're sending the graph you build.
That distinction is important because the changes you have made are in Python -- you don't yet have any inputs or predictions, so you won't be able to inspect their values. What you need to do instead is add the equivalent lookups to the graph that you're exporting.
You could modify the code like so:
labels = tf.constant(['cars', 'trucks', 'suvs'])
predicted_indices = tf.argmax(softmax, 1)
prediction = tf.gather(labels, predicted_indices)
And leave the inputs/outputs untouched from the original code