How can I get to the stream of the composite MinibatchSource? - cntk

If I create a MinibatchSource like this:
reader_test = MinibatchSource(ImageDeserializer('test_map.txt', StreamDefs(
features = StreamDef(field='image', transforms=transforms), # first column in map file is referred to as 'image'
labels = StreamDef(field='label', shape=num_classes) # and second as 'label'
)))
then I can get to the features stream like this:
reader_test.streams.features
But, if I create the MiniBatchSource like this:
image_source = ImageDeserializer('test_map.txt', StreamDefs(
features = StreamDef(field='image', transforms=transforms), # first column in map file is referred to as 'image'
labels = StreamDef(field='label', shape=num_classes) # and second as 'label'
))
text_source = CTFDeserializer("test_map2.txt")
text_source.map_input('index', dim=1, format="dense")
text_source.map_input('piece_type', dim=6, format="dense")
# define a composite reader
reader_config = ReaderConfig([image_source, text_source])
mb_source = reader_config.minibatch_source()
Trying this:
mb_source.streams.features
results in:
AttributeError: 'MinibatchSource' object has no attribute 'streams'
How can I get to the features stream?

This has been fixed since a long time ago.

Related

What is the most efficient way of creating a tf.dataset from multiple json.gz files with multiple text records?

I have thousands of json.gz files, each with a variety of information about scientific papers. For each file, I have to extract the relevant information - e.g. title and labels - to make a dataset, then transform it to a tf.dataset. However, it is poorly efficient since I cannot filter the subjects directly or shuffle them in a single step.
I would like to read them using tf.dataset.interleave in order to shuffle them, but also to filter them according to specific labels.
Here is how I'm doing it up to now.
import tensorflow as tf
import pandas as pd
#For relevant feature extraction
def load_file(file):
#with gzip.open(bytes.decode(file), 'r') as fin: # 4. gzip
with gzip.open(file, 'r') as fin:
json_bytes = fin.read()
json_str = json_bytes.decode('utf-8') # 2. string (i.e. JSON)
bb = json.loads(json_str)
bb = pd.json_normalize(bb, 'items', ['indexed', ['title', 'publisher', 'type','indexed.date-parts', 'subject']],
errors='ignore')
bb.dropna(subset=['title', 'publisher', 'type','indexed.date-parts', 'subject'], inplace=True)
bb.subject = bb.subject.apply(lambda x: int(themes[list(set(x) & set(list(themes.keys())))[0]]) if len(list(set(x) & set(list(themes.keys()))))>0 else len(list(themes.keys()))+1)
bb.title = bb.title.str.join('').values
#bb['author'] = bb['author'].apply(lambda x: '; '.join([', '.join([i['given'], i['family']]) for i in x]))
bb['indexed.date-parts'] = bb['indexed.date-parts'].apply(lambda tpl: datetime.datetime.strptime('-'.join(str(x) for x in tpl[0]), '%Y-%m-%d').strftime('%Y-%m-%d'))
#bb = bb.sample(n=32, replace=True)
#return bb.title.str.join('').values, bb.subject.str.join(', ').values
return dict(bb[['title', 'publisher', 'type','indexed.date-parts', 'subject' ]])
file_list = ['file_2021_01/10625.json.gz',
'file_2021_01/23897.json.gz',
'file_2021_01/12169.json.gz',
'file_2021_01/427.json.gz',...]
filenames = tf.data.Dataset.list_files(file_list, shuffle=True)
dataset = filenames.apply(
tf.data.experimental.parallel_interleave(
lambda x: tf.data.Dataset.from_tensor_slices(tf.numpy_function(load_file, [x], (tf.int64))), cycle_length=1))
However, it results it a error:
InternalError: Unsupported object type dict
[[{{node PyFunc}}]] [Op:IteratorGetNext]
Thanks

Is there a method for converting a winmids object to a mids object?

Suppose I create 10 multiply-imputed datasets and use the (wonderful) MatchThem package in R to create weights for my exposure variable. The MatchThem package takes a mids object and converts it to an object of the class winmids.
My desired output is a mids object - but with weights. I hope to pass this mids object to BRMS as follows:
library(brms)
m0 <- brm_multiple(Y|weights(weights) ~ A, data = mids_data)
Open to suggestions.
EDIT: Noah's solution below will unfortunately not work.
The package's first author, Farhad Pishgar, sent me the following elegant solution. It will create a mids object from a winmidsobject. Thank you Farhad!
library(mice)
library(MatchThem)
#"weighted.dataset" is our .wimids object
#Extracting the original dataset with missing value
maindataset <- complete(weighted.datasets, action = 0)
#Some spit-and-polish
maindataset <- data.frame(.imp = 0, .id = seq_len(nrow(maindataset)), maindataset)
#Extracting imputed-weighted datasets in the long format
alldataset <- complete(weighted.datasets, action = "long")
#Binding them together
alldataset <- rbind(maindataset, alldataset)
#Converting to .mids
newmids <- as.mids(alldataset)
Additionally, for BRMS, I worked out this solution which instead creates a list of dataframes. It will work in fewer steps.
library("mice")
library("dplyr")
library("MatchThem")
library("brms") # for bayesian estimation.
# Note, I realise that my approach here is not fully Bayesian, but that is a good thing! I need to ensure balance in the exposure.
# impute missing data
data("nhanes2")
imp <- mice(nhanes2, printFlag = FALSE, seed = 0, m = 10)
# MathThem. This is just a fast method
w_imp <- weightthem(hyp ~ chl + age, data = imp,
approach = "within",
estimand = "ATE",
method = "ps")
# get individual data frames with weights
out <- complete(w_imp, action ="long", include = FALSE, mild = TRUE)
# assemble individual data frames into a list
m <- 10
listdat<- list()
for (i in 1:m) {
listdat[[i]] <- as.data.frame(out[[i]])
}
# pass the list to brms, and it runs as it should!
fit_1 <- brm_multiple(bmi|weights(weights) ~ age + hyp + chl,
data = listdat,
backend = "cmdstanr",
family = "gaussian",
set_prior('normal(0, 1)',
class = 'b'))
brm_multiple() can take in a list of data frames for its data argument. You can produce this from the wimids object using complete(). The output of complete() with action = "all" is a mild object, which is a list of data frames, but this is not recognized by brm_multiple() as such. So, you can just convert it to a list. This should look like the following:
df_list <- complete(mids_data, "all")
class(df_list) <- "list"
m0 <- brm_multiple(Y|weights(weights) ~ A, data = df_list)
Using complete() automatically adds a weights column to the resulting imputed data frames.

TypeError: 'Value' object is not iterable : iterate around a Dataframe for prediction purpose with GCP Natural Language Model

I'm trying to iterate over a dataframe in order to apply a predict function, which calls a Natural Language Model located on GCP. Here is the loop code :
model = 'XXXXXXXXXXXXXXXX'
barometre_df_processed = barometre_df
barometre_df_processed['theme'] = ''
barometre_df_processed['proba'] = ''
print('DEBUT BOUCLE FOR')
for ind in barometre_df.index:
if barometre_df.verbatim[ind] is np.nan :
barometre_df_processed.theme[ind]="RAS"
barometre_df_processed.proba[ind]="1"
else:
print(barometre_df.verbatim[ind])
print(type(barometre_df.verbatim[ind]))
res = get_prediction(file_path={'text_snippet': {'content': barometre_df.verbatim[ind]},'mime_type': 'text/plain'} },model_name=model)
print(res)
theme = res['displayNames']
proba = res["classification"]["score"]
barometre_df_processed.theme[ind]=theme
barometre_df_processed.proba[ind]=proba
and the get_prediction function that I took from the Natural Language AI Documentation :
def get_prediction(file_path, model_name):
options = ClientOptions(api_endpoint='eu-automl.googleapis.com:443')
prediction_client = automl_v1.PredictionServiceClient(client_options=options)
payload = file_path
# Uncomment the following line (and comment the above line) if want to predict on PDFs.
# payload = pdf_payload(file_path)
parameters_dict = {}
params = json_format.ParseDict(parameters_dict, Value())
request = prediction_client.predict(name=model_name, payload=payload, params=params)
print("fonction prediction")
print(request)
return resultat[0]["displayName"], resultat[0]["classification"]["score"], resultat[1]["displayName"], resultat[1]["classification"]["score"], resultat[2]["displayName"], resultat[2]["classification"]["score"]
I'm doing a loop this way because I want each of my couple [displayNames, score] to create a new line on my final dataframe, to have something like this :
verbatim1, theme1, proba1
verbatim1, theme2, proba2
verbatim1, theme3, proba3
verbatim2, theme1, proba1
verbatim2, theme2, proba2
...
The if barometre_df.verbatim[ind] is np.nan is not causing problems, I just use it to deal with nans, don't take care of it.
The error that I have is this one :
TypeError: 'Value' object is not iterable
I guess the issues is about
res = get_prediction(file_path={'text_snippet': {'content': barometre_df.verbatim[ind]} },model_name=model)
but I can't figure what's goign wrong here.
I already try to remove
,'mime_type': 'text/plain'}
from my get_prediction parameters, but it doesn't change anything.
Does someone knows how to deal with this issue ?
Thank you already.
I think you are not iterating correctly.
The way to iterate through a dataframe is:
for index, row in df.iterrows():
print(row['col1'])

preprocess images with tf.data.experimental.make_csv_dataset or with read_csv option

I am adding this summarization of my issue to make it easier to understand:
I want to do exactly what is done in the following tensorflow example:
https://www.tensorflow.org/guide/datasets
# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function(filename, label):
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_jpeg(image_string)
image_resized = tf.image.resize_images(image_decoded, [28, 28])
return image_resized, label
# A vector of filenames.
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])
# `labels[i]` is the label for the image in `filenames[i].
labels = tf.constant([0, 37, ...])
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
The only differences are: I read the data from CSV that has many more features and then I call the map method:
dataset = tf.data.experimental.make_csv_dataset(file_pattern=CSV_PATH_TRAIN,
batch_size=2,
header=True,
label_name = 'label').map(_parse_function)
How does my _parse_function need to look like? How do I access the image path features, updates it to be an image presentation and return a modified numeric matrix feature of the image without changing anything at the other features?
thanks,
eilalan
==================Here are my code tries:==================
My code reads a CSV with feature columns and label. One of the features is image path, the others are strings.
The image path need to be processed into image numbers matrix.
I have tried doing so with the following options. In both ways tf.read_file fails with the input dimension error.
My question is how to pass one image at a time into the map methods
def read_image_png_option_1(image_path, depth=3, scale=False):
"""Reads the image from image_path (tf.string tensor) [jpg image].
Cast the result to float32 and if scale=True scale it in [-1,1]
using scale_image. Otherwise the values are in [0,1]
Reuturn:
the decoded jpeg image, casted to float32
"""
image = tf.image.convert_image_dtype(
tf.image.decode_png(tf.read_file(image_path), channels=depth),
dtype=tf.float32)
if scale:
image = scale_image(image)
return image
def read_image_png_option_2(features, depth=3, scale=False):
"""Reads the image from image_path (tf.string tensor) [jpg image].
Cast the result to float32 and if scale=True scale it in [-1,1]
using scale_image. Otherwise the values are in [0,1]
Reuturn:
the decoded jpeg image, casted to float32
"""
image = tf.image.convert_image_dtype(
tf.image.decode_png(tf.read_file(features['image']), channels=depth),
dtype=tf.float32)
if scale:
image = scale_image(image)
features['image'] = image
return features
def make_input_fn(fileName,batch_size=8, perform_shuffle=True):
"""An input function for training """
def _input_fn():
def decode_csv(line):
print('line is ',line)
filename_col,label_col,gender_col,ethinicity = tf.decode_csv(line,
[[""]]*amount_of_columns_csv,
field_delim=",",
na_value='NA',
select_cols=None)
image_col = read_image_png_option_1(filename_col)
d = dict(zip(['image','label','gender','ethinicity'], [image_col,label_col,gender_col,ethinicity])), label
return d
## OPTION 1:
# filenames could be more than one
# dataset = tf.data.TextLineDataset(filenames=fileName).skip(1).batch(batch_size).map(decode_csv)
## OPTION 2:
dataset = tf.data.experimental.make_csv_dataset(file_pattern=CSV_PATH_TRAIN,
batch_size=2,
header=True,
label_name = 'label').map(read_image_png_option_2)
#select_columns=[0,1]) #[tf.string,tf.string,tf.string,tf.string])
if perform_shuffle:
dataset = dataset.shuffle(buffer_size=256)
return dataset
return _input_fn()
train_input_fn = lambda: make_input_fn(CSV_PATH_TRAIN)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=50)
eval_input_fn = lambda: make_input_fn(CSV_PATH_VAL)
eval_spec = tf.estimator.EvalSpec(eval_input_fn)
feature_columns = [tf.feature_column.numeric_column("image",shape=(224,224)), # here i need a pyhton method to transform
tf.feature_column.categorical_column_with_vocabulary_list("gender", ["ww","ee"]),
tf.feature_column.categorical_column_with_vocabulary_list("ethinicity",["xx","yy"])]
estimator = tf.estimator.DNNClassifier(feature_columns=feature_columns,hidden_units=[1024, 512, 256],warm_start_from=ws)
tf.estimator.train_and_evaluate(estimator, train_spec=train_spec, eval_spec=eval_spec)
Error for option 2:
ValueError: Shape must be rank 0 but is rank 1 for 'ReadFile' (op: 'ReadFile') with input shapes: [2].
Error for option 1:
ValueError: Shape must be rank 0 but is rank 1 for 'ReadFile' (op: 'ReadFile') with input shapes: [?].
Any help is appreciated.
Thanks
First you need to read the CSV file into dataset.
Then for each row in your CSV you can call your parse function.
def getInput(fileList):
# returns a dataset containing list of filenames
files = tf.data.Dataset.from_tensor_slices(fileList)
# Returs a dataset containing list of rows taken from all the files in file list.
# dataset is filled dynamically and not all entries are read at once
dataset = files.interleave(tf.data.TextLineDataset)
# call parse function for each row
# returned dataset will contain list of whatever the parse function is returning for the row
# we want the image path to be converted to decoded image in parse function
dataset = dataset.map(_parse_function, num_parallel_calls=8)
# return an iterator for the dataset which will be used to get elements.
return dataset.make_one_shot_iterator().get_next()
The parse function will be passed only one parameter that will be a single row from the CSV file. You need to decode the CSV and do further processing on each value.
Let's say you have 3 columns in your CSV each being a string.
def _parse_function(value):
columns_default = [[""], [""], [""]]
# this will be a tensor of columns in the row
columns = tf.decode_csv(value, record_defaults=columns_default,
field_delim=',')
col_names = ["label", "imagepath", "c3"]
features = dict(zip(col_names, columns))
for f, tensor in features.items():
# process imagepath to decoded image
if f == "imagepath":
image_string = tf.read_file(tensor)
image_decoded = tf.image.decode_jpeg(image_string)
image_resized = tf.image.resize_images(image_decoded, [28, 28])
features[f] = image_resized
labels = tf.equal(features.pop('label'), "1")
labels = tf.expand_dims(labels, 0)
return features, labels
Edit:
Explanation for comment:
Dataset object simply contains a list of elements. The elements can be tensors or a tuple of tensors etc. Tensor object can contain anything. It could represent a single feature, a single record or a batch of record. Further dataset API provide handy methods to manipulate the elements within.
If you are using dataset with another API like estimator then they expect the dataset elements to be in specific format which is what need to return from our input function for eg.
https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#train
I have edited my code block above to describe what dataset object at each step will contain.
From what I understand is that you have image path as one of the field in your CSV and you want to convert that path into an actual decoded image which you will use as one of the feature.
Since the image is going to be just one of the feature, you should not try to create a dataset using image files alone. Dataset object will include all your features at once.
So doing this would be incorrect:
files = tf.data.Dataset.from_tensor_slices(ds['imagepath'])
dataset = files.interleave(tf.data.TextLineDataset)
If you are using make_csv() function to read your csv then it will convert each row of your csv into one record where one record will contain list of all features, same as columns of csv.
So each element in the returned dataset should contain a single tensor containing all your features.
Here your image path will be one of the features. now you want to transform that image path to decoded image.
I suppose you can do it by applying a parse function to elements of dataset using map() function but it will be slightly tricky as all your features are already packed inside a single tensor.

Tensorflow Dataset API how to order list_files?

I am using the Dataset API list_files in order to get a list of files in a source directory and target directory, something like:
source_path = '/tmp/data/source/*.ext1'
target_path = '/tmp/data/target/*.ext2'
source_dataset = tf.data.Dataset.list_files(source_path)
target_dataset = tf.data.Dataset.list_files(data_path)
dataset = tf.data.Dataset.zip((source_dataset, target_dataset))
Source and target dir contents have same sequential filenames, but different extensions (e.g, source 0001.ext1 <-> target 0001.ext2).
But since list_files is not ordered in anyway, the zipped dataset contains missmatches between the source and the target.
How can I solve this within the new dataset API?
The default behavior of this method is to return filenames in a non-deterministic random shuffled order. Pass a seed or shuffle=False to get results in a deterministic order.
source_dataset = tf.data.Dataset.list_files(source_path, shuffle=False)
or
val = 5
source_dataset = tf.data.Dataset.list_files(source_path, seed = val)
target_dataset = tf.data.Dataset.list_files(data_path, seed = val)
I had the same issue and I solved it by sorting the file paths first.
My files are named like in OP's case:
input image -> corresponding output
data/mband/01.tif -> data/gt_mband/01.tif
data/mband/02.tif -> data/gt_mband/02.tif
The code looks like this:
from pathlib import Path
import tensorflow as tf
DATA_PATH = Path("data")
# Sort the PATHS
img_paths = sorted(map(str, (DATA_PATH / 'mband').glob('*.tif')))
mask_paths = sorted(map(str, (DATA_PATH / 'gt_mband').glob('*.tif')))
# These are tensors of PATHS
# Paths are strings, so order will be preserved
img_paths = tf.data.Dataset.from_tensor_slices(img_paths)
mask_paths = tf.data.Dataset.from_tensor_slices(mask_paths)
# Load the actual images
def parse_image(image_path: 'some_tensor'):
# Load the image somehow...
return image_as_tensor
imgs = img_paths.map(parse_image)
masks = mask_paths.map(parse_mask)