Split a Tensorflow dataset by class label - tensorflow

I have a tensorflow dataset that I created with the following function: dataset = tf.data.Dataset.from_tensor_slices((imgs, labels)). My labels are "0" and "1" and I would like to split this dataset into two, one with all entries that have a label "0" and one with all entries that have a label "1". I have been trying to find a way to do this using the .filter function but to no avail. Help would be appreciated!

Filter will work.
dataset_zero = dataset.filter(lambda image, label: label == 0)
dataset_one = dataset.filter(lambda image, label: label == 1)
Dataset's stream based operations are annoying though. If your dataset is small, consider retrieving the dataset as a list/numpy array using Dataset.as_numpy_iterator and just using fancy indexing to get your zeros and ones.

Related

Flipping the labels of a TF dataset

I want to create a malicious dataset for CIFAR-100 to test a Federated Learning Attack similar to this malicious dataset for EMNIST:
url_malicious_dataset = 'https://storage.googleapis.com/tff-experiments-public/targeted_attack/emnist_malicious/emnist_target.mat'
filename = 'emnist_target.mat'
path = tf.keras.utils.get_file(filename, url_malicious_dataset)
emnist_target_data = io.loadmat(path)
I tried the following to flip the label 0 to 4 in the extracted example dataset, but this method isn't working:
cifar_train, cifar_test = tff.simulation.datasets.cifar100.load_data(cache_dir=None)
example_dataset = cifar_train.create_tf_dataset_for_client(cifar_train.client_ids[0])
for example in example_dataset:
if example['label'].numpy() == 0:
example['label'] = tf.constant(4,dtype=tf.int64)
Any idea how to create a similar version of the malicious dataset for CIFAR-100 instead of EMNIST by correctly flipping labels?
In general, tf.data.Dataset objects can be modified using their .map method. So for example, a simple label flipping could be done as follows:
def flip_label(example):
return {'image': example['image'], 'label': 99-example['label']}
flipped_dataset = example_dataset.map(flip_label)
This reverses the labels 0-99. You could do something similar to send 0 to 4 and fix all other labels.
Note that if you'd like to apply this to all client datasets in cifar_train, you'd have to use the .preprocess method of tff.simulation.datasets.ClientData. That is, you could do something like cifar_train.preprocess(lambda x: x.map(flip_label)).

Is plotting with Koalas using TopN has any statistic meaning?

I was going through the source code of Koalas, trying to get a handle on how they actually achieve plotting large datasets. It turns our that they use either sampling or TopN - selecting a given number of records.
I understand the meaning of sampling and internally it uses spark.DataFrame.sample to do it. For TopN, however, they simply take the first max_rows number of records from Koalas' DataFrame using data = data.head(max_rows + 1).to_pandas().
This seems strange and I wonder whether it's correctly reflecting the statistical properties of the dataset doing the data selection in this way.
Koalas DataFrame's plot accessor:
class KoalasPlotAccessor(PandasObject):
pandas_plot_data_map = {
"pie": TopNPlotBase().get_top_n,
"bar": TopNPlotBase().get_top_n,
"barh": TopNPlotBase().get_top_n,
"scatter": SampledPlotBase().get_sampled,
"area": SampledPlotBase().get_sampled,
"line": SampledPlotBase().get_sampled,
}
_backends = {} # type: ignore
...
class TopNPlotBase:
def get_top_n(self, data):
from databricks.koalas import DataFrame, Series
max_rows = get_option("plotting.max_rows")
# Simply use the first 1k elements and make it into a pandas dataframe
# For categorical variables, it is likely called from df.x.value_counts().plot.xxx().
if isinstance(data, (Series, DataFrame)):
data = data.head(max_rows + 1).to_pandas()
...

issue with tf.data.experimental.rejection_resample?

I'm working on image classification(multi_labels) model, and dataset is imbalanced, I'm trying to balance the data by using the "tf.data.experimental.rejection_resample" method, but i keep getting the below error:
"ValueError:Shape must be rank 3 but is rank 2 for '{{node Tile}} = Tile[T=DT_FLOAT, Tmultiples=DT_INT32](ExpandDims, Tile/multiples)' with input shapes: [1,9,9], [2]."
please find the blocks of code:
1)Building function to get the label from dataset:
def class_func(features, label):
return label
2)Create the re-sampler (target dist argument is set according to the number of (labels=9):
resampler = tf.data.experimental.rejection_resample(
class_func, target_dist=[0.115, 0.109,0.115, 0.109,0.115, 0.109,0.115, 0.109,0.104])
3)Run the resampler by using .apply function (dataset has been unpatched as per the tensorflow recomendaition(see the reference link)
resample_train_dataset = train_dataset.unbatch().apply(resampler).batch(10)
Note: I tried to change the dims for tensor by using tf.expand_dims but it didn't work, Any idea how to resolve this issue?
In addition to resampling, rejection_resample also transforms your dataset into a new one, where the first element is the class label used for resampling, and the second element is your original item.
So after ds.apply(resampler), if you want your data back in the original format, you might do:
ds.map(lambda extra_label, feat_and_label: feat_and_label)

Pandas masked dataframe assign 2D array

I'm trying to solve a riddle with dataframe and masked array.
Context
I'm trying to make some methods to help me with some machine learning stuff. My goals is from a simple dataframe to add new information to it. I want these functions as simple to use.
Problem
This is a function that take what I called a model, and can make a transform from data. Then, I want to set these new data on new or existing column. I'm not using apply function because, if I do it, I'm losing the ability to extract several data at once with parrallism (that's why I'm extracting first then applying the 'features'.
Here is the riddle, I want to work with a mask because without this mask I cannot modify the original dataframe. If you remove this mask,if dataframe and features have the same number of lines everything is fine. Now, if I add this mask, it seems to be taken row by row and dimension mismatch....
EDIT: I forgot to tell that 'features' correspond to a 2d numpy array.
Error
ValueError: Must have equal len keys and value when setting with an ndarray
Code
def transform(dataframe, tags, out, model, mask=None):
# Check mandatory fields
mandatory = ['datum']
if not isinstance(tags, dict) or not all(elem in mandatory for elem in tags.keys()):
raise Exception(f'Not a dict or missing tag: {mandatory}.')
# Mask creation (see pandas view / copy mechanism)
if mask is None:
mask = [True] * len(dataframe.index)
features = model.transform(dataframe.loc[mask, tags['datum']].to_numpy())
dataframe.loc[mask, out] = features.tolist()
return dataframe
Example
Data can be something like this:
Row ; Data ; Label
0 ; [61953.017837947686, 9.505037089204054, 74.585... ] ;0
1 ; [80832.69302693632, 9.524642547991316, 83.9228... ] ;1
A model could be a PCA from sklearn.
The method could be something like this:
transform(dataframe, {'datum': 'Data'}, 'PCA', PCA(), mask=dataframe[dataframe['Label']==1)
Then output would be:
Row ; Data ; Label ; PCA
0 ; [61953.017837947686, 9.505037089204054, 74.585... ] ;0 ; [74.585... ]
1 ; [80832.69302693632, 9.524642547991316, 83.9228... ] ;1 ; [92.578... ]
Current solution
mask = inputs[condition]
inputs['PCA'] = np.nan
inputs[mask] = Classification.transform(inputs[mask], {'datum': 'Data'}, wavelet, 'PCA')
def transform(dataframe, tags, model, out):
# Check mandatory fields
mandatory = ['datum']
if not isinstance(tags, dict) or not all(elem in mandatory for elem in tags.keys()):
raise Exception(f'Not a dict or missing tag: {mandatory}.')
features = model.transform(dataframe[tags['datum']].to_numpy())
dataframe[out] = features.tolist()
return dataframe
Thanks for answers!
Best regards
Ok, So I find a solution that fit my needs.
Second idea was according to me a bad idea, as it seems not easy to pass view as argument and edit it with a convenient way.
So, I change my first proposition. Here is the final version:
def transform(dataframe, tags, model, out, mask=None):
# Check mandatory fields
mandatory = ['datum']
if not isinstance(tags, dict) or not all(elem in mandatory for elem in tags.keys()):
raise Exception(f'Not a dict or missing tag: {mandatory}.')
# Mask creation (see pandas view / copy mechanism)
if mask is None:
mask = [True] * len(dataframe.index)
# dataframe[out] = np.nan
features = model.transform(dataframe.loc[mask, tags['datum']].to_numpy())
dataframe.loc[mask, out] = pd.Series(features.tolist())
return dataframe
Instead of doing dataframe.loc[mask, out] = features.tolist(), I switch to : dataframe.loc[mask, out] = pd.Series(features.tolist()). By going through a Series it fit without any problems.
Thanks Ken Syme to be involved in my problem.

Dataset API 'flat_map' method producing error for same code which works with 'map' method

I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here.
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name,rows=100):#
print(file_name.decode())
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols =['Wind_MWh','Actual_Load_MWh'],nrows = rows)
X_data = df_input.as_matrix()
X_data.astype('float32', copy=False)
return X_data
dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset,
[file_name], tf.float64))
dataset= dataset.batch(2)
fiter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
I get the following error: map_func must return a Dataset object. The pipeline works without error when I use map but it doesn't give the output I want. For example, if Pandas is reading N rows from each of my CSV files I want the pipeline to concatenate data from B files and give me an array with shape (N*B, 2). Instead, it is giving me (B, N,2) where B is the Batch size. map is adding another axis instead of concatenating on the existing axis. From what I understood in the documentation flat_map is supposed to give a flatted output. In the documentation, both map and flat_map returns type Dataset. So how is my code working with map and not with flat_map?
It would also great if you could point me towards code where Dataset API has been used with Pandas module.
As mikkola points out in the comments, the Dataset.map() and Dataset.flat_map() expect functions with different signatures: Dataset.map() takes a function that maps a single element of the input dataset to a single new element, whereas Dataset.flat_map() takes a function that maps a single element of the input dataset to a Dataset of elements.
If you want each row of the array returned by _get_data_for_dataset() to
become a separate element, you should use Dataset.flat_map() and convert the output of tf.py_func() to a Dataset, using Dataset.from_tensor_slices():
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name, rows=100):
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols=['Wind_MWh', 'Actual_Load_MWh'], nrows=rows)
X_data = df_input.as_matrix()
return X_data.astype('float32', copy=False)
dataset = tf.data.Dataset.from_tensor_slices(file_names)
# Use `Dataset.from_tensor_slices()` to make a `Dataset` from the output of
# the `tf.py_func()` op.
dataset = dataset.flat_map(lambda file_name: tf.data.Dataset.from_tensor_slices(
tf.py_func(_get_data_for_dataset, [file_name], tf.float32)))
dataset = dataset.batch(2)
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()