Convert PyTorch AutoTokenizer to TensorFlow TextVectorization - tensorflow

I have a PyTorch encoder loaded on my PC with transformers.
I saved it in JSON with tokenizer.save_pretrained(...) and now I need to load it on another PC with TensorFlow TextVectorization as I don't have access to the transformers library.
How can I convert ? I read about the tf.keras.preprocessing.text.tokenizer_from_json but it does not work.
In PyTorch JSON I have :
{
"version": "1.0",
"truncation": null,
"padding": null,
"added_tokens": [...],
"normalizer": {...},
"pre_tokenizer": {...},
"post_processor": {...},
"decoder": {...},
"model": {...}
}
and TensorFlow is expecting, with TextVectorizer :
def __init__(
self,
max_tokens=None,
standardize="lower_and_strip_punctuation",
split="whitespace",
ngrams=None,
output_mode="int",
output_sequence_length=None,
pad_to_max_tokens=False,
vocabulary=None,
idf_weights=None,
sparse=False,
ragged=False,
**kwargs,
):
or with the tokenizer_from_json these kind of fields :
config = tokenizer_config.get("config")
word_counts = json.loads(config.pop("word_counts"))
word_docs = json.loads(config.pop("word_docs"))
index_docs = json.loads(config.pop("index_docs"))
# Integer indexing gets converted to strings with json.dumps()
index_docs = {int(k): v for k, v in index_docs.items()}
index_word = json.loads(config.pop("index_word"))
index_word = {int(k): v for k, v in index_word.items()}
word_index = json.loads(config.pop("word_index"))
tokenizer = Tokenizer(**config)

Simply "tf.keras.preprocessing.text.tokenizer_from_json.()" but you may need to correct format in JSON.
Sample: The sample they using " I love cats " -> " Sticky "
import tensorflow as tf
text = "I love cats"
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000, oov_token='<oov>')
tokenizer.fit_on_texts([text])
# input
vocab = [ "a", "b", "c", "d", "e", "f", "g", "h", "I", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "_" ]
data = tf.constant([["_", "_", "_", "I"], ["l", "o", "v", "e"], ["c", "a", "t", "s"]])
layer = tf.keras.layers.StringLookup(vocabulary=vocab)
sequences_mapping_string = layer(data)
sequences_mapping_string = tf.constant( sequences_mapping_string, shape=(1,12) )
print( 'result: ' + str( sequences_mapping_string ) )
print( 'tokenizer.to_json(): ' + str( tokenizer.to_json() ) )
new_tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(tokenizer.to_json())
print( 'new_tokenizer.to_json(): ' + str( new_tokenizer.to_json() ) )
Output:
result: tf.Tensor([[27 27 27 9 12 15 22 5 3 1 20 19]], shape=(1, 12), dtype=int64)
tokenizer.to_json(): {"class_name": "Tokenizer", "config": {"num_words": 10000, "filters": "!\"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n", "lower": true, "split": " ", "char_level": false, "oov_token": "<oov>", "document_count": 1, "word_counts": "{\"i\": 1, \"love\": 1, \"cats\": 1}", "word_docs": "{\"cats\": 1, \"love\": 1, \"i\": 1}", "index_docs": "{\"4\": 1, \"3\": 1, \"2\": 1}", "index_word": "{\"1\": \"<oov>\", \"2\": \"i\", \"3\": \"love\", \"4\": \"cats\"}", "word_index": "{\"<oov>\": 1, \"i\": 2, \"love\": 3, \"cats\": 4}"}}
new_tokenizer.to_json(): {"class_name": "Tokenizer", "config": {"num_words": 10000, "filters": "!\"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n", "lower": true, "split": " ", "char_level": false, "oov_token": "<oov>", "document_count": 1, "word_counts": "{\"i\": 1, \"love\": 1, \"cats\": 1}", "word_docs": "{\"cats\": 1, \"love\": 1, \"i\": 1}", "index_docs": "{\"4\": 1, \"3\": 1, \"2\": 1}", "index_word": "{\"1\": \"<oov>\", \"2\": \"i\", \"3\": \"love\", \"4\": \"cats\"}", "word_index": "{\"<oov>\": 1, \"i\": 2, \"love\": 3, \"cats\": 4}"}}

Related

Identify change in status due to change in categorical variable in panel data

I have unbalanced panel data (repeated observations per ID at different points in time). I need to identify for a change in variable per person over time.
Here is the code to generate the data frame:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": ["01/01/2021", "01/02/2021", "01/01/2021", "01/02/2021", "01/03/2021"],
"job": ["A", "A", "A", "B", "B"],
}
)
df
I am trying to create a column ("change") that indicates when individual 2 changes job status from A to B on that date (01/02/2021).
I have tried the following, but it is giving me an error:
df['change']=df.groupby(['id'])['job'].diff().fillna(0)
In your code error happens because you use 'diff' on 'job' column, but 'job' type is 'object' and 'diff' works only with numeric types.
current answer:
df["change"] = df.groupby(
["id"])["job"].transform(lambda x: x.ne(x.shift().bfill())).astype(int)
Here is the (longer) solution that I worked out:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": [0, 1, 0, 1, 2],
"job": ["A", "A", "A", "B", "B"],
}
)
df1 = df.set_index(['id', 'date']).sort_index()
df1['job_lag'] = df1.groupby(level='id')['job'].shift()
df1.job_lag.fillna(df1.job, inplace=True)
def change(x):
if x['job'] != x['job_lag'] :
return 1
else:
return 0
df1['dummy'] = df1.apply(change, axis=1)
df1

How to assert that sum of two series is equal to sum of another two series

Let's say I have 4 series objects:
ser1=pd.Series(data={'a':1,'b':2,'c':NaN, 'd':5, 'e':50})
ser2=pd.Series(data={'a':4,'b':NaN,'c':NaN, 'd':10, 'e':100})
ser3=pd.Series(data={'a':0,'b':NaN,'c':7,'d':15, 'e':NaN})
ser4=pd.Series(data={'a':5,'b':2,'c':10, 'd':NaN, 'e':NaN})
I would like to assert
assert (ser1 + ser2 == ser3 + ser4) where I treat NaNs as zeros, only not a situation where both ser1 and ser2 are Nans - then I want to ommit this case and treat assert as true. For example when ser1 and ser2 are both NaNs ('c') then assert should return True no matter what are the values of ser3 and ser4. In case only one of ser1 or ser2 is NaN, filling nans with zeros would work.
Here is one way to do it:
def assert_sum_equality(ser1, ser2, ser3, ser4):
"""Helper function.
"""
if ser1.isna().all() and ser2.isna().all():
return True
_ = [ser.fillna(0, inplace=True) for ser in [ser1, ser2, ser3, ser4]]
return all(ser1 + ser2 == ser3 + ser4)
import pandas as pd
# ser1 and ser2 are filled with pd.NA
ser1 = pd.Series({"a": pd.NA, "b": pd.NA, "c": pd.NA, "d": pd.NA, "e": pd.NA})
ser2 = pd.Series({"a": pd.NA, "b": pd.NA, "c": pd.NA, "d": pd.NA, "e": pd.NA})
ser3 = pd.Series({"a": 0, "b": pd.NA, "c": 7, "d": 15, "e": pd.NA})
ser4 = pd.Series({"a": 5, "b": 2, "c": 10, "d": pd.NA, "e": 125})
print(assert_sum_equality(ser1, ser2, ser3, ser4)) # True
# ser1 + ser2 == ser3 + ser4 on all rows
ser1 = pd.Series({"a": 1, "b": 2, "c": 13, "d": 5, "e": 50})
ser2 = pd.Series({"a": 4, "b": pd.NA, "c": 4, "d": 10, "e": 100})
ser3 = pd.Series({"a": 0, "b": pd.NA, "c": 7, "d": 15, "e": pd.NA})
ser4 = pd.Series({"a": 5, "b": 2, "c": 10, "d": pd.NA, "e": 150})
print(assert_sum_equality(ser1, ser2, ser3, ser4)) # True
# ser1 + ser2 != ser3 + ser4 on rows 'c' and 'e'
ser1 = pd.Series({"a": 1, "b": 2, "c": pd.NA, "d": 5, "e": 50})
ser2 = pd.Series({"a": 4, "b": pd.NA, "c": pd.NA, "d": 10, "e": 100})
ser3 = pd.Series({"a": 0, "b": pd.NA, "c": 7, "d": 15, "e": pd.NA})
ser4 = pd.Series({"a": 5, "b": 2, "c": 10, "d": pd.NA, "e": pd.NA})
print(assert_sum_equality(ser1, ser2, ser3, ser4)) # False

Filtering down a Karate test response object to get a sub-list?

Given this feature file:
Feature: test
Scenario: filter response
* def response =
"""
[
{
"a": "a",
"b": "a",
"c": "a",
},
{
"d": "ab",
"e": "ab",
"f": "ab",
},
{
"g": "ac",
"h": "ac",
"i": "ac",
}
]
"""
* match response[1] contains { e: 'ab' }
How can I filter the response down so that it is equal to:
{
"d": "ab",
"e": "ab",
"f": "ab",
}
Is there a built-in way to do this? In the same way as you can filter a List using a Java stream?
Sample code:
Feature: test
Scenario: filter response
* def response =
"""
[
{
"a": "a",
"b": "a",
"c": "a",
},
{
"d": "ab",
"e": "ab",
"f": "ab",
},
{
"g": "ac",
"h": "ac",
"i": "ac",
}
]
"""
* def filt = function(x){ return x.e == 'ab' }
* def items = get response[*]
* def res = karate.filter(items, filt)
* print res

How can i convert my dataset into json format like my required format

i want to convert my this dataset
enter image description here
into this json format using pandas
y = {'name':['a','b','c'],"rollno":[1,2,3],"teacher":'xyz',"year":1998}
First create dictionary by DataFrame.to_dict and filter out duplicated lists for scalars in dictionary comprehension with if-else by check length of sets:
d = {k:v if len(set(v)) > 1 else v[0] for k, v in df.to_dict('l').items()}
print (d)
{'name': ['a', 'b', 'c'], 'rollno': [1, 2, 3], 'teacher': 'xyz', 'year': 1998}
And then convert to json:
import json
j = json.dumps(d)
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3], "teacher": "xyz", "year": 1998}
If values should be duplicated:
import json
j = json.dumps(df.to_dict(orient='l'))
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3],
"teacher": ["xyz", "xyz", "xyz"], "year": [1998, 1998, 1998]}

Special emphasis on observation by circling it in ggplot

I have a dataset with categorical data with 31 levels. I want to show their distribution in a scatterplot with ggplot, but I want to place special emphasis on some of the datapoints, like the red circle over here:
It is my preference to have a red dotted circle around the observation that is at data point [x = 10, y = 6]. Preferably, the solution is sustainable, but manual circling advice is also welcom :). This is my script
library(ggplot2)
#dataframe
df1 <- data.frame(name = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "a", "b", "c", "d", "e"),
n = rep(1:31, 1),
value = c(3, 2, 5, 1, 1, 6, 7, 9, 8, 6, 10, 11, 11, 11, 13, 15, 17, 16, 18, 18, 20, 20, 23, 22, 22, 23, 25, 26, 28, 29, 29))
#set correct data type
df1$name <- as.factor(df1$name)
#produce color vector
color <- grDevices::colors()[grep('gr(a|e)y', grDevices::colors(), invert = T)]
col_sample <- sample(color, 31)
col_sample <- as.vector(col_sample)
#scatterplot
median_scatter <- ggplot(data = df1,
aes(x = n,
y = value,
colour = name))
median_scatter +
geom_point() +
scale_colour_manual(values=col_sample)
You can define a subset of your data, i.e. df1[df1$name == "j", ] which corresponds to point of interest to plot on another geom_point, and pick a shape that is an open circle, and define color, size and stroke to your liking.
median_scatter +
geom_point() +
scale_colour_manual(values=col_sample) +
geom_point(data=df1[df1$name == "j", ], colour="red", shape=1, size=4, stroke=1.5)
Unfortunately no dashed circle shape is available.