Special emphasis on observation by circling it in ggplot - ggplot2

I have a dataset with categorical data with 31 levels. I want to show their distribution in a scatterplot with ggplot, but I want to place special emphasis on some of the datapoints, like the red circle over here:
It is my preference to have a red dotted circle around the observation that is at data point [x = 10, y = 6]. Preferably, the solution is sustainable, but manual circling advice is also welcom :). This is my script
library(ggplot2)
#dataframe
df1 <- data.frame(name = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "a", "b", "c", "d", "e"),
n = rep(1:31, 1),
value = c(3, 2, 5, 1, 1, 6, 7, 9, 8, 6, 10, 11, 11, 11, 13, 15, 17, 16, 18, 18, 20, 20, 23, 22, 22, 23, 25, 26, 28, 29, 29))
#set correct data type
df1$name <- as.factor(df1$name)
#produce color vector
color <- grDevices::colors()[grep('gr(a|e)y', grDevices::colors(), invert = T)]
col_sample <- sample(color, 31)
col_sample <- as.vector(col_sample)
#scatterplot
median_scatter <- ggplot(data = df1,
aes(x = n,
y = value,
colour = name))
median_scatter +
geom_point() +
scale_colour_manual(values=col_sample)

You can define a subset of your data, i.e. df1[df1$name == "j", ] which corresponds to point of interest to plot on another geom_point, and pick a shape that is an open circle, and define color, size and stroke to your liking.
median_scatter +
geom_point() +
scale_colour_manual(values=col_sample) +
geom_point(data=df1[df1$name == "j", ], colour="red", shape=1, size=4, stroke=1.5)
Unfortunately no dashed circle shape is available.

Related

Identify change in status due to change in categorical variable in panel data

I have unbalanced panel data (repeated observations per ID at different points in time). I need to identify for a change in variable per person over time.
Here is the code to generate the data frame:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": ["01/01/2021", "01/02/2021", "01/01/2021", "01/02/2021", "01/03/2021"],
"job": ["A", "A", "A", "B", "B"],
}
)
df
I am trying to create a column ("change") that indicates when individual 2 changes job status from A to B on that date (01/02/2021).
I have tried the following, but it is giving me an error:
df['change']=df.groupby(['id'])['job'].diff().fillna(0)
In your code error happens because you use 'diff' on 'job' column, but 'job' type is 'object' and 'diff' works only with numeric types.
current answer:
df["change"] = df.groupby(
["id"])["job"].transform(lambda x: x.ne(x.shift().bfill())).astype(int)
Here is the (longer) solution that I worked out:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": [0, 1, 0, 1, 2],
"job": ["A", "A", "A", "B", "B"],
}
)
df1 = df.set_index(['id', 'date']).sort_index()
df1['job_lag'] = df1.groupby(level='id')['job'].shift()
df1.job_lag.fillna(df1.job, inplace=True)
def change(x):
if x['job'] != x['job_lag'] :
return 1
else:
return 0
df1['dummy'] = df1.apply(change, axis=1)
df1

Convert PyTorch AutoTokenizer to TensorFlow TextVectorization

I have a PyTorch encoder loaded on my PC with transformers.
I saved it in JSON with tokenizer.save_pretrained(...) and now I need to load it on another PC with TensorFlow TextVectorization as I don't have access to the transformers library.
How can I convert ? I read about the tf.keras.preprocessing.text.tokenizer_from_json but it does not work.
In PyTorch JSON I have :
{
"version": "1.0",
"truncation": null,
"padding": null,
"added_tokens": [...],
"normalizer": {...},
"pre_tokenizer": {...},
"post_processor": {...},
"decoder": {...},
"model": {...}
}
and TensorFlow is expecting, with TextVectorizer :
def __init__(
self,
max_tokens=None,
standardize="lower_and_strip_punctuation",
split="whitespace",
ngrams=None,
output_mode="int",
output_sequence_length=None,
pad_to_max_tokens=False,
vocabulary=None,
idf_weights=None,
sparse=False,
ragged=False,
**kwargs,
):
or with the tokenizer_from_json these kind of fields :
config = tokenizer_config.get("config")
word_counts = json.loads(config.pop("word_counts"))
word_docs = json.loads(config.pop("word_docs"))
index_docs = json.loads(config.pop("index_docs"))
# Integer indexing gets converted to strings with json.dumps()
index_docs = {int(k): v for k, v in index_docs.items()}
index_word = json.loads(config.pop("index_word"))
index_word = {int(k): v for k, v in index_word.items()}
word_index = json.loads(config.pop("word_index"))
tokenizer = Tokenizer(**config)
Simply "tf.keras.preprocessing.text.tokenizer_from_json.()" but you may need to correct format in JSON.
Sample: The sample they using " I love cats " -> " Sticky "
import tensorflow as tf
text = "I love cats"
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000, oov_token='<oov>')
tokenizer.fit_on_texts([text])
# input
vocab = [ "a", "b", "c", "d", "e", "f", "g", "h", "I", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "_" ]
data = tf.constant([["_", "_", "_", "I"], ["l", "o", "v", "e"], ["c", "a", "t", "s"]])
layer = tf.keras.layers.StringLookup(vocabulary=vocab)
sequences_mapping_string = layer(data)
sequences_mapping_string = tf.constant( sequences_mapping_string, shape=(1,12) )
print( 'result: ' + str( sequences_mapping_string ) )
print( 'tokenizer.to_json(): ' + str( tokenizer.to_json() ) )
new_tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(tokenizer.to_json())
print( 'new_tokenizer.to_json(): ' + str( new_tokenizer.to_json() ) )
Output:
result: tf.Tensor([[27 27 27 9 12 15 22 5 3 1 20 19]], shape=(1, 12), dtype=int64)
tokenizer.to_json(): {"class_name": "Tokenizer", "config": {"num_words": 10000, "filters": "!\"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n", "lower": true, "split": " ", "char_level": false, "oov_token": "<oov>", "document_count": 1, "word_counts": "{\"i\": 1, \"love\": 1, \"cats\": 1}", "word_docs": "{\"cats\": 1, \"love\": 1, \"i\": 1}", "index_docs": "{\"4\": 1, \"3\": 1, \"2\": 1}", "index_word": "{\"1\": \"<oov>\", \"2\": \"i\", \"3\": \"love\", \"4\": \"cats\"}", "word_index": "{\"<oov>\": 1, \"i\": 2, \"love\": 3, \"cats\": 4}"}}
new_tokenizer.to_json(): {"class_name": "Tokenizer", "config": {"num_words": 10000, "filters": "!\"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n", "lower": true, "split": " ", "char_level": false, "oov_token": "<oov>", "document_count": 1, "word_counts": "{\"i\": 1, \"love\": 1, \"cats\": 1}", "word_docs": "{\"cats\": 1, \"love\": 1, \"i\": 1}", "index_docs": "{\"4\": 1, \"3\": 1, \"2\": 1}", "index_word": "{\"1\": \"<oov>\", \"2\": \"i\", \"3\": \"love\", \"4\": \"cats\"}", "word_index": "{\"<oov>\": 1, \"i\": 2, \"love\": 3, \"cats\": 4}"}}

Efficient column MultiIndex ordering

I have this dataframe :
df = pandas.DataFrame({'A' : [2000, 2000, 2000, 2000, 2000, 2000],
'B' : ["A+", 'B+', "A+", "B+", "A+", "B+"],
'C' : ["M", "M", "M", "F", "F", "F"],
'D' : [1, 5, 3, 4, 2, 6],
'Value' : [11, 12, 13, 14, 15, 16] }).set_index((['A', 'B', 'C', 'D']))
df = df.unstack(['C', 'D']).fillna(0)
And I'm wondering is there is a more elegant way to order the columns MultiIndex that the following code :
# rows ordering
df = df.sort_values(by = ['A', "B"], ascending = [True, True])
# col ordering
df = df.transpose().sort_values(by = ["C", "D"], ascending = [False, False]).transpose()
Especially I feel like the last line with the two transpose si far more complex than it should be. I tried using sort_index but wasn't able to use it in a MultiIndex context (for both lines and columns).
You can use sort index on both levels:
out = df.sort_index(level=[0,1],axis=1,ascending=[True, False])
I can use
axis=1
And therefore the last line become
df = df.sort_values(axis = 1, by = ["C", "D"], ascending = [True, False])

How to efficiently identify records that are different for a specific column

I have two datasets df1 and df2 where I need to detect any record that is different in df2 compare to df1 and create a resulting dataset with an additional column that flags the records that are different. Here is an example.
package playground
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, sum}
object sample4 {
val spark = SparkSession
.builder()
.appName("Sample app")
.master("local")
.getOrCreate()
val sc = spark.sparkContext
final case class Owner(a: Long,
b: String,
c: Long,
d: Short,
e: String,
f: String,
o_qtty: Double)
final case class Result(a: Long,
b: String,
c: Long,
d: Short,
e: String,
f: String,
o_qtty: Double,
isDiff: Boolean)
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
import spark.implicits._
val data1 = Seq(
Owner(11, "A", 666, 2017, "x", "y", 50),
Owner(11, "A", 222, 2018, "x", "y", 20),
Owner(33, "C", 444, 2018, "x", "y", 20),
Owner(33, "C", 555, 2018, "x", "y", 120),
Owner(22, "B", 555, 2018, "x", "y", 20),
Owner(99, "D", 888, 2018, "x", "y", 100),
Owner(11, "A", 888, 2018, "x", "y", 100),
Owner(11, "A", 666, 2018, "x", "y", 80),
Owner(33, "C", 666, 2018, "x", "y", 80),
Owner(11, "A", 444, 2018, "x", "y", 50),
)
val data2 = Seq(
Owner(11, "A", 666, 2017, "x", "y", 50),
Owner(11, "A", 222, 2018, "x", "y", 20),
Owner(33, "C", 444, 2018, "x", "y", 20),
Owner(33, "C", 555, 2018, "x", "y", 55),
Owner(22, "B", 555, 2018, "x", "y", 20),
Owner(99, "D", 888, 2018, "x", "y", 100),
Owner(11, "A", 888, 2018, "x", "y", 100),
Owner(11, "A", 666, 2018, "x", "y", 80),
Owner(33, "C", 666, 2018, "x", "y", 80),
Owner(11, "A", 444, 2018, "x", "y", 50),
)
val expected = Seq(
Result(11, "A", 666, 2017, "x", "y", 50, isDiff = false),
Result(11, "A", 222, 2018, "x", "y", 20, isDiff = false),
Result(33, "C", 444, 2018, "x", "y", 20, isDiff = false),
Result(33, "C", 555, 2018, "x", "y", 55, isDiff = true),
Result(22, "B", 555, 2018, "x", "y", 20, isDiff = false),
Result(99, "D", 888, 2018, "x", "y", 100, isDiff = false),
Result(11, "A", 888, 2018, "x", "y", 100, isDiff = false),
Result(11, "A", 666, 2018, "x", "y", 80, isDiff = false),
Result(33, "C", 666, 2018, "x", "y", 80, isDiff = false),
Result(11, "A", 444, 2018, "x", "y", 50, isDiff = false),
)
val df1 = spark
.createDataset(data1)
.as[Owner]
.cache()
val df2 = spark
.createDataset(data2)
.as[Owner]
.cache()
}
}
What is the most efficient way to do that?
I think this code could help you to find your answer:
val intersectDF=df1.intersect(df2)
val unionDF=df1.union(df2).dropDuplicates()
val diffDF= unionDF.except(intersectDF)
val intersectDF2=intersectDF.withColumn("isDiff",functions.lit(false))
val diffDF2=diffDF.withColumn("isDiff",functions.lit(true))
val answer=intersectDF2.union(diffDF2)
//Common data between two DataFrame
intersectDF2.show()
//Difference data between two DataFrame
diffDF2.show()
//Your answer
answer.show()
Perhaps this is helpful -
Do a left join and identify un-matched columns as false
val df1_hash = df1.withColumn("x", lit(0))
df2.join(df1_hash, df2.columns, "left")
.select(when(col("x").isNull, false).otherwise(true).as("isDiff") +: df2.columns.map(df2(_)): _*)
.show(false)
/**
* +------+---+---+---+----+---+---+------+
* |isDiff|a |b |c |d |e |f |o_qtty|
* +------+---+---+---+----+---+---+------+
* |true |11 |A |666|2017|x |y |50.0 |
* |true |11 |A |222|2018|x |y |20.0 |
* |true |33 |C |444|2018|x |y |20.0 |
* |false |33 |C |555|2018|x |y |55.0 |
* |true |22 |B |555|2018|x |y |20.0 |
* |true |99 |D |888|2018|x |y |100.0 |
* |true |11 |A |888|2018|x |y |100.0 |
* |true |11 |A |666|2018|x |y |80.0 |
* |true |33 |C |666|2018|x |y |80.0 |
* |true |11 |A |444|2018|x |y |50.0 |
* +------+---+---+---+----+---+---+------+
*/
I think the 2 other answers are not efficient, because join and intersect create hash table for all records and all partitions and compare its all. At least you can try the simplest solution:
df1.rdd.zip(df2.rdd).map {
case (x,y) => (x, x != y)
}
and compare speed on real dataset.
Also it is good idea to replace one-char string to char, because char comparison is very fast.
I have not real dataset, so I can not prove that my answer is faster, but I think it is according to test on small dataset and according to documentation and reasization of join and intersect. zip does not exchange of partitions versus join or intersect. Sorry for my English

Matplotlib: barh full width bars

I'm trying to generate a stacked horizontal bar chart in matplotlib. The issue I am facing is that the width of the bars does not fully fill the available width of the plotting area (additional space on the right).
Unfortunately I couldn't find any information on this online.
What could I do to resolve this?
Chart with additional space on the right of the bars
measures = ("A", "B", "C", "D", "A", "B", "C", "D", "A", "B")
measure_bars = y_pos = np.arange(len(measures))
yes_data = [10, 10, 10, 10, 15, 10, 10, 10, 10, 10]
number_of_answers = [20, 30, 20, 20, 20, 20, 20, 20, 20, 20]
font = {'fontname': 'Arial', 'color': '#10384f'}
yes_data = [i / j * 100 for i, j in zip(yes_data, number_of_answers)]
no_data = [100 - i for i in yes_data]
bar_width = 0.6
plt.rcParams['xtick.top'] = plt.rcParams['xtick.labeltop'] = True
plt.rcParams['xtick.bottom'] = plt.rcParams['xtick.labelbottom'] = False
fig = plt.figure()
plt.barh(measure_bars, yes_data, color='#89d329', height=bar_width, zorder=2)
plt.barh(measure_bars, no_data, left=yes_data, color='#ff3162', height=bar_width, zorder=3)
plt.grid(color=font["color"], zorder=0)
plt.yticks(measure_bars, measures, **font)
plt.title("TECHNICAL AND ORGANIZATIONAL MEASURES", fontweight="bold", size="16", x=0.5, y=1.1, **font)
ax = plt.axes()
ax.xaxis.set_major_formatter(PercentFormatter())
ax.spines['bottom'].set_color(font["color"])
ax.spines['top'].set_color(font["color"])
ax.spines['right'].set_color(font["color"])
ax.spines['left'].set_color(font["color"])
ax.xaxis.label.set_color(font["color"])
ax.tick_params(axis='x', colors=font["color"])
for tick in ax.get_xticklabels():
tick.set_fontname(font["fontname"])
ax.xaxis.set_ticks(np.arange(0.0, 100.1, 10))
plt.gca().legend(('Yes', 'No'), bbox_to_anchor=(0.7, 0), ncol=2, shadow=False)
plt.show()
Please add (somewhere in the middle)
ax.set_xlim(0, 1)