When using coord_cartesian, y axis dissapear - ggplot2

I have the following table:
x var y
a group1 0.5
b group1 -0.65
c group1 -1.3
d group1 0.2
a group2 1.2
b group2 -1.6
c group2 -0.7
d group2 -3
I want to plot x against y, in two different plots by var (group1 or 2), using ggplot.
However, I also want to "zoom in" into the y-axis, thus showing the whole x axis but, in the y-axis, only values from -0.5 to -3:
ggplot(table,
aes(x = x,
y = y)) +
geom_point() +
facet_wrap(vars(var)) +
scale_y_continuous() +
coord_cartesian(ylim = c(-0.5,
-3))
However, this removes the values and ticks from the y axis, and I do not know how to make them appear:

Related

Optimization between 2 dependent variables

I couldn't find a solution nowhere, so I'm asking here. Hope someone knows how to guide me thru.
So, I'm not quite sure if this is a optimization problem (if anyone know what kind of problem is this, let me know), but I need to find the quantity of clients that each attendant has to have so that each has the same amount of orders. I don't know if there is a function or a regression that could be made of this.
Column A has the clients name. Column B has the "difficulty" that each client is to assist - that is, "1" is normal difficulty, "2" is double the normal difficulty and so on - meaning that the orders of this clients will be multiplied by his difficulty. Column C has a spec that only attendant Y can assist. Column D has the quantity of orders that each client requested. And finally column E is the account attendant.
CLIENT
ATTENTION
SPEC
ORDERS
ATTENDANT
a1
3
0
6
y
a2
3
0
7
x
a3
1
0
1
y
a4
1
0
9
y
a5
2
0
6
y
a6
1
0
7
y
a7
3
0
2
y
a8
3
0
9
x
a9
3
0
9
y
a10
2
1
8
y
a11
2
0
8
x
a12
2
0
9
y
a13
1
1
2
y
a14
2
0
4
x
a15
3
0
10
y
a16
2
0
9
x
a17
2
0
8
y
a18
1
1
5
y
a19
3
0
8
x
a20
1
1
3
y
a21
2
0
10
x
a22
2
0
6
x
Summary tables:
ATTENDANT
TOTAL ORDERS
x
61
y
84
ATTENDANT
TOTAL CLIENTS
x
8
y
14
ATTENDANT
TOTAL ORDERS
x
61
y
84
y (spec 0)
66
y (spec 1)
18
Here is a linear program that solves this. Fun problem! This uses the pyomo environment and the separately installed glpk solver. Result: It's a tie! X and Y don't need to haggle! ;)
Code:
# client assignment
import csv
from collections import namedtuple
import pyomo.environ as pyo
data_file = 'data.csv'
Record = namedtuple('Record', ['attention', 'spec', 'orders'])
records = {}
with open(data_file, 'r') as src:
src.readline() # burn header row
reader = csv.reader(src, skipinitialspace=True)
for line in reader:
record = Record(int(line[1]), int(line[2]), int(line[3]))
records[line[0]] = record
#print(records)
# set up the ILP
m = pyo.ConcreteModel()
# SETS
m.C = pyo.Set(initialize=records.keys()) # the set of clients
m.A = pyo.Set(initialize=['X', 'Y']) # the set of attendants
# PARAMETERS
m.attention = pyo.Param(m.C, initialize={c: r.attention for (c, r) in records.items()})
m.y_required = pyo.Param(m.C, initialize={c: r.spec for (c, r) in records.items()})
m.orders = pyo.Param(m.C, initialize={c: r.orders for (c, r) in records.items()})
# VARIABLES
m.assign = pyo.Var(m.A, m.C, domain=pyo.Binary) # 1: assign attendant a to client c
m.work_delta = pyo.Var(domain=pyo.NonNegativeReals) # the abs(work difference)
# OBJECTIVE
# minimize the work delta...
m.obj = pyo.Objective(expr=m.work_delta)
# CONSTRAINTS
# each client must be serviced once and only once
def service(m, c):
return sum(m.assign[a, c] for a in m.A) == 1
m.C1 = pyo.Constraint(m.C, rule=service)
# y-specific customers must be serviced by attendant y, and optionally if not reqd.
def y_serves(m, c):
return m.assign['Y', c] >= m.y_required[c]
m.C2 = pyo.Constraint(m.C, rule=y_serves)
# some convenience expressions to capture work...
m.y_work = sum(m.attention[c] * m.orders[c] * m.assign['Y', c] for c in m.C)
m.x_work = sum(m.attention[c] * m.orders[c] * m.assign['X', c] for c in m.C)
# capture the ABS(y_work - x_work) with 2 constraints
m.C3a = pyo.Constraint(expr=m.work_delta >= m.y_work - m.x_work)
m.C3b = pyo.Constraint(expr=m.work_delta >= m.x_work - m.y_work)
# check the model
#m.pprint()
# SOLVE
solver = pyo.SolverFactory('glpk')
res = solver.solve(m)
# ensure the result is optimal
status = res.Solver()['Termination condition'].value
assert(status == 'optimal', f'error occurred, status: {status}. Check model!')
print(res)
print(f'x work: {pyo.value(m.x_work)} units')
print(f'y work: {pyo.value(m.y_work)} units')
# list assignments
for c in m.C:
for a in m.A:
if pyo.value(m.assign[a, c]):
print(f'Assign {a} to customer {c}')
Output:
Problem:
- Name: unknown
Lower bound: 0.0
Upper bound: 0.0
Number of objectives: 1
Number of constraints: 47
Number of variables: 46
Number of nonzeros: 157
Sense: minimize
Solver:
- Status: ok
Termination condition: optimal
Statistics:
Branch and bound:
Number of bounded subproblems: 53
Number of created subproblems: 53
Error rc: 0
Time: 0.008551836013793945
Solution:
- number of solutions: 0
number of solutions displayed: 0
x work: 158.0 units
y work: 158.0 units
Assign Y to customer a1
Assign Y to customer a2
Assign X to customer a3
Assign Y to customer a4
Assign Y to customer a5
Assign Y to customer a6
Assign Y to customer a7
Assign Y to customer a8
Assign X to customer a9
Assign Y to customer a10
Assign X to customer a11
Assign X to customer a12
Assign Y to customer a13
Assign X to customer a14
Assign X to customer a15
Assign X to customer a16
Assign X to customer a17
Assign Y to customer a18
Assign X to customer a19
Assign Y to customer a20
Assign Y to customer a21
Assign Y to customer a22
[Finished in 588ms]
In Excel Solver
Note: in the solver options, be sure to de-select "ignore integer constraints"
This seems to work OK. The green shaded areas are "locked" to Y and not in the solver's control.
I'm always suspicious of the solver in Excel, so check everything!

Adding secondary y axis in ggplot2 without scaling factor

I want to plot two y axis - one with continuous data and other with values - ranging from 0 to 7
Example:
ID IHC FISH1 FISH2
1 3 11.5 9.5
2 1 2.9 3.9
3 2 1.5 6.5
4 1 3.3 1.3
5 2 5.5 8.5
6 2 6.6 9.6
How can I plot secondary y axis - if it is not related to primary y axis
I want to code in R.
I want a plot like this as output
data %>%
select(ID, IHC, FISH1, FISH2) %>%
gather(key = "FISH_IHC", value = "FISH_val", FISH1, FISH 2, IHC, -ID) %>%
mutate(as_factor = as.factor(FISH_IHC)) %>%
ggplot(aes(x = reorder(ID,FISH_val), y = FISH_val, group = as_factor), na.rm = TRUE) +
geom_point(aes(shape=as_factor, color=as_factor)) +
scale_y_continuous(limits = c(0, 10),
oob = function(x, ...){x},
expand = c(0, -1),
breaks=number_ticks(10),
sec.axis = sec_axis(scales::rescale(pretty(range(IHC)),
name = "IHC")))

In a pandas dataframe with a MultiIndex, how to conditionally fill missing values with group means?

Setup:
# create a MultiIndex
dfx = pd.MultiIndex.from_product([
list('ab'),
list('cd'),
list('xyz'),
], names=['idx1', 'idx2', 'idx3'])
# create a dataframe that fits the index
df = pd.DataFrame([None, .9, -.08, -2.11, 1.09, .38, None, None, -.37, -.86, 1.51, -.49], columns=['random_data'])
df.set_index(dfx, inplace=True)
Output:
random_data
idx1 idx2 idx3
a c x NaN
y 0.90
z -0.08
d x -2.11
y 1.09
z 0.38
b c x NaN
y NaN
z -0.37
d x -0.86
y 1.51
z -0.49
Within this index hierarchy, I am trying to accomplish the following:
When a value is missing within [idx1, idx2, idx3], fill NaN with the group mean of [idx1, idx2]
When multiple values are missing within [idx1, idx2, idx3], fill NaN with the group mean of [idx1]
I have tried df.apply(lambda col: col.fillna(col.groupby(by='idx1').mean())) as a way to solve #2, but I haven't been able to get it to work.
UPDATE
OK, so I have this solved in parts, but still at a loss about how to apply these conditionally:
For case #1:
df.unstack().apply(lambda col: col.fillna(col.mean()), axis=1).stack().
I verified that the correct value was filled by looking at this:
df.groupby(by=['idx1', 'idx2']).mean(),
but it also replaces the missing values that I am trying to handle differently in case #2.
Similarly for #2:
df.unstack().unstack().apply(lambda col: col.fillna(col.mean()), axis=1).stack().stack()
verified the values replaced were correct by looking at
df.groupby(by=['idx1']).mean()
but it also applies to case #1, which I don't want.
I'm sure there is a more elegant way of doing this, but the following should achieve your desired result:
def get_null_count(df, group_levels, column):
result = (
df.loc[:, column]
.groupby(group_levels)
.transform(lambda x: x.isnull().sum())
).astype("int")
return result
def fill_groups(
df,
count_group_levels,
column,
missing_count_idx_map
):
null_counts = get_null_count(
df, count_group_levels, column
)
condition_masks = {
count: ((null_counts == count) & df[col].isnull()).to_numpy()
for count in missing_count_idx_map.keys()
}
condition_values = {
count: df.loc[:, column]
.groupby(indicies)
.transform("mean")
.to_numpy()
for count, indicies in missing_count_idx_map.items()
}
# Defaults
condition_masks[0] = (~df[col].isnull()).to_numpy()
condition_values[0] = df[col].to_numpy()
sorted_keys = sorted(missing_count_idx_map.keys()) + [0]
conditions = [
condition_masks[count]
for count in sorted_keys
]
values = [
condition_values[count]
for count in sorted_keys
]
result = np.select(conditions, values)
return result
col = "random_data"
missing_count_idx_map = {
1: ['idx1', "idx2"],
2: ['idx1']
}
df["filled"] = fill_groups(
df, ['idx1', 'idx2'], col, missing_count_idx_map
)
df then looks like:
random_data filled
idx1 idx2 idx3
a c x NaN -0.20
y 1.16 1.16
z -1.56 -1.56
d x 0.47 0.47
y -0.54 -0.54
z -0.30 -0.30
b c x NaN -0.40
y NaN -0.40
z 0.29 0.29
d x 0.98 0.98
y -0.41 -0.41
z -2.46 -2.46
IIUC, you may try this. Get mean of levelidx1 and mean of level [idx1, idx2]. Fillna use mean of [idx1,idx2]. Next, use mask to assign rows of groups having more than 1 NaN by mean of idx1
Sample `df`:
random_data
idx1 idx2 idx3
a c x NaN
y -0.09
z -0.01
d x -1.30
y -0.11
z 1.33
b c x NaN
y NaN
z 0.74
d x -1.44
y 0.50
z -0.61
df1_m = df.mean(level='idx1')
df12_m = df.mean(level=['idx1', 'idx2'])
m = df.isna().groupby(level=['idx1', 'idx2']).transform('sum').gt(1)
df_filled = df.fillna(df12_m).mask(m & df.isna(), df1_m)
Out[110]:
random_data
idx1 idx2 idx3
a c x -0.0500
y -0.0900
z -0.0100
d x -1.3000
y -0.1100
z 1.3300
b c x -0.2025
y -0.2025
z 0.7400
d x -1.4400
y 0.5000
z -0.6100
OK, solved it.
First, I made a dataframe containing counts by group of non-missing values:
truth_table = df.apply(lambda row: row.count(), axis = 1).groupby(by=['idx1', 'idx2']).sum()
>> truth_table
idx1 idx2
a c 2
d 3
b c 1
d 3
dtype: int64
Then set up a dataframe (one for each case I'm trying to resolve) containing the group means:
means_ab = x.groupby(by=['idx1']).mean()
>> means_ab
idx1
a 0.0360
b -0.0525
means_abcd = x.groupby(by=['idx1', 'idx2']).mean()
>> means_abcd
idx1 idx2
a c 0.410000
d -0.213333
b c -0.370000
d 0.053333
Given the structure of my data, I know:
Case #1 is analogous to truth_table having exactly one missing value in a given index grouping of [idx1, idx2] (e.g., these are the NaN values I want to replace with values from means_abcd)
Case #2 is analogous to truth_table having more than one missing value in a given index grouping of [idx1, idx2] (e.g., these are the NaN values I want to replace with values from means_ab
fix_case_2 = df.combine_first(df[truth_table > 1].fillna(means_ab, axis=1))
>> fix_case_2
idx1 idx2 idx3
a c x NaN
y 0.9000
z -0.0800
d x -2.1100
y 1.0900
z 0.3800
b c x -0.0525 *
y -0.0525 *
z -0.3700
d x -0.8600
y 1.5100
z -0.4900
df = fix_case_2.combine_first(df[truth_table == 1].fillna(means_abcd, axis=1))
>> df
idx1 idx2 idx3
a c x 0.4100 *
y 0.9000
z -0.0800
d x -2.1100
y 1.0900
z 0.3800
b c x -0.0525 *
y -0.0525 *
z -0.3700
d x -0.8600
y 1.5100
z -0.4900

Pandas:set order of new generated column

I am working on a data and write a code which will basically split the data of column (COL) with respect to (comma:,) and print the split data into new columns. Now, what I want is that my code is able to generate the new columns in given manner (desired output). The code is attached below. Thank you in advance.
Input
X1 COL Y1
----------------
A X,Y,Z 146#12
B Z 223#13
C Y,X 725#14
Current output:
X1 Y1 COL-0 COL-1 COL-2
-----------------------------
A 146#12 X Y Z
B 223#13 Z NaN NaN
C 725#14 Y X NaN
Desired output:
X1 COL-1 COL-2 COL-3 Y1
------------------------------
A X Y Z 146#12
B Z - - 223#13
C Y X - 725#14
Script
import pandas as pd
import numpy as np
df = pd.read_csv(r"<PATH TO YOUR CSV>")
for row, item in enumerate(df["COL"]):
l = item.split(",")
for idx, elem in enumerate(l):
col = "COL-%s" % idx
if col not in df.columns:
df[col] = np.nan
df[col][row] = elem
df = df.drop(columns=["COL"])
print(df)
Use DataFrame.pop:
df['Y1'] = df.pop('Y1')
Also solution should be changed with Series.str.split:
df = (df.join(df.pop('COL').str.split(',', expand=True)
.fillna('-')
.rename(columns = lambda x: f'COL-{x+1}')))
df['Y1'] = df.pop('Y1')
print (df)
X1 COL-1 COL-2 COL-3 Y1
0 A X Y Z 146#12
1 B Z - - 223#13
2 C Y X - 725#14
If you wish to replace the NaN values with dashes you can use fillna() and, to keep the columns in order you specified you can simply define a dataframe with that column order.
df_output = df[['X1','COL-1','COL-2','COL-3','Y1']].fillna(value='-')
Not the most elegant of methods, but this should handle your real data and intended result :
import re
cols = df.filter(like='COL').columns.tolist()
pat = '(\w+)'
new_cols = [(f'{re.match(pat,col).group(0)} {i}') for i,col in enumerate(cols,1)]
df.rename(columns=dict(zip(cols,new_cols)),inplace=True)
df['Y1'] = df.pop('Y1')
out:
X1 COL 1 COL 2 COL 3 Y1
0 A X Y Z 146#12
1 B Z NaN NaN 223#13
2 C Y X NaN 725#14

how to iterate a Series with multiindex in pandas

I am a beginner to pandas. And now I want to realise Decision Tree Algorithm with pandas. First, I read test data into a padas.DataFrame, it is like below:
In [4]: df = pd.read_csv('test.txt', sep = '\t')
In [5]: df
Out[5]:
Chocolate Vanilla Strawberry Peanut
0 Y N Y Y
1 N Y Y N
2 N N N N
3 Y Y Y Y
4 Y Y N Y
5 N N N N
6 Y Y Y Y
7 N Y N N
8 Y N Y N
9 Y N Y Y
then I groupby 'Peanut' and 'Chocolate', what I get is:
In [15]: df2 = df.groupby(['Peanut', 'Chocolate'])
In [16]: serie1 = df2.size()
In [17]: serie1
Out[17]:
Peanut Chocolate
N N 4
Y 1
Y Y 5
dtype: int64
Now, the type of serie1 is Series. I can access the value of serie1 but I can not get value of 'Peanut' and 'Chocolate. How can I get the number of serie1 and the value of 'Peanut' and 'Chocolate at the same time?
You can use index:
>>> serie1.index
MultiIndex(levels=[[u'N', u'Y'], [u'N', u'Y']],
labels=[[0, 0, 1], [0, 1, 1]],
names=[u'Peanut', u'Chocolate'])
You can obtain the values of the column names and the levels. Note that the labels refer to the index in the same row in levels. So for example for 'Peanut' the first label is levels[0][labels[0][0]] which is 'N'. The last label of 'Chocolate' is levels[1][labels[1][2]] which is 'Y'.
I created a small example which loops through the indexes and prints all data:
#loop the rows
for i in range(len(serie1)):
print "Row",i,"Value",serie1.iloc[i],
#loop the columns
for j in range(len(serie1.index.names)):
print "Column",serie1.index.names[j],"Value",serie1.index.levels[j][serie1.index.labels[j][i]],
print
Which results in:
Row 0 Value 4 Column Peanut Value N Column Chocolate Value N
Row 1 Value 1 Column Peanut Value N Column Chocolate Value Y
Row 2 Value 5 Column Peanut Value Y Column Chocolate Value Y