How to get the correct font size in PDFBOX - pdfbox

PDF content stream
0.750000 0.000000 0.000000 -0.750000 0.000000 841.920044 cm
q
0.367090 0.000000 0.000000 0.367090 0.000000 0.000000 cm
0.000000 0.000000 0.000000 rg
0.000000 0.000000 0.000000 RG
0.410 w
BT
2 Tr
/F1 40.959999 Tf
1 0 0.000000 -1 847.679993 158.720001 Tm
[<3581>-10.000000<043B>-10.000000<18C5>-20.000000<4374>-10.000000<3635><084D>-20.000000<2195>-10.000000<477D>-10.000000<0B5E>-10.000000<1C3E>-10.000000<34F6>-10.000000<3E98>-20.000000<0003>] TJ
ET
/F1 40.959999 Tf means pdf uses font F1, set fontsize 40.959999.
I hava a question about whether the actual font size is 40.959999 or not. For the font size 40 is too large, but the text showed in adobe arcrobat pro is not so large.
I get font size by calling TextPosition.getFontSizeInPt() (Using PDFBOX),it returns 40.96.
I think this is not correct.
Can anybody tell me how to get the correct font size?
Do I need to consider the '0.750000 0.000000 0.000000 -0.750000 0.000000 841.920044 cm' operator?
how to get font size using pdfbox
TextPosition.getFontSize returns the first value only.
TextPosition.getFontSizeInPt returns something like the first value scaled by the matrices.
it does not make sense in this pdf

public class PDFCustomTextStripper extends PDFTextStripper{
/**
* textPositon - pdraphicsstate
*/
private final Map<TextPosition, PDGraphicsState> textPositionPDGraphicsStates = new HashMap<>();
#Override
protected void processTextPosition(TextPosition text) {
textPositionPDGraphicsStates.put(text, getGraphicsState());
......
}
}
public float getActualFontSize() {
final float fontSizeInPt = getTextPosition().getFontSizeInPt();
try {
return Math.min(Math.abs(getPdGraphicsState().getCurrentTransformationMatrix().getScaleX() * fontSizeInPt),Math.abs(getPdGraphicsState().getCurrentTransformationMatrix().getScaleY() * fontSizeInPt));
} catch (Exception e) {
return fontSizeInPt;
}
}

Related

How to drop values for all rows in pandas

I have code that looks like this:
protein IHD CM ARR VD CHD CCD VOO
0 q9uku9 0.000000 0.039457 0.032901 0.014793 0.006614 0.006591 0.000000
1 o75461 0.000000 0.005832 0.027698 0.000000 0.000000 0.006634 0.000000
There's thousands of rows of proteins. However, I want to drop the rows in pandas where all of the values in the row for all of the diseases are less than 0.01. How do I do this?
You can use loc in combination with any. Basically you want to keep all rows where any value is above or equal to 0.01. Note, I adjusted your example to have the second protein have all values < 0.01.
import pandas as pd
df = pd.DataFrame([
['q9uku9', 0.000000, 0.039457, 0.032901, 0.014793, 0.006614, 0.006591, 0.000000 ],
['o75461', 0.000000, 0.005832, 0.007698, 0.000000, 0.000000, 0.006634, 0.000000]
], columns=['protein', 'IHD', 'CM', 'ARR', 'VD', 'CHD', 'CCD', 'VOO'])
df = df.set_index('protein')
df_filtered = df.loc[(df >= 0.01).any(axis=1)]
Which gives:
IHD CM ARR VD CHD CCD VOO
protein
q9uku9 0.0 0.039457 0.032901 0.014793 0.006614 0.006591 0.0
>>> df
protein IHD CM ARR VD CHD CCD VOO
0 q9uku9 0.0 0.039457 0.032901 0.014793 0.006614 0.006591 0.0
1 o75461 0.0 0.005832 0.027698 0.000000 0.000000 0.006634 0.0
2 d4acr8 0.0 0.001490 0.003920 0.000000 0.000000 0.009393 0.0
>>> df.loc[~(df.select_dtypes(float) < 0.01).all(axis="columns")]
protein IHD CM ARR VD CHD CCD VOO
0 q9uku9 0.0 0.039457 0.032901 0.014793 0.006614 0.006591 0.0
1 o75461 0.0 0.005832 0.027698 0.000000 0.000000 0.006634 0.0

Finding the distances from each point to the rest, looping

I am new to python.
I have a csv file containing 400 pairs of x and y in two columns.
I want to loop over the data such that it starts from a pair (x_i,y_i) and finds the distance between that pair and the rest of the 399 points. I want the process to be repeated for all pairs of (x_i,y_i) and the result is appended to to a list Dist_i
import pandas as pd
x_y_data = pd.read_csv("x_y_points400_labeled_csv.csv")
x = x_y_data.loc[:,'x']
y = x_y_data.loc[:,'y']
i=0
j=0
while (i<len(x)):
Dist=np.sqrt((x[i]-x)**2 + (y[j]-y)**2)
i = 1 + i
j = 1 + j
print(Dist)
output:
0 676.144955
1 675.503342
2 674.642602
..
396 9.897127
397 21.659654
398 15.508062
399 0.000000
Length: 400, dtype: float64
This is how far I went, but it is not what I intend to obtain. My goal is to get something like in the picture attached.
Thanks for your help in advance
enter image description here
You can use broadcasting (arr[:, None]) to do this calculation all at once. This will give you the repetitive calculations you want. Otherwise scipy.spatial.distance.pdist gives you the upper triangle of the calculations.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
N = 6
df = pd.DataFrame(np.random.normal(0, 1, (N, 2)),
columns=['X', 'Y'],
index=[f'point{i}' for i in range(N)])
x = df['X'].to_numpy()
y = df['Y'].to_numpy()
result = pd.DataFrame(np.sqrt((x[:, None] - x)**2 + (y[:, None] - y)**2),
index=df.index,
columns=df.index)
point0 point1 point2 point3 point4 point5
point0 0.000000 2.853297 0.827596 1.957709 3.000780 1.165343
point1 2.853297 0.000000 3.273161 2.915990 1.172704 1.708145
point2 0.827596 3.273161 0.000000 2.782669 3.121463 1.749023
point3 1.957709 2.915990 2.782669 0.000000 3.718481 1.779459
point4 3.000780 1.172704 3.121463 3.718481 0.000000 2.092455
point5 1.165343 1.708145 1.749023 1.779459 2.092455 0.000000
With scipy.
from scipy.spatial.distance import pdist
pdist(df[['X', 'Y']])
array([2.8532972 , 0.82759587, 1.95770875, 3.00078036, 1.16534282,
3.27316125, 2.91598992, 1.17270443, 1.70814458, 2.78266933,
3.1214628 , 1.74902298, 3.7184812 , 1.77945856, 2.09245472])
To turn this into the above DataFrame.
L = len(df)
arr = np.zeros((L, L))
arr[np.triu_indices(L, 1)] = pdist(df[['X', 'Y']])
arr = arr + arr.T # Lower triangle b/c symmetric
pd.DataFrame(arr, index=df.index, columns=df.index)
point0 point1 point2 point3 point4 point5
point0 0.000000 2.853297 0.827596 1.957709 3.000780 1.165343
point1 2.853297 0.000000 3.273161 2.915990 1.172704 1.708145
point2 0.827596 3.273161 0.000000 2.782669 3.121463 1.749023
point3 1.957709 2.915990 2.782669 0.000000 3.718481 1.779459
point4 3.000780 1.172704 3.121463 3.718481 0.000000 2.092455
point5 1.165343 1.708145 1.749023 1.779459 2.092455 0.000000

Mixture prior not working in JAGS, only when likelihood term included

The code at the bottom will replicate the problem, just copy and paste it into R.
What I want is for the mean and precision to be (-100, 100) 30% of the time, and (200, 1000) for 70% of the time. Think of it as lined up in a, b, and p.
So 'pick' should be 1 30% of the time, and 2 70% of the time.
What actually happens is that on every iteration, pick is 2 (or 1 if the first element of p is the larger one). You can see this in the summary, where the quantiles for 'pick', 'testa', and 'testb' remain unchanged throughout. The strangest thing is that if you remove the likelihood loop, pick then works exactly as intended.
I hope this explains the problem, if not let me know. It's my first time posting so I'm bound to have messed things up.
library(rjags)
n = 10
y <- rnorm(n, 5, 10)
a = c(-100, 200)
b = c(100, 1000)
p = c(0.3, 0.7)
## Model
mod_str = "model{
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu, 10)
}
# ISSUE HERE: MIXTURE PRIOR
mu ~ dnorm(a[pick], b[pick])
pick ~ dcat(p[1:2])
testa = a[pick]
testb = b[pick]
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('pick', 'testa', 'testb', 'mu'), n.iter = 10000)
summary(res)
I think you are having problems for a couple of reasons. First, the data that you have supplied to the model (i.e., y) is not a mixture of normal distributions. As a result, the model itself has no need to mix. I would instead generate data something like this:
set.seed(320)
# number of samples
n <- 10
# Because it is a mixture of 2 we can just use an indicator variable.
# here, pick (in the long run), would be '1' 30% of the time.
pick <- rbinom(n, 1, p[1])
# generate the data. b is in terms of precision so we are converting this
# to standard deviations (which is what R wants).
y_det <- pick * rnorm(n, a[1], sqrt(1/b[1])) + (1 - pick) * rnorm(n, a[2], sqrt(1/b[2]))
# add a small amount of noise, can change to be more as necessary.
y <- rnorm(n, y_det, 1)
These data look more like what you would want to supply to a mixture model.
Following this, I would code the model up in a similar way as I did the data generation process. I want some indicator variable to jump between the two normal distributions. Thus, mu may change for each scalar in y.
mod_str = "model{
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], 10)
mu[i] <- mu_ind[i] * a_mu + (1 - mu_ind[i]) * b_mu
mu_ind[i] ~ dbern(p[1])
}
a_mu ~ dnorm(a[1], b[1])
b_mu ~ dnorm(a[2], b[2])
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('mu_ind', 'a_mu', 'b_mu'), n.iter = 10000)
summary(res)
2.5% 25% 50% 75% 97.5%
a_mu -100.4 -100.3 -100.2 -100.1 -100
b_mu 199.9 200.0 200.0 200.0 200
mu_ind[1] 0.0 0.0 0.0 0.0 0
mu_ind[2] 1.0 1.0 1.0 1.0 1
mu_ind[3] 0.0 0.0 0.0 0.0 0
mu_ind[4] 1.0 1.0 1.0 1.0 1
mu_ind[5] 0.0 0.0 0.0 0.0 0
mu_ind[6] 0.0 0.0 0.0 0.0 0
mu_ind[7] 1.0 1.0 1.0 1.0 1
mu_ind[8] 0.0 0.0 0.0 0.0 0
mu_ind[9] 0.0 0.0 0.0 0.0 0
mu_ind[10] 1.0 1.0 1.0 1.0 1
If you supplied more data, you would (in the long run) have the indicator variable mu_ind take the value of 1 30% of the time. If you had more than 2 distributions you could instead use dcat. Thus, an alternative and more generalized way of doing this would be (and I am borrowing heavily from this post by John Kruschke):
mod_str = "model {
# Likelihood:
for( i in 1 : n ) {
y[i] ~ dnorm( mu[i] , 10 )
mu[i] <- muOfpick[ pick[i] ]
pick[i] ~ dcat( p[1:2] )
}
# Prior:
for ( i in 1:2 ) {
muOfpick[i] ~ dnorm( a[i] , b[i] )
}
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('pick', 'muOfpick'), n.iter = 10000)
summary(res)
2.5% 25% 50% 75% 97.5%
muOfpick[1] -100.4 -100.3 -100.2 -100.1 -100
muOfpick[2] 199.9 200.0 200.0 200.0 200
pick[1] 2.0 2.0 2.0 2.0 2
pick[2] 1.0 1.0 1.0 1.0 1
pick[3] 2.0 2.0 2.0 2.0 2
pick[4] 1.0 1.0 1.0 1.0 1
pick[5] 2.0 2.0 2.0 2.0 2
pick[6] 2.0 2.0 2.0 2.0 2
pick[7] 1.0 1.0 1.0 1.0 1
pick[8] 2.0 2.0 2.0 2.0 2
pick[9] 2.0 2.0 2.0 2.0 2
pick[10] 1.0 1.0 1.0 1.0 1
The link above includes even more priors (e.g., a Dirichlet prior on the probabilities incorporated into the Categorical distribution).

pandas iterate over 3 data frames element wise into a function

i wrote :
def revertcheck(basevalue,first,second):
if basevalue==1:
return 0
elif basevalue > first and first > second:
return -abs(first-second)
elif basevalue < first and first < second:
return -abs(first-second)
else:
return abs(first-second)
and now I have 3 same sized correlation matrices of the type
pandas.core.frame.DataFrame
I want to iterate over every element, and feed all those 3 values into my function at a time. Can someone give me a hint how to do that?
AAPL AMZN BAC GE GM GOOG GS SNP XOM
AAPL 1.000000 0.567053 0.410656 0.232328 0.562110 0.616592 0.800797 -0.139989 0.147852
AMZN 0.567053 1.000000 -0.012830 0.071066 0.271695 0.715317 0.146355 -0.861710 -0.015936
BAC 0.410656 -0.012830 1.000000 0.953016 0.958784 0.680979 0.843638 0.466912 0.942582
GE 0.232328 0.071066 0.953016 1.000000 0.935008 0.741110 0.667574 0.308813 0.995237
GM 0.562110 0.271695 0.958784 0.935008 1.000000 0.857678 0.857719 0.206432 0.899904
GOOG 0.616592 0.715317 0.680979 0.741110 0.857678 1.000000 0.632255 -0.326059 0.675568
GS 0.800797 0.146355 0.843638 0.667574 0.857719 0.632255 1.000000 0.373738 0.623147
SNP -0.139989 -0.861710 0.466912 0.308813 0.206432 -0.326059 0.373738 1.000000 0.369004
XOM 0.147852 -0.015936 0.942582 0.995237 0.899904 0.675568 0.623147 0.369004 1.000000
Let's assume basevalue, first and second are your three dataframes of exactly the same size and structure, then you can do what you want in a vectorised manner:
output = abs(first - second)
output = output.mask(basevalue == 1, 0)
output = output.mask((basevalue > first) & (first > second), -abs(first - second))
output = output.mask((basevalue < first) & (first < second), -abs(first - second))

Get object indices

I got mesh loaded from .obj file
o Plane_Plane.002
v 1.000000 0.000000 1.000000
v -1.000000 0.000000 1.000000
v 1.000000 0.000000 -1.000000
v -1.000000 0.000000 -1.000000
vt 0.000100 0.000100
vt 0.999900 0.000100
vt 0.999900 0.999900
vt 0.000100 0.999900
vn 0.000000 1.000000 0.000000
usemtl None
s off
f 2/1/1 1/2/1 3/3/1
f 4/4/1 2/1/1 3/3/1
and I create vertex buffer with data order:
PosX,PosY,PosZ,NormX,NormY,NormZ,TexX,TexY
now do I have to generate indices to draw this plane like 0,1,2,0,2,3 or 0,1,2,3,4,5 because I already created 6 vertices in my vertex buffer. I'm really confused here :(
You can use a map with the vertex as the key and the index as value
Starting from int counter = 0, traverse all vertices, if the map already contain this vertex thenindex to this vertex = counter then set map[vertex] = counter++
Otherwise the index = map[vertex]
Of course you will to overload < operator for any type you use for vertex as the map expects its key to be comparable
Here is a sample code of map usage in unifying vertices
#include <iostream>
#include <map>
using namespace std;
struct Point
{
float x;
float y;
float z;
Point()
{
}
Point(float _x, float _y, float _z)
{
x = _x;
y = _y;
z = _z;
}
bool operator<( const Point& p ) const
{
if(x < p.x)
return true;
if(y < p.y)
return true;
if(z < p.z)
return true;
return false;
}
};
void main()
{
Point p[4];
p[0] = Point(0,0,0);
p[1] = Point(1,1,1);
p[2] = Point(0,0,0);
p[3] = Point(1,1,1);
std::map<Point, int> indicesMap;
int counter = 0;
for(int i=0;i<4;i++)
{
if(indicesMap.find(p[i]) == indicesMap.cend()) // new vertex
{
indicesMap[p[i]] = counter++;
}
}
for(int i=0;i<4;i++)
{
std::cout << indicesMap[p[i]] << std::endl;
}
}
The output will be 0101
as p[2] is p[0] and p[3] is p[1]