Convert a degree 3 cubic Nurbs Curve to Catmull-Rom? - bezier

Is there a way to convert a degree 3 cubic Nurbs curve to a Catmull-Rom curve?
The Nurbs curve has a standard knot vector, so for example a curve with 10 control points has these 12 knots:
[ 0 0 0 1 2 3 4 5 6 7 7 7 ]
I assume the resulting Catmull-Rom curve would have 12 control points? Or even more..?
If it is not possible to one-to-one convert, is there a good algorithm to get at least a very close match?

Related

Computing JaroWinkler Similarity for unordered and different sized dataframes

I have two dataframes extracted from two attached files.
I want to compute JaroWinkler Similarity for tokens inside the files. I am using below code.
from similarity.jarowinkler import JaroWinkler
jarowinkler = JaroWinkler()
df_gt['jarowinkler_sim'] = [jarowinkler.similarity(x.lower(), y.lower()) for x, y in zip(df_ex['abstract_ex'], df_gt['abstract_gt'])]
I am facing two problems:
1. Order of the tokens are not being handled.
When position of the token 'can' and 'interesting' is changed similarity index is wrongly computed!!
Unnamed: 0 abstract_gt jarowinkler_sim
0 0 Bipartite 1.000000
1 1 fluctuations 0.914141
2 2 can 0.474747 <--|
3 3 provide 1.000000 |-- Position swapped in one file
4 4 interesting 0.474747 <--|
5 5 information 1.000000
6 6 about 1.000000
7 7 entanglement 1.000000
8 8 properties 1.000000
9 9 and 1.000000
10 10 correlations 1.000000
2. Size of the dataframe might not be always same.
When one of the dataframe contains less elements my solution gives an error.
raise ValueError( ValueError: Length of values (10) does not match
length of index (11)
How can I solve these two problems and compute the similarity accurately?
Thanks !!
TSV FILES
1. df_ex
abstract_ex
0 Bipartite
1 fluctuations
2 interesting
3 provide
4 can
5 information
6 about
7 entanglement
8 properties
9 and
10 correlations
df_gt
abstract_gt
0 Bipartite
1 fluctuations
2 interesting
3 provide
4 can
5 information
6 about
7 entanglement
8 properties
9 and
10 correlations

calculate the mean of a column based on the label in a pandas dataframe [duplicate]

This question already has answers here:
Pandas Mean for Certain Column
(4 answers)
Closed 2 years ago.
Actually I am new to python and am facing some problems with the pandas dataframe. I want to find out the mean of the columns that have a label positive. I have three columns x1, x2 and label. I want to find out the mean of x1 which have the label 'positive'. I have used a pandas dataframe which looks like this. Can someone help me with this.
x1 x2 label
0 5 2 positive
1 6 1 positive
2 7 3 positive
3 7 5 positive
4 8 10 positive
5 9 3 positive
6 0 4 negative
7 1 8 negative
8 2 6 negative
9 4 10 negative
10 5 9 negative
11 6 11 negative
You may want to look at df.loc[] after filtering with df['label'].eq('positive'):
df.loc[df['label'].eq('positive'),'x1'].mean()
You can do it using boolean indexing as follows:
df.loc[df['label'] == 'positive', 'x1'].mean()
or alternatively
df.loc[df['label'].isin(['positive']), 'x1'].mean()
The boolean indexing array is True for the correct clusters. x1 is just the name of the column to compute the mean over.

Internal node predictions of xgboost model

Is it possible to calculate the internal node predictions of an xgboost model? The R package, gbm, provides a prediction for internal nodes of each tree.
The xgboost output, however only shows predictions for the final leaves of the model.
xgboost output:
Notice that the Quality column has the final prediction for the leaf node in row 6. I would like that value for each of the internal nodes as well.
Tree Node ID Feature Split Yes No Missing Quality Cover
1: 0 0 0-0 Sex=female 0.50000 0-1 0-2 0-1 246.6042790 222.75
2: 0 1 0-1 Age 13.00000 0-3 0-4 0-4 22.3424225 144.25
3: 0 2 0-2 Pclass=3 0.50000 0-5 0-6 0-5 60.1275253 78.50
4: 0 3 0-3 SibSp 2.50000 0-7 0-8 0-7 23.6302433 9.25
5: 0 4 0-4 Fare 26.26875 0-9 0-10 0-9 21.4425507 135.00
6: 0 5 0-5 Leaf NA <NA> <NA> <NA> 0.1747126 42.50
R gbm output:
In the R gbm package output, the prediction column contains values for both leaf nodes (SplitVar == -1) and the internal nodes. I would like access to these values from the xgboost model
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 1 0.000000000 1 8 15 32.564591 445 0.001132514
1 2 9.500000000 2 3 7 3.844470 282 -0.085827382
2 -1 0.119585850 -1 -1 -1 0.000000 15 0.119585850
3 0 1.000000000 4 5 6 3.047926 207 -0.092846157
4 -1 -0.118731665 -1 -1 -1 0.000000 165 -0.118731665
5 -1 0.008846912 -1 -1 -1 0.000000 42 0.008846912
6 -1 -0.092846157 -1 -1 -1 0.000000 207 -0.092846157
Question:
How do I access or calculate predictions for the internal nodes of an xgboost model? I would like to use them for a greedy, poor man's version of SHAP scores.
The solution to this problem is to dump the xgboost json object with all_stats=True. That adds the cover statistic to the output which can be used to distribute the leaf points through the internal nodes:
def _calculate_contribution(node: AnyNode) -> float32:
if isinstance(node, Leaf):
return node.contrib
else:
return (
node.left.cover * Node._calculate_contribution(node.left)
+ node.right.cover * Node._calculate_contribution(node.right)
) / node.cover
The internal contribution is the weighted average of the child contributions. Using this method, the generated results exactly match those returned when calling the predict method with pred_contribs=True and approx_contribs=True.

KMeans clustering for the following mixed variable data

Can somebody help me with this problem?
I'm learning KMeans clustering concepts. I know how to cluster if the variables are continuous. But this data set contains categorical/discrete variables like gender and zip code.
Sno Age Gender Zip Salary
1 26 0 9822 100
2 38 1 9822 700
3 19 1 9822 100
4 64 0 9810 2500
5 53 1 9810 1200
6 75 1 9810 1800
7 19 0 9822 75
8 36 1 9822 350
9 42 1 9875 1800
10 41 0 9875 750
K-Means works only with numerical data.
K-means fails for categorical data because taking the mean of categorical data doesn't make sense at all. Neither does distance. Some people run the data on K-means by using one hot encoding. But this too doesn't give the right clusters.
To solve this kind of problem you can look at another variation of K-Means called the K-Prototype algorithm which works well for a mix of Categorical and Numerical data.
Check out https://pypi.python.org/pypi/kmodes/
This link contains the paper and the python package for using this algorithm. It's easy to understand as well.

How to figure out the resolution (DPI) of images embedded in a PDF document?

I have a PDF document that also contains images.
Now I want to know the resolution of these images.
A first step would be to somehow get the images out of the PDF document. But how?
Is that even possible with something provided in Cocoa?
Have a look at this answer for your other question:
Can a PDF document contain images with different DPI?
Basically, you can now use the (new) -list parameter for Poppler's pdfimages commandline utility (it will NOT work for XPDF's version of pdfimages!).
It will report the dimensions of each image appearing on the queried pages.
(You can also use it to extract images from a PDF: pdfimages -png -f 3 -l 5 some.pdf prefix--- will extract all images as PNGs from the PDF file, starting with first page 3 and ending with last page 5, using a filename prefix of prefix--- for each image. But this problem seems to not be the main focus of your question...)
Example:
pdfimages -list -f 1 -l 3 /Users/kurtpfeifle/Downloads/ct-magazin-14-2012.pdf
page num type width height color comp bpc enc interp object ID
---------------------------------------------------------------------
1 0 image 1247 1738 rgb 3 8 jpx no 3053 0
2 1 image 582 839 gray 1 8 jpeg no 2080 0
2 2 image 344 364 gray 1 8 jpx no 2079 0
3 3 image 581 838 rgb 3 8 jpeg no 7 0
3 4 image 1088 776 rgb 3 8 jpx no 8 0
3 5 image 6 6 rgb 3 8 image no 9 0
3 6 image 8 6 rgb 3 8 image no 10 0
3 7 image 4 6 rgb 3 8 image no 11 0
3 8 image 212 106 rgb 3 8 jpx no 12 0
3 9 image 150 68 rgb 3 8 jpx no 13 0
3 10 image 6 6 rgb 3 8 image no 14 0
3 11 image 4 4 rgb 3 8 image no 15 0
It does not directly report the DPI resolution -- but from the 'width' and 'height' dimensions you can calculate it easily: you measure the width of the picture on your screen with an inch ruler and then divide the 'width pixels' by the measured ruler number...
You find this strange, because the result is dependent on your current zoom level? Yes, it is!
The concept of the 'resolution' is always dependent on the environment. A so-called 'hi-res' picture basically always has lots of pixels in width and height. This allows for better quality (or 'resolution') if the picture needs to be displayed or printed with higher zoom levels.
Update
Meanwhile there is a new version of (Poppler's) pdfimages:
$ pdfimages -version
pdfimages version 0.33.0
[....]
This reports the resolution of embedded images as well, in PPI (pixels per inch), in horizontal (x-ppi) and vertical (y-ppi) directions:
page num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
-------------------------------------------------------------------------------------
1 0 image 1247 1738 rgb 3 8 jpx no 3053 0 151 151 228K 3.6%
2 1 image 582 839 gray 1 8 jpeg no 2080 0 72 72 319B 0.1%
2 2 image 344 364 gray 1 8 jpx no 2079 0 150 150 4325B 3.5%
3 3 image 581 838 rgb 3 8 jpeg no 7 0 73 73 1980B 0.1%
3 4 image 1088 776 rgb 3 8 jpx no 8 0 150 151 106K 4.3%
3 5 image 6 6 rgb 3 8 image no 9 0 150 150 108B 100%
3 6 image 8 6 rgb 3 8 image no 10 0 150 150 158B 110%
3 7 image 4 6 rgb 3 8 image no 11 0 150 150 73B 101%
3 8 image 212 106 rgb 3 8 jpx no 12 0 150 150 2396B 3.6%
3 9 image 150 68 rgb 3 8 jpx no 13 0 150 150 1878B 6.1%
3 10 image 6 6 rgb 3 8 image no 14 0 150 150 81B 75%
3 11 image 4 4 rgb 3 8 image no 15 0 150 150 50B 104%
This new feature appeared first in Poppler version 0.25 (released Wed December 11, 2013). It additionally reports...
...(file) sizes and
...(compression) ratios
...of embedded images.
Limitations of pdfimages -list
Perhaps I should also make you aware of the limitations of the pdfimages utility, and give an example where its output report is not completely correct.
One example is this handcoded PDF from my (recently created) GitHub repository of PDFs to help beginners to study the syntax of PDF source code.
I originally created this PDF in order to demonstrate a bug with Mozilla's PDF.js renderer.
Here is a screenshot about how it looks in PDF.js (left) and how it should look when rendered correctly (right, rendered by Ghostscript and Adobe Reader):
(Right-click on each of above images. Select "Open image in new tab" to see the exact differences...")
The PDF file contains a 2x2 pixels image, embedded only once (with object ID 5 0), but displayed on the page multiple times with different settings, where each time the image is placed...
...at a different position,
...with a different scaling,
...with a different rotation,
...even with a different skew.
Under these extreme circumstances pdfimages -list falls flat on its nose when trying to determine some of the resolutions for instances of this image:
page num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
------------------------------------------------------------------------------------
1 0 image 2 2 rgb 3 8 image no 5 0 4 4 13B 108%
1 1 image 2 2 rgb 3 8 image no 5 0 5 3 13B 108%
1 2 image 2 2 rgb 3 8 image no 5 0 3 5 13B 108%
1 3 image 2 2 rgb 3 8 image no 5 0 6 3 13B 108%
1 4 image 2 2 rgb 3 8 image no 5 0 3 10 13B 108%
1 5 image 2 2 rgb 3 8 image no 5 0 4 72000 13B 108%
1 6 image 2 2 rgb 3 8 image no 5 0 4 2 13B 108%
1 7 image 2 2 rgb 3 8 image no 5 0 2 4 13B 108%
1 8 image 2 2 rgb 3 8 image no 5 0 14401 1 13B 108%
1 9 image 2 2 rgb 3 8 image no 5 0 1 2 13B 108%
1 10 image 2 2 rgb 3 8 image no 5 0 0.950 4 13B 108%
1 11 image 2 2 rgb 3 8 image no 5 0 4 0.950 13B 108%
1 12 image 2 2 rgb 3 8 image no 5 0 0.950 4 13B 108%
1 13 image 2 2 rgb 3 8 image no 5 0 1 4 13B 108%
1 14 image 2 2 rgb 3 8 image no 5 0 0.950 4 13B 108%
1 15 image 2 2 rgb 3 8 image no 5 0 0.950 4 13B 108%
1 16 image 2 2 rgb 3 8 image no 5 0 4 0.950 13B 108%
pdfimages -list gets most values correct, if there is no rotation and/or no skewing involved. It is no wonder that there are discrepancies if the image is rotated or skewed: Because how would you even reliably define an x-ppi and y-ppi value for such cases? That explains the (completely wrong) values of 72000 y-ppi for image no. 5 and 14401 x-ppi for image no. 8.
As you can easily see, pdfimages is rather clever for determining other image properties:
It correctly reports the same object ID 5 0 for all instances of the displayed image, indicating that this image is embedded once, but displayed multiple times on the page.
It correctly reports the image dimensions to be 2x2 pixels.
It's not easy, but it's possible. While you cannot do it using PDFDocument, you can instead use the CGPDF* stuff in Quartz. Briefly: you will need to use CGPDFPageGetDictionary() to get the dictionary for the page the image is on, then get the information about its XObject (assuming it's not inlined in the stream) from the dictionary. Even this is not straightforward -- you will need to consult with the PDF standard to understand how the XObject may be formatted and then use the various CG* routines to drill down to what you need.
I should add that the default DPI ("user unit") for a PDF document is 72. Also, many images in PDFs are created with vector graphics so they don't really have a default DPI.
You need the dimensions of the raw image XObject accessed vai the Do command
The answer is definitely no, because PDF documents don't really have intrinsic resolutions. The resolution ultimately depends on who is handling the document and its elements at the time. It can even vary by the amount of zoom you're using in Adobe Acrobat.
For example, I created a 2D barcode with a 16x16 pixel dimensions and scaled it to be an inch wide and an inch tall before adding it to the document. It looks perfectly crisp (ie, many pixels per square element) in adobe acrobat reader, but when I send the resulting PDF out to a faxing service, it ends up being 100x200 resolution (roughly). When I print that same document in a laser printer, it ends up being more like 400dpi. When I click on the barcode image in acrobat reader and copy/paste it into Gimp, it shows up as a tiny 16x16 bitmap.
This answer is intended as an addendum to #Kurt Pfeifle's answer, and works outside of Objective C.
Alternatively:
If you have a Windows system and do not have a compiler set up, then the following is the easiest method. Download the Windows XPDF binaries; then use pdfimages to extract the images, convert them to a BMP format, and then mspaint will tell you the resolution. The advantages of this method are:
You can get an exact resolution without having to estimate it by measuring the image size;
It WILL work for XPDF's version of pdfimages.
The disadvantages are:
It takes a bit more work, including converting the file to a format you can open without changing the resolution;
You have to do this for each file individually, instead of getting a list.
It gives you the resolution of the images themselves, not the resolution with which they appeared in the PDF file. (thanks to Kurt Pfeifle's comment)