changing pdf text encoding - pdf

I have a PDF document (that is my schoolbook) and the problem is that although the text is printed normally, it is copied in the form of some random glyphs. I found, that it is because of text being encoded on cp1251 but trying to be decoded as cp1252 (or viceversa idk but copied glyphs belong to 1252). Pasting text to decoder from 1252 to 1251 I can get the original text (pic related)
To solve my problem of text searching and copying I just used OCR, but maybe there is a way to change it's encoding in some pdf headers? Also I do need to copy some of the illustrations for school seminars, but Inkscape and AI still output theese glyphs in 1252.
Opening the text in Adobe Acrobat DC, I saw that he was complaining about the font 1251 Times. In Npp i found such ones
1146 0 obj
<<
/Ascent 756
/CapHeight 750
/Descent -195
/Flags 32
/FontBBox [-91 -224 1237 943]
/FontFamily (1251 Times)
/FontFile2 1147 0 R
/FontName /OGAHOK+1251Times
/FontStretch /Normal
/FontWeight 400
/ItalicAngle 0
/StemV 90
/Type /FontDescriptor
>>
endobj
1145 0 obj
<<
/BaseFont /OGAHOK+1251Times
/Encoding /WinAnsiEncoding
/FirstChar 32
/FontDescriptor 1146 0 R
/LastChar 255
/Subtype /TrueType
/Type /Font
/Widths [351 0 0 0 0 0 828 0 392 392 0 0 326 448 288 455 531 533 532 532 532 532 532 531 531 532 288 0 0 0 0 0 864 724 714 776 0 706 0 0 875 417 0 0 0 0 882 0 661 0 770 599 678 0 0 983 0 0 0 0 0 0 0 0 0 495 539 499 565 489 322 491 583 294 0 532 287 887 590 566 563 0 376 385 332 568 486 729 0 503 476 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 554 554 0 952 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 896 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 699 714 0 747 0 0 597 886 0 812 0 1034 875 0 877 0 776 678 729 0 0 858 0 0 0 0 0 0 759 0 0 495 559 523 434 539 489 757 449 622 622 577 550 715 636 566 622 563 499 468 503 764 500 621 553 880 880 0 760 501 517 820 546]
>>
endobj
1150 0 obj
<<
/Filter /FlateDecode
/Length1 32416
/Length 24094
>>
stream
By replacing all occurrences of 1251 with 1252, I have achieved nothing. What is the right way to di this thing? And is there such a right way?

OGAHOK+1251Times (or similar six random characters and a nametag of a font)
Very often indicates the source was recognised as OCR (One Character Relative to another) thus each letter or a line of letters or a page of letters can have its own font, that here look-likes Times Roman in, as you discovered, 1251 style lettering.
So changing the name to 1252 would be like saying the Times is Verdana it can not change the raw data.
I am surprised, but pleased for you, that you can get some readable 1251 to convert to 1252, however reasonable conversion within the potentially corrupted font metrics would be neigh on impossible to replace one symbol at a time to the other and maintain string shape see the varying /Widths.
However without your base PDF file that is based on experience rather than a fail with your source.
[Update]
Wow! that file has 600 fonts ! something has processed those badly
The problem seems to stem from the use of WinAnsiEncoding rather than some UTF-8 or compatible coding method. I am looking to see if there is any way to modify, but not sure if it could help or make things worse. Here I can try editing settings but in this screenshot from Tracker PDF X-change Editor making changes does not help, unless the text is cut, converted and pasted back.

Related

Complex SQL request with double join

There is two tables, first one contains variable name and its ID
VARIABLE NAME GUID
1000131 AddIn_EM63Alarms_Alarm_ShotCycle
1000132 AddIn_EM63Alarms_Alarm_Status
1000133 AddIn_EM63Alarms_Alarm_Code
1000134 AddIn_EM63Alarms_Alarm_Message
Second one contains variable ID and data of those variables in one column called "STRVALUE"
VARIABLE CALCULATION TIMESTAMP_S TIMESTAMP_MS VALUE STATUS GUID STRVALUE
1000131 0 1646404026 664 33 1078067200 1209
1000132 0 1646404026 664 122 1078067200 1
1000133 0 1646404026 664 48 1078067200 650
1000134 0 1646404026 664 61 1078067200 HOPPER TEMP.: TOL. LIM. +/-
1000131 0 1646404026 886 131 1078067200 1209
1000132 0 1646404026 886 220 1078067200 1
1000133 0 1646404026 886 146 1078067200 650
1000134 0 1646404026 886 159 1078067200 HOPPER TEMP.: TOL. LIM. +/-
1000131 0 1646404027 146 229 1078067200 1209
1000132 0 1646404027 146 318 1078067200 0
1000133 0 1646404027 146 244 1078067200 650
1000134 0 1646404027 146 257 1078067200 HOPPER TEMP.: TOL. LIM. +/-
1000131 0 1646404027 360 327 1078067200 1209
1000132 0 1646404027 360 416 1078067200 0
1000133 0 1646404027 360 342 1078067200 650
1000134 0 1646404027 360 355 1078067200 HOPPER TEMP.: TOL. LIM. +/-
1000131 0 1646404027 607 425 1078067200 1209
1000132 0 1646404027 607 514 1078067200 1
1000133 0 1646404027 607 440 1078067200 650
1000134 0 1646404027 607 453 1078067200 HOPPER TEMP.: TOL. LIM. +/-
1000131 0 1646404027 777 523 1078067200 1209
1000132 0 1646404027 777 612 1078067200 1
1000133 0 1646404027 777 538 1078067200 650
1000134 0 1646404027 777 551 1078067200 HOPPER TEMP.: TOL. LIM. +/-
1000131 0 1646404028 190 621 1078067200 1512
1000132 0 1646404028 190 698 1078067200 1
1000133 0 1646404028 190 636 1078067200 306
1000134 0 1646404028 190 649 1078067200 REQUEST: INSPECTION 2
I would like write SQL query which should output those four variable data in separate column like that:
timestamp_s | timestamp_ms | strvalue of 1st variable | strvalue of 2nd variable | strvalue of 3rd variable | strvalue of 4th variable
There is what i try:
select
a.timestamp_ ,
a.strvalue as ShotCycle,
b.strvalue as Code
from
(
select
a.timestamp_s,
v.name ,
a.strvalue
from
IMM0190_VARIABLES as v
inner join
IMM0190_AL as a
on
a.variable=v.variable
where
v.nAME = 'AddIn_EM63Alarms_Alarm_ShotCycle' ) as a
left join
(
select
a.timestamp_s,
v.name ,
a.strvalue
from
IMM0190_VARIABLES as v
inner join
IMM0190_AL as a
on
a.variable=v.variable
where
v.nAME = 'AddIn_EM63Alarms_Alarm_Code' ) as b
on
a.timestamp_s=b.timestamp_s
Therefore result is not as expected.
We can observe that rows are duplicated
Timestamp_s ShotCycle Code
1646404026 1209 650
1646404026 1209 650
1646404026 1209 650
1646404026 1209 650
Please suggest solution how to achieve required result.
This is how I resolved it.
SELECT
Main.Time,
Main.Shot,
Main.Status,
Main.Code,
Main.Message
from
(SELECT
dateadd(S, b.TIMESTAMP_S, '1970-01-01') as Time,
b.STRVALUE as Shot,
c.STRVALUE as Status,
d.STRVALUE as Code,
e.STRVALUE as Message
FROM
(
select al.[VARIABLE]
,al.[TIMESTAMP_S]
,al.[TIMESTAMP_MS]
,al.[STRVALUE]
from [IMM0190_T].[dbo].[IMM0190_AL] as al
where (al.variable = 1000131)
) as b
left join
(
select al.[VARIABLE]
,al.[TIMESTAMP_S]
,al.[TIMESTAMP_MS]
,al.[STRVALUE]
from [IMM0190_T].[dbo].[IMM0190_AL] as al
where (al.variable = 1000132)
)as c on b.timestamp_s=c.timestamp_s and b.TIMESTAMP_MS=c.TIMESTAMP_MS
left join
(
select al.[VARIABLE]
,al.[TIMESTAMP_S]
,al.[TIMESTAMP_MS]
,al.[STRVALUE]
from [IMM0190_T].[dbo].[IMM0190_AL] as al
where (al.variable = 1000133)
)as d on b.timestamp_s=d.timestamp_s and b.TIMESTAMP_MS=d.TIMESTAMP_MS
left join
(
select al.[VARIABLE]
,al.[TIMESTAMP_S]
,al.[TIMESTAMP_MS]
,al.[STRVALUE]
from [IMM0190_T].[dbo].[IMM0190_AL] as al
where (al.variable = 1000134)
)as e on b.timestamp_s=e.timestamp_s and b.TIMESTAMP_MS=e.TIMESTAMP_MS
) Main

create new column from divided columns over iteration

I am working with the following code:
url = 'https://raw.githubusercontent.com/dothemathonthatone/maps/master/fertility.csv'
df = pd.read_csv(url)
year regional_schlüssel Aus15 Deu15 Aus16 Deu16 Aus17 Deu17 Aus18 Deu18 ... aus36 aus37 aus38 aus39 aus40 aus41 aus42 aus43 aus44 aus45
0 2000 5111000 0 4 8 25 20 45 56 89 ... 935 862 746 732 792 660 687 663 623 722
1 2000 5113000 1 1 4 14 13 33 19 48 ... 614 602 498 461 521 470 393 411 397 400
2 2000 5114000 0 11 0 5 2 13 7 20 ... 317 278 265 235 259 228 204 173 213 192
3 2000 5116000 0 2 2 7 3 28 13 26 ... 264 217 206 207 197 177 171 146 181 169
4 2000 5117000 0 0 3 1 2 4 4 7 ... 135 129 118 116 128 148 89 110 124 83
I would like to create a new set of columns fertility_deu15, ..., fertility_deu45 and fertility_aus15, ..., fertility_aus45 such that aus15 / Aus15 = fertiltiy_aus15 and deu15/ Deu15 = fertility_deu15 for each ausi and Ausj where j == i \n [15-45] and deui:Deuj where j == i \n [15-45]
I'm not sure what is up with that data but we need to fix it to make it numeric. I'll end up doing that while filtering
numerator = df.filter(regex='^[a-z]+\d+$') # Lower case ones
numerator = numerator.apply(pd.to_numeric, errors='coerce') # Fix numbers
denominator = df.filter(regex='^[A-Z][a-z]+\d+$').rename(columns=str.lower)
denominator = denominator.apply(pd.to_numeric, errors='coerce')
numerator.div(denominator).add_prefix('fertility_')

How to interpret the log output of docplex optimisation library

I am having a problem interpreting this log that I get after trying to maximise an objective function using docplex:
Nodes Cuts/
Node Left Objective IInf Best Integer Best Bound ItCnt Gap
0 0 6.3105 0 10.2106 26
0 0 5.9960 8 Cone: 5 34
0 0 5.8464 5 Cone: 8 47
0 0 5.8030 11 Cone: 10 54
0 0 5.7670 12 Cone: 13 64
0 0 5.7441 13 Cone: 16 72
0 0 5.7044 9 Cone: 19 81
0 0 5.6844 14 5.6844 559
* 0+ 0 4.5362 5.6844 25.31%
0 0 5.5546 15 4.5362 Cuts: 322 1014 22.45%
0 0 5.4738 15 4.5362 Cuts: 38 1108 20.67%
* 0+ 0 4.6021 5.4738 18.94%
0 0 5.4296 16 4.6021 Cuts: 100 1155 17.98%
0 0 5.3779 19 4.6021 Cuts: 34 1204 16.86%
0 0 5.3462 17 4.6021 Cuts: 80 1252 16.17%
0 0 5.3396 19 4.6021 Cuts: 42 1276 16.03%
0 0 5.3364 24 4.6021 Cuts: 57 1325 15.96%
0 0 5.3269 17 4.6021 Cuts: 66 1353 15.75%
0 0 5.3188 20 4.6021 Cuts: 42 1369 15.57%
0 0 5.2975 21 4.6021 Cuts: 62 1387 15.11%
0 0 5.2838 24 4.6021 Cuts: 72 1427 14.81%
0 0 5.2796 21 4.6021 Cuts: 70 1457 14.72%
0 0 5.2762 24 4.6021 Cuts: 73 1471 14.65%
0 0 5.2655 24 4.6021 Cuts: 18 1479 14.42%
* 0+ 0 4.6061 5.2655 14.32%
* 0+ 0 4.6613 5.2655 12.96%
0 0 5.2554 26 4.6613 Cuts: 40 1492 12.75%
0 0 5.2425 27 4.6613 Cuts: 11 1511 12.47%
0 0 5.2360 23 4.6613 Cuts: 3 1518 12.33%
0 0 5.2296 19 4.6613 Cuts: 7 1521 12.19%
0 0 5.2213 18 4.6613 Cuts: 8 1543 12.01%
0 0 5.2163 24 4.6613 Cuts: 15 1552 11.91%
0 0 5.2106 21 4.6613 Cuts: 4 1558 11.78%
0 0 5.2106 21 4.6613 Cuts: 3 1559 11.78%
* 0+ 0 4.6706 5.2106 11.56%
0 2 5.2106 21 4.6706 5.2106 1559 11.56%
Elapsed time = 9.12 sec. (7822.43 ticks, tree = 0.01 MB, solutions = 5)
51 29 4.9031 3 4.6706 5.1575 1828 10.42%
260 147 4.9207 1 4.6706 5.1575 2699 10.42%
498 242 infeasible 4.6706 5.0909 3364 9.00%
712 346 4.7470 6 4.6706 5.0591 4400 8.32%
991 497 4.7338 6 4.6706 5.0480 5704 8.08%
1358 566 4.8085 11 4.6706 5.0005 7569 7.06%
1708 708 4.7638 14 4.6706 4.9579 9781 6.15%
1985 817 cutoff 4.6706 4.9265 11661 5.48%
2399 843 infeasible 4.6706 4.9058 15567 5.04%
3619 887 4.7066 4 4.6706 4.7875 23685 2.50%
Elapsed time = 17.75 sec. (10933.85 ticks, tree = 3.05 MB, solutions = 5)
4623 500 4.6863 13 4.6706 4.7274 35862 1.22%
What I don't understand is the following:
What is the difference between the third (Objective) and fifth column (Best integer )
How come that the third column (Objective) has higher values than the actual solution of the problem given by CPLEX which is (4.6706)
Does the values in the third column take into consideration the constraints given to the optimization problem?
This webpage didn't help me to understand neither, the explanation of Best Integer is really confusing.
Thank you in advance for your feedback.
Regards.
The user manual includes a detailed explanation of this log in section
CPLEX->User's Manual for CPLEX->Discrete Optimization->Solving Mixed Integer Programming Problems (MIP)->Progress Reports: interpreting the node log
(see https://www.ibm.com/support/knowledgecenter/SSSA5P_12.8.0/ilog.odms.cplex.help/CPLEX/UsrMan/topics/discr_optim/mip/para/52_node_log.html)
I suggest to have a look at
in
https://fr.slideshare.net/mobile/IBMOptimization/2013-11-informsminingthenodelog

MPI_sendrecv changes loop index with -O1 flag [duplicate]

This question already has an answer here:
MPI_Recv overwrites parts of memory it should not access
(1 answer)
Closed 7 years ago.
Despite having written long, heavily parallelized codes with complicated send/receives over three dimensional arrays, this simple code with a two dimensional array of integers has got me at my wits end. I combed stackoverflow for possible solutions and found one that resembled slightly with the issue I am having:
Boost.MPI: What's received isn't what was sent!
However the solutions seem to point the looping segment of code as the culprit for overwriting sections of the memory. But this one seems to act even stranger. Maybe it is a careless oversight of some simple detail on my part. The problem is with the below code:
program main
implicit none
include 'mpif.h'
integer :: i, j
integer :: counter, offset
integer :: rank, ierr, stVal
integer, dimension(10, 10) :: passMat, prntMat !! passMat CONTAINS VALUES TO BE PASSED TO prntMat
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
counter = 0
offset = (rank + 1)*300
do j = 1, 10
do i = 1, 10
prntMat(i, j) = 10 !! prntMat OF BOTH RANKS CONTAIN 10
passMat(i, j) = offset + counter !! passMat OF rank=0 CONTAINS 300..399 AND rank=1 CONTAINS 600..699
counter = counter + 1
end do
end do
if (rank == 1) then
call MPI_SEND(passMat(1:10, 1:10), 100, MPI_INTEGER, 0, 1, MPI_COMM_WORLD, ierr) !! SEND passMat OF rank=1 to rank=0
else
call MPI_RECV(prntMat(1:10, 1:10), 100, MPI_INTEGER, 1, 1, MPI_COMM_WORLD, stVal, ierr)
do i = 1, 10
print *, prntMat(:, i)
end do
end if
call MPI_FINALIZE(ierr)
end program main
When I compile the code with mpif90 with no flags and run it on my machine with mpirun -np 2, I get the following output with wrong values in the first four indices of the array:
0 0 400 0 604 605 606 607 608 609
610 611 612 613 614 615 616 617 618 619
620 621 622 623 624 625 626 627 628 629
630 631 632 633 634 635 636 637 638 639
640 641 642 643 644 645 646 647 648 649
650 651 652 653 654 655 656 657 658 659
660 661 662 663 664 665 666 667 668 669
670 671 672 673 674 675 676 677 678 679
680 681 682 683 684 685 686 687 688 689
690 691 692 693 694 695 696 697 698 699
However, when I compile it with the same compiler but with the -O3 flag on, I get the correct output:
600 601 602 603 604 605 606 607 608 609
610 611 612 613 614 615 616 617 618 619
620 621 622 623 624 625 626 627 628 629
630 631 632 633 634 635 636 637 638 639
640 641 642 643 644 645 646 647 648 649
650 651 652 653 654 655 656 657 658 659
660 661 662 663 664 665 666 667 668 669
670 671 672 673 674 675 676 677 678 679
680 681 682 683 684 685 686 687 688 689
690 691 692 693 694 695 696 697 698 699
This error is machine dependent. This issue turns up only on my system running Ubuntu 14.04.2, using OpenMPI 1.6.5
I tried this on other systems running RedHat and CentOS and the code ran well with and without the -O3 flag. Curiously those machines use an older version of OpenMPI - 1.4
I am guessing that the -O3 flag is performing some odd optimization that is modifying the manner in which arrays are being passed between the processes.
I also tried other versions of array allocation. The above code uses explicit shape arrays. With assumed shape and allocated arrays I am receiving equally, if not more bizarre results, with some of them seg-faulting. I tried using Valgrind to trace the origin of these seg-faults, but I still haven't gotten the hang of getting Valgrind to not give false positives when running with MPI programs.
I believe that resolving the difference in performance of the above code will help me understand the tantrums of my other codes as well.
Any help would be greatly appreciated! This code has really gotten me questioning if all the other MPI codes I wrote are sound at all.
Using the Fortran 90 interface to MPI reveals a mismatch in your call to MPI_RECV
call MPI_RECV(prntMat(1:10, 1:10), 100, MPI_INTEGER, 1, 1, MPI_COMM_WORLD, stVal, ierr)
1
Error: There is no specific subroutine for the generic ‘mpi_recv’ at (1)
This is because the status variable stVal is an integer scalar, rather than an array of MPI_STATUS_SIZE. The F77 interface (include 'mpif.h') to MPI_RECV is:
INCLUDE ’mpif.h’
MPI_RECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS, IERROR)
<type> BUF(*)
INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM
INTEGER STATUS(MPI_STATUS_SIZE), IERROR
Changing
integer :: rank, ierr, stVal
to
integer :: rank, ierr, stVal(mpi_status_size)
produces a program that works as expected, tested with gfortran 5.1 and OpenMPI 1.8.5.
Using the F90 interface (use mpi vs include "mpif.h") lets the compiler detect the mismatched arguments at compile time rather than producing confusing runtime problems.

Compare two files according to first column and print whole line

I will ask my question with an example. I have 2 files:
File1-
TR100013|c0_g1
TR100013|c0_g2
TR10009|c0_g1
TR10009|c0_g2
File2-
TR100013|c0_g1 AT1G01360.1 78.79 165 35 0 301 795 19 183 2E-089 272
TR100013|c0_g2 AT1G01360.1 78.79 165 35 0 301 795 19 183 2E-089 272
TR10009|c0_g1 AT1G16240.3 77.42 62 14 0 261 76 113 174 4E-025 95.9
TR10009|c0_g2 AT1G16240.2 69.17 120 37 0 1007 648 113 232 2E-050 171
TR29295|c0_g1 AT1G22540.1 69.19 172 53 2 6 521 34 200 2E-053 180
TR49005|c5_g1 AT5G24530.1 69.21 302 90 1 909 13 39 340 5E-157 446
Expected Output :
TR100013|c0_g1 AT1G01360.1 78.79 165 35 0 301 795 19 183 2E-089 272
TR100013|c0_g2 AT1G01360.1 78.79 165 35 0 301 795 19 183 2E-089 272
TR10009|c0_g1 AT1G16240.3 77.42 62 14 0 261 76 113 174 4E-025 95.9
TR10009|c0_g2 AT1G16240.2 69.17 120 37 0 1007 648 113 232 2E-050 171
I want to compare two files. If the first column is same in both files, then print the whole line of second file which is common in both files.
Using awk:
awk 'NR==FNR{a[$1]++;next};a[$1]' file1 file2
grep can do the same:
grep -wf file1 file2
-w is to match whole word only.
-f specifies the file with the pattern.