Understand a demo pdf file - pdf

https://github.com/GitHubRulesOK/MyNotes/blob/master/demo.pdf
Jane Austen seems to appear 4 times in the raw pdf file.
BT +0.50 Tc -1.4 Tw 999 TL /F2 1 Tf 15 001 10. 30 200.000 440.000 Tm [(Jane A)600(usten)] TJ ET
BT +0.50 Tc 0.00 Tw 000 TL /F3 1 Tf 15 000 000 15 200.000 430.000 Tm [(Ja)-1000(ne Austen)] TJ ET
BT -1.20 Tc 0.00 Tw 999 TL /F3 1 Tf 15 000 000 15 200.000 420.000 Tm [(J)-1200(a)800(ne Austen)] TJ ET
BT +0.00 Tc 0.00 Tw 000 TL /F3 1 Tf 15 000 000 15 200.000 410.000 Tm [(Jane A)100(us)-500(ten)] TJ ET
But only see it appears once in Acrobat.
I can not following raw pdf code, Could anybody help me understand the content of this pdf file?

Related

Plotting Webscraped data onto matplotlib

I recently managed to collect tabular data from a PDF file using camelot in python. By collect I mean print it out on the terminal, Now i would like to find a way to automate the results into a bar graph diagram on matplotlib. how would i do that? Here's my code for extracting the tabular data from the pdf:
import camelot
tables = camelot.read_pdf("data_table.pdf", pages='2')
print(tables[0].df)
Here's an image of the table
enter image description here
Which then prints out a large table in my terminal:
0 1 2 3 4
0 Country \nCase definition \nCumulative cases \...
1 Guinea Confirmed 2727 156 1683
2 Probable 374 * 374
3 Suspected 7 * ‡
4 Total 3108 156 2057
5 Liberia** Confirmed 3149 11 ‡
6 Probable 1876 * ‡
7 Suspected 3982 * ‡
8 Total 9007 11 3900
9 Sierra Leone Confirmed 8212 230 3042
10 Probable 287 * 208
11 Suspected 2604 * 158
12 Total 11103 230 3408
13 Total 23 218 397 9365
I do have a bit of experience with matplotlib and i know how to plot data manually but not automatically from the pdf. This would save me some time since I'm trying to automate the whole process.

How can I merge two dataframes outside the intersection of the data?

I have a dataframe of presidential candiates, their received donation amount, and the states where the donations came from (contbr_st).
However, the state includes non state abbreviations such as AA, FF, FM as shown below. And, I have a single column dataframe of 50 state abbreviations.
dataframe below is "total"
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 56405.00 135.00
AB 2048.00 NaN
AE 42973.75 5680.00
AK 281840.15 86204.24
AL 543123.48 527303.51
AP 37130.50 1655.00
AR 359247.28 105556.00
AS 2955.00 NaN
AZ 1506476.98 1888436.23
CA 23824984.24 11237636.60
CO 2132429.49 1506714.12
CT 2068291.26 3499475.45
DC 4373538.80 1025137.50
DE 336669.14 82712.00
FF NaN 99030.00
FL 7318178.58 8338458.81
FM 600.00 NaN
Dataframe below is 50 state, it is "state"
state
0 AL
1 AK
2 AZ
3 AR
4 CA
5 CO
6 CT
7 DC
8 DE
9 FL
10 GA
11 HI
12 ID
13 IL
14 IN
15 IA
16 KS
17 KY
18 LA
19 ME
20 MD
21 MA
22 MI
23 MN
24 MS
25 MO
26 MT
27 NE
28 NV
29 NH
30 NJ
31 NM
32 NY
33 NC
34 ND
35 OH
36 OK
37 OR
38 PA
39 RI
40 SC
41 SD
42 TN
43 TX
44 UT
45 VT
46 VA
47 WA
48 WV
49 WI
50 WY
Is there a simple way in Pandas to merge these two dataframes to discard the intersecting states, and keep the non state data from the original dataframe ('total')?
so my expected output would include non state abbreviation data as below
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 56405.00 135.00
AP 37130.50 1655.00
FF NaN 99030.00
FM 600.00 NaN
.
.
The only way I can think of is convert state list from each dataframe, convert to set, use the difference() method. Then, convert the result to dataframe, and merge with the "total" dataframe.

trying to add text at the end of a line at the fist occurrence of a variable in a file

I've attached the text(s) file I'm working with below:
using sed I can get a result which adds the text but it's being added to every line that contained my variable which happened to be "33" in this case. I would only like the text "SLOTS" added to the first occurrence in the file and basically stop once the first is encountered. sed adds the text "SLOTS" to every line that contains my variable in my text file there are other columns that happen to have the same value as my variable. I've searched through numerous websites to get the desired results but haven't had any luck trying awk or other sed examples. any help would be greatly appreciated.
here's my current sed line:
sed '/\b'$slot_drill'\b/s/$/SLOTS/' $slot_card > $new_slot_card
what my current sed line outputs:
d25104-1.dr -- PANEL SIZE: 18x24
Drilled Slots: Yes
Tool Tool Spindle Feed Hits Max Bits Path Time
Size Speed Rate Hits (Min)
T01 126 1.0 1.0 58 1600 0.0 182.8 1.7
T02 250 1.0 1.0 9 1600 0.0 67.5 0.7
T03 12 1.0 1.0 3965 1600 2.5 514.4 4.6
T04 31 1.0 1.0 65 1600 0.0 62.0 0.6
T05 33 1.0 1.0 569 1600 0.4 46.6 0.4 SLOTS
T06 35 1.0 1.0 33 1600 0.0 45.3 0.4 SLOTS
T07 41 1.0 1.0 97 1600 0.1 79.6 0.7
T08 42 1.0 1.0 135 1600 0.1 104.6 0.9
T09 43 1.0 1.0 33 1600 0.0 53.0 0.5 SLOTS
T10 49.2 1.0 1.0 65 1600 0.0 44.2 0.4
T11 52 1.0 1.0 17 1600 0.0 40.7 0.4
T12 63.5 1.0 1.0 33 1600 0.0 55.6 0.5 SLOTS
T13 98 1.0 1.0 25 1600 0.0 52.0 0.5
Total 5104 3.2 1348.2 12.5
output that is desired:
d25104-1.dr -- PANEL SIZE: 18x24
Drilled Slots: Yes
Tool Tool Spindle Feed Hits Max Bits Path Time
Size Speed Rate Hits (Min)
T01 126 1.0 1.0 58 1600 0.0 182.8 1.7
T02 250 1.0 1.0 9 1600 0.0 67.5 0.7
T03 12 1.0 1.0 3965 1600 2.5 514.4 4.6
T04 31 1.0 1.0 65 1600 0.0 62.0 0.6
T05 33 1.0 1.0 569 1600 0.4 46.6 0.4 SLOTS
T06 35 1.0 1.0 33 1600 0.0 45.3 0.4
T07 41 1.0 1.0 97 1600 0.1 79.6 0.7
T08 42 1.0 1.0 135 1600 0.1 104.6 0.9
T09 43 1.0 1.0 33 1600 0.0 53.0 0.5
T10 49.2 1.0 1.0 65 1600 0.0 44.2 0.4
T11 52 1.0 1.0 17 1600 0.0 40.7 0.4
T12 63.5 1.0 1.0 33 1600 0.0 55.6 0.5
T13 98 1.0 1.0 25 1600 0.0 52.0 0.5
Total 5104 3.2 1348.2 12.5
desired output
It is difficult to tell what you are asking. A good minimal example would probably omit the shell variables and take the line you are working with as completely independent so that other people can run it on their own systems.
However, it sounds as if you are just asking how to limit sed's s command to working only for the first line it finds that is a match. In general, I think the way is to prefix a range to the command and use 0 for the beginning of the range. For example: sed '0,/foo/ {s/foo/bar}' will replace only the first instance of foo that it finds with bar.
This is an answer to this sort of problem given at https://unix.stackexchange.com/questions/188264/want-to-substitute-only-first-occurence-with-sed
awk to the rescue!
on a simplified file
$ cat file
1
2
3
1
2
3
append SLOT to the first occurrence of the variable slot, let's say 3.
$ awk -v value="$slot" '$1==value && !c++{$0=$0 "SLOT"} 1' file
1
2
3SLOT
1
2
3
Looking at your example you'll need to use `$2==value``

SQL Count for Each Category

I need to count rows based on 2 fields for grouping.
Animals (a)
id group_id strain_id death_date death_cause status
-----------------------------------------------------------------------
1 512 164 2015-12-01 Culled P
2 512 164 2015-12-02 Culled A
3 512 164 2015-12-02 Surplus B
4 512 230 2015-12-06 Culled A
5 512 164 2015-12-28 Culled A
6 512 230 2016-01-20 Culled B
7 512 230 2016-01-20 Surplus P
8 512 164 NULL NULL P
9 512 230 NULL NULL B
10 512 230 NULL NULL A
11 512 164 2016-01-25 Culled B
12 512 164 2016-02-29 Culled A
13 512 230 2016-02-03 Surplus P
14 512 230 2016-02-03 Culled A
Groups (g)
id group_name
--------------
512 Mice
Strain (s)
id strain_name
----------------
164 Strain 1
230 Strain 2
Group Animal Count (gac)
id total_animals alive_count dead_count
----------------------------------------------------------------------
512 14 3 11
Mating History (mh)
id animal_id history_type history_date
--------------------------------------------------------
1001 2 MA 2015-11-20
1002 2 MR 2015-12-01
1003 3 MA 2015-12-01
1004 6 FA 2015-12-21
1005 9 FA 2016-02-07
1006 10 MA 2016-01-27
1007 11 FA 2015-12-12
So when I group them by the strain_id and the death_cause this is what they should look like visually:
Strain 1 ---- Culled
1 512 164 2015-12-01 Culled P
2 512 164 2015-12-02 Culled A
5 512 164 2015-12-28 Culled A
11 512 164 2016-01-25 Culled B
12 512 164 2016-02-29 Culled A
Strain 1 ---- Surplus
3 512 164 2015-12-02 Surplus B
Strain 2 ---- Culled
4 512 230 2015-12-06 Culled A
6 512 230 2016-01-20 Culled B
14 512 230 2016-02-03 Culled A
Strain 2 ---- Surplus
7 512 230 2016-01-20 Surplus P
13 512 230 2016-02-03 Surplus P
What I want to get from the SQL query is the following result:
g_name s_name d_cause a_total c_alive c_dead c_pup c_breeder c_total
------------------------------------------------------------------------------
Mice Strain 1 Culled 12 3 9 1 2 5
Mice Strain 1 Surplus 12 3 9 0 1 1
Mice Strain 2 Culled 12 3 9 0 1 3
Mice Strain 2 Surplus 12 3 9 2 0 2
Basically I want to count the number of animals using 2 categories which in this case is the strain_name and the death_cause
Note that for an animal to be counted as a breeder (c_breeder), I have to look at the Mating History table and check if the animal_id has ever had any of these codes MA or FA.
I am using INNER JOIN on the groups, group_animal_count, and strains. I use LEFT JOIN for mating_history since animals with a status of P won't have records in that table since they're just pups and won't be involved with mating.
Try:
SELECT group_name, strain_name,death_cause, count(*) as Total
FROM ANIMALS a
JOIN GROUPS g ON a.group_id = g.id
JOIN STRAIN s ON a.strain_id = s.id
GROUP BY group_name, strain_name,death_cause
You can do the aggregation before joining the tables:
SELECT group_name, strain_name, death_cause, total
FROM (
SELECT group_id,
strain_id,
death_cause,
COUNT(*) AS total
FROM animals
GROUP BY group_id, strain_id, death_cause
) a
INNER JOIN groups g ON ( g.group_id = a.group_id )
INNER JOIN strain s ON ( s.strain_id = a.strain_id );

How minimal can an SVG be?

I just reduced this SVG:
<?xml version="1.0" standalone="no"?>
<svg viewBox="0 0 480 150" style="background-color:#ffffff00" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" x="0px" y="0px" width="480" height="150">
<path d="M 0 35.5 L 6.5 22.5 L 16 37 L 23 24 L 34.8 43.7 L 42.5 30 L 50.3 47 L 59.7 27.7 L 69 47 L 85 17.7 L 98.3 39 L 113 9.7 L 127.7 42.3 L 136.3 23.7 L 147 44.3 L 158.3 20.3 L 170.3 40.3 L 177.7 25.7 L 189.7 43 L 199.7 21 L 207.7 35 L 219 11 L 233 37 L 240.3 23.7 L 251 43 L 263 18.3 L 272.7 33.3 L 283 10 L 295 32.3 L 301.3 23 L 311.7 37 L 323.7 7.7 L 339.3 39 L 346.3 25.7 L 356.3 42.3 L 369.7 15 L 376.3 25.7 L 384 9 L 393 28.3 L 400.3 19 L 411.7 38.3 L 421 21 L 434.3 43 L 445 25 L 453 36.3 L 464.3 18.3 L 476.2 40.3 L 480 33.5 L 480 215 L 0 215 L 0 35.5 Z" fill="#175720"/>
</svg>
To this:
<svg height="150" width="480"><path d="m0 35.5l6.5-13 9.5 14.5 7-13 11.8 19.7 7.7-13.7 7.8 17 9.4-19.3 9.3 19.3 16-29.3 13.3 21.3 14.7-29.3 14.7 32.6 8.6-18.6 10.7 20.6 11.3-24 12 20 7.4-14.6 12 17.3 10-22 8 14 11.3-24 14 26 7.3-13.3 10.7 19.3 12-24.7 9.7 15 10.3-23.3 12 22.3 6.3-9.3 10.4 14 12-29.3 15.6 31.3 7-13.3 10 16.6 13.4-27.3 6.6 10.7 7.7-16.7 9 19.3 7.3-9.3 11.4 19.3 9.3-17.3 13.3 22 10.7-18 8 11.3 11.3-18 11.9 22 3.8-6.8v181.5h-480v-179.5z" fill="#175720"/></svg>
(I ran it through a minimizer and then I deleted a bunch of attribute in the <svg> tag.) I am using it as a background image and it seems to work fine in IE, Firefox and Chrome on Windows. I am just wondering what all that other information is doing there if it has no effect on the image appearance. Will there be compatibility issues somewhere because I stripped that info out?
UPDATE:
I discovered that actually, for my use case, I need to have xmlns="http://www.w3.org/2000/svg" or else it won't render in IE or Chrome.
Removing the viewBox creates a significant semantic difference as the SVG will no longer scale (i.e. be responsive to UA resizes). This only applies if you're viewing the image directly though if you're viewing it as a background-image or via a SVG <image> tag or html <img> tag then the SVG will be drawn as if it has a viewBox of "0 0 width height" unless a viewBox is already present.
Removing the background-color will mean that the SVG will no longer be opaque when placed on top of something else. Of course if you're not doing that you may not notice.
The xml:space attribute only matters if you have text elements in your SVG file.
The rest of the removals are benign if the SVG is inline. Namespace attributes are required if the SVG is a standalone file though which will be the case for a background-image.
The reduced version is not valid SVG. It would be considered "just any" XML which happens to have a root element with the name "svg".
To turn the snippet into the SVG there are two options:
add an xmlns attribute with the proper namespace to the svg element (as you discovered)
add a DOCTYPE to the document 1, 2
serving the document as MIME type image/svg+xml is not enough!
Examples:
<svg xmlns="http://www.w3.org/2000/svg"> (SVG version selected by consumer)
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN" "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd"> (for SVG 1.0)
Use the W3 validator to check your documents. Make sure to check that the detected doctype is SVG, because the document might still validate, but as general/unknown XML. -- They also have test pages.
1 is not good enough for Chrome 53.
2 not recommended any more