Multithreading pdf workload slows the program down - pdf

I tried building a program that scans PDF files to find certain elements and then outputs a new PDF file with all the pages that contain the elements. It was originally single-threaded and a bit slow to run. It took about 36 seconds on a 6-Core-5600X. So I tried multiprocessing it with concurrent.futures:
def process_pdf(filename):
# Open the PDF file
f = open(filename, "rb")
print("Searching: " + filename)
# Create a PDF object
pdf = PyPDF2.PdfReader(f)
# Extract the text from each page in the PDF
extracted_text = [page.extract_text() for page in pdf.pages]
# Initialize a list
matching_pages = []
# Iterate through the extracted text
for j, text in enumerate(extracted_text):
# Search for the symbol in the text
if symbol in text:
# If the symbol is found, get a new PageObject instance for the page
page = pdf.pages[j]
# Add the page to the list
matching_pages.append(page)
return matching_pages
Multiprocessing Block:
with concurrent.futures.ThreadPoolExecutor() as executor:
# Get a list of the file paths for all PDF files in the directory
file_paths = [
os.path.join(directory, filename)
for filename in os.listdir(directory)
if filename.endswith(".pdf")
]
futures = [executor.submit(process_pdf, file_path) for file_path in file_paths]
# Initialize a list to store the results
results = []
# Iterate through the futures as they complete
for future in concurrent.futures.as_completed(futures):
# Get the result of the completed future
result = future.result()
# Add the result to the list
results.extend(result)
# Add each page to the new PDF file
for page in results:
output_pdf.add_page(page)
The multiprocessing works, as evident from the printed text, but it somehow doesn't scale at all. 1 thread ~ 35 seconds, 12 Threads ~ 38 seconds.
Why? Where's my bottleneck?
Tried using other libraries, but most were broken or slower.
Tried using re instead of in to search for the symbol, no improvement.

In general PDF consumes much of a devices primary resources, Each device and system will be different so by way of illustration here is one simple PDF task.
Search for text by a common python library component poppler pdftotext, (others may be faster but the aim here is to show the normal attempts and issues)
As a ballpark yardstick for 1 minute I scanned roughly 2500 pages of one file for the word "AMENDMENTS" found 900 occurrences such as
page 82
[1] SHORT TITLE OF 1971 AMENDMENTS
[18] SHORT TITLE OF 1970 AMENDMENTS
[44] SHORT TITLE OF 1968 AMENDMENTS
[54] SHORT TITLE OF 1967 AMENDMENTS
page 83
[11] SHORT TITLE OF 1966 AMENDMENTS
[23] SHORT TITLE OF 1965 AMENDMENTS
[42] SHORT TITLE OF 1964 AMENDMENTS
page 84
[16] SHORT TITLE OF 1956 AMENDMENTS
[26] SHORT TITLE OF 1948 AMENDMENTS
page 85
[43] DRUG ABUSE, AND MENTAL HEALTH AMENDMENTS ACT OF 1988
to scan the whole file (13,234 pages) would be about 5 mins 20 seconds
and I know from testing 4 cpus could process 1/4 of that file (3,308 pages from 13,234) in under 1 minute (there is a gain for using smaller files).
So a 4 core device should do say 3000 pages per core in a short time, well lets see.
If I thread that as 3 x 1000 pages one finishes after about 50 seconds another about 60 seconds and a 3rd about 70 seconds
overall there is little or just a slight gain, so the cause is one application spending one third of its time in each thread.
Overall about 3000 pages per minute
Lets try clever lets use 3 applications on the one file. Surely they can each take less time, guess what
one finishes after about 50 seconds another about 60 seconds and a 3rd about 70 seconds no gain using 3 applications
Overall about 3000 pages per minute
Lets try more clever lets use 3 applications on 3 similar but different files. Surely they can each take less time, guess what
one finishes after about 50 seconds another about 60 seconds and a 3rd about 70 seconds no gain using 3 applications with 3 tasks
Overall about 3000 pages per minute
Whatever way I try the resources on this device are constrained to Overall about 3000 pages per minute.
I may just as well let 1 thread run un-fettered.
So the basic answer is use multiple devices, same as graphics farming, is done.

Related

Reading and handling many small CSV-s to concatenate one large Dataframe

I have two folders each contains about 8,000 small csv files. One with an aggregated size of around 2GB and another with aggregated size of around 200GB.
These files are stored like this to better update them in a daily basis. However, when I conduct EDA, I would like them to be assigned to a single variable. For example.
path = "some random path"
df = pd.concat([pd.read_csv(f"{path}//{files}") for files in os.listdir(path)])
It would take much less time for me to read the dataset with 2GB in total size than reading it on the super computer cluster. And it is impossible to read the 200GB dataset on the local machine unless using some sort of scaling Pandas solutions. The situation does not seem to improve on the cluster even using the popular open-source tools like Dask and Modin.
Is there an effective way that enables to read those csv files effectively with given situation?
Q :"Is there an effective way that enables to read those csv files effectively ... ?"
A :Oh, sure, there is :
CSV format ( standard attempts in RFC4180 ) is not unambiguous and is not obeyed under all circumstances ( commas inside fields, header present or not ), so some caution & care is needed here. Given you are your own data curator, you shall be able to decide plausible steps for handling your own data properly.
So, the as-is state is :
# in <_folder_1_>
:::::::: # 8000 CSV-files ~ 2GB in total
||||||||||||||||||||||||||||||||||||||||||| # 8000 CSV-files ~ 200GB in total
# in <_folder_2_>
Speaking efficiency, O/S coreutils provide the best, stable, proven and most efficient (as system tool used to be since ever ) tools for the phase of merging thousands and thousands of plain CSV-files' content :
###################### if need be,
###################### use an in-place remove of all CSV-file headers first :
for F in $( ls *.csv ); do sed -i '1d' $F; done
this helps for case we cannot avoid headers on the CSV-exporter side. Works like this :
(base):~$ cat ?.csv
HEADER
1
2
3
HEADER
4
5
6
HEADER
7
8
9
(base):~$ for i in $( ls ?.csv ); do sed -i '1d' $i; done
(base):~$ cat ?.csv
1
2
3
4
5
6
7
8
9
Now, the merging phase :
###################### join
cat *.csv > __all_CSVs_JOINED.csv
Given the nature of the said file storage policy, performance can be boosted by using more processes for independent taking small files and large files separately, as defined above, having put the logic inside a pair of conversion_script_?.sh shell-scripts :
parallel --jobs 2 conversion_script_{1}.sh ::: $( seq -f "%1g" 1 2 )
As the transformation is a "just"-[CONCURRENT] flow of processing for a sake of removing the CSV-headers, but a pure-[SERIAL] ( for larger number of files, there might become interesting to use a multi-staged tree of trees - using several stages of [SERIAL]-collections of [CONCURRENT]-ly pre-processed leaves, yet for just 8000 files, not knowing the actual file-system details, the latency-masking from a just-[CONCURRENT] processing both of the directories just independently will be fine to start with )
Last but not least, the final pair of ___all_CSVs_JOINED.csv are safe to get opened using in a way, that prevents moving all disk-stored date into RAM at once ( using chunk-size-fused file-reading-iterator, avoiding RAM-spillovers by using mmaped-mode as a context manager ) :
with pandas.read_csv( "<_folder_1_>//___all_CSVs_JOINED.csv",
sep = NoDefault.no_default,
delimiter = None,
...
chunksize = SAFE_CHUNK_SIZE,
...
memory_map = True,
...
) \
as df_reader_MMAPer_CtxMGR:
...
When tweaking for ultimate performance, details matter and depend on physical hardware bottlenecks ( disk-I/O-wise, filesystem-wise, RAM-I/O-wise ), so due care may take further improvement for minimising the repetitive performed end-to-end processing times ( sometimes even turning data into a compressed/zipped form, in cases, where CPU/RAM resources permit sufficient performance advantages over limited performance of disk-I/O throughput - moving less bytes is so faster, that CPU/RAM-decompression costs are still lower, than moving 200+ [GB]s of uncompressed plain text data.
Details matter,tweak options,benchmark,tweak options,benchmark,tweak options,benchmark
would be nice to post your progress on testing the performanceend-2-end duration of strategy ... [s] AS-IS nowend-2-end duration of strategy ... [s] with parallel --jobs 2 ...end-2-end duration of strategy ... [s] with parallel --jobs 4 ...end-2-end duration of strategy ... [s] with parallel --jobs N ... + compression ... keep us posted

Very Large Text that just disappeared in IDLE PyCharm

I Have This Algorithm Below:
from bs4 import BeautifulSoup
import requests
import time
soup=BeautifulSoup(html,'html.parser')
for link in soup.select('div.sg-actions-list__hole > a[href*="/tarefa"]'):
ref=link.get('href')
rt = ('https://brainly.com.br'+str(ref))
p.append(rt)
print(p)
for url in p:
r = requests.get(url).text
time.sleep(10)
print(r)
Which basically imprints the source code of the page.
My Problem Is Not About the Algorithm, but About IDLE because when you print the page source code it is too big that some parts of the HTML end up disappearing, my question is if there is any solution to this.
I cannot guess what 'redenected' is supposed to mean. In any case, please specify your OS and OS version, and how many characters and lines your are trying to print ('len(p), count( Also, try to reproduce the problem without involving beautiful soup, a 3rd party module, by generating the text in your program.
For instance, on Windows 10 with 3.9.0a1, I can print a 100000 line text.
>>> def f(n):
nl = '\n'
s=('a'*60 + nl)*n
print(f"s has {len(s)} chars, {s.count(nl)} lines")
print(s)
>>> f(100000)
s has 6100000 chars, 100000 lines
[Squeezed text (100000 lines).] # Reverse text box after about 1/2 minute.
Squeezing large output was introduced late 2018. It protects against the freeze effect of long lines. As should be explained in the IDLE doc, squeezed text can copied to the clipboard, viewed in a separate window, or expanded in the shell.

PDFTK Output Same Size as Input Regardless of Cat'd Page Count

I'm running into a weird situation with a particular group of PDFs and not sure where to start. If I burst a 25M, 600 pg file, the output becomes 25M per bursted file. If I do pdftk input.pdf cat 1-100 output out.pdf the size is also 25M (25292kb vs 25524kb for original). Doing page range 1-5 results in a file size of 25040kb.
Is there a flag that I can add to pdftk to handle this situation? Ghostscript can take a page range from this pdf and make an appropriate size PDF but gs doesn't seem to handle burst as well as requires having every font installed.
You are probably making the following assumption about PDF: if you have a PDF with file size 3000 KB and 10 pages, then splitting this PDF will result in 30 files with file size 300 KB.
This assumption is wrong. Imagine a 3000 KB document with ten pages and the following objects:
four font subsets used on every page, each about 50 KB
ten images that figure on a single page, each about 200 KB (one image per page)
four images that figure on every page, each about 50 KB
ten pages with content streams of about 25 KB each
about 350 KB for objects such as the catalog, the info dictionary, the page tree, the cross-reference table, etc...
A single page will need at least:
- the four font subsets: 4 times 50 KB
- the single image: 1 time 200 KB
- the four images: 4 times 50 KB
- a single content stream: 1 time 50 KB
- a slightly reduced cross-reference table, a slightly reduced page tree, an almost identical catalog, an info dictionary of identical size,... 200 KB
Together that's 850 KB. This means that you end up with 8500 KB (10 times 850 KB) if you split up a 10-page 3000 KB PDF document into 10 separate pages.
My guess is that the shared resources (resources that are used in every page, e.g. fonts) are huge in your PDF. E.g. if someone used a high-resolution image as the background of each page that takes about 25M, then each of your 600 pages will need those 25M.
Note that PdfTk is nothing more than a wrapper around an obsolete version of iText. You may want to try a more recent version of iText to find out if the problem persists.

Automating Netlogo based on input from a spreadsheet or txt file

I have developed a model in Netlogo and i want to automate the model run.
Basically what i want to do is read the input from either an excel, csv or .txt file and then ask Netlogo to change the inputs in the model accordingly. Run the model for lets say 100 ticks and store the required output from the 100th tick onto either the same file from which the input was read-in or export it onto a different file. Something like this
Trial Input1 Input2 Output
1 10 20
2 20 20
3 10 30
.
.
.
100 20 100
The variables Input 1 and Input 2 are in the interface either as a slider or input button.
Use the Behavior space feature in Netlogo. It's available under the Tool and below is the documentation on the topic.
https://ccl.northwestern.edu/netlogo/docs/behaviorspace.html

Apache benchmark: what does the total mean milliseconds represent?

I am benchmarking php application with apache benchmark. I have the server on my local machine. I run the following:
ab -n 100 -c 10 http://my-domain.local/
And get this:
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 3 3.7 2 8
Processing: 311 734 276.1 756 1333
Waiting: 310 722 273.6 750 1330
Total: 311 737 278.9 764 1341
However, if I refresh my browser on the page http://my-domain.local/ I find out it takes a lot longer than 737 ms (the mean that ab reports) to load the page (around 3000-4000 ms). I can repeat this many times and the loading of the page in the browser always takes at least 3000 ms.
I tested another, heavier page (page load in browser takes 8-10 seconds). I used a concurrency of 1 to simulate one user loading the page:
ab -n 100 -c 1 http://my-domain.local/heavy-page/
And the results are here:
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 17 20 4.7 18 46
Waiting: 16 20 4.6 18 46
Total: 17 20 4.7 18 46
So what does the total line on the ab results actually tell? Clearly it's not the number of milliseconds the browser is loading the web page. Is the number of milliseconds that it takes from browser to load the page (X) linearly dependent of number of the total mean milliseconds ab reports (Y)? So if I'm able to reduce half of Y, have I also reduced half of X?
(Also Im not really sure what Processing, Waiting and Total mean).
I'll reopen this question since I'm facing the problem again.
Recently I installed Varnish.
I run ab like this:
ab -n 100 http://my-domain.local/
Apache bench reports very fast response times:
Requests per second: 462.92 [#/sec] (mean)
Time per request: 2.160 [ms] (mean)
Time per request: 2.160 [ms] (mean, across all concurrent requests)
Transfer rate: 6131.37 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 1 2 2.3 1 13
Waiting: 0 1 2.0 1 12
Total: 1 2 2.3 1 13
So the time per request is about 2.2 ms. When I browse the site (as an anonymous user) the page load time is about 1.5 seconds.
Here is a picture from Firebug net tab. As you can see my browser is waiting 1.68 seconds for my site to response. Why is this number so much bigger than the request times ab reports?
Are you running ab on the server? Don't forget that your browser is local to you, on a remote network link. An ab run on the webserver itself will have almost zero network overhead and report basically the time it takes for Apache to serve up the page. Your home browser link will have however many milliseconds of network transit time added in, on top of the basic page-serving overhead.
Ok.. I think I know what's the problem. While I have been measuring the page load time in browser I have been logged in.. So none of the heavy stuff is happening. The page load times in browser with anonymous user are closer to the ones ab is reporting.