Fast Image Retrieval

Fast Image Retrieval - apache

I have a website that displays products. Each page displays 16 products and there are around 70,000 products on the site. The HTML for each page is generated using PHP.
Product information is stored within a database. Roughly, the first page of results (if I want to show cheapest items first) would be displayed like this (pseudo code only):
// run sql to fetch product titles and image filenames
SELECT itemTitle, itemImageFileName FROM items ORDER BY itemPrice ASC LIMIT 16
// loop through and display items
for (i=0; i<16; i++) {
echo "<p>$itemTitle</p>";
echo "<img src=$itemImageFileName height='100px' width='80px'>";
}
When I do this, the image titles appear first, and then the images are loaded around half a second to one second afterwards. I am wondering how I can accelerate the image loading.
All images are stored in a single folder containing 70,000 images. Nearly all images are less than 50KB in size. Each image filename is of the form: id_width_height.jpg. For example, a filename might be like: 32193_80_100.jpg
I am wondering whether the bottleneck is that it takes the server some time to find the required files because there are 70,000 files in the folder. Is there a way I can accelerate this? Are there any other reasons why images are slow to load?

First of all, I would inspect the image request in a network tab of Chrome/Firefox. That will answer the question of whether the lookup or the download is the critical part.
Do you have a link ?
70.000 is imho too much.. I would split the filename up and create subfolders such as:
/htdocs/img/32193/80_100.jpg

Related

Using variables in links with Markdown

In my REAMDE.md, I've got dozens of images, which are located in media/theme-n, where n is an arbitrary number. Here's the problem. I only want to include the most recent images. For example, I have images in media/theme-1, media/theme-2, and media/theme-3.
I want to create a variable called current-theme where I set it to theme-3 and just use that variable instead of modifying every image location.
Here's what I've currently got:
<img src="[current-theme]/zsh-prompt-3.png">
[current-theme]: media/theme-3

How to get number of pages printed when using print document control?

Well, as the title says, I want to get the total count of pages printed after the user sends the order. That's because I have a database where I need to store the sale like "A4 COLOR(or black&white)/ 45 pages". I care about the number of pages but I dont know if is possible to get the color mode info too. Thanks for any help

Which is the best practice either to save image name or full URL in database

Which is the better approach for storing image name in database? I have two choices first one is to store just image name e.g. apple.png and second choice is to store full image URL e.g. abc.com/src/apple.png.
Any help will be appreciated. Thanks.

Best practice is not save full path to image like abc.com/src/apple.png but saving specific domain path to image. Ex:
Users image : /user/{id}/avatar/img.png
Product image: /product/{id}/1.png
In this case you avoid sticking images to defined server, server path, url etc. For example, you will decide to move all your images to another server, in this case you don't need to change all records in DB.

The 2 answers already covered it pretty well. It is indeed best practice to save the directory path instead of saving the entire URL path. Some of the reasons were already covered, such as making it easy to move your folders to another server without having to make any changes whatsoever in your file logic.
What you could do, is also have everything in one directory, refer to that, and then just save the image name. However, I would not recommend that. The other structure simply makes it way easier to navigate and look through. Good file structure is something you'll thank yourself for later in case you ever have to go through things manually for one reason or another.
With that said, I'd like to add this trick into the mix:
$_SERVER['DOCUMENT_ROOT']. This always makes you start from the root directory as opposed to having to do tedious things, such as ../../ etc. It looks like a mess.
So in the end as an image path, you'd have something like:
<img src="<?php echo $_SERVER['DOCUMENT_ROOT'].'/'.$row['filePath']; ?>" >
$row['filePath'] being your stored filepath from the database.
Depending on how your file path is saved, you can lose the / in the image source link.

first of all you need to upload all images in public folder of your project , so no need to save domain name
If you are storing all images in one directory , then there is no problem storing only imagename in database
you can easily access images like <img src="/foldername/imagename.jpg" />
but if in your project there are multiple directory like
profile :to save user avatar image ,
background : to save background images,
then it is better to save image with path in database like "/profile/avatar.jpg"
so you can access image like <img src="imagepathhere" />

Another common way is to create image table with cols
id
type (enum or int)
name (file name)
Define in your app (better in model) types like
USER_AVATAR = 1;
PRODUCT_IMG = 2;
Define path map foreach image type like:
$paths = [
USER_AVATAR => '/var/www/project/web/images/users',
...
];
and use id's from this image table in another tables. It is called polymorphic association. It is most flexible way to store images.

Extract portion of HTML from website?

I'm trying to use VBA in Excel, to navigate a site with Internet explorer, to download an Excel file for each day.
After looking through the HTML code of the site, it looks like each day's page has a similar structure, but there's a portion of the website link that seems completely random. But this completely random part stays constant and does not change each time you want to load the page.
The following portion of the HTML code contains the unique string:
<a href="#" onClick="showZoomIn('222698519','b1a9134c02c5db3c79e649b7adf8982d', event);return false;
The part starting with "b1a" is what is used in the website link. Is there any way to extract this part of the page and assign it as a variable that I then can use to build my website link?

Since you don't show your code, I will talk too in general terms:
1) You get all the elements of type link (<a>) with a Set allLinks = ie.document.getElementsByTagName("a"). It will be a vector of length n containing all the links you scraped from the document.
2) You detect the precise link containing the information you want. Let's imagine it's the 4th one (you can parse the properties to check which one it is, in case it's dynamic):
Set myLink = allLinks(3) '<- 4th : index = 3 (starts from zero)
3) You get your token with a simple split function:
myToken = Split(myLink.onClick, "'")(3)
Of course you can be more synthetic if the position of your link containing the token is always the same, like always the 4th link:
myToken = Split(ie.document.getElementsByTagName("a")(3).onClick,"'")(3)

How to split a PDF based on a size limit?

I have searched many places but unable to find a pretty good solution as such.
So what I am trying to achieve is as below:
My program will have quite a lot of PDF docs which I will have to send via mail. There is a mail server limitation of 4 MB. So if all the PDFs are less than 4 MB it will be sent as a single mail. Else I will have to create multiple files each less than 4 MB.
Now my program works fine for the following cases:
1: Lots of files but each less than 4MB and hence keeping a tab during merging so that none of the merged files get over 4MB.
2: All files are pretty small and hence merging them together does not go to 4MB limit.
But there can be a scenario where there is one file which is, say, 14MB. I can split that document by pages. But that is also not a good solution as the pagesize is also not evenly distributed across the pages. I have used iText and PDFBox. Any help/pointer will be highly appreciated!

Imagine a 3000 KB document with ten pages and the following objects:
four font subsets used on every page, each about 50 KB
ten images that figure on a single page, each about 200 KB (one image per page)
four images that figure on every page, each about 50 KB
ten pages with content streams of about 25 KB each
about 350 KB for objects such as the catalog, the info dictionary, the page tree, the cross-reference table, etc...
A single page will need at least:
- the four font subsets: 4 times 50 KB
- the single image: 1 time 200 KB
- the four images: 4 times 50 KB
- a single content stream: 1 time 50 KB
- a slightly reduced cross-reference table, a slightly reduced page tree, an almost identical catalog, an info dictionary of identical size,... 200 KB
Together that's 850 KB. This means that you end up with 8500 KB (10 times 850 KB) if you split up a 10-page 3000 KB PDF document into 10 separate pages.
This example is the result of guess work (based on experience) and it assumes that the PDF is predictable. Most PDFs aren't:
some pages will require high-definition images (maybe even megaBytes), other pages won't have any images,
some pages will need many different fonts and font subsets (lots of kiloBytes), other pages will consist of merely some vector drawings (tiny content stream if compressed).
different pages can share a large amount of resources (Form XObjects, Image XObjects,...), other pages won't share any resources.
and so on...
You have noticed that yourself, as you write: I can split that document by pages. But that is also not a good solution as the pagesize is also not evenly distributed across the pages.
That's exactly why your question can have no other answer than: you'll have to do trial and error. No software can predict how much space is needed by a page before you look at what is needed by that page.
Update:
As David indicates in the comments, it is possible to calculate all the resources needed for a page, and to check if the current resources plus the needed resources exceed the maximum file size.
I have written a small example:
public void manipulatePdf(String src, String dest)
throws IOException, DocumentException {
Document document = new Document();
PdfCopy copy = new PdfSmartCopy(document, new FileOutputStream(dest));
document.open();
PdfReader reader = new PdfReader(src);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// check resources needed for reader.getPageN(i);
copy.addPage(copy.getImportedPage(reader, i));
System.out.println("After adding page: " + copy.getOs().getCounter());
}
document.close();
System.out.println("After closing document: " + copy.getOs().getCounter());
reader.close();
}
I have executed the example on a PDF sample with 18 pages and this was the output:
After adding page: 56165
After adding page: 111398
After adding page: 162691
After adding page: 210035
After adding page: 253419
After adding page: 273429
After adding page: 330696
After adding page: 351564
After adding page: 400351
After adding page: 456545
After adding page: 495321
After adding page: 523640
After adding page: 576468
After adding page: 633525
After adding page: 751504
After adding page: 907490
After adding page: 957164
After adding page: 999140
After closing document: 1002509
You see how the file size of the copy gradually grows with each page that is added. After all pages are added, the size is 999140 bytes, and then the page tree and cross-reference stream are written, adding another 3369 bytes.
Where it says // check resources needed for reader.getPageN(i);, you could make a guesstimate of the size that will be added for the page and break out of the loop if it exceeds a maximum value.
Why would this be a guesstimate:
You could be counting objects that are already added. If you keep track of the objects (not that difficult), your guess will be more accurate.
I'm using PdfSmartCopy. Suppose that there are two identical objects inside your PDF. Bad PDF software often causes such problems. For instance: the same image bytes are added twice to the file. PdfSmartCopy can detect this and will reuse the first object it encounters instead of adding the redundant bytes of the extra object.
We currently don't have a reader.getTotalPageBytes() in PdfReader because PdfReader tries to use as little memory as possible. It won't load any objects into memory as long as these objects aren't needed. Hence it doesn't know the size of each object before the page is imported.
However, I'll make sure that such a method is added in the next release.
Update:
In the next version, you'll find a tool named SmartPdfSplitter that depends on a new class named PdfResourceCounter. You can use it like this:
PdfReader reader = new PdfReader(src);
SmartPdfSplitter splitter = new SmartPdfSplitter(reader);
int part = 1;
while (splitter.hasMorePages()) {
splitter.split(new FileOutputStream("results/merge/part_" + part + ".pdf"), 200000);
part++;
}
reader.close();
Note that this can result in a single-page PDF that exceeds the limit (which was set to 200000 bytes in the code sample) in case that single page can not be reduced to less bytes. In that case, splitter.isOverSized() will return true and you'll have to find another way to reduce the PDF.

PDF Clown supports page data size prediction without need of trial and error: since 2010 it has been featuring a dedicated method (org.pdfclown.tools.PageManager.getSize(Page)) that calculates in memory the actual page data size without the need to write it to a file for trial.
Furthermore, there's another method (org.pdfclown.tools.PageManager.split(long maxDataSize)) purposely implemented to address your kind of scenario which leverages the above-mentioned PageManager.getSize method: it automatically splits a file based on a size limit without creating any intermediate, ugly, stupid, temporary file for trial and error.
You can see a practical example of its use in the org.pdfclown.samples.cli.PageManagementSample (PageDataSizeCalculation and DocumentSplitOnMaximumFileSize cases) included in the downloadable distribution -- here it is an example of console output from the PageDataSizeCalculation case:
Page 1: 29380 (full); 29380 (differential); 29380 (incremental)
Page 2: 30493 (full); 1501 (differential); 30881 (incremental)
Page 3: 21888 (full); 1432 (differential); 32313 (incremental)
Page 4: 33781 (full); 4789 (differential); 37102 (incremental)
. . .
where:
full is the page data size encompassing all its dependencies (like shared resources) -- this is the size of the page when extracted as a single-page document;
differential is the additional page data size -- this is the extra content that's not shared with previous pages;
incremental is the data size of the page sublist encompassing all the previous pages and the current one.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Fast Image Retrieval - apache

Related

Using variables in links with Markdown

How to get number of pages printed when using print document control?

Which is the best practice either to save image name or full URL in database

Extract portion of HTML from website?

How to split a PDF based on a size limit?

Categories

Resources