Display images with Greek in filenames - filenames

I'm building an e-shop in greek and i've come across an important issue regarding the image filenames that users upload to the linux server. If the filename contains greek chars then the file is uploaded normally but the images won't display in the browser. It's important to us that greek filenames are supported as we are counting on searches in greek for both web and image results on Google.

Maybe it's because of server charset filename issue? While your filename is in greek
server translate it at something like Αήι;;΄'.jpgand you have to img srcΑήι;;΄'.jpg

Why aren't they displaying? Do you have an example?
I guess your options depend on how the images are being stored and served back to the users. One thing you could do, in order to preserve the Greek text on the page for search results, may be to serve up the files with a non-Greek name and write the Greek part to the image's alt tag.
This, however, is assuming that you're storing the image files in conjunction with a database of some kind which would contain the necessary meta-data. Basically, upon upload, the image's meta-data (alt text, maybe mime type, anything else important about it) would be written to a record in a database table and then the image file itself can be saved on the file system. Its actual file name could just be the primary key from the table, which can just be an auto-increment field.

Related

HIdden characters in cytoscape names

I am attempting to upload a .cys file to a journal website as part of a submission. Although they accept all file types, when I try to upload the .cys file, the name exceeds the 64-character limit, apparently due to some hidden characters in the name. Is there anyway to see the hidden characters in the file name and/or change them so the filename is less than 64 characters?
You should be able to simply rename the .cys file, however, they might be attempting to unzip it automatically (a .cys file is actually just a zip archive of everything in a Cytoscape session), in which case, they might be having problems with some of the filenames inside of the .cys file, which can't be renamed. You'll need to ask the journal about that.
-- scooter

How to serve files with rewritten file names?

I am working on an app which will store images in the file system. For example, if a user uploads an image named "green.jpg" it will get stored as /wwwroot/images/f/c/3/fc3b16ee-9254-11e9-bc42-526af7764f64.jpg
In the database, I will have a table that contains the image name and its original name:
UUID Name
fc3b16ee-9254-11e9-bc42-526af7764f64 green.jpg
I want the generated html to look like:
<img src="/whatever/green.jpg">
So the question is how to serve /wwwroot/images/f/c/3/fc3b16ee-9254-11e9-bc42-526af7764f64.jpg when the browser requests "green.jpg" "
Generating /wwwroot/whatever/green.jpg is not a good idea because of possible name collisions and supplementary disk io.
What I want is that filenames in the generated HTML to be human readable but store that in a way that I can avoid filename collisions and that there aren't too many files in a directory.
Generating a new directory for each letter of the UUID would solve both problems, but I don't know if it's a good idea.
Using UUID is not important, just the uniqueness of file path is. Even if I use an integer as an unique identifier, if there will be millions of images, having millions of directories might be too much.
If that matters, I am using PostgreSQL as the DB engine and will run the app on Linux. I don't know what filesystem I'll use, probably ext4.

What is the best method to store both relative and absolute URLs in the same column?

Users can upload an image for their avatar, or specify a URL from another website.
This is stored in a 'user.image' column in a database.
User uploads image - A randomly generated 10 letter filename with the original extension is saved locally, and the filename is saved to the database (not the entire path)
User specifies URL - The entire URL is saved to the database
However, when going to display the image, how am I to know whether it was an uploaded image (e.g. filename only in column), or if the user specified a full URL?
Possible thoughts are:
Checking to see if 'user.image' begins with the string 'http://' - but could be expensive performance wise under high traffic
Physically downloading the specified URL at the time of saving, and saving that image to the local server, and then updating the database with only the filename (not the path, since its local now)
Or is there a better way?
As already suggested, you can store your images in database or in the file system. Check this question and answer to decide which is better for your case.
Doing this also allows you to store image at a maximum resolution (auto resize them before storing), perform some optimizations to obtain smaller size etc.

Using ElasticSearch and/or Solr as a datastore for MS Office and PDF documents

I'm currently designing a full text search system where users perform text queries against MS Office and PDF documents, and the result will return a list of documents that best match the query. The user will then be to select any document returned and view that document within MS Word, Excel, or a PDF viewer.
Can I use ElasticSearch or Solr to import the raw binary documents (ie. .docx, .xlsx, .pdf files) into its "data store", and then export the document to the user's device on command for viewing.
Previously, I used MongoDB 2.6.6 to import the raw files into GridFS and the extracted text into a separate collection (the collection contained a text index) and that worked fine. However, MongoDB full text searching is quite basic and therefore I'm now looking at either Solr or ElasticSearch to perform more complex text searching.
Nick
Both Solr and Elasticsearch will index the content of the document. Solr has that built-in, Elasticsearch needs a plugin. Easy either way and both use Tika under the covers.
Neither of them will store the document itself. You can try making them do it, but they are not designed for it and you will suffer.
Additionally, neither Solr nor Elasticsearch are currently recommended as a primary storage. They can do it, but it is not as mission critical for them as - say - for a filesystem implementation.
So, I would recommend having the files somewhere else and using Solr/Elasticsearch for searching only. That's where they shine.
I would try the Elasticsearch attachment plugin. Details can be found here:
https://www.elastic.co/guide/en/elasticsearch/plugins/2.2/mapper-attachments.html
https://github.com/elasticsearch/elasticsearch-mapper-attachments
It's built on top of Apache Tika:
http://tika.apache.org/1.7/formats.html
Attachment Type
The attachment type allows to index different "attachment" type field
(encoded as base64), for example, Microsoft Office formats, open
document formats, ePub, HTML, and so on (full list can be found here).
The attachment type is provided as a plugin extension. The plugin is a
simple zip file that can be downloaded and placed under
$ES_HOME/plugins location. It will be automatically detected and the
attachment type will be added.
Supported Document Formats
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
iWorks document formats
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Feed and Syndication formats
Help formats
Audio formats
Image formats
Video formats
Java class files and archives
Source code
Mail formats
CAD formats
Font formats
Scientific formats
Executable programs and libraries
Crypto formats
A bit late to the party but this may help someone :)
I had a similar problem and some research led me to fscrawler. Description:
This crawler helps to index binary documents such as PDF, Open Office, MS Office.
Main features:
Local file system (or a mounted drive) crawling and index new files,
update existing ones and removes old ones. Remote file system over SSH
crawling.
REST interface to let you "upload" your binary documents to elasticsearch.
Regarding solr:
If the docs only need to be returned on metadata searches, Solr features a BinaryField fieldtype, to which you can send binary data base64 encoded.Keep in mind that in general people recommend against doing this, as it may increase your index (RAM requirements/performance), and if possible a set-up where you store the files externally (and the path to the file in solr) might bea better choice.
If you want solr to automatically index the text inside the pdf/doc -- that's possible with the extractingrequesthandler: https://wiki.apache.org/solr/ExtractingRequestHandler
Elasticsearch do store documents (.pdfs, .docs for instance) in the _source field. It can be used as a NoSQL datastore (same as MongoDB).

How should I format user uploaded pictures' filenames?

My website deals with pictures that users upload. I'm kind of conflicted on what my picture filename should consist of. I'm worried about scalability simply and possibly security? Maybe someone out there deals with the same thing and can tell me what their use on their site?
Currently, my filename convention is
{pictureId}_{userId}_{salt}_{variant}.{fileExt}
where salt is a token generated server-side (not sure why I decided to put this here, maybe for security purposes I don't know) and variant is something like t where it signifies it's a thumbnail. So it would look something like
12332_22_hb8324jk_t.jpg
Please advise, thanks.
In addition to the previous comments, you may want to consider creating a directory hierarchy for your files. Depending on volume and the particular OS hosting the files, you can easily reach a point where you have an unreasonably large number of files in a single directory. There may be limits on the number of files allowed per folder. If you ever need to do any manual QA or maintenance on your files, this may be problematic (especially if such maintenance is not scripted).
I once worked on a project with a high volume of images. We decided to record a subpath in our database in addition to the filename of each file. Our folder names looked like this:
a/e/2/f/9
3/3/2/b/7
Essentially, we created folders 5 deep with a single hex value as the folder name. The depth was probably excessive, but effective. I suppose this could have led to us reaching a limit on the number of folders on a volume (not sure if such a limit exists).
I would also consider storing a drive in addition to a path (assuming you have a bunch of disks for storage). This way you can move images around and then update your database (assuming you have one) as part of the move.
My 2 pence worth; there is a bit of a conflict between scalability and security in this problem I would say.
If you have real security concerns, then you should not rely at all on the filename of the target image : this is just security-by-obfusication - somebody could just guess the name eventually.[even with your salt idea, which makes it harder]
Instead you should at least have a login mechanism to create a session between client and server , to make sure you can only get at stuff once you have authenticated: even then stuff is sniffable: if security really is a concern , then I would say you have to use SSL.
Regarding scalability : I would suggest you actually do give your images sequential numbers: and store them in 'bins' of (say) 500 images each. As you fill up a bin, create a new one. Store bin (min-image-id, max-image id) information in one DB table and image numbers in another: you can then comparitively cheaply find which bin a particular image lives in from its id. This is a fairly common solution for storing lots of docs/images.
You could then map your URLs to the bin+image id: but then to avoid the problem noted by Jason Williams (sequential numbering, makes it easy to probe), you really should address security separately as in point 1.
You might like to consider replacing the underscores with (e.g.) minuses. (Underscores are used as wildcards in SQL, so you could potentially run into trouble one day in a LIKE comparison). (And of course, underscores are just plain evil :-)
It looks form your example like you're avoiding spaces and upper-case characters - good move. I'd keep everything lowercase and use case-insensitive comparisons to eliminate any potential case-sensitivity issues with different file systems.
Scalability should be fine as long as you can cope with any number of digits in your user, picture and type IDs. You're very unlikely to hit any filename length limits with this scheme.
Security could be an issue if you use sequential IDs, as someone could potentially tweak the numbers and request a picture they shouldn't be able to access - but the salt should make it virtually impossible for someone to guess the correct filename for another picture. If users can't see/access the internal filename in any way, that may be an unnecessary measure though.
The first thing to do is to setup a directory structure that models your use case. In your case you have a user that uploads a picture. You would probably have a directory structure like this (probably on a network share somewhere):
-Pictures
-UserID1
-PictureID1~^~Variant.jpg
-PictureID2~^~Variant.jpg
-UserID2
-PictureID1~^~Variant.jpg
-PictureID2~^~Variant.jpg
Pictures - simply the root directory for the following.
UserID - is the database user ID.
PictureID is simply the picture ID from the database (assuming you record the filename of each uploaded picture in a database.)
~^~ - This is simply a delimitor. You can use a one character or X character sequence. I like three characters as it is easily handled with the split function and is readily distinguishable in the file name.
Sometimes I like to add the size of the picture in with the file name .256.jpg or .1024.jpg.
At any rate, all of this depends on your use case. The most important thing is setting up the directory structure properly. That will make it easier to access/serve and manage the pictures.
You can add any other information you need into the filename as long it doesn't exceed the maximum filename length on your system.