Displaying only the similarities between 20 or so files? - header

Say I've got a directory full of html pages. Their headers and footers are basically the same, but I want to be able to see only the portions of all the pages that are the same. I'd like to call it an n-way merge, but that isn't what it is, it's looking for just the similarities between the header and the footers of all the files.
(and my header I don't mean just the <head> tag, but rather the portions of the page that are alike).
Note: There are like 20 html files.
Is there a name for a tool that does this?

If you are looking for what they have in common, you need a Clone detector.
Such a tool finds the common code fragments across an arbitrarily large
number of files and reports the commonality. A good one will discover
commonality based on target langauge structure in spite of white space changes, etc.,
e.g., it isn't comparing lines but rather copy-and-pasted structures.

I use ExamDiff

Related

Edit files as if concatenated as one -- Do any IDEs or text editors have this feature?

Background: I am working with Angular (but my problem is not particular to any language or framework). In Angular, each component requires four separate files. So, we often find ourselves with 40+ files open. But, most of these files can be tiny, less than 20 lines each.
Many IDEs allow you to open your files in multiple windows. Each window can have a different panel, and each panel can have different tabs. This is great, but honestly, still isn't enough.
What I want: In addition to windows, panels, and tabs, I'd like to add another level of organization.
I speculate this has probably existed for decades, but I just don't know what it's called. At the very least, I speculate this has existed at least since Angular was a thing.
For example, here is a screenshot of VSCode with four files open across four panels. (Code taken from Angular dynamic component tutorial):
And here is a quick mockup showing what I'm looking for. Four files are open, but the three shortest ones are "concatenated" into one editor. Arrow-key down from the bottom of one file will bring you to the first line of the next file.
Notably, these files are not actually concatenated on-disk.
TLDR: What text editor can allow me to edit multiple files as if they were concatenated, as in the mockup above?
If the files stay as separate windows/tabs, the file editor would have to shrink each tab to a minimal height, and then tile them vertically. If any editor can do it, I suspect it would be emacs or vim. You might also be able to do it by opening separate editor windows and using a tiling window manager.
We can achieve a similar effect with some text editing magic. It would be something like:
Add a header to each file consisting of a unique separator (e.g. # === magic separator === filename my_file.js ===)
Use cat to combine all the files into one file
Edit this one file
When done, use the separator to break them up and put the text back into the original files
You could easily write some scripts for combining and splitting so you can do it quickly. You can also set up a background script that automatically runs the splitter as you edit the combined file. However, the combined file would essentially be a new file, so you could not view changes on it with git, and VS Code's CodeLens/Inline blame wouldn't work.
One option would be to develop your codebase with the combined files checked in to VCS, and then only have the splitter script as part of your "build" step. So you would make your changes, run ./build.sh which splits the files into some temp directory, and then run your application from there.
Lastly, and I hate to be snide, but the fact is that this problem is best solved by avoiding poorly designed frameworks that do not consider developer ergonomics. Many other languages give the developer much freedom and many tools to organize their code as they wish, rather than imposing constraints like requiring many small components to be in separate files. Java for example also had a similar problem (dunno if more recent versions fixed it) - you can only have one class per file, which creates a huge mess if you like having many small files. C# does not have this limitation and as a result C# codebases can be much tidier than Java codebases.

Understanding the PDF DOM

I am writing an application that has to read and interpret data stored in some PDF files. The reading part is done but I am only able to get a dump of all the words on a page and not the format of the words. What I mean is that if I have to extract a table, I am getting the numbers in the table but not the markup which defines the table.
Further, there is some formatting used which displays a few of these numbers within parentheses (meaning that those numbers are negative) but the parentheses themselves are not part of the text. Hence, I am not able to distinguish between positive and negative numbers present in the PDF table!
How do you get the PDF markup along with the text? Is a PDF similar in structure to an XML with tags used to markup tables etc.? If not, then, is there a resource which describes the salient features of the PDF DOM?
I am using VBA and the Acrobat library (AcroExch etc.)
There is no such thing as "PDF markup" in the sense of HTML etc. A table in PDF cannot be distinguished from line art, other than by using OCR, which can be error-prone if the layout is complex. It is simply drawn using geometrical shapes, like in a vector-based graphics program.
"Is a PDF similar in structure to an XML with tags used to markup tables etc.?"
No, not at all.
And there is no such thing as a 'DOM' either. Google for a file named *PDF32000_2008.pdf*. The current PDF specification for v1.7 (ISO spec) is that file. You should be able to locate it on the Adobe website.
As omz stated, text inside PDF does not really have a structure. You can take a look on the specification here. However, for some very specific files, there is something called PDF Tags, or PDF Marked Content, which is fairly new, and it aims to give PDF documents some kind of structure. If you target this kind of files specifically, you might be able to achieve something. Take a look on chapter 10 (Document Interchange) of the Adobe's specification for further details.
Maybe what you want to achieve can be done with less effort and faster by using TET, the Text Extraction Toolkit made by the fine folks from pdflib.com ( http://www.pdflib.com/products/tet/ ) ??
AFAIR, the TET has some (limited) support for table detection as well....

Find duplicate PDFs

I'm looking for a utility that will help me find duplicate PDFs. The problem: I have a 1000s of PDF files. Some are duplicates. They are not easy to detect due differing files names and small differences in file size. Is there a utility/algorithm/library that can help me find the duplicates or show me files that are very similar (or degree of difference)?
Create an MD5 hash for each file and store it in a database. Identical files will then sort next to each other, or you can quickly search for a pre-existing key.
The problem is not yet solved in any way. What I do, is I use fdupes http://premium.caribe.net/~adrian2/fdupes.html to find exact duplicates.
But most of all, I use a workflow which minimizes duplicates. Every document that enters my system gets indexed with this perl-script I wrote: http://seegras.discordia.ch/Programs/fileindex which puts some name and an md5-sum of it into ~/.fileindex.md5 Now I can change metadata of the local PDF-files or whatever (and run fileindex again), and whenever I accidently download the same file again, I will stil lhave the md5-sum of the original file, and thus can detect whether it's a duplicate.
There's also exif-meta and exif-rename on http://seegras.discordia.ch/Programs/ which help with setting PDF metadata and with renaming PDF-files according to metadata; and if you're tagging all the files correctly, you will end up with duplicate filenames, indicating that they might be the same document within a different file.
If the files were created by the different tools, they could look the same but generate very different results because they are structured totally differently. I made some suggestions in a blog article at https://blog.idrsolutions.com/2010/09/comparing-2-pdf-files/
DiffPDF looks like something that might help you.
I remember that there is a UNIX utility called pdf2txt (see the package poppler-utils). You can try to extract the text from the files and make a textual diff.

Naming convention for assets (images, css, js)?

I am still struggling to find a good naming convention for assets like images, js and css files used in my web projects.
So, my current would be:
CSS: style-{name}.css
examples: style-main.css, style-no_flash.css, style-print.css etc.
JS:
script-{name}.js
examples: script-main.js, script-nav.js etc.
Images: {imageType}-{name}.{imageExtension}
{imageType} is any of these
icon (e. g. question mark icon for help content)
img (e. g. a header image inserted via <img /> element)
button (e. g. a graphical submit button)
bg (image is used as a background image in css)
sprite (image is used as a background image in css and contains multiple "versions")
Example-names would be: icon-help.gif, img-logo.gif, sprite-main_headlines.jpg, bg-gradient.gif etc.
So, what do you think and what is your naming convention?
I've noticed a lot of frontend developers are moving away from css and js in favor of styles and scripts because there is generally other stuff in there, such as .less, .styl, and .sass as well as, for some, .coffee. Fact is, using specific technology selections in your choice of folder organization is a bad idea even if everyone does it. I'll continue to use the standard I see from these highly respected developers:
src/html
src/images
src/styles
src/styles/fonts
src/scripts
And their destination build equivalents, which are sometimes prefixed with dest depending on what they are building:
./
images
styles
styles/fonts
scripts
This allows those that want to put all files together (rather than breaking out a src directory) to keep that and keeps things clearly associated for those that do break out.
I actually go a bit futher and add
scripts/before
scripts/after
Which get smooshed into two main-before.min.js and main-after.min.js scripts, one for the header (with essential elements of normalize and modernizr that have to run early, for example) and after for last thing in the body since that javascript can wait. These are not intended for reading, much like Google's main page.
If there are scripts and style sheets that make sense to minify and leave linked alone because of a particular cache management approach that is taken care of in the build rules.
These days, if you are not using a build process of some kind, like gulp or grunt, you likely are not reaching most of the mobile-centric performance goals you should probably be considering.
I place CSS files in a folder css, Javascript in js, images in images, ... Add subfolders as you see fit. No need for any naming convention on the level of individual files.
/Assets/
/Css
/Images
/Javascript (or Script)
/Minified
/Source
Is the best structure I've seen and the one I prefer. With folders you don't really need to prefix your CSS etc. with descriptive names.
For large sites where css might define a lot of background images, a file naming convention for those assets comes in really handy for making changes later on.
For example:
[component].[function-description].[filetype]
footer.bkg-image.png
footer.copyright-gradient.png
We have also discussed adding in the element type, but im not sure how helpful that is and could possibly be misleading for future required changes:
[component].[element]-[function-description].[filetype]
footer.div-bkg-image.png
footer.p-copyright-gradient.png
You can name it like this:
/assets/css/ - For CSS files
/assets/font/ - For Font files. (Mostly you can just go to google fonts to search for usable fonts.)
/assets/images/ - For Images files.
/assets/scripts/ or /assets/js/ - For JavaScript files.
/assets/media/ - For video and misc. files.
You can also replace "assets" with "resource" or "files" folder name and keep the name of it's subfolders. Well having an order folder structure like this isn't very important the only important is you just have to arrange your files by it's format. like creating a folder "/css/" for CSS files or "/images/" for Image files.
First, I divide into folders: css, js, img.
Within css and js, I prefix files with the project name because your site may include js and css files which are components, this makes it clear where files are specific for your site, or relating to plugins.
css/mysite.main.css css/mysite.main.js
Other files might be like
js/jquery-1.6.1.js
js/jquery.validate.js
Finally images are divided by their use.
img/btn/submit.png a button
img/lgo/mysite-logo.png a logo
img/bkg/header.gif a background
img/dcl/top-left-widget.jpg a decal element
img/con/portait-of-something.jpg a content image
It's important to keep images organized since there can be over 100 and can easily get totally mixed together and confusingly-named.
I tend to avoid anything generic, such as what smdrager suggested. "mysite.main.css" doesn't mean anything at all.
What is "mysite"?? This one I'm working on? If so then obvious really, but it already has me thinking what it might be and if it is this obvious!
What is "Main"? The word "Main" has no definition outside the coders knowledge of what is within that css file.
While ok in certain scenarios, avoid names like "top" or "left" too: "top-nav.css" or "top-main-logo.png".
You might end up wanting to use the same thing elsewhere, and putting an image in a footer or within the main page content called "top-banner.png" is very confusing!
I don't see any issue with having a good number of stylesheets to allow for a decent naming convention to portray what css is within the given file.
How many depends entirely on the size of the site and what it's function(s) are, and how many different blocks are on the site.
I don't think you need to state "CSS" or "STYLE" in the css filenames at all, as the fact it's in "css" or "styles" folder and has an extension of .css and mainly as these files are only ever called in the <head> area, I know pretty clearly what they are.
That said, I do this with library, JS and config (etc) files. eg libSomeLibrary.php, or JSSomeScript.php. As PHP and JS files are included or used in various areas within other files, and having info of what the file's main purpose is within the name is useful.
eg: Seeing the filename require('libContactFormValidation.php'); is useful. I know it's a library file (lib) and from the name what it does.
For image folders, I usually have images/content-images/ and images/style-images/. I don't think there needs to be any further separation, but again it depends on the project.
Then each image will be named accordingly to what it is, and again I don't think there's any need for defining the file is an image within the file name. Sizes can be useful, especially for when images have different sizes.
site-logo-150x150.png
site-logo-35x35.png
shop-checkout-button-40x40.png
shop-remove-item-20x20.png
etc
A good rule to follow is: if a new developer came to the files, would they sit scratching their head for hours, or would they likely understand what things do and only need a little time researching (which is unavoidable)?
As anything like this, however, one of the most important rules to follow is simply constancy!
Make sure you follow the same logic and patterns thoughout all your naming conventions!
From simple css file names, to PHP library files to database table and column names.
This is an old question, but still valid.
My current recommendation is to go with something in this lines:
assets (or assets-web or assets-www); this one is intended for static content used by the client (browser)
data; some xml files and other stuff
fonts
images
media
styles
scripts
lib (or 3rd-party); this one is intended for code you don't make or modify, the libraries as you get them
lib-modded (or 3rd-party-modified); this one is intended for code you weren't expected to modify, but had to, like applying a workaround/fix in the meantime the library provider releases it
inc (or assets-server or assets-local); this one is intended for content used server side, not to be used by the client, like libraries in languages like PHP or server scripts, like bash files
fonts
lib
lib-modded
I marked in bold the usual ones, the others are not usual content.
The reason for the main division, is in the future you can decide to server the web assets from a CDN or restrict client access to server assets, for security reasons.
Inside the lib directories i use to be descriptive about the libraries, for example
lib
jquery.com
jQuery
vX.Y.Z
github
[path]
[library/project name]
vX.Y.Z (version)
so you can replace the library with a new one, without breaking the code, also allowing future code maintainers, including yourself, to find the library and update it or get support.
Also, feel free to organize the content inside according to its usage, so images/logos and images/icons are expected directories in some projects.
As a side note, the assets name is meaningful, not only meaning we have resources in there, but meaning the resources in there must be of value for the project and not dead weight.
The BBC have tons of standards relating web development.
Their standard is fairly simple for CSS files:
http://www.bbc.co.uk/guidelines/futuremedia/technical/css.shtml
You might be able to find something useful on their main site:
http://www.bbc.co.uk/guidelines/futuremedia/

How to extract data from a PDF file while keeping track of its structure?

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.
I have tried a few of different things, but I did not get very far in any of them:
Convert PDF to text. It does not work for me as I lose images and the structure of the document.
Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
Convert PDF to XML. Same as above.
Anyone has any suggestions on how to tackle this problem?
There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:
If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).
On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".
To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).
PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals.
Grobid available as a opensource on github.
https://github.com/kermitt2/grobid
You may do use the following approach like this with iTextSharp or other open source libraries:
Read PDF file with with iTextSharp or similar open source tools and collect all text objects into an array (or convert PDF to HTML using the tool like pdftohtml and then parse HTML)
Sort all text objects by coordinates so you will have them all together
Then iterate through objects and check the distance between them to see if 2 or more objects can be merged into one paragraph or not
Or you may use the commercial tool like ByteScout PDF Extractor SDK that is capable of doing exactly this:
extract text and images along with analyzing the layout of the text
XML or CSV where text objects are merged or splitted into paragraphs inside a virtual layout grid
access objects via special API that makes it possible to address each object via its "virtual" row and column index disregarding how it is stored inside the original PDF.
Disclaimer: I am affiliated with ByteScout
PDF files can be parsed with tabula-py, or tabula-java.
I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.
Unless its is Marked Content, PDF does not have a structure.... You have to 'guess' it which is what the various tools are doing. There is a good blog post explaining the issues at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/
As mentioned in the answers above, PDF's aren't very easy to parse. However, if you have certain additional information regarding the text that you want to parse, you can pull it off.
If your headings are positioned at specific parts of the page, you can parse the PDF file and sort the parsed output by coordinates.
If you have prior knowledge of the spacing between headings and paragraphs, you could also leverage this information to parse the file.
PDFBox is a PDF parsing tool that you can use for extracting text and images on top of which you can define your custom rules for parsing.
However, for parsing PDFs you need to have some prior knowledge of the general format of the PDF file. You can check out the following blogpost Document parsing for more information regarding document parsing.
Disclaimer:I was involved in writing the blogpost.
iText api:
PdfReader pr=new PdfReader("C:\test.pdf");
References:
PDFReader