How to index plain text files (other than *.txt) - apache

I want to index all source code of my application. The code library contains multiple extensions - .html, .js, .py, .php, .json etc. I wish to index them all. However my first try to index them like so
bin/post -c gettingstarted ~/Projects/myapp/
was not successful. I see that it indexed only *.txt files.

I found a solution. For some reason, Apache Solr does not index .js, .py and some other files by default. So, to force it index them, you have to do it like this:
bin/post -c gettingstarted -filetypes js,py ~/Projects/myapp/

Related

make: Convert .pdf files in a folder to .txt files without using loops

I want to convert all .pdf files in a folder into .txt files with make without using loops and with the help of pdftotext. The new .txt files shall keep the original file name. Additionally, the new file gets a new file extension.
Example:
test1.pdf --> test2.newextension
Everything's written within a Makefile file. I start the conversion by typing in "make converted" in my console.
My first (miserable) attempt was:
converted:
#ls *.pdf | -n1 pdftotext
However, there are 3 things still missing with it:
It doesn't repeat the process
The new file extension isn't being added to the newly created files.
Is the original name being kept or being given to the pdftotext function?
I used to program with the bash and Makefile is completely new to me. I'd be thankful for answers!
You can refer to this simple example:
SOURCES ?= $(wildcard *.pdf)
%.txt: %.pdf
pdftotext $< $#
all: $(SOURCES:%.pdf=%.txt)
clean:
rm -f *.txt
If no SOURCE was defined, it'll just try to get all *.pdf files from the local directory.
Then we define a pattern rule teaching make how to make *.txt out of *.pdf.
We also define target all that tried to make a txt file for each .pdf file in SOURCES variable.
And also a clean rule deleting quietly all .txt files in current dir (hence be careful, potentially dangerous).

Wget file format

I have to download all site content and then parse the downloaded folder for "*.pdf" files. I am downloading site using wget -r --no-parent http://www.example.com/ But the problem is that sometimes link looks this
http://www.foodmanufuture.eu/dpubs?f=K20
and the dowloaded pdf is downloaded with name "dpubs?f=K20" and file format is not specified, it does not look like this "dpubs?f=K20.pdf", is there a way to check how many pdf files I have in this folder?
Have you tried the --content-disposition flag? From the man page:
If this is set to on, experimental (not fully-functional) support for "Content-Disposition" headers is enabled. This can currently result in extra round-trips to the server for a "HEAD" request, and is known to suffer from a few bugs, which is why it is not currently enabled by default. This option is useful for some file-downloading CGI programs that use "Content-Disposition" headers to describe what the name of a downloaded file should be.
So it tries to ask the server for a filename. I tried it for the URL you gave and it seemed to work.
You could use the command
file filename
Like this:
file pdfurl-guide
pdfurl-guide: PDF document, version 1.5
You could use:
file *
To know exactly which files in your folder are pdf files

Converting all .haml files in a dir to .html in another dir

In a simple website I'm working on, I have a directory haml and a directory pages. The haml folder contains the .haml files I work on, and the pages folder contains my converted .html files.
I know I can convert the files, from the website root directory, by doing:
haml haml/about.haml pages/about.html for each file.
However, is there a way to convert all the .haml files in my haml folder to an equivalent .html file in the pages folder?
Something like: haml haml/*.haml html/*.html
Thanks!
There is a nice html2haml gem for this purpose
and you can run it on a directory using
find . -name \*.erb -print | sed 'p;s/.erb$/.haml/' | xargs -n2 html2haml
guard watches haml files, compiles them, you can set an output dir as well.
https://github.com/guard/guard-haml sounds like an option.
There's a python script that sounds like it will do exactly what you want...
I haven't tested it, but per the README, it sounds like it's as simple as:
python haml2html.py [input_dir] [output_dir]

How do give a specific mode to all directories and another to all files in a spec file?

I can't rely on the umask since my machine does not use umask to set permissions. Is there a way to specify that all sub-directories (and their sub-directories etc) of some root directory all have a certain permission, and similarly, that all sub-files of the same root directory have another type of permission in the %files section of the spec file.
If not, I'll have to run some external bash scrip to get the spec file syntax for each individual file, and copy that output to the %files section of the spec file, which will be highly tedious.
If you look at the various references online, %defattr() takes a lesser-known fourth parameter for directories.

Problem with multiple listings of the same file in RPM spec

I have some problems with an rpm spec file that is listing the same file multiple times. For this spec we do some normal compilation and then we have script that copies everything to the buildroot. Within this buildroot we have a lot of generic scripts that need to be installed on the final system, so we just list this directory.
However the problem is, that one of the scripts might be changed and configuration options might be changed within the script. So we list this script with different attributes as %config. However this means the script is defined multiple times with conflicting attributes, so rpmbuild complains and does not include the script at all in the installation package.
Is there a good way to handle this problem and to tell rpmbuild to only use the second definition, or do we have to seperate the script into two parts, one containing the configuration and one containing the actual logic?
Instead of specifying the directory, you can create a file list and then prune duplicate files from that.
So where you have something like
%files
%dir foo
%config foo/scriptname
You modify those parts to
find $RPM_BUILD_ROOT -type f | sed -e "s|^$RPM_BUILD_ROOT||" > filelist
sed -i "\|^foo/scriptname$|d" filelist
%files -f filelist
%config foo/scriptname
You can also use %{buildroot} in place of $RPM_BUILD_ROOT.