What's elasticsearch and is it safe to delete logstash? - apache

I have an internal Apache server for testing purpose, not client facing.
I wanted to upgrade the server to apache 2.4, but there is no space left, so I was trying to delete some files on the server.
After checking file size, I found a folder /var/lib/elasticsearch takes 80g space. For example, /var/lib/elasticsearch/elasticsearch/nodes/0/indices/logstash-2015.12.08 takes 60g already. I'm not sure what's elasticsearch. Is it safe if i delete this logstash? Thanks!

Elasticsearch is a search engine, like a NoSql database, and it stores the data in indeces. What you are seeing is the data of one index.
Probobly someone was using the index aroung 2015 when the index was timestamped.
I would just delete it.

I'm afraid that only you can answer that question. One use for logstash+elastic search are to help make sense out of system logs. That combination isn't normally setup by default, so I presume someone set it up at some time for some reason, and it has obviously done some logging. Only you can know if it is still being used, or if it is safe to delete.

As other answers pointed out Elastic search is a distributed search engine. And I believe an earlier user was pushing application or system logs using Logstash to this Elastic search instance. If you can find the source application, check if the log files are already there, if yes, then you can go ahead and delete your index. I highly doubt anyone still needs the logs back from 2015, but it is really your call to see what your application's archiving requirements are and then take necessary action.


What is the significance of data-config.xml file in Solr?

and when shall I use it? How is it configured can anyone please tell me in detail?
The data-config.xml file is an example configuration file for how to use the DataImportHandler in Solr. It's one way of getting data into Solr, allowing one of the servers to connect through JDBC (or through a few other plugins) to a database server or a set of files and import them into Solr.
DIH has a few issues (for example the non-distributed way it works), so it's usually suggested to write the indexing code yourself (and POST it to Solr from a suitable client, such as SolrJ, Solarium, SolrClient, MySolr, etc.)
It has been mentioned that the DIH functionality really should be moved into a separate application, but that hasn't happened yet as far as I know.

Liferay 6.2 Lucene replication in cluster

I'd welcome any help regarding simple issue: I have clustered environment and I enabled Lucene replication in properties (lucene.replicate.write=true). Now, all the tutorials are instructing me to reindex Lucene.
Should I run it on one node? On both? Simultaneously or sequentially?
This question has been asked in Liferay Forum as well: https://www.liferay.com/community/forums/-/message_boards/view_message/69175435.
Thank you!
Basically what I did at first was following:
and the result was NOT WORKING replication.
What I tried next was to overcome this issue and continue with clustering the rest of the portal which at the end helped lucene as well. My progress was to:
deploy cluster activation keys
deploy ehcache-cluster-web.war
edit clusterlink_control and clusterlink_transport files by Liferay tutorials
when servers shutted down delete contents of data/lucene and in Control Panel run reindaxation on one node
At the end, Lucene replication IS WORKING. What I think could be significant is following. At first, portal.properties explanation on keys lucene.commit.* is kind of hard to comprehend. By trial and error I found out that these two keys are in AND relation. Also, I found out about portal.instance.* keys which are used for multiple purposes in clustering and can matter if you have loadbalancers and/or Apaches between the nodes and autodetect fails.
There are multiple ways to configure search clustering in Liferay. If you use the lucene.replicate.write=true way, you're looking at several reindexing runs: On every restart of a server you must reindex that server's documents, as it might have missed indexing requests when it was down.
So, short answer: Don't worry, reindex both. Sooner or later you'll do it anyways, no matter if you need only one now.

Can I use an API such as chef to automatically create, name and set passwords to multiple servers?

I am new to this so forgive me for not understanding the lingo.
I have been using rackspace cloud control panel to build multiple virtual servers, i use them for maybe a couple of hours then i delete them. I need these servers to all have specific and unique names such as: "server1, server2, server3, etc." I also need them to have a specific password unlike the randomly generated password that is assigned by default.
I have been creating each individual server manually (based on an image that's set up) then I have to go back and reset the password andreboot all of them. Doing each one manually is a bit time consuming and I'm sure there is an easier way. Please help me figure this out.
I've been doing some searching but I haven't found anything too relevant to my problem on top of that I'm not too familiar with programming and such.
Basically what I'm looking to do is automatically create these servers with their appropriate names and passwords already built in from the start. I'm not sure if some sort of "API" is the answer, or if there's some sort of script that can be written, or both.
Any assistance is much appreciated.

How to maintain lucene indexes in azure cloud-app

I just started playing with the Azure Library for Lucene.NET (http://code.msdn.microsoft.com/AzureDirectory). Until now, I was using my own custom code for writing lucene indexes on the azure blob. So, I was copying the blob to localstorage of the azure web/worker role and reading/writing docs to the index. I was using my custom locking mechanism to make sure we dont have clashes between reads and writes to the blob. I am hoping Azure Library would take care of these issues for me.
However, while trying out the test app, I tweaked the code to use compound-file option, and that created a new file everytime I wrote to the index. Now, my question is, if I have to maintain the index - i.e keep a snapshot of the index file and use it if the main index gets corrupt, then how do I go about doing this. Should I keep a backup of all the .cfs files that are created or handling only the latest one is fine. Are there api calls to clean up the blob to keep the latest file after each write to the index?
After i answered this, we ended up changing our search infrastructure and used Windows Azure Drive. We had a Worker Role, which would mount a VHD using the Block Storage, and host the Lucene.NET Index on it. The code checked to make sure the VHD was mounted first and that the index directory existed. If the worker role fell over, the VHD would automatically dismount after 60 seconds, and a second worker role could pick it up.
We have since changed our infrastructure again and moved to Amazon with a Solr instance for search, but the VHD option worked well during development. it could have worked well in Test and Production, but Requirements meant we needed to move to EC2.
i am using AzureDirectory for Full Text indexing on Azure, and i am getting some odd results also... but hopefully this answer will be of some use to you...
firstly, the compound-file option: from what i am reading and figuring out, the compound file is a single large file with all the index data inside. the alliterative to this is having lots of smaller files (configured using the SetMaxMergeDocs(int) function of IndexWriter) written to storage. the problem with this is once you get to lots of files (i foolishly set this to about 5000) it takes an age to download the indexes (On the Azure server it takes about a minute,, of my dev box... well its been running for 20 min now and still not finished...).
as for backing up indexes, i have not come up against this yet, but given we have about 5 million records currently, and that will grow, i am wondering about this also. if you are using a single compounded file, maybe downloading the files to a worker role, zipping them and uploading them with todays date would work... if you have a smaller set of documents, you might get away with re-indexing the data if something goes wrong... but again, depends on the number....

Using SQL for cleaning up JIRA database

Has anyone had luck with removing large amount of issues from a jira database instead of using the frontend? Deleting 60000 issues with the bulktools is not really feasible.
Last time I tried it, the jira went nuts because of its own way of doing indexes.
How about doing a backup to xml, editing the xml, and reimporting?
We got gutsy and did a truncate on the jiraissues table and then use the rebuild index feature on the frontend. It looks like it's working!
This is old, but I see that this question was just edited recently, so to chime in:
Writing directly to the JIRA database is problematic. The reindex feature suggested in the Oct 14 08 answer just rebuilds the Lucene index, so it is unlikely to clean up everything that needs to be cleaned up from the database on a modern JIRA instance. Off the top of my head, this will probably leave data lying around in the following tables, among others:
custom field data (customfieldvalue table)
issue links (issuelink table)
versions and components (nodeassociation table, which contains other stuff too, so be careful!)
remote issue links or wiki mentions (remotelink table)
If one has already done such a manual delete on production, it's always a good idea to run the database integrity checker (YOURJIRAURL/secure/admin/IntegrityChecker!default.jspa) to make sure that nothing got seriously broken.
Fast forwarding to 2014, the best solution is to write a quick shell script that uses the REST API to delete all of the required issues. (The JIRA CLI plugin is usually a good option for automating certain types of tasks too, but as far as I can tell, it does not currently support the deletion of issues, so the REST API is your best bet.)