Splunk Database - lucene

I understand that Splunk does not need a lot of functionality that a MySQL database would provide, and to index and perform searches on Big Data it might not be a good option to use a relational database.
Does Splunk use Lucene as a search engine, or have they made their on-disk data format?
I am sorry if there are any problems in the way I am asking the question. This is my first question on Stack Overflow.

Splunk uses its own search engine, it's not based on any 3rd party.
Its search engine is based on files only, no database behind it.
It does not store fields, but raw data only. The fields are extracted during search time, and due to that are very dynamic.
Its also very fast in finding keywords in the data (needle in haystack).
Breaking the data into time-based events, attaching time for each raw event.
Marking every word found in the events and their location across the index
Storing the events in compressed format (tar.gz)
To be more detailed, Splunk is storing data in the following way:
Very fast search for keywords inside the events
Look in the original raw data
Create new fields on the raw data and use them with statistics commands.
Source:
http://www.splunk.com/web_assets/pdfs/secure/Splunk_for_BigData.pdf
http://docs.splunk.com/Documentation/Splunk/6.5.1/Indexer/Howindexingworks
+3 Years experience
Splunk architect.

Googling would have helped: http://answers.splunk.com/answers/43533/search-capabilities-of-splunk-how-powerful-is-it-really --> No Lucene

Splunk has proprietary data format for their indexes. Lucene is not used, and Splunk has it's own Search language called SPL.

Related

Solution to host 200GB of data and provide JSON API with aggregates?

I am looking for a solution that will host a nearly-static 200GB, structured, clean dataset, and provide a JSON API onto the data, for querying in a web app.
Each row of my data looks like this, and I have about 700 million rows:
parent_org,org,spend,count,product_code,product_name,date
A31,A81001,1003223.2,14,QX0081,Rosiflora,2014-01-01
The data is almost completely static - it updates once a month. I would like to support straightforward aggregate queries like:
get total spending on product codes starting QX, by organisation, by month
get total spending by parent org A31, by month
And I would like these queries to be available over a RESTful JSON API, so that I can use the data in a web application.
I don't need to do joins, I only have one table.
Solutions I have investigated:
To date I have been using Postgres (with a web app to provide the API), but am starting to reach the limits of what I can do with indexing and materialized views, without dedicated hardware + more skills than I have
Google Cloud Datastore: is suitable for structured data of about this size, and has a baked-in JSON API, but doesn't do aggregates (so I couldn't support my "total spending" queries above)
Google BigTable: can definitely do data of this size, can do aggregates, could build my own API using App Engine? Might need to convert data to hbase to import.
Google BigQuery: fast at aggregating, would need to roll my own API as with BigTable, easy to import data
I'm wondering if there's a generic solution for my needs above. If not, I'd also be grateful for any advice on the best setup for hosting this data and providing a JSON API.
Update: Seems that BigQuery and Cloud SQL support SQL-like queries, but Cloud SQL may not be big enough (see comments) and BigQuery gets expensive very quickly, because you're paying by the query, so isn't ideal for a public web app. Datastore is good value, but doesn't do aggregates, so I'd have to pre-aggregate and have multiple tables.
Cloud SQL is likely sufficient for your needs. It certainly is capable of handling 200GB, especially if you use Cloud SQL Second Generation.
They only reason why a conventional database like MySQL (the database Cloud SQL uses) might not be sufficient is if your queries are very complex and not indexed. I recommend you try Cloud SQL, and if the performance isn't sufficient, try ensuring you have sufficient indexes (hint: use the EXPLAIN statement to see how the queries are being executed).
If your queries cannot be indexed in a useful way, or your queries are so cpu intensive that they are slow regardless of indexing, you might want to graduate up to BigQuery. BigQuery is parallelised so that it can handle pretty much as much data as you throw at it, however it isn't optimized for real-time use and isn't as conveneint as Cloud SQL's "MySQL in a box".
Take a look at ElasticSearch. It's JSON, REST, cloud, distributed, quick on aggregate queries and so on. It may or may not be what you're looking for.

Totally unstructured data

We currently have a solution were we are having more and more the need to store unstructured data of various kinds. For example clients have the ability to define their own workflows where they define what kind of data should be captured (of various types...some simple some complex). This data then needs to be stored and are then displayed on a web application with a bit of functionality to modify the data.
Until now the workflows have been defined internally and therefore a MS SQL database was designed to cater for these specific workflows and their data. However now that clients have the ability to define workflows we need to relax the structure of our db. At first I thought that a key value table in ms sql might be a good idea but obviously I lose the typeness of the data being capture and then need to deserialize all the data in website (MVC.NET). I am also considering something like raven db but are not sure if this would be a good fit?
So my question is thus what would be the best way to store this unstructured data bearing in mind users must be able to search and edit/display this data as well?
How about combining 2 types of databases. Use a NO-SQL database for your unstructured data and the relational MS SQL database to save the references of your data for each workflow to retrieve them later on?
The data type will always be a problem and you always have to de-serialize it. Searching can be done by using the string representation of each value in your workflow and combining them in a searchable field in your MS SQL row.

google-bigquery

I am using BigQuery for SEO reasons. I am a search TC and I am a little confused why you are not using the Google Forum as I thought that was standard. What I want to use BigQuery for is to pull when my competitors change data on their website and which pages that were changed. So I need the URL that was changed and when it was changed (date) so I can also pull the page title and description to see what they are doing different than I am.
Is there anyone that knows how to use BigQuery to pull:
Date the page was changed
URL
Title
Description
We've switched to using Stack Overflow for support for many of our developer products, such as BigQuery. There's a great community here on StackOverflow, and the interface for formatting technical questions and interacting with the community is fantastic.
BigQuery does not collect the data for you-- it's a cloud service for performing ad hoc queries on massive datasets. Before first performing the queries, you need to upload the data to the service (as a CSV format).
So, if you have a job which collects this data -- URL, title, description, date and perhaps a hash of the webpage, you could potentially ingest a CSV file of this data into BigQuery and use it to understand when webpages have changed.
Of course, there are also 3rd-party services (such as Changedetection.com) which may be easier to use for your purposes.

Grails app: own query language for data stored in DB and files + full text search (Hibernate Search, Compass etc)

I have an application which stored short descriptive data in DB and lots of related textual data in text files.
I would like to add "advanced search" for DB. I was thinking about adding own query language like JIRA does (Jira Query Language). Then I thought about having full text search across those textual files (less priority).
Wondering which tool will better suite me to implement that faster and simpler.
I most of all I want to provide users with ability to write their own queries instead of using elements to specify search filters.
Thanks
UPD. I save dates in DB and most of varchar fields contain one word strings.
UPD2. Apache Derby is used right now.
Take a look at the Searchable plugin for Grails.
http://www.grails.org/plugin/searchable

Sql Server Full Text Search - Getting word occurances/location in text?

Suppose I have Sql Server (2005/2008) create an index from one of my tables.
I wish to use my own custom search engine (a little more tuned to my needs than Full Text Search).
In order to use it however, I need Sql Server to provide me the word positions and other data required by the search engine.
Is there anyway to query the "index" for this data instead of just getting search results?
Thanks
Roey
No. And if you could, what happens if Microsoft decide to change their internal data structures? Your code would break.
What are you trying to achieve?
You shouldnt rely on SQL servers internal data structures - they are tailored specifically for SQL servers use and aren't acessible for querying anyway.
If you want a fast indexer then you will probably have more success using a pre-written one rather than trying to write your own. Give Lucene.Net a try.