I need some advice about druid and metamarkets - sql

I need a solution for storing logs (which more or less follow one of, say, 10, standard formats), preferably in real time, in a database which is fast to query and can easily give me the result to various wired queries. E.g. queries looking for keywords in text bodies, queries involving multiple tables.
A solution that was recommended to me was MetaMarket, which seems to do real-time logging with a very good query system in style. However I'm unsure about the cost and wether or not such a complex solution is needed.
From what I understand the "selling point" of metamarket is the druid db and said db is open source and can be deployed outside of their stack. So what I come here to ask is:
Have any of you guys had experience deploying a real-time logging system with Druid ? How hard was it ? How long did it take ? What are the challenges ? What other technologies besides druid did you use ? Do you have any recommended reading ?
Have any of you had experience with metamarket. If so, again, how hard was it ? how long did it take ? what are the challenges ? how were the cost once it hit production ? Do you have any recommended reading on the subject ?
Also, bonus question: Are there actually any benchmarks done by "unbiased professionals" about druid ? The fact that a real-time in real-time out databse is written in Java seems a bit... ahm, hard to believe.

This is quick answer.
It is true druid is open source but the missing link here is a good UI that plays nice with druid. There is one UI used to be called caravel and now is superset i guess it can do an ok job.
Concerning running a druid cluster it should not be that hard if you have enough resources (eg engineers) to put in place all the pipeline from packaging to deploying druid on the machines/cloud.
Finally the last piece is monitoring/updating the cluster it needs good amount of work as well.
And yes it is written using JAVA but that's the case for many other realtime software take example of KAFKA, in fact druid does a lot of thing off-heap and uses memory mapped files for serving data. Reading the white paper will provide a good/basic understanding of the system, hence you find the answer if druid is a good fit or not.

Related

BigQuery Testing, Debugging, and Design Patterns

We use BigQuery as the main data warehouse in our company.
We have gotten very efficient with SQL syntax and we write multi-page SQL queries with valid Syntax to analyze our data.
The main problem that we are struggling with are terrible logic mistakes in our queries. For example, it could be that a > should have been a >=, or that a join was treating NULL values the wrong way.
The effect is that we are getting wrong data out of BigQuery.
The logic within our data structure is so complicated ("what again was the definition of Customer Type ABC?") that it's terribly difficult to actually pull out anything useable. We estimate that up to 50% of analytics that we pull out of BigQuery are plain wrong.
Of course this is a problem that significantly hurts our bottom line and leads to wrong business decision. It has gotten so bad that we are craving for a normalized database structure that at least could be comprehended easier.
My hope is that maybe we are just missing certain design patterns to properly use BigQuery. However I find zero guidance about this online. The SQL we are using is so complex that I'm starting to think that although the Syntax is correct, SQL was not made for this. What we are doing feels like fitting a complex program into a single function, which in turn becomes untestable and a nightmare to work with.
I would appreciate any input and guidance
I can empathize here. I don't think your issue is unique, and there isn't one best practice. I can tell you what we have done to help with these same issues.
We are a small team of analysts, and only have a couple TB of data to crunch daily so your mileage will vary with these tips depending on your situation.
We use DBT - https://www.getdbt.com/. It has a free command line version, or you can pay for DBT cloud if you aren't confident with command line tools. It will help you go from Pages long SQL queries to smaller digestible chunks that are easier to maintain.
It helps with 3 main use cases for us.
database normalization/summarization - you can easily write queries, have them dependent on each other, have them scheduled to run at a certain time, while doing a lot of the more complex data engineering tasks for you. Such as making sure to run things in the right order, and that no circular references exist. This part of the tool helped us migrate away from pages long SQL queries to smaller digestible chunks that are useful in multiple applications.
documentation: there is a documentation site built in. So you can document a column and write out the definition of 'customer' easily.
Testing. We write loads of tests. We have a 100% accepted answer to certain metrics. Any time we need to reference this metric in other queries, or transform data to slice that metric by other dimensions, we write a test to make sure the new transformation matches back to the 100% accepted answer.
We have explored DBT, unfortunately we didn't have the bandwidth to support it at the company level. As an alternative we use airflow to build and maintain datasets in Bigquery. We use the BigQuery operators to interface with BQ through airflow. This helps us in the following ways:
Ability to build custom operators that can help with organizational level bells and whistles (integration with internal systems, data lifecycle management, lineage management etc.)
Ability to break down complex pieces of SQL into smaller manageable blocks that can be reused
Ability to incorporate testing in the process. You can build testing into your pipeline DAG or can build out separate DAGs of tests that can monitor your datasets and send out reports.
Ability to replay and recreate datasets
Ability to easily manage schema changes
I am sure there are other use cases where airflow helps, but these are some of the things that come to mind.

What are some methods of testing data analytics systems and ETL processes?

I work primarily with so-called "Big Data"; the ETL and analytics parts. One of the challenges I constantly face is finding a good way to "test my data" so to speak. For my mapreduce and ETL scripts I write solid unit test coverage but if there are unexpected underlying changes in the data itself (coming from multiple application systems) the code won't necessarily throw a noticeable error which leaves me with bad / altered data that I don't know about.
Are there any best practices out there that help people keep an eye on what / how the underlying data may be changing?
Our technology stack is AWS EMR, Hive, Postgres, and Python. We're not really interested in bringing in a big ETL framework like Informatica.
You could create some kind of mapping files(maybe xml or something) as per the standards specific to your systems and validate your incoming data before putting it into your cluster, or maybe during the process itself. I was facing a similar issue sometime ago and ended up doing this.
I don't know how feasible it is for your data and your use case but it did the trick for us. I had to create the xml files once(I know it's boring and tedious, but worth giving a try) and now whenever I get new files I use these xml files to validate the data before putting it into my cluster to check whether the data is correct or not(as per the standards defined). This saves a lot of time and effort which would be involved if I have to check everything manually everytime I get some new data.

Is RavenDB Right for my Situation?

I have an interesting situation where I'm near the end of an evaluation period for a RavenDB prototype for use with a project at our company. The reason it's interesting is that 99.99% of the time, I believe it fits Raven's sweet spot; it repeatedly queries for new data, often, and in small batches (< 1000 documents at a time).
However, we do have an initial load period, where we need to load two days' worth of data, which can be 3 million (or more) records in some cases.
A diagram might help:
It's the Transfer Service that is responsible for getting the correct data out of three production databases and storing it in RavenDB. The WCF service will query this data and make it available to its clients.
Once we do the initial load of millions of records/documents into RavenDB, we'll rarely have to do that again.
As an initial load test, on a machine with 4GB RAM and two processors, it took just over 23 minutes to read the initial data. In this case, it was only about 1.28 million records. I eliminated all async operations from this initial load, because I wanted each read to not be interfered with by other read operations. I found the best results this way.
I know it's not recommended, but to accomplish all this, I had to change settings that aren't recommended to be changed:
I had to increase the timeout:
documentStore.JsonRequestFactory.ConfigureRequest += (e, x) => ((HttpWebRequest)x.Request).Timeout = ravenTimeoutInMilliseconds;
In the Raven.Server.exe.config, I had to increase the page size (to int.MaxValue):
<add key="Raven/MaxPageSize" value="2147483647"/>
And in my retrieval methods, I had to use Take(int.MaxValue):
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
Remember this is all for that one-time, initial load. After that, it's many queries, quickly, and often. I should also note that each document is self-contained in RavenDB. There are no relationships to manage.
Knowing all this, is RavenDB a good fit?
A good fit for what?
Full text search? Yes. Background aggregations (map/reduce ones)? Yes. Easy replication and sharding, say scaling? Yes...
Ad-hoc reporting? No. Support for probably thousands of third party tools? No...
If you're talking about performance, you probably want to look at Orens latest post on that. His numbers are quite similar to your ones: http://ayende.com/blog/154913/ravendb-amp-freedb-an-optimization-story
From what I understand of your question, you need to "prep" the WCF web-service. To do this you read 1.2M docs from RavenDB (in about 23 mins) and hold them in memory, so the WCF service can then serve queries from them, is this right? Or am I missing something?
Why not get the WCF service to send it's queries to Raven one-at-a-time? I.e. for each query it gets from a Client, ask RavenDB to do the query for it?
From what you've told us in the other answers comments, I believe the only good way to serve the wcf clients fast enough, is to actually store everything in memory, so just the way you do it now.
The question, if RavenDB is a good fit for that situation depends on whether your data model benefits in others way from the document oriented nature. So, in case you have dynamic data that would require some kind of EAV in a relational databases and lots of joins, then RavenDB will probably be a very good solution. However, if you just need something you can throw flat data in, then I would go with a relational database here. In terms of licensing costs and ease of use, you might also want to take a look at PostgreSql, as this is a really awesome database that comes completely free.

I need advise choosing a NoSQL database for a project with a lot of minute related information

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.
With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.
My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.
I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.
The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.
I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.
Any help would be appreciated.
Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.
I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.
BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.
I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.
For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.
Added 2011-02-12:
Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."
I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:
I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.
I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.
When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.
That said, and this deserves more intense thought, here are some considerations that may be helpful:
I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?
One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..
NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.
I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.
http://postgis.refractions.net/

PostgreSQL performance monitoring tool

I'm setting up a web application with a FreeBSD PostgreSQL back-end. I'm looking for some database performance optimization tool/technique.
Database optimization is usually a combination of two things
Reduce the number of queries to the database
Reduce the amount of data that needs to be looked at to answer queries
Reducing the amount of queries is usually done by caching non-volatile/less important data (e.g. "Which users are online" or "What are the latest posts by this user?") inside the application (if possible) or in an external - more efficient - datastore (memcached, redis, etc.). If you've got information which is very write-heavy (e.g. hit-counters) and doesn't need ACID-semantics you can also think about moving it out of the Postgres database to more efficient data stores.
Optimizing the query runtime is more tricky - this can amount to creating special indexes (or indexes in the first place), changing (possibly denormalizing) the data model or changing the fundamental approach the application takes when it comes to working with the database. See for example the Pagination done the Postgres way talk by Markus Winand on how to rethink the concept of pagination to make it more database efficient
Measuring queries the slow way
But to understand which queries should be looked at first you need to know how often they are executed and how long they run on average.
One approach to this is logging all (or "slow") queries including their runtime and then parsing the query log. A good tool for this is pgfouine which has already been mentioned earlier in this discussion, it has since been replaced by pgbadger which is written in a more friendly language, is much faster and more actively maintained.
Both pgfouine and pgbadger suffer from the fact that they need query-logging enabled, which can cause a noticeable performance hit on the database or bring you into disk space troubles on top of the fact that parsing the log with the tool can take quite some time and won't give you up-to-date insights on what is going in the database.
Speeding it up with extensions
To address these shortcomings there are now two extensions which track query performance directly in the database - pg_stat_statements (which is only helpful in version 9.2 or newer) and pg_stat_plans. Both extensions offer the same basic functionality - tracking how often a given "normalized query" (Query string minus all expression literals) has been run and how long it took in total. Due to the fact that this is done while the query is actually run this is done in a very efficient manner, the measurable overhead was less than 5% in synthetic benchmarks.
Making sense of the data
The list of queries itself is very "dry" from an information perspective. There's been work on a third extension trying to address this fact and offer nicer representation of the data called pg_statsinfo (along with pg_stats_reporter), but it's a bit of an undertaking to get it up and running.
To offer a more convenient solution to this problem I started working on a commercial project which is focussed around pg_stat_statements and pg_stat_plans and augments the information collected by lots of other data pulled out of the database. It's called pganalyze and you can find it at https://pganalyze.com/.
To offer a concise overview of interesting tools and projects in the Postgres Monitoring area i also started compiling a list at the Postgres Wiki which is updated regularly.
pgfouine works fairly well for me. And it looks like there's a FreeBSD port for it.
I've used pgtop a little. It is quite crude, but at least I can see which query is running for each process ID.
I tried pgfouine, but if I remember, it's an offline tool.
I also tail the psql.log file and set the logging criteria down to a level where I can see the problem queries.
#log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this time.
I also use EMS Postgres Manager to do general admin work. It doesn't do anything for you, but it does make most tasks easier and makes reviewing and setting up your schema more simple. I find that when using a GUI, it is much easier for me to spot inconsistencies (like a missing index, field criteria, etc.). It's only one of two programs I'm willing to use VMWare on my Mac to use.
Munin is quite simple yet effective to get trends of how the database is evolving and performing over time. In the standard kit of Munin you can among other thing monitor the size of the database, number of locks, number of connections, sequential scans, size of transaction log and long running queries.
Easy to setup and to get started with and if needed you can write your own plugin quite easily.
Check out the latest postgresql plugins that are shipped with Munin here:
http://munin-monitoring.org/browser/branches/1.4-stable/plugins/node.d/
Well, the first thing to do is try all your queries from psql using "explain" and see if there are sequential scans that can be converted to index scans by adding indexes or rewriting the query.
Other than that, I'm as interested in the answers to this question as you are.
Check out Lightning Admin, it has a GUI for capturing log statements, not perfect but works great for most needs. http://www.amsoftwaredesign.com
DBTuna http://www.dbtuna.com/postgresql_monitor.php has recently started supporting PostgreSQL monitoring. We use it extensively for MySQL monitoring, so if it provides the same for Postgres then it should be a good fit for you too.