Real time queries in MongoDB for different criteria and processing the result - sql

New to Mongodb. Is Mongodb efficient for real time queries where the values for the criteria changes every time for my query. Also there will be some aggregation of the resultset before sending the response back to the user. As an example my user case needs to produce the data in the following format after processing a collection for different criteria values.
Service Total Improved
A 1000 500
B 2000 700
.. .. ..
I see Mongodb has Aggregation which process records and return computed results. Should I be used aggregation instead for efficiency? If aggregation is the way to go, I guess I would do that every time my source data changes. Also, is this what Mongo Hadoop is used for? Am I on the right track in my understanding? Thanks in advance.

Your question is too general, IMHO.
Speed depends on the size of your data and on the kind of your query and if you have put an index on your key etc.
Changing values in your queries are not critical, AFAIK.
For example I work on a MongoDB with 3 million docs and can do some queries in a couple of seconds, some in a couple of minutes. A simple map reduce over all 3 M docs takes about 25 min on that box.
I have not tried the aggregation API yet, which seems to be a successor/alternative to map / reduce runs.
I did not know about the MongoDB / Hadoop integration. It seems to keep MongoDB as an easy-to-use storage unit, which feeds data to a Hadoop cluster and gets results from it, using the more advanced map reduce framework from Hadoop (more phases, better use of a cluster of Hadoop nodes)..

I would follow mongodbs guidelines for counting stuff.
See mongodbs documentation page for preaggregated reports.
Hadoop is good for batch processing, which you probably donĀ“t need for these counting use cases?
See this list for other typical hadoop use cases: link.
And heres a resource for typical mongo+hadoop use cases: link.

Related

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...
This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.
First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.
Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.

How to Let Spark Handle Bigger Data Sets?

I have a very complex query that needs to join 9 or more tables with some 'group by' expressions . Most of these tables have almost the same of numbers of the rows. These tables also have some columns that can be used as the 'key' to partition the tables.
Previously, the app ran fine, but now the data set has 3~4 times data as before. My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling and the application stalls (no matter how I adjust the memory, partition, executors, etc.). The actual data probably is just dozens of Gs.
I would think that if the partitioning works properly, Spark shouldn't do shuffle so much and the join should be done on each node. It is puzzling that why Spark is not so 'smart' to do so.
I could split the data set (with the 'key' I mentioned above) into many data sets that these data sets can be dealt with independently. But the burden will be on myself...it discounts the very reason to use Spark. What other approaches that could help?
I use Spark 2.0 over Hadoop YARN.
My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling
When joining datasets if the size of one side is less than a certain configurable size, spark broadcasts the entire table to each executor so that join may be performed locally everywhere. Your above observation is consistent with this. You can also provide broadcast hint explicitly to the spark, like so df1.join(broadcast(df2))
Other than that, can you please provide more specifics about your problem?
[Sometime ago I was also grappling with the issue of join and shuffles for one of our jobs that had to handle couple of TBs. We were using RDDs (and not the dataset api). I wrote about my findings [here]1. These may be of some use to you are try to reason about the underlying data shuffle.]
Update: According to documentation -- spark.sql.autoBroadcastJoinThreshold is the configurable property key. 10 MB is its default value. And it does the following:
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.
So apparently, this is supported only for the Hive tables.

How to tell there is new data available in ga_sessions_intraday_ efficiently

Google Analytics data should be exported to Big Query 3 times a day, according to the docs. I trying to determine an efficient way to detect new data is available in the ga_sessions_intraday_ table and run a query in BQ to extract on the new data.
My best idea is to poll ga_sessions_intraday_ by running a SQL query every hour. I would track the max visitStartTime (storing the state somewhere) and if a new max visitStartTime shows up in the ga_sessions_intraday_ then I would run my full queries.
Problems with this approach is I need to store state about the max visitStartTime. I would prefer something simpler.
Does GA Big Query have a better way of telling that new data is available in ga_sessions_intraday_? Some kind of event that fires? Do I use the last modified date of the table (but I need to keep track of the time window to run against)?
Thanks in advance for your help,
Kevin
Last modified time on the table is probably the best approach here (and cheaper than issuing a probe query). I don't believe there is any other signalling mechanism for delivery of the data.
If your full queries run more quickly than your polling interval, you could probably just use the modified time of your derived tables to hold the data (and update when your output tables are older than your input tables).
Metadata queries are free, so you can even embed most of the logic in a query:
SELECT
(
SELECT
MAX(last_modified_time)
FROM
`YOUR_INPUT_DATASET.__TABLES__`) >
(
SELECT
MAX(last_modified_time)
FROM
`YOUR_OUTPUT_DATASET.__TABLES__`) need_update
If you have a mix of tables in your output dataset, you can be more selective (with a WHERE clause) to filter down the tables you examine.
If you need a convenient place to run this scheduling logic (that isn't a developer's workstation), you might consider one of my previous answers. (Short version: Apps Script is pretty neat)
You might also consider filing a feature request for "materialized views" or "scheduled queries" on BigQuery's public issue tracker. I didn't see a existing entry for this with a quick skim, but I've certainly heard similar requests in the past.
I'm not sure how the Google Analytics team handles feature requests, but having a pubsub notification upon delivery of a new batch of Analytics data seems like it could be useful as well.

Improve apache hive performance

I have 5GB of data in my HDFS sink. When I run any query on Hive it takes more than 10-15 minutes to complete. The number of rows I get when I run,
select count(*) from table_name
is 3,880,900. My VM has 4.5 GB mem and it runs on MBP 2012. I would like to know if creating index in the table will have any performance improvement. Also are there any other ways to tell hive to only use this much amount of data or rows so as to get results faster? I am ok even if the queries are run for a lesser subset of data at least to get a glimpse of the results.
Yes, indexing should help. However, getting a subset of data (using limit) isn't really helpful as hive still scans the whole data before limiting the output.
You can try using RCFile/ORCFile format for faster results. In my experiments, RCFile based tables executed queries roughly 10 times faster than textfile/sequence file based tables.
Depending on the data you are querying you can get gains by using the different file formats like ORC, Parquet. What kind of data are you querying, is it structured or unstructured data? What kind of queries are you trying to perform? If it is structured data you can see gains also by using other SQL on Hadoop solutions such as InfiniDB, Presto, Impala etc...
I am an architect for InfiniDB
http://infinidb.co
SQL on Hadoop solutions like InfiniDB, Impala and others work by you loading your data through them at which they will perform calculations, optimizations etc... to make that data faster to query. This helps tremendously for interactive analytical queries, especially when compared to something like Hive.
With that said, you are working with 5GB of data (but data always grows! someday could be TBs), which is pretty small so you can still work in the worlds of the some of the tools that are not intended for high performance queries. Your best solution with Hive is to look at how your data is and see if ORC or Parquet could benefit your queries (columnar formats are good for analytic queries).
Hive is always going to be one of the slower options though for performing SQL queries on your HDFS data. Hortonworks with their Stinger initiative is making it better, you might want to check that out.
http://hortonworks.com/labs/stinger/
The use case sounds fit for ORC, Parquet if you are interested in a subset of the columns. ORC with hive 0.12 comes with PPD which will help you discarding blocks while running the queries using the meta data that it stores for each column.
We did an implementation on top of hive to support bloom filters in the meta data indexes for ORC files which gave a performance gain of 5-6X.
What is average number of Mapper/Reducer tasks launched for the queries you execute? Tuning some parameters can definitely help.

Improving query performance of of database table with large number of columns and rows(50 columns, 5mm rows)

We are building an caching solution for our user data. The data is currently stored i sybase and is distributed across 5 - 6 tables but query service built on top of it using hibernate and we are getting a very poor performance. In order to load the data into the cache it would take in the range of 10 - 15 hours.
So we have decided to create a denormalized table of 50 - 60 columns and 5mm rows into another relational database (UDB), populate that table first and then populate the cache from the new denormalized table using JDBC so the time to build us cache is lower. This gives us a lot better performance and now we can build the cache in around an hour but this also does not meet our requirement of building the cache whithin 5 mins. The denormlized table is queried using the following query
select * from users where user id in (...)
Here user id is the primary key. We also tried a query
select * from user where user_location in (...)
and created a non unique index on location also but that also did not help.
So is there a way we can make the queries faster. If not then we are also open to consider some NOSQL solutions.
Which NOSQL solution would be suited for our needs. Apart from the large table we would be making around 1mm updates on the table on a daily basis.
I have read about mongo db and seems that it might work but no one has posted any experience with mongo db with so many rows and so many daily updates.
Please let us know your thoughts.
The short answer here, relating to MongoDB, is yes - it can be used in this way to create a denormalized cache in front of an RDBMS. Others have used MongoDB to store datasets of similar (and larger) sizes to the one you described, and can keep a dataset of that size in RAM. There are some details missing here in terms of your data, but it is certainly not beyond the capabilities of MongoDB and is one of the more frequently used implementations:
http://www.mongodb.org/display/DOCS/The+Database+and+Caching
The key will be the size of your working data set and therefore your available RAM (MongoDB maps data into memory). For larger solutions, write heavy scaling, and similar issues, there are numerous approaches (sharding, replica sets) that can be employed.
With the level of detail given it is hard to say for certain that MongoDB will meet all of your requirements, but given that others have already done similar implementations and based on the information given there is no reason it will not work either.