Is there a way to concatenate the results of multple mongodb queries together in one statement? - mongodb-query

I have a mongodb database that contains a large amount of data without a highly consistent schema. It is used for doing Google Analytics-style interaction tracking with our applications. I need to gather some output covering a whole month, but I'm struggling with the performance of the query, and I don't really know MongoDB very well at all.
The only way I can get results out is by restricting the timespan I am querying within to one day at a time, using the _timestamp field which I believe is indexed by default (I might be wrong).
db.myCollection.find({internalId:"XYZ",username:"Demo",_timestamp:{$gte:ISODate("2019-09-01T00:00:00.000Z"),$lte:ISODate("2019-09-02T00:00:00.000Z")}}); // Day 1..
db.myCollection.find({internalId:"XYZ",username:"Demo",_timestamp:{$gte:ISODate("2019-09-03T00:00:00.000Z"),$lte:ISODate("2019-09-04T00:00:00.000Z")}}); // Day 2..
db.myCollection.find({internalId:"XYZ",username:"Demo",_timestamp:{$gte:ISODate("2019-09-05T00:00:00.000Z"),$lte:ISODate("2019-09-06T00:00:00.000Z")}}); // Day 3..
This works 'fine', but I'd rather be able to SQL union those seperate queries together - but then I guess I'd still end up timing out.
Ideally I'd end up with each of those queries executing seperately, with the resultset being appended to each time and returned at the end.
I might be better off writing a simple application to do this.
Help me Obi-Wan Kenobi, you're my only hope.

Related

Limitations in using all string columns in BigQuery

I have an input table in BigQuery that has all fields stored as strings. For example, the table looks like this:
name dob age info
"tom" "11/27/2000" "45" "['one', 'two']"
And in the query, I'm currently doing the following
WITH
table AS (
SELECT
"tom" AS name,
"11/27/2000" AS dob,
"45" AS age,
"['one', 'two']" AS info )
SELECT
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob)) birth_year,
ANY_value(PARSE_DATE('%m/%d/%Y', dob)) bod,
ANY_VALUE(name) example_name,
ANY_VALUE(SAFE_CAST(age AS INT64)) AS age
FROM
table
GROUP BY
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob))
Additionally, I tried doing a very basic group by operation casting an item to a string vs not, and I didn't see any performance degradation on a data set of ~1M rows (actually, in this particular case, casting to a string was faster):
Other than it being bad practice to "keep" this all-string table and not convert it into its proper type, what are some of the limitations (either functional or performance-wise) that I would encounter by keeping a table all-string instead of storing it as their proper type. I know there would be a slight increase in size due to storing strings instead of number/date/bool/etc., but what would be the major limitations or performance hits I'd run into if I kept it this way?
Off the top of my head, the only limitations I see are:
Queries would become more complex (though wouldn't really matter if using a query-builder).
A bit more difficult to extract non-string items from array fields.
Inserting data becomes a bit trickier (for example, need to keep track of what the date format is).
But these all seem like very small items that can be worked around. Are there are other, "bigger" reasons why using all string fields would be a huge limitation, either in limiting query-ability or having a huge performance hit in various cases?
First of all - I don't really see any bigger show-stoppers than those you already know and enlisted
Meantime,
though wouldn't really matter if using a query-builder ...
based on above excerpt - I wanted to touch upon some aspect of this approach (storing all as strings)
While we usually concerned about CASTing from string to native type to apply relevant functions and so on, I realized that building complex and generic query with some sort of query builder in some cases requires opposite - cast native type to string for applying function like STRING_AGG [just] as a quick example
So, my thoughts are:
When table is designed for direct user's access with trivial or even complex queries - having native types is beneficial and performance wise and being more friendly for user to understand, etc.
Meantime, if you are developing your own query builder and you design table such that it will be available to users for querying via that query builder with some generic logic being implemented - having all fields in string can be helpful in building the query builder itself.
So it is a balance - you can lose a little in performance but you can win in being able to better implement generic query builder. And such balance depend on nature of your business - both from data prospective and what kind of query you envision to support
Note: your question is quite broad and opinion based (which is btw not much respected on SO) so, obviously my answer - is totally my opinion but based on quite an experience with BigQuery
Are you OK to store string "33/02/2000" as a date in one row and "21st of December 2012" in another row and "22ое октября 2013" in another row?
Are you OK to store string "45" as age in one row and "young" in another row?
Are you OK when age "10" is less than age "9"?
Data types provide some basic data validation mechanism at the database level.
Does BigQuery databases have a notion of indexes?
If yes, then most likely these indexes become useless as soon as you start casting your strings to proper types, such as
SELECT
...
WHERE
age > 10 and age < 30
vs
SELECT
...
WHERE
ANY_VALUE(SAFE_CAST(age AS INT64)) > 10
and ANY_VALUE(SAFE_CAST(age AS INT64)) < 30
It is normal that with less columns/rows you don't feel the problems. You start to feel the problems when your data gets huge.
Major concerns:
Maintenance of the code: Think of future requirements that you may receive. Every conversion for data manipulation will add extra complexity to your code. For example, if your customer asks for retrieving teenagers in future, you'll need to convert string to date to get the age and then be able to do the manupulation.
Data size: The data size has broader impacts that can not be seen at the start. For example if you have N parallel test teams which require own test systems, you'll need to allocate more disk space.
Read Performance: When you have more bytes to read in huge tables it will cost you considerable time. For example typically telco operators have a couple of billions of rows data per month.
If your code complexity increase, you'll need to replicate conversions in multiple places.
Even single of above items should push one to distance from using strings for everything.
I would think the biggest issue with this would be if there are other users of this table/data, for instance if someone is trying to write reports with it and do calculations or charts or date ranges it could be a big headache having to always cast or convert the data with whatever tool they are using. You or someone would likely get a lot of complaints about it.
And if someone decided to build a layer between this data and the reporting tool which converted all of the data, then you may as well just do it one time to the table/data and be done with it.
From the solution below, you might face some storage and performance problems, you can find some guidance in the official documentation:
The main performance problem will come from the CAST operation, remember that the BigQuery Engine will have to deal with a CAST operation for each value per row.
In order to test the compute cost of this operations, I used the following query:
SELECT
street_number
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
Inspecting the stages executed in the execution details we are able to see the following:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
WRITE
$1
TO __stage00_output
Only the Read, Limit and Write operations are required. However if we execute the same query adding the the CAST operator.
SELECT
CAST(street_number AS int64)
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
We see that a compute operation is also required in order to perform the cast operation:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
COMPUTE
$10 := CAST($1 AS INT64)
WRITE
$10
TO __stage00_output
Those compute operations will consume some time, that might cause problems when escalating the operation size.
Also, remember that each time that you want to use the data type properties of each data type, you will have to cast your value, and deal with the compute operation time required.
Finally, referring to the storage performance, as you mentioned Strings do not have a fixed size, and that might cause a size increase.

How to count rows of finished BigQuery job using node.js client library

I would like to get the row count of job that was run using:
bigquery.startQuery(options)
The naive way of doing this, would be to stream the result (e.g. using):
job.getQueryResultsStream()
And count one by one. This obviously isn't very efficient, especially for large results. Another way I thought of is using the metadata of the job:
job.on('complete', function(metadata) {...}
Where I could kind of "reverse engineer" the response, to get the query plan, and see the number of written rows in the last step. I could find that in:
statistics.query.queryPlan[statistics.query.queryPlan.length - 1].recordsWritten
While a sample of different queries convinced me that this might work, it feels like a "hack", and it's difficult to say how robust it will be. Seems like I might need to handle different cases (failed queries, etc.)
EDIT: Another option suggested below is "SELECT COUNT"ing the temp table created by the original query (available in the job metadata). While this absolutely is a straightforward way to get the result I'm looking for, it has the disadvantage of requiring another roundtrip to query the BigQuery service, which costs several seconds. It is a 0 "bytes billed" query (counting a full table uses table metadata only), but it seems redundant when the job "knows" how many rows it has written to the output.
Is there a straightforward and "correct" way to get this count from the job object, without a roundtrip to BQ service? Perhaps a field I missed / misinterpreted, or a function in the job object that returns this?
Any job has destination table - even when you do not explicitly set it - result is still saved in so-called anonymous table that you can in turn query to get the count of output rows. So below simple extra query will work (note - names are just as an example)
SELECT COUNT(1)
FROM `yourProject._0511743a77ca76c1b55482d7cb1f8e91ac5c7b36.anon17286defe54b5c07ba6810a71abfdba6388ac4e0`
The actual destination table to use - can be retrieved from configuration.query.destinationTable property of job
job.on('complete', function(metadata) {
console.log(metadata.statistics.query.numDmlAffectedRows)
}

What is a best way to organise the complex couchdb view (sql-like query)?

In my application I need a SQL-like query of the documents. The big picture is that there is a page with a paginated table showing the couchdb documents of a certain "type". I have about 15 searchable columns like timestamp, customer name, the us state, different numeric fields, etc. All of these columns are orderable, also there is a filter form allowing the user to filter by each of the fields.
For a more concrete below is a typical query which is a result by a customer setting some of the filter options and following to the second page. Its written in a pseodo-sql code, just to explain the problem:
timestamp > last_weeks_monday_epoch AND timestamp < this_weeks_monday_epoch AND marked_as_test = False AND dataspace="production" AND fico > 650
SORT BY timestamp DESC
LIMIT 15
SKIP 15
This would be a trivial problem if I were using any sql-like database, but couchdb is way more fun ;) To solve this I've created a view with the following structure of the emitted rows:
key: [field, value], id: doc._id, value: null
Now, to resolve the example query above I need to perform a bunch of queries:
{startkey: ["timestamp", last_weeks_monday_epoch], endkey: ["timestamp", this_weeks_monday_epoch]}, the *_epoch here are integers epoch timestamps,
{key: ["marked_as_test", False]},
{key: ["dataspace", "production"]},
{startkey: ["fico", 650], endkey: ["fico", {}]}
Once I have the results of the queries above I calculate intersection of the sets of document IDs and apply the sorting using the result of timestamp query. Than finally I can apply the slice resolving the document IDs of the rows 15-30 and download their content using bulk get operation.
Needless to say, its not the fastest operation. Currently the dataset I'm working with is roughly 10K documents big. I can already see that the part when I'm calculating the intersection of the sets can take like 4 seconds, obviously I need to optimize it further. I'm afraid to think, how slow its going to get in a few months when my dataset doubles, triples, etc.
Ok, so having explained the situation I'm at, let me ask the actual questions.
Is there a better, more natural way to reach my goal without loosing the flexibility of the tool?
Is the view structure I've used optimal ? At some point I was considering using a separate map() function generating the value of each field. This would result in a smaller b-trees but more work of the view server to generate the index. Can I benefit this way ?
The part of algorithm where I have to calculate intersections of the big sets just to later get the slice of the result bothers me. Its not a scalable approach. Does anyone know a better algorithm for this ?
Having map function:
function(doc){
if(doc.marked_as_test) return;
emit([doc.dataspace, doc.timestamp, doc.fico], null):
}
You can made similar request:
http://localhost:5984/db/_design/ddoc/_view/view?startkey=["production", :this_weeks_monday_epoch]&endkey=["production", :last_weeks_monday_epoch, 650]&descending=true&limit=15&skip=15
However, you should pass :this_weeks_monday_epoch and :last_weeks_monday_epoch values from the client side (I believe they are some calculable variables on database side, right?)
If you don't care about dataspace field (e.g. it's always constant), you may move it into the map function code instead of having it in query parameters.
I don't think CouchDB is a good fit for the general solution to your problem. However, there are two basic ways you can mitigate the ways CouchDB fits the problem.
Write/generate a bunch of map() functions that use each separate column as the key (for even better read/query performance, you can even do combinatoric approaches). That way you can do smart filtering and sorting, making use of a bunch of different indices over the data. On the other hand, this will cost extra disk space and index caching performance.
Try to find out which of the filters/sort orders your users actually use, and optimize for those. It seems unlikely that each combination of filters/sort orders is used equally, so you should be able to find some of the most-used patterns and write view functions that are optimal for those patterns.
I like the second option better, but it really depends on your use case. This is one of those things SQL engines have been pretty good at traditionally.

Efficient way to compute accumulating value in sqlite3

I have an sqlite3 table that tells when I gain/lose points in a game. Sample/query result:
SELECT time,p2 FROM events WHERE p1='barrycarter' AND action='points'
ORDER BY time;
1280622305|-22
1280625580|-9
1280627919|20
1280688964|21
1280694395|-11
1280698006|28
1280705461|-14
1280706788|-13
[etc]
I now want my running point total. Given that I start w/ 1000 points,
here's one way to do it.
SELECT DISTINCT(time), (SELECT
1000+SUM(p2) FROM events e WHERE p1='barrycarter' AND action='points'
AND e.time <= e2.time) AS points FROM events e2 WHERE p1='barrycarter'
AND action='points' ORDER BY time
but this is highly inefficient. What's a better way to write this?
MySQL has #variables so you can do things like:
SELECT time, #tot := #tot+points ...
but I'm using sqlite3 and the above isn't ANSI standard SQL anyway.
More info on the db if anyone needs it: http://ccgames.db.94y.info/
EDIT: Thanks for the answers! My dilemma: I let anyone run any
single SELECT query on "http://ccgames.db.94y.info/". I want to give
them useful access to my data, but not to the point of allowing
scripting or allowing multiple queries with state. So I need a single
SQL query that can do accumulation. See also:
Existing solution to share database data usefully but safely?
SQLite is meant to be a small embedded database. Given that definition, it is not unreasonable to find many limitations with it. The task at hand is not solvable using SQLite alone, or it will be terribly slow as you have found. The query you have written is a triangular cross join that will not scale, or rather, will scale badly.
The most efficient way to tackle the problem is through the program that is making use of SQLite, e.g. if you were using Web SQL in HTML5, you can easily accumulate in JavaScript.
There is a discussion about this problem in the sqlite mailing list.
Your 2 options are:
Iterate through all the rows with a cursor and calculate the running sum on the client.
Store sums instead of, or as well as storing points. (if you only store sums you can get the points by doing sum(n) - sum(n-1) which is fast).

Ruby Rails Complex SQL with aggregate function and DayOfWeek

Rails 2.3.4
I have searched google, and have not found an answer to my dilemma.
For this discussion, I have two models. Users and Entries. Users can have many Entries (one for each day).
Entries have values and sent_at dates.
I want to query and display the average value of entries for a user BY DAY OF WEEK. So if a user has entered values for, say, the past 3 weeks, I want to show the average value for Sundays, Mondays, etc. In MySQL, it is simple:
SELECT DAYOFWEEK(sent_at) as day, AVG(value) as average FROM entries WHERE user_id = ? GROUP BY 1
That query will return between 0 and 7 records, depending upon how many days a user has had at least one entry.
I've looked at find_by_sql, but while I am searching Entry, I don't want to return an Entry object; instead, I need an array of up to 7 days and averages...
Also, I am concerned a bit about the performance of this, as we would like to load this to the user model when a user logs in, so that it can be displayed on their dashboard. Any advice/pointers are welcome. I am relatively new to Rails.
You can query the database directly, no need to use an actual ActiveRecord object. For example:
ActiveRecord::Base.connection.execute "SELECT DAYOFWEEK(sent_at) as day, AVG(value) as average FROM entries WHERE user_id = #{user.id} GROUP BY DAYOFWEEK(sent_at);"
This will give you a MySql::Result or MySql2::Result that you can then use each or all on this enumerable, to view your results.
As for caching, I would recommend using memcached, but any other rails caching strategy will work as well. The nice benefit of memcached is that you can have your cache expire after a certain amount of time. For example:
result = Rails.cache.fetch('user/#{user.id}/averages', :expires_in => 1.day) do
# Your sql query and results go here
end
This would put your results into memcached for one day under the key 'user//averages'. For example if you were user with id 10 your averages would be in memcached under 'user/10/average' and the next time you went to perform this query (within the same day) the cached version would be used instead of actually hitting the database.
Untested, but something like this should work:
#user.entries.select('DAYOFWEEK(sent_at) as day, AVG(value) as average').group('1').all
NOTE: When you use select to specify columns explicitly, the returned objects are read only. Rails can't reliably determine what columns can and can't be modified. In this case, you probably wouldn't try to modify the selected columns, but you can'd modify your sent_at or value columns through the resulting objects either.
Check out the ActiveRecord Querying Guide for a breakdown of what's going on here in a fairly newb-friendly format. Oh, and if that query doesn't work, please post back so others that may stumble across this can see that (and I can possibly update).
Since that won't work due to entries returning an array, we can try using join instead:
User.where(:user_id => params[:id]).joins(:entries).select('...').group('1').all
Again, I don't know if this will work. Usually you can specify where after joins, but I haven't seen select combined in there. A tricky bit here is that the select is probably going to eliminate returning any data about the user at all. It might make more sense just to eschew find_by_* methods in favor of writing a method in the Entry model that just calls your query with select_all (docs) and skips the association mapping.