Couchdb views and many (thousands) document types - optimization

I'm studing CouchDB and I'm picturing a worst case scenario:
for each document type I need 3 view and this application can generate 10 thousands of document types.
With "document type" I mean the structure of the document.
After insertion of a new document, couchdb make 3*10K calls to view functions searching for right document type.
Is this true?
Is there a smart solution than make a database for each doc type?
Document example (assume that none documents have the same structure, in this example data is under different keys):
[
{
"_id":"1251888780.0",
"_rev":"1-582726400f3c9437259adef7888cbac0"
"type":'sensorX',
"value":{"ValueA":"123"}
},
{
"_id":"1251888780.0",
"_rev":"1-37259adef7888cbac06400f3c9458272"
"type":'sensorY',
"value":{"valueB":"456"}
},
{
"_id":"1251888780.0",
"_rev":"1-6400f3c945827237259adef7888cbac0"
"type":'sensorZ',
"value":{"valueC":"789"}
},
]
Views example (in this example only one per doc type)
"views":
{
"sensorX": {
"map": "function(doc) { if (doc.type == 'sensorX') emit(null, doc.valueA) }"
},
"sensorY": {
"map": "function(doc) { if (doc.type == 'sensorY') emit(null, doc.valueB) }"
},
"sensorZ": {
"map": "function(doc) { if (doc.type == 'sensorZ') emit(null, doc.valueC) }"
},
}

The results of the map() function in CouchDB is cached the first time you request the view for each new document. Let me explain with a quick illustration.
You insert 100 documents to CouchDB
You request the view. Now the 100 documents have the map() function run against them and the results cached.
You request the view again. The data is read from the indexed view data, no documents have to be re-mapped.
You insert 50 more documents
You request the view. The 50 new documents are mapped and merged into the index with the old 100 documents.
You request the view again. The data is read from the indexed view data, no documents have to be re-mapped.
I hope that makes sense. If you're concerned about a big load being generated when a user requests a view and lots of new documents have been added you could look at having your import process call the view (to re-map the new documents) and have the user request for the view include stale=ok.
The CouchDB book is a really good resource for information on CouchDB.

James has a great answer.
It looks like you are asking the question "what are the values of documents of type X?"
I think you can do that with one view:
function(doc) {
// _view/sensor_value
var val_names = { "sensorX": "valueA"
, "sensorY": "valueB"
, "sensorZ": "valueC"
};
var value_name = val_names[doc.type];
if(value_name) {
// e.g. "sensorX" -> "123"
// or "sensorZ" -> "789"
emit(doc.type, doc.value[value_name]);
}
}
Now, to get all values for sensorY, you query /db/_design/app/_view/sensor_value with a parameter ?key="sensorX". CouchDB will show all values for sensorX, which come from the document's value.valueA field. (For sensorY, it comes from value.valueB, etc.)
Future-proofing
If you might have new document types in the future, something more general might be better:
function(doc) {
if(doc.type && doc.value) {
emit(doc.type, doc.value);
}
}
That is very simple, and any document will work if it has a type and value field. Next, to get the valueA, valueB, etc. from the view, just do that on the client side.
If using the client is impossible, use a _list function.
function(head, req) {
// _list/sensor_val
//
start({'headers':{'Content-Type':'application/json'}});
// Updating this will *not* cause the map/reduce view to re-build.
var val_names = { "sensorX": "valueA"
, "sensorY": "valueB"
, "sensorZ": "valueC"
};
var row;
var doc_type, val_name, doc_val;
while(row = getRow()) {
doc_type = row.key;
val_name = val_names[doc_type];
doc_val = row.value[val_name];
send("Doc " + row.id + " is type " + doc_type + " and value " + doc_val);
}
}
Obviously use send() to send whichever format you prefer for the client (such as JSON).

Related

How do perform a graph query and join?

I apologize for the title, I don't exactly know how to word it. But essentially, this is a graph-type query but I know RavenDB's graph functionality will be going away so this probably needs to be solved with Javascript.
Here is the scenario:
I have a bunch of documents of different types, call them A, B, C, D. Each of these particular types of documents have some common properties. The one that I'm interested in right now is "Owner". The owner field is an ID which points to one of two other document types; it can be a Group or a User.
The Group document has a 'Members' field which contains an ID which either points to a User or another Group. Something like this
It's worth noting that the documents in play have custom IDs that begin with their entity type. For example Users and Groups begin with user: and group: respectively. Example IDs look like this: user:john#castleblack.com or group:the-nights-watch. This comes into play later.
What I want to be able to do is the following type of query:
"Given that I have either a group id or a user id, return all documents of type a, b, or c where the group/user id is equal to or is a descendant of the document's owner."
In other words, I need to be able to return all documents that are owned by a particular user or group either explicitly or implicitly through a hierarchy.
I've considered solving this a couple different ways with no luck. Here are the two approaches I've tried:
Using a function within a query
With Dejan's help in an email thread, I was able to devise a function that would walk it's way down the ownership graph. What this attempted to do was build a flat array of IDs which represented explicit and implicit owners (i.e. root + descendants):
declare function hierarchy(doc, owners){
owners = owners || [];
while(doc != null) {
let ownerId = id(doc)
if(ownerId.startsWith('user:')) {
owners.push(ownerId);
} else if(ownerId.startsWith('group:')) {
owners.push(ownerId);
doc.Members.forEach(m => {
let owner = load(m, 'Users') || load(m, 'Groups');
owners = hierarchy(owner, owners);
});
}
}
return owners;
}
I had two issues with this. 1. I don't actually know how to use this in a query lol. I tried to use it as part of the where clause but apparently that's not allowed:
from #all_docs as d
where hierarchy(d) = 'group:my-group-d'
// error: method hierarchy not allowed
Or if I tried anything in the select statement, I got an error that I have exceeded the number of allowed statements.
As a custom index
I tried the same idea through a custom index. Essentially, I tried to create an index that would produce an array of IDs using roughly the same function above, so that I could just query where my id was in that array
map('#all_docs', function(doc) {
function hierarchy(n, graph) {
while(n != null) {
let ownerId = id(n);
if(ownerId.startsWith('user:')) {
graph.push(ownerId);
return graph;
} else if(ownerId.startsWith('group:')){
graph.push(ownerId);
n.Members.forEach(g => {
let owner = load(g, 'Groups') || load(g, 'Users');
hierarchy(owner, graph);
});
return graph;
}
}
}
function distinct(value, index, self){ return self.indexOf(value) === index; }
let ownerGraph = []
if(doc.Owner) {
let owner = load(doc.Owner, 'Groups') || load(doc.Owner, 'Users');
ownerGraph = hierarchy(owner, ownerGraph).filter(distinct);
}
return { Owners: ownerGraph };
})
// error: recursion is not allowed by the javascript host
The problem with this is that I'm getting an error that recursion is not allowed.
So I'm stumped now. Am I going about this wrong? I feel like this could be a subquery of sorts or a filter by function, but I'm not sure how to do that either. Am I going to have to do this in two separate queries (i.e. two round-trips), one to get the IDs and the other to get the docs?
Update 1
I've revised my attempt at the index to the following and I'm not getting the recursion error anymore, but assuming my queries are correct, it's not returning anything
// Entity/ByOwnerGraph
map('#all_docs', function(doc) {
function walkGraph(ownerId) {
let owners = []
let idsToProcess = [ownerId]
while(idsToProcess.length > 0) {
let current = idsToProcess.shift();
if(current.startsWith('user:')){
owners.push(current);
} else if(current.startsWith('group:')) {
owners.push(current);
let group = load(current, 'Groups')
if(!group) { continue; }
idsToProcess.concat(group.Members)
}
}
return owners;
}
let owners = [];
if(doc.Owner) {
owners.concat(walkGraph(doc.Owner))
}
return { Owners: owners };
})
// query (no results)
from index Entity/ByOwnerGraph as x
where x.Owners = "group:my-group-id"
// alternate query (no results)
from index Entity/ByOwnerGraph as x
where x.Owners ALL IN ("group:my-group-id")
I still can't use this approach in a query either as I get the same error that there are too many statements.

SSIS Import data that is NOT columnar into SQL

I am fairly new to SSIS and need a little help getting started. I have several reports that come out of our mainframe. The reports are not in a columnar format. The date record is at the top then there might be some initial data then there might be a little more. So I need to read in each line look to see what the text reads and figure out if I need the data or move to the next row.
This is a VERY rough example of what the report I want to import into a SQL table.
DATE: 01/08/2020 FACILITY NAME PAGE1
REVENUE USAGE FOR ACCOUNTING PERIOD 02
----TOTAL---- ----TOTAL---- ----OTHER---- ----INSURANCE---- ----INSURANCE2----
SERVICE CODE - 123456789 DESCRIPTION: WIDGETS
CURR 2,077
IP 0.0000 3 2,345 0.00
143
OP 0.0000 2 1,231 0.00
YTD 5
IP 0.0000
76
OP 0.0000
etc . . . .. .
SERVICE CODE
After the SERVICE CODE the data will start to repeat like it is above. This is the basic idea of a report.
I want to get the Date then the Service Code, Description, Current IP Volume, Current IP Dollar, Current OP Volume, Current OP Dollar, YTD IP Volume, YTD IP Dollar, YTD OP Volume, YTD OP Dollar . . then repeat.
Just to clarify, I am not asking anyone to do this for me. I want to learn how to do this. I have looked on how to do this but every example I have looked at talks about doing this with a CSV, tab, or Excel file. i do not have that type of file so I was asking what I need to look at. I currently use Monarch to format the file, but again I want to learn more about SSIS and this is a perfect way to learn. Asking the vendor to redo the report is not an option plus I want to learn how to do this. Thank you I just wanted to get that out there.
Any help would be greatly appreciated.
Rodger
As stated in comments, you could do this using a script task. The basics steps are:
Define a DataTable to store your data.
Use a StreamReader to read your report.
Process this using a combination of conditionals, String Methods, and parsing to extract the relevant fields from the relevant line:
Write the DataTable to the database using SqlBulkCopy
The following would go inside your Main method in your script task:
//Define a table to store your data
var table = new DataTable
{
Columns =
{
{ "ServiceCode", typeof(string) },
{ "Description", typeof(string) },
{ "CurrentIPVolume", typeof(int) },
{ "CurrentIPDollar,", typeof(decimal) },
{ "CurrentOPVolume", typeof(int) },
{ "CurrentOPDollar", typeof(decimal) },
{ "YTDIPVolume", typeof(int) },
{ "YTDIPDollar,", typeof(decimal) },
{ "YTDOPVolume", typeof(int) },
{ "YTDOPDollar", typeof(decimal) }
}
};
var filePath = #"Your File Path";
using (var reader = new StreamReader(filePath))
{
string line = null;
DataRow row = null;
// As YTD and Curr are identical, we will need a flag later to mark our position within the record
bool ytdFlag= false;
//Loop through every line in the file
while ((line = reader.ReadLine()) != null)
{
//if the line is blank, move on to the next
if (string.IsNullOrWhiteSpace(line)
continue;
// If the line starts with service code, then it marks the start of a new record
if (line.StartsWith("SERVICE CODE"))
{
//If the current value for row is not null then this is
//not the first record, so we need to add the previous
//record to the tale before continuing
if (row != null)
{
table.Rows.Add(row);
ytdFlag= false; // New record, reset YTD flag
}
row = table.NewRow();
//Split the line now based on known values:
var tokens = line.Split(new string[] { "SERVICE CODE - ", "DESCRIPTION: "}, StringSplitOptions.None);
row[0] = tokens[0];
row[1] = tokens[1];
}
if (line.StartsWith("CURR"))
{
//Process the row --> "CURR 2,077"
//Not sure what 2,077 is, but this will parse it
int i = 0;
if (int.TryParse(line.Substring(4).Trim().Replace(",", ""), out i))
{
//Do something with your int
Console.WriteLine(i);
}
}
if (line.StartsWith(" IP"))
{
//Start at after IP then split the line into the 4 numbers
var tokens = line.Substring(3).Split(new [] { " "}, StringSplitOptions.RemoveEmptyEntries);
//If we have gone past the CURR record, then at to YTD Columns
if (ytdFlag)
{
row[6] = int.Parse(tokens[1]);
row[7] = decimal.Parse(tokens[1]);
}
//Otherwise we are still in the CURR section:
else
{
row[2] = int.Parse(tokens[1]);
row[3] = decimal.Parse(tokens[1]);
}
}
if (line.StartsWith(" OP"))
{
//Start at after OP then split the line into the 4 numbers
var tokens = line.Substring(3).Split(new [] { " "}, StringSplitOptions.RemoveEmptyEntries);
//If we have gone past the CURR record, then at to YTD Columns
if (ytdFlag)
{
row[8] = int.Parse(tokens[1]);
row[9] = decimal.Parse(tokens[1]);
}
//Otherwise we are still in the CURR section:
else
{
row[4] = int.Parse(tokens[1]);
row[5] = decimal.Parse(tokens[1]);
}
//After we have processed an OP record, we must set the YTD Flag to true.
//Doesn't matter if it is the YTD OP record, since the flag will be reset
//By the next line that starts with SERVICE CODE anyway
ytdFlag= true;
}
}
}
//Now that we have processed the file, we can write the data to a database
using (var sqlBulkCopy = new SqlBulkCopy("Your Connection String"))
{
sqlBulkCopy.DestinationTableName = "dbo.YourTable";
//If necessary add column mappings, but if your DataTable matches your database table
//then this is not required
sqlBulkCopy.WriteToServer(table);
}
This is a very quick example, far from the finished article, and I have done little or no testing, but it should give you the gist of how it could be done, and get you started on one possible solution.
It can definitely be cleaned up and refactored, but I have tried to make it as clear as possible what is going on, rather than trying to write the most efficient code ever. It should also (hopefully) demonstrate what a monumental pain this is to do, and very minor report changes things like an extra space be "OP" will break the whole thing.
So again, I would re-iterate, if you can get the data in a standard flat file format, with one line per record, you should. I do however appreciate that sometimes these things are out of your control, and I have had to write incredibly ugly import routines like this in the past, so I feel your pain if you can't get the data in a consumable format.

Speed Up Retrieving View Data?

The database I am trying to pull data from has approximately 50,000 documents. Currently it takes around 90 seconds for an iOS or Android device to query and display the data to the mobile device in a view. My code is posted below. Is there something I could be doing differently to speed this up? Thanks for any tips.
function updateAllPoliciesTable() {
try {
var db = Alloy.Globals.dbPolicyInquiry;
var view = db.getView("AllRecordsByInsured");
var vec = view.getAllEntriesBySQL("Agent like ? OR MasterAgent like ?", [Ti.App.agentNumber, Ti.App.agentNumber], true);
var ve = vec.getFirstEntry();
var data = [];
while (ve) {
var unid = ve.getColumnValue("id");
var row = Ti.UI.createTableViewRow({
unid : unid,
height: '45dp',
rowData: ve.getColumnValue("Insured") + " " + ve.getColumnValue("PolicyNumber")
});
var viewLabel = Ti.UI.createLabel({
color : '#333',
font : {
fontSize : '16dp'
},
text: toTitleCase(ve.getColumnValue("Insured")) + " " + ve.getColumnValue("PolicyNumber"),
left: '10dp'
});
row.add(viewLabel);
data.push(row);
ve = vec.getNextEntry();
}
//Ti.API.log("# of policies= " + data.length);
if(data.length == 0) {
var row = Ti.UI.createTableViewRow({
title : "No policies found"
});
data.push(row);
}
$.AllPoliciesTable.setData(data);
Alloy.Globals.refreshAllPolicies = false;
Alloy.Globals.loading.hide();
} catch (e) {
DTG.exception("updateAllPoliciesTable -> ", e);
}
}
Create an index on the appropriate table, that should speed up things.
The SQLite table for your view should be named "view_AllRecordsByInsured".
Create an index for that table, check SQLite documentation about "CREATE INDEX" for more details.
To execute the appropriate SQL, you could use the DTGDatabase class like
var sqldb = new DTGDatabase(Alloy.Globals.dbPolicyInquiry.localdbname);
sqldb.execute("CREATE INDEX IF NOT EXISTS ON view_AllRecordsByInsured (Agent,MasterAgent)")
If that does give enough speed, look at full text search for SQLite dbs.
Here is some example code regarding full text indexes to give you a starting point:
CREATE VIRTUAL TABLE ft_view__mobile_companies_ USING fts4(id, customername, customercity)
INSERT INTO ft_view__mobile_companies_(id, customername, customercity) SELECT id, customername, customercity FROM view__mobile_companies_
To query the index you need to execute SQL with the MATCH operator (see SQLite documentation). In one app I have well over 100.000 datasets synchronized from a Domino view, and searching using a fulltext search in SQLite works instantly.
Well, unlike big database engines, the SQLite database engine is more limited, and so are the devices that it's run on.
What I would try to do is check the query that pulls the data - are you using indexes in your table? do you use them to query? is there unnecessary joins or pulls?
I you fail to tweet the query you should maybe consider checking out a mobile noSQL solution - I know there are some on the appcelerator marketplace - check if it suits your needs and if it speeds up things.

Conditionally adjust visible columns in Rally Cardboard UI

So I want to allow the user to conditionally turn columns on/off in a Cardboard app I built. I have two problems.
I tried using the 'columns' attribute in the config but I can't seem to find a default value for it that would allow ALL columns to display(All check boxes checked) based on the attribute, ie. the default behavior if I don't include 'columns' in the config object at all (tried null, [] but that displays NO columns).
So that gets to my second problem, if there is no default value is there a simple way to only change that value in the config object or do I have to encapsulate the entire variable in 'if-else' statements?
Finally if I have to manually build the string I need to parse the values of an existing custom attribute (a drop list) we have on the portfolio object. I can't seem to get the rally.forEach loop syntax right. Does someone have a simple example?
Thanks
Dax - Autodesk
I found a example in the online SDK from Rally that I could modify to answer the second part (This assumes a custom attribute on Portfolio item called "ADSK Kanban State" and will output values to console) :
var showAttributeValues = function(results) {
for (var property in results) {
for (var i=0 ; i < results[property].length ; i++) {
console.log("Attribute Value : " + results[property][i]);
}
}
};
var queryConfig = [];
queryConfig[0] = {
type: 'Portfolio Item',
key : 'eKanbanState',
attribute: 'ADSK Kanban State'
};
rallyDataSource.findAll(queryConfig, showAttributeValues);
rally.forEach loops over each key in the first argument and will execute the function passed as the second argument each time.
It will work with either objects or arrays.
For an array:
var array = [1];
rally.forEach(array, function(value, i) {
//value = 1
//i = 0
});
For an object:
var obj = {
foo: 'bar'
};
rally.forEach(obj, function(value, key) {
//value = 'bar'
//key = 'foo'
});
I think that the code to dynamically build a config using the "results" collection created by your query above and passed to your sample showAttributeValues callback, is going to look a lot like the example of dynamically building a set of Table columns as shown in:
Rally App SDK: Is there a way to have variable columns for table?
I'm envisioning something like the following:
// Dynamically build column config array for cardboard config
var columnsArray = new Array();
for (var property in results) {
for (var i=0 ; i < results[property].length ; i++) {
columnsArray.push("'" + results[property][i] + "'");
}
}
var cardboardConfig = {
{
attribute: 'eKanbanState',
columns: columnsArray,
// .. rest of config here
}
// .. (re)-construct cardboard...
Sounds like you're building a neat board. You'll have to provide the board with the list of columns to show each time (destroying the old board and creating a new one).
Example config:
{
attribute: 'ScheduleState'
columns: [
'In-Progress',
'Completed'
]
}

Proper Way to Retrieve More than 128 Documents with RavenDB

I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}
It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.
The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default
The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging
Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();
You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.