MongoDB Update / Upsert Question - Schema Related - schema

I have an problem representing data in MongoDB. I was using this schema design, where a combination of date and word is unique.
{'date':2-1-2011,
'word':word1'
users = [user1, user2, user3, user4]}
{'date':1-1-2011,
'word':word2'
users = [user1, user2]}
There are a fixed number of dates, approximately 200; potentially 100k+ words for each date; and 100k+ users.
I inserted records with an algorithm like so:
while records exist:
message, user, date = pop a record off a list
words = set(tokenise(message))
for word in words:
collection1.insert({'date':date, 'word':word}, {'user':user})
collection2.insert('something similar')
collection3.insert('something similar again')
collection4.insert('something similar again')
However, this schema resulted in extremely large collections and terrible performance was terrible. I am inserting different information into each of the four collections, so it is an extremely large number of operations on the database.
I'm considering representing the data in a format like so, where the words and users arrays are sets.
{'date':'26-6-2011',
'words': [
'word1': ['user1', 'user2'],
'word2': ['user1']
'word1': ['user1', 'user2', 'user3']]}
The idea behind this was to cut down on the number of database operations. So that for each loop of the algorithm, I perform just one update for each collection. However, I am unsure how to perform an update / upsert on this because with each loop of the algorithm, I may need to insert a new word, user, or both.
Could anyone recommend either a way to update this document, or could anyone suggest an alternative schema?
Thanks

Upsert is well suited for dynamically extending documents. Unfortunately I only found it working properly if you have an atomic modifier operation in your update object. like the $addToSet here (mongo shell code):
db.words is empty. add first document for a given date with an upsert.
var query = { 'date' : 'date1' }
var update = { $addToSet: { 'words.word1' : 'user1' } }
db.words.update(query,update,true,false)
check object.
db.words.find();
{ "_id" : ObjectId("4e3bd4eccf7604a2180c4905"), "date" : "date1", "words" : { "word1" : [ "user1" ] } }
now add some more users to first word and another word in one update.
var update = { $addToSet: { 'words.word1' : { $each : ['user2', 'user4', 'user5'] }, 'words.word2': 'user3' } }
db.words.update(query,update,true,false)
again, check object.
db.words.find()
{ "_id" : ObjectId("4e3bd7e9cf7604a2180c4907"), "date" : "date1", "words" : { "word1" : [ "user1", "user2", "user4", "user5" ], "word2" : [ "user3" ] } }

I'm using MongoDB to insert 105mil records with ~10 attributes each. Instead of updating this dataset with changes, I just delete and re insert everything. I found this method to be faster than individually touching each row to see if it was one that I needed to update. You will have better insert speeds if you create JSON formatted text files and use MongoDB's mongoimport tool.
format your data into JSON txt files (one file per collection)
mongoimport each file and specify the collection you want it inserted into

Related

Is it correct to do 1-to-1 mapping in Update API request param

There is a need for me to do bulk update of user details.
Let the object details have the following fields,
User First Name
User ID
User Last Name
User Email ID
User Country
An admin can upload the updated data of the users through a csv file. Values with mismatching data needs to be updated. The most probable request format for this bulk update request will be like:(Method 1)
"data" : {
"userArray" : [
{
"id" : 2343565432,
"f_name" : "David",
"email" : "david#testmail.com"
},
{
"id" : 2344354351,
"country" : "United States",
}
.
.
.
]
}
Method 2 : I would send the details in two arrays, one containing the list of similar filed values with respect to their user ids
"data" : {
"userArray" : [
{
"ids" : [23234323432, 4543543543, 45654543543],
"country" : ["United States", "Israel", "Mexico"]
},
{
"ids" : [2323432334543, 567676565],
"email" : ["groove#drivein.com", "zara#foobar.com"]
},
.
.
.
]
}
In method 1, i need to query the database for every user update, which will be more as the no of user edited is more. In contrast, if i use method 2, i query the database only once for each param(i add the array in the query and get those rows whose user id is present in the given array in a single query). And then i can update the each row with their respective details.
But overall in the internet, most of the update api had params in the format specified in method 1 which gives user good readability. But i need to know what will be advantage if i go with method 1 rather than method 2? (I save some query time in method 2 if the no of users count is large which can improve my performance)
I almost always see it being method 1 style.
Woth that said, I don't understand why your DB performance is based on the way the input data is structured. That's just the way information gets into your code.
You can have the client send the data as method 1 and then shim it to method 2 on the backend if that helps you structure the DB queries better

FaunaDB: Query for all documents not referenced by another collection

I'm working on an app where users learn about different patterns of grammar in a language. There are three collections; users and patterns are interrelated by progress, which looks like this:
Create(Collection("progress"), {
data: {
userRef: Ref(Collection("users"), userId),
patternRef: Ref(Collection("patterns"), patternId),
initiallyLearnedAt: Now(),
lastReviewedAt: Now(),
srsLevel: 1
}
})
I've learned how to do some basic Fauna queries, but now I have a somewhat more complex relational one. I want to write an FQL query (and the required indexes) to retrieve all patterns for which a given user doesn't have progress. That is, everything they haven't learned yet. How would I compose such a query?
One clarifying assumption - a progress document is created when a user starts on a particular pattern and means the user has some progress. For example, if there are ten patterns and a user has started two, there will be two documents for that user in progress.
If that assumption is valid, your question is "how can we find the other eight?"
The basic approach is:
Get all available patterns.
Get the patterns a user has worked on.
Select the difference between the two sets.
1. Get all available patterns.
This one is trivial with the built-in Documents function in FQL:
Documents(Collection("patterns"))
2. Get the patterns a user has worked on.
To get all the patterns a user has worked on, you'll want to create an index over the progress collection, as you've figured out. Your terms are what you want to search on, in this case userRef. Your values are the results you want back, in this case patternRef.
This looks like the following:
CreateIndex({
name: "patterns_by_user",
source: Collection("progress"),
terms: [
{ field: ["data", "userRef"] }
],
values: [
{ field: ["data", "patternRef"] }
],
unique: true
})
Then, to get the set of all the patterns a user has some progress against:
Match(
"patterns_by_user",
Ref(Collections("users"), userId)
)
3. Select the difference between the two sets
The FQL function Difference has the following signature:
Difference( source, diff, ... )
This means you'll want the largest set first, in this case all of the documents from the patterns collection.
If you reverse the arguments you'll get an empty set, because there are no documents in the set of patterns the user has worked on that are not also in the set of all patterns.
From the docs, the return value of Difference is:
When source is a Set Reference, a Set Reference of the items in source that are missing from diff.
This means you'll need to Paginate over the difference to get the references themselves.
Paginate(
Difference(
Documents(Collection("patterns")),
Match(
"patterns_by_user",
Ref(Collection("users"), userId)
)
)
)
From there, you can do what you need to do with the references. As an example, to retrieve all of the data for each returned pattern:
Map(
Paginate(
Difference(
Documents(Collection("patterns")),
Match(
"patterns_by_user",
Ref(Collection("users"), userId)
)
)
),
Lambda("patternRef", Get(Var("patternRef")))
)
Consolidated solution
Create the index patterns_by_user as in step two
Query the difference as in step three

How to achieve generic Audit.NET json data processing?

I am using Audit.Net library to log EntityFramework actions into a database (currently everything into one AuditEventLogs table, where the JsonData column stores the data in the following Json format:
{
"EventType":"MyDbContext:test_database",
"StartDate":"2021-06-24T12:11:59.4578873Z",
"EndDate":"2021-06-24T12:11:59.4862278Z",
"Duration":28,
"EntityFrameworkEvent":{
"Database":"test_database",
"Entries":[
{
"Table":"Offices",
"Name":"Office",
"Action":"Update",
"PrimaryKey":{
"Id":"40b5egc7-46ca-429b-86cb-3b0781d360c8"
},
"Changes":[
{
"ColumnName":"Address",
"OriginalValue":"test_address",
"NewValue":"test_address"
},
{
"ColumnName":"Contact",
"OriginalValue":"test_contact",
"NewValue":"test_contact"
},
{
"ColumnName":"Email",
"OriginalValue":"test_email",
"NewValue":"test_email2"
},
{
"ColumnName":"Name",
"OriginalValue":"test_name",
"NewValue":"test_name"
},
{
"ColumnName":"OfficeSector",
"OriginalValue":1,
"NewValue":1
},
{
"ColumnName":"PhoneNumber",
"OriginalValue":"test_phoneNumber",
"NewValue":"test_phoneNumber"
}
],
"ColumnValues":{
"Id":"40b5egc7-46ca-429b-86cb-3b0781d360c8",
"Address":"test_address",
"Contact":"test_contact",
"Email":"test_email2",
"Name":"test_name",
"OfficeSector":1,
"PhoneNumber":"test_phoneNumber"
},
"Valid":true
}
],
"Result":1,
"Success":true
}
}
Me and my team has a main aspect to achieve:
Being able to create a search page where administrators are able to tell
who changed
what did they change
when did the change happen
They can give a time period, to reduce the number of audit records, and the interesting part comes here:
There should be an input text field which should let them search in the values of the "ColumnValues" section.
The problems I encountered:
Even if I map the Json structure into relational rows, I am unable to search in every column, with keeping the genericity.
If I don't map, I could search in the Json string with LIKE mssql function but on the order of a few 100,000 records it takes an eternity for the query to finish so it is probably not the way.
Keeping the genericity would be important, so we don't need to modify the audit search page every time when we create or modify a new entity.
I only know MSSQL, but is it possible that storing the audit logs in a document oriented database like cosmosDB (or anything else, it was just an example) would solve my problem? Or can I reach the desired behaviour using relational database like MSSQL?
Looks like you're asking for an opinion, in that case I would strongly recommend a document oriented DB.
CosmosDB could be a great option since it supports SQL queries.
There is an extension to log to CosmosDB from Audit.NET: Audit.AzureCosmos
A sample query:
SELECT c.EventType, e.Table, e.Action, ch.ColumnName, ch.OriginalValue, ch.NewValue
FROM c
JOIN e IN c.EntityFrameworkEvent.Entries
JOIN ch IN e.Changes
WHERE ch.ColumnName = "Address" AND ch.OriginalValue = "test_address"
Here is a nice post with lot of examples of complex SQL queries on CosmosDB

How to implement a backend for GraphQL connections?

In GraphQL the recommended way for pagination is to use connections as described here. I understand the reasons and advantages of this usage but I need an advice how to implement it.
The server side of the application works on top of a SQL database (Postgres in my case). Some of the GraphQL connection fields have optional argument to specify sorting. Now with knowing the sorting columns and a cursor from the GraphQL query, how can I build an SQL query? Of course it should be efficient - if there is a SQL index index for the combination of sorting columns it should be used.
The problem is that SQL doesn't know anything like GraphQL cursors - we can't tell it to select all rows after certain row. There is just WHERE, OFFSET and LIMIT. From my point of view it seems I need to firstly select a single row based on the cursor and then build a second SQL query using the values of the sorting columns in that row to specify a complicated WHERE clause - not sure if the database would use index in that case.
What bothers me is that I could not find any article on this topic. Does it mean that SQL database is not usually used when implementing a GraphQL server? What database should be used then? How are GraphQL queries to connection fields usually transformed to queries for the underlying database?
EDIT: This is more or less what I came up with myself. The problem is how to extend it to support sorting as well and how to implement it efficiently using database indexes.
The trick here is that, as the server implementer, the cursor can be literally any value you want encoded as a string. Most examples I've seen have been base64-encoded for a bit of opacity, but it doesn't have to be. (Try base64-decoding the cursors from the Star Wars examples in your link, for example.)
Let's say your GraphQL schema looks like
enum ThingColumn { FOO BAR }
input ThingFilter {
foo: Int
bar: Int
}
type Query {
things(
filter: ThingFilter,
sort: ThingColumn,
first: Int,
after: String
): ThingConnection
}
Your first query might be
query {
things(filter: { foo: 1 }, sort: BAR, first: 2) {
edges {
node { bar }
}
pageInfo {
endCursor
hasNextPage
}
}
}
This on its own could fairly directly translate into an SQL query like
SELECT bar FROM things WHERE foo=1 ORDER BY bar ASC LIMIT 2;
Now as you iterate through each item you can just use a string version of its offset as its cursor; that's totally allowed by the spec.
{
"data": {
"things": {
"edges": [
{ "node": { "bar": 17 } },
{ "node": { "bar": 42 } }
],
"pageInfo": {
"endCursor": "2",
"hasNextPage": true
}
}
}
}
Then when the next query says after: "2", you can turn that back into an SQL OFFSET and repeat the query.
If you're trying to build a generic GraphQL interface that gets translated to reasonably generic SQL queries, it's impossible to create indexes such that every query is "fast". Like other cases, you need to figure out what your common and/or slow queries are and CREATE INDEX as needed. You might be able to limit the options in your schema to things you know you can index:
type Other {
things(first: Int, after: String): ThingConnection
}
query OtherThings($id: ID!, $cursor: String) {
node(id: $id) {
... on Other {
things(first: 100, after: $cursor) { ... FromAbove }
}
}
}
SELECT * FROM things WHERE other_id=? ORDER BY id LIMIT ?;
CREATE INDEX things_other ON things(other_id);

What are the resources or tools used to manage temporal data in key-value stores?

I'm considering using MongoDB or CouchDB on a project that needs to maintain historical records. But I'm not sure how difficult it will be to store historical data in these databases.
For example, in his book "Developing Time-Oriented Database Applications in SQL," Richard Snodgrass points out tools for retrieving the state of data as of a particular instant, and he points out how to create schemas that allow for robust data manipulation (i.e. data manipulation that makes invalid data entry difficult).
Are there tools or libraries out there that make it easier to query, manipulate, or define temporal/historical structures for key-value stores?
edit:
Note that from what I hear, the 'version' data that CouchDB stores is erased during normal use, and since I would need to maintain historical data, I don't think that's a viable solution.
P.S. Here's a similar question that was never answered: key-value-store-for-time-series-data
There are a couple options if you wanted to store the data in MongoDB. You could just store each version as a separate document, as then you can query to get the object at a certain time, the object at all times, objects over ranges of time, etc. Each document would look something like:
{
object : whatever,
date : new Date()
}
You could store all the versions of a document in the document itself, as mikeal suggested, using updates to push the object itself into a history array. In Mongo, this would look like:
db.foo.update({object: obj._id}, {$push : {history : {date : new Date(), object : obj}}})
// make changes to obj
...
db.foo.update({object: obj._id}, {$push : {history : {date : new Date(), object : obj}}})
A cooler (I think) and more space-efficient way, although less time-efficient, might be to store a history in the object itself about what changed in the object at each time. Then you could replay the history to build the object at a certain time. For instance, you could have:
{
object : startingObj,
history : [
{ date : d1, addField : { x : 3 } },
{ date : d2, changeField : { z : 7 } },
{ date : d3, removeField : "x" },
...
]
}
Then, if you wanted to see what the object looked like between time d2 and d3, you could take the startingObj, add the field x with the value 3, set the field z to the value of 7, and that would be the object at that time.
Whenever the object changed, you could atomically push actions to the history array:
db.foo.update({object : startingObj}, {$push : {history : {date : new Date(), removeField : "x"}}})
Yes, in CouchDB the revisions of a document are there for replication and are usually lost during compaction. I think UbuntuOne did something to keep them around longer but I'm not sure exactly what they did.
I have a document that I need the historical data on and this is what I do.
In CouchDB I have an _update function. The document has a "history" attribute which is an array. Each time I call the _update function to update the document I append to the history array the current document (minus the history attribute) then I update the document with the changes in the request body. This way I have the entire revision history of the document.
This is a little heavy for large documents, there are some javascript diff tools I was investigating and thinking about only storing the diff between the documents but haven't done it yet.
http://wiki.apache.org/couchdb/How_to_intercept_document_updates_and_perform_additional_server-side_processing
Hope that helps.
I can't speak for mongodb but for couchdb it all really hinges on how you write your views.
I don't know the specifics of what you need but if you have a unique id for a document throughout its lifetime and store a timestamp in that document then you have everything you need for robust querying of that document.
For instance:
document structure:
{ "docid" : "doc1", "ts" : <unix epoch> ...<set of key value pairs> }
map function:
function (doc) {
if (doc.docid && doc.ts)
emit([doc.docid, doc.ts], doc);
}
}
The view will now output each doc and its revisions in historical order like so:
["doc1", 1234567], ["doc1", 1234568], ["doc2", 1234567], ["doc2", 1234568]
You can use view collation and start_key or end_key to restrict the returned documents.
start_key=["doc1", 1] end_key=["doc1", 9999999999999]
will return all historical copies of doc1
start_key=["doc2", 1234567] end_key=["doc2", 123456715]
will return all historical copies of doc2 between 1234567 and 123456715 unix epoch times.
see ViewCollation for more details