mongoengine slow serialization of embedded documents with reference fields - pymongo

I have a small DB with about 500 records. I'm trying to implement a versioning scheme where I save the form along with its current version to my Record collection. Ideally, I would like to store the form along with its version number in an embedded document to keep things nice and tidy:
class Structure(db.EmbeddedDocument):
form = db.ReferenceField(Form, required = True)
version = db.IntField(required = True)
#property
def short(self):
return {
'form': self.form,
'version': self.version
}
class Record(db.Document):
structure = db.EmbeddedDocumentField(Structure)
#property
def short(self):
return {
'structure': self.structure.short
}
This way when I recall a record I can grab the form and the version that was used at the time. Running some timing tests:
start = time.clock()
records = Record.objects.select_related()
print ('Time: ', time.clock() - start)
response = [i.short for i in records]
print ('Time: ', time.clock() - start)
I find the query time for all records Record.objects.select_related() to be reasonable at, ~ 1.12 s, however, I'm finding serialization for the purpose of JSON transfer is extremely expensive at ~ 24.1s!
If I make a slight modification by removing use of the EmbeddedDocument:
class Record(db.Document):
form = db.ReferenceField(Form, required = True)
version = db.IntField(required = True)
#property
def short(self):
return {
'form': self.form,
'version': self.version
}
Running the same test I find the query time to be pretty much unchanged at ~ 1.36 s, however, the serialization time improved by 24x to 1.14s. I really do not understand why use of an embedded document would lead to such as massive penalty in serialization time...? Is dereferencing in an embedded object more difficult?

Related

Scalding Unit Test - How to Write A Local File?

I work at a place where scalding writes are augmented with a specific API to track dataset meta data. When converting from normal writes to these special writes, there are some intricacies with respect to Key/Value, TSV/CSV, Thrift ... datasets. I would like to compare the binary file is the same prior to conversion and after conversion to the special API.
Given I cannot provide the specific api for the metadata-inclusive writes, I only ask how can I write a unit test for .write method on a TypedPipe?
implicit val timeZone: TimeZone = DateOps.UTC
implicit val dateParser: DateParser = DateParser.default
implicit def flowDef: FlowDef = new FlowDef()
implicit def mode: Mode = Local(true)
val fileStrPath = root + "/test"
println("writing data to " + fileStrPath)
TypedPipe
.from(Seq[Long](1, 2, 3, 4, 5))
// .map((x: Long) => { println(x.toString); System.out.flush(); x })
.write(TypedTsv[Long](fileStrPath))
.forceToDisk
The above doesn't seem to write anything to local (OSX) disk.
So I wonder if I need to use a MiniDFSCluster something like this:
def setUpTempFolder: String = {
val tempFolder = new TemporaryFolder
tempFolder.create()
tempFolder.getRoot.getAbsolutePath
}
val root: String = setUpTempFolder
println(s"root = $root")
val tempDir = Files.createTempDirectory(setUpTempFolder).toFile
val hdfsCluster: MiniDFSCluster = {
val configuration = new Configuration()
configuration.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, tempDir.getAbsolutePath)
configuration.set("io.compression.codecs", classOf[LzopCodec].getName)
new MiniDFSCluster.Builder(configuration)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
}
hdfsCluster.waitClusterUp()
val fs: DistributedFileSystem = hdfsCluster.getFileSystem
val rootPath = new Path(root)
fs.mkdirs(rootPath)
However, my attempts to get this MiniCluster to work haven't panned out either - somehow I need to link the MiniCluster with the Scalding write.
Note: The Scalding JobTest framework for unit testing isn't going to work due actual data written is sometimes wrapped in bijection codec or setup with case class wrappers prior to the writes made by the metadata-inclusive writes APIs.
Any ideas how I can write a local file (without using the Scalding REPL) with either Scalding alone or a MiniCluster? (If using the later, I need a hint how to read the file.)
Answering ... There is an example of how to use a mini cluster for exactly reading and writing to HDFS. I will be able to cross read with my different writes and examine them. Here it is in the tests for scalding's TypedParquet type
HadoopPlatformJobTest is an extension for JobTest that uses a MiniCluster.
With some hand-waiving on detail in the link, the bulk of the code is this:
"TypedParquetTuple" should {
"read and write correctly" in {
import com.twitter.scalding.parquet.tuple.TestValues._
def toMap[T](i: Iterable[T]): Map[T, Int] = i.groupBy(identity).mapValues(_.size)
HadoopPlatformJobTest(new WriteToTypedParquetTupleJob(_), cluster)
.arg("output", "output1")
.sink[SampleClassB](TypedParquet[SampleClassB](Seq("output1"))) {
toMap(_) shouldBe toMap(values)
}
.run()
HadoopPlatformJobTest(new ReadWithFilterPredicateJob(_), cluster)
.arg("input", "output1")
.arg("output", "output2")
.sink[Boolean]("output2")(toMap(_) shouldBe toMap(values.filter(_.string == "B1").map(_.a.bool)))
.run()
}
}

insert_many in pymongo not persisting

I'm having some issues with persisting documents with pymongo when using insert_many.
I'm handing over a list of dicts to insert_many and it works fine from inside the same script that does the inserting. Less so once the script has finished.
def row_to_doc(row):
rowdict = row.to_dict()
for key in rowdict:
val = rowdict[key]
if type(val) == float or type(val) == np.float64:
if np.isnan(val):
# If we want a SQL style document collection
rowdict[key] = None
# If we want a NoSQL style document collection
# del rowdict[key]
return rowdict
def dataframe_to_collection(df):
n = len(df)
doc_list = []
for k in range(n):
doc_list.append(row_to_doc(df.iloc[k]))
return doc_list
def get_mongodb_client(host="localhost", port=27017):
return MongoClient(host, port)
def create_collection(client):
db = client["material"]
return db["master-data"]
def add_docs_to_mongo(collection, doc_list):
collection.insert_many(doc_list)
def main():
client = get_mongodb_client()
csv_fname = "some_csv_fname.csv"
df = get_clean_csv(csv_fname)
doc_list = dataframe_to_collection(df)
collection = create_collection(client)
add_docs_to_mongo(collection, doc_list)
test_doc = collection.find_one({"MATERIAL": "000000000000000001"})
When I open up another python REPL and start looking through the client.material.master_data collection with collection.find_one({"MATERIAL": "000000000000000001"}) or collection.count_documents({}) I get None for the find_one and 0 for the count_documents.
Is there a step where I need to call some method to persist the data to disk? db.collection.save() in the mongo client API sounds like what I need but it's just another way of inserting documents from what I have read. Any help would be greatly appreciated.
The problem was that I was getting my collection via client.db_name.collection_name and it wasn't getting the same collection I was creating with my code. client.db_name["collection-name"] solved my issue. Weird.

output for append job in BigQuery using Luigi Orchestrator

I have a Bigquery task which only aims to append a daily temp table (Table-xxxx-xx-xx) to an existing table (PersistingTable).
I am not sure how to handle the output(self) method. Indeed, I can not just output PersistingTable as a luigi.contrib.bigquery.BigQueryTarget, since it already exists before the process started. Has anyone asked himself such a question?
I could not find an answer anywhere else so I will give my solution even though this is a very old question.
I created a new class that inherits from luigi.contrib.bigquery.BigQueryLoadTask
class BigQueryLoadIncremental(luigi.contrib.bigquery.BigQueryLoadTask):
'''
a subclass that checks whether a write-log on gcs exists to append data to the table
needs to define Two Outputs! [0] of type BigQueryTarget and [1] of type GCSTarget
Everything else is left unchanged
'''
def exists(self):
return luigi.contrib.gcs.GCSClient.exists(self.output()[1].path)
#property
def write_disposition(self):
"""
Set to WRITE_APPEND as this subclass only makes sense for this
"""
return luigi.contrib.bigquery.WriteDisposition.WRITE_APPEND
def run(self):
output = self.output()[0]
gcs_output = self.output()[1]
assert isinstance(output,
luigi.contrib.bigquery.BigQueryTarget), 'Output[0] must be a BigQueryTarget, not %s' % (
output)
assert isinstance(gcs_output,
luigi.contrib.gcs.GCSTarget), 'Output[1] must be a Cloud Storage Target, not %s' % (
gcs_output)
bq_client = output.client
source_uris = self.source_uris()
assert all(x.startswith('gs://') for x in source_uris)
job = {
'projectId': output.table.project_id,
'configuration': {
'load': {
'destinationTable': {
'projectId': output.table.project_id,
'datasetId': output.table.dataset_id,
'tableId': output.table.table_id,
},
'encoding': self.encoding,
'sourceFormat': self.source_format,
'writeDisposition': self.write_disposition,
'sourceUris': source_uris,
'maxBadRecords': self.max_bad_records,
'ignoreUnknownValues': self.ignore_unknown_values
}
}
}
if self.source_format == luigi.contrib.bigquery.SourceFormat.CSV:
job['configuration']['load']['fieldDelimiter'] = self.field_delimiter
job['configuration']['load']['skipLeadingRows'] = self.skip_leading_rows
job['configuration']['load']['allowJaggedRows'] = self.allow_jagged_rows
job['configuration']['load']['allowQuotedNewlines'] = self.allow_quoted_new_lines
if self.schema:
job['configuration']['load']['schema'] = {'fields': self.schema}
# test write to and removal of GCS pseudo output in order to make sure this does not fail.
gcs_output.fs.put_string(
'test write for task {} (this file should have been removed immediately)'.format(self.task_id),
gcs_output.path)
gcs_output.fs.remove(gcs_output.path)
bq_client.run_job(output.table.project_id, job, dataset=output.table.dataset)
gcs_output.fs.put_string(
'success! The following BigQuery Job went through without errors: {}'.format(self.task_id), gcs_output.path)
it uses a second output (which might violate luigis atomicity principle) on google cloud storage. Example usage:
class LeadsToBigQuery(BigQueryLoadIncremental):
date = luigi.DateParameter(default=datetime.date.today())
def output(self):
return luigi.contrib.bigquery.BigQueryTarget(project_id=...,
dataset_id=...,
table_id=...), \
create_gcs_target(...)

Advice needed on setting up an (Objective C?) Mac-based web service

I have developed numerous iOS apps over the years so know Objective C reasonably well.
I'd like to build my first web service to offload some of the most processor intensive functions.
I'm leaning towards using my Mac as the server, which comes with Apache. I have configured this and it appears to be working as it should (I can type the Mac's IP address and receive a confirmation).
Now I'm trying to decide on how to build the server-side web service, which is totally new to me. I'd like to leverage my Objective C knowledge if possible. I think I'm looking for an Objective C-compatible web service engine and some examples how to connect it to browsers and mobile interfaces. I was leaning towards using Amazon's SimpleDB as the database.
BTW: I see Apple have Lion Server, but I cannot work out if this is an option.
Any thoughts/recommendations are appreciated.?
There are examples of simple web servers out there written in ObjC such as this and this.
That said, there are probably "better" ways of doing this if you don't mind using other technologies. This is a matter of preference; but I've use Python, MySQL, and the excellent web.py framework for these sorts of backends.
For example, here's an example web service (some redundancies omitted...) using the combination of technologies described. I just run this on my server, and it takes care of url redirection and serves JSON from the db.
import web
import json
import MySQLdb
urls = (
"/equip/gruppo", "gruppo", # GET = get all gruppos, # POST = save gruppo
"/equip/frame", "frame"
)
class StatusCode:
(Success,SuccessNoRows,FailConnect,FailQuery,FailMissingParam,FailOther) = range(6);
# top-level class that handles db interaction
class APIObject:
def __init__(self):
self.object_dict = {} # top-level dictionary to be turned into JSON
self.rows = []
self.cursor = ""
self.conn = ""
def dbConnect(self):
try:
self.conn = MySQLdb.connect( host = 'localhost', user = 'my_api_user', passwd = 'api_user_pw', db = 'my_db')
self.cursor = self.conn.cursor(MySQLdb.cursors.DictCursor)
except:
self.object_dict['api_status'] = StatusCode.FailConnect
return False
else:
return True
def queryExecute(self,query):
try:
self.cursor.execute(query)
self.rows = self.cursor.fetchall()
except:
self.object_dict['api_status'] = StatusCode.FailQuery
return False
else:
return True
class gruppo(APIObject):
def GET(self):
web.header('Content-Type', 'application/json')
if self.dbConnect() == False:
return json.dumps(self.object_dict,sort_keys=True, indent=4)
else:
if self.queryExecute("SELECT * FROM gruppos") == False:
return json.dumps(self.object_dict,sort_keys=True, indent=4)
else:
self.object_dict['api_status'] = StatusCode.SuccessNoRows if self.rows.count == 0 else StatusCode.Success
data_list = []
for row in self.rows:
# create a dictionary with the required elements
d = {}
d['id'] = row['id']
d['maker'] = row['maker_name']
d['type'] = row['type_name']
# append to the object list
data_list.append(d)
self.object_dict['data'] = data_list
# return to the client
return json.dumps(self.object_dict,sort_keys=True, indent=4)

which OO programming style in R will result readable to a Python programmer?

I'm author of the logging package on CRAN, I don't see myself as an R programmer, so I tried to make it as code-compatible with the Python standard logging package as I could, but now I have a question. and I hope it will give me the chance to learn some more R!
it's about hierarchical loggers. in Python I would create a logger and send it logging records:
l = logging.getLogger("some.lower.name")
l.debug("test")
l.info("some")
l.warn("say no")
In my R package instead you do not create a logger to which you send messages, you invoke a function where one of the arguments is the name of the logger. something like
logdebug("test", logger="some.lower.name")
loginfo("some", logger="some.lower.name")
logwarn("say no", logger="some.lower.name")
the problem is that you have to repeat the name of the logger each time you want to send it a logging message. I was thinking, I might create a partially applied function object and invoke that instead, something like
logdebug <- curry(logging::logdebug, logger="some.lower.logger")
but then I need doing so for all debugging functions...
how would you R users approach this?
Sounds like a job for a reference class ?setRefClass, ?ReferenceClasses
Logger <- setRefClass("Logger",
fields=list(name = "character"),
methods=list(
log = function(level, ...)
{ levellog(level, ..., logger=name) },
debug = function(...) { log("DEBUG", ...) },
info = function(...) { log("INFO", ...) },
warn = function(...) { log("WARN", ...) },
error = function(...) { log("ERROR", ...) }
))
and then
> basicConfig()
> l <- Logger$new(name="hierarchic.logger.name")
> l$debug("oops")
> l$info("oops")
2011-02-11 11:54:05 NumericLevel(INFO):hierarchic.logger.name:oops
> l$warn("oops")
2011-02-11 11:54:11 NumericLevel(WARN):hierarchic.logger.name:oops
>
This could be done with the proto package. This supports older versions of R (its been around for years) so you would not have a problem of old vs. new versions of R.
library(proto)
library(logging)
Logger. <- proto(
new = function(this, name)
this$proto(name = name),
log = function(this, ...)
levellog(..., logger = this$name),
setLevel = function(this, newLevel)
logging::setLevel(newLevel, container = this$name),
addHandler = function(this, ...)
logging::addHandler(this, ..., logger = this$name),
warn = function(this, ...)
this$log(loglevels["WARN"], ...),
error = function(this, ...)
this$log(loglevels["ERROR"], ...)
)
basicConfig()
l <- Logger.$new(name = "hierarchic.logger.name")
l$warn("this may be bad")
l$error("this definitely is bad")
This gives the output:
> basicConfig()
> l <- Logger.$new(name = "hierarchic.logger.name")
> l$warn("this may be bad")
2011-02-28 10:17:54 WARNING:hierarchic.logger.name:this may be bad
> l$error("this definitely is bad")
2011-02-28 10:17:54 ERROR:hierarchic.logger.name:this definitely is bad
In the above we have merely layered proto on top of logging but it would be possible to turn each logging object into a proto object, i.e. it would be both, since both logging objects and proto objects are R environments. That would get rid of the extra layer.
See the http://r-proto.googlecode.com for more info.
Why would you repeat the name? It would be more convenient to pass the log-object directly to the function, ie
logdebug("test",logger=l)
# or
logdebug("test",l)
A bit the way one would use connections in a number of functions. That seems more the R way of doing it I guess.