django objects...values() select only some fields - sql

I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.

The values(*fields) function allows you to specify which fields you want.

Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.

Related

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

Large results set from Oracle SELECT

I have a simple, pasted below, statement called against an Oracle database. This result set contains names of businesses but it has 24,000 results and these are displayed in a drop down list.
I am looking for ideas on ways to reduce the result set to speed up the data returned to the user interface, maybe something like Google's search or a completely different idea. I am open to whatever thoughts and any direction is welcome.
SELECT BusinessName FROM MyTable ORDER BY BusinessName;
Idea:
SELECT BusinessName FROM MyTable WHERE BusinessName LIKE "A%;
I'm know all about how LIKE clauses are not wise to use but like I said this is a LARGE result set. Maybe something along the lines of a BINARY search?
The last query can perform horribly. String comparisons inside the database can be very slow, and depending on the number of "hits" it can be a huge drag on performance. If that doesn't concern you that's fine. This is especially true if the Company data isn't normalized into it's own db table.
As long as the user knows the company he's looking up, then I would identify an existing JavaScript component in some popular JavaScript library that provides a search text field with a dynamic dropdown that shows matching results would be an effective mechanism. But you might want to use '%A%', if they might look for part of a name. For example, If I'm looking for IBM Rational, LLC. do I want it to show up in results when I search for "Rational"?
Either way, watch your performance and if it makes sense cache that data in the company look up service that sits on the server in front of the DB. Also, make sure you don't respond to every keystroke, but have a timeout 500ms or so, to allow the user to type in multiple chars before going to the server and searching. Also, I would NOT recommend bringing all of the company names to the client. We're always looking to reduce the size and frequency of traversals to the server from the browser page. Waiting for 24k company names to come down to the client when the form loads (or even behind the scenes) when shorter quicker very specific queries will perform sufficiently well seems more efficient to me. Again, test it and identify the performance characteristics that fit your use case best.
These are techniques I've used on projects with large data, like searching for a user from a base of 100,000+ users. Our code was a custom Dojo widget (dijit), I 'm not seeing how to do it directly with the dijit code, but jQuery UI provides the autocomplete widget.
Also use limit on this query with a text field so that the drop down only provides a subset of all the matches, forcing the user to further refine the query.
SELECT BusinessName FROM MyTable ORDER BY BusinessName LIMIT 10

Is there a way for VBA UDF to "know" what other functions will be run?

Assume I have a UDF that will be used in a worksheet 100,000+ times. Is there a way, within the function, for it to know how many more times it is going to be called in the batch? Basically what I want to do is have every function create a to-do list of work to do. I want to do something like:
IF remaining functions to be executed after this one = 0 then ...
Is there a way to do this?
Background:
I want to make a UDF that will perform SQL queries with the user just giving parameters(date, hour, node, type). This is pretty easy to make if you're willing to actually execute the SQL query every time the function is run. I know its easy because I did this and it was ridiculously slow. My new idea is to have the function first see if the data it is looking for exists in a global cache variable and if it isn't to add it to a global variable "job-list".
What I want it to do is when the last function is called to then go through the job list and perform the fewest number of SQL queries and fill the global cache variable. Once the cache variable is full it would do a table refresh to make all the other functions get called again since on the subsequent call they'll find the data they need in the cache.
Firstly:
VBA UDF performance is extremely sensitive to the way the UDF is coded:
see my series of posts about writing efficient VBA UDFs:
http://fastexcel.wordpress.com/2011/06/13/writing-efficient-vba-udfs-part-3-avoiding-the-vbe-refresh-bug/
http://fastexcel.wordpress.com/2011/05/25/writing-efficient-vba-udfs-part-1/
You should also consider using an Array UDF to return multiple results:
http://fastexcel.wordpress.com/2011/06/20/writing-efiicient-vba-udfs-part5-udf-array-formulas-go-faster/
Secondly:
The 12th post in this series outlines using the AfterCalculate event and a cache
http://fastexcel.wordpress.com/2012/12/05/writing-efficient-udfs-part-12-getting-used-range-fast-using-application-events-and-a-cache/
Basically the approach you would need is for the UDF to check the cache & if not current or available then add a request to the queue. Then use the after-calculation event to process the queue and if neccessary trigger another recalc.
Performing 100,000 SQL queries from an Excel spreadsheet seems like a poor design. Creating a cache'ing mechanism on top of these seems to compound the problem, making it more complicated than it probably needs to be. There are some circumstances where this might be appropriate, but I would consider other design approaches instead.
The most obvious is to take the data from the Excel spreadsheet and load it into a table in the database. Then use the database to do the processing on all the rows as once. The final step is to read the result back into Excel.
I find that the best way to get large numbers of rows from Excel into a database is to save the Excel file as csv and bulk insert them.
This approach may not work for your problem. In general, though, set-based approaches running in the database are going to perform much better.
As for the cach'ing mechanism, if you have to go down that route. I can imagine a function that has the following pseudo-code:
Check if input values are in cache.
If so, read values from cache.
Else do complex processing.
Load values in cache.
This logic could go in the function. As #Bulat suggests, though, it is probably better to add an additional caching layer around the function.

DATA_BUFFER_EXCEEDED error when calling RFC_READ_TABLE?

My java/groovy program receives table names and table fields from the user input, it queries the tables in SAP and returns its contents.
The user input may concern the tables CDPOS and CDHDR. After reading the SAP documentations and googling, I found these are tables storing change document logs. But I did not find any remote call functions that can be used in java to perform this kind of queries.
Then I used the deprecated RFC Function Module RFC_READ_TABLE and tried to build up customized queries only depending on this RFC. However, I found if the number of desired fields I passed to this RFC are more than 2, I always got the DATA_BUFFER_EXCEEDED error even if I limit the max rows.
I am not authorized to be an ABAP developer in the SAP system and can not add any FM to existing systems, so I can only write code to accomplish this requirement in JAVA.
Am I doing something wrong? Could you give me some hints on that issue?
DATA_BUFFER_EXCEEDED only happens if the total width of the fields you want to read exceeds the width of the DATA parameter, which may vary depending on the SAP release - 512 characters for current systems. It has nothing to do with the number of rows, but the size of a single dataset.
So the question is: What are the contents of the FIELDS parameter? If it's empty, this means "read all fields." CDHDR is 192 characters in width, so I'd assume that the problem is CDPOS which is 774 characters wide. The main issue would be the fields VALUE_OLD and VALUE_NEW, both 245 Characters.
Even if you don't get developer access, you should prod someone to get read-only dictionary access to be able to examine the structures in detail.
Shameless plug: RCER contains a wrapper class for RFC_READ_TABLE that takes care of field handling and ensures that the total width of the selected fields is below the limit imposed by the function module.
Also be aware that these tables can be HUGE in production environments - think billions of entries. You can easily bring your database to a grinding halt by performing excessive read operations on these tables.
PS: RFC_READ_TABLE is not released for customer use as per SAP note 382318, and the note 758278 recommends to create your own function module and provides a template with an improved logic.
Use BBP_RFC_READ_TABLE instead
There is a way around the DATA_BUFFER_EXCEED error. Although this function is not released for customer use as per SAP OSS note 382318, you can get around this issue with changes to the way you pass parameters to this function. Its not a single field that is causing your error, but if the row of data exceeds 512 bytes this error will be raised. CDPOS will have this issue for sure!
The work around if you know how to call the function using Jco and pass table parameters is to specify the exact fields you want returned. You then can keep your returned results under the 512 byte limit.
Using your example of table CDPOS, specify something like this and you should be good to go...(be careful, CDPOS can get massive! You should specify and pass a where clause!)
FIELDS = 'OBJECTCLAS'....
FIELDS = 'OBJECTID'
In Java it can be expressed as..
listParams.setValue(this.getpObjectclas(), "OBJECTCLAS");
By limiting the fields you are returning you can avoid this error.

Database vs. Front-End for Output Formatting

I've read that (all things equal) PHP is typically faster than MySQL at arithmetic and string manipulation operations. This being the case, where does one draw the line between what one asks the database to do versus what is done by the web server(s)? We use stored procedures exclusively as our data-access layer. My unwritten rule has always been to leave output formatting (including string manipulation and arithmetic) to the web server. So our queries return:
unformatted dates
null values
no calculated values (i.e. return values for columns "foo" and "bar" and let the web server calculate foo*bar if it needs to display value foobar)
no substring-reduced fields (except when shortened field is so significantly shorter that we want to do it at database level to reduce result set size)
two separate columns to let front-end case the output as required
What I'm interested in is feedback about whether this is generally an appropriate approach or whether others know of compelling performance/maintainability considerations that justify pushing these activities to the database.
Note: I'm intentionally tagging this question to be dbms-agnostic, as I believe this is an architectural consideration that comes into play regardless of one's specific dbms.
I would draw the line on how certain layers could rotate out in place for other implementations. It's very likely that you will never use a different RDBMS or have a mobile version of your site, but you never know.
The more orthogonal a data point is, the closer it should be to being released from the database in that form. If on every theoretical version of your site your values A and B are rendered A * B, that should be returned by your database as A * B and never calculated client side.
Let's say you have something that's format heavy like a date. Sometimes you have short dates, long dates, English dates... One pure form should be returned from the database and then that should be formatted in PHP.
So the orthogonality point works in reverse as well. The more dynamic a data point is in its representation/display, the more it should be handled client side. If a string A is always taken as a substring of the first six characters, then have that be returned from the database as pre-substring'ed. If the length of the substring depends on some factor, like six for mobile and ten for your web app, then return the larger string from the database and format it at run time using PHP.
Usually, data formatting is better done on client side, especially culture-specific formatting.
Dynamic pivoting (i. e. variable columns) is also an example of what is better done on client side
When it comes to string manipulation and dynamic arrays, PHP is far more powerful than any RDBMS I'm aware of.
However, data formatting can use additional data which is also kept in the database. Like, the coloring info for each row can be stored in additional table.
You should then correspond the color to each row on database side, but wrap it into the tags on PHP side.
The rule of thumb is: retrieve everything you need for formatting in as few database round-trips as possible, then do the formatting itself on the client side.
I believe in returning the data pretty much as-is from the database and letting it be formatted on the front-end instead. I don't stick to it religously, but in general I think it's better as it provides greater flexibility - e.g. 1 sproc can service n different requirements for data, each of which can format the data as each individually needs. Otherwise, you end up either with multiple queries returning the same data with slightly different formatting from the DB (from a SQL Server point of view, thus reducing execution plan caching benefits - therefore negative impact on performance).
Leave output formatting to the web server