Large results set from Oracle SELECT - sql

I have a simple, pasted below, statement called against an Oracle database. This result set contains names of businesses but it has 24,000 results and these are displayed in a drop down list.
I am looking for ideas on ways to reduce the result set to speed up the data returned to the user interface, maybe something like Google's search or a completely different idea. I am open to whatever thoughts and any direction is welcome.
SELECT BusinessName FROM MyTable ORDER BY BusinessName;
Idea:
SELECT BusinessName FROM MyTable WHERE BusinessName LIKE "A%;
I'm know all about how LIKE clauses are not wise to use but like I said this is a LARGE result set. Maybe something along the lines of a BINARY search?

The last query can perform horribly. String comparisons inside the database can be very slow, and depending on the number of "hits" it can be a huge drag on performance. If that doesn't concern you that's fine. This is especially true if the Company data isn't normalized into it's own db table.
As long as the user knows the company he's looking up, then I would identify an existing JavaScript component in some popular JavaScript library that provides a search text field with a dynamic dropdown that shows matching results would be an effective mechanism. But you might want to use '%A%', if they might look for part of a name. For example, If I'm looking for IBM Rational, LLC. do I want it to show up in results when I search for "Rational"?
Either way, watch your performance and if it makes sense cache that data in the company look up service that sits on the server in front of the DB. Also, make sure you don't respond to every keystroke, but have a timeout 500ms or so, to allow the user to type in multiple chars before going to the server and searching. Also, I would NOT recommend bringing all of the company names to the client. We're always looking to reduce the size and frequency of traversals to the server from the browser page. Waiting for 24k company names to come down to the client when the form loads (or even behind the scenes) when shorter quicker very specific queries will perform sufficiently well seems more efficient to me. Again, test it and identify the performance characteristics that fit your use case best.
These are techniques I've used on projects with large data, like searching for a user from a base of 100,000+ users. Our code was a custom Dojo widget (dijit), I 'm not seeing how to do it directly with the dijit code, but jQuery UI provides the autocomplete widget.
Also use limit on this query with a text field so that the drop down only provides a subset of all the matches, forcing the user to further refine the query.

SELECT BusinessName FROM MyTable ORDER BY BusinessName LIMIT 10

Related

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

What is a best way to organise the complex couchdb view (sql-like query)?

In my application I need a SQL-like query of the documents. The big picture is that there is a page with a paginated table showing the couchdb documents of a certain "type". I have about 15 searchable columns like timestamp, customer name, the us state, different numeric fields, etc. All of these columns are orderable, also there is a filter form allowing the user to filter by each of the fields.
For a more concrete below is a typical query which is a result by a customer setting some of the filter options and following to the second page. Its written in a pseodo-sql code, just to explain the problem:
timestamp > last_weeks_monday_epoch AND timestamp < this_weeks_monday_epoch AND marked_as_test = False AND dataspace="production" AND fico > 650
SORT BY timestamp DESC
LIMIT 15
SKIP 15
This would be a trivial problem if I were using any sql-like database, but couchdb is way more fun ;) To solve this I've created a view with the following structure of the emitted rows:
key: [field, value], id: doc._id, value: null
Now, to resolve the example query above I need to perform a bunch of queries:
{startkey: ["timestamp", last_weeks_monday_epoch], endkey: ["timestamp", this_weeks_monday_epoch]}, the *_epoch here are integers epoch timestamps,
{key: ["marked_as_test", False]},
{key: ["dataspace", "production"]},
{startkey: ["fico", 650], endkey: ["fico", {}]}
Once I have the results of the queries above I calculate intersection of the sets of document IDs and apply the sorting using the result of timestamp query. Than finally I can apply the slice resolving the document IDs of the rows 15-30 and download their content using bulk get operation.
Needless to say, its not the fastest operation. Currently the dataset I'm working with is roughly 10K documents big. I can already see that the part when I'm calculating the intersection of the sets can take like 4 seconds, obviously I need to optimize it further. I'm afraid to think, how slow its going to get in a few months when my dataset doubles, triples, etc.
Ok, so having explained the situation I'm at, let me ask the actual questions.
Is there a better, more natural way to reach my goal without loosing the flexibility of the tool?
Is the view structure I've used optimal ? At some point I was considering using a separate map() function generating the value of each field. This would result in a smaller b-trees but more work of the view server to generate the index. Can I benefit this way ?
The part of algorithm where I have to calculate intersections of the big sets just to later get the slice of the result bothers me. Its not a scalable approach. Does anyone know a better algorithm for this ?
Having map function:
function(doc){
if(doc.marked_as_test) return;
emit([doc.dataspace, doc.timestamp, doc.fico], null):
}
You can made similar request:
http://localhost:5984/db/_design/ddoc/_view/view?startkey=["production", :this_weeks_monday_epoch]&endkey=["production", :last_weeks_monday_epoch, 650]&descending=true&limit=15&skip=15
However, you should pass :this_weeks_monday_epoch and :last_weeks_monday_epoch values from the client side (I believe they are some calculable variables on database side, right?)
If you don't care about dataspace field (e.g. it's always constant), you may move it into the map function code instead of having it in query parameters.
I don't think CouchDB is a good fit for the general solution to your problem. However, there are two basic ways you can mitigate the ways CouchDB fits the problem.
Write/generate a bunch of map() functions that use each separate column as the key (for even better read/query performance, you can even do combinatoric approaches). That way you can do smart filtering and sorting, making use of a bunch of different indices over the data. On the other hand, this will cost extra disk space and index caching performance.
Try to find out which of the filters/sort orders your users actually use, and optimize for those. It seems unlikely that each combination of filters/sort orders is used equally, so you should be able to find some of the most-used patterns and write view functions that are optimal for those patterns.
I like the second option better, but it really depends on your use case. This is one of those things SQL engines have been pretty good at traditionally.

django objects...values() select only some fields

I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.
The values(*fields) function allows you to specify which fields you want.
Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.

figuring out which field to look for a value in with SQL and perl

I'm not too good with SQL and I know there's probably a much more efficient way to accomplish what I'm doing here, so any help would be much appreciated. Thanks in advance for your input!
I'm writing a short program for the local school high school. At this school, juniors and seniors who have driver's licenses and cars can opt to drive to school rather than ride the bus. Each driver is assigned exactly one space, and their DLN is used as the primary key of the driver's table. Makes, models, and colors of cars are stored in a separate cars table, related to the drivers table by the License plate number field.
My idea is to have a single search box on the main GUI of the program where the school secretary can type in who/what she's looking for and pull up a list of results. Thing is, she could be typing a license plate number, a car color, make, and model, someone driver's name, some student driver's DLN, or a space number. As the programmer, I don't know what exactly she's looking for, so a couple of options come to mind for me to build to be certain I check everywhere for a match:
1) preform a couple of
SELECT * FROM [tablename]
SQL statements, one per table and cram the results into arrays in my program, then search across the arrays one element at a time with regex, looking for a matched pattern similar to the search term, and if I find one, add the entire record that had a match in it to a results array to display on screen at the end of the search.
2) take whatever she's looking for into the program as a scaler and prepare multiple select statements around it, such as
SELECT * FROM DRIVERS WHERE DLN = $Search_Variable
SELECT * FROM DRIVERS WHERE First_Name = $Search_Variable
SELECT * FROM CARS WHERE LICENSE = $Search_Variable
and so on for each attribute of each table, sticking the results into a results array to show on screen when the search is done.
Is there a cleaner way to go about this lookup without having to make her specify exactly what she's looking for? Possibly some kind of SQL statement I've never seen before?
Seems like a right application for the Sphinx full-text search engine. There's the Sphinx::Search module on CPAN which can be used as perl client for Sphinx.
First of all, you should not use SELECT * and you should definitely use bind values.
Second, the easiest way to figure out what the user is searching for is to ask the user. Have a set of checkboxes likes so:
Search among: [ ] Names
[ ] License Plate Numbers
[ ] Driver's License Numbers
Alternatively, you can note that names do not contain any digits and I have not seen any driver's license number which contains digits. There are other heuristics you can apply to partially deduce what the user was trying to search.
If you do an OK job of presenting the results, this might work out.
Finally, try to figure out what search possibilities are offered by the database you are using and leverage them so that most of the searching happens before the user interface touches the data.

Database vs. Front-End for Output Formatting

I've read that (all things equal) PHP is typically faster than MySQL at arithmetic and string manipulation operations. This being the case, where does one draw the line between what one asks the database to do versus what is done by the web server(s)? We use stored procedures exclusively as our data-access layer. My unwritten rule has always been to leave output formatting (including string manipulation and arithmetic) to the web server. So our queries return:
unformatted dates
null values
no calculated values (i.e. return values for columns "foo" and "bar" and let the web server calculate foo*bar if it needs to display value foobar)
no substring-reduced fields (except when shortened field is so significantly shorter that we want to do it at database level to reduce result set size)
two separate columns to let front-end case the output as required
What I'm interested in is feedback about whether this is generally an appropriate approach or whether others know of compelling performance/maintainability considerations that justify pushing these activities to the database.
Note: I'm intentionally tagging this question to be dbms-agnostic, as I believe this is an architectural consideration that comes into play regardless of one's specific dbms.
I would draw the line on how certain layers could rotate out in place for other implementations. It's very likely that you will never use a different RDBMS or have a mobile version of your site, but you never know.
The more orthogonal a data point is, the closer it should be to being released from the database in that form. If on every theoretical version of your site your values A and B are rendered A * B, that should be returned by your database as A * B and never calculated client side.
Let's say you have something that's format heavy like a date. Sometimes you have short dates, long dates, English dates... One pure form should be returned from the database and then that should be formatted in PHP.
So the orthogonality point works in reverse as well. The more dynamic a data point is in its representation/display, the more it should be handled client side. If a string A is always taken as a substring of the first six characters, then have that be returned from the database as pre-substring'ed. If the length of the substring depends on some factor, like six for mobile and ten for your web app, then return the larger string from the database and format it at run time using PHP.
Usually, data formatting is better done on client side, especially culture-specific formatting.
Dynamic pivoting (i. e. variable columns) is also an example of what is better done on client side
When it comes to string manipulation and dynamic arrays, PHP is far more powerful than any RDBMS I'm aware of.
However, data formatting can use additional data which is also kept in the database. Like, the coloring info for each row can be stored in additional table.
You should then correspond the color to each row on database side, but wrap it into the tags on PHP side.
The rule of thumb is: retrieve everything you need for formatting in as few database round-trips as possible, then do the formatting itself on the client side.
I believe in returning the data pretty much as-is from the database and letting it be formatted on the front-end instead. I don't stick to it religously, but in general I think it's better as it provides greater flexibility - e.g. 1 sproc can service n different requirements for data, each of which can format the data as each individually needs. Otherwise, you end up either with multiple queries returning the same data with slightly different formatting from the DB (from a SQL Server point of view, thus reducing execution plan caching benefits - therefore negative impact on performance).
Leave output formatting to the web server