is it possible to refer to queries defined in a previous %%sql module in a later module? - google-bigquery

I just started working with the new Google Cloud Datalab and IPython last week (though I've been using BigQuery for a few months). The tutorials and samples in github are very helpful, but as my scripts and queries become more complex I'm wondering a few things. The first one is this: can I refer to queries defined in one %%sql module in a later %%sql module? The other, somewhat related question is can I somehow store the results from one %%sql module and then put that information into something like an IN clause in a subsequent %%sql module?

Here's some things to try and see if they meet your needs. If they don't, I welcome you to file issues in github, as I think both of your scenarios are things we want to make sure work well.
For the first, it requires a combination of sql cells and code cells [for now]
Cell 1
%%sql --module m1
DEFINE QUERY q1
SELECT ...
Cell 2
%%sql --module m2
DEFINE QUERY q2
SELECT ... FROM $src ...
Cell 3
import gcp.bigquery as bq
compositequery = bq.Query(m2.q2, src = m1.q1)
Essentially, %%sql modules are turned into auto-imported python modules behind the scenes.
I used to split out queries per %%sql cell myself, but since the introduction of modules, I also depending on the scenario, define multiple queries within a single module, where you don't need a bit of python code stitching together. Depends on your scenario, which is better.
For your second question, again, if the queries are split across cells, you'll need some python glue in the middle. Execute one query, get its result, and use that as a parameter for the next query. This would work for general scalar values, but for IN clauses and tuples/lists of values, we have this issue we need to address: https://github.com/GoogleCloudPlatform/datalab/issues/615
For more ideas on how you can use JOINs in BigQuery to produce scalar results in one query that you consume in the next query, you can also see the query under Step 3 in the BigQuery tutorial notebook titled "SQL Query Composition".
Hope that helps.
As mentioned, if you hit specific issues where something didn't work as you expected, please do file an issue, and we can see if it makes sense to address, and possibly you or someone else might even step up to make a contribution. :)

Related

IGNORE CASE query problems saving to a table and using Allow large results

I need case insensitivity in my queries so I found IGNORE CASE which works superbly when used in queries that target the browser (I am talking about BQ web UI). If I choose a destination table (an absolute must for me) and select Allow Large Results (with unchecked Flatten Results) then I get a cryptic error like this:
Error: unexpected LIMIT clause at: 2.200 - 2.206
Even though this Official Google BigQuery issue and feature request tracker post seems to speak of the same issue and even though the problem seems to have been acknowledged back in Jan 2015 the solution isn't apparent.
I could potentially use a bunch of temp tables with lowercased search columns as a workaround but that sounds awfully difficult with the number of tables and columns that I have and the complex queries that I intend to run.
Any other possible workarounds? Why isn't this working yet on BQ?
Yes, it is a known problem, and it has not been neglected. The code changes to fix it are (surprisingly) not trivial, but they are mostly done. Not team is carefully looking how to enable and deploy them. I cannot give you a timeline, but the fix to this problem is coming.
The only workarounds in the meantime, are to wrap all the string comparisons, string GROUP BYs and string ORDER BYs with conversion to LOWER() (or UPPER()) of operands.

Scala 22param limit trying to find a workaround and still use for comprehensions instead of plain SQL in Slick

I'm working with 23 fields here, at last count. I've generally thrown my hands up with trying to count them after reducing from a 31-fielded table by using a foreign-key.
All the good links
Fundamental explanation of how to read and understand Slick's schema code provided by one very good Faiz.
On 22+ parameters...
Stefan Zeigar has been immensely helpful in the example code he's written in this discussion and also more directly linked to here on Github
The good Stefan Zeigar has also posted here on plain SQL queries
What this post is about
I think the above is enough to get me on my way to a working refactoring of my app so that CRUD is feasible. I'll update this question or ask new questions if something comes up and stagnates me. The thing is...
I miss using for comprehensions for querying. I'm talking about Slick's Query Templates
The problem I run into when I use a for comprehensions is that the table... will probably have
object Monsters extends Table[Int]("monster_table"){
// lots of column definitions
def * = id /* for a Table[Int] despite
having 21 other columns I'm not describing
in this projection/ColumnBase/??? */
}
and the * projection won't describe everything I want to return in a query.
The usual simple for comprehension Slick query template will look something like this:
def someQueryTemplate = for {
m <- Monsters
} yield m
and m will be an Int instead of the entire object I want because I declared the table to be a Table[Int] because I can't construct a mapped projection of 22 params because of all the code that needs to be generated for compiler support of class generation for each tuple and arbitrariness
So... in a nutshell:
Is there any way to use Query Templates in Slick with 22+ columns?
I stumbled on this question after answering a similar question a couple days ago. Here's a link to the question. slick error : type TupleXX is not a member of package scala (XX > 22)
To answer the question here, you are able to use nested tuples to solve the 22 column limit. When solving the problem myself, the following post was extremely helpful. https://groups.google.com/forum/#!msg/scalaquery/qjNW8P7VQJ8/ntqCkz0S4WIJ
Another option is to use the soon-to-be-released version of Slick which uses HLists to remove the column count limitation. This version of Slick can be pulled from the Slick master branch on Github. Kudos to cvogt for pointing out this option. https://github.com/slick/slick/blob/master/slick-testkit/src/main/scala/com/typesafe/slick/testkit/tests/MapperTest.scala#L249

Get last few query results in SQL

I frequently do a static analysis of SQL databases, during which I have the luxury of nobody being able to change the data except me.
However, I have not found a way to 'tell' this to SQL in order to prevent running the same query multiple times.
Here is what I would like to do, first I start with a complicated query that has a very small output.
SELECT * FROM MYTABLE WHERE MYPROPERTY = 1234
Then I run a simple query from the same window (Mostly using SQL server studio if that is relevant)
SELECT 1
Now I suddenly realize that I forgot to save the results from my first complicated (slow) query.
As I know the underlying data did not change (or even if it did) I would like to look one step back and simply get the result. However at the moment I don't know any trick to do this and I have to run the entire query again.
So the question summary is: How can I (automatically store/)get the results from recently executed queries.
I am particulary interested in simple select queries, and would be happy to allocate say 100MB memory for automated result storage. Would prefer a solution that works in SQL server studio with T-SQL, but other SQL solutions are also welcome.
EDIT: I am not looking for a way to manually prevent this from happening. In the cases where I can anticipate the problem it will not happen.
This can't be done in Microsoft SQL Server. SQL Server does not cache results, instead it caches data pages that were accessed by your query. This should make your query go a lot faster the second time around so it won't be as painful to re-run it.
In other databases, such as Oracle and MySQL, they do have a query caching mechanism that will allow you to retrieve the results directly the second time around.
I run into this frequently, I often just throw the results of longer-running queries into a temp table:
SELECT *
INTO #results1
FROM MYTABLE WHERE MYPROPERTY = 1234
SELECT *
FROM #results1
If the query is very long-running I might use a 'real' table. It's a good way to save on re-run time.
Downside is that it adds to your query.
You can also send query results to a file in SSMS, info on formatting the output is here: SSMS Results to File
The easiest way to do this is to run each query in its own SSMS window, the results will stay there until you close it, or run out of memory - besides that, I am not sure there is a way to accomplish what you want.
Once you close the SSMS window, I don't believe there is a way to get back 'cached' results.
This isn't a technical answer to your question. Having written queries and looking at results for many years, I am in the habit of saving the results in Excel, regardless of the database/query tool I'm using.
The format in Excel is rather methodical:
Each worksheet has the date. (Called something like "1 Jul".)
Each spreadsheet contains one month. (Typically with the month name like "work-201307".)
In the "B" column I copy and paste the query.
Underneath, in the "C" column, I copy and paste the results.
The next query starts a few lines after, one after the other.
I put the queries in the "B" column, so I can go to the "A" column and use to get to the first row. I put the results in the "C" column, so I can go to the "B" column and use to move between queries.
I mostly do this so I can go back and see the work I did many months ago. For instance, someone sends an email from February and says "do this again". I can go back to the February spreadsheet, go to the day it was created, and see what I was doing at that time.
In your question, though, I realize that I now instinctively solve this problem, because the "right click on the grid, copy with column headers, alt-tab to excel, alt-V" is a behavior that I comes quite naturally.
I was going to suggest you to run each query into a script with a counter (stored in a table) increased each time the query is executed (i.e. i++) and storing each query in a Temp Table called "tmpTable" + i, but it sounds very complicated to manage. Am I right?
Then I googled and I've found this Tool Pack: I didn't try it but you could take a look:
http://www.ssmstoolspack.com/Features
Hope it helps.
EDIT: added the folliwing link. There's the option to output as XML file and they mention SQL Server Integration Services as a possible solution too.
http://michaeljswart.com/2012/03/sending-query-results-to-others/#method5
SECOND EDIT: There's this DBMS-Independent tool too, it sounds interesting:
http://www.sql-workbench.net/
i am not sure this is what you want. Anyway check my answer
In sql server management studio you can open multiple tabs for executing queries. Open new tab for each query, then the result of executed queries will be available under that tab.
After executing one query in a tab dont use that tab for new query, open new tab for that job.
Have you considered using some kind of offline SQL client such as Excel? Specifically, Excel will retrieve the results into the spread sheet (using the Data ribbon/menus) where they are stored pretty much permanently as results. It will prompt you to refresh when necessary or you can do it on demand.
Your question as to whether it can be done in T/SQL or other databases depends on the database and results cache and even then they are options that the query processor can use not guarantees to the individual query.

Script to compare two tables in database, from user input

I am very new to VBA and SQL and am trying to learn. I have a MS Access project that requires a VBA script that prompts the user to input two table names and numerous field names and create a SQL query utilizing those the names.
The specific SQL query I'm trying to use is below.
SELECT
A.user_index, A.input1, B.input1, A.input2, B.input2, A.input3, B.input3, B.input4,
A.input4, A.input5, B.input5
FROM
table1 AS A
LEFT JOIN
table2 AS B ON A.user_index = B.user_index
WHERE
(((A.input1) <> [B].[input1)) OR
(((A.input2) <> [B].[input2])) or
(((A.input4) <> [B].[input4]));
The overall purpose of this is to have a script that will be able to list fields for comparison that is applicable with any database. I know this is probably a relatively easy solution. However, I have no idea where to start.
My first instinct is to say "What have you tried so far?", but as you said, you don't know where to start.
It sounds like you need to first prompt the user for several field and table names, then build a query based on those values. I recommend first outlining exactly what you want your script to do. Maybe something like:
Declare variables to hold the values.
Prompt the user for each of the values and store them in the variables.
2a. After the user enters a value, make sure it is valid. If not, do something accordingly.
Declare a variable to hold your SQL query.
Construct the query.
Run the query.
This is obviously just an example. Break down each step into "baby steps" as much as possible.
It's a good idea to ask yourself how unique these baby steps are to your particular situation (hint: they almost certainly are not unique). If they aren't, then they have probably been solved tens of thousands of times already, so you have a very good chance of googling your questions.
If you still can't find an answer to how to do a particular step, feel free to ask here. Just remember to include your code even if it is broken :)

how to update field names automatically after updating SQL

I am changing the command text for a data set inside the .rdl ffile:
I would like to know how can I update the resulting fields that are returned by the select statement:
I know that these fields must be automatically generated, so I was wondering if it's possible to update them right after editing the SQL code inline??
Usually when someone wants to have a look at the data in command text they are wanting it for reference to an end user(from what I have seen). You may want to amend it but ultimately with reporting your first goal should be: "What am I doing this for?" If your goal is dynamic creation at runtime then I would avoid this and offer a few other suggestions:
Procertize it. Making a stored procedure if you have the know how in SQL Server is a convenient and fast way to get what you want and you can optimize it if you know what you are doing with your SQL FU to get good results. The downside would be if you work with multiple environments you have to deploy your code for the TSQL as well as the RDL file.
Use an expression to build the dataset at runtime. In cases where I have been told that the query itself was not properly optimized by other developers they have mentioned doing this. I myself do not always see the advantage of doing this versus just having your predicate construction work well with good indexing on the source engine. Regardless you can build your dataset at runtime. It would be similar to hitting 'fx' next to the text and then putting in something like this(assuming you have a variable named #Start):
="Select thing
from table
Where >= " & Parameters!Start.Value
Again I have not really seen if this is really that much faster than:
Select thing
from table
Where >= #Start
But it is there if you just want to build it dynamically.
You can try to build your expression dynamically from parameters being PART of the select statement. SSRS is all about the 'expressions' and what you can do with them. Once you jump in and learn how they apply to everything you can go nuts so to speak on using them. A general rule though is the more of them you use and rely on the slower your reports will become.
I hope some of this may help, I would ask first is something dynamic due to a need to be event driven or is performance related.