How to avoid running query twice - ODBC Excel (new query or editing existing query) - sql

I have an ODBC connection to an AWS mysql database instance. It's extremely frustrating that it appears I'm obligated by the excel UI to run the query twice.
First, I have to run the query like this:
After this runs, which returns a limited amount of rows (2nd image below), then I have to run it again to load the data into excle.
My question is, is there any possible way to skip step 1 or step 2, so that I can input my query and have it load directly into the workbook?

I'm not understanding the problem. You are configuring a Query Connection. The first execution returns a preview and the "Transform data" option (if you want to further tailor the query). The second execution loads it. From that point on the query is set up. It only needs configured once.
To get new/changed data, you just need to do a "Refresh All" or configure it to automatically Refresh Data when the Excel Workbook is opened.
If you are adding a Query to many workbooks you could probably setup one then code a query substitution script.

Related

Cannot scheduling script in Big query without having to define destination table [duplicate]

This question already has an answer here:
Bigquery - schedule stored procedure not working anymore
(1 answer)
Closed 2 years ago.
I wrote the script in standard SQL that included declaring several variables and using loop to replace data in multiple tables and set it to be scheduled query in Big query. Before google updated the UI, I wasn't required to specify destination table so it worked just fine but after google changed its UI, It's required to define destination for query results or else the scheduled query cannot be created.
I tried creating scheduled query with destination table then It has this error message
'configuration.query.destinationTable cannot be set for scripts'
I read this document https://cloud.google.com/bigquery/docs/scheduling-queries
and it states that scripting query wasn't supposed to have destination table but it seems like google has changed it and now there's no option to not define destination dataset.
Is there anybody currently facing the same problem or have any solution to this problem?
Thanks!
I've had the same problem. When I rolled back the last UI update the problem was fixed. You can go back to the old version by clicking the "Hide Preview Features" button in the bar at the top of the screen.

Will BigQuery finish long running jobs with a destination table if my browser crashes / computer turns off?

I frequently run BigQuery jobs in the web gui that take 30 minutes or more, saving the results into another table to view later.
Since I'm not waiting for the result to come soon, and not storing them in my computer's memory, it would be great if I could start a query and then turn off my computer, to come back the next day and look at the results in the destination table.
Will this work?
The same applies if my computer crashes, or browser runs out of memory, or anything else that causes me to lose my connection to Bigquery while the job is running.
The simple answer is yes, the processing takes place in the cloud, not on your browser. As long as you set a destination table, the results will be saved there or if not, you can check the query history to see if there were any issues which caused it not to be produced.
If you don't set a destination table it will save to a temporary table which may not be available if you don't return in time.
I'm sure someone can give you a much more detailed answer.
Even if you have not defined destination table - you still can access result of the query by checking Query History. You should locate your query in the list of presented queries and then expand respective item and locate value of Destination Table.
Note: this is not regular table - rather so called anonymous table that is being available for about 24 hours after query was executed
So, knowing that table you can just use it in whatever way you want - for example just simply query it as in below
SELECT *
FROM `yourproject._1e65a8880ba6772f612fbe6ff0eee22c939f1a47.anon9139110fa21b95d8c8729cf0bb6e4bb6452946d4`
Note: anonymous table is being "saved" in a "system" dataset that is started with underscore so you will not be able to see it in UI. Also table name startes with 'anon' which I believe states for 'anonymous'

How to create a Temporary Table using (Select * into ##temp from table) syntax(For MS SQL) using Pentaho data integration

When I am using the above syntax in "Execute row script" step...it is showing success but the temporary table is not getting created. Plz help me out in this.
Yes, the behavior you're seeing is exactly what I would expect. It works fine from the TSQL prompt, throws no error in the transform, but the table is not there after transform completes.
The problem here is the execution model of PDI transforms. When a transform is run, each step gets its own thread of execution. At startup, any step that needs a DB connection is given its own unique connection. After processing finishes, all steps disconnect from the DB. This includes the connection that defined the temp table. Once that happens (the defining connection goes out of scope), the temp table vanishes.
Note, that this means in a transform (as opposed to a Job), you cannot assume a specific order of completion of anything (without Blocking Steps).
We still don't have many specifics about what you're trying to do with this temp table and how you're using it's data, but I suspect you want its contents to persist outside your transform. In that case, you have some options, but a global temp table like this simply won't work.
Options that come to mind:
Convert temp table to a permanent table. This is the simplest
solution; you're basically making a staging table, loading it with a
Table Output step (or whatever), and then reading it with Table
Input steps in other transforms.
Write table contents to a temp file with something like a Text File
Output or Serialze to File step, then reading it back in from the
other transforms.
Store rows in memory. This involves wrapping your transforms in a
Job, and using the Copy Rows to Results and Get Rows from Results steps.
Each of these approaches has its own pros and cons. For example, storing rows in memory will be faster than writing to disk or network, but memory may be limited.
Another step it sounds like you might need depending on what you're doing is the ETL Metadata Injection step. This step allows you in many cases to dynamically move the metadata from one transform to another. See the docs for descriptions of how each of these work.
If you'd like further assistance here, or I've made a wrong assumption, please edit your question and add as much detail as you can.

How to resume data migration from the point where error happened in ssis?

I am migrating data from an Oracle database to a SQL server 2008 r2 database using SSIS. My problem is that at a certain point the package fails, say some 40,000 rows out of 100,000 rows. What can I do so that the next time when I run the package after correcting the errors or something, I want it to be restarted from the 40,001st row, i.e, the row where the error had occured.
I have tried using checkpoint in SSIS, but the problem is that they work only between different control flow tasks. I want something that can work on the rows that are being transferred.
There's no native magic I'm aware of that is going to "know" that it failed on row 40,000 and when it restarts, it should start streaming row 40,001. You are correct that checkpoints are not the answer and have plenty of their own issues (can't serialize Object types, loops restart, etc).
How you can address the issue is through good design. If your package is created with the expectation that it's going to fail, then you should be able to handle these scenarios.
There are two approaches I'm familiar with. The first approach is to add a Lookup Transformation in the Data Flow between your source and your destination. The goal of this is to identify what records exist in the target system. If no match is found, then only those rows will be sent on to destination. This is a very common pattern and will allow you to also detect changes between source and destination (if that is a need). The downside is that you will always be transferring the full data set out of the source system and then filtering rows in the data flow. If it failed on row 99,999 out of 1,000,000 you will still need to stream all 1,000,000 rows back to SSIS for it to find the 1 that hasn't been sent.
The other approach is to use a dynamic filter in your WHERE clause of your source. If you can make assumptions like the rows are inserted in order, then you can structure your SSIS package to look like Execute SQL Task where you run a query like SELECT COALESCE(MAX(SomeId), 0) +1 AS startingPoint FROM dbo.MyTable against the Destination database and then assign that to an SSIS variable (#[User::StartingId]). You then use an expression on your select statement from the Source to be something like "SELECT * FROM dbo.MyTable T WHERE T.SomeId > " + (DT_WSTR, 10) #[User::StartingId] Now when the data flow begins, it will start where it last loaded data. The challenge on this approach is finding those scenarios where you know data hasn't been inserted out of order.
Let me know if you have questions, need things better explained, pictures, etc. Also, above code is freehanded so there could be syntax errors but the logic should be correct.

Synchronize Data Access

Some years ago I built an Excel file for my colleagues that displays lots of data from an external ODBC data source. The data is partitioned into lots of data tables in different sheets. The file also contains a button that allows the user to update the data.
Since accessing the data from the external source was very slow, I implemented some caching logic that stored parts of the results, that were unlikely to change, in external tables on our SQL server and did some magic to keep the data synchronized. The excel file itself only accesses the SQL server. Every data table uses an SPROC to get part of the data.
Fast forward 5 years. The Excel file has grown in size and contain so many sheets and data that our Excel (still version 2003) got problems with it. So my colleagues split the file into two halfs.
The problem is now, that both excel files contain the logic to update the data and it can happen that a user clicks the update button in file no. 1 while another user is already updating file no. 2.
That's the point where the updating logic goes berserk and produces garbage.
The update run is only required once for both excel files because it updates all the data that's displayed in both files. It's quite expensive and lasts from 5 to 15 minutes.
I could split the update run into two halves as well, but that wouldn't make it any faster and updating the two files would take twice as long.
What I think about is some kind of mutex: User A clicks on the update button and the update run starts. User B wants to update too, but the (VBA/SPROC) logic detects that there's already an update running and waits till it finishes.
You could perform the updates in a Transaction with Serializable isolation level; your update code would need to detect and handle SQL Server error 1205 (and report to user that another update is in process).
Alternatively, add a rowversion timestamp to each row and only update a row if it hasn't been changed since you loaded it.
But when A has finished, B will run the update 'for nothing'.
Instead: When A clicks update, call a stored proc which fires the update asynchronously.
When the update starts, it looks at the last time it ran itself and exits if it was less than X minutes ago.