Can we access user variable in query of Source in DFT? - sql

I am working on optimizing a Data Flow Task. I Ado.Net source firing a query like below.
Select Emp_id, EmpName, Salary from Employee.
After source I hace a derived column transform whcih adds a derived column with user variable value #[User::TestVariable].
Now I guess this derived column transform would be taking some time atleast so I was wondering if I can save that time by doing something like below at source.
Select Emp_id, EmpName, Salary, DerivColumn as #[User::TestVariable]
from Employee
Is it possible to do something of this kind? if yes how?
Above is DFT I am working on how can i find out which component took how much time, so i can look to optimize that.

You can use the variable in ADO.NET Source .
1.In the property window of DFT task click the expression property and select the ADO.NET Source SQL Command
In the expression write your SQL Query
Select LoginId,JobTitle," + (DT_WSTR,10) #[User::TestVariable] + " as DerivedColumn
from HumanResources.Employee"
I don't think that your Derived Column is adding any overhead as it is a Non Blocking component (bt there are some exceptions to it )
In order to find the speed of individual components ,
1.Calculate the overall execution time for the package ,which you can find it in the execution result tab
Overall Execution Speed = Source Speed + Transformation Speed
2.Remove the derived component and connect the source to the row transaformation.Now again see the execution time .This will give you the source speed .
Overall Execution Speed - Source Speed = Transformation Speed
SSIS is an in-memory pipeline, so all its transformations occur in memory.It replies heavily on buffer .In your case ,SSIS buffer caries 196,602 rows .This value is controlled by 2 properties
DefaultMaxBufferRows and DefaultMaxBufferSize.MaximumBufferSize is 100MB.Now you need to calculate the estimated row size by calculating the column size in your table.Suppose adding your datatype length comes around 40 bytes then amount in bytes for 196,602 rows is
196,602*40=7864080 ~ 7MB
which is less than DefaultMaxBufferSize 10MB.You can try increasing the DefaultMaxBufferRows to increase the speed .But then again you need to do all your performance testing before comping to a conclusion .
I suggest you read this article to get a complete picture about SSIS performance

so you wish to add a new column to your dataset with a fixed value (contained on #[User::TestVariable]) to be inserted later on a destination, right? NO, you cant do what you are thinking because the scope is the database (where you execurte the query) and the variable is on the package.
Are you sure this derived column operation is take that long? It shouldnt. If it is, you could use a execute SQL task to insert this value on the DB into a temp table and the use it on your query
declare #aux int
select #aux = your_config_column from your_temp_table
Select Emp_id, EmpName, Salary, #aux as DerivColumn
from Employee
it is kind of a messy solution, but it is worth it if the derived column is really taking that long

Related

Improve performance of deducting values of same table in SQL

for a metering project I use a simple SQL table in the following format
ID
Timestamp: dat_Time
Metervalue: int_Counts
Meterpoint: fk_MetPoint
While this works nicely in general I have not found an efficient solution for one specific problem: There is one Meterpoint which is a submeter of another Meterpoint. I'd be interested in the Delta of those two Meterpoints to get the remaining consumption. As the registration of counts is done by one device I get datapoints for the various Meterpoints at the same Timestamp.
I think I found a solution applying a subquery which appears to be not very efficient.
SELECT
A.dat_Time,
(A.int_Counts- (SELECT B.int_Counts FROM tbl_Metering AS B WHERE B.fk_MetPoint=2 AND B.dat_Time=A.dat_Time)) AS Delta
FROM tbl_Metering AS A
WHERE fk_MetPoint=1
How could I improve this query?
Thanks in advance
You can try using a window function instead:
SELECT m.dat_Time,
(m.int_counts - m.int_counts_2) as delta
FROM (SELECT m.*,
MAX(CASE WHEN fk.MetPoint = 2 THEN int_counts END) OVER (PARTITION BY dat_time) as int_counts_2
FROM tbl_Metering m
) m
WHERE fk_MetPoint = 1
From a query point of view, you should as a minimum change to a set-based approach instead of an inline sub-query for each row, using a group by as a minimum but it is a good candidate for a windowing query, just as suggested by the "Great" Gordon Linoff
However if this is a metering project, then we are going to expect a high volume of records, if not now, certainly over time.
I would recommend you look into altering the input such that delta is stored as it's own first class column, this moves much of the performance hit to the write process which presumably will only ever occur once for each record, where as your select will be executed many times.
This can be performed using an INSTEAD OF trigger or you could write it into the business logic, in a recent IoT project we computed or stored these additional properties with each inserted reading to greatly simplify many types of aggregate and analysis queries:
Id of the Previous sequential reading
Timestamp of the Previous sequential reading
Value Delta
Time Delta
Number of readings between this and the previous reading
The last one sounds close to your scenario, we were deliberately batching multiple sequential readings into a single record.
You could also process the received data into a separate table that includes this level of aggregation information, so as not to pollute the raw feed and to allow you to re-process it on demand.
You could redirect your analysis queries to this second table, which is now effectively a data warehouse of sorts.

Slow running U-SQL Job due to SqlFilterTransformer

I have a U-SQL job that extracts data from 2 .tsv and 2 .csv files, selects some features and performs some simple transformations before outputting to csv/tsv files in ADL.
However, when I attempt to add further transformations within SELECT statements, it seems the job takes considerably longer to run (10+ mins vs 1 min), due to one SELECT statement in particular.
I believe it is due to the calculation of the 'YearMonth' column, where I have essentially used concatenation to get the date column to the format I need it in.
Below is the job that runs quickly:
#StgCrime =
SELECT CrimeID,
[Month],
ReportedBy,
FallsWithin,
Longitude,
Latitude,
Location,
LSOACode,
LSOAName,
CrimeType,
LastOutcome,
Context
FROM #ExtCrime;
OUTPUT #StgCrime
TO "CrimeOutput/Crimes.csv"
USING Outputters.Csv(outputHeader:true);
And the job that takes a lot longer:
#StgCrime =
SELECT CrimeID,
String.Concat([Month].Substring(0, 4),[Month].Substring(5, 2)) AS YearMonth,
ReportedBy AS ForceName,
Longitude,
Latitude,
Location,
LSOACode,
CrimeType,
LastOutcome
FROM #ExtCrime;
OUTPUT #StgCrime
TO #OCrime
USING Outputters.Csv(outputHeader:true);
The difference in Vertex view:
Simple/Quick job
With additional transformation
Can anyone help clarify this for me? Surely that one transformation shouldn't cause such an increase in job run time?
The data file being queried is made up of 1,066 csv files, around 2.5GB in total.
Not seeing all of the script and the generated job graph as well as the number of specified AUs, it is a bit hard to estimate why one job was running slower by that much than the other.
You say that the "data file" is made up of 1066 CSV files all of which seem rather small with 2.5GB in total. I would expect that you probably get 1066 extract vertices in the extract stage. Is that the same for the simple job as well?
We have a new feature in preview that will group up to 200 files (or 1GB whatever comes first) into a single vertex to minimize the vertex startup time.
Can you try your job with the following statement added:
SET ##FeaturePreviews = "InputFileGrouping:on";

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

looping in a Kettle transformation

I want to repetitively execute an SQL query looking like this:
SELECT '${date.i}' AS d,
COUNT(DISTINCT xid) AS n
FROM table
WHERE date
BETWEEN DATE_SUB('${date.i}', INTERVAL 6 DAY)
AND '${date.i}'
;
It is basically a grouping by time spans, just that those are intersecting, which prevents usage of GROUP BY.
That is why I want to execute the query repetitively for every day in a certain time span. But I am not sure how I should implement the loop. What solution would you suggest?
The Kettle variable date.i is initialized from a global variable. The transformation is just one of several in the same transformation bundle. The "stop trafo" would be implemented maybe implicitely by just not reentering the loop.
Here's the flow chart:
Flow of the transformation:
In step "INPUT" I create a result set with three identical fields keeping the dates from ${date.from} until ${date.until} (Kettle variables). (for details on this technique check out my article on it - Generating virtual tables for JOIN operations in MySQL).
In step "SELECT" I set the data source to be used ("INPUT") and that I want "SELECT" to be executed for every row in the served result set. Because Kettle maps parameters 1 on 1 by a faceless question-mark I have to serve three times the same paramter - for each usage.
The "text file output" finally outputs the result in a generical fashion. Just a filename has to be set.
Content of the resulting text output for 2013-01-01 until 2013-01-05:
d;n
2013/01/01 00:00:00.000;3038
2013/01/02 00:00:00.000;2405
2013/01/03 00:00:00.000;2055
2013/01/04 00:00:00.000;2796
2013/01/05 00:00:00.000;2687
I am not sure if this is the slickest solution but it does the trick.
In Kettle you want to avoid loops and they can cause real trouble in transforms. Instead you should do this by adding a step that will put a row in the stream for each date you want (with the value stored in a field) and then using that field value in the query.
ETA: The stream is the thing that moves rows (records) between steps. It may help to think of it as consisting of a table at each hop that temporarily holds rows between steps.
You want to avoid loops because a Kettle transform is only sequential at the row level: rows may process in parallel and out of order and the only guarantee is that the row will pass through the steps in order. Because of this a loop in a transform does not function as you would intuitively expect.
FYI, it also sounds like you might need to go through some of the Kettle tutorials if you are still unclear about what the stream is.

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.