USQL Custom Extractor - Latest Version - azure-data-lake

I have a datalake that gets sent data whenever the source system is updated. This can result in a single item being sent multiple times, for the multiple versions of it.
In the USQL, i can retrieve everything, then partition the data set and get my latest version of each item.
However, it doesn't look like variables are available in Views? I'd like a view for ease of access by other teams. e.g.
CREATE VIEW MyDatabase.DataLakeViews.LastestDataVersion
AS
#output =
EXTRACT MyKey string,
MyData string,
EventEnqueuedUtcTime DateTime
FROM #"adl://bwdatalakestore.azuredatalakestore.net/Stream/MGS/pts/sportsbook/betinfo/csv/2017/11/27/{*}.csv"
USING Extractors.Text(delimiter : '|', skipFirstNRows : 1);
#PartitionedOutput =
SELECT *,
ROW_NUMBER() OVER(PARTITION BY MyKey ORDER BY EventEnqueuedUtcTime DESC) AS RowNumber
FROM #output;
#FinalOutput =
SELECT *
FROM #PartitionedOutput
WHERE RowNumber == 1;
OUTPUT #FinalOutput
TO "/ReferenceGuide/QSE/Extract/SearchLog_extracted.txt"
USING Outputters.Tsv();
This doesn't work in a view. Is there a way to shorthand this partitioning, rather than putting into every query.
Possibly a way to achieve this via a custom extractor? It appears that it works by looping over each row, so maybe not suited here...

Views follow the SQL language of being defined without parameters on a single expression.
You want to parameterize the view, which is a table-valued function.

Related

Is the order of a view guaranteed?

PostgreSQL allows ORDER BY in views, so for example I can write a view like this:
CREATE VIEW people_overview AS
SELECT
id
, name1
, name2
FROM person
ORDER BY
name2
, name1
Let's say I have an application where I now use this view like this:
SELECT * FROM people_overview
The app then reads all the data and displays it to the user in some way, like with a grid.
Is it guaranteed that in this situation the ordering specified in the view is maintained when returing the rows to the application?
Or would I be better off to code the ORDER BY into the application?
As per the comments; define the view without an ORDER BY (unless some subquery needs it for TOP N ROWS type purposes) and let the ultimate user of the view determine the sorting order they want (as it's then guaranteed to be what they want and there is less risk that the data will be sorted twice, once needlessly - the optimizer should realise that the in-view ordering is redundant if the select * from view order by x is applied, but there's not much point in taking the risk/putting the extra code clutter in).
I'd possibly also extend this philosophy to things like data conversion/formatting - leave the data in as-stored form so it remains useful for as long as possible, and lets the calling application decide on formatting (i.e. don't format all your dates to a yyyyMMdd string in your view, if the calling app will then have to parse it again to do some math on it etc)

Oracle sql: is there a way to compose queries using aliases for entire clauses?

Analyzing an Oracle DB of an application of mine, I always run queries ending with the very same "order by" clause, given that every table has a date type "DT_EXTRACTION" column.
Is there a way to define an alias for String "order by DT_EXTRACTION desc" (say, equals to $DD) and write my query like this?
select *
from foo
$DD;
Since you're using SQL Developer you could (ab)use substitution variables for this:
define DD='order by DT_EXTRACTION desc'
select * from your_table
ⅅ
but you'd have to either define that string in each script/session, or add it to a login script to make it always available (which you can choose from Tools->Preferences->Database).
That would work in SQL*Plus too.
SQL Developer also has 'snippets', which you can view and manage from the panel revealed by View->Snippets. You can add your own snippet for that order by clause, and can then drag-and-drop it from the snippets panel into your code wherever you need to use it. Not quite what you asked for but still useful. #thatjeffsmith has a write up with pictures, so I won't repeat those details here, since it's not quite what you need.
You may find code templates useful too. From Tool->Preferences->Database choose SQL Editor Code Templates, and define a new one for your string:
Then in the worksheet, type as far as:
select * from your_table DD
hit control-space and it will expand automatically to
select * from your_table order by dt_extraction desc

How to lower case entire column data in Google Cloud BigQuery

I am trying to find a "quick" way to lower case all the data (strings) in a table's column inside Google Cloud BigQuery.
Before going into building a script, I'm looking for more shorter way like Query.
How can I query BigQuery to lower case entire column?
You can use an UPDATE statement:
UPDATE YourTable
SET string_column = LOWER(string_column)
WHERE true;
How can I query BigQuery to lower case entire column?
Definitely LOWER is the function to use
For example
#standardSQL
WITH `dataset.table` AS (
SELECT
'https://stackoverflow.com/q/44970976/5221944' AS url,
'How to lower case entire column data in Google Cloud BigQuery' AS title
)
SELECT * REPLACE(LOWER(title) AS title)
FROM `dataset.table`
I am trying to find a "quick" way to lower case all the data
From what I see in your question - I would not recommend using DML's UPDATE as it is Costly and not necessarily "quick" and for sure not flexible in case if you later will want to change your mind let say to have it UPPER or somehow differently (camel style for example)
The quick way in your case I see in creating view like below. It is cheap ($0.00) and flexible to accomodate any logic of transforming columns in original table
SELECT * REPLACE(LOWER(title) AS title)
FROM `dataset.table`
Found out my self. it can be done using UPDATE as Elliott suggested. but it must use standartSQL. I used the #standardSQL declration for that.
#standardSQL
UPDATE dataset.table
SET field = LOWER(field)
WHERE TRUE

How do I restate a partition in C# in BigQuery?

I have an unpartitionned table in BigQuery called "rawdata". That table is becoming quite big, and I would like to partition it. I did not find any way to partition the original table, but according to https://cloud.google.com/bigquery/docs/creating-partitioned-tables#restating_data_in_a_partition, I can run a command line that will push from unpartitionned "rawdata" into a partitionned table using a query, but only for a specific day/partition.
My instinct was to use the C# API (we already append data through that) to automate the process of doing the bq query --replace restating from the unpartitionned table, but there doesn't seem to be anything that can do that in the C# code. Do you guys have any recommendation on how to proceed forward? Should I wrap the bq command line execution instead of using Google API?
I am not certain which portion of the API you are referring to, but it looks like you are referring to the Query API here: https://cloud.google.com/bigquery/docs/reference/v2/jobs/query#examples, which won't allow you to pass in a destination table and truncate/append to it.
The Insert API here: https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert#examples can be used to do what you like by filling in the Configuration.Query part here: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.query
Specifically, you'd want to specify the 'configuration.query.destinationTable' field to be the table partition you want to populate, and the 'configuration.query.createDisposition' and 'configuration.query.writeDisposition' fields that match your requirements.
This is effectively what the shell client does passing the '--replace' and '--destination_table' parameters.
Finally, you might want to check this thread for cost considerations when you created a partitioned table from a non-partitioned one: Migrating from non-partitioned to Partitioned tables
Based on Ahmed's comments I put together some c# code using the c# api client that can restate data for a partition. Our use case was to remove personal information after it's served it's use. The below code will set the second field to a blank value. In the query I noticed you couldn't use the your_table$20180114 syntax in the query that you would use in CLI tool so this is changed to a where clause to retrieve the original data from the corresponding partition.
var queryRestateJob = _sut.CreateQueryJob(#"SELECT field1, '' as field2, field3 FROM your_dataset.your_table WHERE _PARTITIONTIME = TIMESTAMP('2018-01-14')",
new List<BigQueryParameter>(), new QueryOptions
{
CreateDisposition = CreateDisposition.CreateIfNeeded,
WriteDisposition = WriteDisposition.WriteTruncate,
DestinationTable = new TableReference
{
DatasetId = "your_dataset",
ProjectId = "your_project",
TableId = "your_table$20180114"
},
AllowLargeResults = true,
FlattenResults = false,
ProjectId = "your_project"
})
.PollUntilCompleted();
queryRestateJob = queryRestateJob.ThrowOnAnyError();

Spring JDBCDaoSupport - dealing with multiple selects

Does Spring have any features which allow for a batch select? I basically have n number of selects to execute depending on the number of records in a list. At the moment because the size of the list is always changing i have to dynamically build the SQL which will be executed. The end product looks something like this
select * from record_details t WHERE id IN ((?),(?),(?))
However the code to generate this SQL on the fly is messy and I'm wondering if there is a nicer approach for this type of problem?
The NamedParameterJdbcTemplate (and according support class) does have that support.
public void someRepoMethod(List ids) {
String query = "select * from record_details where id in (:ids)";
getNamedParameterJdbcTemplate().query(query, Collections.singletonMap("ids", ids), new YourRowMapper());
}
If you don't want to generate SQL yourself you have to use some existing framework. From what i know myBatis, is more lightweight than hibernate so it can suit you more, but there may be other more suitable.