How to parse big string U-SQL Regex

How to parse big string U-SQL Regex - azure-data-lake

I have got a big CSVs that contain big strings. I wanna parse them in U-SQL.
#t1 =
SELECT
Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)") AS p
FROM
(VALUES(1)) AS fe(n);
#t2 =
SELECT
p.Groups["ID"].Value AS gads_id,
p.Groups["T"].Value AS gads_t,
p.Groups["S"].Value AS gads_s
FROM
#t1;
OUTPUT #t
TO "/inhabit/test.csv"
USING Outputters.Csv();
Severity Code Description Project File Line Suppression State
Error E_CSC_USER_INVALIDCOLUMNTYPE:
'System.Text.RegularExpressions.Match' cannot be used as column type.
I know how to do it in a SQL way with EXPLODE/CROSS APPLY/GROUP BY. But may be it is possible to do without these dances?
One more update
#t1 =
SELECT
Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["ID"].Value AS id,
Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["T"].Value AS t,
Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["S"].Value AS s
FROM
(VALUES(1)) AS fe(n);
OUTPUT #t1
TO "/inhabit/test.csv"
USING Outputters.Csv();
This wariant works fine. But there is a question. Will the regex evauated 3 times per row? Does exists any chance to hint U-SQL engine - the function Regex.Match is deterministic.

You should probably be using something more efficient than Regex.Match. But to answer your original question:
System.Text.RegularExpressions.Match is not part of the built-in U-SQL types.
Thus you would need to convert it into a built-in type, such as string or SqlArray<string> or wrap it into a udt that provides an IFormatter to make it a user-defined type.

Looks like it is better to use something like this to parse the simple strings. Regexes are slow for the task and if i will use simple string expressions (instead of CLR calls) they probably will be translated into c++ code at codegen phase... and .net interop will be eliminated (i'm not sure).
#t1 =
SELECT
pv.cust_gads != null ? new SQL.ARRAY<string>(pv.cust_gads.Split(':')) : null AS p
FROM
dwh.raw_page_view_data AS pv
WHERE
pv.year == "2017" AND
pv.month == "04";
#t3 =
SELECT
p != null && p.Count == 3 ? p[0].Split('=')[1] : null AS id,
p != null && p.Count == 3 ? p[1].Split('=')[1] : null AS t,
p != null && p.Count == 3 ? p[2].Split('=')[1] : null AS s
FROM
#t1 AS t1;
OUTPUT #t3
TO "/tmp/test.csv"
USING Outputters.Csv();

Related

Does BigQuery have a safe navigation operator?

Does BigQuery have a safe navagation operator, i.e. a null-safe variant of its field navigation operator?
Ideally I'm looking for an operator akin to ?. in Swift/TypeScript, &. in Ruby, etc., but a function I could call would suffice as well.
Right now my query looks like:
SELECT a.b.c.d.e
FROM myTable AS a
WHERE
a.b IS NOT NULL
&& a.b.c IS NOT NULL
&& a.b.c.d IS NOT NULL
&& a.b.c.d.e = "my desired value"
Edit: This doesn't actually work.
Name b not found inside a at [12:34]
I'd wish it could be something like:
SELECT a.b.c.d.e
FROM myTable AS a
WHERE a?.b?.c?.d?.e = "my desired value"

afaik, there is no safe navagation operator for STRUCT type in bigquery.
what I can come up with is to conver nested STRUCT type to JSON type and utilize json path with which you can navigate safely.
WITH myTable AS (
SELECT STRUCT(STRUCT(STRUCT('my_desired_value' AS e) AS d) AS c) AS b
)
SELECT TO_JSON(b).c.d.e, --
TO_JSON(b).f.d.e, -- non-existing path
-- b.f.d.e --> error - Field name f does not exist ...
FROM myTable AS a;
To check field path of struct type, you can use INFORMATION_SCHEMA.COLUMN_FIELD_PATHS.
SELECT *
FROM `your-project.your_dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE table_name = 'myTable';

Error converting data type varchar to bigint. when using update

I'm using SQL Server. I'm trying to run the following SQL script and I'm getting this error:
Error converting data type varchar to bigint.
This is the script:
with T as
(
select
sp.ProfileId
,sp.ProfileHandle
,sp.[Description]
from
[SocialProfile] sp
inner join
Entity e on sp.EntityId = e.EntityId
where
e.EntityStatusId != 3 and e.EntityStatusId != 4
and sp.SocialProfileTypeId in (1, 2, 10)
and (ISNUMERIC(sp.ProfileHandle) = 1)
and IsRemoved = 0
)
update T
set ProfileHandle = NULL
where ProfileHandle = ProfileId
I tried to use cast function but it doesn't work. Can someone tell me what I'm doing wrong?

The isnumeric() doesn't give the protection you want. Use try_convert() for the comparison:
with T as (
select sp.ProfileId, sp.ProfileHandle, sp.[Description]
from [SocialProfile] sp inner join
Entity e
on sp.EntityId = e.EntityId
where e.EntityStatusId not in (3, 4) and
sp.SocialProfileTypeId in (1, 2, 10) and
ISNUMERIC(sp.ProfileHandle) = 1 and -- you can leave it in
IsRemoved = 0
)
update T
set ProfileHandle = NULL
where try_convert(int, ProfileHandle) = ProfileId;
SQL Server has a "feature" where it will rearrange the conditions in the query. The CTE does not get executed "first", so the isnumeric() is not necessarily run before the conversion in the where. I consider this a bug. Microsoft considers this a feature (because it provides more options for optimization).
The only ordering guaranteed in a query is through case expressions. The simplest work-around is try_convert().
In addition, I strongly recommend never relying on implicit conversions. Always explicitly convert. I have spent many hours debugging code for problems caused by implicit conversion.

Check if field exists in CosmosDB JSON with SQL - nodeJS

I am using Azure CosmosDB to store documents (JSON).
I am trying to query all documents that contain the field "abc", and not return the documents that do not have the field "abc". For example, return the first object below and not the second
{
"abc": "123"
}
{
"jkl": "098"
}
I am trying to use the following code:
client.queryDocuments(
collectionUrl,
`SELECT r.id, r.authToken.instagram,r.userName FROM root r WHERE r.abc`
)
I assumed the above would check if abc exists similar to if (r.abc) {}
I have tried using WHERE r.abc IS NOT NULL
Thanks in advance

If you want to know if a field exists you should use the IS_DEFINED("FieldName")
If you want to know if the field's value has a value the
FieldName != null or
FieldName <> null (apparently)
I use variations of this in production:
SELECT c.FieldName
FROM c
WHERE IS_DEFINED(c.FieldName)

All you need to do is change your query to
SELECT r.id, r.authToken.instagram,r.userName FROM root r WHERE r.abc != null
or
SELECT r.id, r.authToken.instagram,r.userName FROM root r WHERE r.abc <> null
Both operators work (tested on the Data Explorer)

Add the NOT operator in the SQL query to negate.
SELECT r.id, r.authToken.instagram,r.userName
FROM root r
WHERE NOT IS_DEFINED(r.abc)
to include all entries where the FieldName abc doesn't exist.

PostgreSQL conditional where clause

In my Ruby on Rails app I'm using blazer(https://github.com/ankane/blazer) and I have the following sql query:
SELECT *
FROM survey_results sr
LEFT JOIN clients c ON c.id = sr.client_id
WHERE sr.client_id = {client_id}
This query works really well. But I need to add conditional logic to check if client_id variable is present. If yes then I filter by this variable, if not then I not launching this where clause. How can I do it in PostgreSQL?

Check if its null OR your condition like this:
WHERE {client_id} IS NULL OR sr.client_id = {client_id}
The WHERE clause evaluate as follow: If the variable is empty, then the WHERE clause evaluate to true, and therefore - no filter. If not, it continue to the next condition of the OR

If anyone faced with the psql operator does not exist: bigint = bytea issue, here is my workaround:
WHERE ({client_id} < 0 AND sr.client_id > {client_id}) OR sr.client_id = {client_id}
Please consider that, client_id generally cannot be negative so you can use that information for eleminating the operation cast issue.

My solution:
I use spring data jpa, native query.
Here is my repository interface signature.
#Query(... where (case when 0 in :itemIds then true else i.id in :itemIds end) ...)
List<Item> getItems(#Param("itemIds) List<Long> itemIds)
Prior calling this method, I check if itemIds is null. If yes, I set value to 0L:
if(itemIds == null) {
itemIds = new ArrayList<Long>();
itemIds.add(0L);
}
itemRepo.getItems(itemIds);
My IDs starts from 1 so there is no case when ID = 0.

How to make Linq to SQL translate to a derived column?

I have a table with a 'Wav' column that is of type 'VARBINARY(max)' (storing a wav file) and would like to be able to check if there is a wav from Linq to SQL.
My first approach was to do the following in Linq:
var result = from row in dc.Table
select new { NoWav = row.Wav != null };
The problem with the code above is it will retreive all the binary content to RAM, and this isn't good (slow and memory hungry).
Any idea how to have Linq query to translate into something like bellow in SQL?
SELECT (CASE WHEN Wav IS NULL THEN 1 ELSE 0 END) As NoWav FROM [Update]

Thanks for all the replies. They all make sense. Indeed, Linq should translate the != null correctly, but it didn't seem to effectively do it: running my code was very slow, so somehow my only explaination is that it got the binary data transfered over to the RAM.... but maybe I'm wrong.
I think I found a work around anyway somewhere else on stackoverflow: Create a computed column on a datetime
I ran the following query against my table:
ALTER TABLE [Table]
ADD WavIsNull AS (CASE WHEN [Wav] IS NULL Then (1) ELSE (0) END)
Now I'll update my DBML to reflect that computed column and see how it goes.

Are you sure that this code will retrieve the data to RAM?
I did some testing using LINQPad and the generated SQL was optimized as you suggest:
from c in Categories
select new
{
Description = c.Description != null
}
SELECT
(CASE
WHEN [t0].[description] IS NOT NULL THEN 1
ELSE 0
END) AS [Description]
FROM [Category] AS [t0]

What about this query:
var result = from row in dc.Table where row.Wav == null
select row.PrimaryKey
for a list of keys where your value is null. For listing of null/not null you could do this:
var result = from row in db.Table
select new
{ Key = row.Key, NoWav = (row.Wav == null ? true : false) };
That will generate SQL code similar to this:
SELECT [t0].[WavID] AS [Key],
(CASE
WHEN [t0].[Wav] IS NULL THEN 1
ELSE 0
END) AS [NoWav]
FROM [tblWave] AS [t0]

I'm not clear here, your SQL code is going to return a list of 1s and 0s from your database. Is that what you are looking for? If you have an ID for your record then you could just retrieve that single record with the a condition on the Wav field, null return would indicate no wav, i.e.
var result = from row in dc.Table
where (row.ID == id) && (row.Wav != null)
select new { row.Wav };

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to parse big string U-SQL Regex - azure-data-lake

Related

Does BigQuery have a safe navigation operator?

Error converting data type varchar to bigint. when using update

Check if field exists in CosmosDB JSON with SQL - nodeJS

PostgreSQL conditional where clause

How to make Linq to SQL translate to a derived column?

Categories

Resources