Combining two rowsets in ADLA without join on clause - azure-data-lake

I've got two types of input files I'm loading into an ADLA job. In one, I've got a bunch of data (left) and in another, I've got a list of values that are important to me (right).
As an example here, let's say I'm using the following in my "left" rowset:
| ID | URL |
|----|-------------------------|
| 1 | https://www.google.com/ |
| 2 | https://www.yahoo.com/ |
| 3 | https://www.hotmail.com/|
I'll have something like the following in my right rowset:
| ID | Name | Regex | Exceptions | Other Lookup Val |
|----|-------|-------------|------------|------------------|
| 1 | ThisA | /[a-z]{3,}/ | abc | 091238 |
| 2 | ThatA | /[a-z]{3,}/ | xyz | lksdf9 |
| 3 | OtherA| /[a-z]{3,}/ | def | 098143 |
As each are loaded via an EXTRACT statement, both are in separate rowsets. Ideally, I'd like to be able to load all the values for both rowsets and loop through the right one to run a series of calculations against the left one to find a match per various business rules. Notably, there's no value to simply join on, nor is it a simple Regex evaluation, but rather something a bit more involved. Thus, the output might just look something like the "left" rowset:
| ID | URL |
|----|-------------------------|
| 1 | https://www.google.com/ |
| 3 | https://www.hotmail.com/|
Now, a COMBINER is the only UDO I see that accepts two rowsets, but the U-SQL syntax requires that I do some sort of join statement here. There's no common identifier between each of the rowsets though, so there's nothing to join on, which suddenly makes this seem less ideal. Of the attribute options defined at https://learn.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide#use-user-defined-combiners, I'd like to specify this as a Full because I'd need each of the left values available to evaluate against each of the right ones, but again, no shared identifier to do this on.
I then tried to use a REDUCER that accepted an IRowset in the IReducer constructor as a parameter, then tried to just pass the rowset in from the U-SQL, but it didn't like that syntax.
Is there any way to perform this custom combining in a manner that doesn't require a JOIN ON clause?

It sounds like you may be able to use an IProcessor. This would allow you to analyze each row in the RIGHT set and add a column (with a value based on your business rules) that you can subsequently use to join to the LEFT set.
[Adding a bit more detail]: You could also do this twice, once for the left and once for the right to create an artificial join column, like row_number or some such.

Related

Designing tournament/matches database

I'm trying to design simple data base that will present some kind of tournament. So I have table that present 'Matches' (it contains two team's ID's), I've also created table 'Round' (one round contains few matches). And there's my question. I want to create 'something' (table/view/procedure/function) that will make possible for me to show ranking of my tournament after given (via argument or using 'where' in select instruction) id of round. For example, there are two teams, team A and team B. In first and second round team A won. So after i pass via something number '1' I want to get output:
Position | Team | Points
:----- | -----: | :----:
1 | Team A | 3
:----- | -----: | :----:
2 | Team B | 0
What is the easiest way to achieve something like that ?
I'm not fully sure I understand what the parameter you are passing in represents, but it sounds like you want to make a stored procedure. And yes you are correct you could use the parameter that is passed in, in your where clause.

How to select with bitwise flag values in SQL

I have two tables in a SQL Server DB. One table BusinessOperations has various information about this business object, the other table OperationType is purely a bitwise flag table that looks like this:
| ID | Type | BitFlag |
| 1 | Basic-A | -2 |
| 2 | Basic | -1 |
| 3 | Type A | 0001 |
| 4 | Type B | 0002 |
| 5 | Type C | 0004 |
| 6 | Type D | 0008 |
| 7 | Type E | 0016 |
| 8 | Type F | 0032 |
The BitFlag column is a varchar column, the bitflags were inserted as '0001' as an example. In the BusinessOperations table, there's a column where the application that uses these tables updates it based on what is selected in the application's UI. As an example, I have one type which has the Basic,Type A, and Type B types selected. The column value in BusinessOperations is 3.
Based on this, I am trying to write a query which will show me something like this:
| ID | Name | Description | OperationType |
| 1 | Test | Test | Basic, Type A, Type B |
Here is the actual layout of the BusinessOperations table (Basic-A and Basic are bit columns:
| ID | Name | Description | Basic-A | Basic | OperationType |
| 1 | Test | Test | 0 | 1 | 3 |
There is nothing that relates these two tables to each other, so I cannot perform a join. I am very inexperienced with bitwise operations and am at a loss on how exactly to structure my select query which is being used to analyze this data. I feel like it needs a STUFF or CASE, but I don't know how I can get this to just show the types and not just the resultant BitFlag.
SELECT ID, Name, Description, OperationType
FROM OperationType
ORDER BY ID
Since you're storing the flag in OperationType as a VARCHAR, the first thing you need to do to is CONVERT or CAST the string to a number so we can do proper bitwise comparisons. I'm slightly unfamiliar with SQL Server, but you may need to remove the leading zeroes before the cast. Thus, the OperationType column in our desired SQL will look something like
CONVERT(INT, BitFlag)
Then, comparing that to our OperationType column would look something like
CONVERT(INT, BitFlag) & OperationType
The full query would look something like (forgive my lack of SQL Server expertise again):
SELECT bo.ID, bo.Name, bo.Description, ot.Type
FROM BusinessOperations AS bo
JOIN OperationType AS ot
ON CONVERT(INT, ot.BitFlag) & OperationType <> 0
The above query will effectively get you a list of the OperationTypes. If you absolutely need them on one line, see other answers to learn how to emulate something like GROUP_CONCAT in SQL Server. Disclaimer: Joining on a bitmask gives no guarantee of performance.
The other problem this answer does not solve is that of your legacy Basic and Basic-A fields. Personally, I'd do one of two things:
Remove them from the OperationType table and have the application tack the two on, based on the Basic and Basic-A columns as appropriate.
Put Basic and Basic-A as their own, positive flags in the OperationType table, and have the application populate the legacy columns as well as the OperationType column as appropriate.
As Aaron Bertrand has said in the comments, this really isn't an issue for Bitmasking at all. Having a many-many table that associates BusinessOperations.ID to OperationType.ID would solve all your problems in a much better way.
In the BusinessOperations table the Basic-A and Basic field are bit fields which is just another way of saying the value can only be a 1 or 0. Think of it like a boolean value True/False. So, in your query you can check each of those to determine whether to include 'Basic-A' and 'Basic' or not.
The OperationType is probably an id which you can lookup in the OperationsType table to get the Type and BitFlag. Without understanding your data completely it looks as if you could do a join for that part. Hopefully that is in the right general direction. If not, let me know.

SQL join two tables using value from one as column name for other

I'm a bit stumped on a query I need to write for work. I have the following two tables:
|===============Patterns==============|
|type | bucket_id | description |
|-----------------------|-------------|
|pattern a | 1 | Email |
|pattern b | 2 | Phone |
|==========Results============|
|id | buc_1 | buc_2 |
|-----------------------------|
|123 | pass | |
|124 | pass |fail |
In the results table, I can see that entity 124 failed a validation check in buc_2. Looking at the patterns table, I can see bucket 2 belongs to pattern b (bucket_id corresponds to the column name in the results table), so entity 124 failed phone validation. But how do I write a query that joins these two tables on the value of one of the columns? Limitations to how this query is going to be called will most likely prevent me from using any cursors.
Some crude solutions:
SELECT "id", "description" FROM
Results JOIN Patterns
ON "buc_1" = 'fail' AND "bucket_id" = 1
union all
SELECT "id", "description" FROM
Results JOIN Patterns
ON "buc_2" = 'fail' AND "bucket_id" = 2
Or, with a very probably better execution plan:
SELECT "id", "description" FROM
Results JOIN Patterns
ON "buc_1" = 'fail' AND "bucket_id" = 1
OR "buc_2" = 'fail' AND "bucket_id" = 2;
This will report all failure descriptions for each id having a fail case in bucket 1 or 2.
See http://sqlfiddle.com/#!4/a3eae/8 for a live example
That being said, the right solution would be probably to change your schema to something more manageable. Say by using an association table to store each failed test -- as you have in fact here a many to many relationship.
An other approach if you are using Oracle ≥ 11g, would be to use the UNPIVOT operation. This will translate columns to rows at query execution:
select * from Results
unpivot ("result" for "bucket_id" in ("buc_1" as 1, "buc_2" as 2))
join Patterns
using("bucket_id")
where "result" = 'fail';
Unfortunately, you still have to hard-code the various column names.
See http://sqlfiddle.com/#!4/a3eae/17
It looks to me that what you really want to know is the description(in your example Phone) of a Pattern entry given the condition that the bucket failed. Regardless of the specific example you have you want a solution that fulfills that condition, not just your particular example.
I agree with the comment above. Your bucket entries should be tuples(rows) and not arguments, and also you should share the ids on each table so you can actually join them. For example, Consider adding a bucket column and index their number then just add ONE result column to store the state. Like this:
|===============Patterns==============|
|type | bucket_id | description |
|-----------------------|-------------|
|pattern a | 1 | Email |
|pattern b | 2 | Phone |
|==========Results====================|
|entity_id | bucket_id |status |
|-------------------------------------|
|123 | 1 |pass |
|124 | 1 |pass |
|123 | 2 | |
|124 | 2 |fail |
1.-Use an Inner Join: http://www.w3schools.com/sql/sql_join_inner.asp and the WHERE clause to filter only those buckets that failed:
2.-Would this example help?
SELECT Patterns.type, Patterns.description, Results.entity_id,Results.status
INNER JOIN Results
ON
Patterns.bucket_id=Results.bucket_id
WHERE
Results.status=fail
Lastly, I would also add a primary_key column to each table to make sure indexing is faster for each unique combination.
Thanks!

How to get numbers arranged right to left in sql server SELECT statements

When performing SELECT statements including number columns (prices, for example), the result always is left to right ordered, which reduces the readability. Therefore I'm searching a method to format the output of number columns right to left.
I already tried to use something like
SELECT ... SPACE(15-LEN(A.Nummer))+A.Nummer ...
FROM Artikel AS A ...
which gives close results, but depending on font not really. An alternative would be to replace 'SPACE()' with 'REPLICATE('_',...)', but I don't really like the underscores in output.
Beside that this formula will crash on numbers with more digits than 15, therefore I searched for a way finding the maximum length of entries to make it more save like
SELECT ... SPACE(MAX(A.Nummer)-LEN(A.Nummer))+A.Nummer ...
FROM Artikel AS A ...
but this does not work due to the aggregate character of the MAX-function.
So, what's the best way to achieve the right-justified order for the number-columns?
Thanks,
Rainer
To get you problem with the list box solved have a look at this link: http://www.lebans.com/List_Combo.htm
I strongly believe that this type of adjustment should be made in the UI layer and not mixed in with data retrieval.
But to answer your original question i have created a SQL Fiddle:
MS SQL Server 2008 Schema Setup:
CREATE TABLE dbo.some_numbers(n INT);
Create some example data:
INSERT INTO dbo.some_numbers
SELECT CHECKSUM(NEWID())
FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))X(x);
The following query is using the OVER() clause to specify that the MAX() is to be applied over all rows. The > and < that the result is wrapped in is just for illustration purposes and not required for the solution.
Query 1:
SELECT '>'+
SPACE(MAX(LEN(CAST(n AS VARCHAR(MAX))))OVER()-LEN(CAST(n AS VARCHAR(MAX))))+
CAST(n AS VARCHAR(MAX))+
'<'
FROM dbo.some_numbers SN;
Results:
| COLUMN_0 |
|---------------|
| >-1486993739< |
| > 1620287540< |
| >-1451542215< |
| >-1257364471< |
| > -819471559< |
| >-1364318127< |
| >-1190313739< |
| > 1682890896< |
| >-1050938840< |
| > 484064148< |
This query does a straight case to show the difference:
Query 2:
SELECT '>'+CAST(n AS VARCHAR(MAX))+'<'
FROM dbo.some_numbers SN;
Results:
| COLUMN_0 |
|---------------|
| >-1486993739< |
| >1620287540< |
| >-1451542215< |
| >-1257364471< |
| >-819471559< |
| >-1364318127< |
| >-1190313739< |
| >1682890896< |
| >-1050938840< |
| >484064148< |
With this query you still need to change the display font to a monospaced font like COURIER NEW. Otherwise, as you have noticed, the result is still misaligned.

Inserting entries only once

I have a relationships table, the table looks something like this
------------------------
| client_id | service_id |
------------------------
| 1 | 1 |
| 1 | 2 |
| 1 | 4 |
| 1 | 7 |
| 2 | 1 |
| 2 | 5 |
------------------------
I have a list of new permissions I need to add, what I'm doing right now is, for example, if I have to add new permissions for the client with id 1, i do
DELETE FROM myTable WHERE client_id = 1
INSERT INTO ....
Is there a more efficient way I can remove only the ones I won't insert later, and add only the new ones?
yes, you can do this but in my humble opinion, it's not really sql dependent subject. actually it depends on your language/platform choice. if you use a powerful platform like .NET or Java, there are many database classes like adapters, datasets etc. which are able to take care of things for you like finding the changed parts, updating/inserting/deleting only necessery parts etc.
i prefer using hibernate/nhibernate like libraries. in this case, you don't even need to write sql queries most of the time. just do the things at oop level and synchronize with the database.
If you put the new permissions into another table, you could do something like:
DELETE FROM myTable WHERE client_id in (SELECT client_id FROM tmpTable);
INSERT INTO myTable AS (SELECT client_id, service_id FROM tmpTable);
You are still taking 2 passes, but you are doing them all at once instead of one at a time.