Imperative USQL - azure-data-lake

How do I write control code for USQL? For example, move records between rowsets depending on the result of a query, and do this iteratively? Do I need to write a C# wrapper that dynamically generates and submits USQL?
For example, imagine I am writing a type of divisive hierarchical clustering algorithm. I have a large data set, where each row represents a point I want to segment. I iteratively repeat the following procedure:
For next cluster of rows X on stack:
For each potential disjoint subsegment Y of X:
Test if Y is different than (X-Y)
If yes:
mark Y as cluster and push to stack
let X=X-Y
Obviously, this procedure involves lots of control flow. How could I implement such a routine in USQL? If it is not possible, how would I implement with ADLA?

Related

How to handle concurrency in faunadb

I've some backend APIs which connect to faunadb; I'm able to do everything I need with data but I've some serious doubts about concurrent modifications (which maybe are not strictly related to faunadb only, but I'd like to understand how to deal with it using this technology).
One example above all: I want to create a new document (A) in a collection (X) which is linked (via reference or other fields) to other documents (B and C) in another collection (Y); in order to be linked, these documents (B and C) must satisfy a condition (e.g. field F = "V"). Once A has been created, B and C cannot be modified (or the condition will be invalidated!).
Of course the API to create the document A can run concurrently with the API used to modify documents B and C.
Here comes the doubt: what if, while creating the document A linked to document B and C, someone else changes field F of document B to something different from "V"?
I could end up with A linked to a wrong document, because both APIs don't know what the other one is doing..
Do I need to use the "Do" function in both APIs to create atomic transactions? So I can:
Check if B and C are valid and, if yes, create A in a single transaction
Check if B is linked to A and, if it doesn't, modify it in a single transaction
Thanks everyone.
Fauna tries to present a consistent data view no matter or when your clients need to ask. Isolation of transaction effects is what matters on short time scales (typically less than 10ms).
The Do function merely lets you combine multiple disparate FQL expressions into a single query. There is no conditional processing aspect to Do.
You can certainly check conditions before undertaking operations, and all Fauna queries are atomic transactions: all of the query succeeds or none of its does.
Arranging for intermediate query values in order to perform conditional logic does tend to make FQL queries more complex, but they are definitely possible:
The query for your first API might look something like this:
Let(
{
document_b: Get(<reference to B document>),
document_c: Get(<reference to C document>),
required_b: Select(["data", "required_field"], Var("document_b"),
required_c: Select(["data", "other_required"], Var("document_c"),
condition: And(Var("required_b"), Var("required_c")),
},
If(
Var("condition"),
Create(Collection("A"), { data: { <document A data }}),
Abort("Cannot create A because the conditions have not been met.")
)
)
The Let function allows you to compose named values for intermediate expressions, which can read or write whatever they need, along with logical operations that determine which conditions need to be tested. The value composition is followed by an expression which, in this example, tests the conditions and only creates the document in the A collection when the conditions are met. When the conditions are not met, the transaction is aborted with an appropriate error message.
Let can nest Lets as much as required, provide the query fits within the maximum query length of 16MB, so you can embed a significant amount of logic into your queries. When the length of a single query is not sufficient, you can define UDFs that can be called, which allow you to store business logic that you can use any number of times.
See the E-commerce tutorial for a UDF that performs all of the processing required to submit an order, check if there is sufficient product in stock, deduct requested quantities from inventory, set backordered status, and create the order.

How to map the column wise data in flowfile in NiFi?

i have csv file which having following structure.,
Alfreds,Centro,Ernst,Island,Bacchus
Germany,Mexico,Austria,UK,Canada
01,02,03,04,05
Now i have to move that data into database like below.
Name,City,ID
Alfreds,Germay,01
Centro,Mexico,02
Ernst,Austria,03
Island,UK,04
Bacchus,Canda,05
i try to map those colums but i can't able to extract the data in column wise.
Here my input data in column wise but i need to insert those in row wise in SQLServer
Can anyone suggest way to transfer column wise data into row wise in sql server?.
Thanks
There is no existing Apache NiFi processor to perform column transposition. One of the problems is that this is difficult to do in a streaming manner, as most NiFi components are designed, because in a naïve implementation you need to hold the entire contents of the flowfile in active memory at the same time.
I would recommend using an ExecuteScript processor to do this (here's a 6 line Python example). Be careful doing this because you can easily end up overflowing your heap if it is not set properly/you read unexpectedly large files into memory.
You could write a custom processor which performs a streaming transpose operation by iterating over each of n rows and reading up to your delimiter, storing a byte counter per row, combining the n elements as a single output row, and repeating the process starting from the respective byte counter of each row. (Given m columns, this is O(m * n)).
Another solution would be splitting the CSV input into individual rows using the SplitText processor, using an ExecuteScript or custom processor to transpose a single row into a single column, and then using a custom merge operation (either extend the existing MergeContent processor or write a script to do this) which laterally concatenates the incoming columns into a reconstructed matrix. (O(n) + O(n) + O(m) => O(2n + m) but the individual transposition operations can be performed in parallel so with x threads it's O(n + n/x + m)).
Any of these approaches will require some level of custom development. If you are really hesitant to pursue that, you could try using ExecuteStreamCommand and one of the many bash solutions to do the transposition on the command-line.
#Andy,
It could be possible in NiFi also without using ExecuteScript.
I have extract the 3 input rows as input.1,input.2,input.3 in ExtractText. And then count number of columns in "input.1" using AnydelinateValues in expression language and store that in "TotalCount" Attribute.
Initially made "Count=1".
Using Loop Concept to get the first column by using "Count" and then increment "Count" Check "Count" in RouteOnAttribute
"le(totalcount)"
Now form insert Query with "Count" Attribute.
It worked well for me.It could be useful for someone.

Multithreaded grouping algorithm

I have a collection of circles, each of which may or may not intersect one or more other circles in the collection. I want to group these circles such that each "group" contains all circles such that every member of the group intersects at least one other member of the group, and such that no member of any group intersects any member of any other group. I have come up with the following VB.NET/pseudocode algorithm to solve this problem on a single thread:
Dim groups As New List(Of List(Of Circle))
For Each circleToClassify In allCircles
Dim added As Boolean
For Each group In groups
For Each circle In group
If circleToClassify.Intersects(circle) Then
group.Add(circleToClassify)
added = True
Exit For
End If
Next
If added Then
Exit For
End If
Next
If Not added Then
Dim newGroup As New List(Of Circle)
newGroup.Add(circleToClassify)
groups.Add(newGroup)
End If
Next
Return groups
Or in English
Take each item from the collection of circles
Check if it intersects with any member of any existing group (Bear in mind a "group" may only contain a single circle)
If the circle does intersect in the aforementioned manner add it to the appropriate group
Otherwise create a new group with this circle as its only member
Go to step 1.
What I want to be able to do is perform this task using an arbitrary number of threads. However, I haven't got very far at all as all solutions I've come up with so far will just end up executing serially due to locking.
Can anyone provide any tips on what I want to be thinking about to achieve this multithreading?
TLDR
The best multithreaded solutions avoid sharing or perform read-only sharing. (And hence don't need locks.)
Consider partitioning your work so that threads don't share result data, and then merging each thread's results.
Note that when you strip away the detail of detecting whether groups of circles intersect, you are really dealing with a connected components graph theory problem. There's plenty of useful material on this subject online. And in fact you may find it much easier and sufficiently fast to simply apply a breadth first search algorithm to find connected components.
Detail
When doing multi-threaded development, first prize is to implement the threads in such a way as to minimise the number of locks. In the most trivial case: if they don't share any data, they don't need locks at all. However, if you can guarantee that the shared data won't be modified while the threads are running: then you don't need locks in this case either.
In your question, there's no need for your input list of circles to be modified. The problem you have is that you're building up a shared list of circle groups. Basically you're sharing your result space and need locks to ensure the integrity of the results.
One technique in this situation is to "partition and merge". As a trivial example consider finding the maximum of a large list of numbers. The naive (and ideal single-threaded solution) is to:
keep a single "current maximum" found;
compare each element to this value;
and update the "current maximum" if it's higher.
The problem for multithreading occurs in updating of the shared result. One solution is to:
partition the list for each of p threads;
find the maximum within each partition;
once all threads finish their work, the final result is trivially obtained by finding the maximum of the p partitioned maximums.
The trade-off against a single-threaded solution involves weighing up the ease with which the workload can be partitioned and the per-thread results merged versus the often much simpler single-threaded approach.
Applying partition and merge to circle clusters
As a side note: Observe that your question is essentially a graph theory question such that: Each circle is a node; where if any 2 circles intersect, there's an undirected edge between them; and you're trying to determine the connected components of the graph.
Obviously this provides an area that you can research for more ideas/information. But more importantly it makes easier to analyse the problem with simple boolean assessment of whether 2 circles intersect.
Also note the potential performance improvements by first pre-processing your circles into a suitable graph structure.
Assume you have 8 circles (A-H) where 1's in the table below indicate the 2 circles intersect.
ABCDEFGH
A11000110
B11000000
C00100000
D00010101
E00001110
F10011100
G10001010
H00010001
One partitioning idea involves determining what's connected by only considering a subset of circles and all their immediate connections.
ABCDEFGH
A11000110 p1 [AB]
B11000000
---------
C00100000 p2 [CD]
D00010101
---------
E00001110 p3 [EF]
F10011100
---------
G10001010 p4 [GH]
H00010001
NB Even though threads are sharing data (e.g. 2 threads may consider the intersection between circles A and F concurrently), the share is read-only and doesn't require a lock.
Assume 4 partitions (and 4 threads) of [AB][CD][EF][GH]. Connected components per partition would be broken down as follows:
[AB]: ABFG
[CD]: C DFH
[EF]: ADEFG
[GH]: AEG DH
You now have a list of potentially overlapping connected components. Merging involves iterating the list to find overlaps. If found, take union of the 2 sets is a new connected component. This will finally produce: ABFGDHE and C
Some optimisation techniques to consider:
The bottom left of the matrix mirrors the top-right. So you should be able to avoid duplicating processing of the inverse connections.
The merging of partitions can itself be partitioned and merged.
In fact in the extreme case you could start out partitioning a single circle per partition.
Connected(A) = ABFG
Connected(B) = B
Connected(AB) = ABFG
Connected(C) = C
Connected(D) = DFH
Connected(CD) = C,DFH
Connected(ABCD) = ABFGDH,C
Connected(E) = EFG
Connected(F) = F
Connected(EF) = EFG
Connected(G) = G
Connected(H) = H
Connected(GH) = G,H
Connected(EFGH) = EFG,H
Connected(ABCDEFGH) = ABFGDHE,C
Very NB You need to ensure appropriate selection of data structures and algorithms or suffer extremely poor performance. E.g. A naive intersection implementation might require O(n^2) operations to determine if two intermediate connected components intersect and totally destroy your goal that lead to all this additional complexity.
One approach is to divide the image into blocks, run the algorithm for each block independently, on different threads (i.e. considering only the circles whose center is in that block), and afterwards join the groups from different blocks that have intersecting circles.
Another approach is to formulate the problem using a graph, where the nodes represent circles, and an edge exists between two nodes if the corresponding circles are intersecting. We need to find the connected components of this graph. This disregards the geometric aspects of the problem, however, there are general algorithms which may be useful (e.g. you could consider the last slides from this link).

Hive/SQL queries for funnel analysis.

I need to perform funnel analysis on a data the schema for which is following:
A(int X) Matched_B(int[] Y) Filtered_C(int[] Z)
Where,
A refers to client ID which can send multiple requests. Instead of storing request ID only client ID is being stored per request in the data pipeline. (I don't know why)
Matched_B refers to a list of items returned for a query.
Flitered_C is a subset of Matched_B and refers to items which successfully passed the filter.
All the data is stored in avro files in HDFS. The QPS with which data is being stored in HDFS is around 12000.
I need to prepare the following reports:
For each combination of (X,Y[i]), the number of times Y[i] appears in Matched_B.
For each combination of (X,Y[i]), the number of times Y[i] appears in Filtered_C.
Basically I would like to know whether this task can be performed using Hive only?
Currently, I am thinking of the following architecture.
HDFS(avro_schema)--> Hive_Script_1 --> HDFS(avro_schema_1) --> Java Application --> HDFS(avro_schema_2) --> Hive_Script_2(external_table) --> result
Where,
avro_schema is the schema described above.
avro_schema_1 is generated by Hive_Script_1 by transforming (using Lateral View explode(Matched_B)) avro_schema and is described as follows:
A(int X) Matched_B_1(int Y) Filtered_C(int[] Z)
avro_schema_2 is generated by the Java Application and is described as follows:
A(int X) Matched_B(int Y) Matched_Y(1 if Y is matched, else 0) Filtered_Y(1 if Y is filtered, 0 otherwise)
Finally we can run a Hive script to process this data for events generated each day.
The other architecture could be that we remove avro_schema_1 generation and directly process avro_schema from the Java application and generate the result.
However, I would like to avoid writing a Java application for this task. Could some point me to a Hive solution to the above problem?
Would also like some architecture's POV regarding the efficient solution to this problem.
Note: Kindly suggest the solution taking into account the QPS(12000).

How to speed up table-retrieval with MATLAB and JDBC?

I am accessing a PostGreSQL 8.4 database with JDBC called by MATLAB.
The tables I am interested in basically consist of various columns of different datatypes. They are selected through their time-stamps.
Since I want to retrieve big amounts of data I am looking for a way of making the request faster than it is right now.
What I am doing at the moment is the following:
First I establish a connection to the database and call it DBConn. Next step would be to prepare a Select-Statement and execute it:
QUERYSTRING = ['SELECT * FROM ' TABLENAME '...
WHERE ts BETWEEN ''' TIMESTART ''' AND ''' TIMEEND ''''];
QUERY = DBConn.prepareStatement(QUERYSTRING);
RESULTSET = QUERY.executeQuery();
Then I store the columntypes in variable COLTYPE (1 for FLOAT, -1 for BOOLEAN and 0 for the rest - nearly all columns contain FLOAT). Next step is to process every row, column by column, and retrieve the data by the corresponding methods. FNAMES contains the fieldnames of the table.
m=0; % Variable containing rownumber
while RESULTSET.next()
m = m+1;
for n = 1:length(FNAMES)
if COLTYPE(n)==1 % Columntype is a FLOAT
DATA{1}.(FNAMES{n})(m,1) = RESULTSET.getDouble(n);
elseif COLTYPE(n)==-1 % Columntype is a BOOLEAN
DATA{1}.(FNAMES{n})(m,1) = RESULTSET.getBoolean(n);
else
DATA{1}.(FNAMES{n}){m,1} = char(RESULTSET.getString(n));
end
end
end
When I am done with my request I close the statement and the connection.
I don´t have the MATLAB database toolbox so I am looking for solutions without it.
I understand that it is very ineffective to request the data of every single field. Still, I failed on finding a way to get more data at once - for example multiple rows of the same column. Is there any way to do so? Do you have other suggestions of speeding the request up?
Summary
To speed this up, push the loops, and then your column datatype conversion, down in to the Java layer, using the Database Toolbox or custom Java code. The Matlab-to-Java method call overhead is probably what's killing you, and there's no way of doing block fetches (multiple rows in one call) with plain JDBC. Make sure the knobs on the JDBC driver you're using are set appropriately. And then optimize the transfer of expensive column data types like strings and dates.
(NB: I haven't done this with Postgres, but have with other DBMSes, and this will apply to Postgres too because most of it is about the JDBC and Matlab layers above it.)
Details
Push loops down to Java to get block fetching
The most straightforward way to get this faster is to push the loops over the rows and columns down in to the Java layer, and have it return blocks of data (e.g. 100 or 1000 rows at a time) to the Matlab layer. There is substantial per-call overhead in invoking a Java method from Matlab, and looping over JDBC calls in M-code is going to incur (see Is MATLAB OOP slow or am I doing something wrong? - full disclosure: that's my answer). If you're calling JDBC from M-code like that, you're incurring that overhead on every single column of every row, and that's probably the majority of your execution time right now.
The JDBC API itself does not support "block cursors" like ODBC does, so you need to get that loop down in to the Java layer. Using the Database Toolbox like Oleg suggests is one way to do it, since they implement their lower-level cursor stuff in Java. (Probably for precisely this reason.) But if you can't have a database toolbox dependency, you can just write your own thin Java layer to do so, and call that from your M-code. (Probably through a Matlab class that is coupled to your custom Java code and knows how to interact with it.) Make the Java code and Matlab code share a block size, buffer up the whole block on the Java side, using primitive arrays instead of object arrays for column buffers wherever possible, and have your M-code fetch the result set in batches, buffering those blocks in cell arrays of primitive column arrays, and then concatenate them together.
Pseudocode for the Matlab layer:
colBufs = repmat( {{}}, [1 nCols] );
while (cursor.hasMore())
cursor.fetchBlock();
for iCol = 1:nCols
colBufs{iCol}{end+1} = cursor.getBlock(iCol); % should come back as primitive
end
end
for iCol = 1:nCols
colResults{iCol} = cat(2, colBufs{iCol}{:});
end
Twiddle JDBC DBMS driver knobs
Make sure your code exposes the DBMS-specific JDBC connection parameters to your M-code layer, and use them. Read the doco for your specific DBMS and fiddle with them appropriately. For example, Oracle's JDBC driver defaults to setting the default fetch buffer size (the one inside their JDBC driver, not the one you're building) to about 10 rows, which is way too small for typical data analysis set sizes. (It incurs a network round trip to the db every time the buffer fills.) Simply setting it to 1,000 or 10,000 rows is like turning on the "Go Fast" switch that had shipped set to "off". Benchmark your speed with sample data sets and graph the results to pick appropriate settings.
Optimize column datatype transfer
In addition to giving you block fetch functionality, writing custom Java code opens up the possibility of doing optimized type conversion for particular column types. After you've got the per-row and per-cell Java call overhead handled, your bottlenecks are probably going to be in date parsing and passing strings back from Java to Matlab. Push the date parsing down in to Java by having it convert SQL date types to Matlab datenums (as Java doubles, with a column type indicator) as they're being buffered, maybe using a cache to avoid recalculation of repeated dates in the same set. (Watch out for TimeZone issues. Consider Joda-Time.) Convert any BigDecimals to double on the Java side. And cellstrs are a big bottleneck - a single char column could swamp the cost of several float columns. Return narrow CHAR columns as 2-d chars instead of cellstrs if you can (by returning a big Java char[] and then using reshape()), converting to cellstr on the Matlab side if necessary. (Returning a Java String[]converts to cellstr less efficiently.) And you can optimize the retrieval of low-cardinality character columns by passing them back as "symbols" - on the Java side, build up a list of the unique string values and map them to numeric codes, and return the strings as an primitive array of numeric codes along with that map of number -> string; convert the distinct strings to cellstr on the Matlab side and then use indexing to expand it to the full array. This will be faster and save you a lot of memory, too, since the copy-on-write optimization will reuse the same primitive char data for repeated string values. Or convert them to categorical or ordinal objects instead of cellstrs, if appropriate. This symbol optimization could be a big win if you use a lot of character data and have large result sets, because then your string columns transfer at about primitive numeric speed, which is substantially faster, and it reduces cellstr's typical memory fragmentation. (Database Toolbox may support some of this stuff now, too. I haven't actually used it in a couple years.)
After that, depending on your DBMS, you could squeeze out a bit more speed by including mappings for all the numeric column type variants your DBMS supports to appropriate numeric types in Matlab, and experimenting with using them in your schema or doing conversions inside your SQL query. For example, Oracle's BINARY_DOUBLE can be a bit faster than their normal NUMERIC on a full trip through a db/Matlab stack like this. YMMV.
You could consider optimizing your schema for this use case by replacing string and date columns with cheaper numeric identifiers, possibly as foreign keys to separate lookup tables to resolve them to the original strings and dates. Lookups could be cached client-side with enough schema knowledge.
If you want to go crazy, you can use multithreading at the Java level to have it asynchronously prefetch and parse the next block of results on separate Java worker thread(s), possibly parallelizing per-column date and string processing if you have a large cursor block size, while you're doing the M-code level processing for the previous block. This really bumps up the difficulty though, and ideally is a small performance win because you've already pushed the expensive data processing down in to the Java layer. Save this for last. And check the JDBC driver doco; it may already effectively be doing this for you.
Miscellaneous
If you're not willing to write custom Java code, you can still get some speedup by changing the syntax of the Java method calls from obj.method(...) to method(obj, ...). E.g. getDouble(RESULTSET, n). It's just a weird Matlab OOP quirk. But this won't be much of a win because you're still paying for the Java/Matlab data conversion on each call.
Also, consider changing your code so you can use ? placeholders and bound parameters in your SQL queries, instead of interpolating strings as SQL literals. If you're doing a custom Java layer, defining your own #connection and #preparedstatement M-code classes is a decent way to do this. So it looks like this.
QUERYSTRING = ['SELECT * FROM ' TABLENAME ' WHERE ts BETWEEN ? AND ?'];
query = conn.prepare(QUERYSTRING);
rslt = query.exec(startTime, endTime);
This will give you better type safety and more readable code, and may also cut down on the server-side overhead of query parsing. This won't give you much speed-up in a scenario with just a few clients, but it'll make coding easier.
Profile and test your code regularly (at both the M-code and Java level) to make sure your bottlenecks are where you think they are, and to see if there are parameters that need to be adjusted based on your data set size, both in terms of row counts and column counts and types. I also like to build in some instrumentation and logging at both the Matlab and Java layer so you can easily get performance measurements (e.g. have it summarize how much time it spent parsing different column types, how much in the Java layer and how much in the Matlab layer, and how much waiting on the server's responses (probably not much due to pipelining, but you never know)). If your DBMS exposes its internal instrumentation, maybe pull that in too, so you can see where you're spending your server-side time.
It occurs to me that to speed up the query to the table, you have to remove if, for that in the JDBC ResulSetMetadata there that give you the information about the data type of the column and the same name, so you save time of if, else if
  ResultSetMetaData RsmD = (ResultSetMetaData) rs.getMetaData ();
        int cols=rsMd.getColumnCount();
        
  while (rs.next)
for i=1 to cols
row[i]=rs.getString(i);
My example is pseudocode becouse i´m not matlab programmer.
I hope you find it useful JDBC, anything let me know!