Azure Stream Analytics -> how much control over path prefix do I really have? - azure-stream-analytics

I'd like to set the prefix based on some of the data coming from event hub.
My data is something like:
{"id":"1234",...}
I'd like to write a blob prefix that is something like:
foo/{id}/guid....
Ultimately I'd like to have one blob for each id. This will help how it gets consumed downstream by a couple of things.
What I don't see is a way to create prefixes that aren't related to date and time. In theory I can write another job to pull from blobs and break it up after the stream analytics step. However, it feels like SA should allow me to break it up immediately.
Any ideas?

{date} , {time} and {partition} are the only ones supported in blob output prefix. {partition} is a number.
Using a column value in blob prefix is currently not supported.
If you have a limited number of such {id}s then you could workaround by writing multiple "select --" statements with different filters writing to different outputs and hardcode the prefix in the output. Otherwise it is not possible with just ASA.

It should be noted that now you actually can do this. Not sure when it was implemented but you can now use a single property from your message as a custom partition key and the syntax is exactly as the OP has asked for: foo/{id}/something/else
More details are documented here: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-custom-path-patterns-blob-storage-output
Key points:
Only one custom property allowed
Must be a direct reference to an existing message property (i.e. no concatenations like {prop1+prop2})
If the custom property results in too many partitions (more than 8,000) then an arbitrary number of blobs may be created for the same parition

Related

RediSql (for redis): Get column names as well as data type?

I am using the excellent RediSql, a module for Redis, to get a powerful caching solution.
When sending a command to Redis, that interacts with the SqLite db in the background, like this:
REDISQL.EXEC db "SELECT * FROM jobcache"
I get a result like this:
I get a type for the integer column, but not for the string, and no column names are provided.
Is there a way to get column name and defined data type always? I would need this, as I need to convert the results back to a more standard sql result format.
unfortunately, at the moment this is not possible with the EXEC command.
You can use the QUERY.INTO command reference
QUERY.INTO add the result of your query into a stream, it adds the column and the values for each row. Then you can consume the stream in whichever way you prefer.
When doing query (reads) against RediSQL is a good practice to use the .QUERY family of commands, this avoids useless replication of data, in the case you are in a cluster setup.
Moreover, it is possible to use the .QUERY commands also against replica of the main redis instance, while the .EXEC commands can be used only against the primary instance.

How do I partition a large file into files/directories using only U-SQL and certain fields in the file?

I have an extremely large CSV, where each row contains customer and store ids, along with transaction information. The current test file is around 40 GB (about 2 days worth), so partitioning is an absolute must for any reasonable return time on select queries.
My question is this: When we receive a file, it contains multiple store's data. I would like to use the "virtual column" functionality to separate this file into the respective directory structure. That structure is "/Data/{CustomerId}/{StoreID}/file.csv".
I haven't yet gotten it to work with the OUTPUT statement. The statement use was thus:
// Output to file
OUTPUT #dt
TO #"/Data/{CustomerNumber}/{StoreNumber}/PosData.csv"
USING Outputters.Csv();
It gives the following error:
Bad request. Invalid pathname. Cosmos Path: adl://<obfuscated>.azuredatalakestore.net/Data/{0}/{1}/68cde242-60e3-4034-b3a2-1e14a5f7343d
Has anyone attempted the same kind of thing? I tried to concatenate the outputpath from the fields, but that was a no-go. I thought about doing it as a function (UDF) that takes the two ID's and filters the whole dataset, but that seems terribly inefficient.
Thanks in advance for reading/responding!
Currently U-SQL requires that all the file outputs of a script must be understood at compile time. In other words, the output files cannot be created based on the input data.
Dynamic outputs based on data are something we are actively working for release sometime later in 2017.
In the meanwhile until the dynamic output feature is available, the pattern to accomplish what you want requires using two scripts
The first script will use GROUP BY to identify all the unique combinations of CustomerNumber and StoreNumber and write that to a file.
Then through the use of scripting or a tool written using our SDKs, download the previous output file and then programmatically create a second U-SQL script that has an explicit OUTPUT statement for each pair of CustomerNumber and StoreNumber

How to query documents in MarkLogic and process results

I've been working off of the tutorial pages but seem to have a fundamental disconnect in my thinking transitioning off of RDBMS systems. I'm using MarkLogic and handling this database interaction through the Java API focusing on the search access via POJO method outlines in the tutorial documentation.
My reference up to this point has come from here principally: http://developer.marklogic.com/learn/java/processing-search-results
My scenario is this:
I have a series of documents. We'll call them 'books' for simplicity. I'm writing these books into my DB like this:
jsonDocMgr.write("/" + book.getID() + "/",
new StringHandle(
"{name: \""+book.getID()+"\","+
"chaps: "+ book.getNumChaps()+","+
"pages: "+ book.getNumPages()+","+
"}"));
What I want is to execute the following type of operation:
-Query all documents with the name "book*" (as ID is represented by book0, book1, book2, etc)
where chaps > 3. For these documents only, I want to modify the number of pages by reducing by half.
In an RDBMS, I'd use something like jdbcTemplate and get a result set for me to iterate through. For each iteration I'd know I was working with a single record (aka a book), parse the field values from the result set, make a note of the ID, then update the DB accordingly.
With MarkLogic, I'm awash in a sea of different handlers and managers...none of which seems to follow the pattern of the ResultSet with a cursor abstraction. Ultimately I want to do a two-step operation of check the chapter count then update the page field for that specific URI.
What's the most common approach to this? It seems like the most basic of operations...
Try the high-level Java API and see if it works for you. Create a multi-statement transaction with a query by example, then use document operations.
At a lower level, the closest match to a ResultSet is the ResultSequence class. The examples at http://docs.marklogic.com/javadoc/xcc/overview-summary.html are pretty good. For updates the interaction model between Java and MarkLogic is a bit different from JDBC and SQL. There is no SELECT... FOR UPDATE syntax.
The most efficient low-level technique is to select and update in one XQuery transaction, something like a stored procedure. However this requires good knowledge of XQuery. The other low-level approach is to use an XCC multi-statement transaction, which requires a little less knowledge of XQuery.
A minor issue in your code ... you definately do NOT want to end your JSON docuement URIs with "/" as you do in your sample code. You should end them with the ".json" or some other extension or no extension but definately not "/" as that is treated specially in the server.

Removing privacy data from a database?

Say that I needed to share a database with a partner. Obviously I have customer information in that database. Short of going through and identifying every column that contains privacy information and a custom script to 'scrub' the data, is there any tool or script which can scrub the data, but keep the format in tact (for example, if a string is 5 characters, it would stay 5 characters, only scrubbed)?
If not, how would you accomplish something like this, preferably in TSQL?
You may consider only share VIEW, create VIEWs to hide data that you don't want share.
Example:
CREATE VIEW v_customer
AS
SELECT
NAME,
LEFT(CreditCard,5) + '****' As CreditCard -- OR, don't show this column at all
....
FROM customer
Firstly I need to state professional interest I work for IBM which has tools that do exactly this.
Step 1. Ensure you identify all the PII (Personally Identifiable Information). When sharing database information it is typical that the obvious column names like "name" are found but you also need to find the "hidden" data where either the data is embedded in a standard format eg string-name-string and column name is something like "reference code" or is in free format text fields . as you have seen this is not going to be an easy job unless you automate it. The Tool for this is InfoSphere Discovery
Step 2. What context does the "scrubbed" data need to be in. Changing named fields to random characters has problems when testing as users focus on text errors rather than functional failures, therefore change names to real but ficticious. Credit card information often needs to be "valid". by that I mean it needs to have a valid prefix say 49XX but the rest an invalid sequence. Finally you need to ensure that every instance of the change is propogated through the database to maintain consistency. Tool for this is Optim Test Data Management with Data Privacy option.
The two tools integrate to give a full data privacy solution.
Based on the original question, it seems you need the fields to be the same length, but not in a "valid" format? How about:
UPDATE customers
SET email = REPLICATE('z', LEN(email))
-- additional fields as needed
Copy/paste and rename tables/fields as appropriate. I think you're going to have a hard time finding a tool that's less work, unless your schema is very complicated, or my formatting assumptions are incorrect.
I don't have an MSSQL database in front of me right now, but you can also find all of the string-like columns by something like:
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE IN ('...', '...')
I don't remember the exact values you need to compare for, but if you run the query and see what's there, they should be pretty self-explanatory.

Converting SQL Result Sets to XML

I am looking for a tool that can serialize and/or transform SQL Result Sets into XML. Getting dumbed down XML generation from SQL result sets is simple and trivial, but that's not what I need.
The solution has to be database neutral, and accepts only regular SQL query results (no db xml support used). A particular challenge of this tool is to provide nested XML matching any schema from row based results. Intermediate steps are too slow and wasteful - this needs to happen in one single step; no RS->object->XML, preferably no RS->XML->XSLT->XML. It must support streaming due to large result sets, big XML.
Anything out there for this?
With SQL Server you really should consider using the FOR XML construct in the query.
If you're using .Net, just use a DataAdapter to fill a dataset. Once it's in a dataset, just use its .WriteXML() method. That breaks your DB->object->XML rule, but it's really how things are done. You might be able to work something out with a datareader, but I doubt it.
Not that I know of. I would just roll my own. It's not that hard to do, maybe something like this:
#!/usr/bin/env jruby
import java.sql.DriverManager
# TODO some magic to load the driver
conn = DriverManager.getConnection(ARGV[0], ARGV[1], ARGV[2])
res = conn.executeQuery ARGV[3]
puts "<result>"
meta = res.meta_data
while res.next
puts "<row>"
for n in 1..meta.column_count
column = meta.getColumnName n
puts "<#{column}>#{res.getString(n)}</#{column}"
end
puts "</row>"
end
puts "</result>"
Disclaimer: I just made all of that up, I'm not even bothering to pretend that it works. :-)
In .NET you can fill a dataset from any source and then it can write that out to disk for you as XML with or without the schema. I can't say what performance for large sets would be like. Simple :)
Another option, depending on how many schemas you need to output, and/or how dynamic this solution is supposed to be, would be to actually write the XML directly from the SQL statement, as in the following simple example...
SELECT
'<Record>' ||
'<name>' || name || '</name>' ||
'<address>' || address || '</address>' ||
'</Record>'
FROM
contacts
You would have to prepend and append the document element, but I think this example is easy enough to understand.
dbunit (www.dbunit.org) does go from sql to xml and vice versa; you might be able to modify it more for your needs.
Technically, converting a result set to an XML file is straight forward and doesn't need any tool unless you have a requirement to convert the data structure to fit specific export schema. In general the result set gets the top-level element of an XML file, then you produce a number of record elements containing attributes, which effectively are the fields of a record.
When it comes to Java, for example, you just need appropriate JDBC driver for interfacing with DBMS of your choice addressing the database independency requirement (usually provided by a DBMS vendor), and a few lines of code to read a result set and print out an XML string per record, per field. Not a difficult task for an average Java developer in my opinion.
Anyway, the more concrete purpose you state the more concrete answer you get.
In Java, you may just fill an object with the xml data (like an entity bean) and then use XMLEncoder to get it to xml. From there you may use XSLT for further conversion or XMLDecoder to bring it back to an object.
Greetz, GHad
PS: See http://ghads.wordpress.com/2008/09/16/java-to-xml-to-java/ for an example for the Object to XML part... From DB to Object multiple more way are possible: JDBC, Groovy DataSets or GORM. Apache Common Beans may help to fill up JavaBeans via Reflection-like methods.
I created a solution to this problem by using the equivalent of a mail merge using the resultset as the source, and a template through which it was merged to produce the desired XML.
The template was standard XML, with a Header element, a Footer element and a Body element. Using a CDATA block in the Body element allowed me to include a complete XML structure that acted as the template for each row. In order to include a fields from the resultset in the template, I used markers that looked like this <[FieldName]>. The template was then pre-parsed to isolate the markers such that in operation, the template requests each of the fields from the resultset as the Body is being produced.
The Header and Footer elements are output only once at the beginning and end of the output set. The body could be any XML or text structure desired. In your case, it sounds like you might have several templates, one for each of your desired schemas.
All of the above was encapsulated in a Template class, such that after loading the Template, I merely called merge() on the template passing the resultset in as a parameter.