How do I partition a large file into files/directories using only U-SQL and certain fields in the file? - azure-data-lake

I have an extremely large CSV, where each row contains customer and store ids, along with transaction information. The current test file is around 40 GB (about 2 days worth), so partitioning is an absolute must for any reasonable return time on select queries.
My question is this: When we receive a file, it contains multiple store's data. I would like to use the "virtual column" functionality to separate this file into the respective directory structure. That structure is "/Data/{CustomerId}/{StoreID}/file.csv".
I haven't yet gotten it to work with the OUTPUT statement. The statement use was thus:
// Output to file
OUTPUT #dt
TO #"/Data/{CustomerNumber}/{StoreNumber}/PosData.csv"
USING Outputters.Csv();
It gives the following error:
Bad request. Invalid pathname. Cosmos Path: adl://<obfuscated>.azuredatalakestore.net/Data/{0}/{1}/68cde242-60e3-4034-b3a2-1e14a5f7343d
Has anyone attempted the same kind of thing? I tried to concatenate the outputpath from the fields, but that was a no-go. I thought about doing it as a function (UDF) that takes the two ID's and filters the whole dataset, but that seems terribly inefficient.
Thanks in advance for reading/responding!

Currently U-SQL requires that all the file outputs of a script must be understood at compile time. In other words, the output files cannot be created based on the input data.
Dynamic outputs based on data are something we are actively working for release sometime later in 2017.
In the meanwhile until the dynamic output feature is available, the pattern to accomplish what you want requires using two scripts
The first script will use GROUP BY to identify all the unique combinations of CustomerNumber and StoreNumber and write that to a file.
Then through the use of scripting or a tool written using our SDKs, download the previous output file and then programmatically create a second U-SQL script that has an explicit OUTPUT statement for each pair of CustomerNumber and StoreNumber

Related

how to properly load a gcs file into gbq with double-pipe delimiters

my existing query:
bq load --field_delimiter="||" --skip_leading_rows=1 source.taxassessor gs://taxassessor/OFFRS_5_0_TAXASSESSOR_0001_001.txt taxassessor.txt
the error I get back is:
Not enough positional args, still looking for destination_table
I tried to mimic the command in the Web UI, and I cannot reproduce because the Web UI doesn't allow double-pipe delimiters (limitation on the UI? or the solution?)
I have two questions:
How do I repair the current query
the source file OFFRS_5_0_TAXASSESSOR_0001_001.txt is one of many source files with the last three characters of the file name showing what file number in the series that file is. How do I use wild cards so I can get file 002.txt, 003.txt, etc. something like OFFRS_5_0_TAXASSESSOR_0001_*.txt?
Thanks
How do I use wild cards so I can get file 002.txt, 003.txt, etc. something like OFFRS_5_0_TAXASSESSOR_0001_*.txt?
Do as you suggested, for instance:
bq load --field_delimiter="||" --skip_leading_rows=1 source.taxassessor gs://taxassessor/OFFRS_5_0_TAXASSESSOR_0001_*.txt taxassessor.txt
It should already work.
How do I repair the current query?
Not sure why you are getting this message as everything seems to be correct...but still it shouldn't work as your delimiter have 2 characters and they are supposed to have just one (imagine for instance if your file has the string "abcd|||efg||hijk|||l", it'd be hard to tell where the delimiter is; whether it's the first two pipes or the last).
If you can't change the delimiter, one thing you could do is saving everything in BigQuery as one entire STRING field. After that, you can extract the fields as you want, something like:
WITH data AS(
select "alsdkfj||sldkjf" as field UNION ALL
select "sldkfjld|||dlskfjdslk"
)
SELECT SPLIT(field, "||") all_fields FROM data
all_fields will have all columns in your files, you can then save the results to some other table or run any analyzes you want.
As a recommendation, it would probably be better if you could change this delimiter to something else, having just one character.

.Net Parsing Fixed Width Data... From a Concatenated, Single, Fixed-Width Column

I was bored and looking at old code that runs like molasses on a cold day. I found that a group of tables in our accounting system - each with 500,000 records of ~20 datapoints - that use a single column of concatenated, fixed-width values instead of separate columns. (Fixing the tables isn't an option.) An old .net ETL project is grabbing all records, doing a bunch of substrings on each record to set an object's corresponding attributes, then sending the object to merge with production data via a stored proc.
The way it is working is fine. It works. And, to be perfectly honest, I doubt I'll be given the go-ahead to fix it even if I come up with a better solution, but I was curious to see if anyone knew of a better way of doing this, because it's not entirely unlikely that I'll face a situation like this in the future.
I was thinking that if there was a way to use the TextFieldParser to parse a static string instead of a file/stream that might be a valid idea. Or, instead, I could write the entire table to a text file and then use the TextFieldParser to send data to the SProc. http://www.dotnetperls.com/textfieldparser does show that TextFieldParser is quite a bit faster than split, which I would assume is tantamount to the string manipulation our project is currently doing with substring. So there may be something to that idea.
Or perhaps the whole, old project should be dumped for a shiny new SSIS project. Would it also have to write the records to a flat file before importing into SQL? Or can it import directly from the table?
Thank you in advance!

Dynamically execute a transformation against a column at runtime

I have a Pentaho Kettle job that can load data from x number of tables, and put it into target tables with a different schema.
Assume I have table 1, like so:
I want to load this table into a destination table that looks like this:
The columns have been renamed, the order has been changed, and the data has been transformed. The rename, and order is easily managed by using the Select Values step, which can be used within an ETL Metadata Injection step, making it dependent on some configuration values loaded at runtime.
But if I need to perform some transformation logic on some of the columns, based on where they go in the target table, this seems to be less straightforward.
In my example, I want the column "CountryName" to be capitalised, and the column "Rating" to be floored (as in changing the real number to the previous integer value).
While I could do this by just manually adding a transformation to accomplish each, I want my solution to be dynamic, so it could just as easily run the "CountryName" column through a checksum component, or perform a ceiling on "Rating" instead.
I can easily wrap these transformations in another transformation so that they can be parameterised and executed when needed:
But, where I'm having trouble is, when I process a row of data, I need a way to be able to say:
Column "CountryName" should be passed through the Capitalisation transform
Column "Rating" should be passed through the Floor transform
Column(s) "AnythingElse" should be passed through the SomeOther transform
Is there a way to dynamically split out the columns in a row, and execute a different transform on each one, based on some configuration metadata that can be supplied?
Logically, it would be something like this, although I suspect there may be a way to handle it as a loop or some form of dynamic transformation, rather than mapping out a path per column:
Kettle is so flexible that it seems like there must be a way to do this, I'm just struggling to know which components to use and how to do it. Any experts out there have some suggestions?
I'm dealing with some biggish data sets here (hundreds of millions of rows) so reluctant to use Row Normaliser/Denormaliser or writing to file/DB if possible.
Have you considered the Modified Java Script Value step? Start with the Data Grid step, the a Select Values step, then the Modified Java Script Value step. In that step you will transform the value of each column in what you form you want and output that in a file.
That of course requires some Java script knowledge but given your example it seems that the required knowledge is pretty basic.

Is it feasible to split data from differently formatted csv files in MS-SQL into several tables with one row per field of a file?

I only found answers about how to import csv files into the database, for example as blob or as 1:1 representation of the table you are importing it into.
What I need is a little different: My team and I are tracking everything we do in a database. A lot of these tasks produce logfiles, benchmark results, etc., which are stored in CSV format. The number of columns are far from consistent and also the data could be completely different from file to file, e.g. it could be a log from fraps with frametimes in it or a log of CPU temparatures over an amount of time, or even something completely different.
Long story short, I came up with an idea, but - being far from a sql pro - I am not sure if it makes sense or if there is a more elegant solution.
Does this make sense to you:
We also need to deal with a lot of data that is produced, so please give me also your opinion if that is feasible with like 200 files per day which can easyly have a couple of thousands rows.
The purpose of all this will be, that we can generate reports form the stored data and perform analysis of the data. E.g. view it on a webpage in a graph or do calculations with it.
I'm limited to MS-SQL in this case, because that's what the current (quite complex) database is and I'm just adding a new schema with that functionality to it.
Currently we just archive the files on a raid and store a link to it in the database. So everyone who wants to do magic with the data needs to download every file he needs and then use R or Excel to create a visualization of the data.
Have you considered a column of XML data type for the file data as an alternative of ColumnId -> Data structure? SQL server provides is a special dedicated XML index (over the entire XML structure) so your data can be fully indexed no matter what CSV columns you have. You will have much less records in the database to handle (as an entire CSV file will be a single XML field value). There are good XML query options to search by values & attributes of the XML type.
For that you will need to translate CSV to XML, but you will have to parse it either way ...
Not that your plan won't work, I am just giving an idea :)
=========================================================
Update with some online info:
An article from Simple talk: The XML Methods in SQL Server
Microsoft documentation for nodes() with various use case samples: nodes() Method (xml Data Type)
Microsoft document for value() with various use case samples: value() Method (xml Data Type)

Is there a way to parser a SQL query to pull out the column names and table names?

I have 150+ SQL queries in separate text files that I need to analyze (just the actual SQL code, not the data results) in order to identify all column names and table names used. Preferably with the number of times each column and table makes an appearance. Writing a brand new SQL parsing program is trickier than is seems, with nested SELECT statements and the like.
There has to be a program, or code out there that does this (or something close to this), but I have not found it.
I actually ended up using a tool called
SQL Pretty Printer. You can purchase a desktop version, but I just used the free online application. Just copy the query into the text box, set the Output to "List DB Object" and click the Format SQL button.
It work great using around 150 different (and complex) SQL queries.
How about using the Execution Plan report in MS SQLServer? You can save this to an xml file which can then be parsed.
You may want to looking to something like this:
JSqlParser
which uses JavaCC to parse and return the query string as an object graph. I've never used it, so I can't vouch for its quality.
If you're application needs to do it, and has access to a database that has the tables etc, you could run something like:
SELECT TOP 0 * FROM MY_TABLE
Using ADO.NET. This would give you a DataTable instance for which you could query the columns and their attributes.
Please go with antlr... Write a grammar n follow the steps..which is given in antlr site..eventually you will get AST(abstract syntax tree). For the given query... we can traverse through this and bring all table ,column which is present in the query..
In DB2 you can append your query with something such as the following, but 1 is the minimum you can specify; it will throw an error if you try to specify 0:
FETCH FIRST 1 ROW ONLY