USQL Extract subset of files - azure-data-lake

I have a USQL question. I have a daily job that is outputting files to a directory in the following format
/MyOutput/{YYYY}/{MM}/{DD}/file.csv
I have a second job now that I want to run that will use the most recent 30 files produced by the first job. I can't figure out how to best do this though.
I know I can do wildcards in the extractor but I prefer not to extract all files and then use a select/where to remove the ones I don't want as extracting all files could get really costly if I'm keeping years worth of these files.
So is there a nice way in usql to say extract only the most recent x files? Or what options do I have here?
Thanks,
John

If you use a date pattern it will do what you want.
#rows =
EXTRACT
...,
date DateTime
FROM /MyOutput/{date:YYYY}/{date:MM}/{date:dd}/file.csv;
SELECT * FROM #rows WHERE date > '2018-5-3'
Will read only the files matching the date range - it won't read all of them in first.

Related

Import csv file into SQL Server temptable, without specifying columns

I am currently trying to manipulate a fairly poor csv file into an SQL database. I would prefer to not use another program to amend the csv file. it's an output from a locked program, and the business have a general understanding of SQL, which is why I prefer an SQL solution.
What I want to do is take the 60 columns from this csv and import them all into a table, without specifying the column names. I have seen this BULK INSERT which allows me to put them into a table, but only one that's already created.
My actual end goal is to transform the data in the CSV so I have a table with just five columns, but step one is to get all of the data in, so that I can then grab the information that's relevant.
MY CSV file has a header row, but 59 of the 60 headers are blank, so there are no actual column headers in the sheet.
Example
SHEET,,,,
Data1, Data2, Data3, Data4, Data5,
Even better, if i could grab the columns I need (which in my case are columns 7, 31, 55 and 59) that would be even better.
I've scoured the internet and can't find the exact solution i'm looking for.
I've also tried to use SSIS for this, but honestly I find it unreliable sometimes and slight changes seem to break everything, so I gave up!! (My idea was flat file import > Derived Column > data conversion > OLE DB destination, but I got errors with the flat file import, which I can't seem to solve.
I would prefer to do it all through one SQL script if I can anyway, but any suggestions of the best way to achieve this are welcome.
Thank you,
Craig

How do I partition a large file into files/directories using only U-SQL and certain fields in the file?

I have an extremely large CSV, where each row contains customer and store ids, along with transaction information. The current test file is around 40 GB (about 2 days worth), so partitioning is an absolute must for any reasonable return time on select queries.
My question is this: When we receive a file, it contains multiple store's data. I would like to use the "virtual column" functionality to separate this file into the respective directory structure. That structure is "/Data/{CustomerId}/{StoreID}/file.csv".
I haven't yet gotten it to work with the OUTPUT statement. The statement use was thus:
// Output to file
OUTPUT #dt
TO #"/Data/{CustomerNumber}/{StoreNumber}/PosData.csv"
USING Outputters.Csv();
It gives the following error:
Bad request. Invalid pathname. Cosmos Path: adl://<obfuscated>.azuredatalakestore.net/Data/{0}/{1}/68cde242-60e3-4034-b3a2-1e14a5f7343d
Has anyone attempted the same kind of thing? I tried to concatenate the outputpath from the fields, but that was a no-go. I thought about doing it as a function (UDF) that takes the two ID's and filters the whole dataset, but that seems terribly inefficient.
Thanks in advance for reading/responding!
Currently U-SQL requires that all the file outputs of a script must be understood at compile time. In other words, the output files cannot be created based on the input data.
Dynamic outputs based on data are something we are actively working for release sometime later in 2017.
In the meanwhile until the dynamic output feature is available, the pattern to accomplish what you want requires using two scripts
The first script will use GROUP BY to identify all the unique combinations of CustomerNumber and StoreNumber and write that to a file.
Then through the use of scripting or a tool written using our SDKs, download the previous output file and then programmatically create a second U-SQL script that has an explicit OUTPUT statement for each pair of CustomerNumber and StoreNumber

How to get SQLite to ignore the first 17 rows of .csv file?

I'm brand new to using SQL and I'm not quite sure where to begin. I have many .CSV files in a folder and I wanted to go about making a database in order to provide a way to search through the information stored in each .csv file. All of the files are identical in their parameters, meaning all of the files are set up the same way in terms of columns and rows. Due to the way everything is set up and how files are laid out identically, this should be easy to code in SQL. Currently I am using SQLite in order to store and organize all of the data AFTER the first 17 rows of information Deleting the first 17 rows is also acceptable. I am familiar with Java and C++, but I'm not quite sure I understand how to skip or delete the first 17 rows in SQL, I also do not know how to code for this in SQLite. I would think that this is a simple thing, but I can't find anything on how to achieve this, skipping or deleting the first 17 rows of each .csv file. How would I go about telling SQLite to delete the first 17 rows?

Search for files names within given convention

Below find example of file which are downloaded to some directory. I got this given file name convention. "SSUP-RX-" is statis no changable.Rest could be changed as username, date and time.
SSUP-RX-admin-2014_12_2-9_16_5_69.csv
What i need to do i have to search for that cain of files in given directory and then extract the date and time from it. What is the best way to search for such cain of files and how to read date/time from it after?
P.S Probably after user name is year, month, day, hour,minutes,seconds, milisends
Well I'm not pretty sure if I got your question correct, I think you are trying to get the name from the file and extract it or clean it up so that for example file name is:
test-2015-01-28-14-15-30-9.csv
So for that the simple way will be retrieve the file name as is and use regex as it has been said to only extract what you want and you can use it as you wish.
Here is how to use regex:
visualbasic.about.com/od/usingvbnet/a/RegExNET.htm
Check out this under check USA Telephone number it similar to what you want:
visualbasic.about.com/od/usingvbnet/a/RegExNET_2.htm
Hope this answer your question.

SQL Query crafting

Edited outputs: no file names or trailing slashes are included
I have a database with potentially thousands thousands of records (we're talking a 2MB result string if it was just SELECT * FROM xxx in a standard use case.
Now for reasons of security this result cannot be held anywhere for much more processing.
There is a path field where I want to extract all records with each level of folder structure.
So run the query one way I get every record in the root:
C:\
Query again another way I get every record in the first folder level:
C:\a\
C:\b\
etc
Then of course I will GROUP somehow in order to return
C:\a\
C:\b\
and not
C:\a\
C:\a\
C:\b\
C:\b\
hopefully you get the idea?
Any answers that at least move me in the right direction I will be grateful for. I really am stumped where to start with this as downloading every record and processing is far from the ideal solution in my context. (Which is what we do now).
SAMPLE DATA
C:\a\b\c\d
C:\a\b\c
C:\
C:\a\b
C:\g
D:\x
D:\x\y
Sample output 1
C:\
D:\
Sample output 2
C:\a
C:\g
D:\x
sample output 3
C:\a\b
D:\x\y
sample output 4
C:\a\b\c
sample output 5
C:\a\b\c\d
You could do if you have only folders: SELECT DISTINCT path FROM table WHERE LENGTH(path) - LENGTH(replace(path,'\','')) = N
If you have only file names then it depends on whether you have an INSTR function (or some regexp substitution function) provided by the RDBMS. In all cases, depends on the string functions that are available.