SQL Query crafting - sql

Edited outputs: no file names or trailing slashes are included
I have a database with potentially thousands thousands of records (we're talking a 2MB result string if it was just SELECT * FROM xxx in a standard use case.
Now for reasons of security this result cannot be held anywhere for much more processing.
There is a path field where I want to extract all records with each level of folder structure.
So run the query one way I get every record in the root:
C:\
Query again another way I get every record in the first folder level:
C:\a\
C:\b\
etc
Then of course I will GROUP somehow in order to return
C:\a\
C:\b\
and not
C:\a\
C:\a\
C:\b\
C:\b\
hopefully you get the idea?
Any answers that at least move me in the right direction I will be grateful for. I really am stumped where to start with this as downloading every record and processing is far from the ideal solution in my context. (Which is what we do now).
SAMPLE DATA
C:\a\b\c\d
C:\a\b\c
C:\
C:\a\b
C:\g
D:\x
D:\x\y
Sample output 1
C:\
D:\
Sample output 2
C:\a
C:\g
D:\x
sample output 3
C:\a\b
D:\x\y
sample output 4
C:\a\b\c
sample output 5
C:\a\b\c\d

You could do if you have only folders: SELECT DISTINCT path FROM table WHERE LENGTH(path) - LENGTH(replace(path,'\','')) = N
If you have only file names then it depends on whether you have an INSTR function (or some regexp substitution function) provided by the RDBMS. In all cases, depends on the string functions that are available.

Related

How do I partition a large file into files/directories using only U-SQL and certain fields in the file?

I have an extremely large CSV, where each row contains customer and store ids, along with transaction information. The current test file is around 40 GB (about 2 days worth), so partitioning is an absolute must for any reasonable return time on select queries.
My question is this: When we receive a file, it contains multiple store's data. I would like to use the "virtual column" functionality to separate this file into the respective directory structure. That structure is "/Data/{CustomerId}/{StoreID}/file.csv".
I haven't yet gotten it to work with the OUTPUT statement. The statement use was thus:
// Output to file
OUTPUT #dt
TO #"/Data/{CustomerNumber}/{StoreNumber}/PosData.csv"
USING Outputters.Csv();
It gives the following error:
Bad request. Invalid pathname. Cosmos Path: adl://<obfuscated>.azuredatalakestore.net/Data/{0}/{1}/68cde242-60e3-4034-b3a2-1e14a5f7343d
Has anyone attempted the same kind of thing? I tried to concatenate the outputpath from the fields, but that was a no-go. I thought about doing it as a function (UDF) that takes the two ID's and filters the whole dataset, but that seems terribly inefficient.
Thanks in advance for reading/responding!
Currently U-SQL requires that all the file outputs of a script must be understood at compile time. In other words, the output files cannot be created based on the input data.
Dynamic outputs based on data are something we are actively working for release sometime later in 2017.
In the meanwhile until the dynamic output feature is available, the pattern to accomplish what you want requires using two scripts
The first script will use GROUP BY to identify all the unique combinations of CustomerNumber and StoreNumber and write that to a file.
Then through the use of scripting or a tool written using our SDKs, download the previous output file and then programmatically create a second U-SQL script that has an explicit OUTPUT statement for each pair of CustomerNumber and StoreNumber

In SQL, what is the memory-efficient way of "mapping" 1 ID to multiple IDs?

I'll describe my scenario so you guys understand what type of design pattern I'm looking for.
I'm making an application where I provide someone with a link that is associated with one or more files. For example, someone needs somePowerpoint.ppx, main.cpp and somevid.mp4, and I have a tool that makes kj13h1djdsja213j1hhadad9933932 associated with those 3 files so that I can give someone
mysite.com/getfiles?fid=kj13h1djdsja213j1hhadad9933932
and they'll get a list of those files that they can download individually or all at once.
Since I'm new to SQL, the only way I know of doing that is having my tool use a table like
fid | filename
------------------------------------------------------------------
kj13h1djdsja213j1hhadad9933932 somePowerpoint.ppx
kj13h1djdsja213j1hhadad9933932 main.cpp
kj13h1djdsja213j1hhadad9933932 somevid.mp4
jj133823u22h248884h4h24h01h232 someotherfile.someextension
to go along with the above example. It would be nice if I could do some equivalent of
fid | filename(s)
---------------------------------------------------------------------------
kj13h1djdsja213j1hhadad9933932 somePowerpoint.ppx, main.cpp, somevid.mp4
jj133823u22h248884h4h24h01h232 someotherfile.someextension
but I'm not sure if that's possible or if I should be using some other design pattern altogether.
Any advice?
I believe Concatenate many rows into a single text string? can help give you a query that would generate your condensed format (you'd still want to store it in SQL with the full list, but you could make a view showing the condensed version using the query in the link)

TSQL Getting Info From Another Row in the Same Table?

I am building a TSQL query to parse through a FTP log from FileZilla. I am trying to figure out if there is a way to get information from a line preceding the current one?
For example,
I have parsed out the Following procedure: "STOR file.exe"
With the FileZilla is doesn't say if the STOR wass successful until the next line. So I want to check the next line and see if the STOR was successful or was unsuccessful?
Also people could try to STOR a files multiple times so I want to get the last version of its status.
Example Info from Log file:
(000005) 4/10/2010 14:55:30 PM - ftp_login_name (IP Address)> STOR file.exe
(000005) 4/10/2010 14:55:30 PM - ftp_login_name (IP Address)> 150 Opening data for transfer.
(000005) 4/10/2010 14:55:30 PM - ftp_login_name (IP Address)> 226 Transfer OK
I want to add a column in my query that says that the STOR was successful or unsuccessful.
Thanks!
Assuming you have parsed these lines into actual columns, and you have SQL server 2005 or greater. You can use CROSS APPLY example query below (untested). I hope this helps.
select o.*, prev.*
from FTPLog o
cross apply
(
select top 1 *
from FTPLog P where P.LogDate < O.LogDate
order by LogDate DESC
) prev
James has the right idea, though there may be some issues if you ever have log dates that are exactly the same (and from your sample it looks like you might). You may be able to add an identity column to force an order at the time the data is inserted, then you can use James' concept on the identity column.
More than that though, TSQL may not be the best choice for this project, at least not by itself. While there are techniques you can use to make it iterate sequentially, it is not as good for that as certain other languages are. You may want to consider parsing your files in a tool, such as Python or Perl or even C#, that is better at text processing and better at processing data sequentially.

creating data base tables from text file data

i have text files generated from another software about genes in human body, i need to insert them to a table and make the table as i need , i have 15 different text files that goes in to one table, as 15 different columns.
GFER = 3.58982863
BPIL1 = 3.58982863
BTBD1 = 4.51464898
BTBD2 = 4.40934218
RPLP1 = 3.57462687
PDHA1 = 4.19320066
LRRC7 = 4.50967385
HIGD1A = 4.46876727
above shown is the data in the text file, gene name and the distances. i need to include this in a table, gene name in a separate column and distance in a separate column, this text file have 3500 lines and i have 14 text files of data, how can i enter this data to a table without manually inserting?any automated software or tool you know? please help me out!
regards,
Rangana
The mysqlimport command ought to load it directly, http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html if you use a little trick to tell it that the = sign is the field delimiter.
shell> mysqlimport blah blah --fields-terminated-by==
If that does not work, write yourself a little routine to read the file, split on = sign, and replace it with a comma or something closer to what mysqlimport wants to see as a default.
You need an import wizard, can't say I've personally used one with mysql (but have with other DBMS) a quick google shows this might be what you need. I have a feeling phpMyAdmin used to have a feature that did this.
First create the table as something like:
mysql> create table gene (name varchar(10), distance double);
and then import a file:
mysql> load data infile '/tmp/gene.txt' into table gene columns terminated by ' = ';
The file needs to be in a place that is accessible to the user under which the mysql executable is running.
You can also use mysqlimport from outside the mysql shell. It connects to the server and issues the equivalent load data infile command.
I tested the above with your sample data and it worked.
i have 15 different text files that goes in to one table, as 15 different columns.
Do you mean 30 columns? 2 columns loaded from each file?
You may have to use = (with spaces on both sides as the delimiter). And as Ken said, if that doesn't do it, search and replace " = " to just a comma ",".
If you have SSIS this can be done fairly quick. Set up the 15 input files and map each file to a pair of columns, like:
File1 ... map to ... Column1 & Column2
File2 ... map to ... Column3 & Column4
etc
Or you can combine the 15 files (can be done easily using Excel) into 1 file with 30 columns and load it in.
i have done it, it may seems odd, but i'm adding here for some one to learn if it is valuable, i have opened those data files using open office spreadsheet, open office has this amazing features of separating the data file in to different columns. so i used them and separated my data files columns and saved it as a excel file(.xls) , then using the "sqlmaestro" as suggested by m.edmondson, using that software's importing data as an excel file feature i was able to achieve my task.
thank you all for your valuable answers, they surely add new things to my knowledge! thank you all once again!

Is there a way to parser a SQL query to pull out the column names and table names?

I have 150+ SQL queries in separate text files that I need to analyze (just the actual SQL code, not the data results) in order to identify all column names and table names used. Preferably with the number of times each column and table makes an appearance. Writing a brand new SQL parsing program is trickier than is seems, with nested SELECT statements and the like.
There has to be a program, or code out there that does this (or something close to this), but I have not found it.
I actually ended up using a tool called
SQL Pretty Printer. You can purchase a desktop version, but I just used the free online application. Just copy the query into the text box, set the Output to "List DB Object" and click the Format SQL button.
It work great using around 150 different (and complex) SQL queries.
How about using the Execution Plan report in MS SQLServer? You can save this to an xml file which can then be parsed.
You may want to looking to something like this:
JSqlParser
which uses JavaCC to parse and return the query string as an object graph. I've never used it, so I can't vouch for its quality.
If you're application needs to do it, and has access to a database that has the tables etc, you could run something like:
SELECT TOP 0 * FROM MY_TABLE
Using ADO.NET. This would give you a DataTable instance for which you could query the columns and their attributes.
Please go with antlr... Write a grammar n follow the steps..which is given in antlr site..eventually you will get AST(abstract syntax tree). For the given query... we can traverse through this and bring all table ,column which is present in the query..
In DB2 you can append your query with something such as the following, but 1 is the minimum you can specify; it will throw an error if you try to specify 0:
FETCH FIRST 1 ROW ONLY