Select rows to extract from a CSV file in USQL - azure-data-lake

I'm trying to extract a few columns from a CSV file.
This file is replaced every day and columns can be added to file.
My problem is that every time the number of columns change i need to update the USQL code... any help?
#billing =
EXTRACT
id string,
company string
FROM #companydatafile
USING Extractors.Csv(skipFirstNRows : 1);
That works on CSV file:
1, company1
2, company2
But if update the file to
1, company1, address1
2, company2, address1
That will return an error.
Many Thanks!

Another hint, in case you do not want to use a custom extractor but would like to use built-in extractors:
If you know that you evolve your CSV schema over time, use a way to differentiate between the different versions in the path name. Then you can use the following pattern:
#s1 = EXTRACT ... FROM "/data/v1/{*}.csv" USING Extractors.Csv();
#s2 = EXTRACT ... FROM "/data/v2/{*}.csv" USING Extractors.Csv();
....
#data = SELECT * FROM #s1 OUTER UNION ALL BY NAME(*) SELECT * FROM #s2 ...;
You can also wrap it into a table-valued function to abstract it. So you only have to update the function definition and using scripts will automatically get the latest version.

David is correct - if you would like to run the same job for variable columns with no changes to the script, you should create a custom extractor. You can also automatically create an EXTRACT statement from a file using ADL Tools for VS (blog here), which means you can avoid delving through the file each time to get the new columns.
You can also vote or create a new feature request here to help increase the priority for developing this. Hope this helps, and let me know if you have other questions.

Have you seen How to deal with files containing rows with different column counts in U-SQL: Introducing a Flexible Schema Extractor?

Related

How to delete recodrs using sysdate through informatica

I have developed a mapping in informatica.Source is file .I need to write a post sql that will delete the already existing data if the file with same name comes again.File comes once in every month and naming is like jass_naming_yyyymm.csv .I have written like delete from tab where load_date = sysdate but its not working.laod date is a column in target table taht stores yyyymm from the file.So query shoud be like if file with existing yyyymm comes again the existing data should get deleted and new file will be loaded.
Please give soluntion.
Post SQL will not help here. You need two pipelines.
Pipeline 1 - Src->exp->tgt.
Use indirect file read method, get file name to fetch yyyy_mm part from file name.
You need to use 'update override' option in the target to delete the data. Use this logic -
DELETE FROM target_table WHERE target_yyyy_mm= :TU.source_yyyy_mm
Pipeline 2 - your mapping.
HTH

Trying to create a table and load data into same table using Databricks and SQL

I Googled for a solution to create a table, using Databticks and Azure SQL Server, and load data into this same table. I found some sample code online, which seems pretty straightforward, but apparently there is an issue somewhere. Here is my code.
CREATE TABLE MyTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://server_name_here.database.windows.net:1433;database = db_name_here",
user "u_name",
password "p_wd",
dbtable "MyTable"
);
Now, here is my error.
Error in SQL statement: SQLServerException: Invalid object name 'MyTable'.
My password, unfortunately, has spaces in it. That could be the problem, perhaps, but I don't think so.
Basically, I would like to get this to recursively loop through files in a folder and sub-folders, and load data from files with a string pattern, like 'ABC*', and load recursively all these files into a table. The blocker, here, is that I need the file name loaded into a field as well. So, I want to load data from MANY files, into 4 fields of actual data, and 1 field that captures the file name. The only way I can distinguish the different data sets is with the file name. Is this possible? Or, is this an exercise in futility?
my suggestion is to use the Azure SQL Spark library, as also mentioned in documentation:
https://docs.databricks.com/spark/latest/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
The 'Bulk Copy' is what you want to use to have good performances. Just load your file into a DataFrame and bulk copy it to Azure SQL
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#bulk-copy-to-azure-sql-database-or-sql-server
To read files from subfolders, answer is here:
How to import multiple csv files in a single load?
I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()
Thanks for a point in the right direction mauridb!!

Query for finding all occurrences of a string in a database

I'm trying to find a specific string on my database. I'm currently using FlameRobin to open the FDB file, but this software doesn't seems to have a properly feature for this task.
I tried the following SQL query but i didn't work:
SELECT
*
FROM
*
WHERE
* LIKE '126278'
After all, what is the best solution to do that? Thanks in advance.
You can't do such thing. But you can convert your FDB file to a text file like CSV so you can search for your string in all the tables/files at the same time.
1. Download a database converter
First step you need a software to convert you databse file. I recommend using Full Convert to do it. Just get the free trial and download it. It is really easy to use and it will export each table in a different CSV file.
2. Find your string in multiple files at the same time
For that task you can use the Find in files feature of Notepad++ to search the string in all CSV files located at the same folder.
3. Open the desired table on FlameRobin
When Notepad++ highlight the string, it shows in what file it is located and the number of the line. Full Convert saves each CSV with the same name as the original table, so you can find it easily whatever database manager software you are using.
Here is Firebird documentation: https://www.firebirdsql.org/file/documentation/reference_manuals/fblangref25-en/html/fblangref25.html
You need to read about
Stored Procedures of "selectable" kind,
execute statement command, including for execute statement variant
system tables, having "relation" in names.
Then in your SP you do enumerate all the tables, then you do enumerate all the columns in those tables, then for every of them you run a usual
select 'tablename', 'columnname', columnname
from tablename
where columnname containing '12345'
over every field of every table.
But practically speaking, it most probably would be better to avoid SQL commands and just to extract ALL the database into a long SQL script and open that script in Notepad (or any other text editor) and there search for the string you need.

How to use Vertica's COPY LOCAL as an sql statement from MATLAB on Windows

I'm trying to insert around 80 million records created using MATLAB into Vertica Database table. I wanted to know if we can call COPY LOCAL statement in MATLAB as a regular sql statement using exec(conn, sql). For test purpose, I tried with a dat file having around 4 million records as following:
sqlstmnt = 'COPY schema.table_name (FK_CUSTOMER_ID,FK_RUN_START_DATE_ID,FK_RUN_END_DATE_ID,FK_TRAVEL_ID,FK_ORIGIN_ID,FK_DEST_ID,FK_SEGMENT_ID,SEGMENT_PERCENTAGE,LAST_UPDATED) FROM LOCAL ''/my/file/full/path/test1.dat''';
results = exec(conn,sqlstmnt);
But it gave an error in results.Message like:
[Vertica]JDBC A ResultSet was expected but not generated from query "COPY schema.table_name(FK_CUSTOMER_ID,FK_RUN_START_DATE_ID,FK_RUN_END_DATE_ID,FK_TRAVEL_ID,FK_ORIGIN_ID,FK_DEST_ID,FK_SEGMENT_ID,SEGMENT_PERCENTAGE,LAST_UPDATED) FROM LOCAL '/my/file/full/path/test1.dat'". Query not executed.
I have the data in the '.dat' file in the order in which the columns are mentioned in COPY LOCAL.
I could not find any helpful resource explaining this error.
I have this test1.dat file which I'm able to insert using COPY from vsql but since I run my codes in MATLAB with many iterations,each iteration producing about a million records, I would want to insert them during each iteration. Any help will be really great.
COPY command return ResultSet that includes the amount of loaded data , i see two main options
1) results =exec(conn,sqlstmnt);
2)results = runsqlscript(conn,'nameOfSQLScriptthatIncludeTheCopyCommand.sql')
I hope you will find it useful
Thanks
I just finish reviewing you’re your input sample data .
i see major problem with the mapping of the input csv to the target table .
Main issues are :
1) Lines are broken into 2 lines ( you should prefer having one sample per line and avoid brock it into 2 lines )
Eg : "1,20150101,0,2,2573,2714,1,8.147237e-01
50,48,49,54,45,48,51,-28 12:11:46"
2) when you define data types on vertica table ,eg: timestamp the data on the csv must reflect to it ( what you have is "-28 12:11:46" , this will not work )
After you fix all this issues , make sure you test it using vsql , then go and try it with matlab
I hope you will find it useful.

Write results of SQL query to multiple files based on field value

My team uses a query that generates a text file over 500MB in size.
The query is executed from a Korn Shell script on an AIX server connecting to DB2.
The results are ordered and grouped by a specific field.
My question: Is it possible, using SQL, to write all rows with this specific field value to its own text file?
For example: All rows with field VENDORID = 1 would go to 1.txt, VENDORID = 2 to 2.txt, etc.
The field in question currently has 1000+ different values, so I would expect the same amount of text files.
Here is an alternative approach that gets each file directly from the database.
You can use the DB2 export command to generate each file. Something like this should be able to create one file :
db2 export to 1.txt of DEL select * from table where vendorid = 1
I would use a shell script or something like Perl to automate the execution of such a command for each value.
Depending on how fancy you want to get, you could just hardcode the extent of vendorid, or you could first get the list of distinct vendorids from the table and use that.
This method might scale a bit better than extracting one huge text file first.