load a csv file from a particular point onwards - sql

I have the below code to read a particular csv file, my problem is that the csv file contains two sets of data, one underneath the other, with different headers.
The starting point of the second data set can vary daily. So I need something that lets me find the row in the CSV where the second dataset begins (it will always start with the number '302') and load the csv from there. The problem i have is that the code below starts from where i need it to start, but it always includes the headers from the first part of the data, which is wrong.
USE csvImpersonation
FILE 'c:\\myfile.TXT'
SKIP_AT_START 3
RETURN #myfile
#loaddata = SELECT * FROM #mylife
where c1 = '302'
The below is a sample of the text file (after the first 3 rows are skipped, which are just full of file settings, dates, etc).
Any help is much appreciated

Related

append new information in csv files to existing historical qvd

Let's say I have a "master" qvd file named salesHistory.qvd, and I want to append new monthly sales from file salesMarch.csv
How do I do that without replacing existing information, but adding new months?
Thanks for the help!
By default, QlikView automatically appends table loads to a previously loaded table if the fields are identical. You can use this to your advantage by using a script similar to the following:
SalesHistory:
LOAD
*
FROM
[salesHistory.qvd] (qvd);
LOAD
*
FROM
[salesMarch.csv]
(txt, utf8, embedded labels, delimiter is ',', msq);
STORE SalesHistory INTO [salesHistory.qvd] (qvd);
This initially loads the contents of your salesHistory.qvd file into a table, and then loads the contents of salesMarch.csv and concatenates it into the SalesHistory table (which contains the contents of salesHistory.qvd.
The final STORE step saves this concatenated table into the salesHistory.qvd file by overwriting it completely.
In the above example, we use * as a field specifier to load all fields from the source files. This means that this approach only works if your QVD file contains the same fields (and field names) as your CSV file.
Furthermore, as this script loads the contents of the QVD file each time it is executed, it will start to duplicate data if it is executed more than once per month as there is no determination of which months already exist in the QVD file. If you need to execute it more than once per month (perhaps due to adjustments) then you may wish to consider applying a WHERE clause to the load from salesHistory.qvd so that only data up to and including the previous month is loaded.
Finally, you may wish to alter the name of your CSV file so that it is always the same (e.g. salesCurrentMonth.csv) so that you do not have to change the filename in your script.

How to Load data in CSV file with a query in SSIS

I have a CSV file which contain millions records/rows. The header row is like:
<NIC,Name,Address,Telephone,CardType,Payment>
In my scenario I want to load data "CardType" is equal to "VIP". How can I preform this operation without loading whole records in the file into a staging table?
I am not loading these records into a data warehouse. I only need to separate these data in CSV file.
The question isn't super-clear, but it sounds like you want to do some processing of the rows before outputting them back into another CSV file. If that's the case, then you'll want to make use of the various transforms available, notably Conditional Split. In there, you can look for rows where CardType == VIP and send those down one output (call it "Valid Rows"), and send the others into the default output. Connect up your valid rows output to your CSV destination and that should be it.

Pentaho PDI how to validate source Excel metadata for the order and number of columns?

In my case, I need to process input data in Excel (xls and xlsx) format. I need to do a file level validation of the Excel file for the order and number of columns, before processing the row level data. If this file level validation is failed, then exclude this file and inform the concerned through mail.
Please guide me, with some sample or example, how to validate the excel files for metadata? I thought of placing a variable in kettle.properties with semicolon separated header fields and compare this with the source excel file. But not getting a way to extract only the header row from file as I want.
Please guide me.
Are column names on Row 1 of your file (or any other row reasonably close to row 1) and you know how many fields are in each, at most? If so, maybe you can get away with that.
Step 1: You need to understand how many rows may there be, what they may be called, what data types, etc.
Step 2: Read the first N rows of the file(s) ensuring the header row will be read; Filter everything that is not the header (how to? depends on the specific structure). Because you don't know what are the field names, just name them field0, ... field999 or whatever.
Step3: Work some magic on the headers; filtering based on position of certain fields, mapping field names to data types, etc.
Step4: Metadata injection. Using the information you already have from before, you create a template transformation that is generic in the sense that field names are not set up in the excel input step. The metadata injection allows you to set up that step in run time, depending on the entire logic you just applied on the headers.
This page has a couple example videos: http://wiki.pentaho.com/display/EAI/ETL+Metadata+Injection
I had to build something like that (only it was CSV files and not XLS) a while back and metadata injection allowed me to load every single file in one go with 100% mapping accuracy. Of course, the magic happens before, when you parse the header row.
Thanks nsousa for your answer.
I got to the required solution with the help of my colleague. Here what I did
(1) Read only the 1st row of the source Excel file as normal data (no header, limit 1) where the field names will be called as F1, F2 etc
(2) concat the fields (data) to get a pattern
(3) Match this pattern with acual metadata pattern, if they are matching, then excel file is passed
Good trick. Thanks.

Load Data Infile Syntax - Invalid field count in CSV on line 1

I was using phpmyadmin for ease of use and am using the Load Data Infile Syntax which gives the following error, - invalid field count in CSV on line 1. I know there is an invalid field count which is on purpose.
Basically the table has 8 columns and the files have 7. I can go into the file and change in manually to 8 by entering data in the 8th column but this is just too time consuming, in fact I would have to start again by the time I finish so I have to rule that out.
The eight column will be a number which is the exact same for each row per file, so unique for each file.
For example the first file has 1000 rows each with data that goes in the first seven columns, then the 8th column is used to identify to what this file data is in reference to. So for the 1000 rows on the sql table the first 7 columns are data, while the last column will just be 1000 1's, and then next file's 1000 rows will have an 8th column that says 1000 2's and so on. (note I'm actually goign to be entering 100001, rather than 1 or 000001 for obvious reasons).
Anyway, I can't delete the column either and add back after loading the file for good reasons which I'll not explain, but I am aware of that method is useless to this scenario.
What I would like is a method which as I load a file which fills the first 7 columns, while for the 8th column, to have a specified int placed in each row of the 8th column for each row there is in the csv. Like auto increment except, rather than increment each new row, just stay the same. Then for the second file all I need to do is change the specified int.
Notes: the solution can't be to change the csv file as this is to time consuming and it is actually counter intuitive.
I'm hoping someone knows if there is a way then to do this, possibly by having sql code which is both a mixture of Load File and Insert so that it processes correctly without error.
The solution is to simply load the 8th column into a variable, something like this:
SET #dummy_variable = 0; /* <-not sure if you need this line...*/
LOAD DATA INFILE 'file.txt'
INTO TABLE t1
(column1, column2, ..., column7, #dummy_variable);

Import Unformatted txt file into SQL

I am having an issue importing data into SQL from a text file. Not because I don't know how...but because the formatting is pretty much terrible for this purpose. Below is an altered sample of the types of text files I need to work with:
1 VA - P
2 VB to 1X P
3 VC to 1Y P
4 N - P
5 G to 1G,Frame P
6 Fout to 1G,Frame P
7 Open Breaker P
8 1B to 1X P
9 1C to 1Y P
Test Status: Pass
Hi-Pot # 1500V: Pass
Customer Order:904177-F
Number: G4901626-200
Serial Number: J245F6-2D03856
Catalog #: CBDC37-X5LE30-H40-L630C-4GJ-G31
Operator: TGY
Date: Aug 01, 2013
Start Time: 04:09:26
Finish Time: 04:09:33
The first 9 lines are all specific test results (tab separated), with header information below. My issue is that I need to figure out:
How can I take the data above and turn it into something broken down into a standard column format to import into SQL?
How can I then automate this such that I can loop through an entire folder structure?
-What you see above is one of hundreds of files divided into several sub-directories.
Also note that the # of test lines above the header information vary from file to file. The header information remains in much the same format though. This is all legacy data that cannot be regenerated, but needs to be imported into our SQL databases.
I am thinking of using an SSIS project with a custom script to import the data...splicing the top section from the bottom by looking for the first empty row...then pivot the data in the header into column format...merge...then move on. But I don't write much VB and I'm not sure how to approach that.
I am working in a SQL Server 2008R2 environment with access to BIDS.
Thoughts?
I would start by importing the data as all character into a table with a single field, one record per line. Then, from that table, you can parse each record into the fields and field types appropriate for each line. Hopefully there is a way to figure out what kind of data each line is, whether each file is consistant in order, or the header record indicates information about subsequent lines. From that, the data can be moved to a final (parsing may take more than one pass) table with the data stored in a format that is useable for whatever you need it.
I would first concentrate on getting the data into the database in the least complicated (and least error prone) way possible. Create a table with three columns: filename, line_number and line_data. Plop all of your files into that table and then you can start to think about how to interpret the data. I would probably be looking to use PIVOT, but if different files can have different numbers of fields it may introduce complications.
I would use a different approach and use SSDT/SSIS package to import the data.
Add a script component to read in the text file and convert it to XML. Not hard there many examples on the web. In your script Store the XML you develop into a variable.
Add a data flow
Add an XML Source. In the XML source you can select the XML variable you created and process either group of data present in your file. Here is some information on using the XML Source.
Add destination task to import it to a destination of your choice
This solution assumes your input lines are terminated {CR}{LF}, the normal Windows way.
Tell MSSQL's Import/Export Wizard to import a Flat File; the Format is "Delimited"; the "Text Qualifier" is the {CR}; the "Header Row Delimiter" is the {LF}; and the OutputColumnWidth (in "Advanced") is a bit more than the longest possible line length.
It's simple and it works.
I just used this to import 23 million lines of mixed up data, and it took less than ten minutes. Now to edit it...