Flat File Import: Remove Data - sql

(Posted a similar question earlier but HR department changed conditions today)
Our HR department has an automated export from our SAP system in the form of a flat file. The information in the flat file looks like so.
G/L Account 4544000 Recruiting/Job Search
Company Code 0020
--------------------------
| Posting Date| LC amnt|
|------------------------|
| 01/01/2013 | 406.25 |
| 02/01/2013 | 283.33 |
| 03/21/2013 |1,517.18 |
--------------------------
G/L Account 4544000 Recruiting/Job Search
Company Code 0020
--------------------------
| Posting Date| LC amnt|
|------------------------|
| 05/01/2013 | 406.25 |
| 06/01/2013 | 283.33 |
| 07/21/2013 |1,517.18 |
--------------------------
When I look at the data in the SSIS Flat File Source Connection all of the information is in a single column. I have tried to use the Delimiter set to Pipe but it will not separate the data, I assume due to the nonessential information at the top and middle of the file.
I need to remove the data at the top and middle and then have the Date and Total split into two separate columns.
The goal of this is to separate the data so that I can get a single SUM for the running year.
Year Total
2013 $5123.25
I have tried to do this in SSIS but I cant seem to separate the columns or remove the data. I want to avoid a script task as I am not familiar with the code or operation of that component.
Any assistance would be appreciated.

I would create a temp table that can import the whole flat file, after that do filter on SQL level
An example
Create TABLE tmp (txtline VARCHAR(MAX))
BCP or SSIS file into tmp table
Run Query like this to get result ( you may need adjust string length to fit your flat file)
WITH cte AS (
SELECT
CAST(SUBSTRING(txtline,2,10) AS DATE) AS PostingDate,
CAST(REPLACE(REPLACE(SUBSTRING(txtline,15,100),'|',''),',','') AS NUMERIC(19,4)) AS LCAmount
FROM tmp
WHERE ISDATE(SUBSTRING(txtline,2,10)) = 1
)
SELECT
YEAR(PostingDate),
SUM(LCAmount)
FROM cte
GROUP BY YEAR(PostingDate)

maybe you could use MS-Excel to open the flat file, using pipe-character as the delimeter, and then create a CSV from that, if needed.

Short of a script task/component (or a full-blown custom SSIS component), I don't think you'll be able to parse that specific format in SSIS. The Flat File Connection Manager does allow you to select how many rows of your text file are headers to be skipped, but the format you're showing has multiple sections (and thus multiple headers). There's also the issue of the horizontal lines, which the Flat File Connection won't be able to properly handle.
I'd first see if there's any way to get a normal CSV file with this data out of SAP. If that turns out to be impossible, then you'll need some sort of custom code to strip out the excess text.

Related

Is comparing two tables faster by importing them into a sql database or by using jdbc?

Background
I need to compare two tables in two different datacenters to make sure they're the same. The tables can be hundreds of millions, even a billion lines.
An example of this is having a production data pipeline and a development data pipeline. I need to verify that the tables at the end of each pipeline are the same, however, they're located in different datacenters.
The tables are the same if all the values and datatypes for each row and column match. There are primary keys for each table.
Here's an example input and output:
Input
table1:
Name | Age |
Alice| 25.0|
Bob | 49 |
Jim | 45 |
Cal | 52 |
table2:
Name | Age |
Bob | 49 |
Cal | 42 |
Alice| 25 |
Output:
table1 missing rows (empty):
Name | Age |
| |
table2 missing rows:
Name | Age |
Jim | 45 |
mismatching rows:
Name | Age | table |
Alice| 25.0| table1|
Alice| 25 | table2|
Cal | 52 | table1|
Cal | 42 | table2|
Note: The output doesn't need to be exactly like the above format, but it does need to contain the same information.
Question
Is it faster to import these tables into a new, common SQL environment, then use SQL to produce my desired output?
OR
Is it faster to use something like JDBC, retrieve all rows for each table, sort each table, then compare them line by line to produce my desired output?
Edits:
The above solutions would be executed at a datacenter that's hosting one of the tables. In the first solution, the only purpose for creating a new database would be to compare these tables using SQL, there are no other uses.
You should definitively start with the database option. Especially if the databases are connected with a database link you can easy set up the transfer of the data.
Such comparison often leads to a full outer join of the two sources and the experience tell us that DIY joins are notorically less performant that the native database implementation (you can deploy for example a parallel option).
Anyway you may try to implement some sofisticated algoritm that can make the compare without the necessity to transfer the whole table.
An example is based on the Merkle Trees where you first scan both source in their location to recognise which parts are identical (that can be ignored) and transfer and compare only the party with a difference.
So if you expect the tables are nearly identical and have keys that allows some hierarchy such approach could end better than a brute force full compare.
The faster solution is to load both tables to variables (memory) in your programing language and then compare them with your favorite algorithm.
Copy them first to a new table is the more than the double of time in read/write operations to disk, especially the write ones.

Transpose/Pivot Excel file in Pentaho (using multiple files)

I've been having some trouble with the following situation: There's an Excel file I need to use which has the information in the following format:
ColumnA | ColumnB
Name | John
Business | Pentaho
Address | Evergreen 123
Job type | Food processing
NameBoss | Boss lv1
Phone | 555-NoPhone
Mail | thisATmail
What I need to do is get all column A as different columns, ending with 7 different columns, each one with one value, which is the data in column B. Additionally, the integration is reading the filename as an extra output field:
SELECT
'${FILES_ROOT}/proyectos/BUSINESS_NAME/B_NAME_OPER/archivos_fuente/NÓMINA BAC - ' ||nombre_empresa||'.xlsx' as nombre_archivo
--, nombre_empresa
FROM "public".maestro_empresa
The transformation for the Excel file I have it as this:
As can bee seen, in the fields tab of the transformation, added manually each column, since the data in the Excel file does not has headers.
With this done, I am not sure how to proceed from here in order to get the transposed data I need. What can I do?
End result I am looking forward is something like this:
Name | Business | Address | Job type | NameBoss | Phone | Mail | excel_name
John | Pentaho | Evergreen 123 | Food processing | Boss lv1 | 555-NoPhone | thisAtMail | ExcelName.xlsx
With step 'Row demoralizer', you can do this easily. AT first you need to take input from excel file -> you need to use 'Row demoralizer' step. You can see sample from HERE.
Note: Remove ''Id'' column from my sample if you always suppose to get one line.
If you ColumnA values are dynamic /not specific . You can use THIS Metadata Injection sample ( where you need to take same excel input twice. But not require to specify column name). Please run transformation "MetaDataInjectionPV.ktr"

Creation of DB Structure for particular Excel Sheet formats

I have basic knowledge of SQL queries. 
Problem Statement:
Every month I will get an Excel sheet with transaction name as one column and the rest of n columns will be dates. The number of rows(Transaction names) are fixed but the dates of a month might vary. I have also attached the screenshot of the excel file.
Excel table
How should I create a table and structure it in the DB?
How should I import this particular excel table with values in the SQL DB?
Kindly please help me!
Thanks.
You can create table like this format.
Create table table_name(Transaction varchar(40),dates date,values float)
This Table looks like:
Transaction | dates | values
billing_ 1 | 11-oct-2019 | 1.2006
billing_ 1 | 12-oct-2019 | 2.2006
billing_ 2 | 11-oct-2019 | 1.2006
billing_ 2 | 12-oct-2019 | 2.2006
For importing excel data.
First you need to format this type then you can easily import into the database.

How to create value Buffer in Pentaho Kettle Transformation

I have some values coming in over time(stream) and over some different rows, which need to be processed as one row.
The incoming data looks kind of like this:
|timestamp |temp|otherStuff|
|------------|----|----------|
|... | |other |
|04:20:00.321|19.0|other |
|04:20:01.123|20.5|other |
|04:20:02.321|22.5|other |
|04:20:03.234|25.5|other |
|04:20:04.345|23.5|other |
|...(new data coming in) |
What I need could look something like this:
|val0|val1|val2|...|valN |
|----|----|----| |------|
|... create new row, |
|as new data arrives |
|23.5|25.5|23.5|...|valN |
|25.5|22.5|20.5|...|valN-1|
|22.5|20.5|19.0|...|valN-2|
I didn't find a good way to solve this with kettle. I'm also using a data service, (Basically a database, with a predefined amount of rows which refresh, as soon as a new dataset arrives) that holds the data in the same way as displayed in the first example.
That means I also could use SQL to flip the table around (which I don't know how to do either). It wouldn’t be as clean as using kettle but it would do the trick.
For better understanding, another example: This is what is coming in:
And something like this is what I need my data to transform to:
Is there any good way of achiving this?
Cheers.
Thank you #jxc,
the Analytic Query step did the trick.
Here's a screenshot of how I did it.
as #jxc stated, you have to
Add N+1 fields with Subject = temp, Type = Lag N rows BACKWARD in get Subject and N from 0 to N
(temp = Value in my case)

Export varbinary to file (image) from multiple rows

I have a MSSQL 2k8 database, in it I have a table of format below.
Employee Number | Segment | Data (varbinary(8000))
----------------------------------------------------------
111111 | 1 | 0x01234567...DEF
111111 | 2 | 0x01234567...DEF
111111 | 3 | 0x01234567...DEF
The data (varbinary) column makes up a picture but unfortunately is split in multiple segments by a process I cannot control.
Is there a way to export this data via an SQL script/procedure to a file? I have seem some questions that answer for a varbinary(max) column but I can't for the life of me work out how to stitch these all together into one file.
Note: Some of the files have >500 segments but this procedure will not be occuring exceedingly regularly.
If the picture can be reconstructed by simply concatenating all of the segments, then you could try execsql.py, which is a SQL script processor written in Python (by me). It has a metacommand of this form:
EXPORT <table_or_view> TO <filename> AS RAW
which will concatenate all columns and rows in the given table or view.