Keep Column types from Java ResultSet in CSV export - resultset

I'm currently building a tool that pulls data directly from a database because SPSS Modeler is too slow and store it in a Java ResultSet first of all.
But I try to export the data into a CSV (or similar) file and try to keep as much column types as possible.
Currently I'm using opencsv but it casts Decimals and many others to a String. When I load the file back into SPSS Modeler I get only Integers and Strings.
Are there any CSV libraries (maybe with a special encoding) or other file types I can use to export the data with its column types (like IBM InfoSphere Data Architect can do) so I can load it directly back into SPSS Modeler without changing it back manually there ?
Thank you!

Retrieving the Metadata from the DB Information Schema
If the data is currently stored in a database, you can retrieve the column type from the information schema. All you need to do is retrieving this information after your queried the table and store it so that you can reuse it later.
// connect to DB as usual
Statement stmt = conn.createStatement();
// create your query
// Note that you can use a dummy query here.
//You only need to access the metadata schema of the table, regardless of the actual query.
ResultSet rse = stmt.executeQuery("Select A,B FROM table WHERE ..");
// get the ResultSetMetadata
ResultSetMetaData rsmd = rse.getMetaData();
// Get database specific type
rsmd.getColumnTypeName(1); // database specific type name for column 1 (e.g. VARCHAR)
rsmd.getColumnTypeName(2); // database specific type name for column 2 (e.g. DateTime)
....
// Get generic JDBC type http://docs.oracle.com/javase/7/docs/api/java/sql/Types.html
rsmd.getColumnType(1) // generic type for col 1 (e.g. 12)
rsmd.getColumnType(2) // generic type for col 2
Processing
You could store this information in a CSV schema and process this during the transformation process.
I recommend that you use SuperCSV, which is available here.
This library provides so called cell processors, which allow you to define the type of the columns.
Description:
Cell processors are an integral part of reading and writing with Super CSV - they automate the data type conversions, and enforce constraints. They implement the chain of responsibility design pattern - each processor has a single, well-defined purpose and can be chained together with other processors to fully automate all of the required conversions and constraint validation for a single CSV column.

Related

Is there a way to avoid the data type conversion from STRING to STRUCT<string STRING, text STRING, provided STRING> for Datastore imports to BigQuery?

We are automatically loading Datastore Backups to BigQuery for further analysis overwriting the table every day.
When a Datastore Kind with at least one Entity with long text is imported in BigQuery, that field is automatically converted to a STRUCT<string STRING, text STRING, provided STRING> instead of a STRING field like all the other text/string fields. This then changes the schema of the BigQuery table and makes any further processing or analysis really hard as queries need to be adapted to account for this. We cannot control the length of text on the Datastore side, so we need to find a way to at least stabilize the schema on the BigQuery side.
Any idea on how to deal with this elegantly?
Any way this conversion can be avoided so the schema of the BigQuery table does not change?
Setting a schema to a Load Job from a Datastore export is not possible in BigQuery. It means that the schema will always be inferred from the data. If you try to load it through the UI for example, you will see a message saying
Source file defines the schema
In this link you can find how the type conversion works between Datastore and BigQuery.
Try to use a View as the final table or create a scheduled query to read your table when its loaded and save the results in another table with the right schema.

How to pass dynamic table names for sink database in Azure Data Factory

I am trying to copy tables from one schema to another with the same Azure SQL db. So far, I have created a lookup pipeline and passed the parameters for the for each loop and copy activity. But my sink dataset is not taking the parameter value I have given under "table option" field rather it is taking the dummy table I chose when creating the sink dataset. Can someone tell how can I pass dynamic table name to a sink dataset?
I have given concat('dest_schema.STG_',#{item().table_name})} in the table option field.
To make the schema and table names dynamic, add Parameters to the Dataset:
Most important - do NOT import a schema. If you already have one defined in the Dataset, clear it. For this Dataset to be dynamic, you don't want improper schemas interfering with the process.
In the Copy activity, provide the values at runtime. These can be hardcoded, variables, parameters, or expressions, so very flexible.
If it's the same database, you can even use the same Dataset for both, just provide different values for the Source and Sink.
WARNING: If you use the "Auto-create table" option, the schema for the new table will define any character field as varchar(8000), which can cause serious performance problems.
MY OPINION:
While you can do this, one of my personal rules is to not cross the database boundary. If the Source and Sink are on the same SQL database, I would try to solve this problem with a Stored Procedure rather than a data factory.

Padding ssis input source columns to avoid truncation errors?

First post. In SSIS I am using an ODBC Source, and the database (or ODBC driver) doesn't appear to report column metadata correctly for any of the tables in the database for varchar type columns. Therefore, each time I import a table, I get truncation errors on all the varchar fields. Is there any way to set the size of these fields besides doing it ONE AT A TIME in the advanced editor? When importing a flat file source it lets you select a padding % for string fields. Does something like this exist for OLE or ODBC sources? If not, is there any way I can override the column length to, say, force them all to be VARCHAR(1000)?
I have never experience SQL Server providing the wrong meta data for an ODBC connection and it is unlikely you have a ghost in the machine (Deus Ex Machina). The meta data of the column can be set in the ODBC source via the advanced editor. I am willing to bet that is where the difference is. To confirm this:
Right click the ODBC connection and select the Advanced Editor
Click on the Input/Ouput Properties tab
Expand OLE DB Source Output
Expand both External Columns and Output Columns
Inspect each column pair and verify that the meta data matches
Correct any outages in the meta data
Let me know if that works. If it does not work, please provide data and SQL query you are using.
The VARCHAR field width must be set to the maximum incoming field width. I know the default field width is 50. Regardless, each field must be set. I previously worked on a project with large numbers of columns on the input files. My solution was to store the meta-data for the columns in a database table and then I built a C# application to read in the meta-data and then modify the *.dtsx file and set the meta data on all columns. This is the best solution that I am aware of to automate the task.
Unfortunately, I don't have much experience with pulling data through ODBC. Are you pulling from an Access database? Or, what are you pulling from?

Program to update the database table from the parameter with the excel sheet from select option in ABAP

Will come directly to the question.
Have 2 parameter like filename and table name. The requirement is to upload the data from the excel sheet to the database table enter in the other parameter. This should be in run time. No hardcoding of field names and that program should be flexible enough to suite any table. Please help.
I can think of two possible approaches:
Dynamic code generation -- write a program which writes a program
Use dynamic type tools
For 1. try googling
For 2. see https://wiki.scn.sap.com/wiki/display/Snippets/Example+-+create+a+dynamic+internal+table - this wiki shows a way (not sure if it is overkill as it creates the type from scratch whereas any table in your SAP system is already a defined type in the Data Dictionary).
You can do easily reference a parameterised table in Open SQL e.g. MODIFY (p_tab) ...
Perhaps you could do a generic SPLIT of a line read in from file by the delimiter into a table of fields - you can then use ASSIGN COMPONENT to match the fields you have read in to the fields in your internal type.
If you are doing this I think a white list of allowed tables would be wise - and auth checks. Otherwise someone could upload SAP standard tables with no authorisation.

Talend ETL tool

I am developing a migration tool and using Talend ETL tool (Free edition).
Challenges faced:-
is it possible to create a Talend job that uses dynamic schema every time it runs i.e. no hard-coded mappings in tMap component.
I want user to give a input CSV/Excel file and the job should create mappings on the basis of that input file. Is it possible in talend?
Any other free source ETL tool can also be helpful, or any sample job.
Yes, this can be done in Talend but if you do not wish to use a tMap then your table and file must match exactly. The way we have implemented it is for stage tables which are all datatype of varchar. This works when you are loading raw data into a stage table, and your validation is done after the load, prior to loading the stage data into a data warehouse.
Here is a summary of our method:
the filenames contain the table name so the process starts with a tFileList and parsing out the table name from the file name.
using tMSSQLColumnList obtain each column name, type, and length for the table (one way is to store it as an inline table in tFixedFlowInput)
run this thru a tSetDynamicSchema to produce your dynamic for that table
use a file input reference the dynamic schema.
load that into a MSSQLOutput again referencing the dynamic schema.
One more note on data types. It may work with data types than varchar, but our stage tables only have varchar and datetime. We had issues with datetime, so we filtered out those column types with a tMap.
Keep in mind, this is a summary to point you in the right direction, not a precise tutorial. But with this info in your hands, it can save you many hours of work while building your solution.