My Vendor is providing a CSV file where columns (names included on first line) are dynamic - meaning they will only appear if there is data in them and there is no guarantee on the order the columns will be provided
I am looking to understand the best approach to take to import such a horrible file.
Using the FileHelpers.net and optional fields.. but the issue with this is that the column orders can change
You can build a FileHelpers class on the fly, then use that with the engine to import the CSV. If you import as a DataTable, you would then be able to check if a column exists and populate your database using that or doing whatever you need to with this columns.
Related
I have a csv file with 1 column which I want to import into my big query environment. When using the Console to import data - always take my first row as a data row rather than a column name. Is there a way in the console to always ensure the first row is always the column name
E.g.
Tk Number
Tk - 0001
Tk - 0002
In CSV format, if the first row is string and others are integers, then it automatically takes the first row as header name, if you have checked the auto-detect schema option while creating the table.
But since you have strings in header as well as body, you will need to give the schema manually while creating the table in BigQuery. And in advanced options you can specify the number of rows to be skipped under 'header rows to skip' option.
I have a CSV file having more than 700 columns. I just want 175 columns from them to be inserted into a RDBMS table or a flat file usingPentaho (PDI). Now, the source CSV file has variable columns i.e. the columns can keep adding or deleting but have some specific keywords that remain constant throughout. I have the list of keywords which are present in column names that have to excluded, e.g. starts_with("avgbal_"), starts_with("emi_"), starts_with("delinq_prin_"), starts_with("total_utilization_"), starts_with("min_overdue_"), starts_with("payment_received_")
Any column which have the above keywords have to be excluded and should not pass onto my RDBMS table or a flat file. Is there any way to remove the above columns by writing some SQL query in PDI? Selecting specific 175 columns is not possible as they are variable in nature.
I think your example is fit to use meta data injection you can refer to example shared below
https://help.pentaho.com/Documentation/7.1/0L0/0Y0/0K0/ETL_Metadata_Injection
two things you need to be careful
maintain list of columns you need to push in.
since you have changing column names so you may face issue with valid columns as well which you want to import or work with. in order to do so make sure you generate the meta data file every time so you are sure about the column names you want to push out from the flat file.
I would like to import a spreadsheet to an access database, on column has ages 1-89 plus another that's says 90+ which in turn create a Import Error. Using the DoCmd.TransferText, is it possible to import everything as it is including 90+ in the column of all other numbers?
If you import data into a table that doesn't already exist, Access will create one for you. It automatically determines the data type of each column based on the first few rows that are imported. If your source data contains a mixture of data types in one column, then you may experience this error.
There are 2 solutions:
Build the import table to be of the correct data type for your data (i.e. specify that the age column is Short Text). Then import to the pre-defined import table.
Ensure that the CSV or Excel file stores each age as a string i.e. "20" instead of 20 (in Excel, format the cell as Text, so it is left aligned or start cell contents with an apostrophe '20 or use formulaic notation ="20").
You could do either of those things, but it would be better to do both of them if possible.
How do I add a column from an external .csv file to an existing project?
I tried to find the solution online, but I wasn't successful.
Using the file you provided, I did this in less than one minute.
I had a project, with one column: .
If you know a little Python, try Jython. Edit Column > Add column based on this column and chose Language : Jython like this:
import csv
#we are going to use DictReader to transform our imported rows into dict,
#so we can latter just refer to the column we want by its key i.e header
rows = csv.DictReader(open('/home/yourusername/Downloads/example.csv'), delimiter=",")
for row in rows:
return row['Comprar'] #'comprar' is the header of the column i want
I am stuck with a CSV file with over 100,000 rows that contains product images from a provider. Here are the details of the issue, I would really appreciate some tips to help resolve this. Thanks.
The File has 1 Row per product and the following 4 columns.
ID,URL,HEIGHT,WIDTH
example: 1,http://i.img.com,100,200
Problem starts when a product has multiple images.
Instead of having 1 row per image the file has more columns in same row.
example:
1,http://i.img.com,100,200,//i.img.com,20,100,//i.img.com,30,50
Note that only first image has "http://" remaining images start with "//"
There is no telling how many images per product hence no way to tell how many total columns per row or max columns.
How can I import this using SSIS or sql import wizard.
Also I need to do this on regular intervals.
Thank you for your help.
I don't think that you can use any standard SSIS task or wizard to do this. You're going to have to write some custom code which parses each line. You can do this in SSIS using VB code or you can import the file into a staging table that's just a single column to hold each row and do the parsing in SQL. SSIS will probably be faster for this kind of operation.
Another possibility is to preprocess the file using regex or a search-and-replace command. Try to get double-quotes around the image list then you should be able to import the whole file fine, with the quoted part going into a single column. Catching the start of the string should be easy enough given the "http:\" for which you can search. Determining where the end quote goes might be more of a problem.
A third potential solution would be to get the source to fix the data. Even if you can't get the images in separate rows (or another file with separate rows, which would be ideal), maybe you can get the double-quotes added from the source as part of the export. This would likely be less error-prone than using the search-and-replace method.
Good luck!