I was using phpmyadmin for ease of use and am using the Load Data Infile Syntax which gives the following error, - invalid field count in CSV on line 1. I know there is an invalid field count which is on purpose.
Basically the table has 8 columns and the files have 7. I can go into the file and change in manually to 8 by entering data in the 8th column but this is just too time consuming, in fact I would have to start again by the time I finish so I have to rule that out.
The eight column will be a number which is the exact same for each row per file, so unique for each file.
For example the first file has 1000 rows each with data that goes in the first seven columns, then the 8th column is used to identify to what this file data is in reference to. So for the 1000 rows on the sql table the first 7 columns are data, while the last column will just be 1000 1's, and then next file's 1000 rows will have an 8th column that says 1000 2's and so on. (note I'm actually goign to be entering 100001, rather than 1 or 000001 for obvious reasons).
Anyway, I can't delete the column either and add back after loading the file for good reasons which I'll not explain, but I am aware of that method is useless to this scenario.
What I would like is a method which as I load a file which fills the first 7 columns, while for the 8th column, to have a specified int placed in each row of the 8th column for each row there is in the csv. Like auto increment except, rather than increment each new row, just stay the same. Then for the second file all I need to do is change the specified int.
Notes: the solution can't be to change the csv file as this is to time consuming and it is actually counter intuitive.
I'm hoping someone knows if there is a way then to do this, possibly by having sql code which is both a mixture of Load File and Insert so that it processes correctly without error.
The solution is to simply load the 8th column into a variable, something like this:
SET #dummy_variable = 0; /* <-not sure if you need this line...*/
LOAD DATA INFILE 'file.txt'
INTO TABLE t1
(column1, column2, ..., column7, #dummy_variable);
Related
I need to get the column number of a staged file on Snowflake.
The main idea behind it, is that I need to automate getting this field in other queries rather than using t.$3 whereas 3 is the position of the field, that might be changed because we are having an expandable surveys (more or less questions depending on the situation).
So what I need is something like that:
SELECT COL_NUMBER FROM #my_stage/myfile.csv WHERE value = 'my_column_name`
-- Without any file format to read the header
And then this COL_NUMBER could be user as t.$"+COL_NUMBER+" inside merge queries.
I am creating a transformation that take input from CSV file and output to a table. That is running correctly but the problem is if I run that transformation more then one time. Then the output table contain the duplicate rows again and again.
Now I want to remove all duplicate row from the output table.
And if I run the transformation repeatedly it should not affect the output table until it don't have a new row.
How I can solve this?
Two solutions come to my mind:
Use Insert / Update step instead of Table input step to store data into output table. It will try to search row in output table that matches incoming record stream row according to key fields (all fields / columns in you case) you define. It works like this:
If the row can't be found, it inserts the row. If it can be found and the fields to update are the same, nothing is done. If they are not all the same, the row in the table is updated.
Use following parameters:
The keys to look up the values: tableField1 = streamField1; tableField2 = streamField2; tableField3 = streamField3; and so on..
Update fields: tableField1, streamField1, N; tableField2, streamField2, N; tableField3, streamField3, N; and so on..
After storing duplicite values to the output table, you can remove duplicites using this concept:
Use Execute SQL step where you define SQL which removes duplicite entries and keeps only unique rows. You can inspire here to create such a SQL: How can I remove duplicate rows?
Another way is to use the Merge rows (diff) step, followed by a Synchronize after merge step.
As long as the number of rows in your CSV that are different from your target table are below 20 - 25% of the total, this is usually the most performance friendly option.
Merge rows (diff) takes two input streams that must be sorted on its key fields (by a compatible collation), and generates the union of the two inputs with each row marked as "new", "changed", "deleted", or "identical". This means you'll have to put Sort rows steps on the CSV input and possibly the input from the target table if you can't use an ORDER BY clause. Mark the CSV input as the "Compare" row origin and the target table as the "Reference".
The Synchronize after merge step then applies the changes marked in the rows to the target table. Note that Synchronize after merge is the only step in PDI (I believe) that requires input be entered in the Advanced tab. There you set the flag field and the values that identify the row operation. After applying the changes the target table will contain the exact same data as the input CSV.
Note also that you can use a Switch/Case or Filter Rows step to do things like remove deletes or updates if you want. I often flow off the "identical" rows and write the rest to a text file so I can examine only the changes.
I looked for visual answers, but the answers were text, so adding this visual-answer for any kettle-newbie like me
Case
user-updateslog.csv (has dup values) ---> users_table , store only latest user detail.
Solution
Step 1: Connect csv to insert/update as in the below Transformation.
Step 2: In Insert/Update, add condition to compare keys to find the candidate row, and choose "Y" fields to update.
I have a Flat File that I'm loading into SQL and that Flat file has 2 different RecordTypes and 2 Different File Layouts based on the RecordType.
So I may have
000010203201501011 (RecordType 1)
00002XXYYABCDEFGH2 (RecordType 2)
So I want to immediately check for Records of RecordType1 and then send those records thru [Derived Column] & [Data Conversion] & [Loading to SQL]
And I want to ignore all Records of RecordType2.
I tried a Conditional Split but it seems like the Records of RecordType2 are still trying to go thru the [Derived Column]&[DataConversion] Steps.
It gives me a DataConversion error on the RecordType2 Records.
I have the Conditional Split set up as RecordType == 1 to go thru the process i have set up.
I guess Conditional Split isn't set up to be used this way?
Where in my process can i tell it to check for RecordType1 and only send records past that point that are RecordType=1?
It makes perfect sense you are having data type errors for Record Type 2 rows since you probably have defined columns along with their data types based on Record Type 1 records. I see three options to achieve what you want to do:
Have a script task in the control flow to copy only Record Type 1
records to a fresh file that would be used by the data flow you
already have (Pro: you do not need to touch the data flow, Con:
reading file twice), OR
In the existing data flow: Instead of getting all the columns from
the data source, read every line coming from the file as one big-fat
column, then a Derived Column to get RecordType, then a Conditional
Split, then a Derived Column to re-create all the columns you had
defined in the data source, OR
Ideal if you have another package processing Record Type 2 rows:
Dump the file into a database table in the staging area, then
replace the Data Source in your Data Flow for an OLEDB Data Source
(or whatever you use) and obtain+filter the records with something
like: SELECT substring(rowdata,1,5) AS RecordType,
substring(rowdata,6,...) AS Column2, .... FROM STG.FileData WHERE
substring(rowdata,1,5) = '00001'. If using this approach it would
be better to have a dedicated column for RecordType
I have a table with a bunch of different fields. One is named period.
The period is not part of the raw data but I run a query when I import new data to the database that gives each record a period.
Now I need a delete query that will delete all the records that have the same period as what is selected in a combobox.
The values in the combobox come from a calendar table that contain all the possible values that could be in that period column at any time.
This is the basic query i thought would solve this issue but it tells me it is going to delete 0 rows every time I run it:
DELETE *
FROM PlanTemp
WHERE PlanTemp.period = Forms![Plan Form]!Combo163;
If you don't need the key field, just remove it.
Look at the "PROPERTIES" section and look at the column names.
Ether remove it there, or from your QUERY source.
You can also look at the Data section of the properties, and change your BOUND column, to Column 2... or whatever holds the data you want to use.
I am exporting a file from a system as .csv. My aim is to link to this file as a table (which matches the output field for field) and then run the queries and export.
The problem I am having is that, upon import, all the fields are 255 bytes wide rather than what they need to be.
Here's what I've tried so far:
I've looked at ALTER TABLE but I cannot run multiple ALTER TABLE statements in one macro.
I've also tried appending the table into another table with the correct structure but it seems to overwrite the structure.
I've also tried using the Left function with the appropriate field length, but when I try to export, I pretty much just see 5 bytes per column.
What I would like is a suggestion as to what is the best path to take given my situation. I am not able to amend the initial .csv export, and I would like to avoid VBA if possible, as I am not at all familiar with it.
You don't really need to worry about the size of Text fields in an Access linked table that is connected to a CSV file. Access simply assigns each Text field the largest possible maximum size: 255. It does not mean that every value is actually 255 characters long, it just means that any values in those fields can be at most 255 characters long.
Even if you could change the structure of the linked table (which you can't), it wouldn't make any real difference except to possibly truncate longer Text values, and you could easily do that with a String function. For example, if a particular field had to be restricted to 15 characters then you could simply use Left([fieldName], 15) as a query column or as the control source in a report.
In the end, as the data set is not that large, I have set this up to append from my source data into a table with the correct structure. I can now run my processes against this table as per normal.