I am inserting records as sequence file from hive , but the key part is coming as blank while viewing file with hadoop fs -text command. Any way to override the key part with simple hiveql code?
The key part I am trying to provide sequence numbers like 1,2,3... for each record.
By default hive will ignore the key part of sequential file. So when we view the content of the table with select statement the key part is never shown , only the columns of the table ie the value part of the sequential file is only shown.
However when doing hadoop -text on sequential file , the key part as well as value part is shown.
In order to have a key part populated in sequence file, we need to explicitly write a Map Reduce program.
Related
I'm new to PDI and Kettle, and what I thought was a simple experiment to teach myself some basics has turned into a lot of frustration.
I want to check a database to see if a particular record exists (i.e. vendor). I would like to get the name of the vendor from reading a flat file (.CSV).
My first hurdle selecting only the vendor name from 8 fields in the CSV
The second hurdle is how to use that vendor name as a variable in a database query.
My third issue is what type of step to use for the database lookup.
I tried a dynamic SQL query, but I couldn't determine how to build the query using a variable, then how to pass the desired value to the variable.
The database table (VendorRatings) has 30 fields, one of which is vendor. The CSV also has 8 fields, one of which is also vendor.
My best effort was to use a dynamic query using:
SELECT * FROM VENDORRATINGS WHERE VENDOR = ?
How do I programmatically assign the desired value to "?" in the query? Specifically, how do I link the output of a specific field from Text File Input to the "vendor = ?" SQL query?
The best practice is a Stream lookup. For each record in the main flow (VendorRating) lookup in the reference file (the CSV) for the vendor details (lookup fields), based on its identifier (possibly its number or name or firstname+lastname).
First "hurdle" : Once the path of the csv file defined, press the Get field button.
It will take the first line as header to know the field names and explore the first 100 (customizable) record to determine the field types.
If the name is not on the first line, uncheck the Header row present, press the Get field button, and then change the name on the panel.
If there is more than one header row or other complexities, use the Text file input.
The same is valid for the lookup step: use the Get lookup field button and delete the fields you do not need.
Due to the fact that
There is at most one vendorrating per vendor.
You have to do something if there is no match.
I suggest the following flow:
Read the CSV and for each row look up in the table (i.e.: the lookup table is the SQL table rather that the CSV file). And put default upon not matching. I suggest something really visible like "--- NO MATCH ---".
Then, in case of no match, the filter redirect the flow to the alternative action (here: insert into the SQL table). Then the two flows and merged into the downstream flow.
I found that BigQuery's schema autodetection doesn't recognize a field if that doesn't appear in the beginning of an input JSON file.
I have this field named "details" which is a record type. In the first 2K rows of the JSON input file, this field doesn't have any sub-fields. But then in 2,698 rows of the input file, this field has "report" sub-field for the first time. If I move the line to the top of the JSON file, it works fine.
How can I solve this issue? Explicitly specifying the schema is one way but I am wondering if there is a way to make the auto detection scan more rows or something like that.
I need guidance regarding the most approriate approach to perform a index function using pentaho Data integration ( kettle )
my situation is as following :
using the GLOBAL voip system report, I stored all data in a Mysql Database, which gives me several id number + name and lastname but whithout the departement name.
each departement name has it's own excel reports that can be identified by the group file name, which is not available in the Global file.
what i am trying to achieve is a lookup for each identification number to identify the departement where he belongs using the report filename and store it on the approriate column.
Any help will be appreciated.
Assuming you're using the Excel File Input step, there is an option on the Additional Output Fields tab that will allow you to specify the Full Filename Field. You can name this whatever you want, and it will add an additional column to your incoming Excel data that has the name of the file as one of the columns. You may need to do some regex cleanup on that fields since it's the full file path, not just the filename.
As far as doing the lookup, there are many lookup options to merge streams in the Lookup category of the design tab. I think the Stream Lookup is the step you'll want.
As far as I understood your need, you have to first build a "mapping table" of two columns: the department (aka the start of the xls filename) and the employee (aka its ID).
This table does not need to be materialized and may stay in a step of the the PDI. So
Read all the xls files with a Microsoft Excel File. In case you do not know how to do it: Browse to any of these file, press the Add button, then in the Selected files table, remove the filename to keep only its directory path and write .*\.xls in the Regex wildcard. Check you select the appropriates files with the Show filename button.
In the same step, define the Sheet to be "Fiche technique" (assuming they are all the same). Define the field to be "A" with type String (an empty column) and "ID" also with type String (otherwise you'll have a un-trappable error on "Agent ID" and "Total". Also follow #eicherjc suggestion and keep the filename, although I suggest you keep the Short file name and call it filename.
You should get a two column stream: ID and filename, which need some bit of data massage before to be used. The ID contains non-integer fields and the file name contains extra characters.
The simplest way to do this is with a Modified Javascript Value. I may suggest the code:
var ID = Number(ID);
var regex = filename.match(/(.*)__\d+\.xls/);
if(regex) filename = regex[1];
and do not forget specify the the ID has now a type Integer and to put a "Y" in the Replace value in field of the Fields`` table at the bottom.
The first line will convert any number in its value, and non-number in a 0, which is an ID that does not exists.
The next lines will extract the department from the filename with a Regex. If you do not like regex, you may use a filename = filename.substr(0, filename.indexOf('__')), or any formula that will do the job.
Now you have a stream ready to be used, except that some employees may, right or wrong, be in more than one department. If it does not matter which one, then leave it like that. Otherwise you have to provide some logic to filter the correct department.
You can now use a Lookup Stream to read the department of each employee. The Lookup step is the Modified Javascript value (or whatever name you gave to this step). The field to lookup is the field of the ID in your mySql. The Lookup field is the ID (or whatever name you gave to the column B of your xls files). And the field to retrieveenter code here is the filename (or more precisely, the department name extracted from the filename).
I am creating a transformation that take input from CSV file and output to a table. That is running correctly but the problem is if I run that transformation more then one time. Then the output table contain the duplicate rows again and again.
Now I want to remove all duplicate row from the output table.
And if I run the transformation repeatedly it should not affect the output table until it don't have a new row.
How I can solve this?
Two solutions come to my mind:
Use Insert / Update step instead of Table input step to store data into output table. It will try to search row in output table that matches incoming record stream row according to key fields (all fields / columns in you case) you define. It works like this:
If the row can't be found, it inserts the row. If it can be found and the fields to update are the same, nothing is done. If they are not all the same, the row in the table is updated.
Use following parameters:
The keys to look up the values: tableField1 = streamField1; tableField2 = streamField2; tableField3 = streamField3; and so on..
Update fields: tableField1, streamField1, N; tableField2, streamField2, N; tableField3, streamField3, N; and so on..
After storing duplicite values to the output table, you can remove duplicites using this concept:
Use Execute SQL step where you define SQL which removes duplicite entries and keeps only unique rows. You can inspire here to create such a SQL: How can I remove duplicate rows?
Another way is to use the Merge rows (diff) step, followed by a Synchronize after merge step.
As long as the number of rows in your CSV that are different from your target table are below 20 - 25% of the total, this is usually the most performance friendly option.
Merge rows (diff) takes two input streams that must be sorted on its key fields (by a compatible collation), and generates the union of the two inputs with each row marked as "new", "changed", "deleted", or "identical". This means you'll have to put Sort rows steps on the CSV input and possibly the input from the target table if you can't use an ORDER BY clause. Mark the CSV input as the "Compare" row origin and the target table as the "Reference".
The Synchronize after merge step then applies the changes marked in the rows to the target table. Note that Synchronize after merge is the only step in PDI (I believe) that requires input be entered in the Advanced tab. There you set the flag field and the values that identify the row operation. After applying the changes the target table will contain the exact same data as the input CSV.
Note also that you can use a Switch/Case or Filter Rows step to do things like remove deletes or updates if you want. I often flow off the "identical" rows and write the rest to a text file so I can examine only the changes.
I looked for visual answers, but the answers were text, so adding this visual-answer for any kettle-newbie like me
Case
user-updateslog.csv (has dup values) ---> users_table , store only latest user detail.
Solution
Step 1: Connect csv to insert/update as in the below Transformation.
Step 2: In Insert/Update, add condition to compare keys to find the candidate row, and choose "Y" fields to update.
I am exporting a file from a system as .csv. My aim is to link to this file as a table (which matches the output field for field) and then run the queries and export.
The problem I am having is that, upon import, all the fields are 255 bytes wide rather than what they need to be.
Here's what I've tried so far:
I've looked at ALTER TABLE but I cannot run multiple ALTER TABLE statements in one macro.
I've also tried appending the table into another table with the correct structure but it seems to overwrite the structure.
I've also tried using the Left function with the appropriate field length, but when I try to export, I pretty much just see 5 bytes per column.
What I would like is a suggestion as to what is the best path to take given my situation. I am not able to amend the initial .csv export, and I would like to avoid VBA if possible, as I am not at all familiar with it.
You don't really need to worry about the size of Text fields in an Access linked table that is connected to a CSV file. Access simply assigns each Text field the largest possible maximum size: 255. It does not mean that every value is actually 255 characters long, it just means that any values in those fields can be at most 255 characters long.
Even if you could change the structure of the linked table (which you can't), it wouldn't make any real difference except to possibly truncate longer Text values, and you could easily do that with a String function. For example, if a particular field had to be restricted to 15 characters then you could simply use Left([fieldName], 15) as a query column or as the control source in a report.
In the end, as the data set is not that large, I have set this up to append from my source data into a table with the correct structure. I can now run my processes against this table as per normal.