Remove extra lines from file using hive script

Remove extra lines from file using hive script - hive

i have some text files, some among them has header, some dont and some have extra lines before the actual record starts. Is there a way to remove the extra lines. Basically I am creating a external table using the file in the specified location. Any links will be really helpful.
Basically , there is no specific number of lines over which the headers are spanned, else i could have skipped the headers using
tblproperties ("skip.header.line.count"="1")

Only filtering will help in this case. Filter your headers during select from table:
select t.*
from your_table t
where t.col not in ('header_value1','header_value2','header_value3')
this will filter out NULLs also. To allow NULLs add OR t.col is NULL

Related

Order of the columns in Apache Zeppelin when selecting the data from the temprorary table is wrong, how to put specific column first?

Currently we have the scala DataFrame output with id value shown first (but it is chronologically added to the DataFrame last). Other columns appears dynamically based on .pivot() function and the data.
When I call for the data in %sql interpreter, the order is changing, thus making CSV file that I download also have id column as the last one, that doesn't work for me. I can't just write the selection script at once with putting the id column at the first point manually, as I can't control other columns because of pivot. Is there any other way to make specific column go first?
The Scala paragraph is:
resultMean.registerTempTable("mean")
The sql paragraph is:
%sql
select *
from mean

For someone who will read this in future, the reason of such a behavior is in misusing the DataFrame. In Scala .show() was applied to one DataFrame, while the export to the temp table to another one. If you face the same, please double check you apply your methods to the same objects.

How to get the column index number of a specific field name in a staged file on Snowflake?

I need to get the column number of a staged file on Snowflake.
The main idea behind it, is that I need to automate getting this field in other queries rather than using t.$3 whereas 3 is the position of the field, that might be changed because we are having an expandable surveys (more or less questions depending on the situation).
So what I need is something like that:
SELECT COL_NUMBER FROM #my_stage/myfile.csv WHERE value = 'my_column_name`
-- Without any file format to read the header
And then this COL_NUMBER could be user as t.$"+COL_NUMBER+" inside merge queries.

How to validate Column Names and Column Order in Azure Data Factory

I want read the column names from a file stored in Azure Files And then validate the column names and sequence e.g. "First_Column"="First_Column", "Second_Column"= "Second_Column", ... etc and also the order should match. Please suggest a way to do this in Azure Data Factory.

Update:
Alternatively, we can use Lookup activity to view the headers, but the judgment condition will be a little complex.
At the If Condition1 we can use the expression: #and(and(equals(activity('Lookup1').output.firstRow.Prop_0,'First_Column'),equals(activity('Lookup1').output.firstRow.Prop_1,'Second_Column')),equals(activity('Lookup1').output.firstRow.Prop_2,'Third_Column'))
We can validate the column names and sequence in dataflow via column patterns in derived column.
For example:
The source data csv file is like this:
The dataflow is like this:
I don't select First row as header , so we can read the headers of the csv file into the dataflow.
Then I use SurrogateKey1 to add a row_no to the data.
The data preview is like this:
At ConditionalSplit1 activity, I use row_no == 1 to filter the headers.
At DerivedColumn1 activity, I use several column pattern to validate the column names and sequence.
The result is as follows:

Is there any way to exclude columns from a source file/table in Pentaho using "like" or any other function?

I have a CSV file having more than 700 columns. I just want 175 columns from them to be inserted into a RDBMS table or a flat file usingPentaho (PDI). Now, the source CSV file has variable columns i.e. the columns can keep adding or deleting but have some specific keywords that remain constant throughout. I have the list of keywords which are present in column names that have to excluded, e.g. starts_with("avgbal_"), starts_with("emi_"), starts_with("delinq_prin_"), starts_with("total_utilization_"), starts_with("min_overdue_"), starts_with("payment_received_")
Any column which have the above keywords have to be excluded and should not pass onto my RDBMS table or a flat file. Is there any way to remove the above columns by writing some SQL query in PDI? Selecting specific 175 columns is not possible as they are variable in nature.

I think your example is fit to use meta data injection you can refer to example shared below
https://help.pentaho.com/Documentation/7.1/0L0/0Y0/0K0/ETL_Metadata_Injection
two things you need to be careful
maintain list of columns you need to push in.
since you have changing column names so you may face issue with valid columns as well which you want to import or work with. in order to do so make sure you generate the meta data file every time so you are sure about the column names you want to push out from the flat file.

How to remove column in Pentaho Data Integration?

I am using PDI/Kettle. I know it is possible to add new columns by specifying them in fields. Is it possible to remove deprecated input columns coming from the previous step in Modified Javascript Step with Spoon?

You can use Select / Rename values step to remove any field from record stream.
Do it in a 2nd tab Remove where you define Fields to remove

#Hello-lad
Reading your question looks like you wanted to know specifically if you can discard an input column coming from a previous step inside of a Modified Javascript Step, but the real use of this transformation is to create columns derived from values coming in the stream of Pentaho and not really to eliminate unwanted items in that stream, for that you particularly use the Select / Rename values (as indicated by mzy)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove extra lines from file using hive script - hive

Only filtering will help in this case. Filter your headers during select from table: select t.* from your_table t where t.col not in ('header_value1','header_value2','header_value3') this will filter out NULLs also. To allow NULLs add OR t.col is NULL

Related

Order of the columns in Apache Zeppelin when selecting the data from the temprorary table is wrong, how to put specific column first?

How to get the column index number of a specific field name in a staged file on Snowflake?

How to validate Column Names and Column Order in Azure Data Factory

Is there any way to exclude columns from a source file/table in Pentaho using "like" or any other function?

How to remove column in Pentaho Data Integration?

Categories

Resources