Import csv to impala - sql

So for my previous homework, we were asked to import a csv file with no columns names to impala, where we explicitly give the name and type of each column while creating the table. However, now we have a csv file but with column names given, in this case, do we still need to write down the name and type of it even it is provided in the data?

Yes, you still have to create an external table and define the column names and types. But you have to pass the following option right at the end of the create table statement
tblproperties ("skip.header.line.count"="1");
-- Once the table property is set, queries skip the specified number of lines
-- at the beginning of each text data file. Therefore, all the files in the table
-- should follow the same convention for header lines.

Related

CSV file matadata validation (comparing with existing SQl Table)

I have a requirement of validating CSV file before loading into staged-folder, and later have to load into sql table.
I need to validate metadata (the structure of the file must be same as target sql table)
No. of columns should be equal to the target sql table
order of columns should be same as target sql table
Data types of columns (no text values should exist in numeric field of csv file)
looking for some easy and efficient way achieve this.
Thanks for help
A Python program and module that does most of what you're looking for is chkcsv.py: https://pypi.org/project/chkcsv/. It can be used to verify that a CSV file contains a specified set of columns and that the data type of each column conforms to the specification. It does not, however, verify that the order of columns in the CSV file is the same as the order in the database table. Instead of loading the CSV file directly into the target table, you can load it into a staging table and then move it from there into the target table--this two-step process eliminates column order dependence.
Disclaimer: I wrote chkcsv.py
Edit 2020-01-26: I just added an option that allows you to specify that the column order should be checked also.

Export autodetect schema

I am trying to create a table from a CSV using the schema autodetect option. It fails because some rows / columns have values that do not conform to the auto detected type. I would like to change the type for those columns to STRING.
Is there a way to export the autodetected schema so I can update it and use it to load. The CSV has 30+ columns and I would like to avoid having to manually generate a schema file for all the columns.
Update
This question is not a duplicate of this. The latter is a solution to the case where the table already exists. In this question there is no existing table whose schema can be exported.

How do I import data from a csv when the records are not separated by line breaks but with brackets

Looking at the AM data, just for a data analysis project and I'm having trouble importing the data into my dbms (postgresql).
My code is sql code is this:
DROP TABLE IF EXISTS member_details;
CREATE TABLE member_details(
pnum varchar(255),
.....
updatedon timestamp);
COPY member_details
FROM '/Users/etc/data/sample_dump.csv'
WITH DELIMITER ','
CSV;
Problem is that the csv file has no line breaks to separate the data, instead each record is within a bracket which my code above does not recognise and thus just imports all the data into the header in one line and so no records are created
how the data is structured
(dataA1, ....,dataAx),(dataB1,...,dataBx)
How can I alter my code so that postgresql imports the data record by record by recognising the brackets.
Based on the PostgreSQL COPY documentation, I don't believe it allows for row delimiters other than carriage returns and/or line feeds. I believe you'll need to process your file before importing. You can simply replace all ,( with \n(, then replace all the parenthesis to make it a standard csv format that COPY will happy consume.
Perhaps there's another method for PostgreSQL that would work too, but I haven't come across anything yet.

Hive Extended table

When we create using
Create external table employee (name string,salary float) row format delimited fields terminated by ',' location /emp
In /emp directory there are 2 emp files.
so when we run select * from employee, it get the data from both the file ad display.
What will be happen when there will be others file also having different kind of record which column is not matching with the employee table , so it will try to load all the files when we run "select * from employee"?
1.Can we specify the specific file name which we want to load?
2.Can we create other table also with the same location?
Thanks
Prashant
It will load all the files in emp directory even it doesn’t match with table.
for your first question. you can use Regex serde.if your data matches to regex.then it loads to the table.
regex for access log in hive serde
https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
other options:I am pointing some links.these links has some ways.
when creating an external table in hive can I point the location to specific files in a direcotry?
https://issues.apache.org/jira/browse/HIVE-951
for your second question: yes we can create other tables also with the same location.
Here are your answers
1. If the data in the file dosent match with table format, hive doesnt throw an error. It tries to read the data as best as it could. If data for some columns are missing it will put NULL for them.
No we cannot specify the file name for any table to read data. Hive will consider all the files under the table directory.
Yes, we can create other tables with the same location.

Using Bulk Insert to import a csv file without specifying column names?

When creating a table before the bulk insert, is there a way to NOT specify the column names and use whatever column names are on the csv file? I have some columns in my csv file that are quarters, like 2012Q2, 2012Q3, etc. In the future, these are going to change depending on the time and that's why I don't want to specify the column names. If this is possible, any help will be appreciated.
Thanks!
One way to this is:
Drop the table on BigQuery
Get column names from your CSV
Create table using the column names from your CSV.
PHP example can be found here: https://github.com/GoogleCloudPlatform/php-docs-samples/blob/master/bigquery/api/src/functions/create_table.php
(the fields = your column names)
Import the CSV
Done!