Can I display column headings when querying via gcloud dataproc jobs submit spark-sql? - apache-spark-sql

I'm issuing a spark-sql job to dataproc that simply displays some data from a table:
gcloud dataproc jobs submit spark-sql --cluster mycluster --region europe-west1 -e "select * from mydb.mytable limit 10"
When the data is returned and outputted to stdout I don't see column headings, I only see the raw data, whitespace delimited. I'd really like the output to be formatted better, specifically I'd like to see column headings. I tried this:
gcloud dataproc jobs submit spark-sql --cluster mycluster --region europe-west1 -e "SET hive.cli.print.header=true;select * from mydb.mytable limit 10"
But it had no affect.
Is there a way to get spark-sql to display column headings on dataproc?
If there is a way to get the data displayed like so:
+----+-------+
| ID | Name |
+----+-------+
| 1 | Jim |
| 2 | Ann |
| 3 | Simon |
+----+-------+
then that would be even better.

I have been performing some tests with a Dataproc cluster, and it looks like it is not possible to retrieve query results with the column names using Spark SQL. However, this is more a Apache Spark SQL issue, rather than Dataproc, so I have added that tag to your question too, in order for it to receive a better attention.
If you get into the Spark SQL console in your Dataproc cluster (by SSHing in the master and typing spark-sql), you will see that the result for SELECT queries does not include the table name:
SELECT * FROM mytable;
18/04/17 10:31:51 INFO org.apache.hadoop.mapred.FileInputFormat: Total input files to process : 3
2 Ann
1 Jim
3 Simon
There's no change if using instead SELECT ID FROM mytable;. Therefore, the issue is not in the gcloud dataproc jobs sbmit spark-sql command, but instead in the fact that Spark SQL does not provide this type of data.
If you do not necessarily have to use Spark SQL, you can try using HIVE instead. HIVE does provide the type of information you want (including the column names plus a prettier formatting):
user#project:~$ gcloud dataproc jobs submit hive --cluster <CLUSTER_NAME> -e "SELECT * FROM mytable;"
Job [JOB_ID] submitted.
Waiting for job output...
+-------------+---------------+--+
| mytable.id | mytable.name |
+-------------+---------------+--+
| 2 | Ann |
| 1 | Jim |
| 3 | Simon |
+-------------+---------------+--+
Job [JOB_ID] finished successfully.

Related

Apache Drill and Apache Kudu - not able to run "select * from <some table>" using Apache Drill, for the table created in Kudu through Apache Impala

I'm able to connect to Kudu through Apache Drill, and able to list tables fine. But when I have to fetch data from the table "impala::default.customer" below, I tried different options but none is working for me.
The table in Kudu was created through Impala-Shell as external table.
Initial connection to Kudu and listing objects
ubuntu#ubuntu-VirtualBox:~/Downloads/apache-drill-1.19.0/bin$ sudo ./drill-embedded
Apache Drill 1.19.0
"A Drill is a terrible thing to waste."
apache drill> SHOW DATABASES;
+--------------------+
| SCHEMA_NAME |
+--------------------+
| cp.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| information_schema |
| kudu |
| sys |
+--------------------+
7 rows selected (24.818 seconds)
apache drill> use kudu;
+------+----------------------------------+
| ok | summary |
+------+----------------------------------+
| true | Default schema changed to [kudu] |
+------+----------------------------------+
1 row selected (0.357 seconds)
apache drill (kudu)> SHOW TABLES;
+--------------+--------------------------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+--------------------------------+
| kudu | impala::default.customer |
| kudu | impala::default.my_first_table |
+--------------+--------------------------------+
2 rows selected (9.045 seconds)
apache drill (kudu)> show tables;
+--------------+--------------------------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+--------------------------------+
| kudu | impala::default.customer |
| kudu | impala::default.my_first_table |
+--------------+--------------------------------+
Now when trying to run "select * from impala::default.customer ", not able to run it at all.
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `impala::default`.customer;
Error: VALIDATION ERROR: Schema [[impala::default]] is not valid with respect to either root schema or current default schema.
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `default`.customer;
Error: VALIDATION ERROR: Schema [[default]] is not valid with respect to either root schema or current default schema.
Current default schema: kudu
[Error Id: 8a4ca4da-2488-4775-b2f3-443b8b4b17ef ] (state=,code=0)
Current default schema: kudu
[Error Id: ce96ea13-392f-4910-9f6c-789a6052b5c1 ] (state=,code=0)
apache drill (kudu)>
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `impala`::`default`.customer;
Error: PARSE ERROR: Encountered ":" at line 1, column 23.
SQL Query: SELECT * FROM `impala`::`default`.customer
^
[Error Id: 5aacdd98-db6e-4308-9b33-90118efa3625 ] (state=,code=0)
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `impala::`.`default`.customer;
Error: VALIDATION ERROR: Schema [[impala::, default]] is not valid with respect to either root schema or current default schema.
Current default schema: kudu
[Error Id: 5450bd90-dfcd-4efe-a8d3-b517be85b10a ] (state=,code=0)
>>>>>>>>>>>
In Drill conventions, the first part of the FROM clause is the storage plugin, in this case kudu. When you ran the SHOW TABLES query, you saw that the table name is actually impala::default.my_first_table. If I'm reading that correctly, that whole bit is the table name and the query below is how you should escape it.
Note the back tick before impala and after first_table but nowhere else.
SELECT *
FROM kudu.`impala::default.my_first_table`
Does that work for you?

Why Query_parallelism affects the result of a join between two UUID columns

I'm running the following test on ignite 2.10.0
I have 2 tables created with a query_parallelism=1 and without affinity key.
When I join the 2 following tables I have the result as expected.
0: jdbc:ignite:thin://localhost:10800> SELECT "id" AS "_A_id", "source_id" AS "_A_source_id" FROM PUBLIC."source_ml_blue";
+--------------------------------------+--------------------------------------+
| _A_id | _A_source_id |
+--------------------------------------+--------------------------------------+
| 86c068cd-da89-11eb-a185-3da86c6c6bb3 | 86c068cc-da89-11eb-a185-3da86c6c6bb3 |
+--------------------------------------+--------------------------------------+
1 row selected (0.004 seconds)
0: jdbc:ignite:thin://localhost:10800> SELECT "id" AS "_B_id", "flx_src_ip_text" AS "_B_src_ip" FROM PUBLIC."source_nprobe_tcp_blue";
+--------------------------------------+-----------+
| _B_id | _B_src_ip |
+--------------------------------------+-----------+
| 86c068cc-da89-11eb-a185-3da86c6c6bb3 | 1.1.1.1 |
+--------------------------------------+-----------+
1 row selected (0.003 seconds)
0: jdbc:ignite:thin://localhost:10800> SELECT _A."id" AS "_A_id", _A."source_id" AS "_A_source_id", _B."id" AS "_B_id", _B."flx_src_ip_text" AS "_B_src_ip" FROM PUBLIC."source_ml_blue" AS "_A" INNER JOIN PUBLIC."source_nprobe_tcp_blue" AS "_B" ON "_A"."source_id"="_B"."id";
+--------------------------------------+--------------------------------------+--------------------------------------+-----------+
| _A_id | _A_source_id | _B_id | _B_src_ip |
+--------------------------------------+--------------------------------------+--------------------------------------+-----------+
| 86c068cd-da89-11eb-a185-3da86c6c6bb3 | 86c068cc-da89-11eb-a185-3da86c6c6bb3 | 86c068cc-da89-11eb-a185-3da86c6c6bb3 | 1.1.1.1 |
+--------------------------------------+--------------------------------------+--------------------------------------+-----------+
1 row selected (0.005 seconds)
If I delete and create the same tables with a query_parallelism = 8, I do not have a SQL error (the parallelism is equal on the 2 tables) BUT the result of the join is empty.
any idea why I get this behavior ?
You observe this behaviour because of optimisations for parallel query execution. Most likely your records landed to different partitions (handled by a different thread). If you increase the number of records in both tables you will see a subset of this join as a result.
The most elegant option here is to let "_A"."source_id" and "_B"."id" be affinity keys. Most likely ignite.jdbc.distributedJoins is going to affect performance for clustered installation. Affinity collocation will make items with matching "_A"."source_id" and "_B"."id" reside in the same partition to avoid cross-partitional interaction (for clustered environments it would lead to additional networks hops).
The problem comes from the SQL client : it has to be aware of the parallelism.
On DBeaver, I had to enable ignite.jdbc.distributedJoins in the connection properties to make the request works properly.

Insufficient Resources error while inserting into SQL table using Vertica

I'm running a Python script to load data from a DataFrame into a SQL Table. However, the insert command is throwing this error:
(pyodbc.Error) ('HY000', '[HY000] ERROR 3587: Insufficient resources to execute plan on pool fastlane [Request exceeds session memory cap: 28357027KB > 20971520KB]\n (3587) (SQLExecDirectW)')
This is my code:
df.to_sql('TableName',engine,schema='trw',if_exists='append',index=False) #copying data from Dataframe df to a SQL Table
Can you do the following for me:
run this command - and share the output. MAXMEMORYSIZE, MEMORYSIZE and MAXQUERYMEMORYSIZE, plus PLANNEDCONCURRENCY give you an idea of the (memory) budget at the time when the query / copy command was planned.
gessnerm#gessnerm-HP-ZBook-15-G3:~/1/fam/fam-ostschweiz$ vsql -x -c \
"select * from resource_pools where name='fastlane'"
-[ RECORD 1 ]------------+------------------
pool_id | 45035996273841188
name | fastlane
is_internal | f
memorysize | 0%
maxmemorysize |
maxquerymemorysize |
executionparallelism | 16
priority | 0
runtimepriority | MEDIUM
runtimeprioritythreshold | 2
queuetimeout | 00:05
plannedconcurrency | 2
maxconcurrency |
runtimecap |
singleinitiator | f
cpuaffinityset |
cpuaffinitymode | ANY
cascadeto |
Then, you should dig, out of the QUERY_REQUESTS system table, the acutal SQL command that your python script triggered. It should be in the format of:
COPY <_the_target_table_>
FROM STDIN DELIMITER ',' ENCLOSED BY '"'
DIRECT REJECTED DATA '<_bad_file_name_>'
or similar.
Then: how big is the file / are the files you're trying to load in one go? If too big, then B.Muthamizhselvi is right - you'll need to portion the data volume you load.
Can you also run:
vsql -c "SELECT EXPORT_OBJECTS('','<schema>.<table>',FALSE)"
.. .and share the output? It could well be that you have too many projections for the memory to be enough, that you are sorting by too many columns.
Hope this helps for starters ...

How to fix: cannot retrieve all fields MongoDB collection with Apache Drill SQL expression query

I'm trying to retrieve all(*) columns from a MongoDB object with Apache Drill expression SQL:
`_id`.`$oid`
Background: I'm using Apache Drill to query MongoDB collections. By default, Drill retrieves the ObjectId values in a different format than the stored in the database. For example:
Mongo: ObjectId(“59f2c3eba83a576fe07c735c”)
Drill query result: [B#3149a…]
In order to get the data in String format (59f2c3eba83a576fe07c735c) from the object, I changed the Drill config "store.mongo.bson.record.reader" to "false".
ALTER SESSION SET store.mongo.bson.record.reader = false
Drill query result after config set to false:
select * from calc;
+--------------------------------------+---------+
| _id | name |
+--------------------------------------+---------+
| {"$oid":"5cb0e161f0849231dfe16d99"} | thiago |
+--------------------------------------+---------+
Running a query by _id:
select `o`.`_id`.`$oid` , `o`.`name` from mongo.od_teste.calc o where `o`.`_id`.`$oid`='5cb0e161f0849231dfe16d99';
Result:
+---------------------------+---------+
| EXPR$0 | name |
+---------------------------+---------+
| 5cb0e161f0849231dfe16d99 | thiago |
+---------------------------+---------+
For an object with a few columns like the one above (_id, name) it's ok to specify all the columns in the select query by id. However, in my production database, the objects have a "hundred" of columns.
If I try to query all (*) columns from the collection, this is the result:
select `o`.* from mongo.od_teste.calc o where `o`.`_id`.`$oid`='5cb0e161f0849231dfe16d99';
or
select * from mongo.od_teste.calc o where `o`.`_id`.`$oid`='5cb0e161f0849231dfe16d99';
+-----+
| ** |
+-----+
+-----+
No rows selected (6.112 seconds)
Expected result: Retrieve all columns from a MongoDB collection instead of declaring all of them on the SQL query.
I have no suggestions here, because it is a bug in Mongo Storage Plugin.
I have created Jira ticket for it, please take a look and feel free to add any related info there: DRILL-7176

Apache Sqoop/Pig Consistent Data Representation/Processing

In our organization, we have been trying to use hadoop ecosystem based tools to implement ETLs lately. Although the ecosystem itself is quite big, we are using only a very limited set of tools at the moment. Our typical pipeline flow is as follows:
Source Database (1 or more) -> sqoop import -> pig scripts -> sqoop export -> Destination Database (1 or more)
Over a period of time, we have encountered multiple issues with the above approach to implementing ETLs. One problem we notice a fair bit is that the fields don't align properly when trying to read a field from HDFS using pig (where data on HDFS has been typically imported with sqoop) and pig script fails with an error. For instance, a string may end up in a numeric field type due to misalignment.
It appears that there are two approaches to this problem:
Remove the problem characters you know of in fields before processing with pig. This is the approach we have taken in the past. We do know that we have some bad data in our source databases - typically new lines and tabs in fields that shouldn't exist. (NOTE: we used to have tabs as field delimiters). So what we did is to use a DB view or a free-form query option sqoop that in turn uses a REPLACE function or its equivalent that's available in the source DB (typically mysql but less often postgres). This approach does work but it has the side effect of HDFS data not matching the source data. In addition, some of the other imported fields will no longer make sense - for instance, imagine that you have a MD5 or SHA1 hash on a field but the field has been modified to replace some characters, so we have to compute MD5 or SHA1 to be consistent instead of importing the one from the source DB. In addition, this approach involves trial and error to a certain extent. We wouldn't necessarily know which fields need to be modified ahead of time (and which characters to remove) so we might need more than one iteration to reach our end goal.
Use enclosure feature with sqoop in combination with escaping and combine this with a loader of an appropriate type in pig so that not only do the fields do line up properly but a given field (and its associated values) are represented the same way as the data moves through the pipeline.
I was trying to figure out a good way to accomplish #2 using different options available in sqoop and pig. Presented below is an outline of what I have tried so far in addition to the findings.
Below are the specific versions of software used for this experiment:
Sqoop: 1.4.3
Pig: 0.12.0
Hadoop: 2.0.0
Since our data sets are typically large (and would take several hours to process), I figured I'll come up with an extremely small data set that kind of mimics some of the data issues we have had. Towards this end, I put together a small table in mysql (which will be used as source DB):
mysql> desc example;
+-------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(1024) | YES | | NULL | |
| v1 | int(11) | YES | | NULL | |
| v2 | int(11) | YES | | NULL | |
| v3 | int(11) | YES | | NULL | |
+-------+---------------+------+-----+---------+----------------+
5 rows in set (0.00 sec)
After data has been added with INSERT statement, here are the contents of the example table:
mysql> select * from example;
+----+----------------------------------------------------------------------------+------+------+------+
| id | name | v1 | v2 | v3 |
+----+----------------------------------------------------------------------------+------+------+------+
| 1 | Some string, with a comma. | 1 | 2 | 3 |
| 2 | Another "string with quotes" | 4 | 5 | 6 |
| 3 | A string with
new line | 7 | 8 | 9 |
| 4 | A string with 3 new lines -
first new line
second new line
third new line | 10 | 11 | 12 |
| 5 | a string with "quote" and a
new line | 13 | 14 | 15 |
| 6 | clean record | 0 | 1 | 2 |
| 7 | single
newline | 0 | 1 | 2 |
| 8 | | 51 | 52 | 53 |
| 9 | NULL | 105 | NULL | 103 |
+----+----------------------------------------------------------------------------+------+------+------+
9 rows in set (0.00 sec)
We can readily see the new lines in name field. I didn't include tabs in this data set as I switched the delimiter from tab to comma, so there is one record with a comma. Since typical enclosing character is a double quote, there are some records with double quotes. Finally in the last two records (id = 8 and 9), I wanted to see how the empty string and nulls are represented in char type field and how null is represented in numeric type field.
I tried the following sqoop import on the above table:
sqoop import --connect jdbc:mysql://localhost/test --username user --password pass --table example --columns 'id, name, v1, v2, v3' --verbose --split-by id --target-dir example --fields-terminated-by , --escaped-by \\ --enclosed-by \" --num-mappers 1
Notice that blackslash has been used an escape character, double quote as an enclosure, and comma as field delimiter.
Here is how the data looks on HDFS:
$hadoop fs -cat example/part-m-00000
"1","Some string, with a comma.","1","2","3"
"2","Another \"string with quotes\"","4","5","6"
"3","A string with
new line","7","8","9"
"4","A string with 3 new lines -
first new line
second new line
third new line","10","11","12"
"5","a string with \"quote\" and a
new line","13","14","15"
"6","clean record","0","1","2"
"7","single
newline","0","1","2"
"8","","51","52","53"
"9","null","105","null","103"
I created a small pig script to read and parse the above data:
REGISTER '……./pig/contrib/piggybank/java/piggybank.jar';
data = LOAD 'example' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE') AS (id:int, name:chararray, v1:int, v2:int, v3:int);
dump data;
Notice the usage of CSVExcelStorage loader that's available in piggybank. Since we have newlines in the incoming data set, we enable MULTILINE option. The above script produces the following output:
(1,Some string, with a comma.,1,2,3)
(2,Another \string with quotes\",4,5,6)
(3,A string with
new line,7,8,9)
(4,A string with 3 new lines -
first new line
second new line
third new line,10,11,12)
(5,a string with \quote\" and a
new line,13,14,15)
(6,clean record,0,1,2)
(7,single
newline,0,1,2)
(8,",51,52,53)
(9,null,105,,103)
In records with id 2 and 5, a blackslash remains in the place of the very first double quote while for subsequent double quotes, both the slash as well as the quote remains. This is not exactly what I want. Noting that CSVExcelStorage, based on Excel 2007, uses double quotes to escape quote (i.e., consecutive double quotes are treated as single double quote), I made the escape character a double quote:
sqoop import --connect jdbc:mysql://localhost/test --username user --password pass --table example --columns 'name, v1, v2, v3' --verbose --split-by id --target-dir example --fields-terminated-by , --escaped-by '\"' --enclosed-by '\"' --num-mappers 1
Before executing the above command, I deleted the existing data:
$hadoop fs -rm -r example
After the sqoop import runs through, here is how data looks on HDFS now:
$hadoop fs -cat example/part-m-00000
"1","Some string, with a comma.","1","2","3"
"2","Another """"string with quotes""""","4","5","6"
"3","A string with
new line","7","8","9"
"4","A string with 3 new lines -
first new line
second new line
third new line","10","11","12"
"5","a string with """"quote"""" and a
new line","13","14","15"
"6","clean record","0","1","2"
"7","single
newline","0","1","2"
"8","","51","52","53"
"9","null","105","null","103"
I ran the same pig script once more on this data and it produces the following output:
(1,Some string, with a comma.,1,2,3)
(2,Another ""string with quotes"",4,5,6)
(3,A string with
new line,7,8,9)
(4,A string with 3 new lines -
first new line
second new line
third new line,10,11,12)
(5,a string with ""quote"" and a
new line,13,14,15)
(6,clean record,0,1,2)
(7,single
newline,0,1,2)
(8,",51,52,53)
(9,null,105,,103)
Noticing that any double quotes in the string are now doubled effectively, I can get rid of this by using REPLACE function in pig:
data2 = FOREACH data GENERATE id, REPLACE(name, '""', '"') as name, v1, v2, v3;
dump data2;
The above script produces the following output:
(1,Some string, with a comma.,1,2,3)
(2,Another "string with quotes",4,5,6)
(3,A string with
new line,7,8,9)
(4,A string with 3 new lines -
first new line
second new line
third new line,10,11,12)
(5,a string with "quote" and a
new line,13,14,15)
(6,clean record,0,1,2)
(7,single
newline,0,1,2)
(8,",51,52,53)
(9,null,105,,103)
The above looks much more like the output I want. One last item I need to ensure is that nulls and empty strings for chararray type and nulls for int type are accounted for.
Towards that end, I add one more section to the above pig script that generates null and empty strings for char type and null for int type:
data3 = FOREACH data2 GENERATE id, name, v1, v2, v3, null as name2:chararray, '' as name3:chararray, null as v4:int;
dump data3;
The output looks as follows:
(1,Some string, with a comma.,1,2,3,,,)
(2,Another "string with quotes",4,5,6,,,)
(3,A string with
new line,7,8,9,,,)
(4,A string with 3 new lines -
first new line
second new line
third new line,10,11,12,,,)
(5,a string with "quote" and a
new line,13,14,15,,,)
(6,clean record,0,1,2,,,)
(7,single
newline,0,1,2,,,)
(8,",51,52,53,,,)
(9,null,105,,103,,,)
I stored the same output in HDFS using the following pig script:
STORE data3 INTO 'example_output' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE');
Here is how data on HDFS looks like:
$hadoop fs -cat example_output/part-m-00000
1,"Some string, with a comma.",1,2,3,,,
2,"Another ""string with quotes""",4,5,6,,,
3,"A string with
new line",7,8,9,,,
4,"A string with 3 new lines -
first new line
second new line
third new line",10,11,12,,,
5,"a string with ""quote"" and a
new line",13,14,15,,,
6,clean record,0,1,2,,,
7,"single
newline",0,1,2,,,
8,"""",51,52,53,,,
9,null,105,,103,,,
For nulls and empty strings, the only two records of interest are the bottom two ones (id = 8 and 9). It’s clear that there is a difference between empty string and null from source using sqoop versus that which is generated from pig. I could account for null and empty strings in the name field above similar to how I have done for the double quote but it seems rather manual and more steps than needed.
Notice that although we have used "enclosed-by" option in sqoop import (as opposed to "optionally-enclosed-by" option), the output from PIG uses enclosure only when there is a need for it i.e., if a quote or a comma appears in the field, then enclosing is performed, otherwise not - in other words, this looks like the sqoop equivalent of "optionally-enclosed-by" option.
The final stage in the pipeline is sqoop export. I put together the following table:
mysql> desc example_output;
+-------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------------+------+-----+---------+-------+
| id | int(11) | YES | | NULL | |
| name | varchar(1024) | YES | | NULL | |
| v1 | int(11) | YES | | NULL | |
| v2 | int(11) | YES | | NULL | |
| v3 | int(11) | YES | | NULL | |
| name2 | varchar(1024) | YES | | NULL | |
| name3 | varchar(1024) | YES | | NULL | |
| v4 | int(11) | YES | | NULL | |
+-------+---------------+------+-----+---------+-------+
8 rows in set (0.00 sec)
Here is the sqoop export command I used:
sqoop export --connect jdbc:mysql://localhost/test --username user --password pass --table example_output --export-dir example_output --input-fields-terminated-by , --input-escaped-by '\"' --input-optionally-enclosed-by '\"' --num-mappers 1 --verbose
The export options are similar to import options except that the "enclosed-by" has been replaced by "optionally-enclosed-by" and an "input-" prefix has been added to some of the options (e.g: --input-fields-terminated-by) since sqoop export uses those while reading input from HDFS.
This fails with the following error in the logs:
2014-02-25 22:19:05,750 ERROR org.apache.sqoop.mapreduce.TextExportMapper: Exception:
java.lang.RuntimeException: Can't parse input data: 'Some string, with a comma.,1,2,3,,,'
at example_output.__loadFromFields(example_output.java:396)
at example_output.parse(example_output.java:309)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:794)
at example_output.__loadFromFields(example_output.java:366)
... 12 more
2014-02-25 22:19:05,756 ERROR org.apache.sqoop.mapreduce.TextExportMapper: On input: 1,"Some string, with a comma.",1,2,3,,,
2014-02-25 22:19:05,757 ERROR org.apache.sqoop.mapreduce.TextExportMapper: On input file: hdfs://nameservice1/user/xyz/example_output/part-m-00000
2014-02-25 22:19:05,757 ERROR org.apache.sqoop.mapreduce.TextExportMapper: At position 0
In an effort to troubleshoot this problem, I created a HDFS location that has only one record (id = 6) from the input data set:
$ hadoop fs -cat example_output_single_record/part-m-00000
6,clean record,0,1,2,,,
Now the sqoop export command becomes:
sqoop export --connect jdbc:mysql://localhost/test --username user --password pass --table example_output --export-dir example_output_single_record --input-fields-terminated-by , --input-escaped-by '\"' --input-optionally-enclosed-by '\"' --num-mappers 1 --verbose
The above command runs through fine and produces the desired result of inserting the single record into the destination DB:
mysql> select * from example_output;
+------+--------------+------+------+------+-------+-------+------+
| id | name | v1 | v2 | v3 | name2 | name3 | v4 |
+------+--------------+------+------+------+-------+-------+------+
| 6 | clean record | 0 | 1 | 2 | | | NULL |
+------+--------------+------+------+------+-------+-------+------+
1 row in set (0.00 sec)
While null value has been preserved for the numeric field, both null and empty string mapped to empty string in the destination DB.
With the above as the background, here are the questions:
I think it would be easier if we can ensure that a given value for a given data type will be represented/processed exactly the same way regardless of whether it’s coming from sqoop or generated by pig. Has anyone figured out a way to ensure consistent representation/processing of a given data type while preserving the original field values? I have covered only two data types here (chararray and int) but I suppose some of the other data types also have potentially similar issues.
I have used "enclosed-by" option in sqoop import instead of "optionally-enclosed-by" so that every field value will be enclosed within double quotes. I just thought it would be a source of less confusion if every value in every field was enclosed instead of just those that need enclosing. What do others use and has one of these options worked better for your use case relative to the other? It looks like CSVExcelStorage doesn't support a notion of "enclosed-by" - are there any other storage functions that support this mechanism?
Any suggestions on how to get the sqoop export to work as intended on the full output of pig script (i.e., example_output on HDFS)?
Maybe you need to step back and choose a simpler solution. So you have newlines, tabs, commas, double quotes, nulls, foreign characters? and maybe even some garbage in your data, but how random is it really? Could you choose an obscure delimiter character and survive?
For example, to use 0x17 as the field delimiter
use delimiter with sqoop:
--fields-terminated-by \0x17
and with pig:
LOAD 'input.dat' USING PigStorage('\\0x17') as (x,y,z);
Or maybe there is some other obscure ascii value you could use:
http://en.wikipedia.org/wiki/ASCII
Q. While importing data with sqoop, I cannot determine an unambiguous delimiter for text files. What are my options in such a scenario?
A. When a chosen delimiter might occur in imported data, use qualifiers to avoid ambiguity. You can accomplish this by using the --escaped-by and --enclosed-by args of the sqoop import command. For example, the following command encloses fields in an imported file in double quotes:
sqoop import --fields-terminated-by , -- escaped-by \\ --enclosed-by '\"'
Big Data Analytics with HDInsight in "cough" 24 "cough" hours
https://books.google.com/books?id=FWvoCgAAQBAJ&pg=PT648&lpg=PT648&dq=sqoop+enclose+fields+double+quote&source=bl&ots=zkYTKphcZp&sig=LdB0BxQVQWrBbiNyA9g_roFA8Yk&hl=en&sa=X&ved=0ahUKEwiMzO-a4KrOA