file parsing using hive query issue - hive

I have a complex text file to parse and load for analysis.
I started off with a simple Hive query to parse a text file and load as a table in HDFS.
I am using beeswax to run this query.
name_area.txt
arun:salem
anand:vnr
Cheeli:guntur
Hive Query
CREATE TABLE test(
name STRING,
area STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "^(.*):(.*)$","output.format.string" = "%1$s %2$s")
LOCATION '/user/name_area.txt';
The file is copied to HDFS.
When i execute the query, I am getting the following exception.
NoReverseMatch at /beeswax/execute/6
Reverse for ‘execute_parameterized_query’ with arguments ‘(6,)’ and keyword arguments ‘{}’ not found.
Request Method: POST
Request URL: http://192.168.58.128:8000/beeswax/execute/6
Django Version: 1.2.3
Exception Type: NoReverseMatch
Exception Value:
Reverse for ‘execute_parameterized_query’ with arguments ‘(6,)’ and keyword arguments ‘{}’ not found.
Exception Location: /usr/lib/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/core/urlresolvers.py in reverse, line 297
Python Executable: /usr/bin/python2.6
Python Version: 2.6.6
Python Path: [”, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/pip-0.6.3-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Babel-0.9.6-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/BabelDjango-0.2.2-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Beaker-1.4.2-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Mako-0.7.2-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Markdown-2.0.3-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/MarkupSafe-0.9.3-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/MySQL_python-1.2.3c1-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Paste-1.7.2-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/PyYAML-3.09-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Pygments-1.3.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/South-0.7-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Spawning-0.9.6-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Twisted-8.2.0-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/anyjson-0.3.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/avro-1.5.0-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/billiard-2.7.3.28-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/celery-3.0.19-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/configobj-4.6.0-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/django_auth_ldap-1.2.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/django_celery-3.0.17-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/django_extensions-0.5-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/django_nose-0.5-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/elementtree-1.2.6_20050316-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/enum-0.4.4-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/eventlet-0.9.14-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/greenlet-0.3.1-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/httplib2-0.8-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/importlib-1.0.2-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/kerberos-1.1.1-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/kombu-2.5.10-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/lockfile-0.8-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/lxml-3.3.5-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/moxy-1.0.0-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/openpyxl-1.6.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/ordereddict-1.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/pam-0.1.3-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/processing-0.52-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/pyOpenSSL-0.13-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/pycrypto-2.6-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/pysqlite-2.5.5-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/python_daemon-1.5.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/python_dateutil-2.0-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/python_ldap-2.3.13-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/pytidylib-0.2.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/requests-2.2.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/requests_kerberos-0.4-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/sasl-0.1.1-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/sh-1.08-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/simplejson-2.0.9-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/threadframe-0.2-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/thrift-0.9.1-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/urllib2_kerberos-0.1.6-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/xlrd-0.9.0-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/zope.interface-3.5.2-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/desktop/core/src’, ‘/usr/lib/hue/desktop/libs/hadoop/src’, ‘/usr/lib/hue/desktop/libs/liboozie/src’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages’, ‘/usr/lib/hue/apps/about/src’, ‘/usr/lib/hue/apps/beeswax/src’, ‘/usr/lib/hue/apps/filebrowser/src’, ‘/usr/lib/hue/apps/hcatalog/src’, ‘/usr/lib/hue/apps/help/src’, ‘/usr/lib/hue/apps/jobbrowser/src’, ‘/usr/lib/hue/apps/jobsub/src’, ‘/usr/lib/hue/apps/oozie/src’, ‘/usr/lib/hue/apps/pig/src’, ‘/usr/lib/hue/apps/proxy/src’, ‘/usr/lib/hue/apps/useradmin/src’, ‘/usr/lib/hue/build/env/bin’, ‘/usr/lib64/python2.6′, ‘/usr/lib64/python2.6/plat-linux2′, ‘/usr/lib64/python2.6/lib-dynload’, ‘/usr/lib64/python2.6/site-packages’, ‘/usr/lib/python2.6/site-packages’, ‘/usr/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg-info’, ‘/usr/lib/hue/apps/beeswax/gen-py’, ‘/usr/lib/hue’, ‘/usr/lib64/python26.zip’, ‘/usr/lib64/python2.6/lib-tk’, ‘/usr/lib64/python2.6/lib-old’, ‘/usr/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg-info’, ‘/usr/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg-info’, ‘/usr/lib/hue/apps/beeswax/src/beeswax/../../gen-py’, ‘/usr/lib/hue/apps/jobbrowser/src/jobbrowser/../../gen-py’, ‘/usr/lib/hue/apps/proxy/src/proxy/../../gen-py’]
Server time: Fri, 24 Apr 2015 07:37:07 -0700
Appreciate your help on this.

Your create statement query does not seems write. The location should be the directory name where input file exists and not the file name.

Related

LOAD DATA LOCAL INPUT PATH doesnt exist

I am new to the Spark and Scala Technology. I'm getting the following exception while trying to load a file from local file system into table using Spark.
Spark version -2.0 and Scala version - 2.11
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'file.txt' INTO TABLE student")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: file.txt
Please try to give complete path as file:/complete path to the file.
In above case:
sqlContext.sql("LOAD DATA LOCAL INPATH 'file:/complete path to the file.txt' INTO TABLE student")
~Kedar

unable to load csv file from GCS into bigquery

I am unable to load 500mb csv file from google cloud storage to big query but i got this error
Errors:
Too many errors encountered. (error code: invalid)
Job ID xxxx-xxxx-xxxx:bquijob_59e9ec3a_155fe16096e
Start Time Jul 18, 2016, 6:28:27 PM
End Time Jul 18, 2016, 6:28:28 PM
Destination Table xxxx-xxxx-xxxx:DEV.VIS24_2014_TO_2017
Write Preference Write if empty
Source Format CSV
Delimiter ,
Skip Leading Rows 1
Source URI gs://xxxx-xxxx-xxxx-dev/VIS24 2014 to 2017.csv.gz
I have gzipped 500mb csv file to csv.gz to upload to GCS.Please help me to solve this issue
The internal details for your job show that there was an error reading the row #1 of your CSV file. You'll need to investigate further, but it could be that you have a header row that doesn't conform to the schema of the rest of the file, so we're trying to parse a string in the header as an integer or boolean or something like that. You can set the skipLeadingRows property to skip such a row.
Other than that, I'd check that the first row of your data matches the schema you're attempting to import with.
Also, the error message you received is unfortunately very unhelpful, so I've filed a bug internally to make the error you received in this case more helpful.

How to construct S3 URL for copying to Redshift?

I am trying to import a CSV file into a Redshift cluster. I have successfully completed the example in the Redshift documentation. Now I am trying to COPY from my own CSV file.
This is my command:
copy frontend_chemical from 's3://awssampledb/mybucket/myfile.CSV'
credentials 'aws_access_key_id=xxxxx;aws_secret_access_key=xxxxx'
delimiter ',';
This is the error I see:
An error occurred when executing the SQL command:
copy frontend_chemical from 's3://awssampledb/mybucket/myfile.CSV'
credentials 'aws_access_key_id=XXXX...'
[Amazon](500310) Invalid operation: The specified S3 prefix 'mybucket/myfile.CSV' does not exist
Details:
-----------------------------------------------
error: The specified S3 prefix 'mybucket/myfile.CSV' does not exist
code: 8001
context:
query: 3573
location: s3_utility.cpp:539
process: padbmaster [pid=2432]
-----------------------------------------------;
Execution time: 0.7s
1 statement failed.
I think I'm constructing the S3 URL wrong, but how should I do it?
My Redshift cluster is in the US East (N Virginia) region.
The Amazon Redshift COPY command can be used to load multiple files in parallel.
For example:
Bucket = mybucket
The files are in the bucket under the path data
Then refer to the contents as:
s3://mybucket/data
For example:
COPY frontend_chemical
FROM 's3://mybucket/data'
CREDENTIALS 'aws_access_key_id=xxxxx;aws_secret_access_key=xxxxx'
DELIMITER ',';
This will load all files within the data directory. You can also refer to a specific file by including it in the path, eg s3://mybucket/data/file.csv

Hi , Google big query - bq fail load display file number how to get the file name

I'm running the following bq command
bq load --source_format=CSV --skip_leading_rows=1 --max_bad_records=1000 --replace raw_data.order_20150131 gs://raw-data/order/order/2050131/* order.json
and
getting the following message when loading data into bq .
*************************************
Waiting on bqjob_r4ca10491_0000014ce70963aa_1 ... (412s) Current status: DONE
BigQuery error in load operation: Error processing job
'orders:bqjob_r4ca10491_0000014ce70963aa_1': Too few columns: expected
11 column(s) but got 1 column(s). For additional help: http://goo.gl/RWuPQ
Failure details:
- File: 844 / Line:1: Too few columns: expected 11 column(s) but got
1 column(s). For additional help: http://goo.gl/RWuPQ
**********************************
The message display only the file number .
checked the files content most of them are good .
gsutil ls and the cloud console on the other hand display file names .
how can I know which file is it according to the file number?
There seems to be some weird spacing introduced in the question, but if the desired path to ingest is "/order.json" - that won't work: You can only use "" at the end of the path when ingesting data to BigQuery.

pyhs2/hive No files matching path file and file Exists

Using the hive or beeline client, I have no problem executing this statement:
hive -e "LOAD DATA LOCAL INPATH '/tmp/tmpBKe_Mc' INTO TABLE unit_test_hs2"
The data from the file is loaded successfully into hive.
However, when using pyhs2 from the same machine, the file is not found:
import pyhs2
conn_str = {'authMechanism':'NOSASL', 'host':'azus',}
conn = pyhs2.connect(conn_str)
with conn.cursor() as cur:
cur.execute("LOAD DATA LOCAL INPATH '/tmp/tmpBKe_Mc' INTO TABLE unit_test_hs2")
Throws exception:
Traceback (most recent call last):
File "data_access/hs2.py", line 38, in write
cur.execute("LOAD DATA LOCAL INPATH '%s' INTO TABLE %s" % (csv_file.name, table_name))
File "/edge/1/anaconda/lib/python2.7/site-packages/pyhs2/cursor.py", line 63, in execute
raise Pyhs2Exception(res.status.errorCode, res.status.errorMessage)
pyhs2.error.Pyhs2Exception: "Error while compiling statement: FAILED: SemanticException Line 1:23 Invalid path ''/tmp/tmpBKe_Mc'': No files matching path file:/tmp/tmpBKe_Mc"
I've seen similar questions posted about this problem, and the usual answer is that the query is running on a different server that doesn't have the local file '/tmp/tmpBKe_Mc' stored on it. However, if that is the case, why would running the command directly from the CLI work but using pyhs2 not work?
(Secondary question: how can I show which server is trying to handle the query? I've tried cur.execute("set"), which returns all configuration parameters but when grepping for "host" the returned parameters don't seem to contain a real hostname.)
Thanks!
This happens because pyhs2 trying to find file on cluster
Solution is to have your source saved in related hdfs location instead of /tmp