file parsing using hive query issue - hive
I have a complex text file to parse and load for analysis.
I started off with a simple Hive query to parse a text file and load as a table in HDFS.
I am using beeswax to run this query.
name_area.txt
arun:salem
anand:vnr
Cheeli:guntur
Hive Query
CREATE TABLE test(
name STRING,
area STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "^(.*):(.*)$","output.format.string" = "%1$s %2$s")
LOCATION '/user/name_area.txt';
The file is copied to HDFS.
When i execute the query, I am getting the following exception.
NoReverseMatch at /beeswax/execute/6
Reverse for ‘execute_parameterized_query’ with arguments ‘(6,)’ and keyword arguments ‘{}’ not found.
Request Method: POST
Request URL: http://192.168.58.128:8000/beeswax/execute/6
Django Version: 1.2.3
Exception Type: NoReverseMatch
Exception Value:
Reverse for ‘execute_parameterized_query’ with arguments ‘(6,)’ and keyword arguments ‘{}’ not found.
Exception Location: /usr/lib/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/core/urlresolvers.py in reverse, line 297
Python Executable: /usr/bin/python2.6
Python Version: 2.6.6
Python Path: [”, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/pip-0.6.3-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Babel-0.9.6-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/BabelDjango-0.2.2-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Beaker-1.4.2-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Mako-0.7.2-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Markdown-2.0.3-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/MarkupSafe-0.9.3-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/MySQL_python-1.2.3c1-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Paste-1.7.2-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/PyYAML-3.09-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Pygments-1.3.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/South-0.7-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Spawning-0.9.6-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/Twisted-8.2.0-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/anyjson-0.3.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/avro-1.5.0-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/billiard-2.7.3.28-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/celery-3.0.19-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/configobj-4.6.0-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/django_auth_ldap-1.2.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/django_celery-3.0.17-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/django_extensions-0.5-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/django_nose-0.5-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/elementtree-1.2.6_20050316-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/enum-0.4.4-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/eventlet-0.9.14-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/greenlet-0.3.1-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/httplib2-0.8-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/importlib-1.0.2-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/kerberos-1.1.1-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/kombu-2.5.10-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/lockfile-0.8-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/lxml-3.3.5-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/moxy-1.0.0-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/openpyxl-1.6.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/ordereddict-1.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/pam-0.1.3-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/processing-0.52-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/pyOpenSSL-0.13-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/pycrypto-2.6-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/pysqlite-2.5.5-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/python_daemon-1.5.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/python_dateutil-2.0-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/python_ldap-2.3.13-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/pytidylib-0.2.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/requests-2.2.1-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/requests_kerberos-0.4-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/sasl-0.1.1-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/sh-1.08-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/simplejson-2.0.9-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/threadframe-0.2-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/thrift-0.9.1-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/urllib2_kerberos-0.1.6-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/xlrd-0.9.0-py2.6.egg’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages/zope.interface-3.5.2-py2.6-linux-x86_64.egg’, ‘/usr/lib/hue/desktop/core/src’, ‘/usr/lib/hue/desktop/libs/hadoop/src’, ‘/usr/lib/hue/desktop/libs/liboozie/src’, ‘/usr/lib/hue/build/env/lib/python2.6/site-packages’, ‘/usr/lib/hue/apps/about/src’, ‘/usr/lib/hue/apps/beeswax/src’, ‘/usr/lib/hue/apps/filebrowser/src’, ‘/usr/lib/hue/apps/hcatalog/src’, ‘/usr/lib/hue/apps/help/src’, ‘/usr/lib/hue/apps/jobbrowser/src’, ‘/usr/lib/hue/apps/jobsub/src’, ‘/usr/lib/hue/apps/oozie/src’, ‘/usr/lib/hue/apps/pig/src’, ‘/usr/lib/hue/apps/proxy/src’, ‘/usr/lib/hue/apps/useradmin/src’, ‘/usr/lib/hue/build/env/bin’, ‘/usr/lib64/python2.6′, ‘/usr/lib64/python2.6/plat-linux2′, ‘/usr/lib64/python2.6/lib-dynload’, ‘/usr/lib64/python2.6/site-packages’, ‘/usr/lib/python2.6/site-packages’, ‘/usr/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg-info’, ‘/usr/lib/hue/apps/beeswax/gen-py’, ‘/usr/lib/hue’, ‘/usr/lib64/python26.zip’, ‘/usr/lib64/python2.6/lib-tk’, ‘/usr/lib64/python2.6/lib-old’, ‘/usr/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg-info’, ‘/usr/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg-info’, ‘/usr/lib/hue/apps/beeswax/src/beeswax/../../gen-py’, ‘/usr/lib/hue/apps/jobbrowser/src/jobbrowser/../../gen-py’, ‘/usr/lib/hue/apps/proxy/src/proxy/../../gen-py’]
Server time: Fri, 24 Apr 2015 07:37:07 -0700
Appreciate your help on this.
Your create statement query does not seems write. The location should be the directory name where input file exists and not the file name.
Related
LOAD DATA LOCAL INPUT PATH doesnt exist
I am new to the Spark and Scala Technology. I'm getting the following exception while trying to load a file from local file system into table using Spark. Spark version -2.0 and Scala version - 2.11 scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'file.txt' INTO TABLE student") org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: file.txt
Please try to give complete path as file:/complete path to the file. In above case: sqlContext.sql("LOAD DATA LOCAL INPATH 'file:/complete path to the file.txt' INTO TABLE student") ~Kedar
unable to load csv file from GCS into bigquery
I am unable to load 500mb csv file from google cloud storage to big query but i got this error Errors: Too many errors encountered. (error code: invalid) Job ID xxxx-xxxx-xxxx:bquijob_59e9ec3a_155fe16096e Start Time Jul 18, 2016, 6:28:27 PM End Time Jul 18, 2016, 6:28:28 PM Destination Table xxxx-xxxx-xxxx:DEV.VIS24_2014_TO_2017 Write Preference Write if empty Source Format CSV Delimiter , Skip Leading Rows 1 Source URI gs://xxxx-xxxx-xxxx-dev/VIS24 2014 to 2017.csv.gz I have gzipped 500mb csv file to csv.gz to upload to GCS.Please help me to solve this issue
The internal details for your job show that there was an error reading the row #1 of your CSV file. You'll need to investigate further, but it could be that you have a header row that doesn't conform to the schema of the rest of the file, so we're trying to parse a string in the header as an integer or boolean or something like that. You can set the skipLeadingRows property to skip such a row. Other than that, I'd check that the first row of your data matches the schema you're attempting to import with. Also, the error message you received is unfortunately very unhelpful, so I've filed a bug internally to make the error you received in this case more helpful.
How to construct S3 URL for copying to Redshift?
I am trying to import a CSV file into a Redshift cluster. I have successfully completed the example in the Redshift documentation. Now I am trying to COPY from my own CSV file. This is my command: copy frontend_chemical from 's3://awssampledb/mybucket/myfile.CSV' credentials 'aws_access_key_id=xxxxx;aws_secret_access_key=xxxxx' delimiter ','; This is the error I see: An error occurred when executing the SQL command: copy frontend_chemical from 's3://awssampledb/mybucket/myfile.CSV' credentials 'aws_access_key_id=XXXX...' [Amazon](500310) Invalid operation: The specified S3 prefix 'mybucket/myfile.CSV' does not exist Details: ----------------------------------------------- error: The specified S3 prefix 'mybucket/myfile.CSV' does not exist code: 8001 context: query: 3573 location: s3_utility.cpp:539 process: padbmaster [pid=2432] -----------------------------------------------; Execution time: 0.7s 1 statement failed. I think I'm constructing the S3 URL wrong, but how should I do it? My Redshift cluster is in the US East (N Virginia) region.
The Amazon Redshift COPY command can be used to load multiple files in parallel. For example: Bucket = mybucket The files are in the bucket under the path data Then refer to the contents as: s3://mybucket/data For example: COPY frontend_chemical FROM 's3://mybucket/data' CREDENTIALS 'aws_access_key_id=xxxxx;aws_secret_access_key=xxxxx' DELIMITER ','; This will load all files within the data directory. You can also refer to a specific file by including it in the path, eg s3://mybucket/data/file.csv
Hi , Google big query - bq fail load display file number how to get the file name
I'm running the following bq command bq load --source_format=CSV --skip_leading_rows=1 --max_bad_records=1000 --replace raw_data.order_20150131 gs://raw-data/order/order/2050131/* order.json and getting the following message when loading data into bq . ************************************* Waiting on bqjob_r4ca10491_0000014ce70963aa_1 ... (412s) Current status: DONE BigQuery error in load operation: Error processing job 'orders:bqjob_r4ca10491_0000014ce70963aa_1': Too few columns: expected 11 column(s) but got 1 column(s). For additional help: http://goo.gl/RWuPQ Failure details: - File: 844 / Line:1: Too few columns: expected 11 column(s) but got 1 column(s). For additional help: http://goo.gl/RWuPQ ********************************** The message display only the file number . checked the files content most of them are good . gsutil ls and the cloud console on the other hand display file names . how can I know which file is it according to the file number?
There seems to be some weird spacing introduced in the question, but if the desired path to ingest is "/order.json" - that won't work: You can only use "" at the end of the path when ingesting data to BigQuery.
pyhs2/hive No files matching path file and file Exists
Using the hive or beeline client, I have no problem executing this statement: hive -e "LOAD DATA LOCAL INPATH '/tmp/tmpBKe_Mc' INTO TABLE unit_test_hs2" The data from the file is loaded successfully into hive. However, when using pyhs2 from the same machine, the file is not found: import pyhs2 conn_str = {'authMechanism':'NOSASL', 'host':'azus',} conn = pyhs2.connect(conn_str) with conn.cursor() as cur: cur.execute("LOAD DATA LOCAL INPATH '/tmp/tmpBKe_Mc' INTO TABLE unit_test_hs2") Throws exception: Traceback (most recent call last): File "data_access/hs2.py", line 38, in write cur.execute("LOAD DATA LOCAL INPATH '%s' INTO TABLE %s" % (csv_file.name, table_name)) File "/edge/1/anaconda/lib/python2.7/site-packages/pyhs2/cursor.py", line 63, in execute raise Pyhs2Exception(res.status.errorCode, res.status.errorMessage) pyhs2.error.Pyhs2Exception: "Error while compiling statement: FAILED: SemanticException Line 1:23 Invalid path ''/tmp/tmpBKe_Mc'': No files matching path file:/tmp/tmpBKe_Mc" I've seen similar questions posted about this problem, and the usual answer is that the query is running on a different server that doesn't have the local file '/tmp/tmpBKe_Mc' stored on it. However, if that is the case, why would running the command directly from the CLI work but using pyhs2 not work? (Secondary question: how can I show which server is trying to handle the query? I've tried cur.execute("set"), which returns all configuration parameters but when grepping for "host" the returned parameters don't seem to contain a real hostname.) Thanks!
This happens because pyhs2 trying to find file on cluster Solution is to have your source saved in related hdfs location instead of /tmp