I am new to the Spark and Scala Technology. I'm getting the following exception while trying to load a file from local file system into table using Spark.
Spark version -2.0 and Scala version - 2.11
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'file.txt' INTO TABLE student")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: file.txt
Please try to give complete path as file:/complete path to the file.
In above case:
sqlContext.sql("LOAD DATA LOCAL INPATH 'file:/complete path to the file.txt' INTO TABLE student")
~Kedar
Related
I am trying to load the avro file from google storage to Big query tables but faced these issue.
Steps i have followed are as below.
Create a dataframe in spark.
Stored these data by writing it into avro.
dataframe.write.avro("path")
Loaded these data into google storage.
Tried to load the data into google bigquery by using following command
bq --nosync load --autodetect --source_format AVRO datasettest.testtable gs://test/avrodebug/*.avro
This command leads to give this error.
Error while reading data, error message: The Apache Avro library failed to read data with the follwing error: Cannot resolve: "long" with "int"
So i even tried to use this command by specifying the schema.
bq --nosync load --source_format AVRO datasettest.testtable gs://test/avrodebug/*.avro C1:STRING, C2:STRING, C3:STRING, C4:STRING, C5:STRING, C6:INTEGER, C7:INTEGER, C8:INTEGER, C9:STRING, C10:STRING, C11:STRING
Here i have only C6,C7 and C8 are having integer values.
Even this also giving the same previous error.
Is there any reason why i am getting error for long to int instead of long to INTEGER
Please let me know is there any way to load these data by casting it.
I am trying to fetch data from a hive external table using HiveContext and storing it in a text file. The path of data for hive external table is hdfs:/data/abc/job_log. My code is failing intermittently with below error.
WARN TaskSetManager: Lost task 1524.0 in stage 0.0 (TID 1524, ): java.io.FileNotFoundException: File does not exist: /data/abc/job_log/abc_job_20171027001515.COPYING
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:672)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
I am using Spark 1.6.1 , Scala 2.10.5 and HDP 2.4.2 cluster.Any help will be appreciated.
I can see the file is on HDFS.
$hadoop fs -cat /user/root/1.txt
1
2
3
but from hive, it is not recognize the file.
hive> create table test4 (numm INT);
OK
Time taken: 0.187 seconds
hive> load data inpath '/user/root/1.txt' into table test4;
FAILED: SemanticException Line 1:17 Invalid path ''/user/root/1.txt'': No files matching path file:/user/root/1.txt
load file from local file system looks good.
Requesting you to please put the complete path for the file.
Eg. load data inpath 'Namenode:' in to table .
Hope this help. Please let me know if you still face any difficulties.
Using the hive or beeline client, I have no problem executing this statement:
hive -e "LOAD DATA LOCAL INPATH '/tmp/tmpBKe_Mc' INTO TABLE unit_test_hs2"
The data from the file is loaded successfully into hive.
However, when using pyhs2 from the same machine, the file is not found:
import pyhs2
conn_str = {'authMechanism':'NOSASL', 'host':'azus',}
conn = pyhs2.connect(conn_str)
with conn.cursor() as cur:
cur.execute("LOAD DATA LOCAL INPATH '/tmp/tmpBKe_Mc' INTO TABLE unit_test_hs2")
Throws exception:
Traceback (most recent call last):
File "data_access/hs2.py", line 38, in write
cur.execute("LOAD DATA LOCAL INPATH '%s' INTO TABLE %s" % (csv_file.name, table_name))
File "/edge/1/anaconda/lib/python2.7/site-packages/pyhs2/cursor.py", line 63, in execute
raise Pyhs2Exception(res.status.errorCode, res.status.errorMessage)
pyhs2.error.Pyhs2Exception: "Error while compiling statement: FAILED: SemanticException Line 1:23 Invalid path ''/tmp/tmpBKe_Mc'': No files matching path file:/tmp/tmpBKe_Mc"
I've seen similar questions posted about this problem, and the usual answer is that the query is running on a different server that doesn't have the local file '/tmp/tmpBKe_Mc' stored on it. However, if that is the case, why would running the command directly from the CLI work but using pyhs2 not work?
(Secondary question: how can I show which server is trying to handle the query? I've tried cur.execute("set"), which returns all configuration parameters but when grepping for "host" the returned parameters don't seem to contain a real hostname.)
Thanks!
This happens because pyhs2 trying to find file on cluster
Solution is to have your source saved in related hdfs location instead of /tmp
While loading XML data file into HIVE table i got following error message:
FAILED: SemanticException 7:9 Input format must implement InputFormat. Error encountered near token 'StoresXml'.
The way i am loading the XML file is as follows :
**Create a table StoresXml
'CREATE EXTERNAL TABLE StoresXml (storexml string)
STORED AS INPUTFORMAT 'org.apache.mahout.classifier.bayes.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/hive/warehouse/stores';'
** Location /user/hive/warehouse/stores is in HDFS.
load data inpath <local path where the xml file is stored> into table StoresXml;
Now,problem is when i select any column from table StoresXml ,the above mentioned error comes up.
Please help me with it.Where i am going wrong?
1) first you need to create single column table like
CREATE TABLE xmlsample(xml string);
2) after that you need to load data in local/hdfs to hive table like
LOAD DATA INPATH '---------' INTO TABLE XMLSAMPLE;
3) NEXT BY USING XPATH, XPATH_ARRAY,XPATH_STRING LIKE SAMPLE XML QUERIES..
I have just loaded this transactions.xml file into hive table using xpath
for XML file:
**Bring records of xml file into one line:
terminal> cat /home/cloudera/Desktop/Test/Transactions_xml.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed 's|</record>|</record>\n|g' | grep -v '^\s*$' > /home/cloudera/Desktop/trx_xml;
terminal> hadoop fs -put /home/cloudera/Desktop/trx_xml.xml /user/cloudera/DataTest/Transactions_xml
hive>create table Transactions_xml1(xmldata string);
hive>load data inpath '/user/cloudera/DataTest/Transactions_xml' overwrite into table Transactions_xml1;
hive>create table Transactions_xml(trx_id int,account int,amount int);
hive>insert overwrite table Transactions_xml select xpath_int(xmldata,'record/Tid'),
xpath_int(xmldata,'record/AccounID'),
xpath_int(xmldata,'record/Amount') from Transactions_xml1;
I hope this will help you. Let me know the result.
I have developed a tool to generate hive scripts from a csv file. Following are few examples on how files are generated.
Tool -- https://sourceforge.net/projects/csvtohive/?source=directory
Select a CSV file using Browse and set hadoop root directory ex: /user/bigdataproject/
Tool Generates Hadoop script with all csv files and following is a sample of
generated Hadoop script to insert csv into Hadoop
#!/bin/bash -v
hadoop fs -put ./AllstarFull.csv /user/bigdataproject/AllstarFull.csv
hive -f ./AllstarFull.hive
hadoop fs -put ./Appearances.csv /user/bigdataproject/Appearances.csv
hive -f ./Appearances.hive
hadoop fs -put ./AwardsManagers.csv /user/bigdataproject/AwardsManagers.csv
hive -f ./AwardsManagers.hive
Sample of generated Hive scripts
CREATE DATABASE IF NOT EXISTS lahman;
USE lahman;
CREATE TABLE AllstarFull (playerID string,yearID string,gameNum string,gameID string,teamID string,lgID string,GP string,startingPos string) row format delimited fields terminated by ',' stored as textfile;
LOAD DATA INPATH '/user/bigdataproject/AllstarFull.csv' OVERWRITE INTO TABLE AllstarFull;
SELECT * FROM AllstarFull;
Thanks
Vijay