I have a directory in HDFS which contains avro files. While I try to overwrite the directory with dataframe it fails.
Syntax: avroData_df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("")
The error is:
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://nameservice1//part-r-00000-bca9a5b6-5e12-45c1-a877-b0f6d6cc8cd3.avro
It somehow seems to be considering the avro files as well while overwriting.
Can we do this using Spark 1.6 at all?
Related
I can not solve the problem of compatibility of an external orc and Claudera’s hive.
I have cloudera express version 6.3.2 with hive version 2.1.1
In general, it’s strange, I downloaded the latest version of claudera, and there is old hive 2.1.1 there
Case:
Externally I create some orc (I tried to create it in the local spark and in the same cloudera through map reducer job - the same result)
I'm trying to read this orc in my claudera even through orcfiledump
I get
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 6
at org.apache.orc.OrcFile $ WriterVersion.from (OrcFile.java:145)
I downloaded the orc-tools-1.5.5-uber.jar utility locally to my computer
Also downloaded there the problematic orc
Performed by java -jar orc-tools-1.5.5-uber.jar meta msout2o12.orc
Uber jar with its own hadoop inside have read this orc ok
Structure for msout2o12.orc
File Version: 0.12 with ORC_135
Rows: 242
Compression: ZLIB
Compression size: 262144
Without any creation of tables, just a hive in the cloudera can stupidly not be able to read the orc using its own utility.
The problem begun from the fact that I created an external table and hiveql on the orc generated such error.
But here it just stupidly reduced the problem to a minimum, just hive --orafiledump can not read the orc.
How to make cloudera read normally orcs? ..
What to tighten up in my cloudera?
It was a big surprize for me.
I returning to parquet.
https://community.cloudera.com/t5/Cloudera-Labs/Problem-of-compatibility-of-an-external-orc-and-Claudera-s/m-p/299395/highlight/false#M582
I'm trying to load multiple json (4000) files into a table in Bigquery using the following command bq load --source_format=NEWLINE_DELIMITED_JSON --replace=true kx-test.store_requests gs://kx-gam-test/store/requests/*, and I am getting the following error:
Error encountered during job execution:
Not found: Files /bigstore/kx-gam-test/store/requests/7fb27d63-5581-43a1-821d-fcf47b3412fd.json.gz
Failure details:
- Not found: Files /bigstore/kx-gam-test/store/requests/93b54246-2284-4b85-8620-76657f4a338b.json.gz
- Not found: Files /bigstore/kx-gam-test/store/requests/fd24a53d-2c49-4f66-bf54-a7ccf14a1cfe.json.gz
- Not found: Files /bigstore/kx-gam-test/store/requests/35a27032-930c-456a-846d-67481a21e52d.json.gz
I am not sure why it is not working, is it possibly due to the number of files I am trying to load? And what is this bigstore folder prefixed in front of my GCS bucket?
I would like to highlight that the folder structure is such that there are some folders inside of kx-gam-test/store/requests, and I would want to load the json gzip files inside all these folders.
According to the documentation:
BigQuery does not support source URIs that include multiple consecutive slashes after the initial double slash.
Also, here is some additional info to consider when loading data to cloud storage.
Few things you can check:
Make sure that you have the necessary permissions
Make sure that the files do exist in GCS
Do you have any process which deletes the file after the loading? Check the audit logs for any traces whether the file might have been deleted while BQ is actually reading/loading it.
I have a Spark job running on an EMr cluster that writes out a DataFrame to HDFS (which is then s3-dist-cp-ed to S3). The data size isn't big (2 GB when saved as parquet). These data in S3 are then copied to a local filesystem (EC2 instance running Linux) and then loaded into a Java application.
It turns out I cannot have the data in parquet format because parquet has been designed for HDFS and cannot be used in local FS (if I am wrong, please point me to a resource on how to read parquet files on local FS).
What other format can I use to address this? Would Avro be compact enough and not blow up the size of data by packing the schema with each row of the dataframe?
You can use Parquet on a local filesystem. To see an example in action, download the parquet-mr library from here, build it with the local profile (mvn -P local install should do it, provided that you have thrift and protoc installed), then issue the following to see the contents of your parquet file:
java -jar parquet-tools/target/parquet-tools-1.10.0.jar cat /path/to/your-file.parquet
I tired to load the data into my table 'users' in LOCAL mode and i am using cloudera on my virtual box. I have a file placed my file inside /home/cloudera/Desktop/Hive/ directory but i am getting an error
FAILED: SemanticException Line 1:23 Invalid path ''/home/cloudera/Desktop/Hive/hive_input.txt'': No files matching path file:/home/cloudera/Desktop/Hive/hive_input.txt
My syntax to load data into table
Load DATA LOCAL INPATH '/home/cloudera/Desktop/Hive/hive_input.txt' INTO Table users
Yes I removed the Local as per #Bhaskar, and path is my HDFS path where file exists not underlying linux path.
Load DATA INPATH '/user/cloudera/input_project/' INTO Table users;
You should change permission on the folder that contains your file.
chmod -R 755 /home/user/
Another reason could be the file access issue. If you are running hive CLI from user01 and accessing a file (your INPATH) from user02 home directory, it will give you the same error.
So the solution could be
1. Move the file to a location where user01 can access the file.
OR
2. Relaunch the Hive CLI after logging in with user02.
Check if you are using a Sqoop import in your script, try to import data to hive from an empty table.
This may cause the scoop import to delete the HDFS location of the hive table.
to confirm run: hdfs dfs -ls before and after you execute the sqoop import, re-create the directory using hdfs dfs -mkdir
My path to the file in HDFS was data/file.csv, note, it is not /data/file.csv.
I specified the LOCATION during table creation as data/file.csv.
Executing
LOAD DATA INPATH '/data/file.csv' INTO TABLE example_table;
failed with the mentioned exception. However, executing
LOAD DATA INPATH 'data/file.csv' INTO TABLE example_table;
worked as desired.
I bootstrap a data file in my EMR job. The bootstrapping succeeds and the file is copied to /home/hadoop/contents/ folder with right permissions.
However when I try to access it in the Pig script like below:
userdidstopick = load '/home/hadoop/contents/UserIdsToPick.txt' AS (uid:chararray);
I get an error that the input path does not exist:
hdfs://10.183.166.176:9000/home/hadoop/contents/UserIdsToPick.txt
When running Ruby jobs the bootstrapped file was always accessible under /home/hadoop/contents/ folder and everything worked for me.
Is it different for Pig?
By default Pig on EMR is configured to access HDFS location instead of local filesystem. The error shows the HDFS location.
There are 2 ways to solve this:
Either copy the file on S3, and directly load file from s3
userdidstopick = load 's3_bucket_location/UserIdsToPick.txt' AS (uid:chararray);
Or you can first copy the file on HDFS (instead of local filesystem), and then directly use it as path you are doing today.
I would prefer first option.