I have a workflow, in which an oozie mapreduce action is supposed to read data from hive table and give it to appropriate mapper. I have not been able to find corresponding settings/properties for workflow.
The data of the hive table is stored in a folder, why don't you read it from there?
create table t01 (line String) STORED AS TextFile LOCATION "pathToStoredFile";
Related
I have lot of legacy pig scripts that run on on-prem cluster, we are trying to move to AWS Data Pipeline (PigActivity) and want to make these pig scripts can read data from S3 buckets where my source data would reside. On-Prem Pig scripts use Hcatalog loader to read hive tables schema. So, if I create Athena tables on those S3 buckets, is there a way to read schema from those Athena tables inside the pig scripts? using some sort of loader similar to hcatloader?
Current: Below code works, but I have to define schema inside the pig script
%default SOURCE_LOC 's3://s3bucket/input/abc'
inp_data = LOAD '$SOURCE_LOC' USING PigStorage('\001') AS
(id: bigint, val_id: int, provision: chararray);
Want:
Read from a Athena table instead
Athena table: database_name.abc (schema as id:bigint, val_id:int, provision:string)
So, looking for something like below: so I do not have to define schema inside the pig script
%default SOURCE_LOC 'database_name.abc'
inp_data = LOAD '$SOURCE_LOC' USING athenaloader();
Is there a loader utility to read Athena? or is there an alternate solution to my need. please help
I want to create a table for logging in Hive. This table should contain basic details of a sqoop job that runs every day. It should list the name of the job/the table name to which the data was loaded using the sqoop job. The number of records ingested , whether the ingestion was successful or failed and also the time of ingestion.
A .log file is created after every sqoop job run but this log file is not structured to be loaded directly into the hive table using the LOAD DATA INPATH command.I would really appreciate it if someone could point me in the right direction about whether a shell script should be written to achieve this or something else?
Thank you in advance
I have created an external table that in Hive that uses data from a Parquet store in HDFS.
When the data in HDFS is deleted, there is no data in the table. When the data is inserted again in the same spot in HDFS, the table does not get updated to contain the new data. If I insert new records into the existing table that contains data, no new data is shown when I run my Hive queries.
How I create the table in Hive:
CREATE EXTERNAL TABLE nodes (id string) STORED AS PARQUET LOCATION "/hdfs/nodes";
The relevant error:
Error: java.io.FileNotFoundException: File does not exist: /hdfs/nodes/part-r-00038-2149d17d-f890-48bc-a9dd-5ea07b0ec590.gz.parquet
I have seen several posts that explain that external tables should have the most up to date data in them, such as here. However, this is not the case for me, and I don't know what is happening.
I inserted the same data into the database again, and queried the table. It contained the same amount of data as before. I then created an identical table with a different name. It had twice as much data in it, which was the right amount.
The issue might be with the metastore database. I am using PostgreSQL instead of Derby for the the database.
Relevant information:
Hive 0.13.0
Spark Streaming 1.4.1
PostgreSQL 9.3
CentOS 7
EDIT:
After examining the Parquet files, I found that the part files have seemingly incompatible file names.
-rw-r--r-- 3 hdfs hdfs 18702811 2015-08-27 08:22 /hdfs/nodes/part-r-00000-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18703029 2015-08-26 15:43 /hdfs/nodes/part-r-00000-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18724320 2015-08-27 08:22 /hdfs/nodes/part-r-00001-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18723575 2015-08-26 15:43 /hdfs/nodes/part-r-00001-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet
These files are the files that causes Hive to error when it can't find it in the error described above. This means that the external table is not acting dynamically, accepting any files in the directory (if you call it that in HDFS), but instead is probably just keeping track of the list of parquet files inside the directory when it was created.
Sample Spark code:
nodes.foreachRDD(rdd => {
if (!rdd.isEmpty())
sqlContext.createDataFrame(rdd.map(
n => Row(n.stuff), ParquetStore.nodeSchema)
.write.mode(SaveMode.Append).parquet(node_name)
})
Where the nodeSchema is the schema and node_name is "/hdfs/nodes"
See my other question about getting Hive external tables to detect new files.
In order to get Hive to update its tables, I had to resort to using the partitioning feature of Hive. By creating a new partition during each Spark run, I create a series of directories internal to the /hdfs/nodes directory like this:
/hdfs/nodes/timestamp=<a-timestamp>/<parquet-files>
/hdfs/nodes/timestamp=<a-different-timestamp>/<parquet-files>
Then, after each Spark job completes, I run the Hive command MSCK REPAIR TABLE nodes using a HiveContext in my Spark job, which finds new partitions and updates the table.
I realize this isn't automatic, but it at least works.
Ok, so probably you need to encapsulate the file in a folder. Hive external table must be mapped on a folder where there could be more than one file.
try to write the file to: /path/to/hdfs/nodes/file
and then map the external table to /path/to/hdfs/nodes
so in the folder nodes you will have only the parquet file and it should works
I've been trying to store csv data into a table in a database using a pig script.
But instead of inserting the data into a table in a database I created a new file in the metastore.
Can someone please let me know if it is possible to insert data into a table in a database with a pig script, and if so what that script might look like?
You can take a look at DBStorage, but be sure to include the JDBC jar in your pig script and declaring the UDF.
The documentation for the storage UDF is here:
http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/DBStorage.html
you can use:
STORE into tablename USING org.apache.hcatalog.pig.HCatStorer()
I am using Interative Hive Session in Elastice Map Reduce to run Hive. Previously I was loading data from S3 into Hive tables.Now, I want to run some scripts on S3 input files without loading data into Hive Tables.
Is this possible?If yes then how can this be achieved?
You can run queries on data right in S3.
CREATE EXTERNAL TABLE mydata (key STRING, value INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n' LOCATION 's3n://mys3bucket/';
or similar