Solr & Hbase integration using NGDATA Hbase-indexer - indexing

Data is not reflected in Solr UI after indexing the Hbase table data using Hbase indexer. I followed the steps provided in Hbase-indexer.
1. Created Hbase table
2. Copied the hbase-sep jar files to the lib directory of HBase.
3. Created an indexer xml file with the index information
4. created an indexer using the indexer xml file.
After all the above steps i tried to search using Solr UI i dont see the data being reflected there. Has anyone worked on this?

Steps to verify:
1. Does Hbase having replication_scope = 1.
2. Are you using put to load the data? Because, indexer will pick wal's(write ahead logs). And put will create wal, where as bulk load will not.
3. Verify the indexer mapping details of hbase column qualifiers to the solr fields.

Related

Committing hudi files manually

I am using spark 3.x with apache-hudi 0.8.0 version.
While I am trying to create presto table by using hudi-hive-sync tool I am getting below error.
Got runtime exception when hive syncing
java.lang.IllegalArgumentException: Could not find any data file written for commit [20220116033425__commit__COMPLETED], could not get schema for table
But I checked all data for partitiionKeys using zepplin notebook , I see all data present.
Its understood that I need to do manually commit the file. How to do it ?

Non HBase solution for storing Huge data and updating on real time

Hi i have developed an application where i have to store TB of data for the first time and then 20 GB monthly incremental like insert/update/delete in the form of xml that will be applied on top of this 5 TB of data .
And finally on request basis i have to generate full snapshot of all data and create 5K text files based on the logic so that respective data should be in the respective files .
I have done this project using HBase .
I have created 35 tables in the HBase having region from 10 to 500 .
I have my data in my HDFS and the using mapreduce i bulk load data into receptive Hbase tables .
After that i have SAX parser application written in java to parse all incoming xml incremental files and update HBase tables .The frequency of the xml files are approx 10 xml files per minutes and total of 2000 updates .
The incremental message are strictly in order .
Finally on request basis i run my last mapreduce application to scan all Hbase table and create 5K text files and deliver it to the client .
All 3 steps are working fine but when i went to deploy my application on production server that is shared cluster ,the infrastructure team are not allowing us to run my application because i do full table scan on HBase .
I have used 94 node cluster and the biggest HBase table data that i have is approx 2 billions .All other tables has less than a millions of data .
Total time for mapreduce to scan and create text files takes 2 hours.
Now i am looking for some other solution to implement this .
I can use HIVE because i have records level insert/update and delete that too in very precise manner.
I have also integrated HBase and HIVE table so that for incremental data HBase table will be used and for full table scan HIVE will be used .
But as HIVE uses Hbase storage handler i cant create partition in HIVE table and that is why HIVE full table scan becomes very very slow even 10 times slower that HBase Full table scan
I cant think of any solution right now kind of stuck .
Please help me with some other solution where HBase is not involved .
Can i use AVRO or perquet file in this use case .But i am not sure how AVRO will support record level update .
I will answer my question .
My issue is that i dont want to perform full table scan on Hbase because it will impact performance of the region server and specially on the shared cluster it will hit the read wright performance on of the HBase .
So my solution using Hbase because it is very good for the update specially delta update that is columns update .
So in order to avoid that Full table scan take snapshot of HBase table export it to the HDFS and them run full table scan on the Hbase table snapshot.
Here is the detailed steps for the process
Create snapshot
snapshot 'FundamentalAnalytic','FundamentalAnalyticSnapshot'
Export Snapshot to local hdfs
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot FundamentalAnalyticSnapshot -copy-to /tmp -mappers 16
Driver Job Configuration to rum mapreduce on Hbase snapshot
String snapshotName="FundamentalAnalyticSnapshot";
Path restoreDir = new Path("hdfs://quickstart.cloudera:8020/tmp");
String hbaseRootDir = "hdfs://quickstart.cloudera:8020/hbase";
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, // Snapshot name
scan, // Scan instance to control CF and attribute selection
DefaultMapper.class, // mapper class
NullWritable.class, // mapper output key
Text.class, // mapper output value
job,
true,
restoreDir);
Also running mapreduce on Hbase snapshot will skip scan on Hbase table and also there will be no impact on region server.
The key to using HBase efficiently is DESIGN. With a good design you will never have to do full scan. That is not what HBase was made for. Instead you could have been doing a scan with Filter - something HBase was built to handle efficiently.
I cannot check your design now but I think you may have to.
The idea is not to design a HBase table the way you would have an RDBMS table and the key is designing a good rowkey. If you rowKey is well built, you should never do a full scan.
You may also want to use a project like Apache Phoenix if you want to access you table using other columns other than row key. It also performs well. I have a good experience with Phoenix.

Loading or pointing to multiple parquet paths for data analysis with hive or prestodb

I have couple of spark jobs that produce parquet files in AWS S3. Every once in a while i need to run some ad-hoc queries on a given date range of this data. I don't want to do this in spark because I want our QA team which has no knowledge os spark be able to do this. What i like to do is to spin up an AWS EMR cluster and load the parquet files into HDFS and run my queries against it. I have figured out how to create tables with hive and point it to one s3 path. But then that limits my data to only one day. because each day of date has multiple files under a path like
s3://mybucket/table/date/(parquet files 1 ... n).
So problem one is to figure how to load multiple days of data into hive. ie
s3://mybucket/table_a/day_1/(parquet files 1 ... n).
s3://mybucket/table_a/day_2/(parquet files 1 ... n).
s3://mybucket/table_a/day_3/(parquet files 1 ... n).
...
s3://mybucket/table_b/day_1/(parquet files 1 ... n).
s3://mybucket/table_b/day_2/(parquet files 1 ... n).
s3://mybucket/table_b/day_3/(parquet files 1 ... n).
I know hive can support partitions but my s3 files are not setup that way.
I have also looked into prestodb which looks like to be the favorite tool for this type of data analysis. The fact it supports ansi SQL makes it a great tool for people that have SQL knowledge but know very little about hadoop or spark. I did install this on my cluster and it works great. But looks like you can't really load data into your tables and you have to rely on Hive to do that part. Is this the right way to use prestodb? I watched a netflix presentation about their use of prestodb and using s3 in place of HDFS. If this works its great but i wonder how the data is moved into memory. At what point the parquet files will be moved from s3 to the cluster. Do i need to have cluster that can load the entire data into memory? how is this generally setup?
You can install Hive and create Hive tables with you data in S3, described in the blog post here: https://blog.mustardgrain.com/2010/09/30/using-hive-with-existing-files-on-s3/
Then install Presto on AWS, configure Presto to connect the hive catalog which you installed previously. Then you can query the your data on S3, with Presto by using SQL.
Rather than trying to load multiple files, you could instead use the API to concatenate the days you want into a single object, which you can then load through the means you already mention.
AWS has a blog post highlighting how to do this exact thing purely through the API (without downloading + re-uploading the data):
https://ruby.awsblog.com/post/Tx2JE2CXGQGQ6A4/Efficient-Amazon-S3-Object-Concatenation-Using-the-AWS-SDK-for-Ruby

How to rebuild hive metadata

I am using hive and hbase to do some analysis on data.
I accidently removed the following file in hdfs
(hdfs) /user/hive/mydata.db
Although all table are existing in hive but when I retrieve data from hive through thrift server to plot it , it shows nothing. How can I rebuild my data. Any guide etc.

Pig : Load in table, then overwrite that table after transformation

Let's say I have a table:
db.table
I load the table and do some transforms on it, and, finally, attempt to store it
mytable = LOAD 'db.table' USING HCatLoader();
.
.
-- My transforms
.
.
STORE mytable_final INTO 'db.table' USING HCatStorer();
But the code complains I'm writing into a table with existing data.
I've looked at this JIRA ticket, which seems to be inactive (I have tried using FORCE and OVERWRITE in several places in the STORE command).
I've also looked at this SO post, but the author is loading from one location and storing in a different location. If I use what is in that post, the result from the transformation is no data. Deleting the files isn't an option. I'm thinking of storing the files temporarily, but I don't know if this is the best option.
I am trying to get the behavior you get in Hive using INSERT OVERWRITE.
I am not familiar with HCatLoader and HCatStorer. But if you LOAD from and STORE to HDFS, Pig provides shell commands that enable you to do the deleting and moving from within your script.
STORE A INTO '/this/path/is/temporary';
RMF '/this/path/is/permanent';
MV '/this/path/is/temporary' '/this/path/is/permanent';