I have used ProtoBuf's to serialize the class and store in HBase Columns.
I want to reduce the number of Map Reduce jobs for simple aggregations, so I need SQL like tool to query the data.
If I use Hive, Is it possible to extend the HBaseStorageHandler and write our own Serde for each Table?
Or any other good solution to is available.
Updated:
I created the HBase table as
create 'hive:users' , 'i'
and inserted user data from java api,
public static final byte[] INFO_FAMILY = Bytes.toBytes("i");
private static final byte[] USER_COL = Bytes.toBytes(0);
public Put mkPut(User u)
{
Put p = new Put(Bytes.toBytes(u.userid));
p.addColumn(INFO_FAMILY, USER_COL, UserConverter.fromDomainToProto(u).toByteArray());
return p;
}
my scan gave results as:
hbase(main):016:0> scan 'hive:users'
ROW COLUMN+CELL
kim123 column=i:\x00, timestamp=1521409843085, value=\x0A\x06kim123\x12\x06kimkim\x1A\x10kim123#gmail.com
1 row(s) in 0.0340 seconds
When I query the table in Hive, I don't see any records.
Here is the command I used to create table.
create external table users(userid binary, userobj binary)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping" = ":key, i:0", "hbase.table.default.storage.type" = "binary")
tblproperties("hbase.table.name" = "hive:users");
when I query the hive table I don't see the record inserted from hbase.
Can you please tell me what is wrong here?
You could try writing a UDF which would take binary protobuf and convert it to some readable structure (comma separated or json). You would have to make sure to map values as binary data.
Related
Is there any way to write unstructured data to a big query table using apache beam dataflow big query io API (i.e without providing schema upfront)
Bigquery needs to know the schema when it creates the table, or when one writes to it. Depending on your situation one may be able to dynamically determine the schema in the pipeline construction code rather than hard coding it.
create a table with just a single STRING column to store data from Dataflow.
CREATE IF NOT EXISTS `your_project.dataset.rawdata` (
raw STRING
);
You can store whatever data as a string without knowing the schema of it. For example, you can store a JSON data as a single string and a CSV as a string, etc.
Specify the table as a destination of your Dataflow. You may need to provide
Dataflow with a javascript UDF which converts a message from a source to a single string which is compatible to a schema of above table.
/**
* User-defined function (UDF) to transform events
* as part of a Dataflow template job.
*
* #param {string} inJson input Pub/Sub JSON message (stringified)
* #return {string} outJson output JSON message (stringified)
*/
function process(inJson) {
var obj = JSON.parse(inJson),
includePubsubMessage = obj.data && obj.attributes,
data = includePubsubMessage ? obj.data : obj;
// INSERT CUSTOM TRANSFORMATION LOGIC HERE
return JSON.stringify(obj);
}
https://cloud.google.com/blog/topics/developers-practitioners/extend-your-dataflow-template-with-udfs
you can see above sample UDF returns a JSON string.
You can later interpret the data with a schema (a.k.a, schema-on-read strategy) like the following
SELECT JSON_VALUE(raw, '$.json_path_you_have') AS column1,
JSON_QUERY_ARRAY(raw, '$.json_path_you_have') AS column2,
...
FROM `your_project.dataset.rawdata`
Depending on your source data, you can use JSON functions or regular expressions to organize your data to a table with a schema you want.
I have used the Use the BigQuery connector with Spark to extract data from a table in BigQuery by running the code on Google Dataproc. As far as I'm aware the code shared there:
conf = {
# Input Parameters.
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
# Output Parameters.
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
copies the entirety of the named table into input_directory. The table I need to extract data from contains >500m rows and I don't need all of those rows. Is there a way to instead issue a query (as opposed to specifying a table) so that I can copy a subset of the data from a table?
Doesn't look like BigQuery supports any kind of filtering/querying for tables export at the moment:
https://cloud.google.com/bigquery/docs/exporting-data
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract
I'm currently building a tool that pulls data directly from a database because SPSS Modeler is too slow and store it in a Java ResultSet first of all.
But I try to export the data into a CSV (or similar) file and try to keep as much column types as possible.
Currently I'm using opencsv but it casts Decimals and many others to a String. When I load the file back into SPSS Modeler I get only Integers and Strings.
Are there any CSV libraries (maybe with a special encoding) or other file types I can use to export the data with its column types (like IBM InfoSphere Data Architect can do) so I can load it directly back into SPSS Modeler without changing it back manually there ?
Thank you!
Retrieving the Metadata from the DB Information Schema
If the data is currently stored in a database, you can retrieve the column type from the information schema. All you need to do is retrieving this information after your queried the table and store it so that you can reuse it later.
// connect to DB as usual
Statement stmt = conn.createStatement();
// create your query
// Note that you can use a dummy query here.
//You only need to access the metadata schema of the table, regardless of the actual query.
ResultSet rse = stmt.executeQuery("Select A,B FROM table WHERE ..");
// get the ResultSetMetadata
ResultSetMetaData rsmd = rse.getMetaData();
// Get database specific type
rsmd.getColumnTypeName(1); // database specific type name for column 1 (e.g. VARCHAR)
rsmd.getColumnTypeName(2); // database specific type name for column 2 (e.g. DateTime)
....
// Get generic JDBC type http://docs.oracle.com/javase/7/docs/api/java/sql/Types.html
rsmd.getColumnType(1) // generic type for col 1 (e.g. 12)
rsmd.getColumnType(2) // generic type for col 2
Processing
You could store this information in a CSV schema and process this during the transformation process.
I recommend that you use SuperCSV, which is available here.
This library provides so called cell processors, which allow you to define the type of the columns.
Description:
Cell processors are an integral part of reading and writing with Super CSV - they automate the data type conversions, and enforce constraints. They implement the chain of responsibility design pattern - each processor has a single, well-defined purpose and can be chained together with other processors to fully automate all of the required conversions and constraint validation for a single CSV column.
My csv file contains data structure like:
99999,{k1:v1,k2:v2,k3:v3},9,1.5,http://www.asd.com
what is the create table query for this structure?
I don't have to do any processing on csv file before it is loaded into table.
You need to use Opencsv serde to read/write csv data to/from hive table. Download it here:https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jar
Add the serde to the library path of Hive. - Could be skipped, but do upload it to the hdfs cluster your hive server is running. We will use it later to query.
Create Table
CREATE TABLE my_table(a int, b string, c int, d double, url string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Notice that if you use openCSV serde, no matter what type you give, it will be taken as String by hive. But no worries as Hive is loosely type language. It will typecast string into int, json etc. at runtime.
Query
To query at the hive prompt first add the library if not added to the library path of hive
add jar hdfs:///user/hive/aux_jars/opencsv.jar;
Now you could query as:
select a, get_json_object(b, '$.k1') from my_table where get_json_object(b, '$.k2') > val;
Above is the example to access the JSON field from an Hive table.
References:
http://documentation.altiscale.com/using-csv-serde-with-hive
http://thornydev.blogspot.in/2013/07/querying-json-records-via-hive.html
PS: Json Tuple is the faster way to access the json elements, but I found the syntax of get_json_object more appealing.
I have a file test_file_1.txt containing:
20140101,value1
20140102,value2
and file test_file_2.txt containing:
20140103,value3
20140104,value4
In HCatalog there is a table:
create table stage.partition_pk (value string)
Partitioned by(date string)
stored as orc;
These two scripts work nicely:
Script 1:
LoadFile = LOAD 'test_file_2.txt' using PigStorage(',') AS (date : chararray, wartosc : chararray);
store LoadFile into 'stage.partition_pk' using org.apache.hcatalog.pig.HCatStorer();
Script 2:
LoadFile = LOAD 'test_file_2.txt' using PigStorage(',')
AS (date : chararray, wartosc : chararray);
store LoadFile into 'stage.partition_pk' using org.apache.hcatalog.pig.HCatStorer();
Table partition_pk contains four partitions - everything is as expected.
But lets say, there is another file containing data that should be inserterd in one of existing partitions.
Pig is unable to write into partition that contain data (or I missed something?)
How do you manage loading into existing partitions (on not empty nonpartitioned tables)?
Do you read partition, union it with new data, delete partition (how?) and insert it as new partition?
Coming from HCatalog's site, https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat, it says: " Once a partition is created records cannot be added to it, removed from it, or updated in it.". So, by the nature of HCatalog, you can't add data to an existing partition that already has data in it.
There are bugs around this that they are working on. Some of the bugs were fixed in Hive 0.13:
https://issues.apache.org/jira/browse/HIVE-6405 (Still unresolved) - The bug used to track the other bugs
https://issues.apache.org/jira/browse/HIVE-6406 (Resolved in 0.13) - separate table property for mutable
https://issues.apache.org/jira/browse/HIVE-6476 (Still unresolved) - Specific to dynamic partitioning
https://issues.apache.org/jira/browse/HIVE-6475 (Resolved in 0.13) - Specific to static partitioning
https://issues.apache.org/jira/browse/HIVE-6465 (Still unresolved) - Adds DDL support to HCatalog
Basically, it looks like if you don't want to use dynamic partitioning, then 0.13 might work for you . You just need to remember to set the appropriate property
What I've found that works for me is to create another partition key that I call build_num. I then pass the value of this parameter via the command line and set it in the store statement. Like so:
create table stage.partition_pk (value string)
Partitioned by(date string,build_num string)
stored as orc;
STORE LoadFile into 'partition_pk' using org.apache.hcatalog.pig.HCatStorer('build_num=${build_num}';
Just don't include the build_num partition in your queries. I generally set the build_num to a timestamp when I ran the job;
Try using multiple partitions:
create table stage.partition_pk (value string) Partitioned by(date string, counter string) stored as orc;
Storing look like this:
LoadFile = LOAD 'test_file_2.txt' using PigStorage(',') AS (date : chararray, wartosc : chararray);
store LoadFile into 'stage.partition_pk' using org.apache.hcatalog.pig.HCatStorer('date=20161120, counter=0');
So now you can store data into the same date partition again by increasing the counter.