Skip Line By Prefix - azure-data-lake

I've been trying to use Azure Data Lake Analytics to do some analysis over a large group of IIS log files. So far I can get this to work for a single, best-case file using something like this:
#results =
EXTRACT
s_date DateTime,
s_time string,
s_ip string,
cs_method string,
cs_uristem string,
cs_uriquery string,
s_port int,
cs_username string,
c_ip string,
cs_useragent string,
sc_status int,
sc_substatus int,
sc_win32status int,
s_timetaken int
FROM #"/input/u_ex151115.log"
USING Extractors.Text(delimiter:' ', skipFirstNRows: 4);
#statuscount = SELECT COUNT(*) AS TheCount,
sc_status
FROM #results
GROUP BY sc_status;
OUTPUT #statuscount
TO #"/output/statuscount_results.tsv"
USING Outputters.Tsv();
As you can see, in the EXTRACT statement, I'm skipping over the IIS log file header using the skipFirstNRows attribute. The problem I'm running into is that many of the log files I have as input contain headers in the middle of the file, presumably because the IIS app pool restarted at some point during the day. When I try to include these files in my query, I get the following error:
Unexpected number of columns in input record at line 14. Expected 14 columns, processed 6 columns out of 6.
The error references a location somewhere in the file where it's encountered the header text.
My question is, using the Text extractor, is there a way to direct it to skip processing a line based on the starting character of the line or something similar? Or, will I need to write a custom extractor to accomplish this?

Based on the documentation for the Text extractor, using the slient parameter will cause any lines that do not have the correct number of columns to silently fail, allowing processing to continue on to the next line. Since the IIS log header doesn't have the same number of columns as the log data, setting this attribute to true solved my problem.
So, my revised code looks like:
#results =
EXTRACT
s_date DateTime,
s_time string,
s_ip string,
cs_method string,
cs_uristem string,
cs_uriquery string,
s_port int,
cs_username string,
c_ip string,
cs_useragent string,
sc_status int,
sc_substatus int,
sc_win32status int,
s_timetaken int
FROM #"/input/u_ex140521.log"
USING Extractors.Text(delimiter:' ', silent: true);
#statuscount = SELECT COUNT(*) AS TheCount,
sc_status
FROM #results
GROUP BY sc_status;
OUTPUT #statuscount
TO #"/output/statuscount_results.tsv"
USING Outputters.Tsv();

Related

Hive simple Regular expression

I am trying to check if all data within in a column is having a valid date.
create table dates (tm string, dt string) row format delimited fields terminated by '\t'
date.txt(sample data)
20181205 15
20171023 23
20170516 16
load data local inpath 'dates.txt' overwrite into table dates;
create temporary macro isitDate(s string)
case when regexp_extract(s,'((0[1-9]|[12][0-9]|3[01])',0) = ''
then false
else true
end;
select * from dates where isitDate(dt);
But select statement is giving below error-
Failed with exception
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
Unable to execute method public java.lang.String
org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer)
on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract#66b45e1e of
class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments
{15:java.lang.String, ((0[1-9]|[12][0-9]|3[01]):java.lang.String,
0:java.lang.Integer} of size 3
Is there something wrong with my regular expression.
made a stupid mistake, there is one extra opening bracket in macro

Cast to long datatype - BigQuery

BigQuery and SQL noob here. I was going through possible data types big query supports here. I have a column in bigtable which is of type bytes and its original data type is scala Long. This was converted to bytes and stored in bigtable from my application code. I am trying to do CAST(itemId AS integer) (where itemId is the column name) in the BigQuery UI but the output of CAST(itemId AS integer) is 0 instead of actual value. I have no idea how to do this. If someone could point me in the right direction then I would greatly appreciate it.
EDIT: Adding more details
Sample itemId is 190007788462
Following is the code which writes itemId to the big table. I have included the relevant method. Using hbase client to write to bigtable.
import org.apache.hadoop.hbase.client._
def toPut(key: String, itemId: Long): Put = {
val TrxColumnFamily = Bytes.toBytes("trx")
val ItemIdColumn = Bytes.toBytes("itemId")
new Put(Bytes.toBytes(key))
.addColumn(TrxColumnFamily,
ItemIdColumn,
Bytes.toBytes(itemId))
}
Following is the entry in big table based on above code
ROW COLUMN+CELL
foo column=trx:itemId, value=\x00\x00\x00\xAFP]F\xAA
Following is the relevant code which reads the entry from big table in scala. This works correctly. Result is a org.apache.hadoop.hbase.client.Result
private def getItemId(row: Result): Long = {
val key = Bytes.toString(row.getRow)
val TrxColumnFamily = Bytes.toBytes("trx")
val ItemIdColumn = Bytes.toBytes("itemId")
val itemId =
Bytes.toLong(row.getValue(TrxColumnFamily, ItemIdColumn))
itemId
}
The getItemId function above correctly returns itemId. That's because Bytes.toLong is part of org.apache.hadoop.hbase.util.Bytes which correctly casts the Byte string to Long.
I am using big query UI similar to this one and using CAST(itemId AS integer) because BigQuery doesn't have a Long data type. This incorrectly casts the itemId byte string to integer and resulting value is 0.
Is there any way I can have a Bytes.toLong equivalent from hbase-client in BigQuery UI? If not is there any other way I can go about this issue?
Try this:
SELECT CAST(CONCAT('0x', TO_HEX(itemId)) AS INT64) AS itemId
FROM YourTable;
It converts the bytes into a hex string, then casts that string into an INT64. Note that the query uses standard SQL, as opposed to legacy SQL. If you want to try it with some sample data, you can run this query:
WITH `YourTable` AS (
SELECT b'\x00\x00\x00\xAFP]F\xAA' AS itemId UNION ALL
SELECT b'\xFA\x45\x99\x61'
)
SELECT CAST(CONCAT('0x', TO_HEX(itemId)) AS INT64) AS itemId
FROM YourTable;

apache hive loads null values instead of intergers

I am new to apache hive and was running queries on sample data which is saved in a csv file as below:
0195153448;"Classical Mythology";"Mark P. O. Morford";"2002";"Oxford University Press";"//images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg";"http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg";"images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg"
and the table which i created is of form
hive> describe book;
OK
isbn bigint
title string
author string
year string
publ string
img1 string
img2 string
img3 string
Time taken: 0.085 seconds, Fetched: 8 row(s)
and the script which I used to create the table is:
create table book(isbn int,title string,author string, year string,publ string,img1 string,img2 string,img3 string) row format delimited fields terminated by '\;' lines terminated by '\n' location 'path';
When I try to retrieve the data from the table by using the following query:
select *from book limit 1;
I get the following result:
NULL "Classical Mythology" "Mark P. O. Morford" "2002" "Oxford University Press" "http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg" "images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg" "images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg"
Even though I specify the first column type as int or bigint the data into the table is getting loaded as NULL.
I tried searching on the internet and could figure out that I have to specify the row delimiter. I used that too but no change in the data from the table.
Is there anything that I am making a mistake... Please help.

Is there a way to create Columnfamily in external table dynamically?

I created a External Table like this:
CREATE External TABLE IF NOT EXISTS words (word string, timest string,
url string, occs string, nos string, hiveall string, occall string) STORED
BY org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
('hbase.columns.mapping' =':key, count:timest, count:url, count:occs,
count:nos, other:hiveall, other:occall ')
Is there any way to create the columnfamilys dynamically? so that i have for example something like this:
1397897857000 column=word:occall, timestamp=1449778100184, value=value1
1397897857000 column=otherword:occall, timestamp=1449778100184, value=value2
I thought about something like this but from hive, this code here is from hbase :
Configuration config = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
String table = "myTable";
admin.disableTable(table);
HColumnDescriptor cf1 = ...;
admin.addColumn(table, cf1); // adding new ColumnFamily
HColumnDescriptor cf2 = ...;
admin.modifyColumn(table, cf2); // modifying existing ColumnFamily
admin.enableTable(table);
from here:
http://hbase.apache.org/0.94/book/schema.html
Or does somebody has another idea for my Problem:
I have multiple data from a word count job. This data contains the url, where the word was read from, a timestamp ,when the word was read, the occurance of how often it was found in the url, and some information about a category( there are news, social and all) with the occurance. The main problem is that multiple words can occur at the same timestamp, which will override a existing one. I need the rowkey to be the timestamp to make some querys against it ( like what was most used word in the last 2 weeks).
Column families can't be changed after creation like this. In your scenario, you should create different column qualifiers instead of different column families.
Fix a column family and use word coming as qualifier name. So, it will not override when different words come at the same timestamp.

Hive - dynamic partitions: Long loading times with a lot of partitions when updating table

I run Hive via AWS EMR and have a jobflow that parses log data frequently into S3. I use dynamic partitions (date and log level) for my parsed Hive table.
One thing that is taking forever now when I have several gigabytes of data and a lot of partitions is when Hive is loading data to the table after the parsing is done.
Loading data to table default.logs partition (dt=null, level=null)
...
Loading partition {dt=2013-08-06, level=INFO}
Loading partition {dt=2013-03-12, level=ERROR}
Loading partition {dt=2013-08-03, level=WARN}
Loading partition {dt=2013-07-08, level=INFO}
Loading partition {dt=2013-08-03, level=ERROR}
...
Partition default.logs{dt=2013-03-05, level=INFO} stats: [num_files: 1, num_rows: 0, total_size: 1905, raw_data_size: 0]
Partition default.logs{dt=2013-03-06, level=ERROR} stats: [num_files: 1, num_rows: 0, total_size: 4338, raw_data_size: 0]
Partition default.logs{dt=2013-03-06, level=INFO} stats: [num_files: 1, num_rows: 0, total_size: 828250, raw_data_size: 0]
...
Partition default.logs{dt=2013-08-14, level=INFO} stats: [num_files: 5, num_rows: 0, total_size: 626629, raw_data_size: 0]
Partition default.logs{dt=2013-08-14, level=WARN} stats: [num_files: 4, num_rows: 0, total_size: 4405, raw_data_size: 0]
Is there a way to overcome this problem and reduce the loading times for this step?
I have already tried to archive old logs to Glacier via a bucket lifecycle rule in hopes that Hive would skip loading the archived partitions. Well, since this still keeps the file(path)s visible in S3 Hive recognizes the archived partitions anyway so no performance is gained.
Update 1
The loading of the data is done by simple inserting the data into the dynamically partitioned table
INSERT INTO TABLE logs PARTITION (dt, level)
SELECT time, thread, logger, identity, message, logtype, logsubtype, node, storageallocationstatus, nodelist, userid, nodeid, path, datablockid, hash, size, value, exception, server, app, version, dt, level
FROM new_logs ;
from one table that contain the unparsed logs
CREATE EXTERNAL TABLE new_logs (
dt STRING,
time STRING,
thread STRING,
level STRING,
logger STRING,
identity STRING,
message STRING,
logtype STRING,
logsubtype STRING,
node STRING,
storageallocationstatus STRING,
nodelist STRING,
userid STRING,
nodeid STRING,
path STRING,
datablockid STRING,
hash STRING,
size STRING,
value STRING,
exception STRING,
version STRING
)
PARTITIONED BY (
server STRING,
app STRING
)
ROW FORMAT
DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'org.maz.hadoop.mapred.LogFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://my-log/logs/${LOCATION}' ;
into the new (parsed) table
CREATE EXTERNAL TABLE logs (
time STRING,
thread STRING,
logger STRING,
identity STRING,
message STRING,
logtype STRING,
logsubtype STRING,
node STRING,
storageallocationstatus STRING,
nodelist STRING,
userid STRING,
nodeid STRING,
path STRING,
datablockid STRING,
hash STRING,
size STRING,
exception STRING,
value STRING,
server STRING,
app STRING,
version STRING
)
PARTITIONED BY (
dt STRING,
level STRING
)
ROW FORMAT
DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://my-log/parsed-logs' ;
The input format (LogFileInputFormat) is responsible of parsing log entries to the desired log format.
Update 2
When I try the following
INSERT INTO TABLE logs PARTITION (dt, level)
SELECT time, thread, logger, identity, message, logtype, logsubtype, node, storageallocationstatus, nodelist, userid, nodeid, path, datablockid, hash, size, value, exception, server, app, version, dt, level
FROM new_logs
WHERE dt > 'some old date';
Hive still loads all partitions in logs. If I on the other hand use static partitioning like
INSERT INTO TABLE logs PARTITION (dt='some date', level)
SELECT time, thread, logger, identity, message, logtype, logsubtype, node, storageallocationstatus, nodelist, userid, nodeid, path, datablockid, hash, size, value, exception, server, app, version, level
FROM new_logs
WHERE dt = 'some date';
Hive only loads the concerned partitions, but then I need to create one query for each date I think might be present in new_logs. Usually new_logs only contain log entries from today and yesterday it but might contain older entries as well.
Static partitioning are my solution of choice at the moment but aren't there any other (better) solutions to my problem?
During this slow phase, Hive takes the files it built for each partition and moves it from a temporary directory to a permanent directory. You can see this in the "explain extended" called a Move Operator.
So for each partition it's one move and an update to the metastore. I don't use EMR but I presume this act of moving files to S3 has high latency for each file it needs to move.
What's not clear from what you wrote is whether you're doing a full load each time you run. For example why do you have a 2013-03-05 partition? Are you getting new log data that contains this old date? If this data is already in your logs table you should modify your insert statement like
SELECT fields
FROM new_logs
WHERE dt > 'date of last run';
This way you'll only get a few buckets and only a few files to move. It's still wasteful to scan all this extra data from new_logs but you can solve that by partitioning new_logs.
AWS has improved HIVE Partition recovery time by more than an order of magnitude on EMR 3.2.x and above.
We have a HIVE table that has more than 20,000 partitions on S3. With prior versions of EMR, it used to take ~80 minutes to recover and now with 3.2.x/3.3.x, we are able to do it under 5 minutes.