Apache Pig How to whitelist or blacklist in Load function? - apache-pig

I am wondering if it's possible to maintain a whitelist or blacklist in Pig's Load function. Say I am doing the following:
AllData = LOAD '/path/to/dir/CAT*' USING AvroStorage();
This would load all the files that starts with the CAT prefix.
e.g. CAT1, CAT2, CAT3, CAT4, CAT5, CAT6
I am wondering if it's possible to maintain a blacklist to filter out let's say CAT2, CAT3 or to maintain a whitelist that keeps CAT1, CAT4, CAT5, CAT6 only. Thanks!

You can do a whitelist by listing all the filename suffixes in curly brackets, like:
AllData = LOAD '/path/to/dir/CAT{1,4,5,6}' USING AvroStorage();

Related

how to get rid of queries in a URL using Hive?

I have a few million urls that can look like:
www.wikipedia.com/helloworld?somekey=published_links&otherkey=1
www.wikipedia.com/helloworld?wowkey=20005
www.wikipedia.com/helloworld
I would like to get rid of the url queries so that they all look like:
www.wikipedia.com/helloworld
How can I do that? Is it safe to do it with regex? Should I use parse_url instead (Hive)?
Thanks!
You can use parse_url function with a concatenation of http:// or https:// to the existing column and get the HOST and PATH values concatenate them to get the desired result.
select CONCAT(parse_url(concat('http://',col),'HOST'),
parse_url(concat('http://',col),'PATH')
)
from tbl

Who parent if we to use rules in Scarpy?

rules = (
Rule(LinkExtractor(
restrict_xpaths='//need_data',
deny=deny_urls), callback='parse_info'),
Rule(LinkExtractor(allow=r'/need/', deny=deny_urls), follow=True),
)
rules to extract need URLs for scraping, right?
Can I in callback def get URL we move?
For example.
website - needdata.com
Rule(LinkExtractor(allow=r'/need/', deny=deny_urls), follow=True), to extract URL like needdata.com/need/1 , right?
Rule(LinkExtractor(
restrict_xpaths='//need_data',
deny=deny_urls), callback='parse_info'),
to extract urls from needdata.com/need/1 , for example it a table with people.
and then parse_info to scrape it. Right?
But I want to understand in parse_info who a parent?
If needdata.com/need/1 has needdata.com/people/1
I want to add to a file column parent and data will be needdata.com/need/1
How to do that? Thank you very much.
We want to use
lx = LinkExtractor(allow=(r'shop-online/',))
And then
for l in lx.extract_links(response):
# l.url - it our url
And then use
meta={'category': category}
The better decision I do not find.

how to remove a field from tuple in apache pig

i have a code as follow:
B_t= LOAD 'test.csv' USING PigStorage('\t') as (id:chararray,usr_id:chararray,weed:chararray,ip:chararray);
in above, i have field with name weed, i would like remove this field from record with filter command without use codes as follow:
B_f = FOREACH B_t GENERATE id , usr_id, ip
or
B_t= LOAD 'test.csv' USING PigStorage('\t') $0 as id, ....;
have anyone idea???
Why in the world u want to use Filter only. Filter is used emit the fields based on condition, not remove the entire field. Foreach -- Generate is the best way.

Using REGEXP_EXTRACT to get domain and subdomains

I have only managed to extract the TLD of the list of websites that I have using
REGEXP_EXTRACT(Domain_name, r'(\.[^.:]*)]\.?:?[0-9]*$') AS web_tld
Example:
I have
www.example1.abc.com
www.example2.efg.123.net
I want the result
Subdomain
example1
efg
Domain
abc
123
TLD
.com
.net
EDIT:
Encountered an error in my query
'Exactly one capturing group must be specified'
when I use (.?([^.:]+).([^.:]+).([^.:]+):?[0-9]*$) as regex
SELECT
REGEXP_EXTRACT(Domain, r'(\.?([^.:]+)\.([^.:]+)\.([^.:]+):?[0-9]*$)'),
FROM [weblist.domain]
ORDER BY 1
LIMIT 250;
As you can only use one capturing group, I think you can actually use 3 separate regular expressions to get the values you want:
SELECT
REGEXP_EXTRACT(Domain, r'([^.:]+):?[0-9]*$'),
REGEXP_EXTRACT(Domain, r'([^.:]+).[^.:]+:?[0-9]*$'),
REGEXP_EXTRACT(Domain, r'([^.:]+).[^.:]+.[^.:]+:?[0-9]*$')
FROM [weblist.domain]
ORDER BY 1
LIMIT 250;
Note you may be better off using the HOST, DOMAIN, and TLD rather than custom regular expressions.

set up way of getting mysite.$domain

I have several domains, only one website and one databse table for each domain.
example: wbesite.us - data from USA goes to database table main_usa
wbesite.co.uk - data form UK goes to database table main_uk
Only have one database with name of the website. Having only one website structured and having variables like this:
$sql="select * from main_".$countrycode." where bla..bla...
and many other variables to catch the domain extension, and so on...
Now, instead of having one full website for each domain, how can set a script and wher do I put it in order to detect the domain that the user uses.
In my server root do I create something like website.$domain?
Something like website OLX but for different purposes.
I hope I made myself clear.
Thank you.
You could use the $_SERVER['SERVER_NAME'] global and somehow parse it, to get the country code.
I didn't try it, but something like this should work:
$servername_array = explode('.', $_SERVER['SERVER_NAME']);
$country_code = array_pop($servername_array);
That way, $country_code will be com, uk, or whatever is after the last dot in the domain name.