UDF for formatting numbers into strings in Pig - apache-pig

In Pig, I want to get a numeric column out let's say "12345" and cast it to a string with formatting like "$12,345".
Are there existing UDF's to help with standard formatting like adding dollar signs, commas, percents, etc? I haven't seen any in the docs

Here is my python UDF that you can leverage.
#!/usr/bin/python
#outputSchema("formatted:chararray")
def toDol(number):
s = '%d' % number
groups = []
while s and s[-1].isdigit():
groups.append(s[-3:])
s = s[:-3]
res = s + ','.join(reversed(groups))
res = '$' + res
return res
This is how your pig script is going to look like
Register 'locale_udf.py' using jython as myfuncs;
DT = LOAD 'sample_data.txt' Using PigStorage() as (dol:float);
DTR = FOREACH DT GENERATE dol,myfuncs.toDol(dol) as formattedstring;
dump DTR;
This should work for you.

Related

Pyspark Schema update/alter Dataframe

I need to read a csv file from S3 ,it has string,double data but i will read as string which will provide a dynamic frame of only string. I want to do below for each row
concatenate few columns and create new columns
Add new columns
Convert value in 3rd column from string to date
Convert values of column 4,5,6 individually from string to decimal
Storename,code,created_date,performancedata,accumulateddata,maxmontlydata
GHJ 0,GHJ0000001,2020-03-31,0015.5126-,0024.0446-,0017.1811-
MULT,C000000001,2020-03-31,0015.6743-,0024.4533-,0018.0719-
Below is the code that I have written so far
def ConvertToDec(myString):
pattern = re.compile("[0-9]{0,4}[\\.]?[0-9]{0,4}[-]?")
myString=myString.strip()
doubleVal="";
if myString and not pattern.match(myString):
doubleVal=-9999.9999;
else:
doubleVal=-Decimal(myString);
return doubleVal
def rowwise_function(row):
row_dict = row.asDict()
data='d';
if not row_dict['code']:
data=row_dict['code']
else:
data='CD'
if not row_dict['performancedata']:
data= data +row_dict['performancedata']
else:
data=data + 'HJ'
// new columns
row_dict['LC_CODE']=data
row_dict['CD_CD']=123
row_dict['GBL']=123.345
if rec["created_date"]:
rec["created_date"]= convStr =datetime.datetime.strptime(rec["created_date"], '%Y-%m-%d')
if rec["performancedata"]
rec["performancedata"] = ConvertToDec(rec["performancedata"])
newrow = Row(**row_dict)
return newrow
store_df = spark.read.option("header","true").csv("C:\\STOREDATA.TXT", sep="|")
ratings_rdd = store_df.rdd
ratings_rdd_new = ratings_rdd.map(lambda row: rowwise_function(row))
updatedDF=spark.createDataFrame(ratings_rdd_new)
Basically, I am creating almost new DataFrame. My questions are below -
is this right approach ?
Since i am my changing schema mostly is there any other approach
Use Spark dataframes/sql, why use rdd? You don't need to perform any low level data operations, all are column level so dataframes are easier/efficient to use.
To create new columns - .withColumn(<col_name>, <expression/value>) (refer)
All the if's can be made .filter (refer)
The whole ConvertToDec can be written better using strip and ast module or float.

Putting dbSendQuery into a function in R

I'm using RJDBC in RStudio to pull a set of data from an Oracle database into R.
After loading the RJDBC package I have the following lines:
drv = JDBC("oracle.jdbc.OracleDriver", classPath="C:/R/ojdbc7.jar", identifier.quote = " ")
conn = dbConnect(drv,"jdbc:oracle:thin:#private_server_info", "804301", "password")
rs = dbSendQuery(conn, statement= paste("LONG SQL QUERY TO SELECT REQUIRED DATA INCLUDING REQUEST FOR VARIABLE x"))
masterdata = fetch(rs, n = -1) # extract all rows
Run through the usual script, they always execute without fail; it can sometimes take a few minutes dependent on variable x, e.g. may result in 100K rows or 1M rows being pulled. masterdata will return everything in a dataframe.
I'm now trying to place all of the above into a function, with one required argument, variable x which is a TEXT argument (a city name); this input however is also part of the LONG SQL QUERY.
The function I wrote called Data_Grab is as follows:
Data_Grab = function(x) {
drv = JDBC("oracle.jdbc.OracleDriver", classPath="C:/R/ojdbc7.jar", identifier.quote = " ")
conn = dbConnect(drv,"jdbc:oracle:thin:#private_server_info", "804301", "password")
rs = dbSendQuery(conn, statement= paste("LONG SQL QUERY TO SELECT REQUIRED DATA,
INCLUDING REQUEST FOR VARIABLE x"))
masterdata = fetch(rs, n = -1) # extract all rows
return (masterdata)
}
My function appears to execute in seconds (no error is produced) however I get just the 21 column headings for the dataframe and the line
<0 rows> (or 0-length row.names)
Not sure what is wrong here; obviously expecting function to still take minutes to execute as data being pulled is large, but not being returned any actual data frame.
Help is appreciated!
if you want to parameterize your query to a JDBC database, try also using the gsubfn package. code might look like this:
library(gsubfn)
library(RJDBC)
Data_Grab = function(x) {
rd1 = x
df <- fn$dbGetQuery(conn,"SELECT BLAH1, BLAH2
FROM TABLENAME
WHERE BLAH1 = '$rd1')
return(df)
basically, you need to put a $ before the variable name that stores the parameter you wish to pass.

Error while returning output of Pig macro via tuple

The error is in the function below, I'm trying to generate 2 measures of entropy (the latter removes all events with <5 frequency).
My error:
ERROR 1200: Cannot expand macro 'TOTUPLE'. Reason: Macro must be defined before expansion.
Which is weird, because TOTUPLE is a built-in function. Other pig scripts use TOTUPLE with no problems.
Code:
define dual_entropies (search, field) returns entropies {
summary = summary_total($search, $field);
entr1 = count_sum_entropy(summary, $field);
summary = filter summary by events >= 5L;
entr2 = count_sum_entropy(summary, $field);
$entropies = TOTUPLE(entr1, entr2);
};
Note that entr1 and entr2 are both single numbers, not vectors of numbers - I suspect that's part of the issue.
I ran into similar confusions. I'm not sure if it's true in general but Pig only liked TOTUPLE when it's part of a FOREACH operation. I worked around by doing group by all, which returns a bag with a single tuple in it, followed by a FOREACH .. GENERATE such as:
B = group A ALL;
C = foreach B generate 'x', 2, TOTUPLE('a', 'b', 'c');
dump C;
...
(x,2,(hi,2,3))
Perhaps this will help

how can i ignore " (double quotes) while loading file in PIG?

I have following data in file
"a","b","1","2"
"a","b","4","3"
"a","b","3","1"
I am reading this file using below command
File1 = LOAD '/path' using PigStorage (',') as (f1:chararray,f2:chararray,f3:int,f4:int)
But here it is ignoring the data of field 3 and 4
How to read this file correctly or any way to make PIG skip '"'
Additional information i am using Apache Pig version 0.10.0
You may use the REPLACE function (it won't be in one pass though) :
file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0 as (f1:chararray), $1 as (f2:chararray), REPLACE($2, '\\"', '') as (f3:int), REPLACE($3, '\\"', '') as (f4:int);
You may also use regexes with REGEX_EXTRACT :
file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0, $1, REGEX_EXTRACT($2, '([0-9]+)', 1), REGEX_EXTRACT($3, '([0-9]+)', 1);
Of course, you could erase " for f1 and f2 the same way.
Try below (No need to escape or replace double quotes) :
using org.apache.pig.piggybank.storage.CSVExcelStorage()
If you have Jython installed you could deploy a simple UDF to accomplish the job.
python UDF
#!/usr/bin/env python
'''
udf.py
'''
#outputSchema("out:chararray")
def formatter(item):
chars = 'abcdefghijklmnopqrstuvwxyz'
nums = '1234567890'
new_item = item.split('"')[1]
if new_item in chars:
output = str(new_item)
elif new_item in nums:
output = int(new_item)
return output
pig script
REGISTER 'udf.py' USING jython as udf;
data = load 'file' USING PigStorage(',') AS (col1:chararray, col2:chararray,
col3:chararray, col4:chararray);
out = foreach data generate udf.formatter(col1) as a, udf.formatter(col3) as b;
dump out
(a,1)
(a,4)
(a,3)
How about use REPLACE? if case is this simple?
data = LOAD 'YOUR_DATA' Using PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray) ;
new_data = foreach data generate
REPLACE(a, '"', '') AS a,
REPLACE(b, '"', '') AS b,
(int)REPLACE(c, '"', '') AS c:int,
(int)REPLACE(d, '"', '') AS d:int;
One more tips: If you are loading a csv file, set a correct number format in an Excel like tools might also help.
You can use CSVExcelStorage loader from Pig.
The double quotes in data are handled by this loader.
You have to register Piggy-bank jar for using this loader.
Register ${jar_location}/piggybank-0.15.0.jar;
load_data = load '${data_location}' using
org.apache.pig.piggybank.storage.CSVExcelStorage(',');
Hope this helps.

Converting dynamic, nicely formatted tabular data in Python to str.format()

I have the following Python 2.x code, which generates a header row for tabular data:
headers = ['Name', 'Date', 'Age']
maxColumnWidth = 20 # this is just a placeholder
headerRow = "|".join( ["%s" % k.center(maxColumnWidth) for k in headers] )
print(headerRow)
This code outputs the following:
Name | Date | Age
Which is exactly what I want - the data is nicely formatted and centered in columns of width maxColumnWidth. (maxColumnWidth is calculated earlier in the program)
According to the Python docs, you should be able to do the same thing in Python3 with curly brace string formatting, as follows:
headerRow = "|".join( ["{:^maxColumnWidth}".format(k) for k in headers] )
However, when I do this, I get the following:
ValueError: Invalid conversion specification
But, if I do this:
headerRow = "|".join( ["{:^30}".format(k) for k in headers] )
Everything works fine.
My question is: How do I use a variable in the format string instead of an integer?:
headerRow = "|".join( ["{:^maxColumnWidth}".format(k) for k in headers] )
headers = ['Name', 'Date', 'Age']
maxColumnWidth=21
headerRow = "|".join( "{k:^{m}}".format(k=k,m=maxColumnWidth) for k in headers )
print(headerRow)
yields
Name | Date | Age
You can represent the width maxColumnWidth as {m}, and then
substitute the value through a format parameter.
No need to use brackets (list comprehension) inside the join. A
generator expression (without brackets) suffices.
As it says, your conversion specification is invalid. "maxColumnWidth" is not a valid conversion specification.
>>> "{:^{maxColumnWidth}}".format('foo', maxColumnWidth=10)
' foo '