Using Python UDF with Hive - hive

I am trying to learn using Python UDF's with Hive.
I have a very basic python UDF here:
import sys
for line in sys.stdin:
line = line.strip()
print line
Then I add the file in Hive:
ADD FILE /home/hadoop/test2.py;
Now I call the Hive Query:
SELECT TRANSFORM (admission_type_id, description)
USING 'python test2.py'
FROM admission_type;
This works as expected, no changes is made to the field and the output is printed as is.
Now, when I modify the UDF by introducing the split function, I get an execution error. How do I debug here? and what am I doing wrong?
New UDF:
import sys
for line in sys.stdin:
line = line.strip()
fields = line.split('\t') # when this line is introduced, I get an execution error
print line

import sys
for line in sys.stdin:
line = line.strip()
field1, field2 = line.split('\t')
print '\t'.join([str(field1), str(field2)])
SELECT TRANSFORM (admission_type_id, description)
USING 'python test2.py' As ( admission_type_id_new, description_new)
FROM admission_type;

Related

dask read_sql_table fails on sqlite table with numeric datetime

I've been given some large sqlite tables that I need to read into dask dataframes. The tables have columns with datetimes (ISO formatted strings) stored as sqlite NUMERIC data type. I am able to read in this kind of data using Pandas' read_sql_table. But, the same call from dask gives an error. Can someone suggest a good workaround? (I do not know of an easy way to change the sqlite data type of these columns from NUMERIC to TEXT.) I am pasting a minimal example below.
import sqlalchemy
import pandas as pd
import dask.dataframe as ddf
connString = "sqlite:///c:\\temp\\test.db"
engine = sqlalchemy.create_engine(connString)
conn = engine.connect()
conn.execute("create table testtable (uid integer Primary Key, datetime NUM)")
conn.execute("insert into testtable values (1, '2017-08-03 01:11:31')")
print(conn.execute('PRAGMA table_info(testtable)').fetchall())
conn.close()
pandasDF = pd.read_sql_table('testtable', connString, index_col='uid', parse_dates={'datetime':'%Y-%m-%d %H:%M:%S'})
pandasDF.head()
daskDF = ddf.read_sql_table('testtable', connString, index_col='uid', parse_dates={'datetime':'%Y-%m-%d %H:%M:%S'})
Here is the traceback:
Warning (from warnings module):
File "C:\Program Files\Python36\lib\site-packages\sqlalchemy\sql\sqltypes.py", line 596
'storage.' % (dialect.name, dialect.driver))
SAWarning: Dialect sqlite+pysqlite does *not* support Decimal objects natively, and SQLAlchemy must convert from floating point - rounding errors and other issues may occur. Please consider storing Decimal numbers as strings or integers on this platform for lossless storage.
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
daskDF = ddf.read_sql_table('testtable', connString, index_col='uid', parse_dates={'datetime':'%Y-%m-%d %H:%M:%S'})
File "C:\Program Files\Python36\lib\site-packages\dask\dataframe\io\sql.py", line 98, in read_sql_table
head = pd.read_sql(q, engine, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\pandas\io\sql.py", line 416, in read_sql
chunksize=chunksize)
File "C:\Program Files\Python36\lib\site-packages\pandas\io\sql.py", line 1104, in read_query
parse_dates=parse_dates)
File "C:\Program Files\Python36\lib\site-packages\pandas\io\sql.py", line 157, in _wrap_result
coerce_float=coerce_float)
File "C:\Program Files\Python36\lib\site-packages\pandas\core\frame.py", line 1142, in from_records
coerce_float=coerce_float)
File "C:\Program Files\Python36\lib\site-packages\pandas\core\frame.py", line 6304, in _to_arrays
data = lmap(tuple, data)
File "C:\Program Files\Python36\lib\site-packages\pandas\compat\__init__.py", line 129, in lmap
return list(map(*args, **kwargs))
TypeError: must be real number, not str
EDIT: The comments by #mdurant make me wonder now if this is a bug in sqlalchemy. The following code gives the same error message as pandas does:
import sqlalchemy as sa
from sqlalchemy import text
m = sa.MetaData()
table = sa.Table('testtable', m, autoload=True, autoload_with=engine)
resultList = conn.execute(sa.sql.select(table.columns).select_from(table)).fetchall()
print(resultList)
resultList2 = conn.execute(sa.sql.select(columns=[text('uid'),text('datetime')], from_obj = text('testtable'))).fetchall()
print(resultList2)
Traceback (most recent call last):
File "<ipython-input-20-188c84a35d95>", line 1, in <module>
print(resultList)
File "c:\program files\python36\lib\site-packages\sqlalchemy\engine\result.py", line 156, in __repr__
return repr(sql_util._repr_row(self))
File "c:\program files\python36\lib\site-packages\sqlalchemy\sql\util.py", line 329, in __repr__
", ".join(trunc(value) for value in self.row),
TypeError: must be real number, not str
Puzzling.
Here is some further information, which hopefully can lead to an answer.
The query being execute at the line in question is
pd.read_sql(sql.select(table.columns).select_from(table),
engine, index_col='uid')
which fails as you show (the limit is not relevant here).
However, the text version of the same query
sql.select(table.columns).select_from(table).compile().string
-> 'SELECT testtable.uid, testtable.datetime \nFROM testtable'
pd.read_sql('SELECT testtable.uid, testtable.datetime \nFROM testtable',
engine, index_col='uid') # works fine
The following workaround, using a cast in the query, does work (but isn't pretty):
import sqlalchemy as sa
engine = sa.create_engine(connString)
table = sa.Table('testtable', m, autoload=True, autoload_with=engine)
uid, dt = list(table.columns)
q = sa.select([dt.cast(sa.types.String)]).select_from(table)
daskDF = ddf.read_sql_table(q, connString, index_col=uid.label('uid'))
-edit-
Simpler form of this that also appears to work (see comment)
daskDF = ddf.read_sql_table('testtable', connString, index_col='uid',
columns=['uid', sa.sql.column('datetime').cast(sa.types.String).label('datet‌​ime')])

how to get file name when file is imported using pandas.read_csv

I am writing big function, I need to use input file name for the purpose of giving out output file name. I tried something
import pandas as pd
import os
input_file = pd.read_csv('my_file.csv',header=None)
input_file_name = os.basename(input_file)
but I can't get back file name.
How can I retrieve 'my_file' here?
def do_job(input_file):
if not os.path.exists(input_file):
sys.stderr.write("Error: '%s' does not exist"%input_file)
sys.exit(1)
input = pd.read_csv(input_file,header=None)
# do many operations
# so file name is stored in handle 'input_file'
# I can give output file name using input_file
output_name = 'Results_' + input_file

Pandas fails to read SAS as iterable

UPDATE. This is a known bug - pandas.read_sas breaks if trying to read a SAS7bdat as an iterable.
I receive an error while attempting pandas.read_sas on pandas 0.18.1 in Spyder 3.0.1, Windows 10.
I generated a simple dataset in SAS and saved in the SAS7bdat format:
data basic;
do i=1 to 20;
j=i**2;
if mod(i,2) then type='Even';
else type='Odd';
output;
end;
run;
We save this data to a directory.
The following code successfully imports the SAS dataset when run in Python:
import pandas
f=pandas.read_sas('basic.sas7bdat')
The following code fails:
import pandas
for chunk in pandas.read_sas('basic.sas7bdat', chunksize=1):
pass
The error generated is
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\io\common.py", line 101, in __next__
raise AbstractMethodError(self)
AbstractMethodError: This method must be defined in the concrete class of SAS7BDATReader
The same error is produced if I use the option iterable=True, or if I use both iterable= and chunksize= together.
Relevant documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sas.html
Sample SAS7bdat datasets: http://www.principlesofeconometrics.com/sas.htm.

How can I print a certain line using for line in lines and line length in Python?

I have to use the import sys module for this syntax. What I have so far is this
import sys
file=sys.argv[1]
fp1= open(file, 'r+')
fp2= open(file+ 'cl.', 'w+')
lines =fp1.readlines()
for line in lines:
if len(line)>1 and line[0]=='Query':
print line.split('|') [0:1}
fp1.close()
Basically when I run this on the command line:
python homework4.py sqout
It gives me nothing, but if I take away the line[0}=='Query':
it prints the first 2 splits of every line (which I want it to do) just not every line. I only want it to print the first line which starts with Query. Thanks
line[0] is just the first character of string line. You could use line[0:5]=='Query' or line[:5]=='Query'
Before doing this I suggest checking first that len(line)>4 or using an exception.

django-admin.py dumpdata to SQL statements

I'm trying to dump my data to SQL statements.
the django-admin.py dumpdata provides only json,xml,yaml.
so:
does someone know a good way to do it?!
I tried that:
def sqldumper(model):
result = ""
units = model.objects.all().values()
for unit in units:
statement = "INSERT INTO myapp.model "+str(tuple(unit.keys())).replace("'", "")+" VALUES " + str(tuple(unit.values()))+"\r\n"
result+=statement
return result
so I'm going over the model values myself, and make the INSERT statement myself.
then I thought of using "django-admin.py sql" to get the "CREATE" statement.. but then I don't know how to use this line from inside my code (and not through the command-line).
I tried os.popen and os.system, but it doesn't really work..
any tips about that?
I'll put it clearly:
how do you use the "manage.py sql " from inside your code?
I add something like this to my view:
import os, sys
import imp
from django.core.management import execute_manager
sys_argv_backup = sys.argv
imp.find_module("settings")
import settings
sys.argv = ['','sql','myapp']
execute_manager(settings)
sys.argv = sys_argv_backup
the thing is - it works.. but it writes the statements to the stdout...
it's something, but not perfect. I'll try using django.core.management.sql.sql_create directly, we'll see how it goes..
thanks
I suggest to use SQL-specific dump program (e.g. mysqldump for MySQL).
For sqlite embedded in Python, you can look this example not involving Django:
# Convert file existing_db.db to SQL dump file dump.sql
import sqlite3, os
con = sqlite3.connect('existing_db.db')
with open('dump.sql', 'w') as f:
for line in con.iterdump():
f.write('%s\n' % line)