dask read_sql_table fails on sqlite table with numeric datetime - pandas

I've been given some large sqlite tables that I need to read into dask dataframes. The tables have columns with datetimes (ISO formatted strings) stored as sqlite NUMERIC data type. I am able to read in this kind of data using Pandas' read_sql_table. But, the same call from dask gives an error. Can someone suggest a good workaround? (I do not know of an easy way to change the sqlite data type of these columns from NUMERIC to TEXT.) I am pasting a minimal example below.
import sqlalchemy
import pandas as pd
import dask.dataframe as ddf
connString = "sqlite:///c:\\temp\\test.db"
engine = sqlalchemy.create_engine(connString)
conn = engine.connect()
conn.execute("create table testtable (uid integer Primary Key, datetime NUM)")
conn.execute("insert into testtable values (1, '2017-08-03 01:11:31')")
print(conn.execute('PRAGMA table_info(testtable)').fetchall())
conn.close()
pandasDF = pd.read_sql_table('testtable', connString, index_col='uid', parse_dates={'datetime':'%Y-%m-%d %H:%M:%S'})
pandasDF.head()
daskDF = ddf.read_sql_table('testtable', connString, index_col='uid', parse_dates={'datetime':'%Y-%m-%d %H:%M:%S'})
Here is the traceback:
Warning (from warnings module):
File "C:\Program Files\Python36\lib\site-packages\sqlalchemy\sql\sqltypes.py", line 596
'storage.' % (dialect.name, dialect.driver))
SAWarning: Dialect sqlite+pysqlite does *not* support Decimal objects natively, and SQLAlchemy must convert from floating point - rounding errors and other issues may occur. Please consider storing Decimal numbers as strings or integers on this platform for lossless storage.
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
daskDF = ddf.read_sql_table('testtable', connString, index_col='uid', parse_dates={'datetime':'%Y-%m-%d %H:%M:%S'})
File "C:\Program Files\Python36\lib\site-packages\dask\dataframe\io\sql.py", line 98, in read_sql_table
head = pd.read_sql(q, engine, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\pandas\io\sql.py", line 416, in read_sql
chunksize=chunksize)
File "C:\Program Files\Python36\lib\site-packages\pandas\io\sql.py", line 1104, in read_query
parse_dates=parse_dates)
File "C:\Program Files\Python36\lib\site-packages\pandas\io\sql.py", line 157, in _wrap_result
coerce_float=coerce_float)
File "C:\Program Files\Python36\lib\site-packages\pandas\core\frame.py", line 1142, in from_records
coerce_float=coerce_float)
File "C:\Program Files\Python36\lib\site-packages\pandas\core\frame.py", line 6304, in _to_arrays
data = lmap(tuple, data)
File "C:\Program Files\Python36\lib\site-packages\pandas\compat\__init__.py", line 129, in lmap
return list(map(*args, **kwargs))
TypeError: must be real number, not str
EDIT: The comments by #mdurant make me wonder now if this is a bug in sqlalchemy. The following code gives the same error message as pandas does:
import sqlalchemy as sa
from sqlalchemy import text
m = sa.MetaData()
table = sa.Table('testtable', m, autoload=True, autoload_with=engine)
resultList = conn.execute(sa.sql.select(table.columns).select_from(table)).fetchall()
print(resultList)
resultList2 = conn.execute(sa.sql.select(columns=[text('uid'),text('datetime')], from_obj = text('testtable'))).fetchall()
print(resultList2)
Traceback (most recent call last):
File "<ipython-input-20-188c84a35d95>", line 1, in <module>
print(resultList)
File "c:\program files\python36\lib\site-packages\sqlalchemy\engine\result.py", line 156, in __repr__
return repr(sql_util._repr_row(self))
File "c:\program files\python36\lib\site-packages\sqlalchemy\sql\util.py", line 329, in __repr__
", ".join(trunc(value) for value in self.row),
TypeError: must be real number, not str

Puzzling.
Here is some further information, which hopefully can lead to an answer.
The query being execute at the line in question is
pd.read_sql(sql.select(table.columns).select_from(table),
engine, index_col='uid')
which fails as you show (the limit is not relevant here).
However, the text version of the same query
sql.select(table.columns).select_from(table).compile().string
-> 'SELECT testtable.uid, testtable.datetime \nFROM testtable'
pd.read_sql('SELECT testtable.uid, testtable.datetime \nFROM testtable',
engine, index_col='uid') # works fine
The following workaround, using a cast in the query, does work (but isn't pretty):
import sqlalchemy as sa
engine = sa.create_engine(connString)
table = sa.Table('testtable', m, autoload=True, autoload_with=engine)
uid, dt = list(table.columns)
q = sa.select([dt.cast(sa.types.String)]).select_from(table)
daskDF = ddf.read_sql_table(q, connString, index_col=uid.label('uid'))
-edit-
Simpler form of this that also appears to work (see comment)
daskDF = ddf.read_sql_table('testtable', connString, index_col='uid',
columns=['uid', sa.sql.column('datetime').cast(sa.types.String).label('datet‌​ime')])

Related

Copy all .csv files in directory to .xlsx files in another directory, Traceback error

I'm working on this script that first takes all .csv's and converts them .xlsx's in a separate folder. I'm getting the first file to output exactly how I want in the 'Script files' folder, but then it throws a Traceback error before it does the second one.
Script code below, Traceback error below that. Some path data removed for privacy:
import pandas as pd
import matplotlib.pyplot as plt
import os
# Assign current directory and list files there
f_path = os.path.dirname(__file__)
rd_path = f_path+'\\Raw Data'
sc_path = f_path+'\\Script files'
# Create /Script files folder
if os.path.isdir(sc_path) == False:
os.mkdir(sc_path)
print("\nCreating new Script files path here...",sc_path)
else:
print("\nScript files directory exists!")
# List files in Raw Data directory
print("\nRaw Data files in the directory:\n",rd_path,"\n")
for filename in os.listdir(rd_path):
f = os.path.join(rd_path,filename)
if os.path.isfile(f):
print(filename)
print("\n\n\n")
# Copy and edit data files to /Script files folder
for filename in os.listdir(rd_path):
src = os.path.join(rd_path,filename)
if os.path.isfile(src):
name = os.path.splitext(filename)[0]
read_file = pd.read_csv(src)
result = sc_path+"\\"+name+'.xlsx'
read_file.to_excel(result)
print(src,"\nconverted and written to: \n",result,"\n\n")
Traceback (most recent call last):
File "C:\Users\_________________\Graph.py", line 32, in <module>
read_file = pd.read_csv(src)
File "C:\Users\_____________\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\______________\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\_____________\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\_____________\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1250, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\_____________\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 47, saw 8
Have you tried to convert to xlsx the second file in the folder? I'm not sure but it seems like there's a problem when Pandas reads the csv.
So I found out this issue was due to the formatting of my 2nd .csv file. Opening it up, I had a number of cells above the data I wanted. After deleting these extra rows so that the relevant data started in row 1, the code ran correctly. Looks like I'll have to add code to detect these extra rows and delete them prior to attempting to convert to .xlsx.

Question about insert from TK form to csv file

I am trying to get some info from TK form that i built to new CSV file, unfortunately i am getting some Errors.
The code:
def sub_func():
with open('Players.csv','w') as df:
df = pd.DataFrame
i=len(df.index)
data=[]
data.append(entry_box1.get())
data.append(entry_box2.get())
data.append(entry_box3.get())
data.append(entry_box4.get())
if i==4:
df.loc[i,:]=data
The Errors:
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\tkinter\_init.py", line 1883, in __call_
return self.func(*args)
File "C:/Users/user/PycharmProjects/test11/Final Project/registration form.py", line 23, in sub_func
i=len(df.index)
TypeError: object of type 'pandas._libs.properties.AxisProperty' has no len()
You have two problems in your code:
The file is opened as df, but df is overwritten on the next line
df = pd.DataFrame does not return anything as it would need parenthesis () to create a dataframe. This means that df does not exist -> error comes when you try to take length of 'nothing'.
Solution:
df = pd.read_csv('./Players') # Make sure that the file is in the same directory
i = len(df.index)
Not sure what you are trying to achieve, but you might need to work on the rest of the code, too.

pandas unable to write to Postgres db throws "KeyError: ("SELECT name FROM sqlite_master ..."

I have created a package allowing a user to write data to either a sqlite or Postgres db. I created a module for connecting to the db and a separate module that provides the writing functionality. In the latter module the write is a straightforward pandas internal function call:
indata.to_sql('pay_' + table, con, if_exists='append', index=False)
Writing to an sqlite db (with connection using 'sqlite3') is successful however when writing to a Postgres db I get the following error:
Traceback (most recent call last):
File "/anaconda3/envs/PCAN_v1/lib/python3.7/site-packages/pg8000/core.py", line 1778, in execute
ps = cache['ps'][key]
KeyError: ("SELECT name FROM sqlite_master WHERE type='table' AND name=?;", ((705, 0, <function Connection.__init__.<locals>.text_out at 0x7fc3205fb510>),))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/envs/PCAN_v1/lib/python3.7/site-packages/pandas/io/sql.py", line 1595, in execute
cur.execute(*args)
File "/anaconda3/envs/PCAN_v1/lib/python3.7/site-packages/pg8000/core.py", line 861, in execute
self._c.execute(self, operation, args)
File "/anaconda3/envs/PCAN_v1/lib/python3.7/site-packages/pg8000/core.py", line 1837, in execute
self.handle_messages(cursor)
File "/anaconda3/envs/PCAN_v1/lib/python3.7/site-packages/pg8000/core.py", line 1976, in handle_messages
raise self.error
pg8000.core.ProgrammingError: {'S': 'ERROR', 'V': 'ERROR', 'C': '42P01', 'M': 'relation "sqlite_master" does not exist', 'P': '18', 'F': 'parse_relation.c', 'L': '1180', 'R': 'parserOpenTable'}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/envs/PCAN_v1/lib/python3.7/site-packages/pandas/io/sql.py", line 1610, in execute
raise_with_traceback(ex)
File "/anaconda3/envs/PCAN_v1/lib/python3.7/site-packages/pandas/compat/__init__.py", line 46, in raise_with_traceback
raise exc.with_traceback(traceback)
File "/anaconda3/envs/PCAN_v1/lib/python3.7/site-packages/pandas/io/sql.py", line 1595, in execute
cur.execute(*args)
File "/anaconda3/envs/PCAN_v1/lib/python3.7/site-packages/pg8000/core.py", line 861, in execute
self._c.execute(self, operation, args)
File "/anaconda3/envs/PCAN_v1/lib/python3.7/site-packages/pg8000/core.py", line 1837, in execute
self.handle_messages(cursor)
File "/anaconda3/envs/PCAN_v1/lib/python3.7/site-packages/pg8000/core.py", line 1976, in handle_messages
raise self.error
pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': {'S': 'ERROR', 'V': 'ERROR', 'C': '42P01', 'M': 'relation "sqlite_master" does not exist', 'P': '18', 'F': 'parse_relation.c', 'L': '1180', 'R': 'parserOpenTable'}
I traced the error to the following file:
/anaconda3/envs/PCAN_v1/lib/python3.7/site-packages/pandas/io/sql.py
What seems to be happening is that the '.to_sql' function is configured to try to write to a db named 'sqlite_master' at this point in the 'sql.py' file:
def has_table(self, name, schema=None):
# TODO(wesm): unused?
# escape = _get_valid_sqlite_name
# esc_name = escape(name)
wld = "?"
query = (
"SELECT name FROM sqlite_master " "WHERE type='table' AND name={wld};"
).format(wld=wld)
return len(self.execute(query, [name]).fetchall()) > 0
Looking more closely at the errors you can see that the connection is correctly made to the db but that pandas is looking for an sqlite db:
I know that the db name was one I used several half a year ago when I first started working with sqlite so I'm thinking that somewhere I set a configuration value. So:
is my reasoning correct?
if so, how do I change the configuration?
if not, what is possibly going on?
Per pandas.DataFrame.to_sql documentation:
con : sqlalchemy.engine.Engine or sqlite3.Connection
Using SQLAlchemy makes it possible to use any DB supported by that
library. Legacy support is provided for sqlite3.Connection objects.
This means only SQLite allows a raw connection for the to_sql method. All other RDBMs including Postgres must use an SQLAlchemy connection for this method to create structures and append data. Do note: read_sql does not require SQLAlchemy since it does not make persistent changes.
Therefore, this raw DB-API connection cannot work:
import psycopg2
con = psycopg2.connect(host="localhost", port=5432, dbname="mydb", user="myuser", password="mypwd")
indata.to_sql('pay_' + table, con, if_exists='append', index=False)
However, this SQLAlchemy connection can work:
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://myuser:mypwd#localhost:5432/mydb')
indata.to_sql('pay_' + table, engine, if_exists='append', index=False)
Better use SQLAlchemy for both databases, here for SQLite:
engine = create_engine("sqlite:///path/to/mydb.db")

Snakemake, pandas, and NCBI: how do I combine a pandas dataframe with the remote NCBI search?

I'm still pretty new to Snakemake, and I've been having trouble with a rule I'm trying to write.
I've been trying to combine using snakemake.remote.NCBI with accessing a pandas dataframe and using a wildcard, but I can't seem to make it work.
I have a tsv file called genomes.tsv with several columns, where each row is one species. One is "id" and has the genbank id for the species's genomes. Another "species" has a short string unique for each species. In my Snakefile, genomes.tsv is imported as genomes, with only the id and species column, then species is set as genomes index and dropped from genome.
I want to use the values in "species" as values for the wildcard {species} in my workflow, and I want my rule to use snakemake.remote.NCBI to download each species's genome sequence in fasta format and then output it to a file "{species}_gen.fa"
from snakemake.remote.NCBI import RemoteProvider as NCBIRemoteProvider
import pandas as pd
configfile: "config.yaml"
NCBI = NCBIRemoteProvider(email=config["email"]) # email required by NCBI to prevent abuse
genomes = pd.read_table(config["genomes"], usecols=["species","id"]).set_index("species")
SPECIES = genomes.index.values.tolist()
rule all:
input: expand("{species}_gen.fasta",species=SPECIES)
rule download_and_count:
input:
lambda wildcards: NCBI.remote(str(genomes[str(wildcards.species)]) + ".fasta", db="nuccore")
output:
"{species}_gen.fasta"
shell:
"{input} > {output}"
Currently, trying to run my code results in a key error, but it says that the key is a value from species, so it should be able to get the corresponding genbank id from genomes.
EDIT: here is the error
InputFunctionException in line 18 of /home/sjenkins/work/olflo/Snakefile:
KeyError: 'cappil'
Wildcards:
species=cappil
cappil is a valid value for {species}, and it should be usable as an index, I think. Here are the first few rows of genomes, for reference:
species id accession name assembly
cappil 8252558 GCA_004027915.1 Capromys_pilorides_(Desmarest's_hutia) CapPil_v1_BIUU
cavape 1067048 GCA_000688575.1 Cavia_aperea_(Brazilian_guinea_pig) CavAp1.0
cavpor 175118 GCA_000151735.1 Cavia_porcellus_(domestic_guinea_pig) Cavpor3.0
Update:
I tried changing the the input line to:
lambda wildcards: NCBI.remote(str(genomes[genomes['species'] == wildcards.species].iloc[0]['id']) + ".fasta", db="nuccore")
but that gives me the error message:
Traceback (most recent call last):
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/init.py", line 547, in snakemake
export_cwl=export_cwl)
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/workflow.py", line 421, in execute
dag.init()
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/dag.py", line 122, in init
job = self.update([job], progress=progress)
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/dag.py", line 603, in update
progress=progress)
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/dag.py", line 666, in update_
progress=progress)
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/dag.py", line 603, in update
progress=progress)
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/dag.py", line 655, in update_
missing_input = job.missing_input
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/jobs.py", line 398, in missing_input
for f in self.input
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/jobs.py", line 399, in
if not f.exists and not f in self.subworkflow_input)
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/io.py", line 208, in exists
return self.exists_remote
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/io.py", line 119, in wrapper
v = func(self, *args, **kwargs)
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/io.py", line 258, in exists_remote
return self.remote_object.exists()
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/remote/NCBI.py", line 72, in exists
likely_request_options = self._ncbi.guess_db_options_for_extension(self.file_ext, db=self.db, rettype=self.rettype, retmode=self.retmode)
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/remote/NCBI.py", line 110, in file_ext
accession, version, file_ext = self._ncbi.parse_accession_str(self.local_file())
File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/remote/NCBI.py", line 366, in parse_accession_str
assert file_ext, "file_ext must be defined: {}.{}.. Possible values include: {}".format(accession,version,", ".join(list(self.valid_extensions)))
AssertionError: file_ext must be defined: ... Possible values include: est, ssexemplar, gb.xml, docset, fasta.xml, fasta, fasta_cds_na, abstract, txt, gp, medline, chr, flt, homologene, alignmentscores, gbwithparts, seqid, fasta_cds_aa, gpc, uilist, uilist.xml, rsr, xml, gb, gene_table, gss, ft, gp.xml, acc, asn1, gbc
I think you should have:
genomes = pd.read_table(config["genomes"], usecols=["species","id"])
SPECIES = list(genomes['species'])
and then access the ID of a given species with:
lambda wildcards: str(genomes[genomes['species'] == wildcards.species].iloc[0]['id'])
Ok, so it turns out that the reason that I was getting AssertionError: file_ext must be defined: is that NCBIRemoteProvider can't recognize the file extension if the file name that its given doesn't have a valid Genbank accession number. I was giving it file names with genbank ids, so it returned that error.
Also, it seems like the whole genome sequences don't have an accession number that returns all the sequences. Instead there's an accession number for the wgs report and then accession numbers for each scaffold. I decided to try downloading the genomes I need manually instead of trying to download all the scaffolds and then combine them.

Using Python UDF with Hive

I am trying to learn using Python UDF's with Hive.
I have a very basic python UDF here:
import sys
for line in sys.stdin:
line = line.strip()
print line
Then I add the file in Hive:
ADD FILE /home/hadoop/test2.py;
Now I call the Hive Query:
SELECT TRANSFORM (admission_type_id, description)
USING 'python test2.py'
FROM admission_type;
This works as expected, no changes is made to the field and the output is printed as is.
Now, when I modify the UDF by introducing the split function, I get an execution error. How do I debug here? and what am I doing wrong?
New UDF:
import sys
for line in sys.stdin:
line = line.strip()
fields = line.split('\t') # when this line is introduced, I get an execution error
print line
import sys
for line in sys.stdin:
line = line.strip()
field1, field2 = line.split('\t')
print '\t'.join([str(field1), str(field2)])
SELECT TRANSFORM (admission_type_id, description)
USING 'python test2.py' As ( admission_type_id_new, description_new)
FROM admission_type;