Write PySpark Dataframe to Impala table

Write PySpark Dataframe to Impala table - dataframe

I want to write a PySpark dataframe into an Impala table, but I am getting an error message. Basically, my code looks like:
properties = {'user': os.getenv('USERNAME), 'password': os.getenv('SECRET), 'driver': 'com.cloudera.impala.jdbc41.Driver'}
df.write.jdbc(url=os.getenv('URL'), table=os.getenv('URL'), mode='append', properties=properties)
The problem seems that the "Create table" statement that is generated has some bad syntax:
When I run the same query in DBeaver, I get the same error message. Only when I delete the quotation marks, the table gets created. I have no idea how to solve this. I created the dataframe by flattening a json file, using the withColumn and explode functions. Can I somehow avoid these quotation marks from being generated?
As an alternative, would it be possible to write the dataframe in an already existing table, using an insert into query instead?
Edit: Another issue I just realized: when it comes to string columns, the "create table"-statements contains the word "TEXT", instead of "STRING" or "varchar", which is also not recognized as a proper data type by Impala...
Thanks a lot!

Related

Pyodbc not detecting parameter marker in SQL statement (i.e. - Insert into table SELECT ...) to Hive table. Is there a workaround for this issue?

My goal is to iterate through a set of values and use them to run a set of queries that will insert their results into a hive table using pyodbc.
I tried
params = ['USA','JP']
set(params)
for i in params:
cursor.execute(insert into table **some_db.some_tbl** SELECT name, country from **some_db.some_tbl_2** where country = ?,i)
And got the error
the following and received the error message, ProgrammingError: ('The SQL contains 0 parameter markers, but 1 parameters were supplied', 'HY000').
If I remove the insert into table some_db.some_tbl portion, it works fine. Not sure what else to do as all documentation and looking at similar questions suggest what I am doing is correct.
If I keep the insert into table some_db.some_tbl portion but remove the parameterization, it works fine.

I am answering this question with a simple workaround as it might help save someone else some time.
Since the 'insert without parameterization' and a 'select with parameterization' work independently, I worked around this issue by looping through my parameters with a select statement and then saving the results to a pandas dataframe. From there, you can then run the insert against the pandas dataframe without parameterization for the current iteration. All of this would be in the body of the loop thus accounting for all parameter values separately.

Pandas read_sql Challenging syntax for postgres query

I am querying a postgres db using python/pandas with sqlalchemy. I have an sql query that looks like this:
SELECT table_name
FROM fixed_602fcccd0f189c2434611b14.information_schema."tables" t
WHERE table_type='BASE TABLE'
and table_schema='di_602fccd10f189c2434611be9'
and (
table_name like 'general_journal%'
or table_name like 'output_journal_line%'
or table_name like 'output_journal_tx%'
or table_name like 'flats%'
)
I've tested it in dBeaver and it works perfectly. I am now trying to pass the same query through pandas read_sql as follows:
from_string = pg_db + '.information_schema."tables"'
print(from_string)
pg_query = queries.id_tables.format(from_string,di_id)
The idea is that I construct the query with variables 'pg_db' (string) and 'di_id' (string) as I make a series of queries. The problem is the query returns empty array when done this way. No error is thrown.
I suspected the challenge is the "tables" attribute that when pandas interprets the query eg. strips off the ", but that doesn't actually seem to matter. Any thoughts on how to make this work with pandas?
UPDATE:
I have tried parameterized and met with the same problem. It seems to boil down to the FROM parameter gets passed in with double quotes. I have tried to strip these but it looks like pandas appends them anyways. In principle double quotes should be fine according to postgres docs but that doesn't seem to be the case even when doing the query in dBeaver. If I pass in the query via pandas as it is written at the top of this post, no problem. The challenge is when I try to use variables for the FROM and table_schema parameters, I get syntax errors.

It turns out that the problem disappeared when I removed the parentheses I put around the 'or' statements. I think the message is to pay attention to how you construct the query eg. form and join all the strings and variables before passing them to pandas.
That said I have used parentheses with much more complex queries in pandas and they were not a problem.

I would first suggest that you use a parameterized query for input but in some cases its just easier to use a built in function repr()
s = "SQL sometimes likes \"here\" and %s \"now\" problems"
print(repr(s))
gives
'SQL sometimes likes "here" and %s "now" problems'

pyodbc execute command not accepting ? parameters correctly?

This code:
cursor.execute('select RLAMBD from ?', OPTable)
print cursor.fetchone().RLAMBD
produces this error:
ProgrammingError: ('42S02', '[42S02] [Oracle][ODBC][Ora]ORA-00903: invalid table name\n (903) (SQLExecDirectW)')
OPTable is an alphanumeric string which I've built from another database query which contains the table name I want to select from.
The following code works just fine within the same script.
sql = 'select RLAMBD from ' + OPTable
cursor.execute(sql)
print cursor.fetchone().RLAMBD
I guess it's not a huge deal to build the sql statements this way, but I just don't understand why it's not accepting the ? parameters. I even have another query in the same script which uses the ? parameterization and works just fine. The parameters for the working query are produced using the raw_input function, though. Is there some subtle difference between the way those two strings might be formatted that's preventing me from getting the query to work? Thank you all.
I'm running python 2.7 and pyodbc 3.0.10.

Parameter placeholders cannot be used to represent object names (e.g., table or column names) or SQL keywords. They are only used to pass data values, e.g., numbers, strings, dates, etc..

retrieve value from sql server and passing it to string to write it into txt file using VB .NET

I am retrieving some data from sql server and trying to write it to a text file, I am getting error in retrieving and passing it to variable Could you please help me in this.

You don't give a lot of detail regarding the error. Assuming your query is working and returning data to the table, Streamwriter.Write(dt) won't return the data in the table, it will return the name of the table or something like that. to get all the data you need to either loop through the rows and columns and print each one as you like or use dt.WriteXML(myIOStream), obWRiter.(myIOstream). You can also try streamwriter.write(dt.writexml()). I haven't tried anything out so you'll need to work on the code.

SQL query syntax error using INSERT INTO

So, I know my code for the database connection and reader is functional, because it has worked for me many times before, however, something about this SQL query:
gives this error message:
when this data is inputted:
This is the database table that I am trying to add the data to:

The issue is that you are using "password" as a column name and that's a reserved word in Jet SQL. Either change the name or escape it in SQL code. You do the latter by wrapping it in square brackets [].

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas