Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':" - hive

While printing the schema of a SQL database,I am getting the following error:

Use SparkSession instead of SQLContext. So do:
sqlContext = SparkSession.builder.master("local[*]").appName("appName").
config("spark.sql.warehouse.dir", "./spark-warehouse").getOrCreate()
The rest of your code should work normally.
You may adjust the variable name from sqlContext to reflect the reference held.

Related

Code Inspection Warning with pandas.read_csv read from string buffer

My Python environment uses Pandas 1.4.2. I have the following code that reads from a string buffer:
response: requests.Response = session.get(url="...")
data: pandas.DataFrame = pandas.read_csv(io.StringIO(response.content.decode("utf-8")), skiprows=2)
When I run Code Inspection in PyCharm, I get the following warning:
Expected type 'str | PathLike[str] | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', got 'StringIO' instead
What change should I make to my code to resolve the issue short of suppressing the warning?
I would ignore this warning.
StringIO is meant to be accepted as a valid data type for the read_csv method ("By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO." - https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) The issue is with Pandas' own type validation (https://github.com/pandas-dev/pandas/blob/v1.4.2/pandas/io/parsers/readers.py#L584-L680) or PyCharm's handling of it.

Scala Sparksession not initialised properly in junit dataframe foreach

I have a spark 2.4 scala 2.11 code which does some work using a foreach over a dataframe. In unit tests I'm using a local sparksession object.
The issue is that- inside the foreach loop I'm trying to create and modify a dataframe using Seq todf method. For some wierd reason the spark session inside the loop is not being instantiated properly and I'm getting a null pointer exception in sqlcontext when trying to do the above said manipulation
Any ideas how can I ensure that the spark session object inside the loop is proper so that the unit test gets executed properly
Additional opinion
It seems spark has a connection to a file due to which it cannot be serialized, hence it fails in junit.

SQLAlchemy Column Types As Generic (BIT Type)

I am trying to list out the columns of each table in an MSSQL database. I can get them fine, but I need to turn them into generic types for use elsewhere. (Python types would be more ideal but not sure how to do that?)
My code so far works until I come across a BIT type column.
from sqlalchemy import *
from sqlalchemy.engine import URL
connection_string = f"DRIVER={{ODBC Driver 17 for SQL Server}};SERVER={auth['server']};DATABASE={auth['database']};UID={auth['username']};PWD={auth['password']}"
connection_url = URL.create("mssql+pyodbc", query={"odbc_connect": connection_string})
engine = create_engine(connection_url)
metadata = MetaData(schema="Audit")
metadata.reflect(bind=engine)
for table in metadata.sorted_tables:
print('\n' + table.name)
for col in table.columns:
name = col.name
type = col.type.as_generic()
print(name, type)
I get the error:
NotImplementedError: Default TypeEngine.as_generic() heuristic method was unsuccessful for sqlalchemy.dialects.mssql.base.BIT. A custom as_generic() method must be implemented for this type class.
I have tried a number of things to work around this, but I'd like to learn what I need to do to fix it properly.
Can I make a custom method that turns BIT to INTEGER? And if so how can I implement it?
Thanks,
Solved. Was a bug, see comments on first post.

When trying to create a DataFrame getting 'TypeError: 'dict' object is not callable' despite calling a series, not a dict?

I have created a series from an existing data frame using value_counts(), and want to turn the output of this into a new data frame, as below:
yeardata= dataset1['Year'].value_counts()
totals = pd.DataFrame(yeardata)
and I am getting the following error:
TypeError: 'dict' object is not callable
I don't understand this, as nowhere within that code am I trying to call a dict. Using type() for yeardata it confirms that this is a series.
I swear this code was working earlier and I haven't changed anything above it but it's now suddenly kicking out an error
Does anyone know what the issue is?
thanks!
You can also use to_frame():
totals = yeardata.to_frame()
Use
totals = pd.DataFrame.from_dict(yeardata)

return a computed field in a serialized django object?

I'm writing an API using Django, and I'm running into some issues around returning data that isn't stored in the database directly, or in other cases organized differently than the database schema.
In particular, given a particular data request, I want to add a field of computed data to my model before I serialize and return it. However, if I just add the field to the model, the built-in serializer (I'm using json) ignores it, presumably because it's getting the list of fields from the model definition.
I could write my own serializer, but what a pain. Or I guess I could run model_to_dict, then serialize the dict instead of the model. Anyone have any better ideas?
Here's what the code vaguely looks like right now:
squidlets = Squidlet.objects.filter(stuff)
for i in range(len(squidlets)):
squidlets[i].newfield = do_some_computation(squid)
return HttpResponse(json_serializer.serialize(squidlets,ensure_ascii=False),
'text/json')
But newfield ain't in the returned json.
i think you should serialize using simple json.. and it doent have to be a queryset... to escape it as json also use marksafe
from django.utils.safestring import mark_safe
from django.utils import simplejson
simplejson.dumps(mark_safe(your_data_structure))
I went with the dict solution, which turned out to be fairly clean.
Here's what the code looks like:
from django.forms.models import model_to_dict
squiddicts = []
squidlets = Squidlet.objects.filter(stuff)
for i in range(len(squidlets)):
squiddict = model_to_dict(squidlets[i])
squiddict["newfield"] = do_some_computation(squidlets[i])
squiddicts.append(squiddict)
return HttpResponse(simplejson.dumps(squiddicts,ensure_ascii=False),
'text/json')
This is maybe slightly more verbose than necessary but I think it's clearer this way.
This does still feel somewhat unsatisfying, but seems to work just fine.