Superset with Impala - Invalid Session id/No protocol version header - impala

I have Superset using Impala as the main data source. Most of the times, every query runs smoothly and I can build charts and dashboards with ease. I need to generate a Table Chart, containing around 100k records and 30+ columns, but I am having some issues. It is basically a SELECT *, no aggregations, filtering or ordering
are being used.
When the data is relatively big, Superset just throws a bunch of errors (It appears to be that the errors are coming from Impala). But I cannot find any
information regarding those errors. I have tried paginating the results, but it did not worked. Also, when I run the query in Superset Chart page, it doesn't take long, it just
displays the error. The only way some information gets displayed in the Table Chart is when I limit the rows at the "Row limit" option to 10 records. But, this will not work out for me.
Those are the errors that keep ocurring:
impala error: Invalid session id: f344bf1aa2a42e2b:ad1df0047d7f909c
impala error: No protocol version header
When I use the Oracle connection that I also have, I can generate a table chart from a large amount of records with no problem.
My setup is the following:
Impala v3.2.0-cdh6.3.3
Superset v0.36.0
So, is that a problem with Superset or Impala? Could have something to with configuration in Superset?

Related

Pyspark - identical dataframe filter operation gives different output

I’m facing a particularly bizarre issue while firing filter queries on a spark dataframe. Here's a screenshot of the filter command I'm trying to run:
As you can see, I'm trying to run the same command multiple times. Each time, it's giving a different number of rows. It is actually meant to return 6 records, but it ends up showing a random number of records every time.
FYI, The underlying data source (from which I'm creating the dataframe) is an Avro file in a Hadoop data lake.
This query only gives me consistent results if I cache the dataframe. But this is not always possible for me because the dataframe might be very huge and hence would choke up memory resources if I cache it.
What might be the possible reasons for this random behavior? Any advice on how to fix it?
Many thanks :)

Connection Timeout Error while reading the table having more than 100 columns in Mosaic Decisions

I am reading a table via snowflake reader node having less number of columns/attributes(around 50-80),the table is getting read on the Mosaic decisions Canvas. But when the attributes of table increases (approx 385 columns),Mosaic reader node fails. As a workaround I tried using the where clause with 1=2,in that case it is pulling the structure of the Table. But when I am trying to read the records even by applying the limit (only 10 records) to the query, it is throwing connection timeout Error.
Even I faced similar issue while reading (approx. 300 columns) table and I managed it with the help of input parameters available in Mosaic. In your case you will have to change the copy field variable to 1=1 used in the query at run time.
Below steps can be referred to achieve this -
Create a parameter (e.g. copy_variable) that will contain the default value 2 for the copy field variable
In reader node, write the SQL with 1 = $(copy_variable) So while validating, it’s same as 1=2 condition and it should validate fine.
Once validated and schema is generated, update the default value of $(copy_variable) to 1 so that while running, you will still get all records.

BQ Switching to TIMESTAMP Partitioned Table

I'm attempting to migrate IngestionTime (_PARTITIONTIME) to TIMESTAMP partitioned tables in BQ. In doing so, I also need to add several required columns. However, when I flip the switch and redirect my dataflow to the new TIMESTAMP partitioned table, it breaks. Things to note:
Approximately two million rows (likely one batch) is successfully inserted. The job continues to run but doesn't insert anything after that.
The job runs in batches.
My project is entirely in Java
When I run it as streaming, it appears to work as intended. Unfortunately, it's not practical for my use case and batch is required.
I've been investigating the issue for a couple of days and tried to break down the transition into the smallest steps possible. It appears that the step responsible for the error is introducing REQUIRED variables (it works fine when the same variables are NULLABLE). To avoid any possible parsing errors, I've set default values for all of the REQUIRED variables.
At the moment, I get the following combination of errors and I'm not sure how to address any of them:
The first error, repeats infrequently but usually in groups:
Profiling Agent not found. Profiles will not be available from this
worker
Occurs a lot and in large groups:
Can't verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner
Appears to be one very large group of these:
Aborting Operations. java.lang.RuntimeException: Unable to read value from state
Towards the end, this error appears every 5 minutes only surrounded by mild parsing errors described below.
Processing stuck in step BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 20m00s without outputting or completing in state finish
Due to the sheer volume of data my project parses, there are several parsing errors such as Unexpected character. They're rare but shouldn't break data insertion. If they do, I have a bigger problem as the data I collect changes frequently and I can adjust the parser only after I see the error, and therefore, see the new data format. Additionally, this doesn't cause the ingestiontime table to break (or my other timestamp partition tables to break). That being said, here's an example of a parsing error:
Error: Unexpected character (',' (code 44)): was expecting double-quote to start field name
EDIT:
Some relevant sample code:
public PipelineResult streamData() {
try {
GenericSection generic = new GenericSection(options.getBQProject(), options.getBQDataset(), options.getBQTable());
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Windowing", generic.getWindowDuration(options.getWindowDuration()))
.apply(generic.getPubsubToString())
.apply(ParDo.of(new CrowdStrikeFunctions.RowBuilder()))
.apply(new BigQueryBuilder().setBQDest(generic.getBQDest())
.setStreaming(options.getStreamingUpload())
.setTriggeringFrequency(options.getTriggeringFrequency())
.build());
return pipeline.run();
}
catch (Exception e) {
LOG.error(e.getMessage(), e);
return null;
}
Writing to BQ. I did try to set the partitoning field here directly, but it didn't seem to affect anything:
BigQueryIO.writeTableRows()
.to(BQDest)
.withMethod(Method.FILE_LOADS)
.withNumFileShards(1000)
.withTriggeringFrequency(this.triggeringFrequency)
.withTimePartitioning(new TimePartitioning().setType("DAY"))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER);
}
After a lot of digging, I found the error. I had parsing logic (a try/catch) that returned nothing (essentially a null row) in the event there was a parsing error. This would break BigQuery as my schema had several REQUIRED rows.
Since my job ran in batches, even one null row would cause the entire batch job to fail and not insert anything. This also explains why streaming inserted just fine. I'm surprised that BigQuery didn't throw an error claiming that I was attempting to insert a null into a required field.
In reaching this conclusion, I also realized that setting the partition field in my code was also necessary as opposed to just in the schema. It could be done using
.setField(partitionField)

SAP DBTech JDBC: [2048]: column store error: search table error: [2724] Olap temporary data size exceeded 31/32 bit limit"

I have a calculation view which is based on other calculation views and joins to bring material Accounts data from different vendors (all joins have 1-1 mapping with target). in the final view I have a calculated column as Formatted_MATERIAL (Material numbers without any leading zeros, used Ltrim() to remove leading zeros.)
Now, when I'm searching Formatted_MATERIAL equal to some specific number it's showing read error (heading). If I'm searching for some range of material it's giving results.
For example, if I search for material (500098), it's present in following query results
select "Formatted_MATERIAL"
FROM "_SYS_BIC"."CA_REPORTS_001_VK"
where "Formatted_MATERIAL" between 5000000 and 6000000
order by "Formatted_MATERIAL"
but no results for
select "Formatted_MATERIAL"
FROM "_SYS_BIC"."CA_REPORTS_001_VK"
where "Formatted_MATERIAL" = 5000098
The cause of the error is that during some processing step in one of the views you're using, the intermediate result set exceeds 2 billion records.
Based on my experience with typical HANA use cases (that would mostly be use cases in relation to SAP products) I am pretty sure that the way these underlying views have been modelled is not really right. Whenever you try and join or aggregate an intermediate result set of two billion records at once, chances are that important operations like filtering, projection and aggregation should have been done much earlier in the model.
Of course, without seeing the model(s) and the execution details (use PlanViz for this) and knowing with HANA version you're using, there is nothing we can say about how to solve this issue.

BigQuery streaming data not available instantly

Since couple of days some data i am streaming to bigquery is not available instantly (as it normally happens) within bigquery web ui after being inserted successfully.
My use case consists of inserting thousand of lines using :
bigquery.tabledata().insertAll(...)
The results of the streaming inserts into the table are :
(i am also checking for insertErrors to be sure as described here):
BigQuery insert status : {"kind":"bigquery#tableDataInsertAllResponse"}
BigQuery insert errors : null
Total number of lines available in bigquery web ui is different that total inserted.
I would be grateful for any help.
Bigquery project details :
Project ID : favorable-beach-87616
Table : mtp_UA_xxxx_1_20150410
Project dependencies on google libraries:
compile 'com.google.api-client:google-api-client:1.19.0'
compile 'com.google.http-client:google-http-client:1.19.0'
compile 'com.google.http-client:google-http-client-jackson2:1.19.0'
compile 'com.google.oauth-client:google-oauth-client:1.19.0'
compile 'com.google.oauth-client:google-oauth-client-servlet:1.19.0'
compile 'com.google.apis:google-api-services-bigquery:v2-rev171-1.19.0'
compile 'com.google.api-client:google-api-client:1.17.0-rc'
Great thanks in advance for your help!
When you say the total number of lines available in the web UI, do you mean the number of rows that show up in the 'details' pane on the table, or the number of rows that are returned if you do a SELECT COUNT(*) query?
If the former, that is expected, since that counter only returns the number of rows that have been flushed to long-term storage (as opposed to the short-term storage buffers the streaming data originally gets written to). This is admittedly confusing, and we are working on a fix.
If the latter, the rows don't show up in a query, that is more concerning. If that is the case, please let us know and we'll investigate.