Tableau Impala empty dates - impala

I've been creating tableau dashboards for a while now using the ZN(Lookup(SUM(field), 0)) trick for a while now to pad empty dates with zeros.
However, when using the same trick with a live Impala connection to a Hadoop cluster, it seems that the output is no longer padded.
Does anyone know why it no longer works and what the potential solution would be?

Related

Failed to load FileDescriptorProto for '_CLOUD_QUERY_METADATA_SCHEMA_'

My Firebase project is integrated with BigQuery, so all raw Google Analytics events are exported daily & streamed to a dedicated collection.
Since today even simple queries on those events are failing with an error:
Error running query
Failed to load FileDescriptorProto for
'CLOUD_QUERY_METADATA_SCHEMA': ;Field number 23 has already been
used in "Msg_0_CLOUD_QUERY_TABLE" by field "items".
An example query which is failing:
SELECT * FROM `project.analytics_184030700.events_*` WHERE event_name IN ("share")
As I mentioned, those (and more advanced) queries used to run until yesterday. I did not change the schema nor any other configuration in the meantime. I've also noticed that BigQuery was updated yesterday.
Looking at the error description, looks like my table schema indeed contains a field called items (a very last one, 23rd) but it was automatically added by Google Analytics.
My suspicions:
Something went wrong with the recent BigQuery release
Something went wrong during daily sync Google Analytics -> BigQuery
Some old job or cache is getting in the way of new queries
At this point I have no idea what to try next. Does anyone have any insight into what could be causing this error?
EDIT:
I noticed that this problem was also just reported in the Google Issue Tracker here: https://issuetracker.google.com/issues/192325507.
I have same issue and I didn't solve it yet but as you said it's cause is Firebase I guess. There's an extra field problem which are limited only for three days (26th,27th and 28th June).
I checked all data older than 26th June but there was no privacy_info field. As you see there is no privacy_info field again for 29th June. I think firebase put this new field but they changed their mind for some reason. But this causes a big problem for us.
Update:
I changed this part:
SELECT * FROM `project.analytics_184030700.events_*`
Like this:
SELECT * FROM `project.analytics_184030700.events_2*`
Interestingly this worked for me.
You can do a workaround for that issue; It seems there are problems with the field
privacy_info
If you select multiple table partitions, just make sure you only select the fields you need, and omit the field privacy_info.
Not using "SELECT *" did resolve this error for me.

Pyspark - identical dataframe filter operation gives different output

I’m facing a particularly bizarre issue while firing filter queries on a spark dataframe. Here's a screenshot of the filter command I'm trying to run:
As you can see, I'm trying to run the same command multiple times. Each time, it's giving a different number of rows. It is actually meant to return 6 records, but it ends up showing a random number of records every time.
FYI, The underlying data source (from which I'm creating the dataframe) is an Avro file in a Hadoop data lake.
This query only gives me consistent results if I cache the dataframe. But this is not always possible for me because the dataframe might be very huge and hence would choke up memory resources if I cache it.
What might be the possible reasons for this random behavior? Any advice on how to fix it?
Many thanks :)

Updating Parquet datasets where the schema changes overtime

I have a single parquet file that I have been incrementally been building every day for several months. The file size is around 1.1GB now and when read into memory it approaches my PCs memory limit. So, I would like to split it up into several files base on the year and month combination (i.e. Data_YYYYMM.parquet.snappy) that will all be in a directory.
My current process reads in the daily csv that I need to append, reads in the historical parquet file with pyarrow and converts to pandas, concats the new and historical data in pandas (pd.concat([df_daily_csv, df_historical_parquet])) and then writes back to a single parquet file. Every few weeks the schema of the data can change (i.e. a new column). With my current method this is not an issue since the concat in pandas can handle the different schemas and I overwriting it each time.
By switching to this new setup I am worried about having inconsistent schemas between months and then being unable to read in data over multiple months. I have tried this already and gotten errors due to non matching schemas. I thought might be able to specify this with the schema parameter in pyarrow.parquet.Dataset. From the doc it looks like it takes a type of pyarrow.parquet.Schema. When I try using this I get AttributeError: module 'pyarrow.parquet' has no attribute 'Schema'. I also tried taking the schema of a pyarrow Table (table.schema) and passing that to the schema parameter but got an error msg (sry I forget error right now and cant connect workstation right now so cant reproduce error - I will update with this info when I can).
I've seen some mention of schema normalization in the context of the broader Arrow/Datasets project but I'm not sure if my use case fits what that covers and also the Datasets feature is experimental so I dont want to use it in production.
I feel like this is a pretty common use case and I wonder if I am missing something or if parquet isn't meant for schema changes over time like I'm experiencing. I've considered investigating the schema of the new file and comparing vs historical and then if there is change deserializing, updating schema, and reserializing every file in the dataset but I'm really hoping to avoid that.
So my questions are:
Will using a pyarrow parquet Dataset (or something else in the pyarrow API) allow me to read in all of the data in multiple parquet files even if the schema is different? To be specific, my expectation is that the new column would be appended and the values prior to when this column were available would be null). If so, how do you do this?
If the answer to 1 is no, is there another method or library for handling this?
Some resources I've been going through.
https://arrow.apache.org/docs/python/dataset.html
https://issues.apache.org/jira/browse/ARROW-2659
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset

Why is pyspark so much slower in finding the max of a column?

Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column?
I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value.
I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
I also tried df.toPandas() and then calculate the max in pandas (you guessed it, df.toPandas took a long time.)
The only thing I did ot try yet is the RDD way.
Before I provide some test code (I have to find out how to generate dummy data in spark), I'd like to know
can you give me a pointer to an article discussing this difference?
is spark more sensitive to memory constraints on my computer than pandas?
As #MattR has already said in the comment - you should use Pandas unless there's a specific reason to use Spark.
Usually you don't need Apache Spark unless you encounter MemoryError with Pandas. But if one server's RAM is not enough, then Apache Spark is the right tool for you. Apache Spark has an overhead, because it needs to split your data set first, then process those distributed chunks, then process and join "processed" data, collect it on one node and return it back to you.
#MaxU, #MattR, I found an intermediate solution that also makes me reassess Sparks laziness and understand the problem better.
sc.accumulator helps me define a global variable, and with a separate AccumulatorParam object I can calculate the maximum of the column on the fly.
In testing this I noticed that Spark is even lazier then expected, so this part of my original post ' I like what spark is doing when it comes to rowwise feature extraction' boils down to 'I like that Spark is doing nothing quite fast'.
On the other hand a lot of the time spent on calculating the maximum of the column has most presumably been the calculation of the intermediate values.
Thanks for yourinput and this topic really got me much further in understanding Spark.

BigQuery: result to my daily query is now NULL since March 15th

This question/bug is mainly for the Google BigQuery team.
I have a daily report in Tableau that connect to a Google BigQuery live Connection. This report has been running for over a year without problems. Since March 15th however, the report is no longer working and the result of the gbq queries generated by Tableau now returns 'null'.
Note: The version of Tableau and version of the BigQuery driver have not changed for over a month. So, nothing has changed on our side. I have also checked in the Query History and the generated queries have always been the same in the last weeks.
One simple query that is generated by Tableau and that now returns 'null' looks like this:
SELECT (CASE WHEN 1000000 = 0 THEN NULL ELSE FLOAT([log_time]) / 1000000 END)
AS [none_Calculation_0500516094317075_ok]
FROM [GDT.MissingItems] [sqlproxy]
GROUP BY 1
This query comes from a simple calculated field created in Tableau that is divided by 1000000 and is cast to a INT. The job_id is job_ydTIq1c_ydnyua4s4SW3zJj00fs
This looks to me like something has changed recently and that is causing the query to now return 'null' instead of what it should return. This is a big problemfor us as we are using this report for operational purposes.
I posted my question/problem in Stackoverflow as mentioned in the Google BigQuery Support page:
https://developers.google.com/bigquery/support
This was a bug in the incorrect application of an optimization in the query execution engine. It has been fixed and we expect to release the fix today (it is possible that the fix won't go live until monday, because we often try to avoid making production changes last minute before the weekend).
The workaround in the meantime would be to use 0.0 rather than null in the case statement.