Warning message while using a window function in SparkSQL dataframe - hive

I am getting the below warning message when I use the window function in SparkSQL. Can anyone please let me know how to fix this issue.
Warning Message:
No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
My Code:
def calcPrevBrdrx(df: DataFrame): DataFrame = {
val w = Window.orderBy("existing_col1")
df.withColumn("new_col", lag("existing_col2", 1).over(w))
}

The warning is exactly what it says. In general, when you use a window function you would first partition by some column and only then order. So for example if you had logs for a user you might partition by the user and then order by time which would do the sorting separately for each user.
If you do not have a partition by then you are sorting on the entire data frame. This would basically mean you have a single partition. All data from all the dataframe would move to that single partition and be sorted.
This would be slow (you are shuffling everything and then sorting everything) and worse this means that all your data need to fit in a single partition which is not scalable.
You should probably take a look at your logic to make sure you really need to sort everything instead of partitioning by something before.

If your logic demands to use order by without partition clause, may be because you don't have anything else to partition on or it doesn't make sense for the window function used, you can add a dummy value like below -
.withColumn("id", explode(typedLit((1 to 100).toList)))
This will create an id field with value from 1 to 100 for each row in original dataframe and use that in partition by clause ( partition by id) , it will launch 100 tasks. Total number of rows it will create will be current rows*100 .Make sure you drop the id field and do distinct on result.

Related

Improve performance of deducting values of same table in SQL

for a metering project I use a simple SQL table in the following format
ID
Timestamp: dat_Time
Metervalue: int_Counts
Meterpoint: fk_MetPoint
While this works nicely in general I have not found an efficient solution for one specific problem: There is one Meterpoint which is a submeter of another Meterpoint. I'd be interested in the Delta of those two Meterpoints to get the remaining consumption. As the registration of counts is done by one device I get datapoints for the various Meterpoints at the same Timestamp.
I think I found a solution applying a subquery which appears to be not very efficient.
SELECT
A.dat_Time,
(A.int_Counts- (SELECT B.int_Counts FROM tbl_Metering AS B WHERE B.fk_MetPoint=2 AND B.dat_Time=A.dat_Time)) AS Delta
FROM tbl_Metering AS A
WHERE fk_MetPoint=1
How could I improve this query?
Thanks in advance
You can try using a window function instead:
SELECT m.dat_Time,
(m.int_counts - m.int_counts_2) as delta
FROM (SELECT m.*,
MAX(CASE WHEN fk.MetPoint = 2 THEN int_counts END) OVER (PARTITION BY dat_time) as int_counts_2
FROM tbl_Metering m
) m
WHERE fk_MetPoint = 1
From a query point of view, you should as a minimum change to a set-based approach instead of an inline sub-query for each row, using a group by as a minimum but it is a good candidate for a windowing query, just as suggested by the "Great" Gordon Linoff
However if this is a metering project, then we are going to expect a high volume of records, if not now, certainly over time.
I would recommend you look into altering the input such that delta is stored as it's own first class column, this moves much of the performance hit to the write process which presumably will only ever occur once for each record, where as your select will be executed many times.
This can be performed using an INSTEAD OF trigger or you could write it into the business logic, in a recent IoT project we computed or stored these additional properties with each inserted reading to greatly simplify many types of aggregate and analysis queries:
Id of the Previous sequential reading
Timestamp of the Previous sequential reading
Value Delta
Time Delta
Number of readings between this and the previous reading
The last one sounds close to your scenario, we were deliberately batching multiple sequential readings into a single record.
You could also process the received data into a separate table that includes this level of aggregation information, so as not to pollute the raw feed and to allow you to re-process it on demand.
You could redirect your analysis queries to this second table, which is now effectively a data warehouse of sorts.

Applying function to unique values for efficiency in pandas

This is a general question about how to apply a function efficiently in pandas. I often encounter situations where I need to apply a function to a pd.Series and it would be faster to apply the function only to unique values.
For example, suppose I have a very large dataset. One column is date, and I want to add a column that gives the last date of the quarter for date. I would do this:
mf['qtr'] = pd.Index(mf['date']) + pd.offsets.QuarterEnd(0)
But for large data sets, this can take a while. So to speed it up, I'll extract the unique values of date, apply the function to those, and then merge it back in to the original data:
dts = mf['date'].drop_duplicates()
eom = Series(pd.Index(dts) + pd.offsets.QuarterEnd(0), index=dts)
eom.name = 'qtr'
mf = pd.merge(mf, eom.reset_index())
This can be much faster than the one-liner above.
So here's my question: Is this really the right way to do things like this, or is there a better approach?
And, would it make sense and be feasible to add a feature to pandas that would take this unique/apply/merge approach automatically? (It wouldn't work for certain functions, such as those that rely on rolling data, so presumably the user would have to explicitly request this behavior.)
I'd personally just group on the date column and then just call your function for each group:
mf.groupby('date',as_index=False)['date'].apply(lambda x: x + pd.offsets.QuarterEnd(0))
I think should work
EDIT
OK the above doesn't work but the following does but I think this is a bit twisted:
mf.groupby('date', as_index=False)['date'].apply(lambda x: (pd.Index(x)+ QuarterEnd(0))[0])
we create a datetimeindex for each date, add the offset and then access the single element to return the value but personally I think this is not great.

"Shuffle failed with error" message when trying to use GROUP BY to return a tables disinct rows

We have a 1.01TB table with known duplicates we are trying to de-duplicate using GROUP EACH BY
There is an error message we'd like some help deciphering
Query Failed
Error:
Shuffle failed with error: Cannot shuffle more than 3.00T in a single shuffle. One of the shuffle partitions in this query exceeded 3.84G. Strategies for working around this error are available on the go/dremelfaq.
Job ID: job_MG3RVUCKSDCEGRSCSGA3Z3FASWTSHQ7I
The query as you'd imagine does quite a bit and looks a little something like this
SELECT Twenty, Different, Columns, For, Deduping, ...
including_some, INTEGER(conversions),
plus_also, DAYOFWEEK(SEC_TO_TIMESTAMP(INTEGER(timestamp_as_string)), conversions,
and_also, HOUROFDAY(SEC_TO_TIMESTAMP(INTEGER(timestamp_as_string)), conversions,
and_a, IF(REGEXP_MATCH(long_string_field,r'ab=(\d+)'),TRUE, NULL) as flag_for_counting,
with_some, joined, reference, columns,
COUNT(*) as duplicate_count
FROM [MainDataset.ReallyBigTable] as raw
LEFT OUTER JOIN [RefDataSet.ReferenceTable] as ref
ON ref.id = raw.refid
GROUP EACH BY ... all columns in the select bar the count...
Question
What does this error mean? Is it trying to do this kind of shuffling? ;-)
And finally, is the dremelfaq referenced in the error message available outside of Google and would it help understand whats going on?
Side Note
For completeness we tried a more modest GROUP EACH
SELECT our, twenty, nine, string, column, table,
count(*) as dupe_count
FROM [MainDataSet.ReallyBigTable]
GROUP EACH BY all, those, twenty, nine, string, columns
And we receive a more subtle
Error: Resources exceeded during query execution.
Job ID: job_D6VZEHB4BWZXNMXMMPWUCVJ7CKLKZNK4
Should Bigquery be able to perform these kind of de-duplication queries? How should we best approach this problem?
Actually, the shuffling involved is closer to this: http://www.youtube.com/watch?v=KQ6zr6kCPj8.
When you use the 'EACH' keyword, you're instructing the query engine to shuffle your data... you can think of it as a giant sort operation.
This is likely pushing close to the cluster limits that we've set in BigQuery. I'll talk to some of the other folks on the BigQuery team to see if there is a way we can figure out how to make your query work.
In the mean time, one option would be to partition your data into smaller tables and do the deduping on those smaller tables, then use table copy/append operations to create your final output table. To partition your data, you can do something like:
(SELECT * from [your_big_table] WHERE ABS(HASH(column1) % 10) == 1)
Unfortunately, this is going to be expensive, since it will require running the query over your 1 TB table 10 times.

Getting maximum value of field in solr

I'd like to boost my query by the item's view count; I'd like to use something like view_count / max_view_count for this purpose, to be able to measure how the item's view count relates to the biggest view count in the index. I know how to boost the results with a function query, but how can I easily get the maximum view count? If anybody could provide an example it would be very helpful...
There aren't any aggregate functions under solr in the way you might be thinking about them from SQL. The easiest way to do it is to have a two-step process:
Get the max value via an appropriate query with a sort
use it with the max() function
So, something like:
q=*:*&sort=view_count desc&rows=1&fl=view_count
...to get an item with the max view_count, which you record somewhere, and then
q=whatever&bq=div(view_count, max(the_max_view_count, 1))
Note that that max() function isn't doing an aggregate max; just getting the maximum of the max-view-count you pass in or 1 (to avoid divide-by-zero errors).
If you have a multiValued field (which you can't sort on) you could also use the StatsComponent to get the max. Either way, you would probably want to do this once, not for every query (say, every night at midnight or whatever once your data set settles down).
You can add just:
&stats=true&stats.field=view_count
You will see a small statistics on that specified field. More documentation here

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.