Error with function percentile_disc in Bigquery - google-bigquery

I am currently running the following query in BigQuery:
SELECT
longTime,
PERCENTILE_DISC(0.99) OVER (ORDER BY longTime DESC) AS percentil_99,
cityList
FROM (FLATTEN(FLATTEN([table], longTime), clientId))
WHERE
cityList IS NOT NULL
AND clientId='UKTRAR'
AND bp_Time BETWEEN TIMESTAMP('2015-01-18')
AND TIMESTAMP('2015-12-18')
but when finished displays the following error
Error: Resources exceeded during query execution.
Job ID: table:job_Qm_LOGJAFMSXSeXXnVRoLer4C8A
the table has the following data
Table Size => 23.4 GB
Number of Rows => 71,809,730

Related

AnalysisException: Could not resolve column/field reference: 'transaction_nominal_value' in SQL Impala

I have the following SQL query where I have two datasets CSDB and MMSR. I merge these two datasets to th Final dataset. So far everything works fine. Now I want to aggregate the data on trade_date but I get the error message: AnalysisException: Could not resolve column/field reference: 'transaction_nominal_value'. I am working in Impala. How can I resolve the problem?
--CSDB
WITH CSDB AS (
SELECT isin, nominal_currency, amount_outstanding, issue_price, amount_issued, yield, original_maturity, residual_maturity
FROM csdb_pq
where nominal_currency = "EUR"
),
--MMSR
MMSR AS (
SELECT transaction_nominal_amount, maturity_days, deal_rate, collateral_haircut, collateral_isin, collateral_nominal_amount, trade_date
FROM datashop_store_business_mmsr.secured_vl_pq
WHERE collateral_isin IS NOT NULL
),
---Join
Final AS (
SELECT *
FROM MMSR
LEFT JOIN CSDB
ON MMSR.collateral_isin = CSDB.isin
)
--Aggregate Data
SELECT trade_date, AVG(transaction_nominal_amount)
FROM Final
GROUP BY trade_date;
Should be transaction_nominal_amount, not transaction_nominal_value per your table elements within your CTE. Just a simple column name error and it should work.

How to iterate many Hive scripts over spark

I have many hive scripts (somewhat 20-25 scripts), each scripts having multiple queries. I want to run these scripts using spark so that the process can run faster. As map reduce job in hive takes long time to execute from spark it will be much faster. Below is the code I have written but its working for 3-4 files but when given multiple files with multiple queries its getting failed.
Below is the code for the same. Please help me if possible to optimize the same.
val spark = SparkSession.builder.master("yarn").appName("my app").enableHiveSupport().getOrCreate()
val filename = new java.io.File("/mapr/tmp/validation_script/").listFiles.filter(_.getName.endsWith(".hql")).toList
for ( i <- 0 to filename.length - 1)
{
val filename1 = filename(i)
scala.io.Source.fromFile(filename1).getLines()
.filterNot(_.isEmpty) // filter out empty lines
.foreach(query =>
spark.sql(query))
}
some of the error I cam getting is like
ERROR SparkSubmit: Job aborted.
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 12 (sql at validationtest.scala:67) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 1023410176, max: 1029177344) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)
many different types of error I get when run the same code multiple times.
Below is how one of the HQL file looks like. its name is xyz.hql and has
drop table pontis_analyst.daydiff_log_sms_distribution
create table pontis_analyst.daydiff_log_sms_distribution as select round(datediff(date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) ),cast(subscriberActivationDate as date))/7,4) as daydiff,subscriberkey as key from pontis_analytics.prepaidsubscriptionauditlog
drop table pontis_analyst.weekly_sms_usage_distribution
create table pontis_analyst.weekly_sms_usage_distribution as select sum(event_count_ge) as eventsum,subscriber_key from pontis_analytics.factadhprepaidsubscriptionsmsevent where effective_date_ge_prt < date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) - 1 ) and effective_date_ge_prt >= date_sub(date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) ),84) group by subscriber_key;
drop table pontis_analyst.daydiff_sms_distribution
create table pontis_analyst.daydiff_sms_distribution as select day.daydiff,sms.subscriber_key,sms.eventsum from pontis_analyst.daydiff_log_sms_distribution day inner join pontis_analyst.weekly_sms_usage_distribution sms on day.key=sms.subscriber_key
drop table pontis_analyst.weekly_sms_usage_final_distribution
create table pontis_analyst.weekly_sms_usage_final_distribution as select spp.subscriberkey as key, case when spp.tenure < 3 then round((lb.eventsum )/dayDiff,4) when spp.tenure >= 3 then round(lb.eventsum/12,4)end as result from pontis_analyst.daydiff_sms_distribution lb inner join pontis_analytics.prepaidsubscriptionsubscriberprofilepanel spp on spp.subscriberkey = lb.subscriber_key
INSERT INTO TABLE pontis_analyst.validatedfinalResult select 'prepaidsubscriptionsubscriberprofilepanel' as fileName, 'average_weekly_sms_last_12_weeks' as attributeName, tbl1_1.isEqual as isEqual, tbl1_1.isEqualCount as isEqualCount, tbl1_2.countAll as countAll, (tbl1_1.isEqualCount/tbl1_2.countAll)* 100 as percentage from (select tbl1_0.isEqual as isEqual, count(isEqual) as isEqualCount from (select case when round(aal.result) = round(srctbl.average_weekly_sms_last_12_weeks) then 1 when aal.result is null then 1 when aal.result = 'NULL' and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result = '' and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result is null and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result is null and srctbl.average_weekly_sms_last_12_weeks is null then 1 else 0 end as isEqual from pontis_analytics.prepaidsubscriptionsubscriberprofilepanel srctbl left join pontis_analyst.weekly_sms_usage_final_distribution aal on srctbl.subscriberkey = aal.key) tbl1_0 group by tbl1_0.isEqual) tbl1_1 inner join (select count(*) as countAll from pontis_analytics.prepaidsubscriptionsubscriberprofilepanel) tbl1_2 on 1=1
Your issue is your code is running out of memory as shown below
failed to allocate 16777216 byte(s) of direct memory (used: 1023410176, max: 1029177344)
Though what you are trying to do is not optimal way of doing things in Spark but I would recommend that you remove the memory serialization as it will not help in anyways. You should cache data only if it is going to be used in multiple transformations. If it is going to be used once there is no reason to put the data in cache.

Output "none" - psycopg2 database query

0
I'm new to SQL and currently try to resolve a data table problem.
I have a data table and now need to find firstly the dates, on which a request lead to an error. They are pulled as timestamps from the log database. Afterwards the status is checked where not status = '200 OK' and the days on which more than 1% of requests lead to an error are shown having count(*) > 0.01,order by num desc.
Now I have the problem, that I don't get any output shown:
OUTPUT IN THE TERMINAL:
--
Following the dates for >1 percent requests, leading to an error:
None
None
CODE:
def number_one_error():
"""
Percentage of errors from requests
Counting errors and timestamps
Output:
number one errors
"""
db = psycopg2.connect(database=dbname)
c = db.cursor()
c.execute('''
select oneerror.date_column, round(((cast(oneerror.request_error as decimal))/requests*1.0),2) as percent
from (select date(log.time) AS date_column,
count (*) as request_error
from log where not status = '200 OK'
group by date_column) as oneerror
join (select date(log.time) AS date_column,
count(*) as requests
from log
group by date_column) as total
on oneerror.date_column = total.date_column
where round((cast(oneerror.request_error as decimal)/requests*1.0),3)> 0.01
order by percent desc
''')
number_one_error = c.fetchall()
db.close()
THANK YOU SO MUCH!

Where clause with dates in hive

The where clause in the below hive query is not working
select
e.num as badge
from dbo.events as e
where TO_DATE(e.event_time_utc) > TO_DATE(select event_date from DL_EDGE_LRF_facilities.card_swipes_lastpulldate)
both event_time_utc and event_date fields are defined as strings and event_time_utc has timestamp values like '2017-09-18 20:10:19.000000' and event_date has only one date value like '2018-01-25'
i am getting an error like "cannot recognize input near 'select' 'event_date' 'from' in function specification " when i run the query, Please help
#user86683; hive does not recognize the syntax since it does not allow in-query in the inequality condition (>). You may try this query and let me know the result.
select e.num as badge
from dbo.events as e, DL_EDGE_LRF_facilities.card_swipes_lastpulldate c
where TO_DATE(e.event_time_utc) > TO_DATE(c.event_date)
You will get a warning but you may ignore it since the table for event_date has only one record.
Warning: Map Join MAPJOIN[10][bigTable=e] in task 'Map 1' is a cross product
Query ID = xxx_20180201102128_aaabb2235-ee69275cbec1
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_09fdf345)
Hope this helps. Thanks.

Google BiqQuery Internal Error

Edit: Tidied up the query a bit. Checked running on one day (versus the 27 I need) and the query runs. With 27 days of data it's trying to process 5.67TB. Could this be the issue?
Latest ID of error run:
Job ID: ee-corporate:bquijob_3f47d425_1530e03af64
I keep getting this error message when trying to run a query in BigQuery, both through the UI and Bigrquery.
Query Failed
Error: An internal error occurred and the request could not be completed.
Job ID: ee-corporate:bquijob_6b9bac2e_1530dba312e
Code below:
SELECT
CASE WHEN d.category_grouped IS NULL THEN 'N/A' ELSE d.category_grouped END AS category_grouped_cleaned,
COUNT(UNIQUE(msisdn_token)) AS users,
(SUM(up_link_data_bytes) + SUM(down_link_data_bytes))/1000000 AS tot_data_mb
FROM (
SELECT
request_domain, up_link_data_bytes, down_link_data_bytes, msisdn_token, timestamp
FROM (TABLE_DATE_RANGE([helpful-skyline-97216:WEBLOG_Staging.WEBLOG_], TIMESTAMP('20160101'), TIMESTAMP('20160127')))
WHERE SUBSTR(http_status_code,1,1) IN ('1',
'2',
'3')) a
LEFT JOIN EACH web_usage_201601.domain_to_cat_lookup_27JAN_with_groups d
ON
a.request_domain = d.request_domain
WHERE
DATE(timestamp) >= '2016-01-01'
AND DATE(timestamp) <= '2016-01-27'
GROUP EACH BY
1
Is there something I'm doing wrong?
The problem seems to be coming from UNIQUE() - it returns repeated field with too many elements in it. The error could be improved, but workaround for you would be to use explicit GROUP BY and then run COUNT on top of it.
If you are okay with an approximation, you can also use
COUNT(DISTINCT msisdn_token) AS users
or a higher approximation parameter than the default 1000,
COUNT(DISTINCT msisdn_token, 5000) AS users
GROUP BY is the most general approach, but these can be faster if they do what you need.