Export LOAD jobs to GCS - google-bigquery

I have several processes that create LOAD jobs in BigQuery. I can find the job_id in this query
SELECT
*
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
USER_EMAIL = 'email#email.com'
ORDER BY CREATION_TIME DESC
The table where the rows are loaded does not contain an insert_timestamp that I could use to filter runs, but is there a way I can get the rows loaded by the job_id and only export them to GCS?

Related

Massive Delete statement - How to improve query execution time?

I have a Spring batch that will run everyday to :
Read CSV files and import them into our database
Aggregate this data and save these aggregated data into another table.
We have a table BATCH_LIST that contains information about all the batchs that were already executed.
BATCH_LIST has the following columns :
1. BATCH_ID
2. EXECUTION_DATE
3. STATUS
Among the CSV files that are imported, we have one CSV file to feed a APP_USERS table, and another one to feed the ACCOUNTS table.
APP_USERS has the following columns :
1. USER_ID
2. BATCH_ID
-- more columns
ACCOUNTS has the following columns :
1. ACCOUNT_ID
2. BATCH_ID
-- more columns
In step 2, we aggregate data from ACCOUNTS and APP_USERS to insert rows into a USER_ACCOUNT_RELATION table. This table has exactly two columns : ACCOUNT_ID (refering to ACCOUNTS.ACCOUNT_ID) and USER_ID (refering to APP_USERS.USER_ID).
Now we want to add another step in our Spring batch. We want to delete all the data from USER_ACCOUNT_RELATION table but also APP_USERS and ACCOUNTS that are no longer relevant (ie data that was imported before sysdate - 2.
What has been done so far :
Get all the BATCH_ID that we want to remove from the database
SELECT BATCH_ID FROM BATCH_LIST WHERE trunc(EXECUTION_DATE) < sysdate - 2
For each BATCH_ID, we are calling the following methods :
public void deleteAppUsersByBatchId(Connection connection, long batchId) throws SQLException
// prepared statements to delete User account relation and user
And here are the two prepared statements :
DELETE FROM USER_ACCOUNT_RELATION
WHERE USER_ID IN (
SELECT USER_ID FROM APP_USERS WHERE BATCH_ID = ?
);
DELETE FROM APP_USERS WHERE BATCH_ID = ?
My issue is that it takes too long to delete data for one BATCH_ID (more than 1 hour).
Note : I only mentioned the APP_USERS, ACCOUNTS AND USER_ACCOUNT_RELATION tables, but I actually have around 25 tables to delete.
How can I improve the query time ?
(I've just tried to change the WHERE USER_ID IN () into an EXISTS. It is better but still way too long.
If that will be your regular process, ie you want to store only last 2 days, you don't need indexes, since every time you will delete 1/3 of all rows.
It's better to use just 3 deletes instead of 3*7 separate deletes:
DELETE FROM USER_ACCOUNT_RELATION
WHERE ACCOUNT_ID IN
(
SELECT u.ID
FROM {USER} u
join {FILE} f
on u.FILE_ID = f.file
WHERE trunc(f.IMPORT_DATE) < (sysdate - 2)
);
DELETE FROM {USER}
WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
DELETE FROM {ACCOUNT}
WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
Just replace {USER}, {FILE}, {ACCOUNT} with your real table names.
Obviously in case of partitioning option it would be much easier - daily interval partitioning, so you could easily drop old partitions.
But even in your case, there is also another more difficult but really fast solution - "partition views": for example for ACCOUNT, you can create 3 different tables ACCOUNT_1, ACCOUNT_2 and ACCOUNT_3, then create partition view:
create view ACCOUNT as
select 1 table_id, a1.* from ACCOUNT_1 a1
union all
select 2 table_id, a2.* from ACCOUNT_2 a2
union all
select 3 table_id, a3.* from ACCOUNT_3 a3;
Then you can use instead of trigger on this view to insert daily data into own table: first day into account_1,second - account_2, etc. And truncate old table each midnight. You can easily get table name using
select 'ACCOUNT_'|| (mod(to_char(sysdate, 'j'),3)+1) tab_name from dual;

Materialized view of first rows

Supposing I have a table with columns date | group_id | user_id | text, and I would like to get the first 3 texts (by date) of each group_id/user_id pair.
It seems wasteful to query the whole table say, every 3 hours, as the results are unlikely to change for a given pair once set, so I looked at materialized views, but examples were about single rows, not sets of rows.
Another issue is that the date column does not correspond to the ingestion date, does this mean that I have to add an ingestion date column to be able to use the #run_time in scheduled queries?
Alternatively, would it be more sensible to load the batch on a separate table, compare it with/update the "first/materialized" table, before merging it with the main table? (so instead of having queries on the main table, fill the materialized table one preemptively at every load). This looks hacky/wrong though?
The question links to I want a "materialized view" of the latest records, and mentions that it deals with single rows instead of multiple rows. The question says that it wants the 3 latest rows instead of only one.
For that, look at the inner query in that answer. Instead of doing this:
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE datehour > TIMESTAMP_SUB(#run_time, INTERVAL 1 DAY )
# change to CURRENT_TIMESTAMP() or let scheduled queries do it
AND datehour > '2000-01-01' # nag
AND wiki='en' AND title LIKE 'A%'
GROUP BY title
)
Do this:
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 3)[OFFSET(0)] latest_row
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE datehour > TIMESTAMP_SUB(#run_time, INTERVAL 1 DAY )
# change to CURRENT_TIMESTAMP() or let scheduled queries do it
AND datehour > '2000-01-01' # nag
AND wiki='en' AND title LIKE 'A%'
GROUP BY title
)
Re #run_time - you can compare it to any column, just make sure to have a column that makes sense to the logic you want to implement.

Partition by week/month//quarter/year to get over the partition limit?

I have 32 years of data that I want to put into a partitioned table. However BigQuery says that I'm going over the limit (4000 partitions).
For a query like:
CREATE TABLE `deleting.day_partition`
PARTITION BY FlightDate
AS
SELECT *
FROM `flights.original`
I'm getting an error like:
Too many partitions produced by query, allowed 2000, query produces at least 11384 partitions
How can I get over this limit?
Instead of partitioning by day, you could partition by week/month/year.
In my case each year of data contains around ~3GB of data, so I'll get the most benefits from clustering if I partition by year.
For this, I'll create a year date column, and partition by it:
CREATE TABLE `fh-bigquery.flights.ontime_201903`
PARTITION BY FlightDate_year
CLUSTER BY Origin, Dest
AS
SELECT *, DATE_TRUNC(FlightDate, YEAR) FlightDate_year
FROM `fh-bigquery.flights.raw_load_fixed`
Note that I created the extra column DATE_TRUNC(FlightDate, YEAR) AS FlightDate_year in the process.
Table stats:
Since the table is clustered, I'll get the benefits of partitioning even if I don't use the partitioning column (year) as a filter:
SELECT *
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate BETWEEN '2008-01-01' AND '2008-01-10'
Predicted cost: 83.4 GB
Actual cost: 3.2 GB
Alternative example, I created a NOAA GSOD summary table clustered by station name - and instead of partitioning by day, I didn't partition it at all.
Let's say I want to find the hottest days since 1980 for all stations with a name like SAN FRAN%:
SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `fh-bigquery.weather_gsod.all`
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC
Note that I got the results after processing only 55.2MB of data.
The equivalent query on the source tables (without clustering) processes 4GB instead:
# query on non-clustered tables - too much data compared to the other one
SELECT name, state, ARRAY_AGG(STRUCT(CONCAT(a.year,a.mo,a.da),temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(CONCAT(a.year,a.mo,a.da)) active_until
FROM `bigquery-public-data.noaa_gsod.gsod*` a
JOIN `bigquery-public-data.noaa_gsod.stations` b
ON a.wban=b.wban AND a.stn=b.usaf
WHERE name LIKE 'SAN FRANC%'
AND _table_suffix >= '1980'
GROUP BY 1,2
ORDER BY active_until DESC
I also added a geo clustered table, to search by location instead of station name. See details here: https://stackoverflow.com/a/34804655/132438

How to optimize spark sql to run it in parallel

I am a spark newbie and have a simple spark application using Spark SQL/hiveContext to:
select data from hive table (1 billion rows)
do some filtering, aggregation including row_number over window function to select first row, group by, count() and max(), etc.
write the result into HBase (hundreds million rows)
I submit the job to run it on yarn cluster (100 executors), it's slow and when I looked at the DAG Visualization in Spark UI, it seems only the hive table scan tasks were running in parallel, rest of steps #2, and #3 above are only running in one instance which probably should be able to optimize to be parallelized?
The application looks like:
Step 1:
val input = hiveContext
.sql(
SELECT
user_id
, address
, age
, phone_number
, first_name
, last_name
, server_ts
FROM
(
SELECT
user_id
, address
, age
, phone_number
, first_name
, last_name
, server_ts
, row_number() over
(partition by user_id, address, phone_number, first_name, last_name order by user_id, address, phone_number, first_name, last_name, server_ts desc, age) AS rn
FROM
(
SELECT
user_id
, address
, age
, phone_number
, first_name
, last_name
, server_ts
FROM
table
WHERE
phone_number <> '911' AND
server_date >= '2015-12-01' and server_date < '2016-01-01' AND
user_id IS NOT NULL AND
first_name IS NOT NULL AND
last_name IS NOT NULL AND
address IS NOT NULL AND
phone_number IS NOT NULL AND
) all_rows
) all_rows_with_row_number
WHERE rn = 1)
val input_tbl = input.registerTempTable(input_tbl)
Step 2:
val result = hiveContext.sql(
SELECT state,
phone_number,
address,
COUNT(*) as hash_count,
MAX(server_ts) as latest_ts
FROM
( SELECT
udf_getState(address) as state
, user_id
, address
, age
, phone_number
, first_name
, last_name
, server_ts
FROM
input_tbl ) input
WHERE state IS NOT NULL AND state != ''
GROUP BY state, phone_number, address)
Step 3:
result.cache()
result.map(x => ...).saveAsNewAPIHadoopDataset(conf)
The DAG Visualization looks like:
As you can see, the "Filter", "Project" and "Exchange" in stage 0 are only running in one instance, so does the stage1 and stage2, so a few questions and apologies if the question is dumb:
Does "Filter", "Project" and "Exchange" happen in Driver after data shuffling from each executor?
What code maps to "Filter", "Project" and "Exchange"?
how I could run "Filter", "Project" and "Exchange" in parallel to optimize the performance?
is it possible to run stage1 and stage2 in parallel?
You're not reading the DAG graph correctly - the fact that each step is visualized using a single box does not mean that it isn't using multiple tasks (and therefore cores) to calculate that step.
You can see how many tasks are used for each step by drilling-down into the stage view, that displays all tasks for this stage.
For example, here's a sample DAG visualization similar to yours:
You can see each stage is depicted by a "single" column of steps.
But if we look at the table below, we can see the number of tasks per stage:
One of them is using only 2 tasks, but the other uses 220, which means data is split into 220 partitions and partitions are processed in parallel, given enough available resources.
If you drill-down into that stage, you can see again that it used 220 tasks and details for all the tasks.
Only tasks reading data from disk are shown in graph as having these "multiple dots" to help you understand how many files were read.
SO - as Rashid's answer suggestes, check the number of tasks for each stage.
It is not obvious so I would do following things to zero in the problem.
Calculate execution time of each steps.
First step may be slow if your table is of text format, spark usually works better if data is stored in Hive in parquet format.
See if your table is partitioned by the column used in where clause.
If saving data to Hbase is slow then you may need to pre-split hbase table as by default data is stored in a single region.
Look at stages tab in spark ui to see how many tasks are started for each stage and also look for data local level as describe here
Hopefully, you will be able to zero in the problem.

Parallelizing insert queries that operate on large volume of data in Postgresql

I have the following query that I would like to cut into smaller queries
and execute in parallel:
insert into users (
SELECT user, company, date
FROM (
SELECT
json_each(json ->> 'users') user
json ->> 'company' company,
date(datetime) date
FROM companies
WHERE date(datetime) = '2015-05-18'
) AS s
);
I could try to do it manually - launch workers that would connect to Postgresql
and every worker would take 1000 companies, extract users and insert to the other table. But is it possible to do it in plain SQL?