I have an existing database from which I need to extract a single record that contains a total of 10 GB of data. I have tried to load the data with
conn = sqlite(databaseFile, 'readonly')
GetResult = [
'SELECT result1, result2 ...... FROM Result '...
'WHERE ResultID IN ......'
];
Data = fetch(conn, GetResult)
With this query, the working memory increases (16GB) until it is full, and then the software crashes.
I also tried to limit the result with
'LIMIT 10000'
at the end of the query and browse the results by offset. This works, but it takes 3 hours (calculated from 20 individual results) to get all the results. (Database can not be changed)
Maybe someone of you has an idea to get the data faster or in one query.
I have a dataset within BigQuery with roughly 1000 tables, one for each variable. Each table contains two columns: observation_number, variable_name. Please note that the variable_name column assumes the actual variable name. Each table contains at least 20000 rows. What is the best way to merge these tables on the observation number?
I have developed a Python code that is going to run on a Cloud Function and it generates the SQL query to merge the tables. It does that by connecting to the dataset and looping through the tables to get all of the table_ids. However, the query ends up being too large and the performance is not that great.
Here it is the sample of the Python code that generates the query (mind it's still running locally, not yet in a Cloud Function).
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set project_id and dataset_id.
project_id = 'project-id-gcp'
dataset_name = 'sample_dataset'
dataset_id = project_id+'.'+dataset_name
dataset = client.get_dataset(dataset_id)
# View tables in dataset
tables = list(client.list_tables(dataset)) # API request(s)
table_names = []
if tables:
for table in tables:
table_names.append(table.table_id)
else:
print("\tThis dataset does not contain any tables.")
query_start = "select "+table_names[0]+".observation"+","+table_names[0]+"."+table_names[0]
query_select = ""
query_from_select = "(select observation,"+table_names[0]+" from `"+dataset_name+"."+table_names[0]+"`) "+table_names[0]
for table_name in table_names:
if table_name != table_names[0]:
query_select = query_select + "," + table_name+"."+table_name
query_from_select = query_from_select + " FULL OUTER JOIN (select observation," + table_name + " from " + "`"+dataset_name+"."+table_name+"`) "+table_name+" on "+table_names[0]+".observation="+table_name+".observation"
query_from_select = " from ("+query_from_select + ")"
query_where = " where " + table_names[0] + ".observation IS NOT NULL"
query_order_by = " order by observation"
query_full = query_start+query_select+query_from_select+query_where+query_order_by
with open("query.sql","w") as f:
f.write(query_full)
And this is a sample of the generated query for two tables:
select
VARIABLE1.observation,
VARIABLE1.VARIABLE1,
VARIABLE2.VARIABLE2
from
(
(
select
observation,
VARIABLE1
from
`sample_dataset.VARIABLE1`
) VARIABLE1 FULL
OUTER JOIN (
select
observation,
VARIABLE2
from
`sample_dataset.VARIABLE2`
) VARIABLE2 on VARIABLE1.observation = VARIABLE2.observation
)
where
VARIABLE1.observation IS NOT NULL
order by
observation
As the number of tables grows, this query gets larger and larger. Any suggestions on how to improve the performance of this operation? Any other way to approach this problem?
I don't know if there is a great technical answer to this question. It seems like you are trying to do a huge # of joins in a single query, and BQ's strength is not realized with many joins.
While I outline a potential solution below, have you considered if/why you really need a table with 1000+ potential columns? Not saying you haven't, but there might be alternate ways to solve your problem without creating such a complex table.
One possible solution is to subset your joins/tables into more manageable chunks.
If you have 1000 tables for example, run your script against smaller subsets of your tables (2/5/10/etc) and write those results to intermediate tables. Then join your intermediate tables. This might take a few layers of intermediate tables depending on the size of your sub-tables. Basically, you want to minimize (or make reasonable) the number of joins in each query. Delete the intermediate tables after you are finished to help with unnecessary storage costs.
$queryResults = $bigQuery->runQuery("
SELECT
date,
(value/10)+273 as temperature
FROM
[bigquery-public-data:ghcn_d.ghcnd_$year]
WHERE
id = '{$station['id']}' AND
element LIKE '%$element%'
ORDER BY
date,
temperature"
);
I am calling this query and iterating over the years for each station. It gets through 1 or 2 stations and then I get a
killed
on my output and the process is halted...
Is it possible that the queries are not closing? I am looking for a close query, but it appears that it should close itself?
Any Ideas?
The following query causes some pretty heavy load on the server, and is currently run every 60 seconds. It finds all rows in a table that do not have lat/long data, and looks up the lat/long based on the city and state, for source and destination locations represented by each row. Right off the bat, I assume that the LTRIM/RTRIM functions are probably slowing things down, and if so, that would be a fairly easy remedy by making sure the data is cleansed on the way in. But the zip codes/geo database is huge, and even with indexes, things are pretty slow and CPU intensive (entirely possible I'm not creating the indexes properly). Any advice is welcome - even if the best thing is to somehow limit the number of rows per execution of the query, to spread out the load over time a bit.
UPDATE
T1
SET
T1.coordinatesChecked = 1
, T1.FromLatitude = T2.Latitude
, T1.FromLongitude = T2.Longitude
, T1.ToLatitude = T3.Latitude
, T1.ToLongitude = T3.Longitude
FROM
LoadsAvail AS T1
LEFT JOIN
ZipCodes AS T2 ON LTRIM(RTRIM(T1.FromCity)) = T2.CityName AND LTRIM(RTRIM(T1.FromState)) = T2.ProvinceAbbr
LEFT JOIN
ZipCodes AS T3 ON LTRIM(RTRIM(T1.toCity)) = T3.CityName AND LTRIM(RTRIM(T1.toState)) = T3.ProvinceAbbr
WHERE
T1.coordinatesChecked = 0
I have 2 years of combined data of size around 300GB in my local disk which i have extracted from teradata. I have to load the same data to both google cloud storage and BigQuery table.
The final data in google cloud storage should be day wise segregated in compressed format(each day file should be a single file in gz format).
I also have to load the data in BigQuery in a day wise partitioned table i.e. each day's data should be stored in one partition.
I loaded the combined data of 2 years to google storage first. Then tried using google dataflow to day wise segregate data by using the concept of partitioning in dataflow and load it to google cloud storage (FYI dataflow partitioning is different from bigquery partitioning). But dataflow did not allow to create 730 partitions(for 2 years) as it hit the 413 Request Entity Too Large (The size of serialized JSON representation of the pipeline exceeds the allowable limit").
So I ran the dataflow job twice which filtered data for each year.
It filtered each one year's data and wrote it into separate files in google cloud storage but it could not compress it as dataflow currently cannot write to compressed files.
Seeing the first approach fail, I thought of filtering 1 the one year's data from the combined data using partioning in dataflow as explained above and writing it directly to BigQuery and then exporting it to google storage in compressed format. This process would have been repeated twice.
But in this approach i could not write more than 45 days data at once as I repeatedly hit java.lang.OutOfMemoryError: Java heap space issue. So this startegy also failed
Any help in figuring out a strategy for date wise segregated migration to google storage in compressed format and BigQuery would be of great help?
Currently, partitioning the results is the best way to produce multiple output files/tables. What you're likely running into is the fact that each write allocates a buffer for the uploads, so if you have a partition followed by N writes, there are N buffers.
There are two strategies for making this work.
You can reduce the size of the upload buffers using the uploadBufferSizeBytes option in GcsOptions. Note that this may slow down the uploads since the buffers will need to be flushed more frequently.
You can apply a Reshuffle operation to each PCollection after the partition. This will limit the number of concurrent BigQuery sinks running simultaneously, so fewer buffers will be allocated.
For example, you could do something like:
PCollection<Data> allData = ...;
PCollectionList<Data> partitions = allData.apply(Partition.of(...));
// Assuming the partitioning function has produced numDays partitions,
// and those can be mapped back to the day in some meaningful way:
for (int i = 0; i < numDays; i++) {
String outputName = nameFor(i); // compute the output name
partitions.get(i)
.apply("Write_" + outputName, ReshuffleAndWrite(outputName));
}
That makes use of these two helper PTransforms:
private static class Reshuffle<T>
extends PTransform<PCollection<T>, PCollection<T>> {
#Override
public PCollection<T> apply(PCollection<T> in) {
return in
.apply("Random Key", WithKeys.of(
new SerializableFunction<T, Integer>() {
#Override
public Integer apply(Data value) {
return ThreadLocalRandom.current().nextInt();
}
}))
.apply("Shuffle", GroupByKey.<Integer, T>create())
.apply("Remove Key", Values.create());
}
}
private static class ReshuffleAndWrite
extends PTransform<PCollection<Data>, PDone> {
private final String outputName;
public ReshuffleAndWrite(String outputName) {
this.outputName = outputName;
}
#Override
public PDone apply(PCollection<Data> in) {
return in
.apply("Reshuffle", new Reshuffle<Data>())
.apply("Write", BigQueryIO.Write.to(tableNameFor(outputName)
.withSchema(schema)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE));
}
}
Let's see if this will help?
Steps + pseudo code
1 - Upload combined data (300GB) to BigQuery to CombinedData table
2 - Split Years (Cost 1x2x300GB = 600GB)
SELECT * FROM CombinedData WHERE year = year1 -> write to DataY1 table
SELECT * FROM CombinedData WHERE year = year2 -> write to DataY2 table
3 - Split to 6 months (Cost 2x2x150GB = 600GB)
SELECT * FROM DataY1 WHERE month in (1,2,3,4,5,6) -> write to DataY1H1 table
SELECT * FROM DataY1 WHERE month in (7,8,9,10,11,12) -> write to DataY1H2 table
SELECT * FROM DataY2 WHERE month in (1,2,3,4,5,6) -> write to DataY2H1 table
SELECT * FROM DataY2 WHERE month in (7,8,9,10,11,12) -> write to DataY2H2 table
4 - Split to 3 months (Cost 4x2x75GB = 600GB)
SELECT * FROM DataY1H1 WHERE month in (1,2,3) -> write to DataY1Q1 table
SELECT * FROM DataY1H1 WHERE month in (4,5,6) -> write to DataY1Q2 table
SELECT * FROM DataY1H2 WHERE month in (7,8,9) -> write to DataY1Q3 table
SELECT * FROM DataY1H2 WHERE month in (10,11,12) -> write to DataY1Q4 table
SELECT * FROM DataY2H1 WHERE month in (1,2,3) -> write to DataY2Q1 table
SELECT * FROM DataY2H1 WHERE month in (4,5,6) -> write to DataY2Q2 table
SELECT * FROM DataY2H2 WHERE month in (7,8,9) -> write to DataY2Q3 table
SELECT * FROM DataY2H2 WHERE month in (10,11,12) -> write to DataY2Q4 table
5 - Split each quarter into 1 and 2 months (Cost 8x2x37.5GB = 600GB)
SELECT * FROM DataY1Q1 WHERE month = 1 -> write to DataY1M01 table
SELECT * FROM DataY1Q1 WHERE month in (2,3) -> write to DataY1M02-03 table
SELECT * FROM DataY1Q2 WHERE month = 4 -> write to DataY1M04 table
SELECT * FROM DataY1Q2 WHERE month in (5,6) -> write to DataY1M05-06 table
Same for rest of Y(1/2)Q(1-4) tables
6 - Split all double months tables into separate month table (Cost 8x2x25GB = 400GB)
SELECT * FROM DataY1M002-03 WHERE month = 2 -> write to DataY1M02 table
SELECT * FROM DataY1M002-03 WHERE month = 3 -> write to DataY1M03 table
SELECT * FROM DataY1M005-06 WHERE month = 5 -> write to DataY1M05 table
SELECT * FROM DataY1M005-06 WHERE month = 6 -> write to DataY1M06 table
Same for the rest of Y(1/2)M(XX-YY) tables
7 - Finally you have 24 monthly tables and now I hope limitations you are facing will be gone so you can proceed with your plan – second approach let’s say - to further split on daily tables
I think, cost wise this is most optimal approach and final querying cost is
(assuming billing tier 1)
4x600GB + 400GB = 2800GB = $14
Of course don’t forget delete intermediate tables
Note: I am not happy with this plan - but if splitting your original file to daily chunks outside of BigQuery is not an option - this can help