I'm trying to export a table in BigQuery to multiple CSV files of 10k lines each.
My plan is to create multiple temporary tables of 10k lines from the main table then export them.
Creation of temporary table is easy :
CREATE TEMP TABLE my_temp_table00x
(
field_a STRING,
field_b INT64
);
To export tables, I can use the Bigquery Export API :
EXPORT DATA
OPTIONS (
uri = 'gs://temp-file-in-gcs/test-export-*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ',')
AS (
SELECT field_a, field_b FROM my_temp_table00x
);
But I'm finding issues on how to divide a table into multiple 10k temp tables...
Actually, there is a simpler way using table offsets. An offset defines where you start retrieving data with your request. An offset of 200 will drop the 200 first lines that your query is supposed to return.
With that in mind, we declare two offsets :
-- an offset that will be incremented during a LOOP
DECLARE offset_it INT64 DEFAULT 0;
-- the max offset that is the size of your table
DECLARE offset_max INT64 DEFAULT (SELECT COUNT(*) FROM test_table);
Then you loop over your table by chunks of 10k lines. The EXECUTE IMMEDIATE allows you to declare a new SQL query for each new chunk :
LOOP
EXECUTE IMMEDIATE FORMAT("""
EXPORT DATA
OPTIONS (
uri = 'gs://temp-file-in-gcs/test-export-*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ',')
AS (
SELECT ROW_NUMBER() OVER (ORDER BY field1) AS table_id,
ds.*
FROM test_table ds
ORDER BY table_id -- this is so we'll always have the same sorting
LIMIT 10000 OFFSET %d;
);"""
, offset_it);
SET offset_it = offset_it + 10000;
IF offset_it >= offset_max THEN
LEAVE;
END IF;
END LOOP;
Related
Let's say we have a terrible design in BigQuery, which should have never been created that way, like the following:
some_project contains dataset metadata, which contains table metadata. Sample data for some_project.metadata.metadata:
| dataset_id |
| xyz1234567 |
| zzz8562042 |
| vyz0009091 |
For each dataset_id I need to query some_table in this dataset, for example some_project.xyz1234567.some_table.
Is this possible to query these multiple tables in a single query? I'm looking for getting aggregate results for each table.
In other words, I'm trying to say something like that:
SELECT SUM(table.x) from table WHERE table IN
(SELECT CONCAT('some_project.', dataset_id, 'some_table') FROM `some_project.metadata.metadata`)
or
SELECT SUM(table.x) FROM
(SELECT CONCAT('some_project.', dataset_id, 'some_table' as table FROM `some_project.metadata.metadata`)
I know that no one should ever need to do something like this, but the design I described above is something I just have to work with.
You can consider this approach using a temporary table as a SQL cursor alternative with the help of BigQuery looping statements.
You can read row by row and execute the query of each table name.
Here you can see this example:
DECLARE var1 INT64 DEFAULT 1;
DECLARE var2 INT64 DEFAULT 0;
DECLARE str1 string DEFAULT '';
DECLARE str2 string DEFAULT '';
DECLARE str3 string DEFAULT '';
CREATE TEMP TABLE temp_emp AS
SELECT empid,
ename,
deptid,
RANK() OVER(ORDER BY empid) rownum
FROM td.emp1;
SET var2= (SELECT COUNT(*) FROM temp_emp);
WHILE var2<=var1 DO
SET str1 = (SELECT empid FROM temp_emp WHERE rownum = x);
SET str2 = (SELECT empid FROM temp_emp WHERE rownum = x);
SET str3 = (SELECT empid FROM temp_emp WHERE rownum = x);
SET var1=var1+1;
END WHILE;
Following are some of points to be noted.
We are using the SET command to assign value to a variable. It is
SELECT..INTO in the original example.
We are not using open and close cursor.
We are creating a TEMPORARY table in place of cursor declaration.
You can see more documentation in this link.
Try this
declare sql string;
set sql = (
select concat("Select something from
someproject.",datasetid,".sometable"
) from metadata.metadata);
execute immediate sql
EXECUTE_IMMEDIATE
I am working on a sql database which will provide with data some grid. The grid will enable filtering, sorting and paging but also there is a strict requirement that users can enter free text to a text input above the grid for example
'Engine 1001 Requi' and that the result will contain only rows which in some columns contain all the pieces of the text. So one column may contain Engine, other column may contain 1001 and some other will contain Requi.
I created a technical column (let's call it myTechnicalColumn) in the table (let's call it myTable) which will be updated each time someone inserts or updates a row and it will contain all the values of all the columns combined and separated with space.
Now to use it with entity framework I decided to use a table valued function which accepts one parameter #searchQuery and it will handle it like this:
CREATE FUNCTION myFunctionName(#searchText NVARCHAR(MAX))
RETURNS #Result TABLE
( ... here come columns )
AS
BEGIN
DECLARE #searchToken TokenType
INSERT INTO #searchToken(token) SELECT value FROM STRING_SPLIT(#searchText,' ')
DECLARE #searchTextLength INT
SET #searchTextLength = (SELECT COUNT(*) FROM #searchToken)
INSERT INTO #Result
SELECT
... here come columns
FROM myTable
WHERE (SELECT COUNT(*) FROM #searchToken WHERE CHARINDEX(token, myTechnicalColumn) > 0) = #searchTextLength
RETURN;
END
Of course the solution works fine but it's kinda slow. Any hints how to improve its efficiency?
You can use an inline Table Valued Function, which should be quite a lot faster.
This would be a direct translation of your current code
CREATE FUNCTION myFunctionName(#searchText NVARCHAR(MAX))
RETURNS TABLE
AS RETURN
(
WITH searchText AS (
SELECT value token
FROM STRING_SPLIT(#searchText,' ') s(token)
)
SELECT
... here come columns
FROM myTable t
WHERE (
SELECT COUNT(*)
FROM searchText
WHERE CHARINDEX(s.token, t.myTechnicalColumn) > 0
) = (SELECT COUNT(*) FROM searchText)
);
GO
You are using a form of query called Relational Division Without Remainder and there are other ways to cut this cake:
CREATE FUNCTION myFunctionName(#searchText NVARCHAR(MAX))
RETURNS TABLE
AS RETURN
(
WITH searchText AS (
SELECT value token
FROM STRING_SPLIT(#searchText,' ') s(token)
)
SELECT
... here come columns
FROM myTable t
WHERE NOT EXISTS (
SELECT 1
FROM searchText
WHERE CHARINDEX(s.token, t.myTechnicalColumn) = 0
)
);
GO
This may be faster or slower depending on a number of factors, you need to test.
Since there is no data to test, i am not sure if the following will solve your issue:
-- Replace the last INSERT portion
INSERT INTO #Result
SELECT
... here come columns
FROM myTable T
JOIN #searchToken S ON CHARINDEX(S.token, T.myTechnicalColumn) > 0
I have two tables:
properties (geo_point POINT, locality_id INTEGER, neighborhood_id INTEGER, id UUID)
places_temp (id INTEGER, poly GEOMETRY, placetype TEXT)
Note: all columns in places_temp are indexed.
properties has ~2 million rows and I would like to:
update locality_id and neighborhood_id for each row in properties with the id from places_temp where properties.geo_point is contained by a polygon in places_temp.poly
Whatever I do it just seems to hang for hours within which time I don't know if it's working, the connection is lost, etc.
Any thoughts on how to do this performantly?
My query:
-- drop indexes on locality_id and neighborhood_id to speed up update
DROP INDEX IF EXISTS idx_properties_locality_id;
DROP INDEX IF EXISTS idx_properties_neighborhood_id;
-- for each property find the locality and neighborhood
UPDATE
properties
SET
locality_id = (
SELECT
id
FROM
places_temp
WHERE
placetype = 'locality'
-- check if geo_point is contained by polygon. geo_point is stored as SRID 26910 so must be
-- transformed first
AND st_intersects (st_transform (geo_point, 4326), poly)
LIMIT 1),
neighborhood_id = (
SELECT
id
FROM
places_temp
WHERE
placetype = 'neighbourhood'
-- check if geo_point is contained by polygon. geo_point is stored as SRID 26910 so must be
-- transformed first
AND st_intersects (st_transform (geo_point, 4326), poly)
LIMIT 1);
-- Add indexes back after update
CREATE INDEX IF NOT EXISTS idx_properties_locality_id ON properties (locality_id);
CREATE INDEX IF NOT EXISTS idx_properties_neighborhood_id ON properties (neighborhood_id);
CREATE INDEX properties_point_idx ON properties USING gist (geo_point);
CREATE INDEX places_temp_poly_idx ON places_temp USING gist (poly);
UPDATE properties p
SET locality_id = x.id
FROM ( SELECT *
, row_number() OVER () rn
FROM places_temp t
WHERE t.placetype = 'locality'
AND st_intersects (st_transform (p.geo_point, 4326), t.poly)
)x
WHERE x.rn = 1
;
And similar for the other field (you could combine them into one query)
Try this
UPDATE
properties
SET
locality_id =t.id, neighbourhood_id
=t.id
From(
SELECT
id
FROM
places_temp
WHERE
placetype in ('locality',
'neighbourhood')
AND st_intersects (st_transform
(geo_point, 4326), poly)
LIMIT 1)) t
I have a table designed like,
create table tbl (
id number(5),
data blob
);
Its found that the column data have
very small size data, which can be stored in raw(200):
so the new table would be,
create table tbl (
id number(5),
data raw(200)
);
How can I migrate this table to new design without loosing the data in it.
This is a bit lengthy method, but it works if you are sure that your data column values don't go beyond 200 in length.
Create a table to hold the contents of tbl temporarily
create table tbl_temp as select * from tbl;
Rem -- Ensure that tbl_temp contains all the contents
select * from tbl_temp;
Rem -- Double verify by subtracting the contents
select * from tbl minus select * from tbl_temp;
Delete the contents in tbl
delete from tbl;
commit;
Drop column data
alter table tbl drop column data;
Create a column data with raw(200) type
alter table tbl add data raw(200);
Select & insert from the temporary table created
insert into tbl select id, dbms_lob.substr(data,200,1) from tbl_temp;
commit;
We are using substr method of dbms_lob package which returns raw type data. So, the resulted value can be directly inserted.
Sybase BCP exports nicely but only includes the data. Is there a way to include column names in the output?
AFAIK It's a very difficult to include column names in the bcp output.
Try free sqsh isql replacement http://www.sqsh.org/ with pipe and redirect features.
F.e.
1> select * from sysobjects
2> go 2>/dev/null >/tmp/objects.txt
I suppose you can achive necessary result.
With bcp you can't get the table columns.
You can get it with a query like this:
select c.name from sysobjects o
inner join syscolumns c on o.id = c.id and o.name = tablename
I solved this problem not too long ago via a proc will loop through the tables columns, and concatenate them. I removed all the error checking and procedure wrapper from this example. this should give you the idea. I then BCP'd out of the below table into headers.txt, then BCP'd the results into detail.txt and used dos copy /b header.txt+detail.txt file.txt to combine the header and detail records...this wall all done in a batch script.
The table you will BCP
create table dbo.header_record
(
headers_delimited varchar(5000)
)
Then massage the below commands into a stored proc. use isql to call this proc before your BCP extracts.
declare
#last_col int,
#curr_col int,
#header_conc varchar(5000),
#table_name varchar(35),
#delim varchar(5),
#delim_size int
select
#header_conc = '',
#table_name = 'dbo.detail_table',
#delim = '~'
set #delim_size = len(#delim)
--
--create column list table to hold our identity() columns so we can work through it
--
create local temporary table col_list
(
col_head int identity
,column_name varchar(50)
) on commit preserve rows
--
-- Delete existing rows in case columns have changed
--
delete from header_record
--
-- insert our column values in the order that they were created
--
insert into col_list (column_name)
select
trim(column_name)
from SYS.SYSCOLUMN --sybase IQ specific, you will need to adjust.
where table_id+100000 = object_id(#table_name) --Sybase IQ 12.7 specific, 15.x will need to be changed.
order by column_id asc
--
--select the biggest identity in the col_list table
--
select #last_col = max(col_head)
from col_list
--
-- Start # column 1
--
set #curr_col = 1
--
-- while our current columns are less than or equal to the column we need to
-- process, continue else end
--
while (#curr_col <= #last_col)
BEGIN
select
#header_conc =
#header_conc + #delim + column_name
from col_list where col_head = #curr_col
set #curr_col = #curr_col + 1
END
--
-- insert our final concatenated value into 1 field, ignore the first delimiter
--
insert into dbo.header_record
select substring(#header_conc, #delim_size, len(#header_conc) )
--
-- Drop temp table
--
drop table col_list
I created a view with the first row being the column names unioned to the actual table.
create view bcp_view
as 'name' col1, 'age' col2, ....
union
select name, convert(varchar, age),.... from people
Just remember to convert any non-varchar columns.