export Bigquery results to GCS based on value of a column - sql

I've been using the following code to export BQ result to GCS.
export_query = f"""
EXPORT DATA
OPTIONS(
uri='{uri}',
format=format,
overwrite=true,
compression='GZIP')
AS {query}"""
client.query(export_query, project=project).result()
Now I need to split a large table base on the value of a particular column, and then export each part separately , i.e.
query_k = """
SELECT * FROM table WHERE col = k
"""
for k = 1,2,3...N
I can do it by running N queries, but that seems to be very slow and consume a lot of resources. I am wondering if it is possible to accomplish the task with one single query.

Related

Apache Beam Pipeline Write to Multiple BQ tables

I have a scenario where I need to do the following:
Read data from pubsub
Apply multiple Transformations to the data.
Persist the PCollection in multiple Google Big Query based on some config.
My question is how can I write data to multiple big query tables.
I searched for multiple bq writes using apache beam but could not find any solution
You can do that with 3 sinks, example with Beam Python :
def map1(self, element):
...
def map2(self, element):
...
def map3(self, element):
...
def main() -> None:
logging.getLogger().setLevel(logging.INFO)
your_options = PipelineOptions().view_as(YourOptions)
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
result_pcollection = (
p
| 'Read from pub sub' >> ReadFromPubSub(subscription='input_subscription')
| 'Map 1' >> beam.Map(map1)
| 'Map 2' >> beam.Map(map2)
| 'Map 3' >> beam.Map(map3)
)
(result_pcollection |
'Write to BQ table 1' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table1',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
(result_pcollection |
'Write to BQ table 2' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table2',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
(result_pcollection_pub_sub |
'Write to BQ table 3' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table3',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
if __name__ == "__main__":
main()
The first PCollection is the result of input from PubSub.
I applied 3 transformations in the input PCollection
Sink the result to the 3 different Bigquery table
res = Flow
=> Map 1
=> Map 2
=> Map 3
res => Sink result to BQ table 1 with `BigqueryIO`
res => Sink result to BQ table 2 with `BigqueryIO`
res => Sink result to BQ table 3 with `BigqueryIO`
In this example I used STREAMING_INSERT for ingestion to Bigquery tables, but you can adapt and change it if needed in your case.
I see the previous answers satisfy your requirement of writing the same result to multiple tables. However, I assume the below scenarios, provide a bit different pipeline.
Read data from PubSub
Filter the data based on configs (from event message keys)
Apply the different/same transformation to the filtered collections
Write results from previous collections to different BigQuery Sinks
Here, we filtered the events at early stages in the pipeline, this is helpful in:
Avoid processing the same event messages multiple times.
You can skip the messages which are not needed.
Apply relevant transformations to event messages.
Overall efficient and cost-effective system.
For example, you are processing messages from all around the world and you need to process and store the data with respect to geography - storing Europe messages in the Europe region.
Also, you need to apply transformations which are relevant to the country-specific data - add an Aadhar number to messages generated from India and Social Security number to messages generated from the USA.
And you don't want to process/store any events from specific countries - data from oceanic countries are irrelevant and not required to process/stored in our use case.
So, in this made-up example, filtering the data (based on the config) at the early stage, you will be able to store country-specific data (multiple sinks), and you don't have to process all events generated from the USA/any other region for adding an Aadhar number (event specific transformations) and you will be able to skip/drop the records or simply store them in BigQuery without applying any transformations.
If the above made-up example resembles your scenario, the sample pipeline design may look like this
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions,...
from apache_beam.io.gcp.internal.clients import bigquery
class TaggedData(beam.DoFn):
def process(self, element):
try:
# filter here
if(element["country"] == "in")
yield {"indiaelements:taggedasindia"}
if(element["country"] == "usa")
yield {"usaelements:taggedasusa"}
...
except:
yield {"taggedasunprocessed"}
def addAadhar(element):
"Filtered messages - only India"
yield "elementwithAadhar"
def addSSN(element):
"Filtered messages - only USA"
yield "elementwithSSN"
p = beam.Pipeline(options=options)
messages = (
p
| "ReadFromPubSub" >> ...
| "Tagging >> "beam.ParDo(TaggedData()).with_outputs('usa', 'india', 'oceania', ...)
)
india_messages = (
messages.india
| "AddAdhar" >> ...
| "WriteIndiamsgToBQ" >> streaming inserts
)
usa_messages = (
messages.usa
| "AddSSN" >> ...
| "WriteUSAmsgToBQ" >> streaming inserts
)
oceania_messages = (
messages.oceania
| "DoNothing&WriteUSAmsgToBQ" >> streaming inserts
)
deadletter = (
(messages.unprocessed, stage1.failed, stage2.failed)
| "CombineAllFailed" >> Flatn...
| "WriteUnprocessed/InvalidMessagesToBQ" >> streaminginserts...
)

How to truncate a table in PySpark?

In one of my projects, I need to check if an input dataframe is empty or not. If it is not empty, I need to do a bunch of operations and load some results into a table and overwrite the old data there.
On the other hand, if the input dataframe is empty, I do nothing and simply need to truncate the old data in the table. I know how to insert data in with overwrite but don't know how to truncate table only. I searched existing questions/answers and no clear answer found.
driver = 'com.microsoft.sqlserver.jdbc.SQLServerDriver'
stage_url = 'jdbc:sqlserver://server_name\DEV:51433;databaseName=project_stage;user=xxxxx;password=xxxxxxx'
if input_df.count()>0:
# Do something here to generate result_df
print(" write to table ")
write_dbtable = 'Project_Stage.StageBase.result_table'
write_df = result_df
write_df.write.format('jdbc').option('url', stage_url).option('dbtable', write_dbtable). \
option('truncate', 'true').mode('overwrite').option('driver',driver).save()
else:
print('no account to process!')
query = """TRUNCATE TABLE Project_Stage.StageBase.result_table"""
### Not sure how to run the query
Truncating is probably easiest done like this:
write_df = write_df.limit(0)
Also, for better performance, instead of input_df.count() > 0 you should use
Spark 3.2 and below: len(input_df.head(1)) > 0
Spark 3.3+: ~df.isEmpty()

How can I import data from excel to postgres- many to many relationship

I'm creating a web application and I encountered a problem with importing data to a table in a postgress database.
I have excel with id_b and id_cat(books id and categories id) books have several categories and categories can be assigned to many books, excel looks like this:
excel data
It has 30 000 records.
I have a problem how to import it into the database(Postgres). The table for this data has two columns:
id_b and id_cat. I wanted to export this data to csv in this way, each book has to be assigned a category identifier (e.g., book with identifier 1 should appear 9 times because it has 9 categories assigned to it and so on)- but I can't do it easily. It should looks like this:
correct data
Does anyone know any way to get data in this form?
Your excel sheet format has a large number of columns, which also depends on the number of categories, and SQL isn't well adapted to that.
The simplest option would be to:
Export your excel data as CSV.
Use a python script to read it using the csv module and output COPY-friendly tab-delimited format.
Load this into the database (or INSERT directly from python script).
Something like that...
import csv
with open('bookcat.csv') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
if row:
id = row[0].strip()
categories = row[1:]
for cat in categories:
cat = cat.strip()
if cat:
print("%s\t%s" % (id, cat))
csv output version:
import csv
with open('bookcat.csv') as csvfile, open("out.csv","w") as outfile:
reader = csv.reader(csvfile)
writer = csv.writer(outfile)
for row in reader:
if row:
id = row[0].strip()
categories = row[1:]
for cat in categories:
cat = cat.strip()
if cat:
writer.writerow((id, cat))
If you need a specific csv format, check the docs of csv module.

Load multiple Excel sheets using For loop with QlikView

I want to load data from two different Excel files, and use them in the same table in QlikView.
I have two files (DIVISION.xls and REGION.xls), and I'm using the following code:
let tt = 'DIVISION$' and 'REGION$';
FOR Each db_schema in 'DIVISION.xls','REGION.xls'
FOR Each v_db in $(tt)
div_reg_table:
LOAD *
FROM $(db_schema)
(biff, embedded labels, table is $(v_db));
NEXT
NEXT
This code works fine, and does not show any error, but I don't get any data, and I don't see my new table (div_reg_table).
Can you help me?
The main reason that your code does not load any data is due to the form of the tt variable.
When your inner loop executes, it evaluates tt (denoted by $(tt)), which then results in the evaluation of:
'DIVISION$' AND 'REGION$'
Which results in null, since these are just two strings.
If you change your statement slightly, from LET to SET and remove the AND, then the inner loop will work. For example:
SET tt = 'DIVISION$', 'REGION$';
However, this also now means that your inner loop will be executed for each value in tt for both workbooks. This means that unless you have a REGION sheet in your DIVISION workbook, then the load will fail.
To avoid this, you may wish to restructure your script slightly so that you have a control table which details the files and tables to load. An example I prepared is shown below:
FileList:
LOAD * INLINE [
FileName, TableName
REGION.XLS, REGION$
DIVISION.XLS, DIVISION$
];
FOR i = 0 TO NoOfRows('FileList') - 1
LET FileName = peek('FileName', i, 'FileList');
LET TableName = peek('TableName', i, 'FileList');
div_reg_table:
LOAD
*
FROM '$(FileName)'
(biff, embedded labels, table is '$(TableName)');
NEXT
If you have just two files DIVISION.XLS and REGION.XLS then it may be worth just using two separate loads one after the other, and removing the for loop entirely. For example:
div_reg_table:
LOAD
*
FROM [DIVISION.XLS]
(biff, embedded labels, table is DIVISION$)
WHERE DivisionName = 'AA';
LOAD
*
FROM [REGION.XLS]
(biff, embedded labels, table is REGION$)
WHERE RegionName = 'BB';
Don't you need to make the noofrows('Filelist') test be 1 less than the answer.
noofrows('filelist') will evaluate to 2 so you will get a loop step for 0,1 and 2. So 2 will return a null which will cause it to fail.
I did it like this:
FileList:
LOAD * INLINE [
FileName, TableName
REGION.XLS, REGION
DIVISION.XLS, DIVISION
];
let vNo= NoOfRows('FileList')-1;
FOR i = 0 TO $(vNo)
LET FileName = peek('FileName', i, 'FileList');
LET TableName = peek('TableName', i, 'FileList');
div_reg_table:
LOAD
*,
$(i) as I,
FileBaseName() as SOURCE
FROM '$(FileName)'
(biff, embedded labels, table is '$(TableName)');
NEXT

apache spark sql query optimization and saving result values?

I have a large data in text files (1,000,000 lines) .Each line has 128 columns .
Here each line is a feature and each column is a dimension.
I have converted the txt files in json format and able to run sql queries on json file using spark.
Now i am trying to build a kd tree with this large data .
My steps :
1) calculate variance of each column pick the column with maximum variance and make it as key first node , mean of the column as the value of the node.
2) based on the first node value split the data into 2 parts an repeat the process until you reach a point.
my sample code :
import sqlContext._
val people = sqlContext.jsonFile("siftoutput/")
people.printSchema()
people.registerTempTable("people")
val output = sqlContext.sql("SELECT * From people")
the people table has 128 columns
My questions :
1) How to save result values of a query into a list ?
2) How to calculate variance of a column ?
3) i will be runnig multiple queries on same data .Does spark has any way to optimize it ?
4) how to save the output as key value pairs in a text file ?
please help