Ignore Last row in CSV file as part of BigQuery External table command - google-bigquery

I have about 40 odd csv files, comma delimited in GCS however the last line of all the files has quotes and dot
”.
So these are not exactly conformed csv schema and has data quality issue which i have to get around
My aim is to create an external table referencing to the gcs files and then be able to select the data.
example:
create or replace dataset.tableName
options (
uris = ['gs://bucket_path/allCSVFILES_*.csv'],
format = 'CSV',
skip_leading_rows = 1,
ignore_unknown_values = true
)
the external table gets created without any error. however, when I select the data, I ran to error
"error message: CSV table references column position 16, but line starting at position:18628631 contains only 1 columns"
This is due to quotes and dot ”. at the end of file.
My question is: is there any way in BigQuery to consume to data without the LAST LINE. as part of options we have skip_leading_rows to skip header but any way to skip to last row?
Currently my best placed option is to clean the files, using sed/tail command.
I have checked the create or replace external table options list below and have tried using ignore_unknown_values but other than this option i don't see any other option which will work.
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_external_table_statement

You can try below work around:
I tried with pandas and removed the last record from the csv file.
from google.cloud import bigquery
import pandas as pd
from google.cloud import storage
df=pd.read_csv('gs://samplecsv.csv')
client = bigquery.Client()
dataset_ref = client.dataset('dataset')
table_ref = dataset_ref.table('new_table')
df.drop(df.tail(1).index,inplace=True)
client.load_table_from_dataframe(df, table_ref).result()
For more information you can refer to this link which mentions the limitation for loading csv files to Bigquery.

Related

GBQ BQ-CLI Export CSV to Google Storage with query and specified delimiter

I'm trying to export a table from Google Bigquery into CSV and save the file into a google storage using bq extract but the issue is I want to export that table into 2 CSV files with different filters, and also using ; as delimiter for my CSV file. But I cannot find any documentation online where I can use queries with bq extract.
For example, I have table mytable.cities and I would like to export that table into 2 CSV files: the first CSV file I want export the table with a condition where city = 'Los Angeles' and for my second CSV with where city = 'New York'.
My syntax currently is this:
bq extract --destination_format=CSV --field_delimiter=';' mytable.cities gs://myBucket/myFile.csv
I did not use the command bq query because it doesn't give me the option to change my delimiter to ;
How can I achieve this?
An easy way to do this is to write a code that uses BQ API to run your queries and then use GCS API to save the results as CSV. See code below to do this:
from google.cloud import bigquery
from google.cloud import storage
bq_client = bigquery.Client()
gcs_client = storage.Client()
bucket = gcs_client.get_bucket("your-bucket-name")
query_1 = """
SELECT mascot_name,mascot FROM `bigquery-public-data.ncaa_basketball.mascots`
where non_tax_type = 'Devils'
"""
query_2 = """
SELECT mascot_name,mascot FROM `bigquery-public-data.ncaa_basketball.mascots`
where non_tax_type = 'Dragons'
"""
df_1 = bq_client.query(query_1).to_dataframe()
bucket.blob("devils.csv").upload_from_string(df_1.to_csv(index=False,sep=";"),"text/csv")
df_2 = bq_client.query(query_2).to_dataframe()
bucket.blob("dragons.csv").upload_from_string(df_2.to_csv(index=False,sep=";"),"text/csv")
NOTE: The example above used a public dataset to test the query and import to csv.
See bucket after code is ran:
Sample file content (devils.csv):

Generate Multiple files Table Structure and create table

I'm trying to Generate Multiple files Table Structure to Snowflake via Python
I have list of files in Directory, I want to read data from files create the tables dynamically in snowflake using file names.
below is what I tried so far
# Generate Multiple files Table Structure to Snowflake via Python
import os
from os import path
import pandas as pd
import snowflake.connector
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import write_pandas, pd_writer
dir = r"D:\Datasets\users_dataset"
engine = create_engine(URL(
account='<account>',
user='<user>',
password='<password>',
role='ACCOUNTADMIN',
warehouse='COMPUTE_WH',
database='DEM0_DB',
schema='PUBLIC'
))
connection = engine.connect()
connection.execute("USE DATABASE DEMO_DB")
connection.execute("USE SCHEMA PUBLIC")
results = connection.execute('USE DATABASE DEMO_DB').fetchone()
print(results)
# read the files from the directory and split the filename and extension
for file in os.listdir(dir):
name, extr = path.splitext(file)
print(name)
file_path = os.path.join(dir, file)
print(file_path)
df = pd.read_csv(file_path, delimiter=',')
df.to_sql(name, con=engine, index=False)
I'm getting below error
sqlalchemy.exc.ProgrammingError: (snowflake.connector.errors.ProgrammingError) 090105 (22000): Cannot perform CREATE TABLE. This session does not have a current database. Call 'USE DATABASE', or use a qualified name.
[SQL:
CREATE TABLE desktop (
"[.ShellClassInfo]" FLOAT
)
]
(Background on this error at: https://sqlalche.me/e/14/f405)
I checked for permission issues on snowflake, and I haven't found any issues.
Can someone please help with this error
At the create_engine the database is called DEM0_DB instead of DEMO_DB:
engine = create_engine(URL(
...
#database='DEM0_DB',
database='DEMO_DB',
schema='PUBLIC'
))

reading paritionned dataset in aws s3 with pyarrow doesn't add partition columns

i'm trying to read a partitionned dataset in aws s3, it looks like :
MyDirectory--code=1--file.parquet
--code=2--another.parquet
--code=3--another.parquet
i created a file_list containing the path to all the files in the directory then executed
df = pq.ParquetDataset(file_list, filesystem=fs).read().to_pandas()
everything works except that the partition column code doesn't exist in the dataframe df.
i tried it also using one path to MyDirectory insted of file_list, but found an error
"Found files in an intermediate directory: s3://bucket/Mydirectoty", i can't find any answer online.
Thank you!
AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
This snippet should work:
import awswrangler as wr
# Write
wr.s3.to_parquet(
df=df,
path="s3://...",
mode="overwrite",
dataset=True,
database="my_databse", # Optional, only if you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...", dataset=True)
If you're happy with other tools you can give dask a try. Assume all the data you want to read is in s3://folder you can just use
import dask.dataframe as dd
storage_options = {
'key': your_key,
'secret': your_secret}
df = dd.read_parquet("s3://folder",
storage_options=storage_options)

How to Convert Many CSV files to Parquet using AWS Glue

I'm using AWS S3, Glue, and Athena with the following setup:
S3 --> Glue --> Athena
My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data.
Since I'm using Athena, I'd like to convert the CSV files to Parquet. I'm using AWS Glue to do this right now. This is the current process I'm using:
Run Crawler to read CSV files and populate Data Catalog.
Run ETL job to create Parquet file from Data Catalog.
Run a Crawler to populate Data Catalog using Parquet file.
The Glue job only allows me to convert one table at a time. If I have many CSV files, this process quickly becomes unmanageable. Is there a better way, perhaps a "correct" way, of converting many CSV files to Parquet using AWS Glue or some other AWS service?
I had the exact same situation where I wanted to efficiently loop through the catalog tables catalogued by crawler which are pointing to csv files and then convert them to parquet. Unfortunately there is not much information available in the web yet. That's why I have written a blog in LinkedIn explaining how I have done it. Please have a read; specially point #5. Hope that helps. Please let me know your feedback.
Note: As per Antti's feedback, I am pasting the excerpt solution from my blog below:
Iterating through catalog/database/tables
The Job Wizard comes with option to run predefined script on a data source. Problem is that the data source you can select is a single table from the catalog. It does not give you option to run the job on the whole database or a set of tables. You can modify the script later anyways but the way to iterate through the database tables in glue catalog is also very difficult to find. There are Catalog APIs but lacking suitable examples. The github example repo can be enriched with lot more scenarios to help developers.
After some mucking around, I came up with the script below which does the job. I have used boto3 client to loop through the table. I am pasting it here if it comes to someone’s help. I would also like to hear from you if you have a better suggestion
import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
client = boto3.client('glue', region_name='ap-southeast-2')
databaseName = 'tpc-ds-csv'
print '\ndatabaseName: ' + databaseName
Tables = client.get_tables(DatabaseName=databaseName)
tableList = Tables['TableList']
for table in tableList:
tableName = table['Name']
print '\n-- tableName: ' + tableName
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="tpc-ds-csv",
table_name=tableName,
transformation_ctx="datasource0"
)
datasink4 = glueContext.write_dynamic_frame.from_options(
frame=datasource0,
connection_type="s3",
connection_options={
"path": "s3://aws-glue-tpcds-parquet/"+ tableName + "/"
},
format="parquet",
transformation_ctx="datasink4"
)
job.commit()
Please refer to EDIT for updated info.
S3 --> Athena
Why not you use CSV format directly with Athena?
https://docs.aws.amazon.com/athena/latest/ug/supported-format.html
CSV is one of the supported formats. Also to make it efficient, you can compress multiple CSV files for faster loading.
Supported compression,
https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html
Hope it helps.
EDIT:
Why Parquet format is more helpful than CSV?
https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and
S3 --> Glue --> Athena
More details on CSV to Parquet conversion,
https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-amazon-s3/
I'm not a big fan of Glue, nor creating schemas from data
Here's how to do it in Athena, which is dramatically faster than Glue.
This is for the CSV files:
create table foo (
id int,
name string,
some date
)
row format delimited
fields terminated by ','
location 's3://mybucket/path/to/csvs/'
This is for the parquet files:
create table bar
with (
external_location = 's3://mybucket/path/to/parquet/',
format = 'PARQUET'
)
as select * from foo
You don't need to create that path for parquet, even if you use partitioning
you can convert either JSON or CSV files into parquet directly, without importing it to the catalog first.
This is for the JSON files - the below code would convert anything hosted at the rawFiles directory
import sys
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
## #params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sparkContext = SparkContext()
glueContext = GlueContext(sparkContext)
spark = glueContext.spark_session
job = Job(glueContext) job.init(args['JOB_NAME'], args)
s3_json_path = 's3://rawFiles/'
s3_parquet_path = 's3://convertedFiles/'
output = spark.read.load(s3_json_path, format='json')
output.write.parquet(s3_parquet_path)
job.commit()
Sounds like in your step 1 you are crawling the individual csv file (e.g some-bucket/container-path/file.csv), but if you instead set your crawler to look at a path level instead of a file level (e.g some-bucket/container-path/) and all your csv files are uniform then the crawler should only create a single external table instead of an external table per file and you’ll be able to extract the data from all of the files at once.

concatenate text files and import them into a SQLite DB

Let us say I have thousands of comma separated text files with 1050 columns each (no header). Is there a way to concatenate and import all the text files into one table, one database in SQLite (Ideally I'd use R and sqldf to communicate with SQlite).
I.e.,
Each file is called, table1.txt, table2.txt, table3.txt; all of different number of rows, but same column types, and different unique IDs in the IDs column (first column of each file).
table1.txt
id1,20.3,1.2,3.4
id10,2.1,5.2,9.3
id21,20.5,1.2,8.4
table2.txt
id2,20.3,1.2,3.4
id92,2.1,5.2,9.3
table3.txt
id3,1.3,2.2,5.4
id30,9.1,4.4,9.3
The real example is pretty much the same but with more columns and more rows. AS you can note the first column in each file corresponds to a unique ID.
Now I'd like my new table in supertable, in the DB, super.db to be (also uniquely indexed):
super.db - name of the DB
mysupertable - name of the table in the DB
myids,v1,v2,v3
id1,20.3,1.2,3.4
id10,2.1,5.2,9.3
id21,20.5,1.2,8.4
id2,20.3,1.2,3.4
id92,2.1,5.2,9.3
id3,1.3,2.2,5.4
id30,9.1,4.4,9.3
For reference, I am using SQLite3; and I am looking for a SQL command that I can run on the background without logging interactively into the sqlite3 interpreter, i.e., IMPORT bla INTO,...
I could try in unix:
cat *.txt > allmyfiles.txt
and then a .sql file,
CREATE TABLE test (myids varchar(255), v1 float, v2 float, v3 float);
.separator ,
.import output.csv test
But this command does not work since I am using R sqldf library, and dbGetQuery(db, sql) and I have no idea how to create such string in R without getting an error.
p.s. I asked a similar Q for appending tables from a DB but this time I need to append/import text files not tables from a DB.
If you are using sqlite database files anyway, you might want to consider working with RSQLite.
install.packages( "RSQLite" ) # will install package "DBI"
library( RSQLite )
db <- dbConnect( dbDriver("SQLite"), dbname = "super.db" )
You still can use the unix command within R which should be faster than any loop in R, using the system() command:
system( "cat *.txt > allmyfiles.txt" )
Provided that your allmyfiles.txt has a consistent format, you can import it as a data.frame into R
allMyFiles <- read.table( "allmyfiles.txt", header = FALSE, sep = "," )
and write it to your database, following #Martín Bel's advice, with something like
dbWriteTable( db, "mysupertable", allMyFiles, overwrite = TRUE, append = FALSE )
EDIT:
Or, if you don't want to route your data through R,you can again resort to using the system() command. This may get you started:
You have a file with the data you want to get into SQLite called allmyfiles.txt. Create a file called table.sql with this content (obviously the structure must match):
CREATE TABLE mysupertable (myids varchar(255), v1 float, v2 float, v3 float);
.separator ,
.import allmyfiles.txt mysupertable
and call it from R with
system( "sqlite3 super.db < table.sql" )
That should avoid routing the data through R but still do all the work from within R.
Take a look at termsql:
https://gitorious.org/termsql/pages/Home
cat *.txt | termsql -d ',' -t mysupertable -c 'myids,v1,v2,v3' -o mynew.db
This should do the job.