can dbt seed be used with a Pipe delimited csv files? - dbt

We have over a dozen tables we have built that are perfect candidates for dbt seed and it is working great. We do have two files with addresses with commas in them though. I tried to use a pipe delimited file but get a syntax error saying it found an unexpected '|'.
I have searched the internet, getdbt and stackoverflow and don't find any reference to possibly declaring a delimiter in the dbt_project.yml file. Can we use a pipe delimiter in the csv file instead of a comma? Thanks.

Just quote the values like so:
col_1,col_2,col_3
value,"value, with comma",another value

Unfortunately, this is an issue which I've brought up to the dbt team but has yet to be resolved.
See: dbt-core issue #3990
That issue thread has more details into the underlying issues with the agate library's csv function, kwargs, etc.
In the interim, Jeremy was helpful in giving some options which make it possible to parse "|" or other delimiter options via the dbt-external-tables package which you can find more information about here:
See: dbt-external-tables issue #72
The file_format property, used as the input to create external table
statements, accepts a string of any length. So today, you could pass
it:
sources:
- name: my_external_source
tables:
- name: my_external_tbl
external:
location: "#my_stage"
file_format: "( type = csv field_delimiter = 'aa' record_delimiter = 'aabb' )"
The package macros will template this out as:
create or replace external table my_external_source.my_external_tbl
with location = '#my_stage'
file_format = ( type = csv field_delimiter = 'aa' record_delimiter = 'aabb' )
Hope that helps, I find the support for "|" delimited files to be relatively weak in general since they seem to be only used heavily in certain industries (financial) or locales.
Will update answer if anything the above changes but this is current information as of dbt-core v1.0 & dbt-external-tables v0.1.2

Related

Not able to filter files using pathGlobFilter

We are trying to read file from directory based on pattern from azure blob srorage.We are using
pathGlobFilter option to select files. The directory contains following files
Sales_51820_14529409_T_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_51820_14529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_T_7a3cc7d1d17261fd17e7e1fabd3.csv
We need to process only those files which does not have "T" in file name .We need to process only these two files
Sales_51820_14529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_7a3cc7d1d17261fd17e7e1fabd3.csv
But we are not able to read only these two files.
Here is the code,
df = spark.read.format("csv").schema(structSchema).options(header=False,inferSchema=True,sep='|',pathGlobFilter= "Sales_\d{5} _ \d{8}_[a-z0-9]+.csv$").load("wasbs://abc#xxxxx.blob.core.windows.net/abc/2022/02/11/"
Regards,
Rajib
Glob is not a standard regular expression, there is differences between them.
For example glob doesn't match the number of times.
For details, see:here
Back to this question, a relatively stupid way, looking forward to the perfect solution of the giant.
pathGlobFilter="Sales_[0-9][0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[a-z0-9]*.csv"

Getting Error for Excel to Table Conversion

I just started learning Python and now I'm trying to integrate that with my GIS knowledge. As the title suggests, I'm attempting to convert an Excel sheet to a table but I keep getting errors, one which is wholly undecipherable to me and the other which seems to be suggesting that my file does not exist which, I know is incorrect since I copied it's location directly from it's properties.
Here is a screenshot of my environment. Please help if you can and thanks in advance.
Environment/Error
Simply set, you put the workspace directory inside the filename variable so when arcpy handles it, it tries to acess a file that does not exist, in an unknown workspace.
Try this.
arcpy.env.workspace = "J:\egis_work\dpcd\projects\SHARITA\Python\"
arcpy.ExcelToTable_conversion("Exceltest.xlsx", "Bookstorestable", "Sheet1")
Arcpy uses the following syntax to convert geodatabase tables to excel
It is straight forward.
Example
Excel tables cannot be stored in the geodatabase. Most reasonable thing is to store them in the rootfolder in which the geodatabase with the table is. Say I want to convert table below into excel and save it in the root folder or in the folder in which the geodatabase is.
I will go as follows: I have put the explanations after the #.
import arcpy
import os
from datetime import datetime, date, time
# Set environment settings
in_table= r"C:\working\Sunderwood\Network Analyst\MarchDistances\Centroid.gdb\SunderwoodFirstArcpyTable"
#os.path.basename(in_table)
out_xls= os.path.basename(in_table)+ datetime.now().strftime('%Y%m%d') # Here
#os.path.basename(in_table)- Gives the base name of pathname. In this case, it returns the name table
# + is used in python to concatenate
# datetime.now()- gives todays date
# Converts todays date into a string in the format YYYMMDD
# Please add all the above statements and you notice you have a new file name which is the table you input plus todays date
#os.path.dirname() method in Python is used to get the directory name from the specified path
geodatabase = os.path.dirname(in_table)
# In this case, os.path.dirname(in_table) gives us the geodatabase
# The The join() method takes all items in an iterable and joins them into one string
SaveInFolder= "\\".join(geodatabase.split('\\')[:-1])
# This case, I tell python take \ and join on the primary directory above which I have called geodatabase. However, I tell it to remove some characters. I will explain the split below.
# I use split method. The split() method splits a string into a list
#In the case above it splits into ['W:\\working\\Sunderwood\\Network', 'Analyst\\MarchDistances\\Centroid.gdb']. However, that is not what I want. I want to remove "\\Centroid.gdb" so that I remain with the follwoing path ['W:\\working\\Sunderwood\\Network', 'Analyst\\MarchDistances']
#Before I tell arcpy to save, I have to specify the workspace in which it will save. So I now make my environment the SaveInFolder
arcpy.env.workspace =SaveInFolder
## Now I have to tell arcpy what I will call my newtable. I use os.path.join.This method concatenates various path components with exactly one directory separator (‘/’) following each non-empty part except the last path component
newtable = os.path.join(arcpy.env.workspace, out_xls)
#In the above case it will give me "W:\working\Sunderwood\Network Analyst\MarchDistances\SunderwoodFirstArcpyTable20200402"
# You notice the newtable does not have an excel extension. I resort to + to concatenate .xls onto my path and make it "W:\working\Sunderwood\Network Analyst\MarchDistances\SunderwoodFirstArcpyTable20200402.xls"
table= newtable+".xls"
#Finally, I call the arcpy method and feed it with the required variables
# Execute TableToExcel
arcpy.TableToExcel_conversion(in_table, table)
print (table + " " + " is now available")

How to fix 'File name too long' errors when using Snakemake

When using Snakemake, I store the values for my variables as part of the filenames (ex. "processed/count_{project}.tsv"). Recently, I've started using R formulas with many covariates as a variable. Now I get an error because the the filename is too long for the operating system. Has anyone else run into this issue and have any suggestions? Is there a canonical Snakemake approach for this problem?
Personally, I don't think it is a good idea to store information into the filename.
Rather, I would create a temp file in tabular or yaml format linking the file in question to covariates or other data. Then read this file in R or else to extract the relevant information.
One idea is to use paths instead since paths allowed to be longer.

Using BQ command line change configuration configuration.load.quote

I want to know how using BQ command line tool I can change configuration of a BigQuery API job. E.g., I want to change configuration.load.quote property from command line tool. Is there is any way. I need this to load a table with field double quote(") inside.
You cannot modify a job once it is created, but I guess what you want is set the quote property when creating the job.
In most cases, bq help <command> will get you what you need. Here's the output of bq help load. As you can see, you just have to specify --quote="'" after the command but before the arguments.
$ bq help load
Python script for interacting with BigQuery.
USAGE: bq.py [--global_flags] <command> [--command_flags] [args]
load Perform a load operation of source into destination_table.
Usage:
load <destination_table> <source> [<schema>]
The <destination_table> is the fully-qualified table name of table to
create, or append to if the table already exists.
The <source> argument can be a path to a single local file, or a
comma-separated list of URIs.
The <schema> argument should be either the name of a JSON file or a
text schema. This schema should be omitted if the table already has
one.
In the case that the schema is provided in text form, it should be a
comma-separated list of entries of the form name[:type], where type
will default to string if not specified.
In the case that <schema> is a filename, it should contain a single
array object, each entry of which should be an object with properties
'name', 'type', and (optionally) 'mode'. See the online documentation
for more detail:
https://developers.google.com/bigquery/preparing-data-for-bigquery
Note: the case of a single-entry schema with no type specified is
ambiguous; one can use name:string to force interpretation as a
text schema.
Examples:
bq load ds.new_tbl ./info.csv ./info_schema.json
bq load ds.new_tbl gs://mybucket/info.csv ./info_schema.json
bq load ds.small gs://mybucket/small.csv name:integer,value:string
bq load ds.small gs://mybucket/small.csv field1,field2,field3
Arguments:
destination_table: Destination table name.
source: Name of local file to import, or a comma-separated list of
URI paths to data to import.
schema: Either a text schema or JSON file, as above.
Flags for load:
/home/David/google-cloud-sdk/platform/bq/bq.py:
--[no]allow_jagged_rows: Whether to allow missing trailing optional columns in
CSV import data.
--[no]allow_quoted_newlines: Whether to allow quoted newlines in CSV import
data.
-E,--encoding: <UTF-8|ISO-8859-1>: The character encoding used by the input
file. Options include:
ISO-8859-1 (also known as Latin-1)
UTF-8
-F,--field_delimiter: The character that indicates the boundary between
columns in the input file. "\t" and "tab" are accepted names for tab.
--[no]ignore_unknown_values: Whether to allow and ignore extra, unrecognized
values in CSV or JSON import data.
--max_bad_records: Maximum number of bad records allowed before the entire job
fails.
(default: '0')
(an integer)
--quote: Quote character to use to enclose records. Default is ". To indicate
no quote character at all, use an empty string.
--[no]replace: If true erase existing contents before loading new data.
(default: 'false')
--schema: Either a filename or a comma-separated list of fields in the form
name[:type].
--skip_leading_rows: The number of rows at the beginning of the source file to
skip.
(an integer)
--source_format: <CSV|NEWLINE_DELIMITED_JSON|DATASTORE_BACKUP>: Format of
source data. Options include:
CSV
NEWLINE_DELIMITED_JSON
DATASTORE_BACKUP
gflags:
--flagfile: Insert flag definitions from the given file into the command line.
(default: '')
--undefok: comma-separated list of flag names that it is okay to specify on
the command line even if the program does not define a flag with that name.
IMPORTANT: flags in this list that have arguments MUST use the --flag=value
format.
(default: '')

Get the contents of a Application Server directory

I need to get a listing of a server-side directory inside SAP. How do I achieve this in ABAP? Are there any built-in SAP functions I can call?
Ideally I want a function which I can pass a path as input, and which will return a list of filenames in an internal table.
EPS2_GET_DIRECTORY_LISTING does the same thing as EPS_GET_DIRECTORY_LISTING BUT retunrs the file names up to 200 chars!
Call function RZL_READ_DIR_LOCAL:
FUNCTION RZL_READ_DIR_LOCAL.
*"----------------------------------------------------------------------
*"Lokale Schnittstelle:
*" IMPORTING
*" NAME LIKE SALFILE-LONGNAME
*" TABLES
*" FILE_TBL STRUCTURE SALFLDIR
*" EXCEPTIONS
*" ARGUMENT_ERROR
*" NOT_FOUND
*"----------------------------------------------------------------------
Place the path in the NAME import parameter, and then read the directory listing from FILE_TBL after it returns.
RZL_READ_DIR_LOCAL can handle normal local paths as well as UNC paths.
The only downside is it only gives you access to the first 32 chars of each filename. However, you can easily create a new function based on the RZL_READ_DIR_LOCAL code, and change the way the C program output is read, as the first 187 characters of each filename are actually available.
After reading the answers of Chris Carrthers and tomdemuyt I would say:
1) Use RZL_READ_DIR_LOCAL if you need simple list of filenames.
2) EPS_GET_DIRECTORY_LISTING is more powerfull - it can also list subdirectories.
Thanks You both!
With best Regards
Niki Galanov
the answer is calling function module EPS_GET_DIRECTORY_LISTING.
DIR_NAME -> Name of directory
FILE_MASK -> Pass '*' to get all files.
Note: This does not deal with really large file names (80 characters+), it truncates the name.
Take a look at transaction AL11 source code: RSWATCH0 form fill_file_list
There you can get all information about files.
Hope this helps!