Generating TPC-DS database for sql server

Generating TPC-DS database for sql server - sql

How do I populate the Transaction Processing Performance Council's TPC-DS database for SQL Server? I have downloaded the TPC-DS tool but there are few tutorials about how to use it.

In case you are using windows, you gotta have visual studio 2005 or later. Unzip dsgen in the folder tools there is dsgen2.sln file, open it using visual studio and build the project, will generate tables for you, I've tried that and I loaded tables manually into sql server

I've just succeeded in generating these queries.
There are some tips may not the best but useful.
cp ${...}/query_templates/* ${...}/tools/
add define _END = ""; to each query.tpl
${...}/tools/dsqgen -INPUT templates.lst -OUTPUT_DIR /home/query99/

Let's describe the base steps:
Before go to the next steps double-check that the required TPC-DS Kit has not been already prepared for your DB
Download TPC-DS Tools
Build Tools as described in 'v2.11.0rc2\tools\How_To_Guide-DS-V2.0.0.docx' (I used VS2015)
Create DB
Take the DB schema described in tpcds.sql and tpcds_ri.sql (they located in 'v2.11.0rc2\tools\'-folder), suit it to your DB if required.
Generate data that be stored to database
# Windows
dsdgen.exe /scale 1 /dir .\tmp /suffix _001.dat
# Linux
dsdgen -scale 1 -dir /tmp -suffix _001.dat
Upload data to DB
# example for ClickHouse
database_name=tpcds
ch_password=12345
for file_fullpath in /tmp/tpc-ds/*.dat; do
filename=$(echo ${file_fullpath##*/})
tablename=$(echo ${filename%_*})
echo " - $(date +"%T"): start processing $file_fullpath (table: $tablename)"
query="INSERT INTO $database_name.$tablename FORMAT CSV"
cat $file_fullpath | clickhouse-client --format_csv_delimiter="|" --query="$query" --password $ch_password
done
Generate queries
# Windows
set tmpl_lst_path="..\query_templates\templates.lst"
set tmpl_dir="..\query_templates"
set dialect_path="..\..\clickhouse-dialect"
set result_dir="..\queries"
set tmpl_name="query1.tpl"
dsqgen /input %tmpl_lst_path% /directory %tmpl_dir% /dialect %dialect_path% /output_dir %result_dir% /scale 1 /verbose y /template %tmpl_name%
# Linux
# see for example https://github.com/pingcap/tidb-bench/blob/master/tpcds/genquery.sh
To fix the error 'Substitution .. is used before being initialized' follow this fix.

Related

Deploy sql workflow with DBX

I am developing deployment via DBX to Azure Databricks. In this regard I need a data job written in SQL to happen everyday. The job is located in the file data.sql. I know how to do it with a python file. Here I would do the following:
build:
python: "pip"
environments:
default:
workflows:
- name: "workflow-name"
#schedule:
quartz_cron_expression: "0 0 9 * * ?" # every day at 9.00
timezone_id: "Europe"
format: MULTI_TASK #
job_clusters:
- job_cluster_key: "basic-job-cluster"
<<: *base-job-cluster
tasks:
- task_key: "task-name"
job_cluster_key: "basic-job-cluster"
spark_python_task:
python_file: "file://filename.py"
But how can I change it so I can run a SQL job instead? I imagine it is the last two lines of code (spark_python_task: and python_file: "file://filename.py") which needs to be changed.

There are various ways to do that.
(1) One of the most simplest is to add a SQL query in the Databricks SQL lens, and then reference this query via sql_task as described here.
(2) If you want to have a Python project that re-uses SQL statements from a static file, you can add this file to your Python Package and then call it from your package, e.g.:
sql_statement = ... # code to read from the file
spark.sql(sql_statement)
(3) A third option is to use the DBT framework with Databricks. In this case you probably would like to use dbt_task as described here.

I found a simple workaround (although might not be the prettiest) to simply change the data.sql to a python file and run the queries using spark. This way I could use the same spark_python_task.

T-SQL Database Backup Cleanup Script logging of files to be deleted before delete

I was having issues with the SQL maintanence plan cleanup task not deleting one particularly large database backup each night, it works for days then fails then starts working again and it always works correctly on the small databases.
I researched this maintanence plan cleanup task issue and tried everything I could think of to get it working, changed extension matching to *, added a \ at the end of folder path, changed age no NONE so all files would be deleted regardless of age, and still sometimes this backup is not getting deleted.
So I implemented this SQL JOB using the script below to see if that would work, but the same issue again, intermittently the large backup file is not deleted, when I run the task manually it seems to always delete it.
My question here is, is there a way to firstly get a list of files that match the delete criteria and write them to a log file before actually attempting to delete the files, that way I could at least see if for some reason the large backup file is not matching the criteria to be deleted in the first place.
Any assistance to otherwise delete the old backup files using T-SQL without using xp_cmdshell and without using a batch or powershell script would be appreciated.
declare #dt datetime
select #dt=dateadd(hh,-22,getdate())
EXECUTE master.dbo.xp_delete_file 0,N'Z:\SQLBackups\',N'BAK',#dt,1
The version of SQL server I'm having the issue with:
Microsoft SQL Server Management Studio 10.50.4042.0
Microsoft Analysis Services Client Tools 10.50.4042.0
Microsoft Data Access Components (MDAC) 6.1.7601.17514
Microsoft MSXML 3.0 6.0
Microsoft Internet Explorer 9.10.9200.17609
Microsoft .NET Framework 2.0.50727.5485
Operating System 6.1.7601

After creating a powershell script to delete the files and finding the error "Another process is using the file" I made this script to check what process that is using the handle64.exe program and found it was the Commvault agent CLBackup.exe that was locking the file.
Backup schedule conflict is the cause of the issue.
$log = ($MyInvocation.MyCommand.Path).TrimEnd("ps1") + "log"
$handlelog = ($MyInvocation.MyCommand.Path).TrimEnd(".ps1") + "-Handle.log"
$1 = gci 'Z:\SQLBackups' | %{gci $_.fullname -Filter '*.BAK' | ? {$_.LastWriteTime -le (get-date).addhours(-22)}}
write-output "$(get-date -format g) Files will be deleted: $1" >> $log
$1 | % {remove-item $_.fullname -force -Confirm:$FALSE}
if (!!($error)) {
write-output "$(get-date -format g) Open Handles on .BAK files:" >> $handlelog
$exec = "$env:SystemRoot\system32\cmd.exe /c" + (Split-Path($MyInvocation.MyCommand.Path)) + "\handle64.exe .BAK -u -nobanner -accepteula"
Invoke-Expression -Command:$exec >> $handlelog
$error >> $log
}

Running SQL in Stata

I am trying to load data from SQL server management studio into Stata. How do I get Stata to run the .sql file? I have used the -ado- procedure from another post, but it does not work because my database has a username and password.
Original -ado- code:
program define loadsql
*! Load the output of an SQL file into Stata, version 1.2 (dvmaster#gmail.com)
version 12.1
syntax using/, DSN(string) [CLEAR NOQuote LOWercase SQLshow ALLSTRing DATESTRing]
#delimit;
tempname mysqlfile exec line;
file open `mysqlfile' using `"`using'"', read text;
file read `mysqlfile' `line';
while r(eof)==0 {;
local `exec' `"``exec'' ``line''"';
file read `mysqlfile' `line';
};
file close `mysqlfile';
odbc load, exec(`"``exec''"') dsn(`"`dsn'"') `clear' `noquote' `lowercase' `sqlshow' `allstring' `datestring';
end;

help odbc discusses connect_options for connecting to odbc data sources. Two of which are u(userId) and p(password) which can be added to the original code written by #Dimitriy V. Masterov (see post here).
I believe you should be able to connect using SQL Server authentication by adding the u(string) and p(string) as additional options following syntax in the ado file, and then again down below following
odbc load, exec(`"``exec''"') dsn(`"`dsn'"')
This would also require that you pass these arguments to the program when you call it:
loadsql using "./sqlfile.sql", dsn("mysqlodbcdata") u(userId) p(Password)

How to generate executable TPC-DS queries?

I have downloaded the DSGEN tool from the TPC-DS web site and already generated the tables and loaded the data into Oracle XE.
I am using the following command to generate the SQL statements :
dsqgen -input ..\query_templates\templates.lst -directory ..\query_templates -dialect oracle -scale 1
However, No matter how I adjust the command I always get this error message :
ERROR: A query template list must be supplied using the INPUT option
Can anybody help?

Apparently you need to use / rather than - for the flags for the Windows executable:
dsqgen /input ..\query_templates\templates.lst /directory ..\query_templates
/dialect oracle /scale 1

MS SQL Server use old log file location after detach/copy/attach

I create database "Test" in folder "d:\test". Database files are "d:\test\Test.mdf" and "d:\Test\Test_log.ldf". I detach database from MS SQL Server 2008 R2, copy all files to new folder ("d:\test_new"), delete log file ("d:\test_new\Test_log.ldf"), and try to attach database again from new location. When I use SQL Server Management Studio, and choose "d:\test_new\Test.mdf" file, it determines that log file is located in "d:\test\Test_log.ldf" (old location). How can I attach this database with rebuilding log in new location? Just imagine, that I cannot copy ldf file again to new location, and that it is still available there, so SQL Server see it anyway. I want to say to SQL Server - "please, forget that log file, and create new log file here". It's be better if you help me with T-SQL script, but if it will be steps in Management studio - I will convert it to script myself.
What I had tried already:
1.
CREATE DATABASE [test]
ON ( FILENAME = N'D:\test_new\test.mdf' )
FOR ATTACH_REBUILD_LOG
attaches log file from old location (FOR ATTACH - the same)
2.
CREATE DATABASE [test]
ON ( FILENAME = N'D:\test_new\test.mdf' )
LOG ON ( FILENAME = N'D:\test_new\test_log.ldf' )
FOR ATTACH_REBUILD_LOG
returns an error: Unable to open the physical file "D:\test_new\test_log.ldf". Operating system error 2: "2(File not found.)".
3.
sp_attach_db and sp_attach_single_file_db
was tried too. And I even had checked their source codes - they just create dynamic SQL and call CREATE DATABASE ... FOR ATTACH statement.
The question is slowly changed to: "Is it possible?"
UPDATE
Well, it looks like it's not possible with current versions of SQL Server. If anybody knows a way to do it - please, I will be very pleased to know it too!

Edit2: To my knowledge, it is not possible for SQL Server to recreate a log file. It can shrink the ldf, but not create it when only the mdf exists.
When you copy your files from d:\test\ to d:\test_new\, do not delete the d:\test_new\Test_log.ldf.
Leave the log file there, because you cannot reattach the new DB without that log file. Afterwards, you can shrink that log to a minimum size.
So, to synthesize:
Copy your files from d:\test\ to d:\test_new\ and leave the log
file there.
Run your create database script that you posted in your question (point 2).
Run the following script to shrink the log to a minimum size
.
USE test
GO
DBCC SHRINKFILE(logicalFileName, 1)
GO
To find out what logicalFileName is, run sp_helpfile, that will give you the logical file name for your log file:
USE test
GO
EXEC sp_helpfile
GO
more info here
Edit:
I think you need first to detach the test database from the old location:
(You might create a script that does it all, from the following commands)
C:\> osql -E
1> sp_detach_db 'test'
2> go
3> quit
C:\>
Then copy the files to the new location.
C:\> copy d:\test\* d:\test_new\*
Next, attach the test DB to the new path location:
C:\> osql -E
1> sp_attach_db #dbname = N'test', #filename1 = N'd:\test_new\Test.mdf', #filename2 = N'd:\test_new\Test_log.ldf'
2> go
3> quit
C:\>
to test if the new database was successfully attached:
C:\> osql -E
1> use test
2> go
3> quit
C:\>
If there are no errors after the go command, then all is ok
Hope this helps
Microsoft article on how to move files

The users must copy BOTH the .mdf and .ldf files. They then have to use the following command (one or the other).
sp_attach_db (deprecated, use the CREATE DATABASE WITH ATTACH in the future)
EXEC sp_attach_db #dbname = 'dbname', #filename1='d:\test_new\test.mdf', #filename2='d:\test_new\test.ldf'
This will result in the databas using the data file (mdf) and transaction log (ldf) from the \test_new directory
CREATE DATABASE FOR ATTACH
CREATE DATABASE dbname ON '(FILENAME=d:\test_new\test.mdf'), (FILENAME='d:\test_new\test.ldf') FOR ATTACH

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas