Export varbinary to file (image) from multiple rows - sql

I have a MSSQL 2k8 database, in it I have a table of format below.
Employee Number | Segment | Data (varbinary(8000))
----------------------------------------------------------
111111 | 1 | 0x01234567...DEF
111111 | 2 | 0x01234567...DEF
111111 | 3 | 0x01234567...DEF
The data (varbinary) column makes up a picture but unfortunately is split in multiple segments by a process I cannot control.
Is there a way to export this data via an SQL script/procedure to a file? I have seem some questions that answer for a varbinary(max) column but I can't for the life of me work out how to stitch these all together into one file.
Note: Some of the files have >500 segments but this procedure will not be occuring exceedingly regularly.

If the picture can be reconstructed by simply concatenating all of the segments, then you could try execsql.py, which is a SQL script processor written in Python (by me). It has a metacommand of this form:
EXPORT <table_or_view> TO <filename> AS RAW
which will concatenate all columns and rows in the given table or view.

Related

Open sums in SQL / dynamic selection of tables

Much ink has been spilled on the topic of sum types in SQL. The standard solutions are called absorption, separation, and partition; see, e.g.: https://www.inf.unibz.it/~montali/teaching/1415/dpm/slides/4.relational-mapping.pdf .
I want to ask about how to encode open sums. Normal sums allow a field to be one of a fixed set of several different types; with open sums, this set is not fixed.
The basic setup in our program: There is a list of "triggers," where each trigger can be one of many different things. Plugins can be written defining new trigger types, although the set of trigger types can be assumed to be known at compile time.
We want a table of all triggers.
Our current best idea:
Dynamically create a materialized view of the following form:
id | id_in_plugin_table | thing_in_main_program_it_refs | plugin_name
---------------------------------------------------------------------
1 | 27 | 8 | RegexTrigger
2 | 27 | 12 | RidiculouslyUnsafeCustomJSTrigger
This relation is automatically generated from the various plugin tables, each of which have their own ID and a thing_in_main_program_it_refs field.
For illustration, here's what the referenced tables may look like.
RegexTrigger table:
id | thing_in_main_program_it_refs | regex
---------------------------------------------------------------------
27 | 8 | hel*o
RidiculouslyUnsafeCustomJSTrigger
id | thing_in_main_program_it_refs | custom_js
---------------------------------------------------------------------
27 | 12 | (x) => isPrime(x.length())
Either use two roundtrips to lookup the plugin table and then query it, or combine them into a single SQL program which uses EXEC.
I'm happy with part 1, but not with part 2. Neither option sounds efficient, and the latter option uses EXEC.
So, we're looking for either (a) a better way to dynamically select a table in a query, or (b) a different approach to open sums.

Is comparing two tables faster by importing them into a sql database or by using jdbc?

Background
I need to compare two tables in two different datacenters to make sure they're the same. The tables can be hundreds of millions, even a billion lines.
An example of this is having a production data pipeline and a development data pipeline. I need to verify that the tables at the end of each pipeline are the same, however, they're located in different datacenters.
The tables are the same if all the values and datatypes for each row and column match. There are primary keys for each table.
Here's an example input and output:
Input
table1:
Name | Age |
Alice| 25.0|
Bob | 49 |
Jim | 45 |
Cal | 52 |
table2:
Name | Age |
Bob | 49 |
Cal | 42 |
Alice| 25 |
Output:
table1 missing rows (empty):
Name | Age |
| |
table2 missing rows:
Name | Age |
Jim | 45 |
mismatching rows:
Name | Age | table |
Alice| 25.0| table1|
Alice| 25 | table2|
Cal | 52 | table1|
Cal | 42 | table2|
Note: The output doesn't need to be exactly like the above format, but it does need to contain the same information.
Question
Is it faster to import these tables into a new, common SQL environment, then use SQL to produce my desired output?
OR
Is it faster to use something like JDBC, retrieve all rows for each table, sort each table, then compare them line by line to produce my desired output?
Edits:
The above solutions would be executed at a datacenter that's hosting one of the tables. In the first solution, the only purpose for creating a new database would be to compare these tables using SQL, there are no other uses.
You should definitively start with the database option. Especially if the databases are connected with a database link you can easy set up the transfer of the data.
Such comparison often leads to a full outer join of the two sources and the experience tell us that DIY joins are notorically less performant that the native database implementation (you can deploy for example a parallel option).
Anyway you may try to implement some sofisticated algoritm that can make the compare without the necessity to transfer the whole table.
An example is based on the Merkle Trees where you first scan both source in their location to recognise which parts are identical (that can be ignored) and transfer and compare only the party with a difference.
So if you expect the tables are nearly identical and have keys that allows some hierarchy such approach could end better than a brute force full compare.
The faster solution is to load both tables to variables (memory) in your programing language and then compare them with your favorite algorithm.
Copy them first to a new table is the more than the double of time in read/write operations to disk, especially the write ones.

Creation of DB Structure for particular Excel Sheet formats

I have basic knowledge of SQL queries. 
Problem Statement:
Every month I will get an Excel sheet with transaction name as one column and the rest of n columns will be dates. The number of rows(Transaction names) are fixed but the dates of a month might vary. I have also attached the screenshot of the excel file.
Excel table
How should I create a table and structure it in the DB?
How should I import this particular excel table with values in the SQL DB?
Kindly please help me!
Thanks.
You can create table like this format.
Create table table_name(Transaction varchar(40),dates date,values float)
This Table looks like:
Transaction | dates | values
billing_ 1 | 11-oct-2019 | 1.2006
billing_ 1 | 12-oct-2019 | 2.2006
billing_ 2 | 11-oct-2019 | 1.2006
billing_ 2 | 12-oct-2019 | 2.2006
For importing excel data.
First you need to format this type then you can easily import into the database.

Transpose variable number of rows into columns in OpenRefine

I have an xml file containing records from a library catalogue. I have imported it into OpenRefine but all the values are in one column. I want to transpose it so each field in the record has its own column. However, this is complicated by the fact that a) each field is optional so does not exist in all records and b) many fields are repeatable so can appear multiple times in each record. Here's a simplified example of what the data looks like:
| RecordID | Tag | Data |
| 1 | 040a | CaABCD |
| 1 | 245a | Go fish |
| 1 | 245a | A guide to fish |
| 1 | 246i | Fish series |
| 1 | 260a | Fishing friends |
| 2 | 040a | CaABDC |
| 2 | 245a | Happy trails |
| 2 | 246i | Hiking series |
| 2 | 260i | The happy hiker |
| 2 | 500a | Notes |
I have read the Q&A here Openrefine - Transpose rows into columns based on text but the problem with this solution is that if I concatenate all the values together I have no way to be sure what field they belong in anymore, as my data is much more complicated than the data in that question (my actual data has 25+ fields and many thousands of records).
I was able to get closer using Google Sheets and making a pivot table with a calculated field (as in PivotTable to show values, not sum of values - see the answer at the very bottom). However, I still don't know how to handle the repeating fields. In the pivot table the multiple values are there but only the first displays (double-clicking on an individual cell brings up a details table which lists all the values), so when I copy-paste the table I lose the additional values. I would like to concatenate them but I cannot see a way to do so within the pivot table.
Can you think of any other way I could do this, in OpenRefine or another tool? Thanks!
The classic way to fix this in OpenRefine is to use "Transpose -> Columnize by key value". But this feature is poorly documented and can cause headaches even for OpenRefine developers. In your case, repeated fields will be problematic, so here is a possible solution.
1° Go to the "tag" column, click on "Transpose -> Columnize by key value" and use the following configuration (don't forget the "Note column (optional)")
The result will look like this (my dataset is not exactly the same as yours, I modified a value to do some test)
2° In the new column "Record ID: 040 a", click on "edit column -> Move Column To Beginning".
3° If you want to merge the repeated fields, go to each column that contains them and click on "Edit Cells -> Join Multi Value cells" by choosing a separator, for example "|".
The end result will look like this.
To get rid of unnecessary columns: Click on Export -> Custom tabular export and deselect the columns whose name starts with RecordId.
OpenRefine also has a native MARC importer which might be something worth trying if you need to work with MARC data in the future. MARCEdit also has some specific OpenRefine support built in.

Flat File Import: Remove Data

(Posted a similar question earlier but HR department changed conditions today)
Our HR department has an automated export from our SAP system in the form of a flat file. The information in the flat file looks like so.
G/L Account 4544000 Recruiting/Job Search
Company Code 0020
--------------------------
| Posting Date| LC amnt|
|------------------------|
| 01/01/2013 | 406.25 |
| 02/01/2013 | 283.33 |
| 03/21/2013 |1,517.18 |
--------------------------
G/L Account 4544000 Recruiting/Job Search
Company Code 0020
--------------------------
| Posting Date| LC amnt|
|------------------------|
| 05/01/2013 | 406.25 |
| 06/01/2013 | 283.33 |
| 07/21/2013 |1,517.18 |
--------------------------
When I look at the data in the SSIS Flat File Source Connection all of the information is in a single column. I have tried to use the Delimiter set to Pipe but it will not separate the data, I assume due to the nonessential information at the top and middle of the file.
I need to remove the data at the top and middle and then have the Date and Total split into two separate columns.
The goal of this is to separate the data so that I can get a single SUM for the running year.
Year Total
2013 $5123.25
I have tried to do this in SSIS but I cant seem to separate the columns or remove the data. I want to avoid a script task as I am not familiar with the code or operation of that component.
Any assistance would be appreciated.
I would create a temp table that can import the whole flat file, after that do filter on SQL level
An example
Create TABLE tmp (txtline VARCHAR(MAX))
BCP or SSIS file into tmp table
Run Query like this to get result ( you may need adjust string length to fit your flat file)
WITH cte AS (
SELECT
CAST(SUBSTRING(txtline,2,10) AS DATE) AS PostingDate,
CAST(REPLACE(REPLACE(SUBSTRING(txtline,15,100),'|',''),',','') AS NUMERIC(19,4)) AS LCAmount
FROM tmp
WHERE ISDATE(SUBSTRING(txtline,2,10)) = 1
)
SELECT
YEAR(PostingDate),
SUM(LCAmount)
FROM cte
GROUP BY YEAR(PostingDate)
maybe you could use MS-Excel to open the flat file, using pipe-character as the delimeter, and then create a CSV from that, if needed.
Short of a script task/component (or a full-blown custom SSIS component), I don't think you'll be able to parse that specific format in SSIS. The Flat File Connection Manager does allow you to select how many rows of your text file are headers to be skipped, but the format you're showing has multiple sections (and thus multiple headers). There's also the issue of the horizontal lines, which the Flat File Connection won't be able to properly handle.
I'd first see if there's any way to get a normal CSV file with this data out of SAP. If that turns out to be impossible, then you'll need some sort of custom code to strip out the excess text.