Compare table schemas before starting a job - pentaho

We are currently working on a project where we need to check if the database schema has changed everytime we start a Spoon job, since our origin is a third party database that we have little to no control.
The most obvious solution to us would be to create a script that would call a tool like apgdiff and then compare the schema to a previous generated schema file. If there was any change, we would then send a notification.
The question is basically: is this the best way to achieve this?
Any help would be appreciated.
Thanks for your time.
P.S.: I'm not sure whether stackoverflow is the best place for this kind of question so, if not, please feel free to suggest any interesting forum.

Solution I:
Assuming it is a PostgreSQL database you are referring to, if you have sufficient privileges to INFORMATION_SCHEMA, I would suggest that you query the database like so:
select column_name, data_type, character_maximum_length
from INFORMATION_SCHEMA.COLUMNS where table_name = '<name of table>';
Store the expected result in a persistent way, like you mentioned, and then just compare the results in a sub-transformation. The persistent schema could be a CSV file that stores the definition like so:
app_id character varying 255
platform character varying 255
etl_tstamp timestamp without time zone (null)
collector_tstamp timestamp without time zone (null)
dvce_tstamp timestamp without time zone (null)
event character varying 128
event_id character 36
Then just simply compare the two files: (1) the file that holds the expected schema definition and (2) the file that you just generated, fresh from the database. You could use the File Compare step to do so:
I hope this helps a bit.
EDIT:
Solution II:
Another solution you could apply: you can also use the Table Compare step (contributed by www.kjube.de) to compare two tables from different sources.
What's nice about this step is that you can specify two different connections for the two tables you are comparing.

Related

how to view stats on snowflake?

I am looking for a way to visualize the stats of a table in Snowflake.
The long step is to pull a meaningful sample of the data with python and apply Pandas, but it is somewhat inefficient and unsafe to pull the data out of snowflake.
Snowflake's new interface shows these stats graphically and I would like to know if there is a way to obtain this data with query or by consulting metadata.
I need something like Pandas-profiling but without a external server. maybe snowflake store metadata/statistic about its colums. numeric, categoric
https://github.com/pandas-profiling/pandas-profiling
thank you for your advices.
You can find a lot meta information in the INFORMATION_SCHEMA.
All the views and table functions in the Snowflake INFORMATION_SCHEMA can be found here: https://docs.snowflake.com/en/sql-reference/info-schema.html
not sure if you're talking about viewing the information schema as mentioned, but if you need documentation on this whole new interface, it's called SnowSight
you can learn more there:
https://docs.snowflake.com/en/user-guide/ui-snowsight.html
cheers!
The highlight in your screenshot isn't statistics about the data in the table, but merely about the query result (which looks like a DESCRIBE TABLE query). For example, if you look at type, it simply tells you that this table has 6 VARCHAR columns, 2 timestamps, and 1 number.
What you're looking for is something that is provided by most BI tools or data catalogs. I suggest you take a look at those instead.
You could also use an independent tool, like Soda, which is open source.

Error: Schema changed for Timestamp field (additional)

I am getting an error message when I query a specific table in my data set that has a nullable timestamp field. In the BigQuery web tool, I run simple query, e.g.:
SELECT * FROM [reztrack.201401] LIMIT 100
The result I get is: Error: Schema changed for Timestamp field date
Example Job ID: esiteisthebomb:job_6WKi7ZhSi8D_Ewr8b5rKV-a5Eac
This is the exact same issue that was noted here: Error: Schema changed for Timestamp field.
Also logged this under: https://code.google.com/p/google-bigquery/issues/detail?id=307 but I was unsure since it said we should be logging everything in Stackoverlfow.
Any information on how to fix this for this or other tables would be greatly appreciated.
Note: The original answer states to contact google support, but Google support for BigQuery was moved to StackOverflow. Therefore I assume that means to open it as a new question in hopes the engineers will respond.
BigQuery recently improved the representation of its internal timestamp format (there had previously been a lot of cases where timestamps broke in strange ways and this change should fix that). Your table still was using the old timestamp format, and you tickled a bug in the old format when schemas changed (in this case, the field went from REQUIRED to OPTIONAL).
We have an automated process that coalesces tables to make their storage more efficient. I scheduled this to run over your table, and have verified that it has rewritten your table using the new timestamp format.
You should now be able to query this field of your table without further problems.

On DB2 for i, Search for Column, return table names in list form

I'm still a bit of a noob, so pardon if this question is a bit obvious. I did search for an answer but either couldn't understand how the answers I found applied, or simply couldn't find an answer.
I have a massive database housed on a DB2 for i server which I'm accessing using SQL through SQLExplorer (based on Squirrel SQL). The tables are very poorly documented and the first order of business is figuring out how to find my way around.
I want to write a simple query that does this:
1) Allows me to search the entire database looking for tables that include a column called "Remarks" (which contains field descriptions).
2) I then want it to search that column for a keyword.
3) I want a table returned that includes the names of the tables that include that keyword (just the name, I can look up the table alphabetically later and look inside if I need to.)
I need this search to be super lightweight, and I'm hoping the concept I describe will achieve that. Anything that eats up a lot of resources will likely anger the sys admin for the server.
Just to show I have tried (and that I am a complete noob), here's what I've got so far.
SELECT *
FROM <dbname>
WHERE Remarks LIKE '<keyword>'
Feel free to mock, I told you I'm an idiot :-).
Any help? Perhaps at least a push in the right direction?
PS - I can't seem to find a search function in SQLExplorer, if someone knows if I can perhaps use a simple search or filter to accomplish this same goal...that would be great.
You can query the system catalog to identify the tables:
SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME
FROM QSYS2.SYSCOLUMNS WHERE UPPER(DBILFL) = 'REMARKS'
And then query each table individually:
SELECT * FROM TABLE_SCHEMA.TABLE_NAME WHERE Remarks LIKE '%<keyword>%'
See the LIKE predicate for details of the pattern expression.
Normally i use something like this
SELECT TABLE_SCHEMA, TABLE_NAME
,COLUMN_NAME,SYSTEM_COLUMN_NAME,COLUMN_HEADING
,DATA_TYPE, "LENGTH",NUMERIC_SCALE
FROM QSYS2.SYSCOLUMNS
WHERE UPPER(COLUMN_NAME) LIKE '%REMARK%'
#JamesA, i'm at V6R1, by default, normal user are not authorized to object QADBIFLD in QSYS
Generally, many if not most IBM i shops (especially those that use RPG) stick to 10 (or less) character schema names & table names, and have a 10 (or less) character names for 'system' column names, even if longer column names are also provided. Column text generally describes each field.
SELECT SYSTEM_TABLE_SCHEMA, SYSTEM_TABLE_NAME
,SYSTEM_COLUMN_NAME,
,DATA_TYPE, "LENGTH",NUMERIC_SCALE
,CHAR(COLUMN_TEXT)
FROM QSYS2.SYSCOLUMNS
WHERE UPPER(COLUMN_NAME) LIKE '%REMARK%'

Removing privacy data from a database?

Say that I needed to share a database with a partner. Obviously I have customer information in that database. Short of going through and identifying every column that contains privacy information and a custom script to 'scrub' the data, is there any tool or script which can scrub the data, but keep the format in tact (for example, if a string is 5 characters, it would stay 5 characters, only scrubbed)?
If not, how would you accomplish something like this, preferably in TSQL?
You may consider only share VIEW, create VIEWs to hide data that you don't want share.
Example:
CREATE VIEW v_customer
AS
SELECT
NAME,
LEFT(CreditCard,5) + '****' As CreditCard -- OR, don't show this column at all
....
FROM customer
Firstly I need to state professional interest I work for IBM which has tools that do exactly this.
Step 1. Ensure you identify all the PII (Personally Identifiable Information). When sharing database information it is typical that the obvious column names like "name" are found but you also need to find the "hidden" data where either the data is embedded in a standard format eg string-name-string and column name is something like "reference code" or is in free format text fields . as you have seen this is not going to be an easy job unless you automate it. The Tool for this is InfoSphere Discovery
Step 2. What context does the "scrubbed" data need to be in. Changing named fields to random characters has problems when testing as users focus on text errors rather than functional failures, therefore change names to real but ficticious. Credit card information often needs to be "valid". by that I mean it needs to have a valid prefix say 49XX but the rest an invalid sequence. Finally you need to ensure that every instance of the change is propogated through the database to maintain consistency. Tool for this is Optim Test Data Management with Data Privacy option.
The two tools integrate to give a full data privacy solution.
Based on the original question, it seems you need the fields to be the same length, but not in a "valid" format? How about:
UPDATE customers
SET email = REPLICATE('z', LEN(email))
-- additional fields as needed
Copy/paste and rename tables/fields as appropriate. I think you're going to have a hard time finding a tool that's less work, unless your schema is very complicated, or my formatting assumptions are incorrect.
I don't have an MSSQL database in front of me right now, but you can also find all of the string-like columns by something like:
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE IN ('...', '...')
I don't remember the exact values you need to compare for, but if you run the query and see what's there, they should be pretty self-explanatory.

Database : best way to model a spreadsheet

I am trying to figure out the best way to model a spreadsheet (from the database point of view), taking into account :
The spreadsheet can contain a variable number of rows.
The spreadsheet can contain a variable number of columns.
Each column can contain one single value, but its type is unknown (integer, date, string).
It has to be easy (and performant) to generate a CSV file containing the data.
I am thinking about something like :
class Cell(models.Model):
column = models.ForeignKey(Column)
row_number = models.IntegerField()
value = models.CharField(max_length=100)
class Column(models.Model):
spreadsheet = models.ForeignKey(Spreadsheet)
name = models.CharField(max_length=100)
type = models.CharField(max_length=100)
class Spreadsheet(models.Model):
name = models.CharField(max_length=100)
creation_date = models.DateField()
Can you think about a better way to model a spreadsheet ? My approach allows to store the data as a String. I am worried about it being too slow to generate the CSV file.
from a relational viewpoint:
Spreadsheet <-->> Cell : RowId, ColumnId, ValueType, Contents
there is no requirement for row and column to be entities, but you can if you like
Databases aren't designed for this. But you can try a couple of different ways.
The naiive way to do it is to do a version of One Table To Rule Them All. That is, create a giant generic table, all types being (n)varchars, that has enough columns to cover any forseeable spreadsheet. Then, you'll need a second table to store metadata about the first, such as what Column1's spreadsheet column name is, what type it stores (so you can cast in and out), etc. Then you'll need triggers to run against inserts that check the data coming in and the metadata to make sure the data isn't corrupt, etc etc etc. As you can see, this way is a complete and utter cluster. I'd run screaming from it.
The second option is to store your data as XML. Most modern databases have XML data types and some support for xpath within queries. You can also use XSDs to provide some kind of data validation, and xslts to transform that data into CSVs. I'm currently doing something similar with configuration files, and its working out okay so far. No word on performance issues yet, but I'm trusting Knuth on that one.
The first option is probably much easier to search and faster to retrieve data from, but the second is probably more stable and definitely easier to program against.
It's times like this I wish Celko had a SO account.
You may want to study EAV (Entity-attribute-value) data models, as they are trying to solve a similar problem.
Entity-Attribute-Value - Wikipedia
The best solution greatly depends of the way the database will be used. Try to find a couple of top use cases you expect and then decide the design. For example if there is no use case to get the value of a certain cell from database (the data is always loaded at row level, or even in group of rows) then is no need to have a 'cell' stored as such.
That is a good question that calls for many answers, depending how you approach it, I'd love to share an opinion with you.
This topic is one the various we searched about at Zenkit, we even wrote an article about, we'd love your opinion on it: https://zenkit.com/en/blog/spreadsheets-vs-databases/