Sharing schema definition between BigQuery Client Libraries and Beam IO - google-bigquery

Background:
We are using cloud data flow runner in Beam 2.0 to ETL our data to our warehouse in BigQuery. We would like to use the BigQuery Client Libraries (Beta) to create the schema of our data warehouse prior to the beam pipelines populating them with data. (Reasons: full control over table definitions,e.g. partitioning, ease of creating DW instances, i.e. datasets,separation of ETL logic from DW design, and code modularisation)
Problem:
The BigQury IO in Beam uses TableFieldSchema and TableSchema Classes under com.google.api.services.bigquery.model for representing BigQuery fields and schemas, while the BigQuery Client Libraries uses TableDefinitionunder com.google.cloud.bigquerypackage for the same stuff, so the field and schema definitions can not be defined in one place and re-used at another place.
Is there a way to define the schema at one place and re-use it?
Thanks,
Soby
p.s. we are using the Java SDK in Beam

A similar question was asked here.
I wrote some utils and published them on GitHub that might be of interest to you.
The ParseToProtoBuffer.py downloads the schema from BigQuery and parses it into a Protobuf schema (you might want to look into Protobuffers to boost your pipelines performance as well). If you compile this into a Java class, use it in your project you can use the makeTableSchema function in ProtobufUtils.java to get the TableSchema for that class. You might want to use makeTableRow as well if you decide to develop your pipeline with Protobuffers.
The code I pushed there is WIP and not being used in production or anything yet, but I hope it gives you a push in the right direction.

Related

One time migration of VSAM files from Mainframe to Cloud Azure

Want to migrate bulk files (e.g VSAM) from Mainframe to Azure in the beginning of the Project, how that can be achieved ?
Any utility or do we need to write own scripts?
I suspect there are some utilities out there but I suspect they are most / all priced products. Since VSAM datasets are not defined using a language construct like DDL you will likely have to do most of the heavy lifting. Either writing your own programs or custom scripts. You didn’t mention operating system but I assume you’re working on z/OS.
Here are some things to consider:
The structure of the VSAM dataset is basically record oriented. There are three basic types you’ll run into that host application data:
Key Sequenced Datasets (KSDS)
Entry Sequenced Datasets (ESDS)
Relative Record datasets (RRDS)
Familiarize yourself with the means of defining the datasets as it will give you some insight into the dataset specifics. DFSMS Access Method Services Commands will show the utilities used to create them and get information like Keylength and offest of the key. DEFINE CLUSTER is the command to create the dataset. You mentioned you are moving the data toi Azure but this will help you understand the characteristics of the data you are moving.
Since there is no DDL for VSAM datasets you will generally find the structure in the programs that manipulate them like COBOL Copybooks, HLASM DSECTs and similar constructs. This is the long pole in the tent for you.
Consider what are the semantics of accessing the data. VSAM as an access method does have some ability to control read/write access on a macro level using a DEFINE CLUSTER option called SHAREOPTIONS. The SHAREOPTIONS instruct the operating system how to handle the VSAM buffers in terms of reading and writing so that multiple processes can access the same data. Its primitive if compared to sahred files systems like NFS. VSAM allows the application to control access (or serialization) using ENQ / DEQ functions. These enable applications to express intent in the cluster about a VSAM file and coordinate their own activities.
You might find that converting a VSAM file to a relational form like Db2 is better for you. Again, you’ll have to create the DDL to describe the tables, data formats and the like.
Another consideration is data conversion. You’ll find there is character data that is most likely in EBCDIC and needs to be converted to a new code page. Numeric data can be in Packed Decimal, Binary, or even text and will need to be converted.
The short answer is there isn’t an “Easy Button” to do what you want. Consider the data is only one of the questions that needs to be answered. Serialization and access to the data, codepage conversion, if you are moving some data but not others will you need to be able to map some of the converted data back to data on the mainframe.
Consider exploring IBM CDC classic replication. You can achieve it with click of buttons.
I have not done for Azure. So not sure about support.

Append structure to standard table or create Z table?

Nowadays SAP recommends to "keep the core clean" in order to be able to move to the cloud and always be able to update to the latest version without having to worry or retest, also valid for on-premise.
I got the requirement to add a Z field to the QMEL table to link its notifications to SAP PS projects (PROJ table). The QMEL table already has a structure -CI_QMEL- ready to be extended and the related BAPIs support this extension.
But in order to keep the core clean, I'm considering to challenge the functional requirement and suggest to create a ZNOTIF_PROJ table with the same key than QMEL (Notification ID). This would then become totally separated from the standard but at the same time the official BAPI wouldn't be able to support it, so a wrapper on top would be needed to update the standard and the custom and everything would become more complex.
Should I stick to the old extension style or go for a new table?
Personally I prefer extending standard tables. Having BAPIs, standard transactions, etc, work as expected is worth far more than a nebulous idea like a "clean core."
As long as you're not modding core code or extending tables in an incorrect manner, customizing the system in ways supported by SAP is not a bad thing. You should consider your future upgrade plans (S/4 on-prem vs cloud, for example) when deciding the right answer, but don't make things too hard on yourself.
S/4 on-prem or cloud already has adding new fields and tables functionality. We can do this in web UI look like SAP CRM. So there is no problem for extending existing structure. Help page about this functionality here.

Does Semantic tools like Anzo create a copy of data?

I'm new to semantic technologies. I understand what RDF, OWL and Ontologies and other basic terminologies are and how semantic search uses them. When we create a semantic search module using anzo with enterprise search capabilities. It connects with various data sources and creates relationship between them. Now I'm interested in knowing what a semantic tool like anzo does internally.
Does it creates a copy of data on local machine or it hits data sources every time we execute a SPARQL query
If it stores data, is this data stored in its row format or data is stored after cleaning and creating semantic relation between them.
What happens to data after query is executed. How does it get current data every time?
Any thoughts over it would be valuable for me.
Thanks a lot in advance!
Based on your comments, it appears you're using Anzo Graph Query Engine? If so, then the answers to you questions are
A copy of the data is held in memory
Not clear from any of the published information.
It doesn't. You need to load in the data using the 'LOAD' command.
A bit more on 3: You would be responsible for implementing a mechanism to keep the data in here up-to-date with the underlying data source. (which might be as simple as rebuilding the graph from a nightly dump or trying to implement a change data capture against the underlying store which replicated CRUD operations on the graph)
My answers are based on the marketing and support information available on the CambridgeSemantics site.

Use RavenDB as the database for an Orchard CMS module

I'm just getting underway with Orchard CMS. How difficult would it be to create an Orchard module that uses RavenDB as its database? Is a hard dependency on SQL and NHibernate baked deeply into Orchard?
All of Orchard's core features are based on NHibernate so it would be difficult to move the entire Orchard database to another DBMS not supported by NHibernate. However, Orchard is very extensible and it is quite easy to access all kinds of custom data sources from your own modules. For example, I am currently working in a project where we store our data in a graph database (neo4j) and access them in Orchard using a WCF service.
It depends on what kind of data you need to access, but you will probably need to create a custom content part which dynamically loads data instead of using the underlying SQL database through NHibernate. You can do this by inheriting from the non-generic ContentPart class (the generic one uses a record stored using NHibernate) and using a ContentHandler to populate the data from your custom data source.
There is an experimental RavenDB-based data layer implementation in 'ravendb' Mercurial branch.
It was built a couple of months ago and I'm not sure about the compatibility with the current release, but you can give it a try. There were no big changes to DL since then so I assume it should work or need just a couple of tweaks.

I need advise choosing a NoSQL database for a project with a lot of minute related information

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.
With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.
My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.
I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.
The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.
I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.
Any help would be appreciated.
Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.
I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.
BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.
I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.
For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.
Added 2011-02-12:
Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."
I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:
I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.
I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.
When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.
That said, and this deserves more intense thought, here are some considerations that may be helpful:
I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?
One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..
NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.
I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.
http://postgis.refractions.net/