SQL database management question for Webscraper project

SQL database management question for Webscraper project - sql

I have very little Database management experience, I took a single class when I was in Undergrad. I wanted to see other's inputs on the best way to setup the database.
I have developed a docker application(Webscraping, PostGIS database). The webscraper scrapes from multiple websites everyday. Then uploads to the database, it also checks for duplicates before uploading to the database.
However, I don't want the Reasearch Assistants to be able to change things on the original tables, since lot of the webscraper depends on the structure of the original tables. I gave them SELECT access, but I want them to be able to share their data on the Database as this is a collaborative project.
My original thoughts was to create a new and empty database with full permission. And only SELECT access to the webscraper database. I don't know if this is the best way to do this.
What are your thoughts?
Also to note, this is a contract job for a university project under a grant so I won't be maintaining the database after the contract. Also the project isn't big enough to hire a person with Docker & Database experience just to maintain the database. So I am trying to bulletproof this as much as possible.

Related

PET technology Fluent Nhibernate

For a web application (with some real private data) we want to use privacy enhancing technology to prevent big risks when someone gets permission to our database.
The application is build with different layers, and we use (as said in the topic title) Fluent NHibernate to connect to our database and we've created our own wrapper class to create query's.
Security is a big issue for the kind of application we're building. I'll try to explain the setting by a simple example:
Our customers got some clients in their application (each installation of the application uses its own database), for which some sensitive data is added, there is a client table, and a person table, that are linked.
The base table, which links to the other tables (there will be hundreds of them soon), probably containing sensitive data, is the client table
At this moment, the client has a cleint_id, and a table_id in the database, our customer only knows the client_id, the system links the data by the table_id, which is unknown to the user.
What we want to ensure:
A possible hacker who would have gained access to our database, should not be able to see the link between the customer and the other tables by just opening the database. So actually there should be some kind of "hidden link" between the customer and other tables. The personal data and all sensitive other tables should not be obviously linked together.
Because of the data sensitivity we're looking for a more robust solution then "statically hash the table_id and use this in other tables", because when one of the persons is linked to the corresponding client, not all other clients data is compromised too.
Ultimately, the customer table cannot be linked to the other tables at all, just by working inside the database, the application-code is needed to link the tables.
To accomplish this we've been looking into different methods, but because of the multiple linked tables to this client, and further development (thus probably even more tables) we're looking for a centralised solution. That's why we concluded this should be handled in the database connector. Searching on the internet and here on Stack Overflow, did not point us in the right direction, perhaps we couldn't find this because of wrong search terms (PET, Privacy enhancing technology, combined with NHibernate did not give us any directions.
How can we accomplish our goals in this specific situation, or where to search to help us fix this.

We have a similar requirement for our application and what we ended up with using database schema's.
We have one database and each customer has a separate schema, where all the data for that customer is stored. It is possible to link from the schema to the rest of the database, but not to different schema's.
Security can be set for each schema separately so you can make the life of a hacker harder.
That being said I can also imagine a solution where you let NHibernate encrypt every peace of data it will send to the database and decrypt everything it gets back. The data will be store savely, but it will be very difficult to query over data.
So there is probably not a single answer to this question, and you have to decide what is better: Not being able to query, or just making it more difficult for a hacker to get to the data.

Temporary Tables Quick Guide

I have a structured database and software to handle it and I wanted to setup a demo version based off of a simple template version. I'm reading through some resources on temporary tables but I have questions.
What is the best way to go about cloning a "temporary" database while keeping a clean list of databases?
From what I've seen, there are two ways to do this - temporary local versions that are terminated at the end of the session, and tables that are stored in the database until deleted by the client or me.
I think I would prefer the 2nd option, because I would like to be able to see what they do with it. However, I do not want add a ton of throw-away databases and clutter my system.
How can I a) schedule these for deletion after say 30 days and b) if possible, keep these all under one umbrella, or in other words, is there a way to keep them out of my main list of databases and grouped by themselves.
I've thought about having one database and then serving up the information by using a unique ID for the user and 'faux indexes' so that it appears as 1,2,3 instead of 556,557,558 to solve B. I'm unsure how I could solve A, other than adding a date and protected columns and having a script that runs daily and deletes if over 30 days and not protected.
I apologize for the open-ended question, but the resources I've found are a bit ambiguous.

These aren't true temp tables in the sense that your DBMS knows them. What you're looking for is a way to have a demo copy of your database, probably with a cut-down data set. It's really no different from having any other non-production copy of your database.
Don't do this on your production database server.
Do not do this on your production database server.
Script the creation of your database schema. Depending on the DBMS you're using, this may be pretty easy. If you've got a good development/deployment/maintenance process for your system, this should already exist.
Create your database on the non-production server using the script(s) generated in the previous step. Use an easily-identifiable naming convention, like starting the database name with demo.
Load any data required into the tables.
Point the demo version of your app (that's running on your non-production servers) at this new database.
Create a script/process/job which looks at your database server and drops any databases that match your demo DB naming convention and were created more than 30 days ago.
Without details about your actual environment, people can't give concrete examples/sample code/instructions.
If you cannot run a second, independent database server for these demos, then you will have to make do with your production server. This is still a bad idea because of potential security exposures and performance impact on your production database (constrained resources).
Create a complete copy of your database (or at least the schema, with a reduced data set) for each demo.
Create a unique set of credentials for each of these demo databases. This account should have access to only its demo database.
Configure the demo instance(s) of your application to connect to the demo database
Here's why I'm pushing so hard for separate databases: If you keep copying your "demo" tables within the database, you will have to update your application code to point at those tables each time you do a new demo. Once you start doing this, you're taking a big risk with your demos - the code you keep changing isn't really the application you're running in production anymore. And if you miss one of those changes, you'll get unexpected results at best, and mangling of your production data at worst.

Local SQL database interface to cloud database

Excuse me if the question is simple. We have multiple medical clinics running each running their own SQL database EHR.
Is there anyway I can interface each local SQL database with a cloud system?
I essentially want to use the current patient data that one is consulting with at that moment to generate a pathology request that links to a cloud ?google app engine database.

As a medical student / software developer this project of yours interests me greatly!
If you don't mind me asking, where are you based? I'm from the UK and unfortunately there's just no way a system like this would get off the ground as most data is locked in proprietary databases.
What you're talking about is fairly complex anyway, whatever country you're in I assume there would have to be a lot of checks / security around any cloud system that dealt with patient data. Theoretically though, what you would want to do ideally is create an online database (cloud, hosted, intranet etc), and scrap the local databases entirely.
You then have one 'pool' of data each clinic can pull information from (i.e. ALL records for patient #3563). They could then edit that data and/or insert new records and SAVE them, exporting them back to the main database.
If there is a need to keep certain information private to one clinic only this could still be achieved on one database in a number of ways, or you could retain parts of the local database and have them merge with the cloud data as they're requested by the clinic

This might be a bit outdated, but you guys should checkout https://www.firebase.com/. It would let you do what you want fairly easily. We just did this for a client in the exact same business your are.
Basically, Firebase lets you work with a Central Database on the Cloud, that is automatically synchronised with all its front-ends. It even handles losing the connection to the server automagically. It's the best solution I've found so far to keep several systems running against one only cloud database.
We used to have our own backend that would try its best to sync changes, but you need to be really careful with inter-system unique IDs for your tables (i.e. going to one of the branches and making a new user won't yield the same id that one that already exists in any other branch or the central database). It becomes cumbersome very quickly.
CakePHP can automatically generate this kind of Unique IDs pretty easily and automatically, but you still have to work on sync'ing all the local databases with the central repository.

What strategies are available for migrating Access databases to SQL server-based applications?

I'm considering undertaking a project to migrate a very large MS Access application to a new system based on SQL Server. The existing system is essentially an ERP application with a couple of dozen users, all sharing the Access database over the network. The database has around 300 tables and lots of messy VBA code. This system is beginning to break down (actually, it's amazing it has worked as long as it has).
Due to the size and complexity of the Access application, a 'big bang' approach is not really feasible. It seems sensible to rope off chunks of functionality and migrate them piecemeal to the new system. During the migration process, which I expect to take several months, there may be a need for both databases to be in operation and be able to query and modify data in both systems.
I have considered using something like the ADO.NET Entity Framework to implement a data abstraction layer to handle this, but as far as I can tell, the Entity Framework has no Access provider.
Does my approach seem reasonable? What other strategies have people used to accomplish similar goals?

You may find that the main problem is using the MS Access JET engine as the backend. I'm assuming that you do have an Access FE (frontend) with all objects except tables, and a BE (backend - tables only).
You may find that migrating the data to SQL Server, and linking the Access FE to that, would help alleviate problems immediately.
Then, if you don't want to continue to use MS Access as the FE, you could consider breaking it up into 'modules', and redesign modules one by one using a separate development platform.

We faced a similar situation a few years ago, but we knew from the beginning that we'll have to swich one day to SQL SERVER, so the whole code was written to work from an Access client to both Access AND SQL server databases.
The idea of having a 'one-step' migration to SQL server is certainly the easier way to manage this on the database side, and there are many tools for that. But, depending on the way your client app talks to the database, your code might then not work properly. If, for example, your code includes a lot of SQL instructions (or generates them on the fly by, for example, adding filters to SELECT instructions), your syntax might not be 'SQL server' compatible: access wildcards, dates, functions, will not work on SQL server.
In addition to this, and as said by #mjv, the other drawback of a one time switch to MS SQL is that you will inheritate many of the problems from the original database: wrong or inapropriate field names, inapropriate primary/foreign key policies, hidden one-to-many relations that you'd like to implement in the new database model, etc.
I'll propose here some principles and rules to implement a 'soft transition' solution, which clearly best fits you. Just to say that it's not going to be easy, but it's definitely very interesting, paticularly when dealing with 300 tables! Lucky you!
I assume here that yo have the ability to update the client code, and you'd prefer to keep at all times the same client interface. It is of course possible to have at transition time two different interfaces, one for each database, but this will be very confusing for the users, and a permanent source of frustration for them.
According to me, the best solution strongly depend on:
The original connection technology,
and the way data is managed in your
client's code: Access linked tables,
ODBC, ADODB, recordset, local
tables, forms recordsources, batch
updating, etc.
The possibilities to split your
tables and your app in 'mostly
independant' modules.
And you will not spare the following mandatory activities:
setup up of a transfer
procedure from Access database to SQL server. You
can use already existing tools (The
access upsizing wizard is very poor,
so do not hesitate to buy a real
one, like SSW or EMS SQL Manager,
very powerfull) or build your own
one with Visual Basic. If your plan
is to make some changes in Data
Definition, you'll definitely have
to write some code. Keep in mind
that you will run this code
maaaaaany times, so make sure that
it includes all time-saving
instructions that will allow you to
restart the process from the start
as many times as you want. You will
have to choose between 2 basic data
import strategies when importing data:
a - DELETE existing record, then INSERT imported record
b - UPDATE existing record from imported record
If you plan to switch to new Primary\foreign key types, you'll have to keep track of old identifiers in your new database model during the transition period. Do not hesitate to switch to GUID Primary Keys at this stage, especially if the plan is to replicate data on multiple sites one of these days.
This transfer procedure will be divided in modules corresponding to the 'logical' modules defined previously, and you should be able to run any of these modules independantly (keeping of course in mind that they'll probably have to be implemented in a specific order, where the 'customers' module has to run before the 'invoicing' module).
implement in your client's code the possibility to connect to both original ms-access database and new MS SQL server. Ideally, you should be able to manage from within your code both connections for displaying and validating data.
This possibility will be implemented by modules, where you will have, for each of them, a 'trial period', ie the possibility to choose at testing time between access connection and sql connection when using the module. Once testing is done and complete, the module can then be run in exclusive SQL server mode.
During the transfer period, that can last a few months, you will have to manage programatically the database constraints that exist between 'SQL server' modules and 'Access' modules. Going back to our customers/invoicing example, the customers module will be first switched to MS SQL. Before the Invoicing module can be switched, you'll have to implement programmatically the one to many relations between Customers and Invoices, where each of the tables will be in a different database. Such a constraint can be implemented on the Invoice form by populating the Customers combobox with the Customers recordset from the SQL server.
My proposal is to build your modules following your database model, allways beginning with the 'one' tables or your 'one-to-many' relations: basic lists like 'Units', 'Currencies', 'Countries', shall be switched first. You'll have a first 'hands on' experience in writting data transfer code, and managing a second connection in your client interface. You'll be then able to 'go up' in your database model, switching the 'products' and 'customers' tables (where units, countries and currencies are foreign keys) to the new server.
Good luck!

I would second the suggestion to upsize the back end to SQL Server as step 1.
I would never go to the suggested Step 2, though (i.e., replacing the Access front end with something else). I would instead suggest investing the effort in fixing the flaws of the schema, and adjusting the Access app to work with the new schema.
Obviously, it is never the case that everything just works hunky dory when you upsize -- some things that were previously quite fast will be dogs, and some things that were previously quite slow will be fast. And I've found that it is often the case that the problems are very often not where you anticipate that they will be. You can only figure out what needs to be fixed by testing.
Basically, anything that works poorly gets re-architected, or moved entirely server-side.
Leverage the investment in the existing Access app rather than tossing all that out and starting from scratch. Access is a fine front end for a SQL Server back end as long as you don't assume it's going to work just the same way as it would with a Jet/ACE back end.

...thinking out loud... I think this may work.
I appears that the complexity of the application resides in the various VBA modules rather than the database table/schema themselves. A possible migration path could therefore be to first migrate the data storage to SQL server, exactly as-is, as follow:
prevent any change to the data for a few hours
duplicate all tables to the SQL server; be sure to create the same indexes as well.
create linked tables to ODBC Source pointing to the newly created tables on SQL Server
these tables should have the very same name as the original tables (which therefore may require being renamed, say with a leading underscore, for possible reference).
Now, the application can be restarted and should be using the SQL tables rather than the Access tables. All logic should work as previously (right...), possible slowness to be expected, depending on the distance between the two machines.
All the above could be tested in about a day's work or so; the most tedious being the creation of the tables on SQL server (much of that can be automated, I'm sure). The next most tedious task is to assert that the application effectively works as previously, but with its storage on SQL.
EDIT: As suggested by a comment, I should stress that there is a [fair ?] possibility that the application would not readily work so smoothly under SQL server back-end, and could require weeks of hard work in testing and fixing. However, and unless some of these difficulties can be anticipated because of insight into the application not expressed in the question, I propose that attempting the "As-is" migration to SQL Server should be considered; after all, it may just work with minimal effort, and if it doesn't, we'd know this very quickly. This is therefore a hi-return, low risk proposal...
The main advantage sought with this approach is that there will be a single storage during the [as the OP expects] longer period during which the old Access application will co-exist with the new application.
The drawback of this approach, is that, at least at first, the schema of original database is reproduced verbatim, i.e. including some of its known quirks and legacy-herited idiosyncrasies. These schema issues (and the underlying application logic) can be in time corrected, but this is of course less easy than if the new application starts ab initio, with its own, separate, storage, and distinct schema.
After the storage is moved to SQL server, the most used and/or the most independent modules of the Access application can be re-written in the new application, and as significant portions of the original application is ported, effective usage, by select beta testers or by actual users can start to be switched to the new application.
Possibly, some kind of screen-scraping based logic or some other system could be used to produce an hybrid application which would provide the end users with a comprehensive application, which sometimes work from new logic, and sometimes from the original MS-Access program.

How can I maintain consistent DB schema accross 18 databases (sql server)?

We have 18 databases that should have identical schemas, but don't. In certain scenarios, a table was added to one, but not the rest. Or, certain stored procedures were required in a handful of databases, but not the others. Or, our DBA forgot to run a script to add views on all of the databases.
What is the best way to keep database schemas in sync?

For legacy fixes/cleanup, there are tools, like SQLCompare, that can generate scripts to sync databases.
For .NET shops running SQL Server, there is also the Visual Studio Database Edition, which can create change scripts for schema changes that can be checked into source control, and automatically built using your CI/build process.

SQL Compare by Red Gate is a great tool for this.

SQLCompare is the best tool that I have used for finding differences between databases and getting them synced.
To keep the databases synced up, you need to have several things in place:
1) You need policies about who can make changes to production. Generally this should only be the DBA (DBA team for larger orgs) and 1 or 2 backaps. The backups should only make changes when the DBA is out, or in an emergency. The backups should NOT be deploying on a regular basis. Set Database rights according to this policy.
2) A process and tools to manage deployment requests. Ideally you will have a development environment, a test environment, and a production environment. Developers should do initial development in the dev environment, and have changes pushed to test and production as appropriate. You will need some way of letting the DBA know when to push changes. I would NOT recommend a process where you holler to the next cube. Large orgs may have a change control committee and changes only get made once a month. Smaller companies may just have the developer request testing, and after testing is passed a request for deployment to production. One smaller company I worked for used Problem Tracker for these requests.
Use whatever works in your situation and budget, just have a process, and have tools that work for that process.
3) You said that sometimes objects only need to go to a handful of databases. With only 18 databases, probably on one server, I would recommend making each Databse match objects exactly. Only 5 DBs need usp_DoSomething? So what? Put it in every databse. This will be much easier to manage. We did it this way on a 6 server system with around 250-300 DBs. There were exceptions, but they were grouped. Databases on server C got this extra set of objects. Databases on Server L got this other set.
4) You said that sometimes the DBA forgets to deploy change scripts to all the DBs. This tells me that s/he needs tools for deploying changes. S/He is probably taking a SQL script, opening it in in Query Analyzer or Manegement Studio (or whatever you use) and manually going to each database and executing the SQL. This is not a good long term (or short term) solution. Red Gate (makers of SQLCompare above) have many great tools. MultiScript looks like it may work for deployment purposes. I worked with a DBA that wrote is own tool in SQL Server 2000 using O-SQl. It would take an SQL file and execute it on each database on the server. He had to execute it on each server, but it beat executing on each DB. I also helped write a VB.net tool that would do the same thing, except it would also go through a list of server, so it only had to be executed once.
5) Source Control. My current team doesn't use source control, and I don't have enough time to tell you how many problems this causes. If you don't have some kind of source control system, get one.

I haven't got enough reputation to comment on the above answer but the pro version of SQL Compare has a scriptable API. Given that you have to replicate stuff to all of these databases you could use this to make an automated job to either generate the change scripts or to validate that the databases are all in sync. It's also not much more expensive than the standard version.

Aside from using database comparison tools, with 18 databases you should have a DBA, so enforce a policy that only the DBA can change tables at the database level by restricting access to CREATE and ALTER to the DBA only. On both your test and live databases. The dev database shouldn't have this, of course! Make the developers who have been creating or altering the schemas willy-nilly go via the DBA.

Create a single source-controlled DDL/SQL script for each release and only use it to update the databases. The diff tools can be useful but mainly for checking that you haven't made a mistake and getting out of trouble when the policies fail. Combine the DDL, SQL, and stored procedure scripts into a single script so that it's not easy to "forget" to run one of the scripts.

We have got a tool called DB Schema Difftective that can compare and sync database schemas. With our other tool, DB MultiRun you can easily deploy generated (sync) scripts to multiple db servers (project based).

I realize this post is old, but TurnKey is correct. If you are a developer working in a team environment, the best way to maintain a database schema for a large application, is to make updates to a Master Schema in what ever source safe you use. Simply write your own Scripting class and your Database will be perfect every time.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas