Extracting Data from a VERY old unix machine - sql

Firstly apologies if this question seems like a wall of text, I can't think of a way to format it.
I have a machine with valuable data on(circa 1995), the machine is running unix (SCO OpenServer 6) with some sort of database stored on it.
The data is normally accessed via a software package of which the license has expired and the developers are no longer trading.
The software package connects to the machine via telnet to retrieve data and modify data (the telnet connection no longer functions due to the license being changed).
I can access the machine via an ODBC driver (SeaODBC.dll) over a network, this was how I was planning to extract the data but so far I have retrieved 300,000 rows in just over 24 hours, in total I estimate there will be around 50,000,000 rows total so at current speed it will take 6 months!
I need either a quicker way to extract the data from the machine via ODBC or a way to extract the entire DB locally on the machine to an external drive/network drive or other external source.
I've played around with the unix interface and the only large files I can find are in a massive matrix of single character folder (eg A\G\data.dat, A\H\Data.dat ect).
Does anyone know how to find out the installed DB systems on the machine? Hopefully it is a standard and I'll be able to find a way to export everything into a nicely formatted file.
Edit
Digging around the file system I have found a folder under root > L which contains lots of single lettered folders, each single lettered folder contains more single letter folders.
There are also files which are named after the table I need (eg "ooi.r") which have the following format:
<Id>
[]
l for ooi_lno, lc for ooi_lcno, s for ooi_invno, id for ooi_indate
require l="AB"
require ls="SO"
require id=25/04/1998
{<id>} is s
sort increasing Id

I do not recognize those kinds of filenames A\G\data.dat and so on (filenames with backslashes in them???) and it's likely to be a proprietary format so I wouldn't expect much from that avenue. You can try running file on these to see if they are in any recognized format just to see...
I would suggest improving the speed of data extraction over ODBC by virtualizing the system. A modern computer will have faster memory, faster disks, and a faster CPU and may be able to extract the data a lot more quickly. You will have to extract a disk image from the old system in order to virtualize it, but hopefully a single sequential pass at reading everything off its disk won't be too slow.
I don't know what the architecture of this system is, but I guess it is x86, which means it might be not too hard to virtualize (depending on how well the SCO OpenServer 6 OS agrees with the virtualization). You will have to use a hypervisor that supports full virtualization (not paravirtualization).

I finally solved the problem, running a query using another tool (not through MS Access or MS Excel) worked massively faster, ended up using DaFT (Database Fishing Tool) to SELECT INTO a text file. Processed all 50 million rows in a few hours.
It seems the dll driver I was using doesn't work well with any MS products.

Related

Database model to manage documents

I need to build a tables related to manage documents such as jpg,doc,msg,pdf using a sql server 2008 .
As i know sql server support .jpg images, so my question is if it's possible to upload other kind of files into a db.
This is an example of the table (could be redefined if it's needed).
Document : document_id int(10)
name varchar(10)
type image (doesnt know how it might works)
Those are the initial values for a table, but i dont know how to make it useful for any type.
pd: do i need to assign a directory to save this documents into the server?
You can store almost any file type in an sql server table...if you do, you will almost certainly regret it.
Store a meta-data / a pointer to the file in your database instead, and store the files themselves on a disk directly where they belong.
Your database size - and thus hardware required to run it - will grow very rapidly, so you will be incurring large costs that you do not need to incur.
Use Filestream
https://learn.microsoft.com/en-us/sql/relational-databases/blob/filestream-sql-server
I know that a link-only answer is not an answer but I can't believe no one has mentioned it yet
The proper database design pattern is not to save Files into DBMS. You should develop a kind of File Manager Subsystem to manage your files for all of your projects.
File Manager Subsystem
This subsystem should be Reusable, Extendable, Secure and etc. All your projects that want to save Files, can use this subsystem.
Files can be saved in every where such as Local Hard, Network Drive, External Drives, Clouds and etc. So this subsystem should be design to support all kind of requests.
(you can improve the mentioned subsystem by adding a lot of features to it. for example checking duplicate files,...)
This subsystem, should generate a Unique Key for each file. After uploading and saving the files, the subsystem should generate that key.
Now, you can use this Unique Key to save in database (instead of file). Every time if you want to get the file, you can get the Unique Key from database and request to get file from the subsystem by unique key.

Reading from database file over network in visual basic, with no shared drives

The machine I'm trying to access is an industrial PC that serves as the interface for some PLC-automated equipment. The computer creates a data record (an .mdb database) of the various machine settings so that they can be reviewed later if desired. I've created an application in Visual Basic 2010 to display and sort that information once it's been copied from the machine to someone's laptop.
What I'd like to do, though, is allow the user to access the database from their laptop over the network; each PC has a static IP address on the customer's LAN. Currently, we can use Teamviewer to transfer files, but I'd like to include that ability in my data viewing application. Without changing any settings on the industrial PC (i.e. leaving the network file sharing alone, and avoiding installing any sort of SQL server software), how can I access this information? The database files can get pretty large (40+ mb) so I'm trying to avoid any sort of FTP transfer, which would undoubtedly take forever.
The database is already set up on an ODBC connection (which is how the pc stores the PLC information in the database), and I suspect that this may be the key, but between my lack of any thorough understanding regarding networking and the internet's general distrust of ODBC (well earned, from what I've seen so far), I'm having a hard time finding any useful information.
If anyone could point me towards some useful tutorials, or give me a good place to start, I'd appreciate it.
I think you are out of luck. Think about it, if you could access the file, without sharing it, wouldn't that be a serious security problem?
What you can do: Use the always existing share \host\c$ to access the file. For that you'll need administrative privileges for the host.
But be aware: Even if this "works", it introduces problems:
It will be slow, because you are essentially copying 40MB over the wire over and over again
If you are not carefull, you might lock the *.mdb file while accessing it. Access Databases are not really meant to be accessed by multiple users/processes. That didn't prevented Microsoft to try to fake it. Todo so, a *.ldb (L as in Lock) file will be created, to signal other processes the file is locked.
Tldr: Don't do it.

How to maintain lucene indexes in azure cloud-app

I just started playing with the Azure Library for Lucene.NET (http://code.msdn.microsoft.com/AzureDirectory). Until now, I was using my own custom code for writing lucene indexes on the azure blob. So, I was copying the blob to localstorage of the azure web/worker role and reading/writing docs to the index. I was using my custom locking mechanism to make sure we dont have clashes between reads and writes to the blob. I am hoping Azure Library would take care of these issues for me.
However, while trying out the test app, I tweaked the code to use compound-file option, and that created a new file everytime I wrote to the index. Now, my question is, if I have to maintain the index - i.e keep a snapshot of the index file and use it if the main index gets corrupt, then how do I go about doing this. Should I keep a backup of all the .cfs files that are created or handling only the latest one is fine. Are there api calls to clean up the blob to keep the latest file after each write to the index?
Thanks
Kapil
After i answered this, we ended up changing our search infrastructure and used Windows Azure Drive. We had a Worker Role, which would mount a VHD using the Block Storage, and host the Lucene.NET Index on it. The code checked to make sure the VHD was mounted first and that the index directory existed. If the worker role fell over, the VHD would automatically dismount after 60 seconds, and a second worker role could pick it up.
We have since changed our infrastructure again and moved to Amazon with a Solr instance for search, but the VHD option worked well during development. it could have worked well in Test and Production, but Requirements meant we needed to move to EC2.
i am using AzureDirectory for Full Text indexing on Azure, and i am getting some odd results also... but hopefully this answer will be of some use to you...
firstly, the compound-file option: from what i am reading and figuring out, the compound file is a single large file with all the index data inside. the alliterative to this is having lots of smaller files (configured using the SetMaxMergeDocs(int) function of IndexWriter) written to storage. the problem with this is once you get to lots of files (i foolishly set this to about 5000) it takes an age to download the indexes (On the Azure server it takes about a minute,, of my dev box... well its been running for 20 min now and still not finished...).
as for backing up indexes, i have not come up against this yet, but given we have about 5 million records currently, and that will grow, i am wondering about this also. if you are using a single compounded file, maybe downloading the files to a worker role, zipping them and uploading them with todays date would work... if you have a smaller set of documents, you might get away with re-indexing the data if something goes wrong... but again, depends on the number....

DB (SQL) automated stress/load tools?

I want to measure the performance and scalability of my DB application. I am looking for a tool that would allow me to run many SQL statements against my DB, taking the DB and script (SQL) file as arguments (+necessary details, e.g. host name, port, login...).
Ideally it should let me control parameters such as number of simulated clients, duration of test, randomize variables or select from a list (e.g. SELECT FROM ... WHERE value = #var, where var is read from command line or randomized per execution). I would like to test results to be saved as CSV or XML file that I can analyze and plot them. And of course in terms of pricing I prefer "free" or "demo" :-)
Surprisingly (for me at least) while there are dozens of such tools for web application load testing, I couldn't find any for DB testing!? The ones I did see, such as pgbench, use a built-in DB based on some TPC scenario, so they help test the DBMS configuration and H/W but I cannot test MY DB! Any suggestions?
Specifically I use Postgres 8.3 on Linux, though I could use any DB-generic tool that meets these requirements. The H/W has 32GB of RAM while the size of the main tables and indexes is ~120GB. Hence there can be a 1:10 response time ratio between cold vs warm cache runs (I/O vs RAM). Realistically I expect requests to be spread evenly, so it's important for me to test queries against different pieces of the DB.
Feel free to also contact me via email.
Thanks!
-- Shaul Dar (info#shauldar.com)
JMeter from Apache can handle different server types. I use it for load tests against web applications, others in the team use it for DB calls. It can be configured in many ways to get the load you need. It can be run in console mode and even be clustered using different clients to minimize client overhead ( and so falsifying the results).
It's a java application and a bit complex at first sight. But still we love it. :-)
k6.io can stress test a few relational databases with the xk6-sql extension.
For reference, a test script could be something like:
import sql from 'k6/x/sql';
const db = sql.open("sqlite3", "./test.db");
export function setup() {
db.exec(`CREATE TABLE IF NOT EXISTS keyvalues (
id integer PRIMARY KEY AUTOINCREMENT,
key varchar NOT NULL,
value varchar);`);
}
export function teardown() {
db.close();
}
export default function () {
db.exec("INSERT INTO keyvalues (key, value) VALUES('plugin-name', 'k6-plugin-sql');");
let results = sql.query(db, "SELECT * FROM keyvalues;");
for (const row of results) {
console.log(`key: ${row.key}, value: ${row.value}`);
}
}
Read more on this short tutorial.
The SQL Load Generator is another such tool:
http://sqlloadgenerator.codeplex.com/
I like it, but it doesn't yet have the option to save test setup.
We never really found an adequate solution for stress testing our mainframe DB2 database so we ended up rolling our own. It actually just consists of a bank of 30 PCs running Linux with DB2 Connect installed.
29 of the boxes run a script which simply wait for a starter file to appear on an NFS mount then start executing fixed queries based on the data. The fact that these queries (and the data in the database) are fixed means we can easily compare against previous successful runs.
The 30th box runs two scripts in succession (the second is the same as all the other boxes). The first empties then populates the database tables with our known data and then creates the starter file to allow all the other machines (and itself) to continue.
This is all done with bash and DB2 Connect so is fairly easily maintainable (and free).
We also have another variant to do random queries based on analysis of production information collected over many months. It's harder to check the output against a known successful baseline but, in that circumstance, we're only looking for functional and performance problems (so we check for errors and queries that take too long).
We're currently examining whether we can consolidate all those physical servers into virtual machines, on both the mainframe running zLinux (which will use the shared-memory HyperSockets for TCP/IP, basically removing the network delays) and Intel platforms with VMWare, to free up some of that hardware.
It's an option you should examine if you don't mind a little bit of work up front since it gives you a great deal of control down the track.
Did you check Bristlecone an open source tool from Continuent? I don't use it, but it works for Postgres and seems to be able to do the things that your request. (sorry as a new user, I cannot give you the direct link to the tool page, but Google will get you there ;o])

Open,Read,Write Files on Network Attached Storage via VBScript

I have thousands of small CSV files I want to aggregate (with a little munging in-script first). They are on a NAS device, a "SNAP" Server to be more exact. I've had some success with VBA from Excel - doing about 700 files in about a minute, if I recall (was a month ago). Actually, it was half-success: the snap server is home to 80% pdfs and some proprietary-format files and only 20% CSVs. The Loop to test for filetype took the execution time north of 2 hours and the script apparently completely ignored date filtering I put in. The quick result or 'success' was on 700 copies of the CSVs I made and put on my C drive. I've been doing VBA scripting for almost 20 years, and I think I'm decent at it; I do a lot of CSV reading and writing from VBA the last 9 years. So my question is more about your experience with snap servers or NAS generally.
Can I not treat the snap server more or less like any drive/folder with VBA?
Would VBScript be more appropriate? (already using FileSystemObject, after all)
If I can use VBS can I store the script on the NAS and run it using taskscheduler?
I'd appreciate any tips or gotchas from you folks who have experience with snap servers!
Some thoughts on the choice of language:
VB Script is more lightweight than VBA in that it does not require MS Office to be installed. The syntax is similar so there is no real productivity difference.
Moving forward Powershell would be strongly recommended for Windows system admin tasks, general text file processing, etc.
Some thoughts on using the NAS server:
a) If running your script on a workstation you should be able to use a URI string \\myserver\myshare to connect to a share on the NAS. If not you may need to map a drive letter to that share before your script runs.
b) If you want to run your script on the NAS there are 2 things to consider: is the NAS OS locked so that you may not add your own scheduled task and is it Linux or some flavor of Windows. Many NAS products use embedded Linux so running a VBA or VBScript solution directly on the NAS may not work unless it is based on something like Embedded XP and you have access to Scheduled Tasks, etc.
Hope this helps...