cleaning datasources [closed]

cleaning datasources [closed] - sql

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I'm project managing a development that's pulling data from all kinds of data sources (SQL MySQL, Filemaker, excel) before installing into a new database structure with a record base through 10 years. Obviously I need to clean all this before exporting, and am wondering if there are any apps that can simplify this process for me, or any guides that I can follow.
Any help would be great

I do this all the time and like Tom do it in SQl Server using DTS or SSIS depending on the version of the final database.
Some things I strongly recommend:
Archive all files received before you process them especially if you are getting this data from outside sources, you may have to research old imports and go back to the raw data. After the archive is successful, copy the file to the processing location.
For large files especially, it is helpful to get some sort of flag file that is only copied after the other file is completed or even better whcich contains the number of records in the file. This can help prevent problems from corrupted or incomplete files.
Keep a log of number of records and start failing your jobs if the file size or number of records is suspect. Put in a method to process anyway if you find the change is correct. Sometimes they really did mean to cut the file in half but most of the time they didn't.
If possible get column headers in the file. You would be amazed at how often data sources change the columns, column names or order of the columns without advance warning and break imports. It is easier to check this before processing data if you have column headers.
Never import directly to a production table. Always better to use a staging table where you can check and clean data before putting it into prod.
Log each step of your process, so you can easily find what caused a failure.
If you are cleaning lots of files consider creating functions to do specific types of cleaning (phone number formatting for instance) then you can use the same function in multiple imports.
Excel files are evil. Look for places where leading zeros have been stripped in the import process.
I write my processes so I can run them as a test with a rollback at the end. Much better to do this than realize your dev data is so hopelessly messed up that you can't even do a valid test to be sure everything can be moved to prod.
Never do a new import on prod without doing it on dev first. Eyeball the records directly when you are starting a new import (not all of them if it is a large file of course, but a good sampling). If you think you should get 20 columns and it imports the first time as 21 columns, look at the records in that last column, many times that means the tab delimited file had a tab somewhere in the data and the column data is off for that record.
Don't assume the data is correct, check it first. I've had first names in the last name column, phones in the zip code column etc.
Check for invalid characters, string data where there should just be numbers etc.
Any time it is possible, get the identifier from the people providing the data. Put this in a table that links to your identifier. This will save you from much duplication of records becuase the last name changed or the address changed.
There's lots more but this should get you started on thinking about building processes to protect your company's data by not importing bad stuff.

I work mostly with Microsoft SQL Server, so that's where my expertise is, but SSIS can connect to a pretty big variety of data sources and is very good for ETL work. You can use it even if none of your data sources are actually MS SQL Server. That said, if you're not using MS SQL Server there is probably something out there that's better for this.
To provide a really good answer one would need to have a complete list of your data sources and destination(s) as well as any special tasks which you might need to complete along with any requirements for running the conversion (is it a one-time deal or do you need to be able to schedule it?)

Not sure about tools, but your going to have to deal with:
synchronizing generated keys
synchronizing/normalizing data formats (e.g. different date formats)
synchronizing record structures.
orphan records
If the data is running/being updated while you're developing this process or moving data you're also going to need to capture the updates. When I've had to do this sort of thing in the past the best, not so great answer I had was to develop a set of scripts that ran in multiple iterations, so that I could develop and test the process iteratively before I moved any of the data. I found it helpful to have a script (I used a schema and an ant script, but it could be anything) that could clean/rebuild the destination database. It's also likely that you'll need to have some way of recording dirty/mismatched data.

In similar situations I personally have found Emacs and Python mighty useful but, I guess, any text editor with good searching capabilities and a language with powerful string manipulation features should do the job. I first convert the data into flat text files and then
Eyeball either the whole data set or a representative true random sample of the data.
Based on that make conjectures about different columns ("doesn't allow nulls", "contains only values 'Y' and 'N'", "'start date' always precede 'end date'", etc.).
Write scripts to check the conjectures.
Obviously this kind method tends to focus on one table at a time and therefore only complements the checks made after uploading the data into a relational database.

One trick that comes in useful for me with this, is to find a way for each type of data source to output a single column plus unique identifier at a time in tab delimited form say, so that you can clean it up using text tools (sed, awk, orTextMate's grep search), and then re-import it / update the (copy of!) original source.
It then becomes much quicker to clean up multiple sources, as you can re-use tools across them (e.g. capitalising last names - McKay, O'Leary o'Neil, Da Silva, Von Braun, etc., fixing date formats, trimming whitespace) and to some extent automate the process (depending on the source).

Related

Azure Data Factory - optimal design for an IOT pipeline

I am working on an Azure Data Factory solution to solve the following scenario:
Data files in CSV format are dumped into Data Lake Gen 2 paths. There are two varieties of files, let's call them TypeA and TypeB and each is dumped into a path reflecting a grouping of sensors and the date.
For example:
/mycontainer/csv/Group1-20210729-1130/TypeA.csv
/mycontainer/csv/Group1-20210729-1130/TypeB.csv
/mycontainer/csv/Group1-20210729-1138/TypeA.csv
/mycontainer/csv/Group1-20210729-1138/TypeB.csv
I need to extract data from TypeA files in Delta format into a different location on Data Lake Gen 2 storage. I'll need to do similar processing for TypeB files but they'll have a different format.
I have successfully put together a "Data Flow" which, given a specific blob path, accomplishes step 2. But I am struggling to put together a pipeline which applies this for each file which comes in.
My first thought was to do this based on a storage event trigger, whereby each time a CSV file appeared the pipeline would be run to process that one file. I was almost able to accomplish this using a combination of fileName and folderPath parameters and wildcards. I even had a pipeline which will work when triggered manually (meaning I entered a specific fileName and folderPath value manually). However I had two problems which made me question whether this was the correct approach:
a) I wasn't able to get it to work when triggered by real storage events, I suspect because my combination of parameters and wildcards was ending up including the container name twice in the path it was generating. It's hard to check this because the error message you get doesn't tell you what the various values actually resolve to (!).
b) The cluster that is needed to extract the CSV into parquet Delta and put the results into Data Lake takes several minutes to spin up - not great if working at the file level. (I realize I can mitigate this somewhat - at a cost - by setting a TTL on the cluster.)
So then I abandoned this approach and tried to set up a pipeline which will be triggered periodically, and will pick up all the CSV files matching a particular pattern (e.g. /mycontainer/csv/*/TypeA.csv), process them as a batch, then delete them. At this point I was very surprised to find out that the "Delimited Text" dataset doesn't seem to support wildcards, which is what I was kind of relying on to achieve this in a simple way.
So my questions are:
Am I broadly on the right track with my 'batch of files' approach? Is there a way to define a delimited text data source which reads its data from multiple blobs?
Or do I need a more 'iterative' approach using maybe a 'Foreach' step? I'm really really hoping this isn't the case as it seems an odd pattern to be adopting in 2021.
A much wider question: is ADF a suitable tool for this kind of scenario? I was excited about using it at first, but increasingly it feels like one of those 'exciting to demo but hard to actually use' things which so often pop-up in the low/no code space. Are there popular alternatives which will work nicely with Azure storage?
Any pointers very much appreciated.

I believe you're very much on the right track.
Last week I was able to get wildcard CSV's to be imported if the wildcard is in the CSV name. Maybe create an intermediate step to put all Type A's in the same folder?
Concerning ADF - it's a cool technology, with a steep learning curve (and a lot of updates - incl. breaking changes sometimes) if you're looking to get data ingested without too much coding. Some drawbacks:
Monitoring - if you want to have it cheaper, there's a lot of hacking (e.g. mailing via Logic Apps)
Debugging - as you've noticed, debug messages are often cryptic or insufficient
Multiple monthly updates make it feel like a beta. Indeed, often there are straightforward tasks that are quite difficult to achieve.
Good luck ;)

Read Excel Files from External Tables

I am tasked to create a template that will be Filled up by Business Users with Employee Information, then our program will load this into the Database using External Tables.
However, our Business Users constantly change the template by adding, removing or reordering fields.
I am convinced to use XLSX instead of CSV so that I can lock the Column Headers so they cannot remove, add and reorder the columns.
However, When i query the External Table, it shows Non-ASCII Characters when reading XLSX because its in Binary.
How can i do either of the following?
Effectively Read Excel Files from External Tables
Lock the Headers of CSV Files?

What you have here is a political problem, but you are looking for a technical fix. Not a good fit.
The problem comes in two halves:
Somebody decided it was a good idea to collect user input in a spreadsheet, which it is generally not.
Users are fiddling with the input format, which they should not.
Fixes are:
Strictly enforce the data structure. Reject any CSV which doesn't natch and make the users edit them. They will quickly tire of tweaking the spreadsheets when they realise they're just creating more work for themselves. But they will also get resentful, so consider ...
Building a data input screen. It's pretty simple to knock up a spreadsheet-like grid UI. You don't need anything complicated in Java: Oracle's Apex is intended for exactly this sort of thing. Find out more.
However, if you are stuck with Excel as a UI I suggest you have a look at Anton Scheffer's excellent PLSQL as_read_xlsx package on the Amis site. Check it out. You'll probably need to replace your external table with a view over a table (perhaps pipelined) function.

SQL files management

Most of my day is spent on writing SQL queries to perform small tasks, mainly to get information from the database and manipulate it somehow for data visualization building reports for others.
At the end of the day i try to have a nice folder scheme to help me reusing code and so on, but it's becoming harder to handle so many files and keep
track of everything I've done so far.
Don't want to have huge SQL files because I might want to
the end It's hard to avoid a war zone in my desktop and on this folders. It's also a mess to handle so many folders/codes.
For version control we're using a GIT server, but there is plenty of code that is not in production that we would like to keep track and reuse somehow.
We're using iPython notebook, R studio and SSMS to build our codes, I'm wonder if there is some efficient ways to work.
There must be an efficient way to work out there. What do you use to keep track of your (SQL) codes? and more importantly reuse it.
Thanks in advance,
Rafael

I just use a folder system. And I keep the shell-scripts so to speak as the first file (like the generic code to do X). Whereas the specific codes where I take X and apply dates and other conditions in the bottom half of the folder.

What do you use to keep track of your (SQL) codes? and more importantly reuse it.
For ease of reuse, I have all my running SQL code backed up on an SQL server through routine INFORMATION SCHEMA dumps. For all development code that I need to reuse with others, I have a GIT server that gets automatic updates throughout the day. For reuse on my laptop itself, I have a local backup through time machine.
As for directory or folder structure, all code starts as project based and eventually I migrate the best and most useful code to a personal folder structure that is topic based (date arithmetic, indexing, etc.). No matter how they are stored, all these folders are indexed using local and remote indexing features so I can search and retrieve them with just a few keystrokes when needed. Ultimately what's needed for optimum reuse is ease of retrieval. The quicker I can retrieve, the more reuse I get.
Lastly, it's not just SQL code, but all the supporting documents that led to that code solution. Sometimes this collection may include code from other languages, code from other servers, emails, text documents, images, workflows, etc. Keeping them all together enhances the value of reuse.

Alternatives to using RFile in Symbian

This question is in continuation to my previous question related to File I/O.
I am using RFile to open a file and read/write data to it. Now, my requirement is such that I would have to modify certain fields within the file. I separate each field within a record with a colon and each record with a newline. Sample is below:
abc#def.com:Albert:1:2
def#ghi.com:Alice:3:1
Suppose I want to replace the '3' in the second record by '2'. I am finding it difficult to overwrite specific field in the file using RFile because RFile does not provide its users with such facility.
Due to this, to modify a record I have to delete the contents of the file and serialize ( that is loop through in memory representation of records and write to the file ). Doing this everytime there is a change in a record's value is quite expensive as there are hundreds of records and the change could be quite frequent.
I searched around for alternatives and found CPermanentFileStore. But I feel the API is hard to use as I am not able to find any source on the Internet that demonstrates its use.
Is there a way around this. Please help.

Depending on which version(s) of Symbian OS you are targetting, you could store the information in a relational database. Since v9.4, Symbian OS includes an SQL implementation (based on the open source SQLite engine).

Using normal files for this type of records takes a lot of effort no matter the operating system. To be able to do this efficiently you need to reserve space in the file for expansion of each record - otherwise you need to rewrite the entire file if a record value changes from say 9 to 10. Also storing a lookup table in the file will make it possible to jump directly to a record using RFile::Seek.
The CPermamanentFileStore simplifies the actual reading and writing of the file but basically does what you have to do yourself otherwise. A database may be a better choice in this instance. If you don't want to use a database I think using stores would be be a better solution.

How to write a simple database engine [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am interested in learning how a database engine works (i.e. the internals of it). I know most of the basic data structures taught in CS (trees, hash tables, lists, etc.) as well as a pretty good understanding of compiler theory (and have implemented a very simple interpreter) but I don't understand how to go about writing a database engine. I have searched for tutorials on the subject and I couldn't find any, so I am hoping someone else can point me in the right direction. Basically, I would like information on the following:
How the data is stored internally (i.e. how tables are represented, etc.)
How the engine finds data that it needs (e.g. run a SELECT query)
How data is inserted in a way that is fast and efficient
And any other topics that may be relevant to this. It doesn't have to be an on-disk database - even an in-memory database is fine (if it is easier) because I just want to learn the principals behind it.
Many thanks for your help.

If you're good at reading code, studying SQLite will teach you a whole boatload about database design. It's small, so it's easier to wrap your head around. But it's also professionally written.
SQLite 2.5.0 for Code Reading
http://sqlite.org/

The answer to this question is a huge one. expect a PHD thesis to have it answered 100% ;)
but we can think of the problems one by one:
How to store the data internally:
you should have a data file containing your database objects and a caching mechanism to load the data in focus and some data around it into RAM
assume you have a table, with some data, we would create a data format to convert this table into a binary file, by agreeing on the definition of a column delimiter and a row delimiter and make sure such pattern of delimiter is never used in your data itself. i.e. if you have selected <*> for example to separate columns, you should validate the data you are placing in this table not to contain this pattern. you could also use a row header and a column header by specifying size of row and some internal indexing number to speed up your search, and at the start of each column to have the length of this column
like "Adam", 1, 11.1, "123 ABC Street POBox 456"
you can have it like
<&RowHeader, 1><&Col1,CHR, 4>Adam<&Col2, num,1,0>1<&Col3, Num,2,1>111<&Col4, CHR, 24>123 ABC Street POBox 456<&RowTrailer>
How to find items quickly
try using hashing and indexing to point at data stored and cached based on different criteria
taking same example above, you could sort the value of the first column and store it in a separate object pointing at row id of items sorted alphabetically, and so on
How to speed insert data
I know from Oracle is that they insert data in a temporary place both in RAM and on disk and do housekeeping on periodic basis, the database engine is busy all the time optimizing its structure but in the same time we do not want to lose data in case of power failure of something like that.
so try to keep data in this temporary place with no sorting, append your original storage, and later on when system is free resort your indexes and clear the temp area when done
good luck, great project.

There are books on the topic a good place to start would be Database Systems: The Complete Book by Garcia-Molina, Ullman, and Widom

SQLite was mentioned before, but I want to add some thing.
I personally learned a lot by studying SQlite. The interesting thing is, that I did not go to the source code (though I just had a short look). I learned much by reading the technical material and specially looking at the internal commands it generates. It has an own stack based interpreter inside and you can read the P-Code it generates internally just by using explain. Thus you can see how various constructs are translated to the low-level engine (that is surprisingly simple -- but that is also the secret of its stability and efficiency).

I would suggest focusing on www.sqlite.org
It's recent, small (source code 1MB), open source (so you can figure it out for yourself)...
Books have been written about how it is implemented:
http://www.sqlite.org/books.html
It runs on a variety of operating systems for both desktop computers and mobile phones so experimenting is easy and learning about it will be useful right now and in the future.
It even has a decent community here: https://stackoverflow.com/questions/tagged/sqlite

Okay, I have found a site which has some information on SQL and implementation - it is a bit hard to link to the page which lists all the tutorials, so I will link them one by one:
http://c2.com/cgi/wiki?CategoryPattern
http://c2.com/cgi/wiki?SliceResultVertically
http://c2.com/cgi/wiki?SqlMyopia
http://c2.com/cgi/wiki?SqlPattern
http://c2.com/cgi/wiki?StructuredQueryLanguage
http://c2.com/cgi/wiki?TemplateTables
http://c2.com/cgi/wiki?ThinkSqlAsConstraintSatisfaction

may be you can learn from HSQLDB. I think they offers small and simple database for learning. you can look at the codes since it is open source.

If MySQL interests you, I would also suggest this wiki page, which has got some information about how MySQL works. Also, you might want to take a look at Understanding MySQL Internals.
You might also consider looking at a non-SQL interface for your Database engine. Please take a look at Apache CouchDB. Its what you would call, a document oriented database system.
Good Luck!

I am not sure whether it would fit to your requirements but I had implemented a simple file oriented database with support for simple (SELECT, INSERT , UPDATE ) using perl.
What I did was I stored each table as a file on disk and entries with a well defined pattern and manipulated the data using in built linux tools like awk and sed. for improving efficiency, frequently accessed data were cached.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas