sql search for data format - sql

Is it possible to search an SQL database by an entry's format?
I'm specifically trying to find instances in a database where the structure of an entry is A##.# (Any letter, followed by 2 numbers, a decimal, then a final number)
And in case it could cause complications or may need separate attention, would the same code be able to find instances where there is no decimal, just A##?

Related

data sanitization/clean-up

Just wondering…
We have a table where the data in certain fields is alphanumeric, comprising a 1-2 digit alpha followed by a 1-2 digit number e.g. x2, x53, yz1, yz95
The number of letters added before the number can be determined by the field so that certain fields will always have the same 1 letter added before the number while others will always have the same 2 letters.
For each field, the actual letters and number of letters added (1 or 2) are always the same, thus, we can always tell which letters appear before the numbers just via the field names.
For the purposes of all downstream data analysis, it is only ever the numeric value from the string which is important.
Sql queries are constructed dynamically behind a user form where the final sql can take many forms depending on which selections and switches the user has chosen. With this, the VBA generating the sql constructs is fairly involved, containing many conditions/variable pathways to the final sql construct.
With this, it would make the VBA and sql much easier to write, read, debug, and perhaps increase the sql execution speed, etc. – if we were only dealing with a numeric datatype e.g. I wouldn’t need to accommodate the many apostrophes within the numerous lines of “strSQL = strSQL & …”
Given that the data itself being analysed is a copy that’s imported via regular .csv extracts from a live source, would it be acceptable to pre sanitize/clean-up these fields around the import stage by converting the data within to numeric values and field datatypes?
- perhaps either by modifying the sql used to generate the extract or by modifying the schema/vba process used to import the extract into the analysis table e.g. using something like a Replace function such as “ = Replace(OriginalField,”yz”,””) “ to strip out the yz characters.
Yes, link the csv "as is", and for each linked table create a straight select query that does the sanitization, like:
Select
Val(Mid([Field1], 2)) As NumField1,
Val(Mid([Field2], 1)) As NumField2,
etc.
Val(Mid([FieldN], 2)) As NumFieldN
From
YourLinkedCsvTable
then use this query throughout your application when you need the data.

Yahoo finance API field misalignment

I'm trying to use the Yahoo Finance API to create a custom csv but depending upon the stock there is field misalignment.
For instance, if I just want to download the "k3" field for yahoo which corresponds to last trade size, I would craft the url like so:
http://finance.yahoo.com/d/quotes.csv?s=yhoo&f=k3
However, if you download that csv there are two columns of data.
Similarly, if I decide to get Float Shares , I want the url:
http://finance.yahoo.com/d/quotes.csv?s=yhoo&f=f6
However that gives me 3 columns. Is there a way to get it in exactly one column? I want to automate this process but the different column orientations make it very difficult as different rows then have different column lengths and I am unable to easily match up the column name with the row.
Bonus: If someone can explain where the 3 float share numbers come from that would be great, I seem to only be able to match up the first to the site...
Thank you for your help!
In short, you are describing known bugs that Yahoo isn't going to fix as the feed is officially unsupported.
Specifically re. Float (f6): the number returned is the entire float. It is not 3 csv numbers. The commas are not delimiters; rather, they are 1,000s separators. (I suspect the same is the case with K3. As it is with a couple of other known numbers. (See link below.))
Two solutions:
(1) Write your own workaround using conditional statements (if or case) in your code.
(2) Download the buggy parameters separately from the clean ones.
See: Yahoo's official reply to your question.
The multiple columns is because that excel (or whatever csv viewer you are using) treats "thousand-seperator" as the the "comma-seperator". We used to have this problem in our school project, and found a hack which is good only if you are using this api for some hobby project and not concerning data usage.
The idea is instead of treating the results as a csv, pick a static column (column A) where you will know the value beforehand (e.g. column 's' stock symbol) or put this value as the first column. When constructing the query, use this column to surround those columns (float columns) with formatting problem. once you get the quotes.csv, manual seperate the results on the column A value.
for example using
http://download.finance.yahoo.com/d/quotes.csv?s=yhoo&f=sf6sa5sb6
will get you
"YHOO", 887,675,000,"YHOO",400,"YHOO",N/A
Then use ,"YHOO", to seperate the results (excluding first column).
Not an elegent way to solve the problem, but at least it gives you the correct result.

Storing multiple values in one column vs. Multiple columns for multiple values

With regards to Performance & 'The proper way of doing things' and trying to figure out which way is better to store configuration data in a SQL database. Assume you have website configuration data for setting a minimum and maximum age that a person must be to access the site.
CREATE TABLE SiteConfig
(
featureName varchar(100),
value varchar(100),
)
Which is better:
To store it all in a single row and process it in PHP with explode();.
featureName: "ageRequirement"
value: "13|60"
To store it in separate rows for each feature and just SELECT the feature you need when needed.
featureName: "minAge"
value: "13"
featureName: "maxAge"
value: "60"
Due to the amount of features of the website, this is a difference of having 60 ROWS of data vs. about 25.
Store it has two separate features.
You may want to access the feature in the database, rather than in the application code.
There is no reason to parse a comma delimited string to get two features.
Admittedly, in the case of just two numbers separated by a comma, not much can go wrong. This is, however, a slippery slope and you might soon find yourself trying to stuff multiple fields into a single string, and then devoting a lot of resources to parsing the string.
If those are values that you never will use by themselves in the database, either would work.
If you suspect that you would ever want to use the value in the database, you would not want them in the same value. That would make querying the data complicated and inefficient.

Generating Record Layouts for EBCDIC Data Files.

We are attempting to write a tool in Perl which is expected to parse a fixed length EBCDIC data file and generate the record layout by looking at the hex value of each byte in the record.
It is assumed that each data file, which is written by a Cobol program whose source code we do not have, can have multiple record layouts. The aim of this tool is to perform data migration (EBCDIC to ASCII) by generating layout which would then be fed to a converter.
The problem is that there are hundreds of permutations and combinations that may arise with each byte. I thought that comparing the hex value of the corresponding byte in the record below the current one might give us some clue as to what this might be. But even in this case there is no concrete solution that one might arrive at. Decisions need to be taken at every juncture which might affect the end result.
Could someone please let me know for any said patterns that I can look for? For example, for all COMP-3s each nibble can possibly represent a value from 0-9 and hence the hex value of the byte might be something like, [0-9][0-9]. Essentially for data migration one need not bother about COMPs and COMP-3s as their value would not be affected in the migration. But identifying what is the DISPLAY fields are is also turning out to be a huge task. Can someone throw some ideas or point me in some direction that I can further explore?
Any help would be highly appreciated. I am really stuck in a mire here.
Thanks,
Aditya.
There are many enterprise transformation tools that will do exactly what you need. Alternatively, it is easy to parse the ADATA records from the compiled copybooks to get the exact byte positions and representations of every field.
Can I hazard a guess? Do you have nobody skilled in Cobol? It isn't that hard to process Cobol copybooks, certainly not as hard as it is to use a write only language like Perl.
Do you have syncsort or DFsort available? It will do what you ask with a simple config file...
I guess you have to go with probabilities, and hope the data is varied enough to get a lot out of that.
Any field that only contains EBCDIC values of alpha-numeric plus punctuation
Numeric DISPLAY fields will be the easiest, containing just EBCDIC 0-9. Note that if signed then the first number will be changed to a letter, like A is -1 I think.
Pretty random distribution of values, leading with hex 0's, will likely be binary numeric "COMP" fields.
COMP-3 fields are one decimal digit in each hex digit of data. So if all the hex digits happen to be 0-9, that's a strong sign of a comp-3 field. Except the last hex digit of the field, which will contain a C for positive, D for negative, and F for unsigned.
Some programs use spaces on numeric fields, so if a field contains all sorts of binary, and also hex 40 (spaces), it's probably best to toss the hex 40 out of the mix. It might tell you a group of bytes is one field if they are all spaces together, or all data together.
As for multiple layouts, that's tough. A common convention for records that can have multiple layouts is to have a limited set of values for "what type of data is this" near the front of the record. Like significantID, recordType, data. So the significantID should increase steadily, while the recordType fields will vary between just a few values and re-cycle.
The FileWizard in RecordEditor / JRecord can search for Mainframe Cobol fields in a Files. The FileWizard results can be stored in a Xml file for use in other languages or you can use the copy Function to copy from Ebcdic to either Ascii fixed or CSV formats.
There is some out of date documentation on the File Wizard

Struggling with a MySQL database of phone numbers

My application wants to store a list of international phone number in a mysql database. Then the application would need to query the database and search for a specific number. Sounds simple but it's actually a huge problem.
Because users can search for that number in a different format we'll have to do a full scan to the database each time.
For example. We might have the number 17162225555 stored in the database (along with another 5 million entries). Now the user comes along and attempt to search using 7162225555. Another user might try to serach with 2225555. etc etc. So in other words, the database have to issue the SQL query using a "like %number%" which would result in a full scan.
How should we design this application? Is there some way to tweat the Mysql to handle this better? Or should we not use SQL at all?
PS. We have millions of entries, and 10s of these search request per second.
This is very weird, I've struggled with this issue myself many times, over the last 15 years and generally come up with structures that separate area codes, country codes and number into separate fields etc. But whilst reading your question another solution just popped into my head, it does require a separate field though so may not be appropriate for you.
You could have a separate field called reverse_phone_number, have this automatically populated by the DB engine then when people search simply reverse the search string and use the indexed reverse field with just a % at the end of the like string, thereby allowing the use of the index.
Dependant on your DB engine you may be able to create an index based on a user-defined function that does the reverse for you obviating the need for an additional field.
In some countries, e.g. the UK, you may have an issue with leading zeros. A UK phone number is represented as (area code)(Phone Number) e.g. 01634 511098, when this is internationalised the leading zero of the area code is removed and the international dial code (+ or 00) and the country code (44) are added. This results in an international phone number of +441634511098. Any user searching for 0163451109 would not find the phone number if it was entered in internationalised format. You can overcome this issue by removing leading zeros from the search string.
EDIT
Based on suggestions from Ollie Jones you should store the number as entered by the user and then strip leading zeros, punctuation and white space from the number before reversing and storing in the reversed field. Then simply use the same algorithm to strip the search string before reversing, find the record and then display the originally entered number back to the user.