Consolidating / Clustering Terms and phrases - indexing

Our application allows a user to enter company names that their organization works with. A current issue is that the way one user inputs the company name varies from user to user. We need to consolidate this data. Are there any proven approaches for tackling this problem?

The problem of data quality is generally referred to as Data Cleansing. There are many methods and tools in this area.
The best for you will depend on the extent of your problem and also on the technologies you use. But if I understand well, the data that are stored are OK, the problem is that user input data to search against with incorrect spelling? In this case fuzzy searching could help.

Related

How to build an efficient query string in ASP.NET Core?

Hello World,
I'm in research mode for one of feature to be built in our software and there one new thing that we have never faced.
The thing is, on one form we have a drop down with list of items. User can select default which means all items needs to be considered or else he can selectively opt for certain list items.
Actually the form is related to filter functionality depending upon users input the data is going to get filtered and displayed on UI.
The main problem we are trying to solve is suppose user selects default, which means all list items ID's are gonna be considered in POST call of API. The list can be huge, say 1 to 1K and above too.
So under such circumstances we can build the query string but, it seems its gonna be so huge. I have also studied that certain browsers support limited query string as per their standard limits.
So currently I have following doubts in mind.
Will shortening of query string work here ?
By which technique it can be handled efficiently ?
What performance considerations I need to take care during during so ?
Any suggestions or thoughts are welcome. That would boast my software design thinking.
based on what I understand from your question, here is my opinion:[if I understood wrong, please correct me, so I can help you]
You need to send query in URL and not in body or using JSON!is that correct?
I think you don't need to send every one of the selected items one by one!
If there are selected in serial, you can perform a range in your query!
Like http://abcd/test?id=1-43,6-765(take ID as string and then export the useful data in back-end) with this approach, you can shorten your query!
And also think about the database too (if there is any).querying this much data is use a lot of IO and make query low performance.

MS-access 2007 case-based capacity planner for tender consultants,

As a way to score points for the study I’m doing, as well as out of interest into databases and wanting to help my team I’m trying to build a capacity planning tool in MS Access 2007. I work in a department that handles registering and supporting tenders. I have attached two pictures of what I’m trying to do here.
I’ve already spent some weeks making multiple iterations with colleagues who are involved and help write VBA and SQL (out of interest, wanting to learn something or otherwise. Our core business, however, isn’t developing). The primary goal of the database is as follows:
A user can access, create and modify “cases” that correlates to a case ID that we use in a different system.
A user can write down his capacity per week per year for a case.
multiple users can assign themselves to a case.
Users can leave messages (records) for other users to see on a case
Metadata can be attached to the case
The main problem we seem to be running into is that whenever a user tries through to edit an existing case through the overview, the case data no longer “complies” with entries elsewhere. Forcing updates through visual basic also seems to not have worked so far.
Adding to the complexity: most of the names we use are in dutch.
Here is an overview of the relations.
http://imgur.com/O022LAG
Here is a screenshot of the case overview as seen by a user.
http://imgur.com/kuENqaq
Main question:
How can I make entire records change for multiple users based on the input of one user.
In compliance with the guidelines regarding asking subjective questions I’m trying to be a bit more precise here:
Additionally I’m uncertain:
whether it is our approach that is wrong,
if perhaps we’re overlooking a glaring issue, or
if we should redesign this from scratch with a different layout.
Any help specifying where we should look or what would be advisable to do would be much appreciated!
Kind regards,
Timo

T-SQL database design and tables

I'd like to hear some opinions or discussion on a matter of database design. Me and my colleagues are developing a complex application in finance industry that is being installed in several countries.
Our contractors wanted us to keep a single application for all the countries so we naturally face the difficulties with different workflows in every one of them and try to make the application adjustable to satisfy various needs.
The issue I've encountered today was a request from the head of the IT department from the contractors side that we keep the database model in terms of tables and columns they consist of.
For examlpe, we got a table with different risks and we needed to add a flag column IsSomething (BIT NOT NULL ...). It fully qualifies to exists within the risk table according to the third normal form, no transitive dependency to the key, a non key value ...
BUT, the guy said that he wants to keep the tables as they are so we had to make a new table "riskinfo" and link the data 1:1 to the new column.
What is your opinion ?
We add columns to our tables that are referenced by a variety of apps all the time.
So long as the applications specifically reference the columns they want to use and you make sure the new fields are either nullable or have a sensible default defined so it doesn't interfere with inserts I don't see any real problem.
That said, if an app does a select * then proceeds to reference the columns by index rather than name you could produce issues in existing code. Personally I have confidence that nothing referencing our database does this because of our coding conventions (That and I suspect the code review process would lynch someone who tried it :P), but if you're not certain then there is at least some small risk to such a change.
In your actual scenario I'd go back to the contractor and give your reasons you don't think the change will cause any problems and ask the rationale behind their choice. Maybe they have some application-specific wisdom behind their suggestion, maybe just paranoia from dealing with other companies that change the database structure in ways that aren't backwards-compatible, or maybe it's just a policy at their company that got rubber-stamped long ago and nobody's challenged. Till you ask you never know.
This question is indeed subjective like what Binary Worrier commented. I do not have an answer nor any suggestion. Just sharing my 2 cents.
Do you know the rationale for those decisions? Sometimes good designs are compromised for the sake of not breaking currently working applications or simply for the fact that too much has been done based on the previous one. It could also be many other non-technical reasons.
Very often, the programming community is unreasonably concerned about the ripple effect that results from redefining tables. Usually, this is a result of failure to understand data independence, and failure to guard the data independence of their operations on the data. Occasionally, the original database designer is at fault.
Most object oriented programmers understand encapsulation better than I do. But these same experts typically don't understand squat about data independence. And anyone who has learned how to operate on an SQL database, but never learned the concept of data independence is dangerously ignorant. The superficial aspects of data independence can be learned in about five minutes. But to really learn it takes time and effort.
Other responders have mentioned queries that use "select *". A select with a wildcard is more data dependent than the same select that lists the names of all the columns in the table. This is just one example among dozens.
The thing is, both data independence and encapsulation pursue the same goal: containing the unintended consequences of a change in the model.
Here's how to keep your IT chief happy. Define a new table with a new name that contains all the columns from the old table, and also all the additional columns that are now necessary. Create a view, with the same name as the old table, that contains precisely the same columns, and in the same order, that the old table had. Typically, this view will show all the rows in the old table, and the old PK will still guarantee uniqueness.
Once in a while, this will fail to meet all of the IT chief's needs. And if the IT chief is really saying "I don't understand databases; so don't change anything" then you are up the creek until the IT chief changes or gets changed.

Wiki Database, is there one?

I was searching the net for something like a wiki database, just like wikipedia but instead stores structured content, editable by users. What I was looking for was an online database accessible by everyone where people can design the schema and data with proper versioning of both schema and data. I couldn't find any such site. I am not sure if it is my search skills or if there really is no wiki database as of now. Does anyone out there know anything like this?
I think there is a great potential for something like this. A possible example will be a website with a GUI for querying a MySQL DB where any website visitor can create DB objects and populate data.
UPDATE: I had registered the domain wikidatabase.org to get started on a tool but I didn't find enough time yet. If anyone is interested in spending some time and coding on this, please let me know at wikidatabase.org
It's not quite what you're looking for, but Semantic Mediawiki adds database-like features to MediaWiki:
http://semantic-mediawiki.org/wiki/Semantic_MediaWiki
It's still fundamentally a Wiki, but you can add semantic tags to pages ([[foo::bar]] [[baz::1000]]) and then do database-type queries across them: SELECT baz FROM pages WHERE foo=bar would be {{#ask: [[foo::bar]] | ?baz}}. There is even an embryonic SPARQL implementation for pseudo-SQL queries.
OK this question is old, but Google led me here, so for anyone else out there looking for a wiki for structured data: Take a look at Foswiki.
This might be like what you're looking for: dbpedia.org. They're working on extracting data from Wikipedia, and encoding it in a structured format using RDF, so that it can be queried using SPARQL.
Linkeddata.org has a big list of RDF data sets.
Do you mean something like http://www.freebase.com?
You should check out https://www.wikidata.org/wiki/Wikidata:Main_Page which is a bit different but still may be of interest.
Something that might come close to your requirements is Google Docs.
What's offered is document editing roughly similar to MS Word, and spreadsheets roughly similar to Excel. I'm thinking of the latter, of course.
In Google Docs, You can create spreadsheets for free; being spreadsheets, they naturally have a row-and-column structure similar to a database, and which you can define flexibly. You can also share these sheets with other people. This seems to be a by-invite-only process rather than open-to-all, but there may be other possibilities I'm not aware of, or that level of sharing might be enough for you in any case.
mindtouch should be able to do it. It's rather easy to get data in / out. (for example: it's trivial to aggregate all the IP's for servers into one table).
I pretty much use it as a DB in the wiki itself (pages have tables, key/value..inheritance, templates, etc...) but you can also interface with the API, write dekiscript, grab the XML...
I like this idea. I have heard of some sites that are trying to pull together large datasets for various things for open consumption, but none that would allow a wiki feel.
You could start with something as simple as an installation of phpMyAdmin with a known password that would allow people to log in, create a database, edit data and query from any other site on the web.
It might suffer from more accuracy problems than wikipedia though.
OpenRecord, development of which seems to have halted in 2008, seems to approach this. It is a structured wiki in which pages are views on the data. Unlike RDBMSes it is loosely typed - the system tries to make a best guess about what data you entered, but defaults to text when it cannot guess. Schemas appear to have been implied.
http://openrecord.org
An example of the typing that is given is that of a date. If you enter '2008' in a record, the system interprets this as a date. If you enter 'unknown' however, the system allows that as well.
Perhaps you might be interested in Couch DB:
Apache CouchDB is a document-oriented
database that can be queried and
indexed in a MapReduce fashion using
JavaScript. CouchDB also offers
incremental replication with
bi-directional conflict detection and
resolution.
I'm working on an Open Source PHP / Symfony / PostgreSQL app that does this.
It allows multiple projects, each project can have multiple directories, each directory has a defined field structure. Admins set all this up.
Then members of the public can suggest new records, edit or report existing ones. All this is moderated and versioned.
It's early days yet but it basically works and is already in real world use in several projects.
Future plans already in progress include tools to help keep the data up to date, better searching/querying and field types that allow translations of content between languages.
There is more at http://www.directoki.org/
I'm surprised that nobody has mentioned Wikibase yet, which is the software that powers Wikidata.

How to map subjective data in the semantic web?

I've been looking at the freebase project for storing data. It seems to be a great place to store concrete, objective data like names, locations and dates. Is it a good place to store subjective data like opinions or ratings? Is there another/better open data, semantic data store or strategy for storing and querying this kind of information?
Additionally, since it is subjective I can be sure that others will not agree with my opinion. How would I store the opinions of others inline so the crowd opinion could be represented better?
Is freebase the right place to store this type of data?
For example: a restaurant rating or a movie rating. The movie rating would probably be less time sensitive than the restaurant rating. Any non-identifying information about the person who entered the data would be interesting for determining other factors and relationships.
The Semantic Web is more or less a variant of first-order logic, for the most part, so the important part is to have a clear understanding of what each of your predicates "mean". This idea is very simple but applicable to a wide-variety of meaning representations - i.e. it is behind the entity model of databases.
There should be no problem representing the information you mentioned in a semantic web representation. Just be sure to have a clear definition of what each of your predicates denote, so that the meaning doesn't shift over time and you end up with an inconsistent representation.
Genesereth's book is old but a good one if you are interested in reading about this in further detail. I think a lot of people who worked on the Semantic Web were involved in Douglas Lenat's Cyc project which gradually shifted to a logic-based meaning representation over time.
http://www.amazon.com/Logical-Foundations-Artificial-Intelligence-Genesereth/dp/0934613311
The site for Cyc:
http://www.cyc.com/
I find designing/selecting data formats is very hard without an understanding of the questions I will be asking using that data. What purpose do you expect the data to be used for? Come up with some use cases and that may guide your search.
Storing attributed data is an open research topic, with development in (among other places) the Intelligence community: these users obviously need to keep track of where information came from, and who has added to it along the way, both to verify its reliability and to do things like track whether Secret information has been included by accident. That may be a good place to look.
Data is data, what you want to do is label the data as what it is, an opinion or a rating. A "fact" I suppose which could be inferred from such data would be that most people had x subjective opinion about said topic.
from twitter:
jimpick #the_real_kevinw Each user and app/base has their own namespace, but I'd ask on the developers mailing list. A mashup might fit better.