First time posting on here because Google is yielding no results!
So, I have a website that is based around travelling and locations. Everytime someone enters content into the site, they select a location and that content then has lat and long, country, etc.
The issue I face is that I have a DB of all the "cities and areas" of the world and there are a good 3.5 million records in the database I believe.
My question to you is how would you guys recommend doing a 1 field autocomplete form for all the cities? I don't need advice on the autocomplete form itself, I need advice on HOW and WHERE I should be storing the data... text files? SQL? Up until now, I have been using SQL but I don't know how it should be done. Would an AJAX autoloader be able to handle it if I only returned 100 records or so? Should all the results be preloaded?
Thanks for your help guys!
EDIT: I have actually found another way to do it. I found this awesome little plugin to integrate Google Maps with it
http://xilinus.com/jquery-addresspicker/demos/index.html
Fantastic.
Benny
I have a few thoughts here:
since you don't know whether a user will enter the english or local (native) name, each city record in your database should have both. Make sure to index these fields.
Do not do auto-complete until you have a minimum number of characters. Otherwise, you will match way too many rows in your table. For example, assuming an even distribution of english characters (26), then at 3.5 million records you would statistically get thar = he following matches per character:
1 char = 135k
2 char = 5.2k
3 char = 200
4 char = 8
If you are using MySQL you will want to use the LIKE specifier.
There are much more advance methods for predictive matching, but this should be a good start.
Related
I have a table I have filtered from data. It is my highlights across the web. I want to, ultimately, output these to a doc file I have by the page they came from
I have the api data filtered down to two columns
url|quote
How do I, for each url, output the quote to a doc file. or just for starters iterate through a set of quotes by each earl.
In SQL it would be something like this
Select quote over(partition by url) as sub_header
From table
url quote
https://jotengine.com/transcriptions/WIUL8HBabqxffIDOkUA9Dg I actually think that the bigger problem is not necessarily having the ideas. I think everyone has lots of interesting ideas. I think the bigger problem is not killing the bad ideas fast enough. I have the most respect for the Codecademy founders in this respect. I think they tried 12 ideas in seven weeks or something like that, in the summer of YC.
https://jotengine.com/transcriptions/WIUL8HBabqxffIDOkUA9Dg We were like what the heck is going on here so we went and visited five of our largest customers in New York, this was about three years ago and we said okay, you're using the S3 integration but what the heck are you using it for? For five out of five customers in a row, they said well we have a data engineering team that's taking data from the S3 bucket, converting it into CS view files and managing all the schema-translations and now they're uploading it into a data warehouse like Redshift. The first time I heard that from a customer, I was like okay, that's interesting
I want to output a url header followed by all the quotes I've highlighted. Ideally my final product will be in docx
it would be great if you could provide some source code to help explain your problem. From looking at your question, I would say all you need to do is put your columns into a DataFrame, then export this to excel.
df = pd.DataFrame({"url":url,"quote":quote})
df["quote"].to_excel("filename.xlsx")
Hope this helps.
Here's my situation (or see TLDR at bottom): I'm trying to make a system that will search for user entered words through several documents and return the documents that contain those words. The user(s) will be searching through thousands of documents, each of which will be 10 - 100+ pages long, and stored on a webserver.
The solution I have right now is to store each unique word in a table with an ID (only maybe 120 000 relevant words in the English language), and then in a separate table store the word id, the document it is in, and the number of times it appears in that document.
E.g: Document foo's text is
abc abc def
and document bar's text is
abc def ghi
Documents table will have
id | name
1 'foo'
2 'bar'
Words table:
id | word
1 'abc'
2 'def'
3 'ghi'
Word Document table:
word id | doc id | occurrences
1 1 2
1 2 1
2 1 1
2 2 1
3 2 1
As you can see when you have thousands of documents and each has thousands of unique words, the Word Document tables blows up very quickly and takes way too long to search through.
TL;DR My question is this:
How can I store searchable data from large documents in an SQL database, while retaining the ability to use my own search algorithm (I am aware SQL has one built in for .docs and pdfs) based on custom factors (like occurrence, as well as others) without having an outright massive table for all the entries linking each word to a document and its properties in that document?
Sorry for the long read and thanks for any help!
Rather than building your own search engine using SQL Server, have you considered using a C# .net implementation of the lucene search api's? Have a look at https://github.com/apache/lucene.net
Good question. I would piggy back on the existing solution of SQL Server (full text indexing). They have integrated a nice indexing engine which optimises considerably better than your own code probably could do (or the developers at Microsoft are lazy or they just got a dime to build it :-)
Please see SQL server text indexing background. You could query views such as sys.fulltext_index_fragments or use stored procedures.
Ofcourse, piggy backing on an existing solution has some draw backs:
You need to have a license for the solution.
When your needs can no longer be served, you will have to program it all yourself.
But if you allow SQL Server to do the indexing, you could more easily and with less time build your own solution.
Your question strikes me as being naive. In the first place... you are begging the question. You are giving a flawed solution to your own problem... and then explaining why it can't work. Your question would be much better if you simply described what your objective is... and then got out of the way so that people smarter than you could tell you HOW to accomplish that objective.
Just off hand... the database sounds like a really dumb idea to me. People have been grepping text with command line tools in UNIX-like environments for a long time. Either something already exists that will solve your problem or else a decent perl script will "fake" it for you-- depending on your real world constraints, of course.
Depending on what your problem actually is, I suspect that this could get into some really interesting computer science questions-- indexing, Bayesian filtering, and who knows what else. I suspect, however, that you're making a very basic task more complicated than it needs to be.
TL;DR My answer is this:
** Why wouldn't you just write a script to go through a directory... and then use regexes to count the occurences of the word in each file that is found there?
I have two Tables:
Articles(artID, artContents, artPublishDate, artCategoryID, publisherID).
ArticleUpdated(upArtID, upArtContents, upArtEditedData, upArtPublishDate, upArtCategory, upArtOriginalArticleID, upPublisherID)
A user logging in to the application and update an article's
contents at (artContents) column. I want to know about:
Which Changes the user made to the article's contents?
I want to store both versions of the Article, Original version and Edited Version!
What should I do for doing above two task:
Any necessary changes into the tables?
The query for getting exact edited data of (artContents).
(The exact edited data means, that there may 5000 characters in the coloumns, the user may edit 200 characters in the middle or somewhere else in column's characters, I want exact those edited characters, before of edit and after of edit)
Note: I am using ASP.NET with C# for Developing
You are not going to be able to do the exact editing using SQL. You need an algorithm such as the Unix diff on files (which works on the line level). At the character level, the algorithm would be some variation of Levenshtein distance. If diff meets your needs, you could download it, write a stored-procedure to call it, and then use it in the database. This would be rather expensive.
The part of your question of maintaining the different versions is much easier. I would add two colmnns EffDate and EndDate onto each record. You can get the most recent version by looking for EndDate is NULL and find the version active at any given time. Merge is generally useful for maintaining such a table.
Basically this type for requirement needs custom logging.
The example what you have provided i.e. "The exact edited data means, that there may 5000 characters in the coloumns, the user may edit 200 characters in the middle or somewhere else in column's characters, I want exact those edited characters, before of edit and after of edit"
Can have a case that user updates particular words from different place from the text.
You can use http://nlog-project.org/ for logging, its a fast and robust tool that normally we use for doing .net logging.
Also you can take a look
http://www.codeproject.com/Articles/38756/Two-Simple-Approaches-to-WinForms-Dirty-Tracking
Asp.net Event for change tracking of entities
What would be the best way to implement change tracking on an object
Above urls will clear some air, on how to do it.
You would obviously need to track down and store every change.
I have been trying to get the full list of playlists matching a certain keyword. I have discovered however that using start-index past 100 brings the same set of results as using start-index=1. It does not matter what the max-results parameter is - still the same results. The total results returned however is way above 100, thus it cannot be that the query returned only 100 results.
What might the problem be? Is it a quota of some sort or any other authentication restriction?
As an example - the queries bring the same result set, whether you use start-index=1, or start-index=101, or start-index = 201 etc:
http://gdata.youtube.com/feeds/api/playlists/snippets?q=%22Jan+Smit+Laura%22&max-results=50&start-index=1&v=2
Any idea will be much appreciated!
Regards
Christo
I made an interface for my site, and the way I avoided this problem is to do a query for a large number, then store the results. Let your web page then break up the results and present them however is needed.
For example, if someone wants to do a search of over 100 videos, do the search and collect the results, but only present them with the first group, say 10. Then when the person wants to see the next ten, you get them from the list you stored, rather than doing a new query.
Not only does this make paging faster, but it cuts down on the constant queries to the YouTube database.
Hope this makes sense and helps.
I'm not too good with SQL and I know there's probably a much more efficient way to accomplish what I'm doing here, so any help would be much appreciated. Thanks in advance for your input!
I'm writing a short program for the local school high school. At this school, juniors and seniors who have driver's licenses and cars can opt to drive to school rather than ride the bus. Each driver is assigned exactly one space, and their DLN is used as the primary key of the driver's table. Makes, models, and colors of cars are stored in a separate cars table, related to the drivers table by the License plate number field.
My idea is to have a single search box on the main GUI of the program where the school secretary can type in who/what she's looking for and pull up a list of results. Thing is, she could be typing a license plate number, a car color, make, and model, someone driver's name, some student driver's DLN, or a space number. As the programmer, I don't know what exactly she's looking for, so a couple of options come to mind for me to build to be certain I check everywhere for a match:
1) preform a couple of
SELECT * FROM [tablename]
SQL statements, one per table and cram the results into arrays in my program, then search across the arrays one element at a time with regex, looking for a matched pattern similar to the search term, and if I find one, add the entire record that had a match in it to a results array to display on screen at the end of the search.
2) take whatever she's looking for into the program as a scaler and prepare multiple select statements around it, such as
SELECT * FROM DRIVERS WHERE DLN = $Search_Variable
SELECT * FROM DRIVERS WHERE First_Name = $Search_Variable
SELECT * FROM CARS WHERE LICENSE = $Search_Variable
and so on for each attribute of each table, sticking the results into a results array to show on screen when the search is done.
Is there a cleaner way to go about this lookup without having to make her specify exactly what she's looking for? Possibly some kind of SQL statement I've never seen before?
Seems like a right application for the Sphinx full-text search engine. There's the Sphinx::Search module on CPAN which can be used as perl client for Sphinx.
First of all, you should not use SELECT * and you should definitely use bind values.
Second, the easiest way to figure out what the user is searching for is to ask the user. Have a set of checkboxes likes so:
Search among: [ ] Names
[ ] License Plate Numbers
[ ] Driver's License Numbers
Alternatively, you can note that names do not contain any digits and I have not seen any driver's license number which contains digits. There are other heuristics you can apply to partially deduce what the user was trying to search.
If you do an OK job of presenting the results, this might work out.
Finally, try to figure out what search possibilities are offered by the database you are using and leverage them so that most of the searching happens before the user interface touches the data.