How do you replace Amazon CloudSearch in development? - amazon-cloudsearch

In production, my application runs MySQL + Amazon CloudSearch. In development, it runs only MySQL and I'm not interested in running a search domain only for development.
Currently, in development, I run text searches in MySQL, which is not ideal because I have to write specific environment code.
I have found Groonga CloudSearch, which seems awesome, but still very young and incomplete.
So, what would be the best approach to replace Amazon CloudSearch in development?

We simply just added an index field to classify the data e.g.
system: dev or live
Then any search from the code just passes in dev/live to the search query depending on what system it is running on.
You could also add a index for "domain" and then have values of "www.mydomain.com" and "dev.mydomain.com". Then in your code to do the search just pass in the domain that documents are coming from and where they should be visible.

This is one of those "it depends on how you're using it" answers. Is there a reason (other than expense) why you don't want to use AWS cloud search in development? You're choosing a pretty specific SAAS product that's effectively a black box to anyone who doesn't work on the project.
That said, some viable substitutions (depending on features needed) are:
-Solr - lucene.apache.org/solr/
-Elasticsearch - elasticsearch.org
-Whoosh - bitbucket.org/mchaput/whoosh/
-Xapian - xapian.org
-Really anything that uses the Lucene - lucene.apache.org/core/ -full text search engine
-MongoDB -mongodb.org
-Memcached memcached.org
-Redis redis.io/
so... yeah... it completely depends on the context of CloudSearch in your stack. You could be using it as a key-value store with minimal logic, or with a complex framework like
Haystack - https://github.com/pbs/haystack-cloudsearch

Related

Picking the right database technique for file storage and search

for a personal project I am searching for the "most suitable" database engine to hit the following key issues
need to store large amounts of single different document files (PDF)
need to perform full-text search onto PDF (for this I plan to use OCR and save the processed data/metadata additionally to the database)
need to get pieces/chunks of the saved documents (for example from a specific year) and show a preview of lots of them within a nice web UI
as much performance as possible
Up to now I did work a lot with SQL (MySql) and have some theoretical knowledge about other systems (MemCached, Redis, PostgreSQ, MongoDb). But I`ve never used them in combination and never hit the point WHEN they should be used for WHAT exactly or how they can be combined.
I think especially for a project like this it`s very important to select the right engine from beginning not to hit performance issues later.
So especially to all experienced developers out there, what would be your favourite choiche for this kind of (I gues SQL may not be the only right solution) ?
Or at the end will it be better to store files within filesystem and keep only metadata in database ?
BTW my planned API backend for this is Laravel 7+, frontend will be Vue 2+.
Thank you very much !

How do you implement search over static content within cshtml files

I am using asp.net core and Razor - and as it is a help system I would like to implement some kind of search facility to bring back a list of results hyperlinked based on the search terms.
I would like the search to iterate essentially over the content contained within the and tags and then link this to the appropriate page/view.
What is the best way to do this?
I'm not even sure how you get a handle on the actual content of your own cshtml pages and then go from there.
This question is far too broad. However, I can provide you some pointers.
First, you need to determine what you're actually wanting to surface and where that data lives. Your question says "static web pages", but then you mention .cshtml. Traditionally, when it comes to creating your own search, you're going to have access to some particular dataset (tables in a database, for example). It's much simpler to search across the more structured data than the end result of it being dumped in various and sundry places over a web page.
Search engines like Google only index in this way because they typically don't have access to the raw data (although some amount of "access" can be granted via things like JSON-LD and other forms of Schema.org markup). In other words, they actually read from the web page out of necessity, because that's what they have to work with. It's certainly not the approach you would take if you have access to the data directly.
If for some reason you need to actually spider an index your own site's HTML content, then you'll essentially have to do what the big boys do: create a bot, run it on a schedule, crawl your site, link by link, downloading each document, and then parse and process it. The end result would be to create a set of structured data that you can actually query against, which is why all this is pretty much just wasted effort if you already have that data.
Once you have the data, however you got there, you simply query it. In the most basic of forms, you could store it in a table in a database and literally issue SQL queries against it. Your search keywords/parameters are essentially the WHERE of the SELECT statement, so you'd have to figure out a way to map the keywords/parameters you're receiving to an acceptable WHERE clause that achieves that.
More traditionally, you'd use an actual search engine: essentially a document database that is designed and optimized for search, and generally provides a more search-appropriate API to query against. There's lots of options in this space from roll your own to hosted SaaS solutions, and anywhere in between. Of course the cost meter goes down the more work you have to do and goes up the more out of the box it is.
One popular open-source and largely free option is Elasticsearch. It uses Lucene indexes, which it stitches to together in a clustered environment to provide failover and scale. Deployment is a beast, to say the least, though it's gotten considerably better with things like containerization and orchestration. You can stand up an Elasticsearch cluster in something like Kubernetes with relative ease, though you still will probably need to do a bit of config. Elasticsearch does also have hosted options, but you know, cost.

Backend solution for fetching and transforming data from various third-party APIs

We are building new feature sets for one of our financial application. We have our own SQL server database and we will be calling multiple RESTful APIs that return JSON responses. For e.g. some returns news data, some returns stocks info, some returns finance data and our own sql server database has employee data. So, they all come with their own different data format. This new app we are building is going to aggregate all those data, transform it into meaningful display on web like mint.com does.
Web application will display analytical reports based on these data
There will be an option to download reports through various templates
We are completely open in terms of technology stack for our backend and middle-tier. As a first thought NoSQL like mongodb and elasticsearch for search and reporting comes to our mind. There will be a web application build on top of these data (stored or retrieved from API), most likely in Asp.net MVC.
We need your input, specially if you have experience with building similar enterprise solution.
Can you please share your opinions on,
What are some good tech stack you would pick for this app?
How would that scale now and in future when APIs data format changes.
Performance is also important since data will be displayed on web UI.
We have a similar setup to what you are mentioning, using ASP.Net MVC with ElasticSearch (SQL server for relational data, periodically updating ES), aggregating data (XML/JSON) from multiple sources, although with the purpose of improving searching and filtering results instead of reporting. However, I would expect that the scenario you are looking at would also be a suitable match for ElasticSearch, depending on your specific requirements.
1) Since you are already using SQL Server (and I expect are familiar with that), I would suggest combining that with ElasticSearch - the additional mongodb layer seems unnecessary, in terms of maintenance of another technology and development to fit that integration. There is a very good C# library (two actually, ElasticSearch.Net and NEST, used together) that exposes most of the ES functionality.
2) We chose ElasticSearch for its scalability in combination with flexibility and ease-of-use. A challenge you may face could be mapping the documents from C# classes to ElasticSearch documents. In essence, it is incredibly easy to set up, however you do need to do some planning to index data the way you want to search and retrieve it. So if choosing ES as a platform, spend some time with the structure of the documents - by default, dynamic mapping is enabled, so you can pretty much throw any JSON into a document. However, for a production environment, it's better to turn that off and have one or more mappings set up, so they can be queried in a standardized way.
3) Performance is a key factor for us as well, which is why we were looking at Lucene-based engines like Solr and ElasticSearch when doing research, along with NoSQL databases. ElasticSearch outperforms SQL Server by 10 to 1 or better, in most scenarios. Solr vs. ElasticSearch performance depends on scenario, benchmarks and comparisons are around if you Google them. The exception may be if many documents should be retrieved in one query - ES (or Lucene) is not made for that use case, it's best for fast retrieval of fewer results (similar to Google's per page results count) per page. If you need 1000 documents per page/result, a NoSQL database may be a better option.
ElasticSearch is fast to get up and running - install it on a local development box and try it out, you'll get a feel for if it fits.
From my experience, mongodb is the worst choice for reporting, especially for aggregation. It lacks in good aggregation functionality, has some data type conflicts (such as decimals being stored as strings, which you cannot use in it's built in aggregation framework api) and you'll probably will have to maintain map-reduce functions in javascript for most of the scenarios.
If your application's true nature is only reports, and they do not have to be updated in realtime, I would drop off the on-demand RPC calls to external APIs. I would consider copying ahead as much data as possible and storing it under a schema that is the most convenient for you to work with, and synchronising it afterwards under scheduled, predicted intervals.
I wouldn't be in a hurry making assumptions about that data to be available all the time nor it always to be in the format you expect. You also gain optimisation benefits on running your own copy of it, indexed in the way you want, instead of trying to figure which of the RPCs is your bottleneck.
As for your questions:
1) If you don't mind using Python, I would pick Django on top of PostgresSQL database. Django is a fully featured sturdy ORM + Web framework which is excellent for this kind of work. If not, just stick to a relational SQL database. I heard wonders of Cassandra but haven't tried it yet.
2 + 3) As I mentioned before, replicating the data as much as possible for your own good. After everything is "in house" you can cluster it and tweak it freely. Using a distributed cache against heavy client requests is also a good idea (such as REDIS), instead of generating those reports each time on demand.
I've been using Jasper reports and the Jasper Reports Server to integrate into our web app. Jasper accepts many different datasource types including JSON and SQLServer. The core version is free and allows you to product html amd pdf reports of high complexity. The paid version with the server allows you to easily integrate in your web app. The core is Java spring (partially open source) running on tomcat/jboss and you can interact with it using REST web services or the visualize.js library for your web front end. It uses highcharts which can produce some beautiful results and has options for adhoc reporting and dashboards built from many reports.
See demos here: http://www.jaspersoft.com/
This has an assumed stack of your backend db's and data sources, tomcat with Java Spring, web front end HTML/Javascript.
The tool is used by many large enterprises including Amazon so scalibility so shouldn't be an issue.
If the format of your data changes you'll need to change the report. This is xml formatted editted by GUI with WYSIWYG.

JSON vs classic schema design [duplicate]

The Project
I've been asked to work on an interesting project -- what amounts to a basic Web CMS -- that uses HTML/CSS/jQuery with PHP. However, one requirement is that there won't be a database to house the data (they want flat files for the documents/pages -- preferable in JSON format).
In a very basic sense, it'll be used to generate HTML pages via a very "non-techie" interface. Each installation would only have around 20 pages, but a few may get up to 100. It has to be fairly easy to drop onto a PHP capable server and run, with very little setup needed.
What's Out There
There are tons of CMS options and quite a few flat file versions. But an OSS or other existing CMS is not an option. They need a simple propriety system.
Initial Thoughts
So flat files it is... but I'd really like to get some feedback on the drawbacks, and if it is worth the effort to try and convince them to use something like MySQL (SQLite or CouchDB are out since none of the servers can be configured to run them at the present time).
Of course the document files are pretty straightforward, but we're also talking about login info for 1 or 2 admins per installation, a few lists, as well as configs/settings (which also can easily be stored in a file with protection).
The Dilemma
If there are benefits to using MySQL rather than JOSN formatted files and some arrays in a simple project like this -- beyond my own pre-conceived notions :) -- I'll be sure to argue them.
But honestly I can't see any that outweigh their need to not have a database system.
I'd appreciate you insight and opinions.
If you can't cite a specific need for relational table design, then you're good with flat files. Build as specified. The moment you can cite a specific need, let them know; upgrading isn't that hard, if you're perception is timely (that is, if you aren;t in the position of having to normalize data that should have been integrated earlier).
It's a shame you can't use CouchDB, this seems like the perfect application for it. Keep in mind that using flat-files severely constrains your architecture and, especially, scalability.
What's the best case scenario for your CMS app? It's successful and people want to use it more? If you're using flat-files it'll be harder to service and improve your system (e.g. make it more robust, and add new features for future versions) and performance will not scale well. So "success" in this case is at best short-lived, as success translates into more and more work for less and less gains in feature-set and performance.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
Will this be installed on any shared hosting sites. For this to work somewhat safely, a mechanism like suEXEC needs to be set up properly as the web server will need write permissions to various directories.
What would be cool with a simple site that was feed via JSON and jQuery is that the site wouldn't need to load on each click. Just the relevant data would change. You could then use hashes in the location bar to keep track of where you were (ex. http://localhost/#about)
The problem being if they are editing the raw JSON file they can mess it up pretty quick. I think your admin tools would have to generate the JSON files based on the input so that you can ensure nothing breaks. The admin tools would be more entailed then the site (though isn't that always the case with dynamic sites)
What is the predicted data sizes for the CMS?
A large reason for the use of a RDMS is quick,specific access to large amounts of data. The data format might not be large, but if there is a lot of the data, then it might be better in the long run for a RDMS.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
While an RDBMS may be necessary for a very large CMS, a small one could run off flat files very well. A lot of CMS products out there fall down in that regard, I think, by throwing an RDBMS into the mix when there's no real need.
However, if you are using flat files, there are security issues which others have highlighted. Another issue I've come across is hosting providers using the disable_functions directive in php.ini to disable file I/O functions like fopen() and friends. If you're hosting your CMS on a box you control, you won't have this problem but if you're using a third-party provider, check first.
As the original poster, I wasn't signed in, so I'm following up to the answers so far in an answer (sorry if this is bad form).
There may instances where this is on
a shared host.
Though the JSON files can technically
be edited, this won't be the case.
The admin interface will be robust
enough to do all of the creating/editing of pages
The size for each install will be
relatively small -- 1 - 2 admins,
10-100 pages. A few lists of common
items may run longer (snippets of
copy for example).
Security will be a big issue -- any
other options suggestions on this
specifically?
Well, isn't there a problem with they being distrustful to any database system? Isn't the problem more in their thinking than in technology? Maybe they are afraid of database because it sounds complex to them. In that case, if you just present them some very simple CMS (like CMS made simple, which I've heard is really simple and the learning process is very fast), if they see everything is easy then may be they just don't care what's behind, if it's a database or whatever!
They could hear to arguments like better maintenance, lower cost of maintenance, much better handover to another webmaster than proprietary solutions (they are not dependent on you) etc.

Problems using MySQL FULLTEXT indexing for programming-related data (SO Data Dump)

I'm trying to implement a search feature for an offline-accessible StackOverflow, and I'm noticing some problems with using MySQLs FULLTEXT indexing.
Specifically, by default FULLTEXT indexing is restricted to words between 4 and 84 characters long. Terms such as "PHP" or "SQL" would not meet the minimum length and searching for those terms would yield no results.
It is possible to modify the variable which controls the minimum length a word needs to be to be indexed (ft_min_word_len), but this is a system-wide change requiring indexes in all databases to be rebuilt. On the off chance others find this app useful, I'd rather keep these sort of variables as vanilla as possible. I found a post on this site the other day stating that changing that value is just a bad idea anyway.
Another issue is with terms like "VB.NET" where, as far as I can tell, the period in the middle of the term separates it into two indexed values - VB and NET. Again, this means searches for "VB.NET" would return nothing.
Finally, since I'm doing a direct dump of the monthly XML-based dumps, all values are converted to HTML Entities and I'm concerned that this might have an impact on my search results.
I found a blog post which tries to address these issues with the following advice:
keep two copies of your data - one with markup, etc. for display, and one modified for searching (remove unwanted words, markup, etc)
pad short terms so they will be indexed, I assume with a pre/suffix.
What I'd like to know is, are these really the best workarounds for these issues? It seems like semi-duplicating a > 1GB table is wasteful, but maybe that's just me.
Also, if anyone could recommend a good site to understand MySQL's FULLTEXT indexing, I'd appreciate it. To keep this question from being too cluttered, please leave the site recommendations in the question comments, or email me directly at the site on my user profile).
Thanks!
Additional Info:
I think I should clarify a couple of things.
I know "MySQL" tends to lead to the assumption of "web application", but that's not what I'm going for here. I could install Apache and PHP and run things that way, but I'm trying to keep this light. I can use my website for playing with PHP, so I don't feel the need to install it on my home machine too. I also hope this could be useful for others as well, and I don't want to force anyone else into installing a bunch of extra utilities. I went with MySQL since it was easy and needing to install some sort of DB was unavoidable.
The specifics of the project were going to be:
Desktop application written in C# (WinForms)
MySQL backend
I'm starting to wonder if I should just say to hell with it, and install everything I'd need to make this an (offline) webapp. As much as we'd all like to think our pet project is going to be used and loved by the community at large, I should know by now that this is likely going end up being only used by a single user.
From what was already said, I understand, that MySQL FullText is not for you ;) But why stick to MySQL? Try Sphinx:
http://www.sphinxsearch.com/
It will solve most of your problems.