querying larg text file containing JSON objects

querying larg text file containing JSON objects - sql

I have few Gigabytes text file in format:
{"user_ip":"x.x.x.x", "action_type":"xxx", "action_data":{"some_key":"some_value"...},...}
each entry is one line.
First I would like to easily find entries for given ip. This part is easy because I can use grep for example. However even for this I would like to find better solution because I would like to get response as fast as possible.
Next part is more complicated because I would like to find entries from selected ip and of selected type and with particular value of some_key in action_data.
Probably I would have to convert this file to SQL db (probably SQLite, because it will be desktop APP), but I would ask if there are exists better solutions?

You could take a look at MongoDB, a document based database. With it you essentially store JSON objects that you can then index and easily query in an efficient way. You can find about how to query in the docs: Querying.

Yes, put it into a database, any database. Then querying it will be straightforward.

Just wanted to mention that Oracle Berkeley DB 11gR2 (released on April 1st, 2010) introduces support for a SQL API. In fact, the SQL API is the sqlite3() API. So, as Jason mentioned, if you'd like the ease-of-use of SQLite, combined with the scalability and concurrency of Berkeley DB, you can now get both things in a single library.
Regards,
Dave

If you need the relational guarantees of an SQL-based DB, definitely go ahead with SQLite. It will allow for fast queries, joins, aggregations, sorts, and overall any sort of search you could possibly dream up. It sounds like this is just a big list of Actions performed by users at some IP, so you'll probably want to use some sort of sequence as your primary key since none of the other attributes look like good candidates.
On the other hand, if you just need to do very simple queries, e.g. look up entries by IP, look up entries by action type, etc., you might want to look into Oracle Berkeley DB. As long as you don't need any searches that are too fancy, Berkeley DB will let you store Terabytes of data and access them at record speed.
So look over both and see what's best for your use case. They're good for different things, which might be why both are available as storage systems on Android, for instance. I think SQLite will probably win out, but when thinking about embedded local DB systems you should always at least consider both of these technologies.

Related

What is the easiest/non-technical friendly way to upload and sort through a large dataset (400GB+)?

Would like to sort the (largely text) data by date tags first (separating the data by quarters, lets say, even if into different files). Then would like to perform standard functions like conditional sums on the data. Without a substantial programming/database background but with a willingness to spend a few days to learn, what's my best bet for a solution?

It sounds like you want to be able to do some queries on your data. I would look into a SQL database solution. The most difficult part would be getting your data into the database.
All of AWS's relational databases can import from text files:
Microsoft's SQL Server: http://msdn.microsoft.com/en-us/library/ms178129.aspx
Oracle: http://docs.oracle.com/cd/B28359_01/text.111/b28304/aload.htm
MySQL: http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html
I would base the decision purely on whichever one is easier to load your file. If you are budget constrained, you can download MySQL and not do this whole cloud thing. Just keep it on your local computer, assuming you have enough disk space to host the database.
After that's done, they all support SQL, which makes it very easy for you to query your data. If you don't want to write your own SQL, there are tools for you to create queries by drag and drop. But being a programmer, I highly recommend writing your own queries, or coming back here for some query help.

Is SQL the ''assembler'' of the NoSQL database world?

I recently came across http://www.fossil-scm.org/index.html/doc/tip/www/theory1.wiki by D. Richard Hipp, the developer responsible for SQLite.
it go me thinking, is Fossil the only NoSQL database that uses SQL?
Do others uses SQL as a 'High Level Scripting Language'?

From the article, it sounds like Fossil isn't a database any more than git is a database. Yes, it's a thing that contains data, and yes, it's backed by a database, but it seems pretty far from a database itself. So the first part of of your question basically relies on a faulty assumption. There is a database called Friendly which uses MySQL to store schema-less models, but it seems like an awkward bandaid sort of solution at best.
I'm certainly not familiar with all of the NoSQL options out there, but, to my knowledge, none of the well-though-of ones use SQL for anything. MongoDB and CouchDB, the two I'm most familiar with, both use Javascript as part of their query interface, though in very different ways. MongoDB has queries more like what you'd expect from a relational database: you can write an arbitrary query for all documents that match a certain set of attributes. However, unlike a relational database, there's no such thing as a join (you'll only ever get a list of distinct documents back, not compound documents) and you can write arbitrary Javascript code to select documents. CouchDB, on the other hand, does not allow arbitrary queries. Instead, you create views (which are essentially simpler key-value stores) using map/reduce functions written in Javascript and then query those views from a start key to and end key.
In both cases, the type of information being transmitted to the server to perform the query isn't well-suited for the type of problem that SQL is good at solving. The trade-off to SQL being so high-level (to use the logic of the author of the paper) is that it's only suitable for a very narrow set of problems.

The creator of Fossil / SQLite is working and pushing UnQL as the NoSQL standard:
UnQL means Unstructured Query Language.
It's an open query language for JSON, semi-structured and document
databases.
It looks like a stripped down version of SQL.

What nosql means? can someone explain it to me in simple words?

in this post Stack Overflow Architecture i read about something called nosql, i didn't understand what it means, and i tried to search on google but seams that i can't get exactly whats it.
Can anyone explain what nosql means in simple words?

If you've ever worked with a database, you've probably worked with a relational database. Examples would be an Access database, SQL Server, or MySQL. When you think about tables in these kinds of databases, you generally think of a grid, like in Excel. You have to name each column of your database table, and you have to specify whether all the values in that column are integers, strings, etc. Finally, when you want to look up information in that table, you have to use a language called SQL.
A new trend is forming around non-relational databases, that is, databases that do not fall into a neat grid. You don't have to specify which things are integers and strings and booleans, etc. These types of databases are more flexible, but they don't use SQL, because they are not structured that way.
Put simply, that is why they are "NoSQL" databases.
The advantage of using a NoSQL database is that you don't have to know exactly what your data will look like ahead of time. Perhaps you have a Contacts table, but you don't know what kind of information you'll want to store about each contact. In a relational database, you need to make columns like "Name" and "Address". If you find out later on that you need a phone number, you have to add a column for that. There's no need for this kind of planning/structuring in a NoSQL database. There are also potential scaling advantages, but that is a bit controversial, so I won't make any claims there.
Disadvantages of NoSQL databases is really the lack of SQL. SQL is simple and ubiquitous. SQL allows you to slice and dice your data easier to get aggregate results, whereas it's a bit more complicated in NoSQL databases (you'll probably use things like MapReduce, for which there is a bit of a learning curve).

From the NoSQL Homepage
NoSQL is a fast, portable, relational database management system without arbitrary limits, (other than memory and processor speed) that runs under, and interacts with, the UNIX 1 Operating System. It uses the "Operator-Stream Paradigm" described in "Unix Review", March, 1991, page 24, entitled "A 4GL Language". There are a number of "operators" that each perform a unique function on the data. The "stream" is supplied by the UNIX Input/Output redirection mechanism. Therefore each operator processes some data and then passes it along to the next operator via the UNIX pipe function. This is very efficient as UNIX pipes are implemented in memory. NoSQL is compliant with the "Relational Model".
I would also see this answer on Stackoverflow.

Put simply, it means not using a relational database for data storage.
Here's a relevant article: http://www.computerworld.com/s/article/9135086/No_to_SQL_Anti_database_movement_gains_steam_

NoSql is the new database philosophy which talks about all the shortcomings of the relational database design, particularly the problems they have in scaling up for today's demanding web environments.
NoSql is quickly evolving into a movement with new tools, software and formats coming up as alternative to SQL.
RDBMS is as ubiquitous as OOP and while both of these design methodologies solve some problems wonderfully, they don't solve all.
So think of NoSql as the functional programmin of the database world.
Was this simple enough?

NoSQL is the idea that SQL-type databases don't satisfy the demands/requirements of a heavily-used database that requires transactions be reliable and failsafe (or close to it). This ties into the ideas of ACID and CAP, both things worth looking into but not something to lose sleep over unless you run a really popular site that is transaction-heavy (ie Amazon or Ebay). To get a great start on these subjects, I suggest:
http://www.eflorenzano.com/blog/post/my-thoughts-nosql/
and
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

Something everyone considering a "nosql" approach should consider:
(I shan't risk putting the image into this post as it contains a curse word, and I don't want offensive flags. So clicker beware -- there's an f-word in there. Only click if you have a sense of humor.)
http://browsertoolkit.com/fault-tolerance.png

Found this nice article about no-sql
and this as well:
NoSQL, Yes Search

Implementing a massive search application

We have an email service that hosts close to 10000 domains such that we store the headers of messages in a SQL Server database.
I need to implement an application that will search the message body for keywords. The messages are stored as files on a NAS storage system.
As a proof of concept, I had implemented a SQL server based search system were I would parse the message and store all the words in a database table along with the memberid and the messageid. The database was on a separate server to the headers database.
The problem with that system was that I ended up with a table with 600 million rows after processing messages on just one domain. Obviously this is not a very scalable solution.
Since the headers are stored in a SQL Server table, I am going to need to join the messageIDs from the search application to the header table to display the messages that contain the searched for keywords.
Any suggestions on a better architecture? Any better alternative to using SQL server? We receive over 20 million messages a day.
We are a small company with limited resources with respect to servers, maintenance etc.
Thanks

have a look at Hadoop. It's complete "map-reduce" framework for working with huge datasets inspired by Google. It think (but I could be wrong) Rackspace is using it for email search for their clients.

lucene.net will help you a lot, but no matter how you approach this, it's going to be a lot of work.

Consider not using SQL for this. It isn't helping.
GREP and other flat-file techniques for searching the text of the headers is MUCH faster and much simpler.

You can also check out the java lucene stuff which might be useful to you. Both Katta which is a distributed lucene index and Solr which can use rsync for index syncing might be useful. While I don't consider either to be very elegant it is often better to use something that is already built and known to work before embarking on actual development. Without knowing more details its hard to make a more specific recommendation.

If you can break up your 600 million rows, look into database sharding. Any query across all rows is going to be slow. At very least you could break up by language. If they're all English, well, find some way to split the data that makes sense based on common searches. I'm just guessing here but maybe domains could be grouped by TLD (.com, .net, .org, etc).
For fulltext search, compare SQL Server vs Lucene.NET vs cLucene vs MySQL vs PostgreSQL. Note full-text search will be faster if you don't need to rank the results. If a database is still slow look into performance tuning and if that fails look into a Linux-based db.
http://incubator.apache.org/lucene.net/
http://sourceforge.net/projects/clucene/

i wonder if BigTable (http://en.wikipedia.org/wiki/BigTable) does searching.

Look into the SQL Server full text search services/functionality. I haven't used it myself, but I once read that Stack Overflow uses it.

three solutions:
Use an already-existant text search engine (lucene is the most mentioned, there are several more)
Store the whole message in the SQL database, and use included full text search (most DBs have it these days).
Don't create a new record for each word occurrence, just add a new value to a big field in the word record. Even better if you don't use SQL for this table, use a key-value store where the key is the word and the value is the list of occurrences. Check some Inverted Index bibliography for inspiration
but to be honest, i think the only reasonable approach is #1

What's the best way of converting a mysql database to a sqlite one?

I currently have a relatively small (4 or 5 tables, 5000 rows) MySQL database that I would like to convert to an sqlite database. As I'd potentially have to do this more than once, I'd be grateful if anyone could recommend any useful tools, or at least any easily-replicated method.
(I have complete admin access to the database/machines involved.)

I've had to do similar things a few times. The easiest approach for me has been to write a script that pulls from one data source and produces an output for the new data source. Just do a SELECT * query for each table in your current database, and then dump all the rows into an INSERT INTO query for your new database. You can either dump this into a file or pipe it straight into the database frontend.
It's not pretty, but honestly, pretty hardly seems to be a major concern for things like this. This technique is quick to write, and it works. Those are my primary criteria for things like this.
You might want to check out this thread, too. It looks like a couple of people have already put together basically what you need. I didn't look that far into it, though, so no guarantees.

As long as a MySQL dump file doesn't exceed the SQLite query language, you should be able to migrate fairly easily:
tgl#moto~$ mysqldump old-database > old-database-dump.sql
tgl#moto~$ sqlite3 -init old-database-dump.sql new-database
I haven't tried this myself.
UPDATE:
Looks like you'll need to do a couple edits of the MySQL dump. I'd use sed, or Google for it.
Just the comment syntax, auto_increment & TYPE= declaration, and escape characters differ.

Here is a list of converters:
http://www.sqlite.org/cvstrac/wiki?p=ConverterTools
An alternative method that would work nicely but is rarely mentioned is: use a ORM class that abstracts the specific database differences away for you. e.g. you get these in PHP (RedBean), Python (Django's ORM layer, Storm, SqlAlchemy), Ruby on Rails ( ActiveRecord), Cocoa (CoreData)
i.e. you could do this:
Load data from source database using the ORM class.
Store data in memory or serialize to disk.
Store data into source database using the ORM class.

If it's just a few tables you could probably script this in your preferred scripting langauge and have it all done by the time it'd take to read all the replies or track down a suitable tool. I would any way. :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas