Migrating a websites' knowledgebase to PostgreSQL Database for search querying - sql

I've been tasked to create a search function for a websites' knowledgebase (which is stored in a github repo). I'm only really familiar with building databases with Django, so I'm having trouble understanding how I'm supposed to upload a bunch of html files to the database and query them with postgres. Any pointers on how the database can be structured. I've heard that html files can be stored in a text field, but how are the columns structured, does each page get its' own row, etc? and how can I do this with a fairly large knowledge base without having to manually upload each file?
The db hosting platform I am using has a migration utility that says
Uploading will accept data in any of three forms, plain text (SQL), tar archives (uncompressed), or PostgreSQL's own compressed 'custom' format.
That's assuming the database is already structured.

I've heard that html files can be stored in a text field, but how are the columns structured, does each page get its' own row, etc?
Storing html in a column is perfectly acceptable. If you're storing the html in a column, then each new page requires a new row.
and how can I do this with a fairly large knowledge base without having to manually upload each file?
You just said the hosting provider permits "PostgreSQL's own compressed 'custom' format". So install PostgreSQL locally. Get it all up and working. Insert every page locally. Then you can upload to the hosting provider using pg_dump --format=c which is not just a single action, but compressed.

Related

Convert an online JSON set of files to a relational DB (SQL Server, MySQL, SQLITE)

I'm using a tool called Teamwork to manage my team's projects.
The have an online API that consists of JSON files that are accessible with authorisation
https://developer.teamwork.com/projects/introduction/welcome-to-the-teamwork-projects-api
I would like to be able to convert this online data to an sql db so i can create custom reports for my management.
I can't seem to find anything ready to do that.
I need a strategy to do this..
If you know how to program, this should be pretty straightforward.
In Python, for example, you could:
Come up with a SQL schema that maps to the JSON data objects you want to store. Create it in a database of your choice.
Use the Requests library to download the JSON resources, if you don't already have them on your system.
Convert each JSON resource to a python data structure using json.loads.
Connect to your database server using the appropriate Python library for your database. e.g., PyMySQL.
Iterate over the python data, inserting rows into the database as appropriate. This is essentially the JSON-to-Tables mapping from step 1 made procedural.
If you are not looking to do this in code, you should be able to use an open-source ETL tool to do this transformation. At LinkedIn a coworker of mine used to use Talend Data Integration for solid ETL work of a very similar nature (JSON to SQL). He was very fond of it and I respected his opinion, so I figured I should mention it, although I have zero experience of it myself.

form a database out of .txt file

I have a .txt file with rows of the following format
SI1334596|MRKU3|High Cube|1|EGST|First Line|Vehicle one|25|13|
How do I form a database of above .txt entries to perform SQL queries on it? I also want to assign column names to each of the columns. I have little or no knowledge on importing txt file entries in a database. I am looking for a software which can be installed on my windows computer which can import .txt file and convert into a database and allow me to perform queries thereafter.
If you are asking for recommendations on specific tools, then your Question is off-topic for StackOverflow.com. See the Software Recommendations Stack Exchange.
Here are some possible approaches, with and without programming.
Database Import
Databases often have a built-in command or facility for importing data straight from a text file. When directly importing text with little or processing, the import is often very fast.
For example, Postgres has the copy command to import. This command includes a parameter DELIMITER where you can tell it to expect the vertical bar | as the separator between fields.
You would define your table structure ahead of time, before the import, defining a name and data type for each expected column/field.
Custom App
You can write an app to read the text file, process the incoming data, and feed the prepared data to the database. For example, write a Java app that reads the text file, uses JDBC to connect to the server, and SQL written as text strings to instruct the database server on what to do.
You can do this row by row. Or, for increased speed, you can write a batch statement telling the database server to create multiple rows at the same time.
This is the way to go if the data requires complicated processing or there are other related chores such as keeping a history of many such imports, logging other information, reporting duplicate data, and so on.
For Java, the Apache Commons CSV library helps with reading/writing plain text files.
Spreadsheet
Many spreadsheets, such as LibreOffice Calc, can parse the data, deduce the column headers as titles, and populate a spreadsheet. You can do queries within the spreadsheet. Works well for smaller amounts of data that can comfortably reside within memory. You may not need a database at all.
Database Tool
SQL database engines such as Postgres, H2, SQLite, and MySQL/MariaDB are just black-box engines not full-blown interactive data tools. You can obtain such tools that connect with these engines. This tools can import/export text files, display lists of data, let you enter/modify data, create forms for better access to the data, and generate reports.
But there are some such data tools that have a database engine built-in. Examples include:
FileMaker
4D
LibreOffice Base

Which dashboard analytics will support Parse.com data source?

I've developed an app that uses Parse.com as the back end. I now need a dashboard analytics software package (such as iDashboards) that will enable me to pull data from my Parse.com database classes and present some of that data in a pretty dashboard fashion.
iDashboards looks to be the kind of tool i'm after, but it only supports certain data source inputs such as JDBC, ODBC, SQL, MySQL etc. Not being a database guru by any means, i'm not sure if Parse.com can be classed as any of the above, but from what i've read it doesn't come under any of these categories.
Can anybody recommend a way of either connecting Parse.com to iDashboard, or suggest another dashboard tool that will support Parse.com as a data source?
The main issue you are facing is that data coming out of Parse.com is going to be in json format. Most dashboards are going to prefer csv files.
The best dashboard I am aware of is Tableau and there is a discussion about getting json into Tableau here: http://community.tableau.com/ideas/1276
If your preference is using iDashboards then you need to convert the json coming out of Parse into a csv format that iDashboards can consume. You can do that using RJSON as mentioned in the post above but you'll probably have an easier time of it with a simple php or python script that periodically connects to Parse and pulls out data updates for you and then pushes it to your dashboard of choice.
Converting json to csv in php is addressed here: Converting JSON to CSV format using PHP
The difference is much more fundamental than "unsupported file format". In fact, JSON data coming out of Parse is stored in a so-called denormalized form, which means that a single JSON data file may contain the equivalent of arbitrarily many tables in a relational database. Stated differently, one JSON file may translated into potentially many CSV files, and there's no unique choice of how to perform that translation.
This is a so-called ETL problem, where ETL stands for Extract-Transform-Load. As such, you may be interested in open source ETL tools such as Kettle. Kettle is supported by Pentaho and includes functionality that can help you develop a workflow to turn JSON data into multiple CSV files that can then be imported into iDashboards (or similar). Aside from Kettle, Talend is also widely used for this purpose and has the same ability.
Finally, note that Parse is powered by MongoDB, and exports JSON data that is easily stored and manipulated in MongoDB. As such, a natural fit for reporting on Parse data is any reporting tool built for MongoDB.
As of the time of this writing, there are two such options:
JSON Studio, which is a commercial solution that is built explicitly for MongoDB and has your stated capability to produce dashboards.
SlamData, which is an open source solution, also built for MongoDB, which allows native SQL on the database. The current version does not have reporting capabilities (just CSV export), but the 2.09 version due out in June has reporting dashboards baked in.
An advantage of using a MongoDB reporting tool is that you will not have to wrangle your data into relational form. If it's heavily nested, using arrays, and so forth, it can be quite painful to develop an ETL workflow and keep it in sync with how the data is changing. Instead, all you have to do is built a script to pipe the raw data from Parse into a MongoDB instance (perhaps hosted by MongoLab or equivalent, if you don't want to host it yourself), and connect the MongoDB reporting tool on top.
You might also contact Parse and see if they have a recommended solution for this. It occurs to me they should probably bake some sort of analytical / reporting functionality into their APIs as this is such a common use case.
You can use Axibase Time-Series Database to ingest your data from parse.com and they have built in dashboards and widgets for visualization or you can just export data from ATSD to csv and use iDashboards.

Uploading a file to a VarBinaryMax field into Windows Azure?

I'm extremely confused, so I've created an SQL Database in Windows Azure, created a "video table" with a "video_file" column as "varbinary(max)" because I want to upload a video file into that field, however Azure offers no "Upload" option like say, PHPMyAdmin does where you can hit "browse" and upload a video directly into the field. Can anyone guide me as to how to actually upload a file into a Windows Azure SQL Database so it can be read as a varbinary type? Can it be done within the Azure management portal? Or does it require some sort of external program/service?
To answer your question, the functionality to upload files directly into SQL Azure Database does not exist. This is something you have to do on your own.
Can anyone guide me as to how to actually upload a file into a Windows
Azure SQL Database so it can be read as a varbinary type?
Do a search for uploading files in SQL Server and you will find plenty of examples on how to do that. Take a look at this link for example: http://www.codeproject.com/Articles/225446/Uploading-and-downloading-files-to-from-a-SQL-Serv
Can it be done within the Azure management portal? Or does it require some sort of external program/service?
No. This functionality does not exist in Azure Management Portal. As mentioned above, you would need to write some code to do so.
A little bit off-topic comment:
May I suggest that instead of saving the image files in the database you save them in Blob Storage and store the URL of the blob in your table. There're some advantages I could see in this:
Compared to SQL Database, Azure Blob Storage is much cheaper. If you store video files (or in other words large files) in the database, you will end up with large database and thus end up paying more money.
You will be choking the database when reading this large data from the database which will impact the performance of your application.

Uploading pictures to a path VS database

I am about to create an ASP.NET MVC app which will have over 2000 products and each products will approx 20 photos. The app will be asp.net mvc app and
I am using sql server 2008 r2 to manage my data. which way is the better approcah here;
Uploading pictures to a path and
storing their file names to database
in order to be able to make a
relation to each other.
Storing pictures inside the database
as byte as well and retreive them
from there when needded.
definitely in the filesystem (store path) is better, i have done both in the past.
Against SQL server to store images
A) betting data in and out can be more difficult as have to used blob type objects and some ORMs don't really cater for this
B) your data base is much bigger so effects your backup/restore policy. The more frequently you backup the better but space will be increased. Storing in file, yep you still need to backup but backing up filesystem is easy.
C) when you run out of storage space you just add another NAS drive / server and start storing images there, so scales horizontally
The common perception is not as good as data stored in two places but for me its better as the type of data in stored in the best storage medium for the data types ..
Definitely storing as a path rather than the byte array. This means you can easily change the actual image itself without having to alter any code or muck around in SQL (as long s the new file has the same name as the old one).
Hope this helps.
In the database using FILESTREAM which combines the 2 ideas (file and database)
FILESTREAM integrates the SQL Server Database Engine with an NTFS file system by storing varbinary(max) binary large object (BLOB) data as files on the file system. Transact-SQL statements can insert, update, query, search, and back up FILESTREAM data. Win32 file system interfaces provide streaming access to the data.
This changes the file vs database arguments
If you want to store paths only, then you'll have to accept the fact that images and database will get out of synch over time.