Google BigTable and free text indexing - bigtable

The BigTable article main example is storing web content in a big table.
Is there an article discussing how do Google index the free text ?
Would it be right to they use BigTable under the hood? Any information about the BigTable schema?

I don't think you can get the underlying schema. Big Table is built on top of GFS if you are lookin at file system level.
WEB SEARCH FOR A PLANET:THE GOOGLE CLUSTER ARCHITECTURE
The section serving a Google query(page:23) gives you a generic idea of data is organised into buckets.

Related

How to create a CDN to store and serve images and videos?

We have a requirement to store and retrieve content(Audio, Video, Images) quickly. We are not allowed to use Commercial providers like AWS S3 etc.
Any suggestions on how to go about? Challenges I forsee are
a) Storage
b) Fast Retrieval
c) Caching
Would cassandra help in the above?
This is a very typical use case for Cassandra for things like streaming services or media-sharing social apps.
The difference is that the media files are saved in an object store and only the metadata (such as URL of the media file) is stored in Cassandra so you can retrieve information about where the media is stored really quickly.
As a side note, I wanted to warn you that others will likely vote to close your question because it is soliciting opinions vs a specific software issue. Cheers!

Amazon CloudSearch and Amazon Kendra

I was wondering what is the main difference between Amazon CloudSearch and Kendra? Why there are 2 different tools of the same company products and compete each other? Both looks like same, I am not sure what are the differences in features. How it is being differentiated one among the other.
Amazon CloudSearch: Set up, manage, and scale a search solution for your website or application. Amazon CloudSearch enables you to search large collections of data such as web pages, document files, forum posts, or product information. With a few clicks in the AWS Management Console, you can create a search domain, upload the data you want to make searchable to Amazon CloudSearch, and the search service automatically provisions the required technology resources and deploys a highly tuned search index;
Amazon Kendra: Enterprise search service powered by machine learning. It is a highly accurate and easy to use enterprise search service that’s powered by machine learning. It delivers powerful natural language search capabilities to your websites and applications so your end users can more easily find the information they need within the vast amount of content spread across your company.
The key difference between the two services is that AWS Cloud Search is based on Solr, a keyword engine, while Amazon Kendra is an ML-powered search engine designed to provide more accurate search results over unstructured data such as Word documents, PDFs, HTML, PPTs, and FAQs. Kendra was designed from the ground up to natively handle natural language queries and return specific answers, instead of just lists of documents like keyword engines do.
Another key difference is that in CloudSearch, to upload data to your domain, it must be formatted as a valid JSON or XML batch. Kendra, on the other hand, provides out of the box connectors that allow customers to automatically index content from popular repositories like Sharepoint Online, S3, Salesforce, Servicenow, etc., directly into the Kendra index. So, depending on your use case, Kendra may be a better choice, especially if you’re considering the service for enterprise search applications, or even web site search where deeper language understanding is important. Hope this helps, happy to address follow-up questions. You can also visit our Kendra FAQ page for more specific answers around the service: https://aws.amazon.com/kendra/faqs/

Pull data from HTTP request API to Google Cloud

I have an app that sending me data from an API. The data is semi-structured (json data)
I would like to send this data to Google Big Query in order to stock all the information.
However, I'm not able to find how can I do it properly.
So far I have used Node JS on my own server to get the data using POST request.
Could you please help me ? Thnak.
You can use bigquery API to do streaming inserts.
You can also write the data to PubSub or Google Cloud Storage and use dataflow pipelines to load them into bigquery (you can either use streaming inserts (incur costs) or batch load jobs (free))
You can also log in stackdriver and from there you can select and send to bigquery (there already exists direct options for it in GCP, note that under the hood it performs streaming inserts)
If you feel that setting up dataflow is complicated, you can store your files and perform batch load jobs by directly calling bigquery API. Note that there are limits on number of batch loads you can make in a day over a particular table (1000 per day)
There is a page in the official documentation that lists all the possibilities of loading data to BigQuery.
For the simplicity, you can just send data from your local data soruce. You should use the Google Cloud client libraries for Big Query. Here you have a guide on how to do that as well as a relevant code example.
But my honest recommendation is to send data to Google Cloud Storage and from there, to load it to BigQuery. This way the whole process will be more stable.
You can check all the options from the first link that I've posted and choose what you think that will fit best with your workflow.
Keep in mind the limitations of this process.

Storing Files ( images, Microsoft office documents )

What is the best way for storing images and Microsoft Office documents:
Google Drive
Google Storage
You may want to consider checking this page to help you choose which storage option suits you best and also learn more.
To differentiate the two:
Google Drive
A collaborative space for storing, sharing, and editing files, including Google Docs and is good for the following:
End-user interaction with docs and files
Collaborative creation and editing
Syncing files between cloud and local devices
Google Cloud Storage
A scalable, fully-managed, highly reliable, and cost-efficient object / blob store and good for these:
Images, pictures, and videos
Objects and blobs
Unstructured data
In addition to that, see Google Cloud Platform - FAQ for more insights.
Different approaches can be taken into consideration, google docs are widely used for online working with office documents etc, it provides probably same layout in comparison to Microsoft office, the advantage is that you can share the document with other people as well, plus you can edit it online at any time.
Google Drive (useful way to store your files)
Every Google Account starts with 15 GB of free storage that's shared across Google Drive, Gmail, and Google Photos. When you upgrade to Google One, your total storage increases to 100 GB or more depending on what plan you choose.
Mediafire (another useful way to store your files)
In mediafire on the basic package it allows you 10 GB of cloud space for free, the files you store in the MediaFire can be encrypted by password encryption. It allows more other features as well. A suggestion to explore.

Using BigQuery for logs analysis

Im trying to do logs analysis with BigQuery. Specifically, I have an appengine app and a javascript client that will be sending log data to BigQuery. In bigquery, I'll store the full log text in one column but also extract important fields into other columns. I then want to be able to do adhoc queries over those columns.
Two questions:
1) Is BigQuery particularly good or particularly bad at this use case?
2) How do I setup revolving logs? I.e. I want to only store the last N logs or the last X GB of log data. I see delete is not supported.
Just so you know, there is an excellent demo of moving App Engine Log data to BigQuery via App Engine MapReduce called log2bq (http://code.google.com/p/log2bq/)
Re: "use case" - Stack Overflow is not a good place for judgements about best or worst, but BigQuery is used internally at Google to analyse really really big log data.
I don't see the advantage of storing full log text in a single column. If you decide that you must set up revolving "logs," you could ingest daily log dumps by creating separate BigQuery tables, perhaps one per day, and then delete the tables when they become old. See https://developers.google.com/bigquery/docs/reference/v2/tables/delete for more information on the Table.delete method.
After implementing this - we decided to open source the framework we built for it. You can see the details of the framework here: http://blog.streak.com/2012/07/export-your-google-app-engine-logs-to.html
If you want your Google App Engine (Google Cloud) project's logs to be in BigQuery, Google has added this functionality built in to the new Cloud Logging system. It is a beta feature known as "Logs Export"
https://cloud.google.com/logging/docs/install/logs_export
They summarize it as:
Export your Google Compute Engine logs and your Google App Engine logs to a Google Cloud Storage bucket, a Google BigQuery dataset, a Google Cloud Pub/Sub topic, or any combination of the three.
We use the "Stream App Engine Logs to BigQuery" feature in our Python GAE projects. This sends our app's logs directly to BigQuery as they are occurring to provide near real-time log records in a BigQuery dataset.
There is also a page describing how to use the exported logs.
https://cloud.google.com/logging/docs/export/using_exported_logs
When we want to query logs exported to BigQuery over multiple days (e.g. the last week), you can use a SQL query with a FROM clause like this:
FROM
(TABLE_DATE_RANGE(my_bq_dataset.myapplog_,
DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY'), CURRENT_TIMESTAMP()))