Can I search in an Azure Table Storage? - azure-storage

I want to store the addresses of thousands of people. The address is typical name, address, city etc... I want to then search for first name, last name, city etc...
Can I use Azure Table Storage and use it's API to do that?

The comments regarding indexing your table using Azure Search are excellent, if you don't mind paying the hefty monthly fee.
To answer your question, you can search using the Azure Storage API, but you've got to be very intentional about what fields you want to search when structuring your Azure Storage Tables initially.
The only "indexes" you have to work with are the partition and row keys. Entities with the same partition key are stored together and can be searched efficiently if the partitions are not large. Since Azure Tables do not enforce a schema, you can actually store the same data under different partitions to make search easier.
Assume you had the address Johnny Appleseed 839 Sherman Oaks Drive Knoxville, TN 37497. You could duplicate this data in the same table with the partition keys:
citystate-knoxville_tn_37497
name-appleseed_johnny
street-sherman_oaks
When your user tries to search, choose a partition based on the criteria that the user entered, then Azure will perform a full partition scan to find all matching records. You'll also need to deal with continuation tokens.
You can also limit to a partial partition scan if you specify your row key within a partition to be part of the search criteria. Azure will scan only rows that could potentially match the row key.

Azure Storage Table only has two indexed properties: PartitionKey & RowKey. Querying on non-indexed properties will trigger a whole table scan. If you only need to store several thousands of records, Azure Storage Table is a good option due to its low price. However, if you're going to store much more records, I'd suggest you to choose SQL Azure since it supports advanced indexed query.

Related

Does a SQL query with fewer attributes cost less?

My question is very simple - Does a SQL query with fewer attributes cost less?
Example:
Let's say our users table have 10 columns like userId, name, phone, email, ...
SELECT name, phone FROM users WHERE userId='id'
is cheapier than this
SELECT * FROM users WHERE userId='id'
Is it true in the perspective of resource utilization?
It depends.
It is certainly possible that limiting the number of columns in the projection improves performance but it depends on what indexes are available. If we assume that userId is either the primary key or at least an indexed column, you'd expect database's optimizer to determine which row(s) to fetch by doing a lookup using an index that has userId as the leading column.
If there is an index on (user_id, phone) or if phone is an included column on the index if your database supports that concept, the database can get the phone from the index it used to find the row(s) to return. In this way, the database never has to visit the actual table to fetch the phone. An index that has all the information the database needs to process the query without visiting the table is known as a "covering index". Roughly speaking, it is probably roughly as costly to search the index for the rows to return as it is to visit the table to fetch additional columns for the projection. If you can limit the number of columns in the projection in order to use a covering index, that to may significantly reduce the cost of the query. Even more significantly if visiting the table to fetch every column involves doing multiple reads because of chained rows or out-of-line LOB columns in Oracle, TOAST-able data types in PostgreSQL, etc.
Reducing the number of columns in the projection will also decrease the amount of data that needs to be sent over the network and the amount of memory required on the client to process that data. This tends to be most significant when you have larger fields. For example, if one of the columns in the users table happened to be an LDAP path for the user's record, that could easily be hundreds of characters in length and account for half the network bandwidth consumed and half the memory used on the middle tier. Those things probably aren't critical if you're building a relatively low traffic internal line of business application that needs to serve a few hundred users. It is probably very critical if you're building a high volume SaaS application that needs to serve millions of users.
in the grand scheme of things, both are negligible.
If the data is stored by rows, there isn't much of a difference as retrieving a line of data doesn't cost much. perhaps if one of the columns was particularly large then avoiding its retrieval would be beneficial.
but if the data is stored by columns, then the first one is cheaper as each entry is stored in a different location.

Structuring a large DynamoDB table with many searchable attributes?

I've been struggling with the best way to structure my table. Its intended to have many, many GBs of data (I haven't been given a more detailed estimate). The table will be claims data (example here) with a partition key being the resourceType and a sort key being the id (although these could be potentially changed). The end user should be able to search by a number of attributes (institution, provider, payee, etc totaling ~15).
I've been toying with combining global and local indices in order to achieve this functionality on the backend. What would be the best way to structure the table to allow a user to search the data according to 1 or more of these attributes in essentially any combination?
If you use resourceType as a partition key you are essentially throwing away the horizontal scaling features that DynamoDB provides out of the box.
The reason to partition your data is such that you distribute it across many nodes in order to be able to scale without incurring a performance penalty.
It sounds like you're looking to put all claim documents into a single partition so you can do "searches" by arbitrary attributes.
You might be better off combining your DynamoDB table with something like ElasticSearch for quick, arbitrary search capabilities.
Keep in mind that DynamoDB can only accommodate approximately 10GB of data in a single partition and that a single partition is limited to up to 3000 reads per second, and up to 1000 writes per second (reads + 3 * writes <= 3000).
Finally, you might consider storing your claim documents directly into ElasticSearch.

Is DynamoDB suitable as an S3 Metadata index?

I would like to store and query a large quantity of raw event data. The architecture I would like to use is the 'data lake' architecture where S3 holds the actual event data, and DynamoDB is used to index it and provide metadata. This is an architecture that is talked about and recommended in many places:
https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
https://www.youtube.com/watch?v=7Px5g6wLW2A
https://s3.amazonaws.com/big-data-ipc/AWS_Data-Lake_eBook.pdf
However, I am struggling to understand how to use DynamoDB for the purposes of querying the event data in S3. In the link to the AWS blog above, they use the example of storing customer events produced by multiple different servers:
S3 path format: [4-digit hash]/[server id]/[year]-[month]-[day]-[hour]-[minute]/[customer id]-[epoch timestamp].data
Eg: a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data
And the schema to record this event in DynamoDB looks like:
Customer ID (Partition Key), Timestamp-Server (Sort Key), S3-Key, Size
87423, 1436055953839-i-31cc02, a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data, 1234
I would like to perform a query such as: "Get me all the customer events produced by all servers in the last 24 hours" but as far as I understand, it's impossible to efficiently query DynamoDB without using the partition key. I cannot specify the partition key for this kind of query.
Given this requirement, should I use a database other than DynamoDB to record where my events are in S3? Or do I simply need to use a different type of DynamoDB schema?
The architecture looks fine and feasible using DynamoDB database. The DynamoDBMapper class (present in AWS SDK Java) can be used to create the model which has useful methods to get the data from S3.
DynamoDBMapper
getS3ClientCache() Returns the underlying S3ClientCache for accessing
S3.
DynamoDB database can't be queried without partition key. You have to scan the whole DynamoDB database if partition key is not available. However, you can create a Global Secondary Index (GSI) on date/time field and query the data for your use case.
In simple terms, GSI is similar to the index present in any RDBMS. The difference is that you can directly query the GSI rather than the main table. Normally, GSI is required if you would like to query the DynamoDB for some use case when partition key is not available. There are options available to include ALL (or) selective fields present in the main table in GSI.
Global Secondary Index (GSI)
Difference between Scan and Query in DynamoDB
Yes, in this use case, looks like GSI can't help as the use case requires a RANGE query on partition key. The DynamoDB supports only equality operator. DynamoDB supports range queries on sort keys or other non-key attributes if partition key is available. You may have to scan the DynamoDB to fulfill this use case which is costly operation.
Either you have think about alternate data model where you can query by partition key or use some other database.
First, I've read that same AWS blog page too: https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
The only way you can make this work with DynamoDB is:
add another attribute called "foo" and put same value 1 for all items
add another attribute called "timestamp" and put epoch timestamp there
create a GSI with partition key "foo" and range key "timestamp", and project all other attributes
Looks a bit dirty, huh? Then you can query items the for last 24 hours with partition key 1 (all items have 1) and use that timestamp range key. Now, the problems:
GSI having all items with same partition key? Performance will suck if data gorws large
Costs more with a GSI
You should think about the costs as well. Think about your data ingestion rate. Putting 1000 objects per second in a bucket would costs you about $600 per month and $600 more with GSI. Just because of that query need (last 24 hrs), you have to spend $600 more.
I'm encountering the same problems for designing this meta data index. DynamoDB just doesn't look right. This is always what you get when you try to use DynamoDB in a way you would use a RDBMS. Because I have few querying needs like yours. I thought about ElasticSearch and the s3 listing river plugin, and it doesn't look good either since I have to manage ES clusters and storage. What about CloudSearch? Looking at its limits, CloudSearch doesn't fell right either.
My requirements:
be able to access the most recent object with a given prefix
be able to access objects within a specific time range
get maximum performance out of S3 by hash strings in key space for AWS EMR, Athena or Redshift Spectrum
I am all lost here. I even thought about S3 versioning feature since I can get the most recent object just naturally. All seems not quite right and AWS documents and blog articles are full of confusions.
This is where I'm stuck for the whole week :(
People at AWS just love drawing diagrams. When they introduce some new architecture scheme or concept, they just put a bunch of AWS product icons there and say it's beautifully integrated.

How to perform select on a massive dataset of 10 billion+ rows

When a user registers, email must be unique, and the registration check must take 1 second at most.
How does Facebook / Google manage to perform a select on table with several billion rows, retrieving instant response.
Is it as simple as:
select email from users where email = "xxx#yyy.zzz" limit 1
Does having an index on email field and running this query on a super fast server do the trick?
Or is there more to it?
Short answer, yes. Though with that much data, I'm thinking you may want to look into things like sharding, etc. to make things even faster
When using SQL, indexing and uniqueness can be assured by utilizing primary keys. These primary keys are then used by the backend driving the database to ensure that there are no duplications in the table. Because the keys are used for indexing the rows in the table, this also means that lookup on even a large set of data is much quicker because of these indices. Set the primary key to be the email adress and you should be good to go in this case.
Even when using NoSQL databases like Mongo, Cassandra, etc. it is necessary to create indices on your data so that lookup is quick.

Efficient table structure or indexing for searchable IP address ranges in SQL

I have raw data provided to me for a geolocation service, in the form of a table of IP address ranges mapped to location data.
The addresses are provided as byte-packed integers (one dotted-quad per byte), permitting easy storage and comparisons, so each row in this table provides a range low address, a range high address, and some text location fields. I don't have to/am not able to use CIDR.
The table is several million records.
I don't have strong SQL chops. The code I inherited simply does a sql call like:
SELECT location FROM geodata WHERE lookup_address >= range_low AND lookup_address =< range_high
The performance is terrible. My understanding is that this will simply do a linear search for matching records. To get around this temporarily I have thrown together a client cache into a tree map to bring this down to log performance, but a) my memory usage is now hard to justify, and b) detecting live database updates is a problem I don't really want to tackle right now.
It seems like this problem must come up now and then in the SQL world for addresses, telephone numbers, etc.. Is there a "standard" way to organize and index ranges in a SQL table so that I can get at least log performance out of a direct SQL query?
Check that you have index on your filter fields - in this case range_low and range_high.
CREATE INDEX IX_geodata_range_fields ON geodata (range_low, range_high)