Data preparation to upload into Redis server - redis

I have a 10GB .xml file, which I want to upload into redis server using the mass insert . I need advise on how to convert this .xml data to some key, value or any other data structure supported by redis? I am working with stack over flow dumps and for example, If I take up the comments.xml.
Data pattern:
row Id="5" PostId="5" Score="9" Text="this is a super theoretical AI question. An interesting discussion! but out of place..." CreationDate="2014-05-14T00:23:15.437" UserId="34"
Lets say I want to retrieve all comments made by particular userid or a particular date how do I do that?
Firstly,
How do I prepare this .xml date into data structure suitable for Redis.
How can I upload it into Redis. I am using Redis on windows. The commands pipe and cat does not seem to work. I have tired using centos but I prefer using Redis on windows.

Before you choose proper data structure you need to understand what type of quires you will make. For example if you have user specific data and you need to group different user activities per user and have aggregated results you need to go with different structures, build indexes, split data in chunks and so on.
Relatively for large amount of aggregated data (45GB) I found usable SortedSets with ZRANGE because it has better complexity that LRANGE. You can split your data in chunks based on your data size and process each ZRANGE individually in threads and then combine results.
On top of that structure you can add indexes with LISTS where you need only to iterate data for relatively small amounts of data.

Related

How to store and aggregate data in ~300m JSON objects efficiently

I have an app where I receive 300m JSON text files (10m daily, retention = 30 days) from a Kafka topic.
The data it contains needs to be aggregated every day based on different properties.
We would like to build it with Apache Spark, using Azure Databricks, because the size of the data will gro, we cannot vertically scale this process anymore (currently runs in 1 Postgres server) and we also need something that is cost-effective.
Having this job in Apache Spark is straightforward in theory, but I haven't found any practical advice on how to process JSON objects efficiently.
These are the options as I see:
Store the data in Postgres and ingest it with the Spark job (SQL) - may be slow to transfer the data
Store the data in Azure Blob Storage in JSON format - We may run out of the number of files that can be stored, also this seems inefficient to read so many files
Store the JSON data in big chunks, eg. 100.000 JSON in one file - it could be slow to delete/reinsert when the data changes
Convert the data to CSV or some binary format with a fixed structure and store it in blob format in big chunks - Changing the format will be a challenge but it would rarely happen in the future, also CSV/binary is quicker to parse
Any practical advice would be really appreciated. Thanks in advance.
There are multiple factors to be consider :
If you are trying to read the data on daily manner then strongly suggested to do store the data in Parquet format and store in databricks. If not accessing daily then store in Azure buckets itself (computation cost will be minimised)
If JSON data to be flattened then you need to do all the data manipulations and write into delta tables with OPTIMISE conditions.
If really retention 30 mandatory then be cautious with file formats bcz data will grow exponentially on daily basis. Other wise Alter table properties with retention period to 7 days or 15 days.

Can BigQuery table extracted rows be randomized

I am currently extracting a BigQuery table into sharded .csv's in Google Cloud Storage -- is there any way to shuffle/randomize the rows for the extract? The GCS .csv's will be used as training data for a GCMLE model, and the current exports are in a non-random order as they are bunched up by similar "labels".
This causes issues when training a GCMLE model as you must hand the model random examples of all labels within each batch. While GCMLE/TF has the ability to randomize the order of rows WITHIN individual .csv's, but there is not (to my knowledge) any way to randomize the rows selected within multiple .csv's. So, I am looking for a way to ensure that the rows being output to the .csv are indeed random.
Can BigQuery table extracted rows be randomized?
No. Extract Job API (thus any client built on top of it) has nothing that would allow you to do so.
I am looking for a way to ensure that the rows being output to the .csv are indeed random.
You should first create tables corresponding to your csv file and then extract them one-by-one into separate csv. In this case you can control what goes into what csv
If your concern is cost of processing (you will need to scan table as many times as csv files you will need) - you can check partitioning approaches in Migrating from non-partitioned to Partitioned tables . This still involves cost but substantially reduced one
Finally, zero cost option is to use Tabledata.list API with paging while distributing response throughout your csv files - you can do this easily in client of your choice

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...
This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.
First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.
Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.

SparkSQL: intra-SparkSQL-application table registration

Context. I have tens of SQL queries stored in separate files. For benchmarking purposes, I created an application that iterates through each of those query files and passes it to a standalone Spark application. This latter first parses the query, extracts the used tables, registers them (using: registerTempTable() in Spark < 2 and createOrReplaceTempView() in Spark 2), and executes effectively the query (spark.sql()).
Challenge. Since registering the tables can be time consuming, I would like to lazily register the tables, i.e. only once when they are first used, and keep that in form of metadata that can readily be used in the subsequent queries without the need to re-register the tables with each query. It's a sort of intra-job caching but not any of the caching options Spark offers (table caching), as far as I know.
Is that possible? if not can anyone suggest another approach to accomplish the same goal (iterating through separate query files and run a querying Spark application without registering the tables that have already been registered before).
In general, registering a table should not take time (except if you have lots of files it might take time to generate the list of file sources). It is basically just giving the dataframe a name. What would take time is reading the dataframe from disk.
So the basic question is, how is the dataframe (tables) written to disk. If it is written as a large number of small files or a file format which is slow (e.g. csv), this can take some time (having lots of files take time to generate the file list and having a "slow" file format means the actual reading is slow).
So the first thing you can try to do is read your data and resave it.
lets say for the sake of example that you have a large number of csv files in some path. You can do something like:
df = spark.read.csv("path/*.csv")
now that you have a dataframe you can change it to have less files and use a better format such as:
df.coalesce(100).write.parquet("newPath")
If the above is not enough, and your cluster is large enough to cache everything, you might put everything in a single job, go over all tables in all queries, register all of them and cache them. Then run your sql queries one after the other (and time each one separately).
If all of this fails you can try to use something like alluxio (http://www.alluxio.org/) to create an in memory file system and try to read from that.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).