How can I optimize my varchar(max) column? - sql

I'm running SQL Server and I have a table of user profiles which contains columns for the user's personal info and a profile picture.
When setting up the project, I was given advice to store the profile image in the database. This seemed OK and worked fine, but now I'm dealing with real data and querying more rows the data is taking a lifetime to return.
To pull just the personal data, the query takes one second. To pull the images I'm looking at upwards of 6 seconds for 5 records.
The column is of type varchar(max) and the size of the data varies. Here's an example of the data lengths:
28171
4925543
144881
140455
25955
630515
439299
1700483
1089659
1412159
6003
4295935
Is there a way to optimize my fetching of this data? My query looks like this:
SELECT *
FROM userProfile
ORDER BY id
Indexing is out of the question due to the data lengths. Should I be looking at compressing the images before storing?

If takes time to return data. Five seconds seems a little long for a few megabytes, but there is overhead.
I would recommend compressing the data, if retrieval time is so important. You may be able to retrieve and uncompress the data faster than reading the uncompressed data.
That said, you should not be using select * unless you specifically want the image column. If you are using this in places where it is not necessary, that can improve performance. If you want to make this save for other users, you can add a view without the image column and encourage them to use the view.

If it is still possible to take one step back.Drop the idea of Storing images in table. Instead save path in DB and image in folder.This is the most efficient .
SELECT *
FROM userProfile
ORDER BY id
Do not use * and why are you using order by ? You can order by AT UI code

Related

Architectural design clarrification

I built an API in nodejs+express that allows reactjs clients to upload CSV files(maximum size is atmost 1GB) to the server.
I also wrote another API which when given the filename and row numbers in an array (ie array of row numbers ) as input, it selects the rows corresponding to the row numbers, from the previously stored files and writes it to another result file (writeStream).
Then th resultant file is piped back to the client(all via streaming).
Currently as you see I am using files(basically nodejs' read and write streams) to asynchronously manage this.
But I have faced srious latency (only 2 cores are used) and some memory leak (900mb consumption) when I have 15 requests, each supplying about 600 rows to retrieve from files of size approximately 150mb.
I also have planned an alternate design.
Basically, I will store the entire file as a SQL Table with row numbers as primary indexed key.
I will convert the user inputted array of row numbrs to a another table using sql unnest and then join both these tables to get the rows needed.
Then I will supply back the resultant table as a csv file to the client.
Would this architecture be better than the previous architecture?
Any suggestions from devs is highly appreciated.
Thanks.
Use the client to do all the heavy lifting by using the XLSX package for any manipulation of content. Then have API to save information about the transaction. This will remove upload to server and download from the server and help you provide better experience.

BigQuery: Limitation on length of list inside "IN" operator?

Is there any limitation on the number of elements that can be inserted on the IN operator?
I am asking this because I have a hive partitioned table (connected to a bucket with JSON) that I need to query hourly to extract some information. In order to not re-process already processed files, I use one of the partition fields to as identifier on which IDs I already processed, so I can query with a NOT IN only the new ones.
I'll show an example.
This is an example of the content of the bucket:
date=2021-05-15/id=ad9isjiodpa/file.jsonl
date=2021-05-15/id=sda0u9dsapo/file.jsonl
date=2021-05-15/id=adsi9ojdsds/file.jsonl
so I can make a query like this, to exclude those I already processed:
SELECT * FROM hive_table where id NOT IN ('ad9isjiodpa', 'sda0u9dsapo')
Usually this query process around 30GB per run, And everything works great, everyone is happy. The list usually don't have more than 2k elements.
Usually ...
last time the number of elements exceedeed 4k elements and this resulted in 2.6 TB of data processed. That was extremely unlikely and made me think that it actually processed ALL the files in the bucket (inside the timerange).
Is there some scenario, or documentation I didn't pay enough attention to? Do you know why it did process so much data? What did I do wrong?
The current fix I did is to split the list of elements in smaller chunks and do something like
SELECT * FROM hive_table where id NOT IN (<chunked_elemens1>) AND id NOT IN (<chunked_elemens2>) ...
Will this work?
Thank you very much in advance

Solr Re-indexing taking time

We have indexed data with 143 million rows(docs) into solr.It takes around 3 hours to index.I usde csvUpdateHandler and indexes the csv file by remote streaming.
Now ,while i re-index the same csv data,it is still taking 3+ hours.
Ideally,since there are no changes in _id values,it should have finished quickly Is there any way to speed up re-indexing?
Please help with this.
You're probably almost as efficient as you can be when it comes to actual submission of data - a possible change is to only submit the data that you know has changed due to some external factor.
Solr would have to query the index for each value anyway, then determine which fields has changed before reindexing, which would probably be more expensive that it already is.
For that number of documents, 3 hours is quite good. You should work on reducing the number of rows submitted instead, so that the total amount of work is less than what it used to be. If the CSV is sorted and rows are only appended, keep the last _id available and only submit the CSV rows present after the id before submitting the CSV to Solr.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

SQL Database Design for Test Data

So, I am trying to learn how to set up good, and usable databases. I have ran into a problem involving storing large amounts of data correctly. The database I am using is MSSQL 2008. For example:
We test about 50,000 devices a week. Each one of these devices have a lot of data associated with them. Overall, we are just looking at the summary of data calculated from the raw data. The summary is easy to handle, its just the raw data I'm trying to enter into a database for future use in case someone wants more details.
For the summary, I have a database full of tables for each set of 50,000 devices. But, each device there is data similar to this:
("DevID") I,V,P I,V,P I,V,P ...
("DevID") WL,P WL,P WL,P ...
Totaling to 126 (~882 chars) data points for the first line and 12000 (~102,000 chars) data points for the second line. What would be the best way to store this information? Create a table for each and every device (this seems unruly)? Is there a data type that can handle this much info? I am just not sure.
Thanks!
EDIT: Updated ~char count and second line data points.
You could just normalize everything into one table
CREATE TABLE device
( id BIGINT AUTO_INCREMENT PRIMARY KEY
, DevID INT
, DataPoint VARCHAR
, INDEX(DevID))
Psudocode obviously, since I don't know your exact requirements.
Does this data represent a series of readings over time? Time-series data tends to be highly repetetive. So a common strategy is to compress it in ways that avoid storing every single value. For example use run-length encoding, or associate time intervals with each value instead of single points.