I need some storage ideas and tips for optimization [closed] - sql

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 hours ago.
Improve this question
I have a large jsonl file (text file with one json per line) that is over 45gb in size. In each json there's a company name which I'll query from a web application. Users will type the name in a search box and results have to appear in less than a second. So far, I tried Cosmos DB with the city of the company as a partition key, but that is pretty slow and probably also expensive. Now I'm using SQL Server in Azure with a FULL TEXT index on the name column. My table looks like this:
Company ID | Name | JSON
One idea would be to use the "General Purpose" tier in Azure with 4 vCores, that's about 700 EUR/month and the search is <100ms most of the times (sometimes I get timeout from my web app with time >30s), but is there a way to further improve this? Maybe there's a better storage option that I'm not considering or another option from Azure besides "General Purpose" that's a better fit for this scenario? I'm not an expert in SQL Server so I would appreciate some ideas on how to optimize this besides using the FULL TEXT catalog. I don't know if it matters but I want to mention that the data set will be updated every 3 months, so it's not changed often.
Here's how I'm querying the database:
SELECT TOP(5) Id, Name
FROM Companies
WHERE
CONTAINS(Name,'%abc%')
The output with "STATISTICS TIME ON":
SQL Server parse and compile time:
CPU time = 15 ms, elapsed time = 49 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(5 rows affected)
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 1083 ms.

Related

Query time for a specific entity is 10000 times higher [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 9 months ago.
Improve this question
We run into a problem: select for a filter by a certain id takes a very long time. For all id about 5ms, for this - 10 seconds.
This is explain. Left - normal, right - long. This is absolutely the same sql query, where the difference is only in one digit 'where id = ...'
this
It is striking that a filter is used on the right, but for some reason it is not on the left, as well as some huge number of 'rows removed'. Such a number can only be obtained by multiplying the number of rows in the joined tables. Once again I repeat that the sql query is absolutely the same except for the entity id, the number of retrieved data for entities is comparable.
One of the tables also uses btre index. The only thing that this id has is special - it comes after the numbering break, 22,23,24,30 for example. But I was not able to reproduce the problem on this principle.
Unfortunately, I cannot show the code, but I hope that this information will be enough to advise something.
upd:
I found the reason. Postgres for some reason expects that one of the tables will return only 1 structure, when as a real return in 10k+ and therefore chooses the wrong algorithm. For other entity ids, it "thinks" correctly and chooses higher algorithms. Can you find how posgres counts plan lines? What could be the problem?
If I understand correctly, your problem is data histogram. We cannot support you because you cannot provide example code. Briefly, one of your table has a data whose id columns has heterogenous data in it. For example; your table has 1 billion records and in that table each id has 500 records. Yet, some of the id' s (virtually, let say) 20 or 200 millions records. So, if you search for these highly non-selective rows the database optimizer will not help you.
Check your data histogram!

Calculate user size to store in PostgreSQL DB [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I'm creating a PostgreSQL DB where I'll store some users, so I need to know which is the exact size (MB) of each user.
This is my reasoning :
Profile picture : JPEG 15 Mb 75% = up to 1,8 MB
name + surname + work : 20 characters each so = 60 B
date of birth : timestamp = 8 B
bio : up to 500 characters = 500 B
For a total of (approximately) 2,5 MB.
So if I have 1 GB of available space on the DB I will store up to 400 users.
Is it right? Am I missing something?
I would not store the image binary data in the database, this is not a good idea. Store it in azure blob storage and just store the url to it. At least not in the database, make the database as small and fast as possible or you will get issues later down the line. e.g. indexing with large columns will make queries slow down in time.

How to make correct function in dynamic programming [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have the following problem in dynamic programming.
A person has time machine and he can move in time either 1 year or 2. At the beginning he is at year 0 and he wants to reach year 100. Every step he does (1 or 2 years) he is paying some fixed fees. There is an array with 100 integers represents the fee he needs to pay if he went threw the specific year.
I need to find the minimum amount the person can pay to go from year 0 to year 100 using dynamic programming.
From what i have done so far i think that there should be something like
minCost(i) = min{A[i-1], A[i-2]}
and the base cases are years 1 and 2 which costs A[1], A[2] respectively. But i think this approach has more of greedy algorithm rather than dynamic programming.
I saw the bin packing algorithm of dynamic programming and i understood it and the matrix that represents it.
How should the matrix of the shown problem above look like?
And how should i build the function and the pseudo code for this problem?
You are almost there.
Think about how will you reach the i th year from i-1 th year and i-2 th year. There is a fee which you are forgetting to take into consideration.
MinCostToReachYear(i) = min( MinCostToReachYear(i-1) + fee(i-1), MinCostToReachYear(i-2) + fee(i-2) )
You already know the base cases year 1 and year 2. Can you think of extrapolating with the use of a for loop or more easily with a recursive function which you already know as mentioned above? I leave it as an exercise for you.

What are typical lengths of chat message and comment in database? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I need to create a column in SQL Server database. Entries for that column will contain messages from chat. Previously such messages has been stored as comments.
My main quetion is:
What is typical text length for chat message and comment?
By the way:
What would happen if I used varchar(max)? How would it impact database size and performance? Is better to use powers of 2 or powers of 10 (e.g. 128 instead of 100) while considering text lengths?
Using VARCHAR(MAX) has a disadvantage: you can not define an index over this column.
Generally, your application should impose a maximum length for a chat message. How big that limit is depends very much on what the application is used for. But anything more than 1000 byte is probably less a legitimate message but an attempt to disrupt your service.
If your maximum value is a power of 2, or a power of ten or any other value has no influence on the performance as long as the row fits in one (8KB) page.
Short answer - it doesn't matter.
From MSDN:
The storage size is the actual length of the data entered + 2 bytes.
So VARCHAR(10) and VARCHAR(10000) will consume the same amount of data if the values don't exceed 10 characters.
Definitely use N/VARCHAR(MAX), it can grow to be 2GB (if I remember correctly). It will grow as required though, so it is very efficient with regards to space unless you are only storing very small amounts of data.

Improving performance of SQL next row

I wrote an application that performs around 40 queries and then does some processing on the results of each query. (Right now I'm using Qt 3.2.2 in Visual C++ 6.0 on Windows XP with SQL Server 2005, but that's not required.) The paradigm is to create a QSqlQuery object with the query (this causes the query to be performed) and then while (query.next()) { operate(query.value(0)); } By profiling I find that the query.next() call is taking up half the time of the program, which seems excessive as it's just fetching a row of data (6 or 7 fields).
This performance is unacceptable, and I'm looking for a way to improve this. I'm open to changing anything -- switching my compiler, switching languages, switching the paradigm I use to get data from the database. How can I speed this up?
Here's the query:
select rtrim(Portfolio.securityid), rtrim(type), rtrim(coordinate), rtrim(value)
from MarketData
inner join portfolio
on cast(MarketData.securityid as varchar(36))=portfolio.securityid
where Portfolioname=?
and type in
('bond_profit', 'bondoption_profit', 'equity_profit', 'equityoption_profit')
and marketdate=?
order by Portfolio.securityid, type, coordinate
CPU usage is around 40% while the program is running, so I suspect it's spending the majority of its time waiting for the .next() call to return with more data.
Performing the same query in SSMS returns 4.5 million rows in about 5 minutes, but the total time spent waiting on .next() during the run of the program is 30 minutes.