Check the files exist before the result of sql query

Check the files exist before the result of sql query - sql

I have a website with admin panel, which people can upload files and data. My website is on 2 servers and use nlb, and also two servers are synced together with dfs, also I have 2 SQL Server on both servers and they are synced too.
The problem is syncing files on both servers take sometimes,
Now just imagine I have a table with these fields:
Name, price, fileName
File address is point to somewhere of physical disk which is synced in 2 servers.
Now imagine the website run this query:
Select * from myTable
How can I tell SQL that just show that records which the fileName field exists in physical disk?
Note: I want it to do it with SQL, not in my application.

There are two options:
You already mentioned it. It is a good idea to add a column (FileExists bit (Y/N)) which tells you directly the status of physical copy.
Second option is tedious as you need to create custom logic which identifies the physical copy at runtime as per the first commented link given by #MichałKomorowski.
Ideally a database is used to store data and communicate with the application. Outer world interfaces within a database will be tedious work as well as decrease performance.
For example, if you implement a function to check physical file and in select query as above you want, you used that function, so extra headache to checking the status will definitely increase the time and performance decrease. Just think again.

Related

SQLite and multiple writes

I am new to databases and I'm trying to decide which one would suit my needs the most. I am in the planning stages of a program that will store directory references to thousands of PDF files at multiple locations (all under one domain). Basically, all it will be is date, location, originator's name and link to the file in each of the fields within the database. Nothing more (no files, nothing fancy). All I'll need to do with the database is sort fields by location, date, name of the originator -- that's it. There will be instances where multiple writes would need to occur at the same time. I've read up on SQLite website that only one write is supported at any given time. Does it mean writes to a specific file or database period?
What I mean is that multiple records would need to be added from different clients at any given time, but the existing records would not need to be modified (and if they do, it would be done from a specific client). To give a little bit more detail, what I'll have is several locations at which service application will be running in the background and listening to folders. Once a file enters a folder, it gets renamed under a specific format and added to the database. It is very likely that two folder listening apps would try to add files to the database at the same time.
Would I be able to accomplish this with SQLite or is it one write at any given time to the entire database? If only one write is possible, period, to the entire database, is there a way to implement some sort of spooling system (sort of like on a printer), where writes would wait in a queue with life timers on them?
If it's not possible, then I will look at MySQL. Cost is of a concern, so I'm steering towards these two.

Only one write can occur at the exact time, but by default writes are automatically queued and you can achieve thousands of writes per second.
The main concern is what type of application is this? Is it a web application developed on one machine and deployed to a single other production machine? Then the extra trouble of installing and maintaining MySQL is not a concern and you're better off using MySQL. If this is a desktop application installed on many desktops, then using an embedded database is far easier for development, installation, and maintenance and in that case use SQLite.

Why does my SELECT query take so much longer to run on the web server than on the database itself?

I'm running the following setup:
Physical Server
Windows 2003 Standard Edition R2 SP2
IIS 6
ColdFusion 8
JDBC connection to iSeries AS400 using JT400 driver
I am running a simple SQL query against a file in the database:
SELECT
column1,
column2,
column3,
....
FROM LIB/MYFILE
No conditions.
The file has 81 columns - aplhanumeric and numeric - and about 16,000 records.
When I run the query in the emulator using the STRSQL command, the query comes back immediately.
When I run the query on my Web Server, it takes about 30 seconds.
Why is this happening, and is there any way to reduce this time?

While I cannot address whatever overhead might be involved in your web server, I can say there are several other factors to consider:
This may likely have to do primarily in the differences between the way the two system interfaces work.
Your interactive STRSQL session will start displaying results as quickly as it receives the first few pages of data. You are able to page down through that initial data, but generally at some point you will see a status message at the bottom of the screen indicating that it is now getting more data.
I assume your web server is waiting until it receives the entire result set. It wants to get all the data as it is building the HTML page, before it sends the page. Thus you will naturally wait longer.
If this is not how your web server application works, then it is likely to be a JT400 JDBC Properties issue.
If you have overridden any default settings, make sure that those are appropriate.
In some situations the OPTIMIZATION_GOAL settings might be a factor. But if you are reading the table (aka physical file or PF) directly, in its physical sequence, without any index or key, then that might not apply here.
Your interactive STRSQL session will default to a setting of *FIRSTIO, meaning that the query is optimized for returning the first pages of data quickly, which corresponds to the way it works.
Your JDBC connection will default to a "query optimize goal" of "0", which will translate to an OPTIMIZATION_GOAL setting of *ALLIO, unless you are using extended dynamic packages. *ALLIO means the optimizer will try to minimize the time needed to return the entire result set, not just the first pages.
Or, perhaps first try simply adding FOR READ ONLY onto the end of your SELECT statement.
Update: a more advanced solution
You may be able to bypass the delay caused by waiting for the entire result set as part of constructing the web page to be sent.
Send a web page out to the browser without any records, or limited records, but use AJAX code to load the remainder of the data behind the scenes.
Use large block fetches whenever feasible, to grab plenty of rows in one clip.

One thing you need to remember, the i saves the access paths it creates in the job in case they are needed again. Which means if you log out and log back in then run your query, it should take longer to run, then the second time you run the query it'll be faster. When running queries in a web application, you may or may not be reusing a job meaning the access paths have to be rebuilt.
If speed is important. I would:
Look into optimizing the query. I know there are better sources, but I can't find them right now.
Create a stored procedure. A stored procedure saves the access paths created.

With only 16000 rows and no WHERE or ORDER BY this thing should scream. Break the problem down to help diagnose where the bottleneck is. Go back to the IBM i, run your query in the SQL command line and then use the B, BOT or BOTTOM command to tell the database to show the last row. THAT will force the database to cough up the entire 16k result set, and give you a better idea of the raw performance on the IBM side. If that's poor, have the IBM administrators run Navigator and monitor the performance for you. It might be something unexpected, like the 'table' is really a view and the columns you are selecting might be user defined functions.
If the performance on the IBM side is OK, then look to what Cold Fusion is doing with the result set. Not being a CF programmer, I'm no help there. But generally, when I am tasked with solving multi-platform performance issues, the client side tends to consume the entire result set and then use program logic to choose what rows to display/work with. The server is MUCH faster than the client, and given the right hints, the database optimiser can make some very good decisions about how to get at those rows.

Database of images and text

background:
I'm in the design phase of building an app.
I want the app to display text and images, the problem is that I will have A LOT of them. hundreds to thousands.
This is my largest app so far, and I am unsure on how to handle all the data.
The question???????:
What would be the best way to store and access these images and text?
Would I use a formal database approach like SQL?
Or would it be better to navigate files/folders e.g. dropping all the files in res/drawable?
potentially useful facts:
The database will be stored and accessed natively so it can be accessed off-line.
The user will not be adding to the database in anyway, only accessing the data.
the database will be updated every 6 months.
The application 'page' will display 1-5 images along with several blocks of text.
Concept:
the app will be like a recipe app...the user will pick some parameters e.g. ingredients, type, diet.. then select a recipe. And then several images and blocks of text will be displayed showing and detailing the process of some recipe.
I apologize if this is repeated but I didn't see a specific answer for my purposes.

The "Best" approach will depend on the functionality of the database server in question.
Generally, you should store the images "In" the database until that becomes a performance issue. Once you start storing images "Outside" of the database you will have to handle all the issue that are normally taken care of by the database. Disk space management, orphan records, file name conflicts, folder file limits, to name just a few. Depending on your situation these may be big issues or thay may be nothing to worry about.
I've seen several application where images (or attachements) were kept "Outside" the database, and in each case it was done poorly. There are just so many issues to handle, and most developers don't even think of half of them. In many cases the performance of storing the images "In" the databse was acceptable, but the developers decided against it because they just knew it would not perform well.

If your using SQL server 2008 the Filestream data type is ideal for your case. It stores the binary files outside of the database but behaves as a normal field. Also you are able to read/write the files using a stream instead of getting/setting the whole file as a byte array (like when using varbin(max))
If you don't have this functionality in your database, I would recommend storing the images outside of the DB

Its probably a better idea to use a file based approach for deployed static resources.
At the very least because taking a dependency on file system is typically easier to manage then taking a dependency on a DB.
Also this line indicates some sort of non-web client
The database will be stored and accessed natively so it can be accessed off-line."
This means if you go with the DB approach you'll have a couple of other interesting problems
Deployment
Depending on the platform deploying a DB can be a real bear depending on your target platform. What happens if they if already have the engine but its a different version.
Resources
Is your DB going to be client/server based (like MySQL/SQL Server etc)? If so then your app has to now manage the current state of its process. If not then you'll be using a file-based db SQL Lite/MS Access, at which point I would question why using a static DB is worth doing at all.
One final note. There's nothing stopping your Content Production environment from using a DB. Its quite common for Content producers to maintain a database for their content that will you will later use to produce the files for publishing/deployment.

IIS access log to SQL normalization

I am looking for insert IIS 6.0 access log ( 5 servers, and over 400MB daily ) to SQL database. What scares me is the size. There is a lot of information you are duplicating (i.e. site name, url, referrer, browser) and could be normalized by index and look-up table.
Reason why I am looking for own database instead using other tools is that is 5 servers and I need very custom statistics and reports on each, few or all. Also installing any (specially open source) software is massacre ( need have 125% functionality and take months ).
I wounder what would be the most efficient way to do it? Is someone saw examples or articles about it ?

Whilst I would suggest buying a decent log parsing tool if you insist on going it alone, take a look at Log Parser
http://www.microsoft.com/downloads/en/details.aspx?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&displaylang=en
to help you do some of the heavy listing, either into SQL or maybe it can get the results you are after directly.

On the one hand, you will reduce disk space for values a lot by using artificial keys for things like server IP address, user agent, and referrer. Some of that space you save will be lost to the index, but the overall disk savings for 400 MB per day, times 5 servers, should still be substantial.
The tradeoff, of course, is the need to use joins to bring that information back together for reporting.
My nitpick is that replacing one column's values with an artificial key to a two-column lookup table shouldn't be called "normalizing". You can do that without identifying any functional dependencies. (I'm not certain you're proposing to do that, but it sounds like it.)
You're looking at about 12 gigs a month in raw data, right? Did you consider approaching it from a data warehousing point of view? (Instead of an OLTP point of view.)

Low MySQL Table Cache Hit Rate

I've been working on optimizing my site and databases, and I have been using mysqltuner.pl to help with this. I've gotten just about everything correct, except for the table cache hit rate, no matter how high I raise it in my.cnf, I am still hitting about 0% (284 open / 79k opened).
My problem is that I don't really understand exactly what affects this so I don't really know what to look for in my queries/database structure to fix this.

The table cache defines the number of simultaneous file descriptors that MySQL has open. So table cache hit rate will be affected by how many tables you have relative to your limit as well as how frequently you re-reference tables (keeping in mind that it is counting not just a single connection, but simultaneous connections)
For instance, if your limit is 100 and you have 101 tables and you query each one in order, you'll never get any table cache hits. On the other hand, if you only have 1 table, you should generally get close to 100% hit rate unless you run FLUSH TABLES a lot ( as long as your table_cache is set higher than the number of typically simultaneous connections).
So for tuning, you need to look at how many distinct tables you might reference by one process/client and then look at how many simultaneous connections you might typically have.
Without more details, I can't guess whether your case is due to too many simultaneous connections or too many frequently referenced tables.

A cache is supposed to maintain copies of hot data. Hot data is data that is used a lot. If you cannot retrieve data out of a certain cache it means the DB has to go to disk to retrieve it.
--edit--
sorry if the definition seemed a bit obnoxious. a specific cache often covers a lot of entities, and these are database specific, you need to find out what is cached by the table cache firstly.
--edit: some investigation --
Ok, it seems (from the reply to this post), that Mysql uses the table cache for the data structures used to represent a table. the data structures also (via encapsulation or by having duplicate table entries for each table) represent a set of file descriptors open for the data files on the file system. The MyIsam engine uses one for a table and one for each index, additionally each active query element requires its own descriptors.
A file descriptor is a kernel entity used for file IO, it represents the low-level context of a particular file read or write.
I think you are either interpreting the value's incorrectly or they need to be interpreted differently in this context. 284 is the amount of active tables at the instance you took the snapshot and the second value represents the amount of times a table was acquired since you started Mysql.
I would hazard a guess that you need to take multiple snapshots of this reading and see if the first value (active fd's at that instance) ever exceed your cache size capacity.
p.s., the kernel generally has a upper limit on the amount of file descriptors it will allow each process to open -- so you might need to tune this if it is too low.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas