I have a machine which creates a new log file at the beginning of the day(12am) and updates the log file whenever there is any changes until the end of the day.
How do I import the data in real time (30 sec, 1min or whenever there is any changes) to my SQL server database?
Will SQL Server 2008 be able to access the active log file? If not will it be easier if i let my machine create a new log file whenever there is any updates? But if it is so, how do i import so many log files with different names in real time. ( I must be able to scale the solution up to multiple machines)
Thx a lot
You can log each new line with a reversed time stamp.
Since you need to log only when the file changes you can implement an in memory queue
which reads from the file and stores the data.
Then implement a producer consumer model wherein one thread reads and loads data from the queue and the consumer logs to the database.
A windows service then can keep reading from the queue and log to the SQL Server.
(Since it's a producer consumer there will not be any busy waiting in case the queue is empty)
Somehow you will also have to notify the producer thread whenever every log is made. This can be done through Sockets/or some other means in case you have access to the code which is doing the logging.
If you have no control over the application producing the file then you have little option but to poll the file. Write an application that regularly polls the file and writes the deltas to the database. The application will need to record a high water mark that it has last read to.
Another wrinkle is that if the application does not close the file between writes then the last accessed time stamp might not be updated, so checking the age of the file may not be reliable. In this case you need to implement something like this process:
Open the log file
Seek to your last recorded EOF position
Try reading
If successful, process the new data until you get to the new EOF.
Update your persistent EOF position
Close the file
You will need to make sure that the number of bytes read aligns with your file seek position. If the log file is unicode then it may not have a 1:1 mapping between bytes and characters. You may need to read chunks of the file in binary mode and do the translation to characters from the buffer.
Once you have the log file entries parsed then you can just insert the data, or use SQLBulkCopy for larger data volumes.
If you can relax your latency constraints and the log file is small enough then you could possibly just implement a process that copies the log file to a staging area and reloads the whole thing periodically.
How about an SSIS package being called by an SQL Server Scheduled Job?
Related
I have a huge file that can have anywhere from few hundred thousand to 5 million records. Its tab-delimited file. I need to read the file from ftp location , transform it and finally write it in a FTP location.
I was going to use FTP connector get the repeatable stream and put it into mule batch. Inside mule batch process idea was to use a batch step to transform the records and finally in batch aggregate FTP write the file to destination in append mode 100 records at a time.
Q1. Is this a good approach or is there some better approach?
Q2. How does mule batch load and dispatch phase work (https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#load-and-dispatch ) Is it waiting for entire stream of millions of records to be read in memory before dispatching a mule batch instance ?
Q3. While doing FTP write in batch aggregate there is a chance that parallel threads will start appending content to FTP at same time thereby corrupting the records. Is that avoidable. I read about File locks (https://docs.mulesoft.com/ftp-connector/1.5/ftp-write#locks) . My assumption is it will simply raise File lock exception and not necessarily wait to write FTP in append mode.
Q1. Is this a good approach or is there some better approach?
See answer Q3, this might not work for you. You could instead use a foreach and process the file sequentially though that will increase the time for processing significantly.
Q2. How does mule batch load and dispatch phase work
(https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#load-and-dispatch
) Is it waiting for entire stream of millions of records to be read in
memory before dispatching a mule batch instance ?
Batch doesn't load big numbers of records in memory, it uses file based queues. And yes, it loads all records in the queue before starting to process them.
Q3. While doing FTP write in batch aggregate there is a chance that
parallel threads will start appending content to FTP at same time
thereby corrupting the records. Is that avoidable. I read about File
locks (https://docs.mulesoft.com/ftp-connector/1.5/ftp-write#locks) .
My assumption is it will simply raise File lock exception and not
necessarily wait to write FTP in append mode
The file write operation will throw a FILE:FILE_LOCK error if the file is already locked. Note that Mule 4 doesn't manage errors through exceptions, it uses Mule errors.
If you are using DataWeave flatfile to parse the input file, note that it will load the file in memory and use significantly more memory than the file itself to process it, so you probably are going to get an out of memory error anyway.
I have to implement a kafka consumer which reads data from a topic and writes it a file based on the account id(will be close to million) present in the payload. Assuming there will be around 3K events per second. Is it ok to open and close file for each message read?
or should I consider a different approach?
I am assuming following:
Each account id will be unique and will have its own unique file.
It is okay to have a little lag in the data in the file, i.e. the data in the file will be near real time.
The data read per event is not huge.
Solution:
Kafka Consumer reads the data and writes to a database, preferably a NoSQL db.
A separate Single thread periodically reads the database for new records inserted, groups them by accountId.
Then iterates over the accoundId and for each accountId opens the File, writes the data at once, closes the File and moves to the next accountId.
Advantages:
Your consumer will not be blocked due to File Handling, as the two operations are decoupled.
Even if File Handling fails then the data is always present in DB to reprocess.
If your account id repeats, then it is better to windowing. You can aggregate all events of say 1 min by windowing, then you can group events by key and process all accountId at once.
This way, you will not have to open a file multiple times.
It is not okay to open a file for every single message, you should buffer a fixed amount of message, then write to a file when you each that limit.
You can use the HDFS Kafka Connector provided by Confluent to manage this.
If configured with the FieldPartitioner writing out to a local filesystem given store.url=file:///tmp, for example, that will create one directory per unique accountId field in your topic. Then the flush.size configuration determines how many messages will end up in a single file
Hadoop does not need to be installed as the HDFS libraries are included in the Kafka Connect classpath and they support local filesystems
You would start it like this after creating two property files
bin/connect-standalone worker.properties hdfs-local-connect.properties
As I understood reading some articles in internet, SQL Server has a buffer cache where it stores pages, and when insert statement is executed, the modified data is written only to that buffer in memory, not to the disk.
And when system checkpoint comes all dirty pages are flushed to disk.
Does this mean that when we execute insert statement and get a return value that everything was ok, the data might still not be written to disk and in theory if system crash occurs before checkpoint, the dirty pages wont be saved to disk although we received information that everything was ok and the transaction is commited?
No. Because you totally ignore the 2nd part of the mechanism - the LOG FILE. The Log file keeps all changed pages and is flushed to disc. In case of a crash, upon start, the server will replay the changes from the log file.
Requirement - That our application processes files containing records and we have to maintain the log for the records in every file. The log file could easily be 100 MB at times in size.
Solution - Since database operation would be very heavy, so we wanted to go for in-memory cache. Write the logs for a particular file into a redis key (key might be the unique file name itself). Later when the user wants to see the log file, application should be able to read the contents from the cache using the unique key file name and write its content into a file which the user can see/download.
Question - Is this a good idea that, we keep appending the logs for a particular file to the same key and later when we have to write to the file, we read from the key and write the contents to the file? Basically the value of the redis key would always be string and its size might run into 100 MBs in size. Will there be any problems because of this?
You can achieve with redis easily, but don't forget that redis is in-memory store (make sure you don't run out of RAM). Ask yourself why you want to go for in-memory store over normal disk operations while dealing with files. If you feel like more frequent read operations happens and accessing time is crucial go ahead with redis.
Regarding size - 100MB is not a problem, in redis string can hold upto 512MB & List, Set, Hashes can hold >4billion records
I prefer MongoDB(which is a disk-based document store) for this kind of operations over redis.
Consider looking at this link to know when redis is awesome.
I have a streamed service. The message returned from the operation has a stream as the only body member, which is a stream to a file in the file system. I wonder if there's a way to record how much time it takes to the client to consume such file, from the server?
One of the ways you can go - return from server not only stream, but data structure, contains file size as well.
On client - you can use timer and calculate time against already read vs took time vs full file size.
See this example: http://www.codeproject.com/Articles/20364/Progress-Indication-while-Uploading-Downloading-Fi