Alternatives to using RFile in Symbian - serialization

This question is in continuation to my previous question related to File I/O.
I am using RFile to open a file and read/write data to it. Now, my requirement is such that I would have to modify certain fields within the file. I separate each field within a record with a colon and each record with a newline. Sample is below:
abc#def.com:Albert:1:2
def#ghi.com:Alice:3:1
Suppose I want to replace the '3' in the second record by '2'. I am finding it difficult to overwrite specific field in the file using RFile because RFile does not provide its users with such facility.
Due to this, to modify a record I have to delete the contents of the file and serialize ( that is loop through in memory representation of records and write to the file ). Doing this everytime there is a change in a record's value is quite expensive as there are hundreds of records and the change could be quite frequent.
I searched around for alternatives and found CPermanentFileStore. But I feel the API is hard to use as I am not able to find any source on the Internet that demonstrates its use.
Is there a way around this. Please help.

Depending on which version(s) of Symbian OS you are targetting, you could store the information in a relational database. Since v9.4, Symbian OS includes an SQL implementation (based on the open source SQLite engine).

Using normal files for this type of records takes a lot of effort no matter the operating system. To be able to do this efficiently you need to reserve space in the file for expansion of each record - otherwise you need to rewrite the entire file if a record value changes from say 9 to 10. Also storing a lookup table in the file will make it possible to jump directly to a record using RFile::Seek.
The CPermamanentFileStore simplifies the actual reading and writing of the file but basically does what you have to do yourself otherwise. A database may be a better choice in this instance. If you don't want to use a database I think using stores would be be a better solution.

Related

Azure Data Factory - optimal design for an IOT pipeline

I am working on an Azure Data Factory solution to solve the following scenario:
Data files in CSV format are dumped into Data Lake Gen 2 paths. There are two varieties of files, let's call them TypeA and TypeB and each is dumped into a path reflecting a grouping of sensors and the date.
For example:
/mycontainer/csv/Group1-20210729-1130/TypeA.csv
/mycontainer/csv/Group1-20210729-1130/TypeB.csv
/mycontainer/csv/Group1-20210729-1138/TypeA.csv
/mycontainer/csv/Group1-20210729-1138/TypeB.csv
I need to extract data from TypeA files in Delta format into a different location on Data Lake Gen 2 storage. I'll need to do similar processing for TypeB files but they'll have a different format.
I have successfully put together a "Data Flow" which, given a specific blob path, accomplishes step 2. But I am struggling to put together a pipeline which applies this for each file which comes in.
My first thought was to do this based on a storage event trigger, whereby each time a CSV file appeared the pipeline would be run to process that one file. I was almost able to accomplish this using a combination of fileName and folderPath parameters and wildcards. I even had a pipeline which will work when triggered manually (meaning I entered a specific fileName and folderPath value manually). However I had two problems which made me question whether this was the correct approach:
a) I wasn't able to get it to work when triggered by real storage events, I suspect because my combination of parameters and wildcards was ending up including the container name twice in the path it was generating. It's hard to check this because the error message you get doesn't tell you what the various values actually resolve to (!).
b) The cluster that is needed to extract the CSV into parquet Delta and put the results into Data Lake takes several minutes to spin up - not great if working at the file level. (I realize I can mitigate this somewhat - at a cost - by setting a TTL on the cluster.)
So then I abandoned this approach and tried to set up a pipeline which will be triggered periodically, and will pick up all the CSV files matching a particular pattern (e.g. /mycontainer/csv/*/TypeA.csv), process them as a batch, then delete them. At this point I was very surprised to find out that the "Delimited Text" dataset doesn't seem to support wildcards, which is what I was kind of relying on to achieve this in a simple way.
So my questions are:
Am I broadly on the right track with my 'batch of files' approach? Is there a way to define a delimited text data source which reads its data from multiple blobs?
Or do I need a more 'iterative' approach using maybe a 'Foreach' step? I'm really really hoping this isn't the case as it seems an odd pattern to be adopting in 2021.
A much wider question: is ADF a suitable tool for this kind of scenario? I was excited about using it at first, but increasingly it feels like one of those 'exciting to demo but hard to actually use' things which so often pop-up in the low/no code space. Are there popular alternatives which will work nicely with Azure storage?
Any pointers very much appreciated.
I believe you're very much on the right track.
Last week I was able to get wildcard CSV's to be imported if the wildcard is in the CSV name. Maybe create an intermediate step to put all Type A's in the same folder?
Concerning ADF - it's a cool technology, with a steep learning curve (and a lot of updates - incl. breaking changes sometimes) if you're looking to get data ingested without too much coding. Some drawbacks:
Monitoring - if you want to have it cheaper, there's a lot of hacking (e.g. mailing via Logic Apps)
Debugging - as you've noticed, debug messages are often cryptic or insufficient
Multiple monthly updates make it feel like a beta. Indeed, often there are straightforward tasks that are quite difficult to achieve.
Good luck ;)

MongoDB or SQL for text file?

I have a 25GB's text file with that structure(headers):
Sample Name Allele1 Allele2 Code metaInfo...
So it just one table with a few millions of records. I need to put it to database coz sometimes I need to search that file looking, for example, specific sample. Then I need to get all row and equals to file. This would be a basic application. What is important? File is constant. It no needed put function coz all samples are finished.
My question is:
Which DB will be better in this case and why? Should I put a file in SQL base or maybe MongoDB would be a better idea. I need to learn one of them and I want to pick the best way. Could someone give advice, coz I didn't find in the internet anything particular.
Your question is a bit broad, but assuming your 25GB text file in fact has a regular structure, with each line having the same number (and data type) of columns, then you might want to host this data in a SQL relational database. The reason for choosing SQL over a NoSQL solution is that the former tool is well suited for working with data having a well defined structure. In addition, if you ever need to relate your 25GB table to other tables, SQL has a bunch of tools at its disposal to make that fast, such as indices.
Both MySQL and MongoDB are equally good for your use-case, as you only want read-only operations on a single collection/table.
For comparison refer to MySQL vs MongoDB 1000 reads
But I will suggest going for MongoDB because of its aggeration pipeline. Though your current use case is very much straight forward, in future you may need to go for complex operations. In that case, MongoDB's aggregation pipeline will come very handy.

Using data from a text file for database in SQL Developer

I suppose this is somewhat of an extension of the question asked here.
However, I believe the linked OP and my own reason for reading a file with SQL Developer are different. I am learning SQL and databases and am attempting to create a model database (as in, I won't be editing the data after insertion, just set up search queries and what not). I want it to be large (over 100,000 entries), so I've created a C++ program that wrote randomly generate entries for the database on a .txt file (one entry per line in the .txt file) instead of hard coding the insertion of each entry. Now what I want to do is read the .txt file in SQL Developer and inserts it into a table.
My problem lies in the fact that I am not able to create directories. I am using a university oracle connection and I do not have the privileges to actually make a directory so that I can use UTL_FILE on my .txt file as was answered in the linked question. Assuming there is no way for me to gain this permission, is there an alternate way I can accomplish reading a .txt file for data for my table? Is there a better way to go about creating "dummy data" for my database?
What I ended up doing to insert my mock data was change the way the .txt file was formatted. Instead of having my C++ code write the data one entry per row, I actually made the code so that it wrote SQL code to the .txt file as I think #toddlermenot was suggesting, more or less. After I had the C++ code write as many inserts-with-mock-entries as I needed to the text file, I just copy/pasted it to SQL developer and achieved the desired results.
My problem is a classic case of making the process more complicated than it needed to be.
Also, even though I did not use the method,#Multisync provided an interesting way to go about achieving my goal. I had no idea SQL had the tools for me to generate mock data. Thanks for introducing me to that.

Which would be better? Storing/access data in a local text file, or in a database?

Basically, I'm still working on a puzzle-related website (micro-site really), and I'm making a tool that lets you input a word pattern (e.g. "r??n") and get all the matching words (in this case: rain, rein, ruin, etc.). Should I store the words in local text files (such as words5.txt, which would have a return-delimited list of 5-letter words), or in a database (such as the table Words5, which would again store 5-letter words)?
I'm looking at the problem in terms of data retrieval speeds and CPU server load. I could definitely try it both ways and record the times taken for several runs with both methods, but I'd rather hear it from people who might have had experience with this.
Which method is generally better overall?
The database will give you the best performance with the least amount of work. The built in index support and query analyzers will give you good performance for free while a textfile might give you excellent performance for a ton of work.
In the short term, I'd recommend creating a generic interface which would hide the difference between a database and a flat-file. Later on, you can benchmark which one will provide the best performance but I think the database will give you the best bang per hour of development.
For fast retrieval you certainly want some kind of index. If you don't want to write index code yourself, it's certainly easiest to use a database.
If you are using Java or .NET for your app, consider looking into db4o. It just stores any object as is with a single line of code and there are no setup costs for creating tables.
Storing data in a local text file (when you add new records to end of the file) always faster then storing in database. So, if you create high load application, you can save the data in a text file and copy data to a database later. However in most application you should use a database instead of text file, because database approach has many benefits.

cleaning datasources [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I'm project managing a development that's pulling data from all kinds of data sources (SQL MySQL, Filemaker, excel) before installing into a new database structure with a record base through 10 years. Obviously I need to clean all this before exporting, and am wondering if there are any apps that can simplify this process for me, or any guides that I can follow.
Any help would be great
I do this all the time and like Tom do it in SQl Server using DTS or SSIS depending on the version of the final database.
Some things I strongly recommend:
Archive all files received before you process them especially if you are getting this data from outside sources, you may have to research old imports and go back to the raw data. After the archive is successful, copy the file to the processing location.
For large files especially, it is helpful to get some sort of flag file that is only copied after the other file is completed or even better whcich contains the number of records in the file. This can help prevent problems from corrupted or incomplete files.
Keep a log of number of records and start failing your jobs if the file size or number of records is suspect. Put in a method to process anyway if you find the change is correct. Sometimes they really did mean to cut the file in half but most of the time they didn't.
If possible get column headers in the file. You would be amazed at how often data sources change the columns, column names or order of the columns without advance warning and break imports. It is easier to check this before processing data if you have column headers.
Never import directly to a production table. Always better to use a staging table where you can check and clean data before putting it into prod.
Log each step of your process, so you can easily find what caused a failure.
If you are cleaning lots of files consider creating functions to do specific types of cleaning (phone number formatting for instance) then you can use the same function in multiple imports.
Excel files are evil. Look for places where leading zeros have been stripped in the import process.
I write my processes so I can run them as a test with a rollback at the end. Much better to do this than realize your dev data is so hopelessly messed up that you can't even do a valid test to be sure everything can be moved to prod.
Never do a new import on prod without doing it on dev first. Eyeball the records directly when you are starting a new import (not all of them if it is a large file of course, but a good sampling). If you think you should get 20 columns and it imports the first time as 21 columns, look at the records in that last column, many times that means the tab delimited file had a tab somewhere in the data and the column data is off for that record.
Don't assume the data is correct, check it first. I've had first names in the last name column, phones in the zip code column etc.
Check for invalid characters, string data where there should just be numbers etc.
Any time it is possible, get the identifier from the people providing the data. Put this in a table that links to your identifier. This will save you from much duplication of records becuase the last name changed or the address changed.
There's lots more but this should get you started on thinking about building processes to protect your company's data by not importing bad stuff.
I work mostly with Microsoft SQL Server, so that's where my expertise is, but SSIS can connect to a pretty big variety of data sources and is very good for ETL work. You can use it even if none of your data sources are actually MS SQL Server. That said, if you're not using MS SQL Server there is probably something out there that's better for this.
To provide a really good answer one would need to have a complete list of your data sources and destination(s) as well as any special tasks which you might need to complete along with any requirements for running the conversion (is it a one-time deal or do you need to be able to schedule it?)
Not sure about tools, but your going to have to deal with:
synchronizing generated keys
synchronizing/normalizing data formats (e.g. different date formats)
synchronizing record structures.
orphan records
If the data is running/being updated while you're developing this process or moving data you're also going to need to capture the updates. When I've had to do this sort of thing in the past the best, not so great answer I had was to develop a set of scripts that ran in multiple iterations, so that I could develop and test the process iteratively before I moved any of the data. I found it helpful to have a script (I used a schema and an ant script, but it could be anything) that could clean/rebuild the destination database. It's also likely that you'll need to have some way of recording dirty/mismatched data.
In similar situations I personally have found Emacs and Python mighty useful but, I guess, any text editor with good searching capabilities and a language with powerful string manipulation features should do the job. I first convert the data into flat text files and then
Eyeball either the whole data set or a representative true random sample of the data.
Based on that make conjectures about different columns ("doesn't allow nulls", "contains only values 'Y' and 'N'", "'start date' always precede 'end date'", etc.).
Write scripts to check the conjectures.
Obviously this kind method tends to focus on one table at a time and therefore only complements the checks made after uploading the data into a relational database.
One trick that comes in useful for me with this, is to find a way for each type of data source to output a single column plus unique identifier at a time in tab delimited form say, so that you can clean it up using text tools (sed, awk, orTextMate's grep search), and then re-import it / update the (copy of!) original source.
It then becomes much quicker to clean up multiple sources, as you can re-use tools across them (e.g. capitalising last names - McKay, O'Leary o'Neil, Da Silva, Von Braun, etc., fixing date formats, trimming whitespace) and to some extent automate the process (depending on the source).