Uploading CSV using Sqlite3 Console - Different treatment of commas within quotes - ruby-on-rails-3

I am currently experiencing problems associated with reading in CSV files to a sqlite3 database in a rails application. I have around 20 CSV files each with 20k lines of data in them which i need to read into a database on a regular basis.
Having experimented with a few different approaches, I have opted for using sqlite3 console as this enables me to quickly upload the data (in seconds as opposed to hours going through Rails using the code I was using previously). I tested this approach locally where I am running sqlite 3.7.15.2 and successfully read in the data to my table allitems using the following commands:
sqlite3 development.sqlite3
.separator ','
.import '../newdata.csv' allitems
Encouraged by my success, I proceeded to attempt to recreate this process on a live test site. However, in this case I get a number of errors indicating that the number of columns in newdata.csv doesn't always match the number of columns in allitems. I inspected the data in Excel and found all data to be in the correct number of columns required. On further investigation, I discovered that it was commas within text strings which were causing the issues and found some information online (http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles)suggesting sqlite3 will always split on commas, regardless of whether they're inside quotes.
My first solution was to attempt to use a new separator which would never appear within the text strings (,|,), although this did succeed it also caused different problems as now many text fields when displayed on the webpage contain " at the start and end which has various knock on effects. I created an additional work around for this, converting my separator to "," and inserting " before and after fields which were not strings, but accounting for exceptions in the data is turning into a never ending fiddle.
Having lost patience with the above approach, I was looking for some advice as to how I could get around this problem? In particular, I am puzzled as to why i do not have any problems when I run the code locally, but face all these issues on the server. The server is currently running sqlite 3.7.3 but I don't know if this is the cause of the issue, or how I could update the version remotely if it was...
Thanks for your suggestions

Importing CSV files robustly is complex; there are still somewhat frequent bugfixes for sqlite3's import function.
Apparently, there was a necessary bugfix between versions 3.7.3 and 3.7.15.
The sqlite3 tool does not really have any dependencies.
Download or compile your own copy, rename it to whatever you like, and use that.

Related

VBA to automatically print page ranges of a PDF

At work, attached files size to email is limited to 10 Mo and because of many reasons :
Outlook is the only way to share files
I can only use the programs already installed
I am trying to create a VBA macro to :
automatically split PDF bigger than 10 Mo by printing them into smaller files
merge them on the other side
I know it is far from ideal (and many tools exists to do it), but I have no other options.
So far, it seems that I can only use PDFCreator and Adobe Reader for this task, as no other helpfull tools are deployed on my PC (mostly Office)... and I can not figure a way to use command line for printing range of pages.
I successfully created a working (very) inelegant macro, based on Shell commands and SendKeys VBA, basically emulating human interaction to print range A, then waiting for the job to be done, them printing range B, and so on... Among the many problems I should now solve :
add protection to take into account machines with different processing power (replace my timings with file creation verification and detect if jobs are still running in the background)
create a robust merging system when receiving the mail
Plus I am very dependant of the software versions installed, and I foresee a lot of issues with software updates/version if this macro is to be used by many people.
So this method doesn't have a bright futur for now, and unless I find an other way to solve this problem, I will probably give up and keep doing this manually (after all, if my employer doesn't provide better tool, I should not be expected to be as efficient as I could).
Have you any insight about how to cleverly solve this issue ?
(Yes, I already told my boss that working like this is a nightmare, but easy file exchange is not the priority).
I managed to solve my problem using 7-zip and its "-v" option using command line : I split my big file into binary smaller files and automatically create new mail with them as attachments.

Is there any interactive console for some strong language for everyday work of processing strings?

starting to work as an IT man lately
with some programing background,
there are so many occasions where there's a need for processing large amount of data.
mainly strings i guess..
for example:
there's 2 large sets of lines, and we need all the lines in both of the sets
replacing one or more white characters in a row, to one line break...
taking the 4th to 7th character of each line and print them in one line with comma as a delimiter
these are not the best examples, but generally any kind of parsing, manipulating and query of texts.
it's very often that the task is extremely easy in any programing language, but it is just to frustrating to open the IDE of such language....
i'm looking to some way to write code (with intelisence/autocomplete), in an easy fast window...... with simple input and output textboxes....
do you understand my need? can you think of anything that can help?
i know some of the problems can be solved using excel.. but i really prefer some good old programing.... unless someone is strongly believe i'm wrong.
if i will build something myself, there will be an option to add any amount of unlimited multiline textboxes. they'll be automatically named, although the name is changeable (the names will be the the name of the variables).
you can as well add any number of output textboxes that have names...
and you have the editor window, in which you write the procedure..... and it will have some interactive intelisence like interface...
can you see what i'm saying? do you know anything similar?
Seems like Python would be fine for this.
Has an interactive keyboard interface, quite nice abstraction facilities,
and strings as objects with good libraries for processing such strings.
It sounds like a lot of what you want can be handled with regular expressions using sed, awk, or perl in a standard console. Autocomplete will be pretty limited, but your scripts will be short anyway - to deal with your third case above, for example:
sed 's/^...\(....\).*/\1/g' < input.txt | tr "\n" ',' > output.csv
What you can do is use an interactive regex tester. There's many online like this one.
You could also look into tools like Data Wrangler from Stanford, which are designed to be more accesible but as powerful as traditional shell tools.
(Note that your first issue - intersecting sets of lines - is a bit different, and would be solved in the shell with comm. This page has a good explanation of how to use comm to perform set operations like "all the files in this file not in this file" or "only the files in this file also in this other file".)

XPerfView slow to load symbols

I am attempting to perform a stackwalk with Xperf, using a batch file similar to the one listed at Getting the symbols with xperf.
I launch XperfView, confirm the symbol path is correct, and then load the symbols. However, when I attempt to open a summary table on a selected portion (5 seconds or so) of the "CPU Sampling by CPU" graph, the Performance Analyzer hangs (not responding) for a long time (hours).
I left it running last night, and when I came in this morning the Summary Table had finally loaded, containing results as expected... I had thought perhaps it was just performing an initial download to cache the symbols to C:\symbols, but a repeat test this morning has similar problems (hang for 1 hr 15 minutes at this point).
WPT (xperf, xperfview, WPA) doesn't ship with dbghelp.dll and symsrv.dll. That means that, depending on what is in your path you may get:
Fast symbol loading
Symbol loading that takes up to 150x longer
No symbol loading at all.
The solution is to copy a known-good version of these DLLs into the WPT install directory. For more details see this post:
http://randomascii.wordpress.com/2012/10/04/xperf-symbol-loading-pitfalls/
In his post , Bruce Dawson speculates that there is an issue with dbghelp.dll and/or symsrv.dll in WPT as shipped in the current SDK. He suggesting replacing those with the ones either from Visual Studio 2010, or from Debugging Tools for Windows (i.e. WinDbg). Worked for me...
Have you set up symcache something like this
SRV*c:\dev\symbols*http://msdl.microsoft.com/download/symbols
The symcache would cache the symbols locally. I usually have my _NT_SYMBOL_PATH environment variable with the above information.
HTH

Compare two fxcop results

I'm going to analysis two different versions of the same dll with fxcop.
I would like to display only the differences between these two reports.
Does anyone know if this is possible ?
Thanks for your time.
Yes, it's possible, but there are no built-in tools available for this. One fairly simple approach would be to use a diff tool to compare the two reports. If the result is too noisy for you, another approach would be to roll your own tool to compare the XML of the two reports.
Are you using UI or the command line?
With the command line tool, you have a number of options. One of them is to import an old report to be used as a baseline. Then set the fxcop project to report only new errors: Report Status="Active, Absent" NewOnly="True"
The command line will be something like this: fxcopcmd.exe /i:OldVersionReport.xml /out:NewVersionReport.xml /p:FXCopProject.fxcop /f:mydll.dll
The new report will have only new active error and also a list of missing i.e. fixed errors from the old version.
While this will work for the most part, you need to understand that the difference will not be 100% acurate. FXCop does its best to match old report to the new version of the DLL, but sometimes it fails. For example, if you fixed a particular violation somewhere in code, but added the same type of violation in another place, FXCop will most likely miss this and show no difference.
For FxCop VS 2010 , all you need is to have /saveMessagesToReport:Absent along with the older generated FxCop file /import:"OldFile.xml" specified .
Just an eg.
fxcopcmd.exe /import:"c:\Old.xml" /summary "/file:c:\*.dll"
/saveMessagesToReport:Absent /out:"c:\Output.xml"

Check sql script valid

As part of a release we run a load of PL/SQL scripts against a database. Recently someone left the ; off the end of a line in one script that was called another script so this meant that script did not get run. Because this did not cause an error, it just didn't get run, it took quite a while to track down what had happened.
I want to check the scripts before they are run for lines in them that are missing either a ; at the end or a / on the line after. This is made more complicated as 'lines' in the script could actually span more than one line if it is statement or block of code.
To me this seems like to do this I'm going to have to parse the scripts then check they meet the above.
I've found ANTLR and wonder if this might be a way to do it since there seem to be existing PL/SQL grammars but looks like that's going to be a step learning curve for what's just a simple check.
Does anyone know an easy way or any other tools, eclipse plugins etc that I can use to check for lines in the scripts that are missing either a ; at the end or a / on the line after?
Update
We already do most of the stuff Tom H suggested. The scripts are run into our test server and we have a version table that gets updated at the end. The problem was that the missing semi-colon in the container script meant one script did not get run but the rest including the one to update the version number ran without errors. Therefore the problem only got picked up quite a way into testing. This needed the database restored before running the scripts with the missing semi-colon added so basically resulted in half a day of testing time being lost. If there was a simple way to check this before running the scripts into the test server it could save quite a bit of time.
I agree with MattH that you may be going about this the wrong way. I would just add an insert statement to the end of all of your scripts which insert a "version" row into a table in the database. At the end of your deployment scripts it's then an easy task to check that the version table has all of the correct rows in it.
Also, you should have all of your release scripts being run exactly as they will be in production against your QA server. That's where all of the testing takes place. You never do anything to the server besides what is in your release steps - you only run the release scripts and if those release scripts are ever changed then you refresh the QA server with them and redo testing.
When you go to production your release process has then been fully tested. As a fail safe measure you can also use tools like Red Gate's SQL Compare and SQL Data Compare to check that production matches the QA server. The data compare would only be against certain tables (look-up tables, etc.). If you have data changes to major tables (1M rows, etc.) then you can right a custom script to check that they are correct.
Even if the scripts are different for every release (and not part of a defined source control structure that creates or replaces database objects) I would adopt a practice of breaking the scripts down into the most fundamental units of work per file and deploying them through Ant with the standard sql task. You probably have these types of scripts:
CREATE or REPLACE dbobject...
SQL DML scripts
Anonymous PL/SQL blocks
If you standardize on a consistent statement delimiter (I suggest using "/" since it works with all of the cases above) and set the deployment to fail on error, then Ant will either deploy all of the files or indicate why it couldn't.
I think it would be very difficult to otherwise parse files of one or more SQL and/or PLSQL statements and find missing delimiters if there are no standards on delimiter choice or statements per file.
Just a thought, but are you going about this the wrong way?
I assume, at the file-level, the lack of a semi colon in the file was not a problem? but it only became a problem when run via the batch processing? If that's the case maybe you can change your batch processing to cope with this.
If it was the file, then testing should have picked it up. You don't want to parse your input files to make sure they compile etc.