Microsoft Access: Import CSV file from a list of multiple files - vba

I have a huge file around 10 GB in *.csv format. It is data from 1960 to present date for different regions. I was able to break down the file by the regions. There are about 8000 regions and I split the file by regions so I have 8000 files about 2 MB each.
I was wondering what would be the most efficient way to create a Access database system to look up data for each region. Is it by:
Separating the file into small files (i.e by each region 8000 files) by region name and importing them to access each time, OR
Splitting them into constant sizes about 1 GB each and querying them.
In either case, how do I import the files to Access?

As you may be aware, an Access database file is limited to 2GB in size, so you almost certainly won't be able to keep all of the information in one file. Even if it did just barely fit, keeping that much information in a single Access database file would likely be rather slow to work with.
Depending on the "shape" of your data there may be other limits in Access that could cause difficulties in your particular situation. For example, a table (or query) is limited to 255 columns. If you haven't done so already, take a look at the Access specifications before proceeding too much further with this.
And in any case, consider using another database as the back-end. Your data may even be too large for single database in SQL Server Express Edition (maximum of 10GB total size per database, as I recall), but even if you had to split the data into two SQL Express databases it would be easier to deal with than a dozen (or more?) Access databases.
Bear in mind that if you use a different database back-end you may still be able to use Access as a query and reporting tool (via ODBC linked tables).
Edit re: comment
Based on your description, if you will never need to query across regions (and remember that "never" is a very long timeā„¢) then your 8000-file approach would be feasible. However, I wouldn't necessarily recommend importing the corresponding CSV data every time you want to run a query. Instead, I would borrow ideas from both Tom's and HansUp's answers:
Plan "A": Start by running queries directly against the CSV files themselves to see if that is fast enough for your needs. You could test that by creating a linked table to the CSV file and running some typical queries. As Tom mentioned, a CSV linked table cannot be indexed, so if you find that the queries are too slow then you may have to go to Plan "B".
Plan "B": If you do need to import the CSV data then you'll probably want to use HansUp's suggestion of using DoCmd.TransferText to help automate the process. It seems wasteful to import the specific CSV file for every query, so you might consider creating ~8000 .accdb files and then using a query like...
strSQL = _
"SELECT * FROM TableName " & _
"IN ""C:\__tmp\region12345.accdb"" " & _
"WHERE StartDate BETWEEN #2013-05-10# AND #2013-05-15#"
...where your code could substitute
the name of the appropriate .accdb file based on the region of interest, and
the required date range.

If you will be doing this with VBA, you can use the DoCmd.TransferText Method to import CSV data into Access.
I wouldn't want to do that in your situation, though. 10 GB is too much data to reasonably manage in Access. And if you partition that into separate db files, querying data pulled from multiple db files is challenging and slow. Furthermore, if the query's combined result set hits the 2 GB Access limit, you will get a confusing error about insufficient disk space.
This is not a reasonable job for data storage in MS Access.

#Gords & #HansUps are very good answers. Use a better backend for your data. Free ones would include SQL Express & MySQL. If you're in a corporate environment, then you may have a license for MS SQL Server.
However, if you insist on doing this doing this in strictly Access, here are two related ideas. Both ideas require that you link and de-link (using VBA) to the data you need as you need it.
You don't have to import a CSV file to be able to see it as a table. You can link to it just as you would a table in another database.
Positives: You don't have to change your existing data format.
Drawbacks: You can't edit your existing data, nor can you index it,
so queries may be slow.
Or, you can convert each CSV file into it's own Access DB (you can do this with VBA to automate this). Then, like in the above suggestion, link & de-link the tables as needed.
Positives: You can edit your existing data, and also index it, so
queries may be quick.
Drawbacks: It's an awful amount of work just to avoid using a
different backend DB.

Related

Excel Power query import csv files

Hopefully someone can help me with the following question. I want to add one large CSV file (600k rows) every month to the existing ones. What would be the most efficient way (in terms of loading)? Adding the CSV file in a folder and get that whole folder as input for Power Query or somehow use an append query for only the month which is added? Thanks for your help,
Regards,
Michiel
I would use power query to load the directory of csv files. When a new file is added you just need to "refresh all".
loading from csv files is the fastest way to load information. (I just read a study and it is incredibly fast, even if there are millions of rows in multiple files)
you can do any required data cleaning or transformation in the query which is also the fastest way to do manipulations of the data. It is much much faster than making the changes later in excel tables.
you have the choice of loading into the data model, or to an excel table or both.
loading the data into the data model makes it available for any use you might have and is extremely useful for creating pivot tables.

SparkSQL: intra-SparkSQL-application table registration

Context. I have tens of SQL queries stored in separate files. For benchmarking purposes, I created an application that iterates through each of those query files and passes it to a standalone Spark application. This latter first parses the query, extracts the used tables, registers them (using: registerTempTable() in Spark < 2 and createOrReplaceTempView() in Spark 2), and executes effectively the query (spark.sql()).
Challenge. Since registering the tables can be time consuming, I would like to lazily register the tables, i.e. only once when they are first used, and keep that in form of metadata that can readily be used in the subsequent queries without the need to re-register the tables with each query. It's a sort of intra-job caching but not any of the caching options Spark offers (table caching), as far as I know.
Is that possible? if not can anyone suggest another approach to accomplish the same goal (iterating through separate query files and run a querying Spark application without registering the tables that have already been registered before).
In general, registering a table should not take time (except if you have lots of files it might take time to generate the list of file sources). It is basically just giving the dataframe a name. What would take time is reading the dataframe from disk.
So the basic question is, how is the dataframe (tables) written to disk. If it is written as a large number of small files or a file format which is slow (e.g. csv), this can take some time (having lots of files take time to generate the file list and having a "slow" file format means the actual reading is slow).
So the first thing you can try to do is read your data and resave it.
lets say for the sake of example that you have a large number of csv files in some path. You can do something like:
df = spark.read.csv("path/*.csv")
now that you have a dataframe you can change it to have less files and use a better format such as:
df.coalesce(100).write.parquet("newPath")
If the above is not enough, and your cluster is large enough to cache everything, you might put everything in a single job, go over all tables in all queries, register all of them and cache them. Then run your sql queries one after the other (and time each one separately).
If all of this fails you can try to use something like alluxio (http://www.alluxio.org/) to create an in memory file system and try to read from that.

How do I create an Excel Pivot connected to an Access DB that downloads only the queried data?

I have a table of around 60 columns and 400,000 rows and increasing. Our company laptops and MS Excel cannot handle this much data in RAM. So I decided to store the data in MS Access and link it to Excel.
However the pivot in Excel still downloads all the data into Excel, and then performs the filters and operations on the data. Which worked with lesser data, but with more data now has started giving memory errors. Also even though the data in the pivot might be only 50 cells, the file size is 30+ MBs...
So is it possible to create a connection to Access in such a way that it downloads only the data that is queried, does the operations before hand and then sends the revised data to Excel?
I saw this setup in my previous company (where the Excel pivot would only download what it needed). But it was querying an SQL DB as far as I remember. (Sadly couldn't learn more about it since the IT director was intent on being the only guy who knew core operations (He basically had the company's IT operations hostage in exchange for his job security))... But I digress.
I've tried searching for this on the internet for a few days, but it's a very specific problem that I can't find in Google :/
Any help or even pointers would be highly appreciated!
Edit: I'd just like to point out that I'm trying to create an OLAP connection for analysis, so the pivot would be changing fields. My understanding of how pivots work, was that when we select the fields in the pivot, excel would design a query (based on the select fields) and send it to the connection DB to retrieve the data requested. If this is not how it happens, how do I make something like this happen? Hope that elaborates.
I suppose that you created a single massive table in Access to store all your data, so if you just link that table as the data source, Excel won't know which particular bit of data is relevant and will most probably have to go through all of it itself.
Instead, you can try a combination of different approaches:
Create a query that pre-filters the data from Access and link that query to Excel.
Use a SQL Command Type for your Connection Properties instead of a Table.
Test that query in Access to make sure it runs well and is fast enough.
Make sure that all important fields have indexes (fields you filter, fields you group by, any field that Excel has to go through to decide whether it should be included or not in the pivot, make sure that that field has a sensible index).
Make sure that you have set a Primary Key in your table(s) in Access. Just use the default auto-increment ID if it's not already used.
If all else fails, break down that huge table: it's not so much the amount of records that's too much it's more the high number of columns.
If you use calculated fields in your pivot or filter data based on some criteria, consider adding columns to your table(s) in Access that contain pre-calculated data. For instance you could run a query from Access to update these additional fields or add some VBA to do that.
It should works pretty well though: to give you an idea, I've made some tests with Excel 2013 linked to a 250MB ACCDB containing 33 fields and 392498 rows (a log of stock operations): most operations on the pivot in Excel only take a fraction of a second, maybe a couple of seconds for the most data-intensive ones.
Another thing: Access has support pivot tables and pivot charts. Maybe you don't need Excel if Access is enough. You can use the Access Runtime 2013 or 2013 (it's free) as a front-end on each machine that needs access to the data. Each front-end can then be linked to the backend database that holds the data on a network share. The tools are a bit more clunky than in Excel, but they work.
Another possible solution, to avoid the creation of queries in the Access DB, is to use PowerPivot add-in in Excel, implementing there queries and normalizations.

Efficiency of analysing exported SQL data sheet in SQL server

Apologies for the long title.
I have been given a large flat file to be analysed in SQL server, which was generated by another SQL database which I do not have direct access to. Due to the way the query had been generated, there are over 5000 different rows, for only 900 unique objects.
My question is a straightforward one: I am not attempting to create a long term database. Would it more time-efficient to to split this back into separate tables to re run queries, or would it be easier to analyse it as is?
5,000 rows is not a large table. The time and effort to reverse engineer a "normalised" database, and write and debug the ETL, will not be repaid in reduced runtime. Index it well for your queries and remember to put a DISTINCT in where it's required. You'll be fine.

SQL Data load - Suggestions required

Gurus,
We are in process of setting up a SSIS package to load a formatted text file in to SQL server. It will have around 100 million rows and file size will be (multiple files of around 15 GB each) 100 GB. The file format is aligned with XML schema like mentioned below... it takes nearly 72 hrs to load this file in to SQL server tables...
File format
EM|123|XYZ|30|Sales mgr|20000|AD|1 Street 1| State1|City1|US|AD|12Street 2|state 2|City2|UK|CON|2012689648|CON|42343435
EM|113|WYZ|31|Sales grade|200|AD|12 Street 1| State2|City2|US|AD|1Street 22|state 3|City 3|UK|CON|201689648|CON|423435
EM|143|rYZ|32|Sales Egr|2000|AD|113Street 1| State3|City3|US|AD|12Street 21|state 4|City 5|UK|CON|201269648|CON|443435
Data will come in above format. It means "EM" till "AD" is Employee details like Code,Name,age,Designation,Salary and "AD" is Address details like Street,Sate,City,Country. Address data can be multiple for same employee...similarly "CON" is contact details with Phone number which also may be multiple.
So, we need to load Employee Details in to seperate table, Address details in seperate table and Contact details in seperate table with Code as Primary key in Employee Details and Reference key in other two tables.
We designed package like, had a Script Component as Source and parsed line by line by using .NET scripts and created multiple out put buffers each per table and added the row in the script. Mapped the Script component output to 3 OLE DB Destinations (SQL Server tables).
Our server is Quad Core with 48 GB RAM virtualized and we have 2 cores with 24 GB dedicated for DB. Our SQL server DB (Simple Recovery model) has Data files in Network share location that is SAN storage. To improve performance we created Each table in differenct data file (Primary and secondary).. but still it takes around 72 hrs.
Need guidance on following points.
Is it possible to use BCP, if yes any pointers.. (Hope BCP will perform better)
Any suggestions on specified solution.
Any alternates...
There are no indexes defined on the table also no triggers...We have even set defaultMaxbufferzie to 100 MB
Looking forward for response..Any help is much appreciated..
1.) If necessary, simplify/flatten your XML file via XSLT as shown here:
http://blogs.msdn.com/b/mattm/archive/2007/12/15/xml-source-making-things-easier-with-xslt.aspx
2.) Use XML Source as shown here:
http://blogs.msdn.com/b/mattm/archive/2007/12/11/using-xml-source.aspx
3.) Drop any indexes on the destination tables
4.) If your source data is bulletproof, disable constraints on the tables via:
ALTER TABLE [MyTable] NOCHECK CONSTRAINT ALL
5.) Load the data via OLEDB Destination
6.) Re-enable constraints
7.) Re-create indexes
You say the data files are on a network share. One improvement would be to add a hard drive and run the job on the SQL server as you'd eliminate latency. I think even hooking up a USB drive to read the files from would be better than using a network location. Certainly worth a little test in my opinion.
SSIS is pretty quick when doing bulk loads, so I suspect that the bottleneck isn't with SSIS itself, but something about the way your database/server is configured. Some suggestions:
When you're running the import, how many rows are you importing every second (you can do a "SELECT COUNT(*) FROM yourtable WITH READUNCOMMITTED" during the import to see) Does this rate stay constant, or does it slow down towards the end of your import?
As others have said, do you have any indexes or triggers on your destination tables?
While you're running the import, what do your disks look like? In perfmon, is the disk queue spiking like crazy, indicating that your disks are the bottleneck? What's the throughput on these disks during normal performance testing? I've had experience where improperly configured iSCSI or improperly aligned SAN storage can drop my disks from 400MB/s down to 15MB/s - still fine during regular usage, but way too slow to do anything in bulk.
You're also talking about loading 100GB of data, which is no small amount - it shouldn't take 72 hours to load, but it won't load it in 20 minutes either, so have reasonable expectations. Please weigh in on these and the other bottlenecks people have asked about and we may be able to help you isolate your issue.
If you have any control over the way the files are created to begin with, I would move away from the one-to-many relationship you have with the |EM| and |AD|,|CON| and do something like this:
|EM|EmpID|data|data|
|AD|EmpID|data|data|
|CON|EmpID|data|data|
Further, if you can split the records into three different files, you'd be able to use a Flat File Source component with a fixed specification for each source to process the data in bulk.