Hopefully someone can help me with the following question. I want to add one large CSV file (600k rows) every month to the existing ones. What would be the most efficient way (in terms of loading)? Adding the CSV file in a folder and get that whole folder as input for Power Query or somehow use an append query for only the month which is added? Thanks for your help,
Regards,
Michiel
I would use power query to load the directory of csv files. When a new file is added you just need to "refresh all".
loading from csv files is the fastest way to load information. (I just read a study and it is incredibly fast, even if there are millions of rows in multiple files)
you can do any required data cleaning or transformation in the query which is also the fastest way to do manipulations of the data. It is much much faster than making the changes later in excel tables.
you have the choice of loading into the data model, or to an excel table or both.
loading the data into the data model makes it available for any use you might have and is extremely useful for creating pivot tables.
Related
I'm new to databases. I've been saving a financials table from a website in JSON format on a daily basis, accumulating new files in my directory every day. I simply parse the contents into a C# collection for use in my program and compare data via Linq.
Obviously I'm looking for a more efficient solution especially as my file collection will grow over time.
An example of a row of the table is:
{"strike":"5500","type":"Call","open":"-","high":"9.19B","low":"8.17A","last":"9.03B","change":"+.33","settle":"8.93","volume":"0","openInterest":"1,231"}
I'd prefer to keep a 'compact file' per stock that I can access individually as opposed to a large database with many stocks.
What would be an 'advisable' solution to use? I know that's a bit of an open ended question but some suggestions would be great.
I don't mind slower writing into the DB but a fast read would be beneficial.
What would be the best way to store the data? Strings or numerical values?
I found this link to help with the conversion How to Save JSON data to SQL server database in C#?
Thank you.
For a faster reading in a DB, I would suggest denormalization of the data.
Read "Normalization vs Denormalization"
Judging from your JSON file, it doesn't seems like you have any table joins. Keeping that structure should be fine.
For the comparison between varchar(string) and int(numeric), int are faster than varchar, for the simple fact that ints take up much less space than varchars. int takes up 2-8 bytes where varchar takes up 4 bytes plus the actual characters.
I am using a simple text file to store filenames and their hashvalues; which is later read to search a particular file. Should I go for SQL for such simple task ?
Depends on your need and operations.
If you need simple operation like read and write ( updates and deletion are difficult than DB) considering with a very low volume data sizes it's ok to go that way (not recommending).
Relational Databases are always better than normal file systems because of their rows and tuples structure, suitable for data manipulation operations.
If your need are simple use a json or XML structures. They are way better than raw text files
Context. I have tens of SQL queries stored in separate files. For benchmarking purposes, I created an application that iterates through each of those query files and passes it to a standalone Spark application. This latter first parses the query, extracts the used tables, registers them (using: registerTempTable() in Spark < 2 and createOrReplaceTempView() in Spark 2), and executes effectively the query (spark.sql()).
Challenge. Since registering the tables can be time consuming, I would like to lazily register the tables, i.e. only once when they are first used, and keep that in form of metadata that can readily be used in the subsequent queries without the need to re-register the tables with each query. It's a sort of intra-job caching but not any of the caching options Spark offers (table caching), as far as I know.
Is that possible? if not can anyone suggest another approach to accomplish the same goal (iterating through separate query files and run a querying Spark application without registering the tables that have already been registered before).
In general, registering a table should not take time (except if you have lots of files it might take time to generate the list of file sources). It is basically just giving the dataframe a name. What would take time is reading the dataframe from disk.
So the basic question is, how is the dataframe (tables) written to disk. If it is written as a large number of small files or a file format which is slow (e.g. csv), this can take some time (having lots of files take time to generate the file list and having a "slow" file format means the actual reading is slow).
So the first thing you can try to do is read your data and resave it.
lets say for the sake of example that you have a large number of csv files in some path. You can do something like:
df = spark.read.csv("path/*.csv")
now that you have a dataframe you can change it to have less files and use a better format such as:
df.coalesce(100).write.parquet("newPath")
If the above is not enough, and your cluster is large enough to cache everything, you might put everything in a single job, go over all tables in all queries, register all of them and cache them. Then run your sql queries one after the other (and time each one separately).
If all of this fails you can try to use something like alluxio (http://www.alluxio.org/) to create an in memory file system and try to read from that.
I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).
I have a huge file around 10 GB in *.csv format. It is data from 1960 to present date for different regions. I was able to break down the file by the regions. There are about 8000 regions and I split the file by regions so I have 8000 files about 2 MB each.
I was wondering what would be the most efficient way to create a Access database system to look up data for each region. Is it by:
Separating the file into small files (i.e by each region 8000 files) by region name and importing them to access each time, OR
Splitting them into constant sizes about 1 GB each and querying them.
In either case, how do I import the files to Access?
As you may be aware, an Access database file is limited to 2GB in size, so you almost certainly won't be able to keep all of the information in one file. Even if it did just barely fit, keeping that much information in a single Access database file would likely be rather slow to work with.
Depending on the "shape" of your data there may be other limits in Access that could cause difficulties in your particular situation. For example, a table (or query) is limited to 255 columns. If you haven't done so already, take a look at the Access specifications before proceeding too much further with this.
And in any case, consider using another database as the back-end. Your data may even be too large for single database in SQL Server Express Edition (maximum of 10GB total size per database, as I recall), but even if you had to split the data into two SQL Express databases it would be easier to deal with than a dozen (or more?) Access databases.
Bear in mind that if you use a different database back-end you may still be able to use Access as a query and reporting tool (via ODBC linked tables).
Edit re: comment
Based on your description, if you will never need to query across regions (and remember that "never" is a very long timeā¢) then your 8000-file approach would be feasible. However, I wouldn't necessarily recommend importing the corresponding CSV data every time you want to run a query. Instead, I would borrow ideas from both Tom's and HansUp's answers:
Plan "A": Start by running queries directly against the CSV files themselves to see if that is fast enough for your needs. You could test that by creating a linked table to the CSV file and running some typical queries. As Tom mentioned, a CSV linked table cannot be indexed, so if you find that the queries are too slow then you may have to go to Plan "B".
Plan "B": If you do need to import the CSV data then you'll probably want to use HansUp's suggestion of using DoCmd.TransferText to help automate the process. It seems wasteful to import the specific CSV file for every query, so you might consider creating ~8000 .accdb files and then using a query like...
strSQL = _
"SELECT * FROM TableName " & _
"IN ""C:\__tmp\region12345.accdb"" " & _
"WHERE StartDate BETWEEN #2013-05-10# AND #2013-05-15#"
...where your code could substitute
the name of the appropriate .accdb file based on the region of interest, and
the required date range.
If you will be doing this with VBA, you can use the DoCmd.TransferText Method to import CSV data into Access.
I wouldn't want to do that in your situation, though. 10 GB is too much data to reasonably manage in Access. And if you partition that into separate db files, querying data pulled from multiple db files is challenging and slow. Furthermore, if the query's combined result set hits the 2 GB Access limit, you will get a confusing error about insufficient disk space.
This is not a reasonable job for data storage in MS Access.
#Gords & #HansUps are very good answers. Use a better backend for your data. Free ones would include SQL Express & MySQL. If you're in a corporate environment, then you may have a license for MS SQL Server.
However, if you insist on doing this doing this in strictly Access, here are two related ideas. Both ideas require that you link and de-link (using VBA) to the data you need as you need it.
You don't have to import a CSV file to be able to see it as a table. You can link to it just as you would a table in another database.
Positives: You don't have to change your existing data format.
Drawbacks: You can't edit your existing data, nor can you index it,
so queries may be slow.
Or, you can convert each CSV file into it's own Access DB (you can do this with VBA to automate this). Then, like in the above suggestion, link & de-link the tables as needed.
Positives: You can edit your existing data, and also index it, so
queries may be quick.
Drawbacks: It's an awful amount of work just to avoid using a
different backend DB.