What is the best way to import data into google sql DB from a spreadsheet file?
I have to import two file with 4k rows each into a db.
I've tried to load 4k (one file) rows using Appscript and the result was:
Execution succeeded [294.336 seconds total runtime]
Ideas?
Code here
https://pastebin.com/3RiM1CNb
Depends a bit on how often you need this done. From your comment "No, this files will be uploaded two times for month in gdrive.", I think you mean 2 times/month.
If you need this done programmatically, I suggest to use a cronjob and have either App Engine or a local machine run that cronjob.
You can access the spreadsheet with a serviceaccount (add it to the users of that spreadsheet like any other user) using libraries (check the quickstarts on the side for your language of preference) and process the data with that. How to actually process the data depends on your language of choice, but that's simply inserting rows into MySQL.
The simplest option would be to export to CSV and import this into Cloud SQL. Note that you may need to reformat this into something Cloud SQL understands, but that depends on the source data in Google Sheets.
As for the error you're getting, you're exceeding the max allowed runtime for App Script, which is 6 minutes...
Related
Thanks to the help in this forum, I got my SQL-conncection and inserts working now.
The following TR-formula is used to retrieve the data from Excel Eikon:
#TR($C3,"TR.CLOSEPRICE (adjusted=0);
TR.CompanySharesOutstanding;
TR.Volume;
TR.TURNOVER"," NULL=Null CODE=MULTI Frq=D SDate="&$A3&" EDate="&$A3)
For 100k RICs the formulas usually need between 30s and 120s to refresh. That would still be acceptable.
The problem is to get the same refresh-speed in a VBA-loop. Application.Run "EikonRefreshWorksheet" is currently used for a synchronous refresh as recommended in this post.
https://community.developers.refinitiv.com/questions/20247/can-you-please-send-me-the-excel-vba-code-which-ex.html
The syntax of the code is correct and working for 100 RICS. But already for 1k the fetching gets very slow and will freeze completely for like 50k. Even with a timeout interval of 5min.
I isolated the refresh-part. There is nothing else slowing it down. So is this maybe just not the right method for fetching larger data sets? Does anyone know a better alternative?
I finally got some good advice from the Refinitiv Developer Forum which I wanted to share here:
I think you should be using the APIs directly as opposed to opening a spreadsheet and taking the data from that - but all our APIs have limits in place. There are limits for the worksheet functions as well (which use the same backend services as our APIs) - as I think you have been finding out.
You are able to use our older Eikon COM APIs directly in VBA. In this instance you would want to use the DEX2 API to open a session and download the data. You can find more details and a DEX2 tutorial sample here:
https://developers.refinitiv.com/en/api-catalog/eikon/com-apis-for-use-in-microsoft-office/tutorials#tutorial-6-data-engine-dex-2
However, I would recommend using our Eikon Data API in the Python environment as it is much more modern and will give you a much better experience than the COM APIs. If you have a list of 50K instruments say - you could make 10 API calls of say 5K instruments using some chunking and it would all be much easier for you to manage - without even resorting to Excel - and then you can use any Python SQL tool to ingest into any database you wish - all from one python script.
import refinitiv.dataplatform.eikon as ek
ek.set_app_key('YOUR APPKEY HERE')
riclist = ['VOD.L','IBM.N','TSLA.O']
df,err = ek.get_data(riclist,["TR.CLOSEPRICE(adjusted=0).date","TR.CLOSEPRICE(adjusted=0)",'TR.CompanySharesOutstanding','TR.Volume','TR.TURNOVER'])
df
#df.to_sql - see note below
#df.to_csv("test1.csv")
1641297076596.png
This will return you a pandas dataframe that you can easily directly write into any SQLAlchemy Database for example (see example here) or CSV / JSON for example.
Unfortunately, our company policy does not allow for Python at the moment. But the VBA-solution also worked, even though it took some time to understand the tutorial and it has more limitations.
Goal
We have a MEAN stack application that implements a strict mongoose schema. The MEAN stack app needs to be seeded with data that originates from a SQL Server database. The app should function as expected as long as the seeded data complies with the mongoose schema.
Problem
Currently, the data transfer job is being done through the mongo CLI which does not perform validation. Issues that have come up have been Date objects being saved as strings, missing keys that are required on our schema, entire documents missing, etc. The dev team has lost hours of development time debugging the app and discovering these data issues.
Solution we are looking for
How can we validate data so it:
Throws errors
Fails and halts the transfer
Or gives some other indication that the data is not clean
Disclaimer
I was not part of the data transfer process so I don't have more detail on the specifics of that process.
This is a general problem of what you might call "batch import", "extract-transform-load (ETL)", or "data store migration", disconnected from any particular tech. I'd approach it by:
Export the data into some portable format (e.g. CSV or JSON)
Push the data into the new system through the same validation logic that will handle new data on an ongoing basis.
It's often necessary to modify that logic a bit. For example, maybe your API will autogenerate time stamps for normal operation, but for data import, you want to explicitly set them from the old data source. A more complicated situation would be when there are constraints that cross your models/entities that need to be suspended until all the data is present.
Typically, you write your import script or system to generate a summary of how many records were processed, which ones failed, and why. Then you fix the issues, run it on those remaining records. Repeat until you're happy.
P.S. It's a good idea to version control your import script.
Export to csv and write a small script using node. that will solve your problem. You can use fast-csv npm
I need to transfer a large table in BigQuery, 2B records, to Cloud Storage with csv format. I am doing the transfer using the console.
I need to specify a uri including a * to shard the export due to the size of the file. I end up with 400 csv files in Cloud Storage. Each has a header row.
This makes combining the files time consuming, since I need to download the csv files to another machine, strip out the header rows, combine the files, and then re-upload. FY the size of the combined csv file is about 48GB.
Is there a better approach for this?
Using the API, you will be able to tell BigQuery not to print the header row during the table extraction. This is done by setting the configuration.extract.printHeader option to false. See the documentation for more info. The command-line utility should also be able to do that.
Once you've done this, concatenating the files is much easier. In a Linux/Mac computer it would be a single cat command. However, you could also try to concatenate directly from Cloud Storage by using the compose operation. See more details here. Composition can be performed either from the API or the command line utility.
Since composition actions is limited to 32 components, you will have to compose 32 files after 32 files. That should make around 13 composition operations for 400 files. Note that I have never tried the composition operation, so I'm just guessing on this part.
From the console, use the bq utility to strip the headers:
bq --skip_leading_rows 1
Firstly apologies if this question seems like a wall of text, I can't think of a way to format it.
I have a machine with valuable data on(circa 1995), the machine is running unix (SCO OpenServer 6) with some sort of database stored on it.
The data is normally accessed via a software package of which the license has expired and the developers are no longer trading.
The software package connects to the machine via telnet to retrieve data and modify data (the telnet connection no longer functions due to the license being changed).
I can access the machine via an ODBC driver (SeaODBC.dll) over a network, this was how I was planning to extract the data but so far I have retrieved 300,000 rows in just over 24 hours, in total I estimate there will be around 50,000,000 rows total so at current speed it will take 6 months!
I need either a quicker way to extract the data from the machine via ODBC or a way to extract the entire DB locally on the machine to an external drive/network drive or other external source.
I've played around with the unix interface and the only large files I can find are in a massive matrix of single character folder (eg A\G\data.dat, A\H\Data.dat ect).
Does anyone know how to find out the installed DB systems on the machine? Hopefully it is a standard and I'll be able to find a way to export everything into a nicely formatted file.
Edit
Digging around the file system I have found a folder under root > L which contains lots of single lettered folders, each single lettered folder contains more single letter folders.
There are also files which are named after the table I need (eg "ooi.r") which have the following format:
<Id>
[]
l for ooi_lno, lc for ooi_lcno, s for ooi_invno, id for ooi_indate
require l="AB"
require ls="SO"
require id=25/04/1998
{<id>} is s
sort increasing Id
I do not recognize those kinds of filenames A\G\data.dat and so on (filenames with backslashes in them???) and it's likely to be a proprietary format so I wouldn't expect much from that avenue. You can try running file on these to see if they are in any recognized format just to see...
I would suggest improving the speed of data extraction over ODBC by virtualizing the system. A modern computer will have faster memory, faster disks, and a faster CPU and may be able to extract the data a lot more quickly. You will have to extract a disk image from the old system in order to virtualize it, but hopefully a single sequential pass at reading everything off its disk won't be too slow.
I don't know what the architecture of this system is, but I guess it is x86, which means it might be not too hard to virtualize (depending on how well the SCO OpenServer 6 OS agrees with the virtualization). You will have to use a hypervisor that supports full virtualization (not paravirtualization).
I finally solved the problem, running a query using another tool (not through MS Access or MS Excel) worked massively faster, ended up using DaFT (Database Fishing Tool) to SELECT INTO a text file. Processed all 50 million rows in a few hours.
It seems the dll driver I was using doesn't work well with any MS products.
I want to measure the performance and scalability of my DB application. I am looking for a tool that would allow me to run many SQL statements against my DB, taking the DB and script (SQL) file as arguments (+necessary details, e.g. host name, port, login...).
Ideally it should let me control parameters such as number of simulated clients, duration of test, randomize variables or select from a list (e.g. SELECT FROM ... WHERE value = #var, where var is read from command line or randomized per execution). I would like to test results to be saved as CSV or XML file that I can analyze and plot them. And of course in terms of pricing I prefer "free" or "demo" :-)
Surprisingly (for me at least) while there are dozens of such tools for web application load testing, I couldn't find any for DB testing!? The ones I did see, such as pgbench, use a built-in DB based on some TPC scenario, so they help test the DBMS configuration and H/W but I cannot test MY DB! Any suggestions?
Specifically I use Postgres 8.3 on Linux, though I could use any DB-generic tool that meets these requirements. The H/W has 32GB of RAM while the size of the main tables and indexes is ~120GB. Hence there can be a 1:10 response time ratio between cold vs warm cache runs (I/O vs RAM). Realistically I expect requests to be spread evenly, so it's important for me to test queries against different pieces of the DB.
Feel free to also contact me via email.
Thanks!
-- Shaul Dar (info#shauldar.com)
JMeter from Apache can handle different server types. I use it for load tests against web applications, others in the team use it for DB calls. It can be configured in many ways to get the load you need. It can be run in console mode and even be clustered using different clients to minimize client overhead ( and so falsifying the results).
It's a java application and a bit complex at first sight. But still we love it. :-)
k6.io can stress test a few relational databases with the xk6-sql extension.
For reference, a test script could be something like:
import sql from 'k6/x/sql';
const db = sql.open("sqlite3", "./test.db");
export function setup() {
db.exec(`CREATE TABLE IF NOT EXISTS keyvalues (
id integer PRIMARY KEY AUTOINCREMENT,
key varchar NOT NULL,
value varchar);`);
}
export function teardown() {
db.close();
}
export default function () {
db.exec("INSERT INTO keyvalues (key, value) VALUES('plugin-name', 'k6-plugin-sql');");
let results = sql.query(db, "SELECT * FROM keyvalues;");
for (const row of results) {
console.log(`key: ${row.key}, value: ${row.value}`);
}
}
Read more on this short tutorial.
The SQL Load Generator is another such tool:
http://sqlloadgenerator.codeplex.com/
I like it, but it doesn't yet have the option to save test setup.
We never really found an adequate solution for stress testing our mainframe DB2 database so we ended up rolling our own. It actually just consists of a bank of 30 PCs running Linux with DB2 Connect installed.
29 of the boxes run a script which simply wait for a starter file to appear on an NFS mount then start executing fixed queries based on the data. The fact that these queries (and the data in the database) are fixed means we can easily compare against previous successful runs.
The 30th box runs two scripts in succession (the second is the same as all the other boxes). The first empties then populates the database tables with our known data and then creates the starter file to allow all the other machines (and itself) to continue.
This is all done with bash and DB2 Connect so is fairly easily maintainable (and free).
We also have another variant to do random queries based on analysis of production information collected over many months. It's harder to check the output against a known successful baseline but, in that circumstance, we're only looking for functional and performance problems (so we check for errors and queries that take too long).
We're currently examining whether we can consolidate all those physical servers into virtual machines, on both the mainframe running zLinux (which will use the shared-memory HyperSockets for TCP/IP, basically removing the network delays) and Intel platforms with VMWare, to free up some of that hardware.
It's an option you should examine if you don't mind a little bit of work up front since it gives you a great deal of control down the track.
Did you check Bristlecone an open source tool from Continuent? I don't use it, but it works for Postgres and seems to be able to do the things that your request. (sorry as a new user, I cannot give you the direct link to the tool page, but Google will get you there ;o])