i have issue when trying to export the data of a single row the export is very slow however when i try to export the whole table the exportation speed is very good
this is my MariaDB information in case it will help you in figuring the issue:
database containing more than 6557519 record with 35 columns
the column i try to export is integer NON-NULL becasue it is PK, when i leave the export operation finish the file containing the single column is 102 mb so it should take no time with the regular server export speed.
this is a single column query :
SELECT phone_number FROM pos2
when i export the single column data:
this is the whole table query
SELECT * FROM pos2
when i export the table data:
so you can see the difference in speed which leads to wast of alot of time to my company knowing i do it
almost in daily basis.
i tried changing CSV format to other formats like SQL or TXT and same problem happens,
also tried compressing the query result before exporting by activating ZIP Compression option from the server sittings but that made the situation worst since the export didn't happen at all and the server got freezed.
Related
I am reading a table via snowflake reader node having less number of columns/attributes(around 50-80),the table is getting read on the Mosaic decisions Canvas. But when the attributes of table increases (approx 385 columns),Mosaic reader node fails. As a workaround I tried using the where clause with 1=2,in that case it is pulling the structure of the Table. But when I am trying to read the records even by applying the limit (only 10 records) to the query, it is throwing connection timeout Error.
Even I faced similar issue while reading (approx. 300 columns) table and I managed it with the help of input parameters available in Mosaic. In your case you will have to change the copy field variable to 1=1 used in the query at run time.
Below steps can be referred to achieve this -
Create a parameter (e.g. copy_variable) that will contain the default value 2 for the copy field variable
In reader node, write the SQL with 1 = $(copy_variable) So while validating, it’s same as 1=2 condition and it should validate fine.
Once validated and schema is generated, update the default value of $(copy_variable) to 1 so that while running, you will still get all records.
Attempts at changing data type in Access have failed due to error:
"There isn't enough disk space or memory". Over 385,325 records exists in the table.
Attempts at the following links, among other StackOverFlow threads, have failed:
Can't change data type on MS Access 2007
Microsoft Access can't change the datatype. There isn't enough disk space or memory
The intention is to change data type for one column from "Text" to "Number". The aforementioned links cannot accommodate that either due to size or the desired data type fields.
Breaking out the table may not be an option due to the number of records.
Help on this would be appreciated.
I cannot tell for sure about MS Access, but for MS SQL one can avoid a table rebuild (requiring lots of time and space) by appending a new column that allows null- values at the rightmost end of the table, update the column using normal update queries and AFAIK even drop the old column and rename the new one. So in the end it's just the location of that column that has changed.
As for your 385,325 records (I'd expect that number to be correct) even if the table had 1000 columns with 500 unicode- characters each we'd end up with approximately 385,325*1000*500*2 ~ 385 GB of data. That should nowadays not be the maximum available - so:
if it's the disk space you're running out of, how about move the data to some other computer, change the DB there and move it back.
if the DB seems to be corrupted (and standard tools didn't help (make a copy)) it will most probably help to create a new table or database using table creation (better: create manually and append) queries.
I am trying to export data from BigQuery Table using python api. Table contains 1 to 4 million of rows. So I have kept maxResults parameter to maximum i.e. 100000 and then paging through. But problem is that in One page I am getting 2652 rows only so number of paging is too much. Can anyone provide reason for this or solution to deal. Format is JSON.
Or can I export data into CSV format without using GCS?
I tried by inserting job and keeping allowLargeResults =true, but the result remain same.
Below is my query body :
queryData = {'query':query,
'maxResults':100000,
'timeoutMs':'130000'}
Thanks in advance.
You can try to export data from table without using GCS by using bq command line tool https://cloud.google.com/bigquery/bq-command-line-tool like this:
bq --format=prettyjson query --n=10000000 "SELECT * from publicdata:samples.shakespeare"
You can use --format=json depending on your needs as well.
Actual page size is determined not by row count, but rather by size of those rows in given page. I think it is something around 10MB
User can alsoset maxResults to limit rows in page in addition to above criteria
I'm loading batch files to a table.
I want to add a timestamp column to the table so I can know the insertion times
on the record. I'm loading in append mode, so not all records insert at the same time.
Unfortunately, I didn't find a way to it in big query. When loading a file to a table, I didn't find an option to add padding the insertion with additional columns. I just want to calculate timestamp in my code and put it as constant field for all the insertion process.
The solution that I'm doing now, is to load to temp table and then query the table + new timestamp field into the target table. It works, but it's another step and I have multiple loadings and the full process takes too much time due to the latency of another step.
Does anyone know about another solution with only 1 step?
That's a great feature request for https://code.google.com/p/google-bigquery/issues/list. Unfortunately, there is no automated way to do it today. I like the way you are doing it though :)
If you are willing to make a new table to house this information, I recommend making a new table with the following settings:
table with _PARTITIONTIME field based on insertion
If you make a table using the default _PARTITIONTIME partitioning field, it does exactly what you are asking based on time of insertion
You can add a timestamp column/value using Pandas data-frame:
from datetime import datetime
import pandas as pd
from google.cloud import bigquery
insertDate = datetime.utcnow()
bigqueryClient = bigquery.Client()
tableRef = bigqueryClient.dataset("dataset-name").table("table-name")
dataFrame = pd.read_json("file.json")
dataFrame['insert_date'] = insertDate
bigqueryJob = bigqueryClient.load_table_from_dataframe(dataFrame, tableRef)
bigqueryJob.result()
You can leverage the "hive partitioning" functionality of BigQuery load jobs to accomplish this. This feature is normally used for "external tables" where the data just sits in GCS in carefully-organized folders, but there's no law against using it to import data into a native table.
When you write your batch files, include your timestamp as part of the path. For example, if your timestamp field is called "added_at" then write your batch files to gs://your-bucket/batch_output/added_at=1658877709/file.json
Load your data with the hive partitioning parameters so that the "added_at" value comes from the path instead of from the contents of your file. Example:
bq load --source_format=JSON \
--hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix=gs://your-bucket/batch_output/ \
dataset-name.table-name \
gs://your-bucket/output/added_at=1658877709/*
The python API has equivalent functionality.
So, I am trying to learn how to set up good, and usable databases. I have ran into a problem involving storing large amounts of data correctly. The database I am using is MSSQL 2008. For example:
We test about 50,000 devices a week. Each one of these devices have a lot of data associated with them. Overall, we are just looking at the summary of data calculated from the raw data. The summary is easy to handle, its just the raw data I'm trying to enter into a database for future use in case someone wants more details.
For the summary, I have a database full of tables for each set of 50,000 devices. But, each device there is data similar to this:
("DevID") I,V,P I,V,P I,V,P ...
("DevID") WL,P WL,P WL,P ...
Totaling to 126 (~882 chars) data points for the first line and 12000 (~102,000 chars) data points for the second line. What would be the best way to store this information? Create a table for each and every device (this seems unruly)? Is there a data type that can handle this much info? I am just not sure.
Thanks!
EDIT: Updated ~char count and second line data points.
You could just normalize everything into one table
CREATE TABLE device
( id BIGINT AUTO_INCREMENT PRIMARY KEY
, DevID INT
, DataPoint VARCHAR
, INDEX(DevID))
Psudocode obviously, since I don't know your exact requirements.
Does this data represent a series of readings over time? Time-series data tends to be highly repetetive. So a common strategy is to compress it in ways that avoid storing every single value. For example use run-length encoding, or associate time intervals with each value instead of single points.