How can I recover a recently expired BigQuery table? - google-bigquery

A table I had been updating every day disappeared, and I found out it may have expired. I hadn't taken notice of it.
Is there a simple way to recover that table?

If it is within 7 days, you may be able to fetch the data using time-travel:
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#for_system_time_as_of
And once you are able to read the data, you can save the result to a destination table as kind of a way to recover the table.

Related

BigQuery events_intraday tables are deleted but no events_ table is created

I have set up Google Analytics GA4 to send both streaming and daily data to BigQuery for a test project. When I first did this, the events_intraday_ table was converted to events_ table each day.
As of sometime this week, it stopped working. In the morning, I can see yesterday's intraday table, and when I send some data, I see that today's is created. Sometime during the day, yesterday's table will disappear but no matching events_ table shows up.
Possibly related: I removed some old events_ tables last week as I was figuring out what I wanted to be stored. Is it possible deleting them somehow triggered a constant delete?
All the tables have expiration dates set to NEVER.
Is there a setting/log file/anything I can look at to understand what went wrong? And any ideas how to fix it?

How to identify deleted transactions in the Netsuite Transactions table

I'm working on building reports outside of NetSuite (in order to join this data with data from other source systems) using data pulled into Redshift from the NetSuite back-end tables. I have several tables that have been completely piped into Redshift against which I write my queries. In trying to recreate some values from the monthly P&Ls, I noticed my totals were not tying out with what is shown in the NS UI. After troubleshooting with our finance team, it appears as they there are 3 invoices that they deleted but are still showing up in the Transactions table. I do not as an IsDeleted field or something similar. How can I identify which records in the table have been deleted in order to filter them out of my results?
For Transactions has other posters have said use deleted records but a word to the wise here, deleted record only track the transactions themselves. Therefore, if your end users delete some lines in your transactions, deleted record WONT show said transactions in deleted records.
Some commenter said to look in system notes but in our cases, we only have the new and old version_id for the type impact in the system notes. Moreover, we never found a way to get what it mapped to on the ODBC side. (Correct me if I am wrong I would be more than happy to get a better way that the shitty hack we found)
The only workaround we found in our process here is to load all transactions with last_modified_date > {ts'last Import date'} in a temporary table and check if some lines for those transactions were deleted. In addition, to looking into deleted record to find the deleted transactions themselves. That is the only way we were able to match for our P&L reports long therm.
The logic behind this is that luckily in our processes, the end user must always edit the transaction themselves to delete some lines. Therefore, when they save their changes the transaction itself get a new modified date.
We asked NetSuite support directly and they told us that they do not officially have an official table to track the deletions of lines.
You can create a saved search of deleted records(record type of invoice) in NetSuite. Export it via CSV or Excel then use this CSV/Excel to update Redshift table and tag deleted record.
In the future you can create an API call to Redshift(if available) when a NetSuite record got deleted that will update/tag the deleted record in Redshift. This way you don't need to generate the deleted record.
DELETE is logged in TRANSACTION_HISTORY

BigQuery streamed data is not in table

I've got an ETL process which streams data from a mongo cluster to BigQuery. This runs via cron on a weekly basis, and manually when needed. I have a separate dataset for each of our customers, with the table structures being identical across them.
I just ran the process, only to find that while all of my data chunks returned a "success" response ({"kind": "bigquery#tableDataInsertAllResponse"}) from the insertAll api, the table is empty for one specific dataset.
I had seen this happen a few times before, but was never able to re-create. I've now run it twice more with the same results. I know my code is working, because the other datasets are properly populated.
There's no 'streaming buffer' in the table details, and running a count(*) query returns 0 response. I've even tried removing cached results from the query, to force freshness - but nothing helps.
Edit - After 10 minutes from my data stream (I keep timestamped logs) - partial data now appears in the table; however, after another 40 minutes, it doesn't look like any new data is flowing in.
Is anyone else experiencing hiccups in streaming service?
Might be worth mentioning that part of my process is to copy the existing table to a backup table, remove the original table, and recreate it with the latest schema. Could this be affecting the inserts on some specific edge cases?
Probably this is what is happening to you: BigQuery table truncation before streaming not working
If you delete or create a table, you must wait a least 2 minutes to start streaming data on it.
Since you mentioned that all other tables are working correctly and only the table that has the deletion process is not saving data then probably this explains what you are observing.
To fix this issue you can either wait a bit longer before streaming data after the delete and create operations or maybe changing the strategy to upload the data (maybe saving it into some CSV file and then using job insert methods to upload the data into the table).

collecting mysql statistics

What would be the easiest way to count the new records that are inserted into a database? Is it possible to include a count query in with the load query?
Or is something more complex needed, such as recording the existing last record and counting everything added after it?
edit:
I have a cron job, that uses LOAD DATA INFILE in a script that is passed directly to mysql. This data is used with a php web application. As part of the php web application, I need to generate weekly reports, including how many records were inserted in the last week.
I am unable to patch mysql, or drastically change the database schema/structure, but I am able to add in new tables or fields. I would prefer not to count records from the csv file and store this result in a textfile or something. INstead, I would prefer to do everything from within PHP with queries.
Assuming your using Mysql 5 or greater, you could create a trigger which would fire upon inserting into a specific table. Note that an "insert" trigger also fires with the "LOAD" command.
Using a trigger would require you to persist the count information in a separate table. Basically you'd need to create a new table with 1 row/column to hold the count. The trigger would then update that value with the amount of data loaded.
Here's the MySQL manual page on triggers, the syntax is fairly straight forward. http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
edit
Alternatively, if you don't want to persist the data within the database you could perform your "Load" operations within a stored procedure. This would allow you to perform a select count() on the table before you begin the Load and after the Load is complete. You would just need to subtract the resulting values to determine how many rows were inserted during the Load.
Here's the MySQL manual page on procedures.
http://dev.mysql.com/doc/refman/5.0/en/create-procedure.html
That would probably depend on what is determined as being new. Is it entries entered into the database in the last five minutes or 10 minutes etc? Or is it any record past a certain Auto ID?
If you are looking at time based method of determining what's new, you can have a field (probably of type datetime) that records the time when the record was inserted and to get the number, you simply do a...
select count(*) from table where currentTime > 'time-you-consider-to-be-new'
If you don't want to go by recording the time, you can use an auto increment key and simply keep track of the last inserted ID and count the ones that come after that at any given time window. so if one hour ago the ID was 10000 then a number of records have been inserted since then. You will need to count all records greater than 10000 and keep track of the last insert ID and repeat whenever needed.
If you are not looking at a specific table, you can use the following:
show global status like "Com_%";
This will show you statistics for every type of query. These numbers just keep on counting, so if you want to use them, record the initial number when starting to track the queries, and subtract this from your final number (but yea, that's a given).
If you are looking for pure statistics, I can recommend using Munin with the MySQL plugins.
From where do you load the data? You might consider to count them befor you insert them into the database. If it's a sqlscript you might write a quick and dirty bash script (with grep or something similar) to count the fields.
You say you can't change the structure. Does that mean you can't change the table you are inserting into, or you can't change the database at all? If you can add a table, then just create a table with 2 columns - a timestamp and the key of the table you are loading. Before you load your csv file, create another csv file with just those two columns, and load that csv after your main one.
This might be simpler than you want, but what about a Nagios monitor to track the row count? (Also consider asking around on serferfault.com; this stuff is totally up their alley.)
Perhaps you could write a small shell script that queries the database for the number of rows. You could then have a Cron job that runs every minute/hour/day etc and outputs the COUNT to a log file. Over time, you could review the log file and see the rate at which the database is growing. If you also put a date in the log file, you could review it easier over longer periods.
See if this is the kind of MySQL data collection you're interested in: http://code.google.com/p/google-mysql-tools/wiki/UserTableMonitoring.
If that is the case, Google offers a MySQL patch (to apply to a clean mysql directory source) at http://google-mysql-tools.googlecode.com/svn/trunk/mysql-patches/all.v4-mysql-5.0.37.patch.gz. You can read more about the patch at http://code.google.com/p/google-mysql-tools/wiki/Mysql5Patches.
If this is not what you're looking for, I suggest you explain yourself a little more in order for us to help you better.
Could you use a trigger on the table which will insert into a table you created, which in the structure has a timestamp?
You could then use a date calculation on a period range to find the information needed.
I dont know what version of mysql you are using, but here is link to the syntax for trigger creation in version 5.0: http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
Good luck,
Matt
Well, if you need exhaustive information: which rows were inserted, updated or deleted, it might make sense to create an additional audit table to store those things with a timestamp. You could do this with triggers. I would also write a stored procedure which would execute as event and erase old entries (whatever you consider old).
Refer to the link posted by Lima on how to create triggers in MySQL.
Refer to page 655 of "MySQL Cookbook" by Paul Dubois (2nd Edition) or page 158 of "SQL for smarties" by Joe Celko.
so the 'load' will only insert new data in the table ? or rewrite the whole table ?
If it will load new data, then you can do a select count(*) from yourtable
once before the loading and once after the loading ... the difference will show you how many new records where inserted..
If on the other hand you rewrite the whole table and want to find the different records from the previous version .. then you would need a completely different approach..
Which one is it ?
Your question is a bit ambiguous but they mysql c APIs provide a function "mysql_affected_rows" that you can call after each query to get the number of affected rows. For an insert it returns the number of rows inserted. Be aware that for updates it returns the number of rows changed not the number of rows that matched the where clause.
If you are performing a number of queries and need to know how many were inserted the most reliable way would probably be doing a count before and after the queries.
As noted in sobbayi's answer adding a "created at" timestamp to your tables would allow you to query for records created after (or before) a given time.
UPDATE:
OK here is what you need to do to get a count before and after:
create a table for the counts:
create table row_counts (ts timestamp not null, row_count integer not null);
in your script add the following before and after your load file inline query:
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
load file inline......
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
the row_counts table will now have the count before and after your load.
show global status like 'Com_insert';
flush status and show session status... will work for just the current connection.
see http://dev.mysql.com/doc/refman/5.1/en/server-status-variables.html#statvar_Com_xxx
Since you asked for the easiest way, I would suggest you to use a trigger on insert. You could use a single column, single row table as a counter and update it with the trigger.

How would you maintain a history in SQL tables?

I am designing a database to store product informations, and I want to store several months of historical (price) data for future reference. However, I would like to, after a set period, start overwriting initial entries with minimal effort to find the initial entries. Does anyone have a good idea of how to approach this problem? My initial design is to have a table named historical data, and everyday, it pulls the active data and stores it into the historical database with a time stamp. Does anyone have a better idea? Or can see what is wrong with mine?
First, I'd like to comment on your proposed solution. The weak part of course is that, there can, actually, be more than one change between your intervals. That means, the record was changed three times during the day, but you only archive the last change.
It's possible to have the better solution, but it must be event-driven. If you have the database server that supports events or triggers (like MS SQL), you should write a trigger code that creates entry in history table. If your server does not support triggers, you can add the archiving code to your application (during Save operation).
You could place a trigger on your price table. That way you can archive the old price in an other table at each update or delete event.
It's a much broader topic than it initially seems. Martin Fowler has a nice narrative about "things that change with time".
IMO your approach seems sound if your required history data is a snapshot of the end of the day's data - in the past I have used a similar approach with overnight jobs (SP's) that pick up the day's new data, timestamp it and then use a "delete all data that has a timestamp < today - x" where x is the time period of data I want to keep.
If you need to track all history changes, then you need to look at triggers.
I would like to, after a set period, start overwriting initial entries with minimal effort to find the initial entries
We store data in Archive tables, using a Trigger, as others have suggested. Our archive table has additional column for AuditDate, and stores the "Deleted" data - i.e. the previous version of the data. The current data is only stored in the actual table.
We prune the Archive table with a business rule along the lines of "Delete all Archive data more than 3 months old where there exists at least one archive record younger than 3 months old; delete all archive data more than 6 months old"
So if there has been no price change in the last 3 months you would still have a price change record from the 3-6 months ago period.
(Ask if you need an example of the self-referencing-join to do the delete, or the Trigger to store changes in the Archive table)