BigQuery - 1 file duplicate out of many using Java API - google-bigquery

I am using Java API quick start program to upload CSV files into bigquery tables. I uploaded more than thousand file but in one of the table 1 file's rows are duplicated in bigquery. I went through the logs but found only 1 entry of its upload. Also went through many similar questions where #jordan mentioned that the bug is fixed. Can there be any reason of this behavior? I am yet to try a solution mentioned of setting a job id. But just could not get any reason from my side of the duplicate entries...

Since BigQuery is append only by design, you need to accept there will be some duplicates in your system, and be able to write your queries in such way that selects the most recent version of it.

Related

Backfill Google Analytics in BigQuery

I'm looking for a workaround on the following issue. Hope someone can help.
I'm unable to backfill data in the ga_sessions_ table in BigQuery through product linking in GA. e.g. partition ga_sessions_20180517 is missing
This specific view has already been linked before. Google documentation says that historical load is only done once per view (hence, the issue) (https://support.google.com/analytics/answer/3416092?hl=en)
Is there any way to work around it?
Kind regards,
Martijn
You can use Google Analytics Reporting API to get the data for that view. This method has lot of restrictions like sometimes the data is sampled/only 7 dimensions can be exported in one call, but at least you will be able to fetch your data in a partitioned manner.
Documentation hereDoc
If you need a lot of dimensions/metrics in hit level format, scitylana.com has a service that can provide this data historically.
If you have a clientId set in a custom dimension the data-quality is near perfect.
It also works without a clientId set.
You can get all history as available through the API.
You can get 100+ dimensions/metrics in one batch into BQ.

How to add multiple images to a single record in SQL Server using asp.net

I am working on a site and one of its requirement is to add multiple images to a single column of database along with other details.
This is the screenshot of my webform:
and this is my database table:
But I am confused about adding multiple images in a specific field of one row. Is it possible that a single record can have a field with multiple images stored in it? Please suggest a good solution to this problem.
Thank you
You can store multiple files together in a single blob or stream by using MIME Multipart formatting. See this RFC: https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html
Note that using a separate table with one record per image/file is a better overall solution because there is a large overhead in extracting files from a Multipart blob, making it slow and inefficient... so don't store files larger than a few kilobytes.

Enrich CSV with metadata from database

I've been looking around for a lightweight, scaleable solution to enrich a CSV file with additional metadata from a database. Each line in the CSV represents a data item and the columns the metadata belonging to that item.
Basically I have a CSV extract and I need to add additional metadata from a database. The metadata can be accessed via ODBC or REST API call.
I have a number of options in my head but I'm looking for other ideas. My options are as follows:
Import the CSV into a database table, apply the additional metadata with sql UPDATE statements by finding the necessary metadata with SELECT statements, and then export the data back into CSV format. For this solution I was thinking to use an ETL tool which may be a bit heavyweight to tackle this problem.
I also thought about a NodeJS based solution where I read the CSV in, call web service to get the metadata and write back the data into the CSV file. The CSV can be however quite large with potentially tens of thousands of rows so this could be heavy on memory or in case of line-by-line processing not very performant.
If you have a better solution in mind, please post. Many thanks.
I think you've come up with a couple of pretty good ideas here already.
Running with your first suggestion using an ETL tool to enrich your CSV files, you should check out https://github.com/streamsets/datacollector
It's a continuous ingestion approach, so you could even monitor a directory of CSV files to load as you get them. While there's no specific functionality yet for doing lookups in a database, its certainly possible in a number of ways (including writing your own custom logic in Java, or a script in python or JavaScript).
*Full disclosure I work on this project.

Big Query table too fragmented - unable to rectify

I have got a Google Big Query table that is too fragmented, meaning that it is unusable. Apparently there is supposed to be a job running to fix this, but it doesn't seem to have stopped the issue for myself.
I have attempted to fix this myself, with no success.
Steps tried:
Copying the table and deleting original - this does not work as the table is too fragmented for the copy
Exporting the file and reimporting. I managed to export to google cloud storage, as the file was JSON, so couldn't download - this was fine. The problem was on re-import. I was trying to use the web interface and it asked for a schema. I only have the file to work with, so I tried to use the schema as identified by BigQuery, but couldn't get it to be accepted - I think the problem was with the tree/leaf format not translating properly.
To fix this, I know I either need to get the coalesce process to work (out of my hands - anyone from Google able to help? My project ID is 189325614134), or to get help to format the import schema correctly.
This is currently causing a project to grind to a halt, as we can't query the data, so any help that can be given is greatly appreciated.
Andrew
I've run a manual coalesce on your table. It should be marginally better, but there seems to be a problem where we're not coalescing as thoroughly as we should. We're still investigating, we have an open bug on the issue.
Can you confirm this is the SocialAccounts table? You should not be seeing the fragmentation limit on this table when you try to copy it. Can you give the exact error you are seeing?

Exclude versioned documents while Querying-Raven db

I have appended the versioning bundle in midway of my project after having written most of my raven queries in my data access layer. Now because of versioning i have lots of replicated data. Whenever i query a type of document i can see the values replicated as many times as the document is versioned. Is there way to stop querying the re-visioned documents when i query for the current data in common without re-writing all of my queries with Exclude("Revisions").Is there any setting where i can say query on re-visioned document =False which i can set globally? please suggest something to overcome this..
That is the way it works, actually. It appears that you have disabled the versionning bundle, which would cause this to happen.