SingleColumnValueFilter not working in Bigtable Emulator - bigtable

It seems that the Bigtable emulator is not filtering properly when using the SingleColumnValueFilter as shown in the example below; however, this same code works correctly in the production version of Bigtable.
Incorrect Result output: "row1 row2 row3 row4"
Should have printed: "row3"
byte[] cf = "cf".getBytes();
byte[] cq = "cq".getBytes();
Connection conn = BigtableConfiguration.connect("fake-project", "fake-instance");
Admin admin = conn.getAdmin();
TableName testTableName = TableName.valueOf("testTable");
HTableDescriptor descriptor = new HTableDescriptor(testTableName);
descriptor.addFamily(new HColumnDescriptor(cf));
admin.createTable(descriptor);
byte[] val = { 0x1a };
byte[] val2 = { 0x11 };
byte[] val3 = "a".getBytes();
byte[] val4 = "b".getBytes();
Table table = conn.getTable(testTableName);
table.put(new Put("row1".getBytes()).addColumn(cf, cq, val));
table.put(new Put("row2".getBytes()).addColumn(cf, cq, val2));
table.put(new Put("row3".getBytes()).addColumn(cf, cq, val3));
table.put(new Put("row4".getBytes()).addColumn(cf, cq, val4));
Scan scan = new Scan().setFilter(new SingleColumnValueFilter(cf, cq, CompareOp.EQUAL, val3));
// THIS wrongly prints all rows in the table rather than just row3
for(Result r: table.getScanner(scan)) {
String row = new String(r.getRow());
System.out.print(row);
}

The code in the question is correct.
This was a bug in an older emulator version (Emulator: gcloud beta 2019.02.22).
This bug has since been fixed (see original report here).

Related

Order of the iterations of entries in an Ignite cache and seek method

What is the ordering of the keys in an Ignite cache (without using indexing) and is it possible to do the equivalent of the following RocksDB snippet
try (final RocksIterator rocksIterator =
rocksDB.newIterator(columnFamilyHandleList.get(1))) {
for (rocksIterator.seek(prefixKey);
i.e. jump to the next entry starting with a given byte[] or String?
The way you'd do that in Ignite is by using SQL.
var query = new SqlFieldsQuery("select x,y,z from table where z like ? order by x").setArgs("prefix%");
try (var cursor = cache.query(query)) {
for (var r : cursor) {
Long id = (Long) r.get(0);
BigDecimal value = (BigDecimal) r.get(1);
String name = (String) r.get(2);
}
}

Not able to upload json data to Bigquery tables using c#

I am trying to upload json data to one of the table created under the dataset in Bigquery but fails with "Google.GoogleApiException: 'Google.Apis.Requests.RequestError
Not found: Table currency-342912:sampleDataset.currencyTable [404]"
Service account is created with roles BigQuery.Admin/DataEditor/DataOwner/DataViewer.
The roles are also applied to the table also.
Below is the snippet
public static void LoadTableGcsJson(string projectId = "currency-342912", string datasetId = "sampleDataset", string tableId= "currencyTable ")
{
//Read the Serviceaccount key json file
string dir = Directory.GetParent(Directory.GetCurrentDirectory()).Parent.Parent.FullName + "\\" + "currency-342912-ae9b22f23a36.json";
GoogleCredential credential = GoogleCredential.FromFile(dir);
string toFileName = Directory.GetParent(Directory.GetCurrentDirectory()).Parent.Parent.FullName + "\\" + "sample.json";
BigQueryClient client = BigQueryClient.Create(projectId,credential);
var dataset = client.GetDataset(datasetId);
using (FileStream stream = File.Open(toFileName, FileMode.Open))
{
// Create and run job
BigQueryJob loadJob = client.UploadJson(datasetId, tableId, null, stream); //This throws error
loadJob.PollUntilCompleted();
}
}
Permissions for the table, using the service account "sampleservicenew" from the screenshot
Any leads on this , much appreciated
Your issue might reside in your user credentials. Please follow this steps to check your code:
Please check if the user you are using to execute your application have access to the table you want to insert data.
If your json tags matchs your table columns.
If you json inputs are correct ( table name, dataset name ).
Use a dummy table to perform a quick test of your credentials and data integrity.
These steps will help you identifying what could be missing on your side. I perform the following operations to reproduce your case:
I created a table on BigQuery based on the values of your json data:
create or replace table `projectid.datasetid.tableid` (
IsFee BOOL,
BlockDateTime timestamp,
Address STRING,
BlockHeight INT64,
Type STRING,
Value INT64
);
Created a .json file with your test data
{"IsFee":false,"BlockDateTime":"2018-09-11T00:12:14Z","Address":"tz3UoffC7FG7zfpmvmjUmUeAaHvzdcUvAj6r","BlockHeight":98304,"Type":"OUT","Value":1}
{"IsFee":false,"BlockDateTime":"2018-09-11T00:12:14Z","Address":"tz2KuCcKSyMzs8wRJXzjqoHgojPkSUem8ZBS","BlockHeight":98304,"Type":"IN","Value":18}
Build & Run below code.
using System;
using Google.Cloud.BigQuery.V2;
using Google.Apis.Auth.OAuth2;
using System.IO;
namespace stackoverflow
{
class Program
{
static void Main(string[] args)
{
String projectid = "projectid";
String datasetid = "datasetid";
String tableid = "tableid";
String safilepath ="credentials.json";
var credentials = GoogleCredential.FromFile(safilepath);
BigQueryClient client = BigQueryClient.Create(projectid,credentials);
using (FileStream stream = File.Open("data.json", FileMode.Open))
{
BigQueryJob loadJob = client.UploadJson(datasetid, tableid, null, stream);
loadJob.PollUntilCompleted();
}
}
}
}
output
Row
IsFee
BlockDateTime
Address
BlockHeight
Type
value
1
false
2018-09-11 00:12:14 UTC
tz3UoffC7FG7zfpmvmjUmUeAaHvzdcUvAj6r
98304
OUT
1
2
false
2018-09-11 00:12:14 UTC
tz2KuCcKSyMzs8wRJXzjqoHgojPkSUem8ZBS
98304
IN
18
Note: You can use above code to perform your quick tests of your credentials and the integrity of the data to insert.
I also make use of the following documentation:
Load Credentials from a file
Google.Cloud.BigQuery.V2
Load Json data into a new table

Kafka Spark streaming HBase insert issues

I'm using Kafka to send a file with 3 columns using Spark streaming 1.3 to insert into HBase.
This is how my HBase looks like :
ROW COLUMN+CELL
zone:bizert column=travail:call, timestamp=1491836364921, value=contact:numero
zone:jendouba column=travail:Big data, timestamp=1491835836290, value=contact:email
zone:tunis column=travail:info, timestamp=1491835897342, value=contact:num
3 row(s) in 0.4200 seconds
And this is how I read data with spark streaming, I'm using spark-shell:
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.kafka.KafkaUtils
import kafka.serializer.StringDecoder
val ssc = new StreamingContext(sc, Seconds(10))
val topicSet = Set ("zed")
val kafkaParams = Map[String, String]("metadata.broker.list" -> "xx.xx.xxx.xx:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
lines.foreachRDD(rdd => { (!rdd.partitions.isEmpty)
lines.saveAsTextFiles("hdfs://xxxxx:8020/user/admin/zed/steams3/")
})
this code is working when I'm saving data into HDFS even it save many empty data to HDFS.
before writing this question I was searching here and some other question like mine but I didn't get a good solution.
May you propose the best way to do this?.
This is how my code look now
val sc = new SparkContext("local", "Hbase spark")
val tableName = "notz"
val conf = HBaseConfiguration.create()
conf.addResource(new Path("file:///opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/etc/hbase/conf.dist/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
lines.foreachRDD(rdd => { (!rdd.partitions.isEmpty)
if(!admin.isTableAvailable(tableName)) {
print("Creating GHbase Table")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("zone"
.getBytes()))
admin.createTable(tableDesc)
}else{
print("Table already exists!!")
}
val myTable = new HTable(conf, tableName)
// i'm blocked here
})

Apache Spark Performance Issue

We thought of using Apache Spark to match records faster, but we are finding it highly inefficient than SQL matching using select statement.
Using,
JavaSparkContext javaSparkContext = new JavaSparkContext(new SparkConf().setAppName("AIRecordLinkage").setMaster("local[*]"));<br>
Dataset<Row> sourceFileContent = spark.read().jdbc("jdbc:oracle:thin:#//Connection_IP/stage", "Database_name.Table_name", connectionProperties);
We are able to import around 1.8 million records to spark environment which is stored in dataset object.
Now using filter function
targetFileContent.filter(col("TARGETUPC").equalTo(upcValue))
The above filter statement is in a loop where upcValue gets updated for approximately 46k IDs.
This program is executing for several hours, but we tried the same using sql IN operator where in we kept all the 46k UPC IDs which executed in less than a minute.
Configuration:
Spark-sql 2.11
Spark-core 2.11
JDK 8
Windows 10, Single node 4 cores 3Ghz, 16 GB RAM.
C drive -> 12 GB free space.
Eclipse -> Run configuration -> –Xms15000m.
Kindly help us to analyze and understand if there are any mistakes and suggest us what needs to be done to improve the performance.
#Component("upcExactMatch")
public class UPCExactMatch {
#Autowired
private Environment envirnoment;
#Autowired
private LoadCSV loadCSV;
#Autowired
private SQLHandler sqlHandler;
public ArrayList<Row> perform(){
ArrayList<Row> upcNonMatchedItemIDs=new ArrayList<Row>();
ArrayList<Row> upcMatchedItemIDs=new ArrayList<Row>();
JavaSparkContext javaSparkContext = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(javaSparkContext);
SparkSession sparkSession = SparkSession.builder().appName("JavaStopWordshandlerTest").getOrCreate();
try{
Dataset<Row> sourceFileContent =loadCSV.load(sourceFileName,sourceFileLocation,javaSparkContext,sqlContext);
// load target from database
Dataset<Row> targetFileContent = spark.read().jdbc("jdbc:oracle:thin:#//Connection_IP/stage", "Database_name.Table_name", connectionProperties);
System.out.println("File counts :"+sourceFileContent.count()+" : "+targetFileContent.count());
ArrayList<String> upcMatched = new ArrayList<String>();
ArrayList<String> partNumberMatched = new ArrayList<String>();
List<Row> sourceFileContents = sourceFileContent.collectAsList();
int upcColumnIndex=-1;
int itemIDColumnIndex=-1;
int partNumberTargetIndex=-1;
String upcValue="";
StructType schema = targetFileContent.schema();
List<Row> data = Arrays.asList();
Dataset<Row> upcMatchedRows = sparkSession.createDataFrame(data, schema);
for(Row rowSourceFileContent: sourceFileContents){
upcColumnIndex=rowSourceFileContent.fieldIndex("Vendor UPC");
if(!rowSourceFileContent.isNullAt(upcColumnIndex)){
upcValue=rowSourceFileContent.get(upcColumnIndex).toString();
upcMatchedRows=targetFileContent.filter(col("TARGETUPC").equalTo(upcValue));
if(upcMatchedRows.count() > 0){
for(Row upcMatchedRow: upcMatchedRows.collectAsList()){
partNumberTargetIndex=upcMatchedRow.fieldIndex("PART_NUMBER");
if(partNumberTargetIndex != -1){
upcMatched.add(upcValue);
partNumberMatched.add(upcMatchedRow.get(partNumberTargetIndex).toString());
System.out.println("Source UPC : "+upcValue +"\tTarget part number :"+ upcMatchedRow.get(partNumberTargetIndex));
}
}
}
}
}
for(int i=0;i<upcMatched.size();i++){
System.out.println("Matched Exact UPC ids are :"+upcMatched.get(i) + "\t:Target\t"+partNumberMatched.get(i));
}
}catch(Exception e){
e.printStackTrace();
}finally{
sparkSession.stop();
sqlContext.clearCache();
javaSparkContext.close();
}
return upcMatchedItemIDs;
}
}
Try doing inner join between two dataframes of data sets for matched records.

Java executed statement not returning string data in resultset

I have a simple SQL code that returns one record but when I execute it from Java, it does not return the string portions of the record, only numerical. The fields are VARCHAR2 but do not get extracted into my resultset. Following is the code. The database connectivity portion has been edited out for posting in the forum but it does connect. I have also attached the output. Any guidance would be appreciated as my searches on the web have returned empty. -Greg
package testsql;
import java.sql.*;
public class TestSQL {
String SQLtracknbr;
int SQLtracklength;
int numberOfColumns;
String coltypename;
int coldispsize;
String SQLschemaname;
public static void main(String[] args)
throws ClassNotFoundException, SQLException
{
Class.forName("oracle.jdbc.driver.OracleDriver");
DriverManager.registerDriver (new oracle.jdbc.driver.OracleDriver());
String url = "jdbc:oracle:thin:#oracam.corp.mot.com:1522:oracam";
String SQLcode = "select DISTINCT tracking_number from sfc_unit_process_track where tracking_number = 'CAH15F6WW9'";
System.out.println(SQLcode);
Connection conn =
DriverManager.getConnection(url,"report","report");
conn.setAutoCommit(false);
try (Statement stmt = conn.createStatement(); ResultSet rset = stmt.executeQuery(SQLcode)) {
ResultSetMetaData rsmd = rset.getMetaData();
while (rset.next()) {
int numberOfColumns = rsmd.getColumnCount();
boolean b = rsmd.isSearchable(1);
String coltypename = rsmd.getColumnTypeName(1);
int coldispsize = rsmd.getColumnDisplaySize(1);
String SQLschemaname = rsmd.getSchemaName(1);
String SQLtracknbr = rset.getString(1);
int SQLtracklength = SQLtracknbr.length();
if (SQLtracknbr == null)
System.out.println("NULL**********************.");
else
System.out.println("NOT NULL.");
System.out.println("numberOfColumns = " + numberOfColumns);
System.out.println("column type = " + coltypename);
System.out.println("column display size = " + coldispsize);
System.out.println("tracking_number = " + SQLtracknbr);
System.out.println("track number length = " + SQLtracklength);
System.out.println("schema name = " + SQLschemaname);
}
}
System.out.println ("*******End of code*******");
}
}
The result of what is executed in Java is below:
run:
select DISTINCT tracking_number from sfc_unit_process_track where tracking_number = 'CAH15F6WW9'
NOT NULL.
numberOfColumns = 1
column type = VARCHAR2
column display size = 30
tracking_number =
track number length = 0
schema name =
*******End of code*******
BUILD SUCCESSFUL (total time: 0 seconds)
This seems to be caused by an incompatibility between the driver you're using, ojdbc7.jar, and the version of the database you're connecting to, 9i.
According to the JDBC FAQ section "What are the various supported Oracle database version vs JDBC compliant versions vs JDK version supported?", the JDK 7/8 driver ojdbc7.jar that you're using is only support for Oracle 12c.
Oracle generally only support client/server versions two release apart (see My Oracle Support note 207303.1), and the Oracle 12c and 9i client and server have never been supported either way around. JDBC is slightly different of course, but it may be related, as drivers are installed with the Oracle software.
You will have to upgrade your database to a supported version, or - perhaps more practically in the short term - use an earlier driver. The Wayback Machine snapshot of the JDBC FAQ from 2013 says the 11.2.0 JDBC drivers - which includes ojdbc6.jar and ojcbd5.jar - can talk to RDBMS 9.2.0. So either of those ought to work...