Queries in delta tables - amazon-s3

I'm trying to read some tables from delta-lake stored in a S3 Bucket, using delta-rs in rust.
When I ran the code, it seems to be opening the table, because when I print the table it returns the following:
version: 0
metadata: GUID=36348853-e380-4d3d-986f-034b1cd7bcd2, name=None, description=None, partitionColumns=[], createdTime=1632167494225, configuration={}
min_version: read=1, write=2
files count: 1
But when I try to query on it, it returns the following message:
Parquet reader thread terminated due to error:
IoError(Os { code: 2, kind: NotFound, message: "No such file or directory" }
Here is my code, and I don't have any idea what I can do to solve this issue:
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
env::set_var("AWS_ACCESS_KEY_ID", AWS_ACCESS_KEY_ID);
env::set_var("AWS_SECRET_ACCESS_KEY", AWS_SECRET_ACCESS_KEY);
let web_site_request = GetBucketWebsiteRequest{bucket: S3_TEST_BUCKET.to_string(), expected_bucket_owner: None};
let table_uri = "s3://dev-evandro/common/lakehouse-sync/parquet/payments/chargebee/customer/";
let be = storage::get_backend_for_uri(table_uri).unwrap();
let table = deltalake::open_table(table_uri).await.unwrap();
println!("{}", table);
let mut ctx = ExecutionContext::new();
ctx.register_table("test_table", Arc::new(table))?;
let batches = ctx
.sql("SELECT * FROM test_table LIMIT 1")?
.collect()
.await?;
let batch = pretty_format_batches(&[batches][0]).unwrap();
println!("{}", batch);
Ok(())
}

Related

Why such a simple BufWriter operation didn't work

The following code is very simple. Open a file as a write, create a BufWriter using the file, and write a line of string.
The program reports no errors and returns an Ok(10) value, but the file just has no content and is empty.
#[tokio::test]
async fn save_file_async() {
let path = "./hello.txt";
let inner = tokio::fs::OpenOptions::new()
.create(true)
.write(true)
//.truncate(true)
.open(path)
.await
.unwrap();
let mut writer = tokio::io::BufWriter::new(inner);
println!(
"{} bytes wrote",
writer.write("1234567890".as_bytes()).await.unwrap()
);
}
Need an explicit flush:
writer.flush().await.unwrap();

git-rs leaving index in odd state

I'm using libgit2 via git-rs. I'm trying to commit things which is working, however, old files are showing up as deleted and/or staged even after the commit. How can I clean this up?
let repo: Repository = ...
let head_commit = match repo.head() {
Ok(head) => head.peel_to_commit().ok(),
Err(_) => None,
};
let head_tree = match head_commit {
Some(ref commit) => commit.tree().ok(),
None => None,
};
let mut diff_options = DiffOptions::new();
diff_options
.include_untracked(true)
.recurse_untracked_dirs(true);
let diff_result = self
.repo
.diff_tree_to_workdir_with_index(head_tree.as_ref(), Some(&mut diff_options));
let diff_deltas: Vec<_> = match diff_result {
Ok(ref diff) => diff.deltas().collect(),
Err(_) => Vec::new(),
};
if diff_deltas.is_empty() {
info!("no files changed");
return Ok(());
}
let mut index = .repo.index()?;
for diff_delta in diff_deltas {
let delta = diff_delta.status();
match delta {
Delta::Added
| Delta::Copied
| Delta::Modified
| Delta::Renamed
| Delta::Untracked
| Delta::Unmodified => {
let path = diff_delta.new_file().path().unwrap();
debug!("Staging {:?} file: {:?}", delta, path);
index.add_path(path)?;
}
Delta::Deleted => {
let path = diff_delta.old_file().path().unwrap();
debug!("Unstaging {:?} file: {:?}", delta, path);
index.remove_path(path)?;
}
_ => debug!("skipping {:?} file", delta),
}
}
let index_oid = index.write_tree()?;
let index_tree = self.repo.find_tree(index_oid)?;
let sig = Signature::new(&self.committer.name, &self.committer.email, &time)?;
let parents: Vec<_> = [&head_commit].iter().flat_map(|c| c.as_ref()).collect();
repo.commit(Some("HEAD"), &sig, &sig, message, &index_tree, &parents)?;
index.clear().unwrap();
As #user2722968 pointed out, you must call both index.write() and index.write_tree().
index.write_tree() will create a tree (or trees) from the index, returning the root tree object. This can be used to create a commit.
The trouble here is that you have updated HEAD with the commit that you've created. When you run git status, then that will compare the working directory to the index to the HEAD commit.
In this case, you have made changes to the index, you've updated the index in memory and used that to update HEAD. But you haven't actually written the index itself to disk, so you will see that the working directory does not match the index for any modified files, nor does the index match HEAD. (The working directory and HEAD contents will match, though that is never actually tested by git.)
Once you call index.write() then the working directory, the index and the HEAD commit will all be identical, and thus you'll see the expected (empty) status.

How to test two parallel transactions in Rust SQLx?

I'm experimenting with Rocket, Rust and SQLx and I'd like to test what happens when two parallel transactions try to insert a duplicated record on my table.
My insert fn contains nothing special and it works fine:
async fn insert_credentials<'ex, EX>(&self, executor: EX, credentials: &Credentials) -> Result<u64, Errors>
where
EX: 'ex + Executor<'ex, Database = Postgres>,
{
sqlx::query!(
r#"INSERT INTO credentials (username, password)
VALUES ($1, crypt($2, gen_salt('bf')))"#,
credentials.username,
credentials.password,
)
.execute(executor)
.await
.map(|result| result.rows_affected())
.map_err(|err| err.into())
}
My test, though, hangs indefinitely since it waits for a commit that never happens:
#[async_std::test]
async fn it_should_reject_duplicated_username_in_parallel() {
let repo = new_repo();
let db: Pool<Postgres> = connect().await;
let credentials = new_random_credentials();
println!("TX1 begins");
let mut tx1 = db.begin().await.unwrap();
let rows_affected = repo.insert_credentials(&mut tx1, &credentials).await.unwrap();
assert_eq!(rows_affected, 1);
println!("TX2 begins");
let mut tx2 = db.begin().await.unwrap();
println!("It hangs on the next line");
let rows_affected = repo.insert_credentials(&mut tx2, &credentials).await.unwrap();
assert_eq!(rows_affected, 1);
println!("It never reaches this line");
tx1.commit().await.unwrap();
tx2.commit().await.unwrap();
}
How do I create and execute those TXs in parallel, such that the assertions pass but the test fails when trying to commit the second TX?
For reference, this is my Cargo.toml
[package]
name = "auth"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
async-trait = "0.1.52"
serde = "1.0.136"
thiserror = "1.0.30"
# TODO https://github.com/SergioBenitez/Rocket/issues/1893#issuecomment-1002393878
rocket = { git = "https://github.com/SergioBenitez/Rocket", features = ["json"] }
[dependencies.redis]
version = "0.21.5"
features = ["tokio-comp"]
[dependencies.sqlx]
version = "0.5.11"
features = ["macros", "runtime-tokio-rustls", "postgres"]
[dependencies.uuid]
version = "1.0.0-alpha.1"
features = ["v4", "fast-rng", "macro-diagnostics"]
## DEV ##
[dev-dependencies]
mockall = "0.11.0"
[dev-dependencies.async-std]
version = "1.11.0"
features = ["attributes", "tokio1"]
You can use a async_std::future::timeout or tokio::time::timeout. Example using async_std:
use async_std::future;
use std::time::Duration;
let max_duration = Duration::from_millis(100);
assert!(timeout(max_duration, tx2.commit()).await.is_err());
If you want to continue to tx2 before completing tx1, you can async_std::task::spawn or tokio::spawn the tx1 first:
async_std::task::spawn(async move {
assert!(tx1.commit().await.is_ok());
});
#Mika pointed me the right direction, I could spawn both transactions and add a bit of timeout to give the concurrent TXs some time to execute.
let handle1 = tokio::spawn(async move {
let repo = new_repo();
let mut tx = db1.begin().await.unwrap();
let rows_affected = repo.insert_credentials(&mut tx, &credentials1).await.unwrap();
assert_eq!(rows_affected, 1);
tokio::time::sleep(Duration::from_millis(100)).await;
tx.commit().await.unwrap()
});
let handle2 = tokio::spawn(async move {
let repo = new_repo();
let mut tx = db2.begin().await.unwrap();
let rows_affected = repo.insert_credentials(&mut tx, &credentials2).await.unwrap();
assert_eq!(rows_affected, 1);
tokio::time::sleep(Duration::from_millis(100)).await;
tx.commit().await.unwrap()
});
let (_first, _second) = rocket::tokio::try_join!(handle1, handle2).unwrap();
I thought this way both TXs would execute in parallel until the sleep line, then one would commit and the other one would fail on the commit line. But no, actually both TXs execute in parallel, TX1 runs until the sleep and TX2 blocks on the insert line until TX1 commits, then TX2 fails on the insert line.
I guess that's just how DB works on this case and maybe I could change that by messing with TX isolation, but that's not my intent here. I'm just playing to learn more, and that's enough learning for today :)

Rust macro to generate multiple individual tests

Is it possible to have a macro that generates standalone tests? I have two text files, one with an input and another with an output. Each new line in the text file represents a new test.
Currently, this is how I run my tests:
#[test]
fn it_works() {
let input = read_file("input.txt").expect("failed to read input");
let input = input.split("\n").collect::<Vec<_>>();
let output = read_file("output.txt").expect("failed to read output");
let output = output.split("\n").collect::<Vec<_>>();
input.iter().zip(output).for_each(|(a, b)| {
println!("a: {}, b: {}", a, b);
assert_eq!(b, get_result(a));
})
But, as you can see, if one test fail, all of them fail, since there's a loop inside a single test. And I need each iteration to be a single and isolated test, without having to repeat myself.
So I was wondering if it's possible to achieve that by using macros?
The macro ideally would output something like:
#[test]
fn it_works_1() {
let input = read_file("input.txt").expect("failed to read input");
let input = input.split("\n").collect::<Vec<_>>();
let output = read_file("output.txt").expect("failed to read output");
let output = output.split("\n").collect::<Vec<_>>();
assert_eq!(output[0], get_result(input[0])); // first test
}
#[test]
fn it_works_2() {
let input = read_file("input.txt").expect("failed to read input");
let input = input.split("\n").collect::<Vec<_>>();
let output = read_file("output.txt").expect("failed to read output");
let output = output.split("\n").collect::<Vec<_>>();
assert_eq!(output[1], get_result(input[1])); // second test
}
// ... the N remaining tests: it_works_n()
You can't do this with a declarative macro because a declarative macro cannot generate an identifier to name the test functions. However you can use a crate such as test-case, which can run the same test with different inputs:
use test_case::test_case;
#[test_case(0)]
#[test_case(1)]
#[test_case(2)]
#[test]
fn it_works(index: usize) {
let input = read_file("input.txt").expect("failed to read input");
let input = input.split("\n").collect::<Vec<_>>();
let output = read_file("output.txt").expect("failed to read output");
let output = output.split("\n").collect::<Vec<_>>();
assert_eq!(output[index], get_result(input[index])); // first test
}
If you have a lot of different inputs to test, you could use a declarative macro to generate the code above, which would add all of the #[test_case] annotations.
After Peter Hall answer, I was able to achieve what I wanted. I added the seq_macro crate to generate the repeated #[test_case]'s. Maybe there's a way to loop through all test cases instead of manually defining the amount of tests (like I did), but this is good for now:
macro_rules! test {
( $from:expr, $to:expr ) => {
#[cfg(test)]
mod tests {
use crate::{get_result, read_file};
use seq_macro::seq;
use test_case::test_case;
seq!(N in $from..$to {
#(#[test_case(N)])*
fn it_works(index: usize) {
let input = read_file("input.txt").expect("failed to read input");
let input = input.split("\n").collect::<Vec<_>>();
let output = read_file("output.txt").expect("failed to read output");
let output = output.split("\n").collect::<Vec<_>>();
let res = get_result(input[index]);
assert_eq!(
output[index], res,
"Test '{}': Want '{}' got '{}'",
input[index], output[index], res
);
}
});
}
};
}
test!(0, 82);

Rebuild SQL database through command line in swift

Having a very difficult time trying to run command line arguments through Swift. I need to run commands on SQL files that a user manually drags onto the app (so the file path is different every time).
The piping between my app and the command line is working (sending 'pwd' will return the correct response), but when I try sending the arguments I want I cannot get them to work. I have tried using both "bin/bash" and "usr/bin/env" to no avail.
Essentially I am trying to rebuild a database that has been corrupted, without having to go in through terminal and do it myself. Common errors I see across attempts include 'Launch path not accessible' or 'File or directory not found'. I have tried using 'chmod 6' through terminal to set the permissions on the file, but this still does not work. Any help on what I am doing wrong to access the file, or another way to try and rebuild a database, would be greatly appreciated.
func checkForCorruption(filePath: URL) -> (String?, Bool){
let folder = filePath.deletingLastPathComponent()
let arguments = ["cd \(folder.relativePath)", "sqlite3 Restaurant.sql", ".mode insert",".output dump.sql",".dump", ".exit"]
let task = Process()
task.launchPath = "bin/bash/"
task.arguments = arguments
let inPipe = Pipe()
task.standardInput = inPipe
let pipe = Pipe()
task.standardOutput = pipe
let errPipe = Pipe()
task.standardError = errPipe
var output : [String] = []
task.launch()
task.waitUntilExit()
let data = pipe.fileHandleForReading.readDataToEndOfFile()
let errData = errPipe.fileHandleForReading.readDataToEndOfFile()
if let out = NSString(data: data, encoding: String.Encoding.utf8.rawValue){
print(out)
}
if let errOut = NSString(data: errData, encoding: String.Encoding.utf8.rawValue){
print("error: \(errOut)")
}
let outHandle = pipe.fileHandleForReading
if var string = String(data: data, encoding: .utf8) {
string = string.trimmingCharacters(in: .newlines)
output = string.components(separatedBy: "\n")
do {
try string.write(toFile: "\(folder.relativePath)/dump.sql", atomically: true, encoding: String.Encoding.utf8)
}
catch _ {
print("something went wrong")
}
}
outHandle.readabilityHandler = { pipe in
print("reading")
if let line = String(data: pipe.availableData, encoding: String.Encoding.utf8) {
print("New ouput: \(line)")
} else {
print("Error decoding data: \(pipe.availableData)")
}
}
return ("", false)
}
I got some help at work, for anyone struggling with this, here is the answer (the print statement is just were the dump file is located).
let arguments = ["\(filePath.relativePath)", ".mode insert",".output dump.sql",".dump", ".exit"]
let task = Process()
task.launchPath = "/usr/bin/sqlite3"
print(FileManager.default.currentDirectoryPath)