Is it possible to load constrains from file (csv, txt) to Deequ Checks? - dataframe

Is it possible to save suggested constrains to file and then load them as cheks? I was able to do it without saving them with next code
val allConstraints = suggestionResult.constraintSuggestions.flatMap {
case (_, suggestions) =>
suggestions.map {
_.constraint
}
}.toSeq
val generatedCheck = Check(CheckLevel.Error, "generated constraints", allConstraints)
val verificationResult: VerificationResult = {
VerificationSuite()
.onData(tested_df)
.addCheck(generatedCheck)
.run()
}
However I want to save them to file and apply them later when needed? Is there any way to do this?

Related

Why such a simple BufWriter operation didn't work

The following code is very simple. Open a file as a write, create a BufWriter using the file, and write a line of string.
The program reports no errors and returns an Ok(10) value, but the file just has no content and is empty.
#[tokio::test]
async fn save_file_async() {
let path = "./hello.txt";
let inner = tokio::fs::OpenOptions::new()
.create(true)
.write(true)
//.truncate(true)
.open(path)
.await
.unwrap();
let mut writer = tokio::io::BufWriter::new(inner);
println!(
"{} bytes wrote",
writer.write("1234567890".as_bytes()).await.unwrap()
);
}
Need an explicit flush:
writer.flush().await.unwrap();

Scalding Unit Test - How to Write A Local File?

I work at a place where scalding writes are augmented with a specific API to track dataset meta data. When converting from normal writes to these special writes, there are some intricacies with respect to Key/Value, TSV/CSV, Thrift ... datasets. I would like to compare the binary file is the same prior to conversion and after conversion to the special API.
Given I cannot provide the specific api for the metadata-inclusive writes, I only ask how can I write a unit test for .write method on a TypedPipe?
implicit val timeZone: TimeZone = DateOps.UTC
implicit val dateParser: DateParser = DateParser.default
implicit def flowDef: FlowDef = new FlowDef()
implicit def mode: Mode = Local(true)
val fileStrPath = root + "/test"
println("writing data to " + fileStrPath)
TypedPipe
.from(Seq[Long](1, 2, 3, 4, 5))
// .map((x: Long) => { println(x.toString); System.out.flush(); x })
.write(TypedTsv[Long](fileStrPath))
.forceToDisk
The above doesn't seem to write anything to local (OSX) disk.
So I wonder if I need to use a MiniDFSCluster something like this:
def setUpTempFolder: String = {
val tempFolder = new TemporaryFolder
tempFolder.create()
tempFolder.getRoot.getAbsolutePath
}
val root: String = setUpTempFolder
println(s"root = $root")
val tempDir = Files.createTempDirectory(setUpTempFolder).toFile
val hdfsCluster: MiniDFSCluster = {
val configuration = new Configuration()
configuration.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, tempDir.getAbsolutePath)
configuration.set("io.compression.codecs", classOf[LzopCodec].getName)
new MiniDFSCluster.Builder(configuration)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
}
hdfsCluster.waitClusterUp()
val fs: DistributedFileSystem = hdfsCluster.getFileSystem
val rootPath = new Path(root)
fs.mkdirs(rootPath)
However, my attempts to get this MiniCluster to work haven't panned out either - somehow I need to link the MiniCluster with the Scalding write.
Note: The Scalding JobTest framework for unit testing isn't going to work due actual data written is sometimes wrapped in bijection codec or setup with case class wrappers prior to the writes made by the metadata-inclusive writes APIs.
Any ideas how I can write a local file (without using the Scalding REPL) with either Scalding alone or a MiniCluster? (If using the later, I need a hint how to read the file.)
Answering ... There is an example of how to use a mini cluster for exactly reading and writing to HDFS. I will be able to cross read with my different writes and examine them. Here it is in the tests for scalding's TypedParquet type
HadoopPlatformJobTest is an extension for JobTest that uses a MiniCluster.
With some hand-waiving on detail in the link, the bulk of the code is this:
"TypedParquetTuple" should {
"read and write correctly" in {
import com.twitter.scalding.parquet.tuple.TestValues._
def toMap[T](i: Iterable[T]): Map[T, Int] = i.groupBy(identity).mapValues(_.size)
HadoopPlatformJobTest(new WriteToTypedParquetTupleJob(_), cluster)
.arg("output", "output1")
.sink[SampleClassB](TypedParquet[SampleClassB](Seq("output1"))) {
toMap(_) shouldBe toMap(values)
}
.run()
HadoopPlatformJobTest(new ReadWithFilterPredicateJob(_), cluster)
.arg("input", "output1")
.arg("output", "output2")
.sink[Boolean]("output2")(toMap(_) shouldBe toMap(values.filter(_.string == "B1").map(_.a.bool)))
.run()
}
}

How to write a string to clipboard (Windows OS) with a Kotlin/Native application?

I'm very new to Kotlin and making a command line .exe, on Windows using Kotlin/Native. The application should read from a text file and print on screen, line by line. When it reaches the last line of the file, it should put it in the clipboard.
aFile.txt looks something like this:
one
two
three
...
...
the last line
and the code read.kt (Kotlin/Native) I have so far is this:
import kotlinx.cinterop.*
import platform.posix.*
fun main(args: Array<String>) {
if (args.size != 1) {
println("Usage: read.exe <file.txt>")
return
}
val fileName = args[0]
val file = fopen(fileName, "r")
if (file == null) {
perror("cannot open input file $fileName")
return
}
try {
memScoped {
val bufferLength = 64 * 1024
val buffer = allocArray<ByteVar>(bufferLength)
do {
val nextLine = fgets(buffer, bufferLength, file)?.toKString()
if (nextLine == null || nextLine.isEmpty()) break
print("${nextLine}")
} while (true)
}
} finally {
fclose(file)
}
}
The code above prints each line on the screen, but how do I write the string "the last line" in the computer's clipboard? I'm looking for a native (not Java) solution if that's possible.
Thank you very much.
Update:
Obviously, this is not the solution I was looking for, but I don't understand yet what are they talking about here (https://learn.microsoft.com/en-us/windows/desktop/api/winuser/nf-winuser-setclipboarddata).
As a temporary fix, I was able to get what I needed using system(), echo and clip with code like this:
system("echo ${nextLine} | clip")
print("${nextLine}")
Try the following:
import java.awt.Toolkit
import java.awt.datatransfer.Clipboard
import java.awt.datatransfer.StringSelection
fun setClipboard(s: String) {
val selection = StringSelection(s)
val clipboard: Clipboard = Toolkit.getDefaultToolkit().systemClipboard
clipboard.setContents(selection, selection)
}
In Windows, you can work with the Clipboard through WinAPI, as you can see there. The reference says, that you got to use functions from the winuser.h header. This header is included in windows.h, as far as I know, so it is in your platform.windows.* package. You can approve it by checking Kotlin/Native repository files.
To clarify, what I meant, I wrote this small example of platform.windows.* usage. You can add this function to your code, and call it when you got to copy some string.
import platform.windows.*
fun toClipboard(lastLine:String?){
val len = lastLine!!.length + 1
val hMem = GlobalAlloc(GMEM_MOVEABLE, len.toULong())
memcpy(GlobalLock(hMem), lastLine.cstr, len.toULong())
GlobalUnlock(hMem)
val hwnd = HWND_TOP
OpenClipboard(hwnd)
EmptyClipboard()
SetClipboardData(CF_TEXT, hMem)
CloseClipboard()
}

PDFBox not clean tmp files after convertToImage method

I use the PDFBox function, such as convertToImage, everything works fine, but PDFBox does not clear the temporary files after the conversion. In my system in the directory for temporary files "/tmp" there are many files such us +~JF132216249314633400.tmp, they are deleted only after restarting my application, but when the application continues to work, temporary files are not deleted.
PDFBox version - 1.8.15
when I use this
page.convertToImage(BufferedImage.TYPE_INT_RGB, 300)
the PDFbox library creates tmp files such as "+~JF132216249314633400.tmp"
my method:
def splitPdfToImages(file: File): List[File] = {
val document = PDDocument.load(file)
val pages = (for (i <- 0 until document.getNumberOfPages)
yield document.getDocumentCatalog.getAllPages.get(i).asInstanceOf[PDPage]).toList
val imgFiles = pages.zipWithIndex.map { case (page, i) =>
val baos = IOUtils.createBAOS
ImageIO.write(page.convertToImage(BufferedImage.TYPE_INT_RGB, 300), "jpg", baos)
val bais = IOUtils.createBAIS(baos.toByteArray)
try {
val img = Image.fromStream(bais)
implicit val writer = JpegWriter().withCompression(100)
val tmpFile = File.createTempFile(s"""${file.getName.split("\\.").head}_$i""", file.getName.split("\\.").last)
img.output(tmpFile)
} finally {
baos.close()
bais.close()
}
}
document.close()
imgFiles
}
Please help me to solve this issue.

Detecting file size with MultipartFormDataStreamProvider before file is saved?

We are using the MultipartFormDataStreamProviderto save file upload by clients. I have a hard requirement that file size must be greater than 1KB. The easiest thing to do would of course be the save the file to disk and then look at the file unfortunately i can't do it like this. After i save the file to disk i don't have the ability to access it so i need to look at the file before its saved to disk. I've been looking at the properties of the stream provider to try to figure out what the size of the file is but unfortunately i've been unsuccessful.
The test file i'm using is 1025 bytes.
MultipartFormDataStreamProvider.BufferSize is 4096
Headers.ContentDisposition.Size is null
ContentLength is null
Is there a way to determine file size before it's saved to the file system?
Thanks to Guanxi i was able to formulate a solution. I used his code in the link as the basis i just added a little more async/await goodness :). I wanted to add the solution just in case it helps anyone else:
private async Task SaveMultipartStreamToDisk(Guid guid, string fullPath)
{
var user = HttpContext.Current.User.Identity.Name;
var multipartMemoryStreamProvider = await Request.Content.ReadAsMultipartAsync();
foreach (var content in multipartMemoryStreamProvider.Contents)
{
using (content)
{
if (content.Headers.ContentDisposition.FileName != null)
{
var existingFileName = content.Headers.ContentDisposition.FileName.Replace("\"", string.Empty);
Log.Information("Original File name was {OriginalFileName}: {guid} {user}", existingFileName, guid,user);
using (var st = await content.ReadAsStreamAsync())
{
var ext = Path.GetExtension(existingFileName.Replace("\"", string.Empty));
List<string> validExtensions = new List<string>() { ".pdf", ".jpg", ".jpeg", ".png" };
//1024 = 1KB
if (st.Length > 1024 && validExtensions.Contains(ext, StringComparer.OrdinalIgnoreCase))
{
var newFileName = guid + ext;
using (var fs = new FileStream(Path.Combine(fullPath, newFileName), FileMode.Create))
{
await st.CopyToAsync(fs);
Log.Information("Completed writing {file}: {guid} {user}", Path.Combine(fullPath, newFileName), guid, HttpContext.Current.User.Identity.Name);
}
}
else
{
if (st.Length < 1025)
{
Log.Warning("File of length {FileLength} bytes was attempted to be uploaded: {guid} {user}",st.Length,guid,user);
}
else
{
Log.Warning("A file of type {FileType} was attempted to be uploaded: {guid} {user}", ext, guid,user);
}
var responseMessage = new HttpResponseMessage(HttpStatusCode.BadRequest)
{
Content =
st.Length < 1025
? new StringContent(
$"file of length {st.Length} does not meet our minumim file size requirements")
: new StringContent($"a file extension of {ext} is not an acceptable type")
};
throw new HttpResponseException(responseMessage);
}
}
}
}
}
You can also read the request contents without using MultipartFormDataStreamProvider. In that case all of the request contents (including files) would be in memory. I have given an example of how to do that at this link.
In this case you can read header for file size or read stream and check the file size. If it satisfy your criteria then only write it to desire location.