How do I partition on saving parquet files with polars using Rust? - pandas

So I was able to save some struct to Parquet file using the following code:
pub fn store(data: Vec<Data>) -> () {
let json = serde_json::to_string(&data).unwrap();
let cursor = Cursor::new(json);
let mut df: DataFrame = JsonReader::new(cursor)
.finish()
.unwrap();
let mut buffer = File::create("foo.txt").unwrap();
let pw = ParquetWriter::new(buffer);
pw.finish(&mut df);
println!("{:?}", df)
}
I was wondering if there is a way to partition based on columns? This feature is supported in pandas, you can find the documentation here where it is possible to pass partition_cols when calling the to_parquet function
Is this feature supported in polars?

Related

How to access row elements in a polars LazyFrame/DataFrame

I struggle accessing the row-elements of a Frame.
One idea I have is to filter the dataframe down to a row, convert it to a vec or something similar and access the elements this way ?!
In Panadas I used to just use ".at / .loc / .iloc / etc."; with Polars in Rust I have no clue.
Any suggestions on what the proper way to do this is ?
Thanks to #isaactfa ... he got me onto the right track. I ended up getting the row not with "get_row" but rather with "get" ... this is probably due to my little RUST understanding (my 2nd week).
Here is a working code sample:
use polars::export::arrow::temporal_conversions::date32_to_date;
use polars::prelude::*;
fn main() -> Result<()> {
let days = df!(
"date_string" => &["1900-01-01", "1900-01-02", "1900-01-03", "1900-01-04", "1900-01-05",
"1900-01-06", "1900-01-07", "1900-01-09", "1900-01-10"])?;
let options = StrpTimeOptions {
date_dtype: DataType::Date, // the result column-datatype
fmt: Some("%Y-%m-%d".into()), // the source format of the date-string
strict: false,
exact: true,
};
// convert date_string into dtype(date) and put into new column "date_type"
// we convert the days DataFrame to a LazyFrame ...
// because in my real-world example I am getting a LazyFrame
let mut new_days_lf = days.lazy().with_column(
col("date_string")
.alias("date_type")
.str()
.strptime(options),
);
// Getting the weekday as a number:
// This is what I wanted to do ... but I get a string result .. need u32
// let o = GetOutput::from_type(DataType::Date);
// new_days_lf = new_days_lf.with_column(
// col("date_type")
// .alias("weekday_number")
// .map(|x| Ok(x.strftime("%w").unwrap()), o.clone()),
// );
// This is the convoluted workaround for getting the weekday as a number
let o = GetOutput::from_type(DataType::Date);
new_days_lf = new_days_lf.with_column(col("date_type").alias("weekday_number").map(
|x| {
Ok(x.date()
.unwrap()
.clone()
.into_iter()
.map(|opt_name: Option<i32>| {
opt_name.map(|datum: i32| {
// println!("{:?}", datum);
date32_to_date(datum)
.format("%w")
.to_string()
.parse::<u32>()
.unwrap()
})
})
.collect::<UInt32Chunked>()
.into_series())
},
o,
));
new_days_lf = new_days_lf.with_column(
col("weekday_number")
.shift_and_fill(-1, 9999)
.alias("next_weekday_number"),
);
// now we convert the LazyFrame into a normal DataFrame for further processing:
let mut new_days_df = new_days_lf.collect()?;
// convert the column to a series
// to get a column by name we need to collect the LazyFrame into a normal DataFrame
let col1 = new_days_df.column("weekday_number")?;
// convert the column to a series
let col2 = new_days_df.column("next_weekday_number")?;
// now I can use series-arithmetics
let diff = col2 - col1;
// create a bool column based on "element == 2"
// add bool column to DataFrame
new_days_df.replace_or_add("weekday diff eq(2)", diff.equal(2)?.into_series());
// could not figure out how to filter the eager frame ...
let result = new_days_df
.lazy()
.filter(col("weekday diff eq(2)").eq(true))
.collect()
.unwrap();
// could not figure out how to access ROW elements
// thus I used "get" instead af of "get_row"
// getting the date where diff is == 2 (true)
let filtered_row = result.get(0).unwrap();
// within the filtered_row get element with an index
let date = filtered_row.get(0).unwrap();
println!("\n{:?}", date);
Ok(())
}

scala spark foldLeft on map with array

I have a configuration in a form of a map:
val config = Map[String, Array[String]] = Map("column1" -> Array("field1"), "column2" -> Array("field1","field2"))
I want to use foldLeft to apply this to a dataframe and dynamically filter nested fields using dropFields functions.
I came out with:
val result = config.foldLeft(""){(k, v) =>
df.withColumn( v._1, col(k + v._1).dropFields(v._2: _*))
}
but struggle to make foldleft work, any help would be appreciated.

Get the file name as a string in Rust

I am trying to get the file name of a given string using the following code:
fn get_filename() -> Result<(), std::io::Error> {
let file = "folder/file.text";
let path = Path::new(file);
let filename = path.file_name()?.to_str()?;
println!("{}",filename);
Ok(())
}
But I get this error:
error[E0277]: `?` couldn't convert the error to `std::io::Error`
The original code didn't want to return Result, but that's the only way to use ?. How can I fix this?
Based on Niklas's answer here is the answer:
fn filename() -> Option<()> {
let file = "hey.text";
let path = Path::new(file);
let filename = path.file_name()?.to_str()?;
println!("{}",filename);
None
}
Both .file_name() and .to_str() here return an Option. The usage of the ? operator on the None variant to convert to an Err is an experimental feature (see the NoneError documentation). If you want to use the ? operator, you could consider returning Option<()> here or unwrapping the result by hand.

Extract words from a string in spark hadoop with scala

I was using the code below to extract strings I needed in Spark SQL. But now I am working with more data in Spark Hadoop and I want to extract strings. I tried the same code, but it does not work.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
}
tmplst.mkString(",")
}
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()
How do I solve this problem?
First off, you're using Spark not the way its meant to. Your DataFrame isn't partitioned at all. Use:
val values = List("#always_nidhi", "#YouTube", "no", "i", "dnt", "understand" ...). That way, each bulk of words will be assigned to a different partition, different JVMs and/or clusters (depending on the total number of partitions and size of data). In your solution, the entire sentence is assigned to a specific partition and thus there's no parallelism nor distribution.
Second, you don't have to use a UDF (try to avoid those in general).
In order to find your regex, you can simply execute:
dataFrame.filter(col("words") rlike "#\\w+")
Hope it helps :-)

Generate a sorted map in Rust

I am looking for a way to generate a sorted map, having the keys sorted by multiple values.
Something like this:
let mut f = File::open("storing_data_like_a_pro.csv")?;
let mut buffer = String::new();
f.read_to_string(&mut buffer)?;
let sorted_cache_map: BTreeMap<String, FancyParsedData> = buffer.lines()
.map(|line| FancyParsedData::new(line)
.map(|data| (data.key, data))
// .sortBy(|key, value| value.foo, ASCENDING)
// .thenSortBy(|key, value| value.bar, DESCENDING)...
.collect();
When I know the key I would get a snappy response. If I would search like one or multiple values, I want them to be ordered (useful for pagination).