Colab: How to run a specific codeblock - google-colaboratory

I'm trying to run a specific code in colab with different input arguments, and record the results
Let's say, for simplicity, my colab notebook looks like this.
As an alternative to manually changing the arguments every time(obviously there are much more arguments than the example given)
Is there a way to do the 'run codeblock 3' in google colab?

Why not put your arguments in to either a list or dictionary, since it looks like you're manually entering them anyway.
For a list you could do:
args.input = [5, 10, 15, 20]
then
saved = {}
for i in args.input:
sum = 0
for j in range(i):
sum += j
saved[str(i)] = sum
Which saves your results in a dictionary:
>> saved
{'10': 45, '15': 105, '20': 190, '5': 10}

Related

LGBMClassifier + Unbalanced data + GridSearchCV()

The dependent variable is binary, the unbalanced data is 1:10, the dataset has 70k rows, the scoring is the roc curve, and I'm trying to use LGBM + GridSearchCV to get a model. However, I'm struggling with the parameters as sometimes it doesn't recognize them even when I use the parameters as the documentation shows:
params = {'num_leaves': [10, 12, 14, 16],
'max_depth': [4, 5, 6, 8, 10],
'n_estimators': [50, 60, 70, 80],
'is_unbalance': [True]}
best_classifier = GridSearchCV(LGBMClassifier(), params, cv=3, scoring="roc_auc")
best_classifier.fit(X_train, y_train)
So:
What is the difference between putting the parameters in the GridsearchCV() and params?
As it's unbalanced data, I'm trying to use the roc_curve as the scoring metric as it's a metric that considers the unbalanced data. Should I use the argument scoring="roc_auc" put it in the params argument?
The difference between putting the parameters in GridsearchCV()or params is mentioned in the docs of GridSearch:
When you put it in params:
Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries,
in which case the grids spanned by each dictionary in the list are
explored. This enables searching over any sequence of parameter
settings.
And yes you can put the scoring also in the params.

Column missing when trying to open hdf created by pandas in h5py

This is what my dataframe looks like. The first column is a single int. The second column is a single list of 512 ints.
IndexID Ids
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131 [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716 [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098 [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...
I saved it to hdf and tried opening it using
df.to_hdf('test.h5', key='df', data_columns=True)
h3 = h5py.File('test.h5')
I see 4 keys when I list the keys
h3['df'].keys()
KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values']
Axis1 sees to contain the values for the first column
h3['df']['axis1'][0:5]
array([ 1899317, 22861131, 2163410, 15760716, 12244098,
However, there doesn't seem to be data from the second column. There does is another column with other data
h3['df']['block0_values'][0][0:5]
But that doesn't seem to correspond to any of the data in the second column
array([128, 4, 149, 1, 0], dtype=uint8)
Purpose
I am eventually trying to create a datastore that's memory mapped, that retrieves data using particular indices.
So something like
h3['df']['workingIndex'][22861131, 15760716]
would retrieve
[0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...],
[0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
The problem is you're trying to serialize a Pandas Series of Python lists and it is not rectangular (it is jagged).
Pandas and HDF5 are largely used for rectangular (cube, hypercube, etc) data, not for jagged lists-of-lists.
Did you see this warning when you call to_hdf()?
PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['Ids']]
What it's trying to tell you is that lists-of-lists are not supported in an intuitive, high-performance way. And if you run an HDF5 visualization tool like h5dump on your output file, you'll see what's wrong. The index (which is well-behaved) looks like this:
DATASET "axis1" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 5 ) / ( 5 ) }
DATA {
(0): 1899317, 22861131, 2163410, 15760716, 12244098
}
ATTRIBUTE "CLASS" {
DATA {
(0): "ARRAY"
}
}
But the values (lists of lists) look like this:
DATASET "block0_values" {
DATATYPE H5T_VLEN { H5T_STD_U8LE}
DATASPACE SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) }
DATA {
(0): (128, 5, 149, 164, ...)
}
ATTRIBUTE "CLASS" {
DATA {
(0): "VLARRAY"
}
}
ATTRIBUTE "PSEUDOATOM" {
DATA {
(0): "object"
}
}
What's happening is exactly what the PerformanceWarning warned you about:
> PyTables will pickle object types that it cannot map directly to c-types
Your list-of-lists is being pickled and stored as H5T_VLEN which is just a blob of bytes.
Here are some ways you could fix this:
Store each row under a separate key in HDF5. That is, each list will be stored as an array, and they can all have different lengths. This is no problem with HDF5, because it supports any number of keys in one file.
Change your data to be rectangular, e.g. by padding the shorter lists with zeros. See: Pandas split column of lists into multiple columns
Use h5py to write the data in whatever format you like. It's much more flexible and creates simpler (and yet more powerful) HDF5 files than Pandas/PyTables. Here's one example (which shows h5py can actually store jagged arrays, though it's not pretty): Storing multidimensional variable length array with h5py

Identifying the index of an array inside a 2D array

I'm trying out an inventory system in python 3.8 using functions and numpy.
While I am new to numpy, I haven't found anything in the manuals for numpy for this problem.
My problem is this specifically:
I have a 2D array, in this case the unequipped inventory;
unequippedinv = [[""], [""], [""], [""], ["Iron greaves", 15, 10, 10]]
I have an if statement to ensure that the item selected is acceptable. I'm now trying to remove the entire index ["Iron greaves", 15, 10, 10] using unequippedinv.pop(unequippedinv.index(item)) but I keep getting the error ValueError: "'Iron greaves', 15, 10, 10" is not in list
I've tried using numpy's where and argwhere but instead just got [] as the outcome.
Is there a way to search for an entire array in a 2D array, such as how SQL has SELECT * IN y WHERE x IS b but in which it gives me the index for the entire row?
Note: I have now found out that it is something to do with easygui's choicebox, which, I assume, turns the chosen array into a string which is why it creates an error.

How do I get a snakemake rule to execute M times to generate MxN files?

I'm converting a bioinformatics pipeline into snakemake and have a script which loops over M files (where M=22 for each non-sex chromosome).
Each file essentially contains N label columns that I want as individual files. The python script does this reliably, my issue is that if I provide the snakefile with wildcards for the output (both chromosomes and labels) it will run the script MxN times whilst I only want it to run M times.
I can circumvent the problem by only looking for one label file per chromosome but this isn't keeping with the spirit of snakemake and the next step in the pipeline requires input from all label files.
I've already tried using the checkpoint feature (which as I understand, reevaluates the DAG after each rule is executed) to check the output, understand that N files have been generated and skip N jobs. But this crashes and I get this error. But because I know my labels beforehand, as I understand I shouldn't need checkpoint/dynamic - I just don't know exactly what I do need.
Is it possible to disable a wildcard from generating jobs and just check that the output is returned?
LABELS = ['A', 'B', 'C', 'D']
CHROMOSOMES = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
rule all:
input:
"out/final.txt"
rule split_files:
'''
Splits the chromosome files by label.
'''
input:
"per_chromosome/myfile.{chromosome}.txt"
output:
"per_label/myfile.{label}.{chromosome}.txt"
script:
"scripts/split_files_snake.py"
rule make_out_file:
'''
Makes the final output file by checking each of label.chromosome files one-by-one
'''
input:
expand("per_label/myfile.{label}.{chromosome}",
label=LABELS,
chromosome=CHROMOSOMES)
output:
"out/final.txt"
script:
"scripts/make_out_file_snake.py"
If you wish to avoid the scropt being run N times, you can specify all your output files without a wildcard in the output:
output:
"per_label/myfile.A.{chromosome}.txt",
"per_label/myfile.B.{chromosome}.txt",
"per_label/myfile.C.{chromosome}.txt",
"per_label/myfile.D.{chromosome}.txt"
To make the code more generic you can use the expand function but pay special attention to the braces in the format string:
output:
expand("per_label/myfile.{label}.{{chromosome}}.txt", label=LABELS)

Building a language model using tensorflow , dataset shape issue

I'm trying to build a translation model, so I'm getting a text as input, I'm encoding him to a list of integers (the type of enocding is not important).so far so good.
let's say this is what I have so far:
<class 'list'>: [1645, 3, 205, 753, 753, 1332, 18, 7, 7, 24]
Now I want to do this lines:
ds = tf.data.Dataset.from_tensors(encoded_txt)
ds = ds.batch(32)
(btw why do we need the first line, just to be able to do the second?)
But from this lines I'm getting:
shape=(?,32)
and I don't understand why?
I have a batch size of 32 and 10 numbers,
why isn't it (1,32) (with paddings or something)???
This impacts me after on in the code, I really need to understand how to handle this.
btw just reshape isn't working :(
Thanks!