Split a SAM file in Awk keeping N number of lines as header - awk

I have a very big Sequence Alignment Map (SAM) file as depicted below
#X YYYYYY ZZZZZ\
#X ssssss ddddd\
#X CCCCCC LLLLL
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
I want to split the file based on column 3 so I can do awk '{print > $3}' file.txt which is working fine. Now I want to keep the lines
#X YYYYYY ZZZZZ\
#X ssssss ddddd\
#X CCCCCC LLLLL
as header on top of all the splitted files, how can I do that?
I tried this:
awk '$1 ~ /^#/ {print > $3}' file.txt

You have to keep track of whether the file is one you have seen before, and if not, write the header before you write to it for the first time.
awk '$1 ~ /^#/ { header = header $0 ORS; next }
!seen[$3]++ { printf "%s", header >$3 }
{ print > $3 }' file.txt
The internal variable ORS usually contains a newline but it's customary to use the variable so that you only need to change the string in one place if you want to use a different output record separator.
This can run out of file handles if you have more than a couple of dozen distinct values in $3 but if your script otherwise works, I guess that's not a problem in your case.
(The brute-force fix is to close and reopen the file after each write, which makes the script much slower. A better fix if you have the memory is to collect all the results into RAM and only write when you have read all the data. A more sophisticated approach would keep a buffer of, say, 20 open file handles, and close the least recently used when you need to write to a file which isn't among them.)

If "header lines" always contains 3 fields then criteria can be as follows:
For lines containing more than 3 fields, set first and second field as ""; else, print line as it is:
file.txt used:
cat file.txt
#X YYYYYY ZZZZZ\
#X ssssss ddddd\
#X CCCCCC LLLLL
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
awk:
awk '{ if( NF > 3) $1=$2=""; print }' file.txt
#X YYYYYY ZZZZZ\
#X ssssss ddddd\
#X CCCCCC LLLLL
117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0

Related

Filtering file according to the highest value in a column of each line

I have the following file:
gene.100079.0.5.p3 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 84.9
gene.100079.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 84.9
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
chr11_pilon3.g3568.t1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 74.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
The above file has some IDs which are similar
gene.100079.0.5.p3
gene.100079.0.3.p1
gene.100079.0.0.p1
By remaining only gene.100079 the IDs become identically. I would like to filter the above file in the following way:
chr11_pilon3.g3568.t1 = 74.9. IDs starting with chr get excluded from the comparison and they end up straight in the output.
gene.100079.0.0.p1 = 86.7 && gene.100079.0.5.p3 = 84.9 == gene.100079.0.3.p1 = 84.9. gene.100079.0.0.p1 has the highest value and therefore it should be in the output.
gene.100080.0.3.p1 = 99.9 == gene.100080.0.0.p1 = 99.9. Both IDs have the same value and therefore both should be in the output.
However, this awk script from #RavinderSingh13 and #anubhava returns the wrong results.
awk '{
if (/^gene\./) {
split($1, a, /\./)
k = a[1] "." a[2]
}
else
k = $1
}
!(k in max) || $13 >= max[k] {
if(!(k in max))
ord[++n] = k
else if (max[k] == $13) {
print
next
}
max[k] = $13
rec[k] = $0
}
END {
for (i=1; i<=n; ++i)
print rec[ord[i]]
}' file
Wrong output with the above script:
chr11_pilon3.g3568.t1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 74.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
gene.100079.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 84.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
As output I would like to get:
chr11_pilon3.g3568.t1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 74.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
I also tried to fix as show below but it didn't work:
awk '{
if (/^gene\./) {
split($1, a, /\./)
k = a[1] "." a[2]
}
else
k = $1
}
!(k in max) || $13 > max[k] {
max[k]=$13;
line[k]=$0
}
END {
for(i in line)
print line[i]
}'
Thank you in advance,
This seems to work correctly, assuming that the data is ordered so that all the lines with the same first two name components are grouped together in the data file. The order of those lines within the group doesn't matter.
As revised, the question now wants lines starting chr transferred to the output without any filtering. That is easily achieved — the rule matching /^chr/ provides that functionality.
#!/bin/sh
awk '
function dump_memo()
{
if (memo_num > 0)
{
for (i = 0; i < memo_num; i++)
print memo_line[i]
}
}
/^chr/ { print; next } # Do not process lines starting chr specially
{
split($1, a, ".")
key = a[1] "." a[2]
val = $NF
# print "# " key " = " val " (memo_key = " memo_key ", memo_val = " memo_val ")"
if (memo_key == key)
{
if (memo_val == val)
{
memo_line[memo_num++] = $0
}
else if (memo_val < val)
{
memo_val = val
memo_num = 0
memo_line[memo_num++] = $0
}
}
else
{
dump_memo()
memo_num = 0
memo_line[memo_num++] = $0
memo_key = key
memo_val = val
}
}
END { dump_memo() }' "$#"
When run on the data file shown in the question, the original output from the unrevised script was:
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
The main difference between this and what was requested is the sort order. If you need the data in sorted order, pipe the output of the script through sort.
With the revised script (with the /^chr/ rule) is:
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
chr11_pilon3.g3568.t1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 74.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
Again, if you want the data in some specific order, apply a sort to the output.

Run TFMA for Keras models without compiling

I am training a Keras model using custom training loops in TensorFlow, where the weights are updated using gradient tape rather than the model.fit() method. As such, the model is not compiled before training.
After exporting the saved_model, I am able to successfully load it for inference:
model = tf.saved_model.load("path/to/saved_model")
pred_fn = model.signatures["serving_default"]
results = pred_fn(tf.constant(examples))
However, when I try loading it with TFMA using run_model_analysis:
eval_shared_model = tfma.default_eval_shared_model("path/to/saved_model", eval_config=eval_config)
eval_results = tfma.run_model_analysis(
eval_shared_model=eval_shared_model,
data_location=test_tfrecords_path,
file_format="tfrecords"
)
I get the following error:
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
-----------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-107-19f51f42014a> in <module>
2 eval_shared_model=eval_shared_model,
3 data_location=test_tfrecords_path,
----> 4 file_format="tfrecords"
5 )
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/tensorflow_model_analysis/api/model_eval_lib.py in run_model_analysis(eval_shared_model, eval_config, data_location, file_format, output_path, extractors, evaluators, writers, pipeline_options, slice_spec, write_config, compute_confidence_intervals, min_slice_size, random_seed_for_testing, schema)
1200 random_seed_for_testing=random_seed_for_testing,
1201 tensor_adapter_config=tensor_adapter_config,
-> 1202 schema=schema))
1203 # pylint: enable=no-value-for-parameter
1204
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pvalue.py in __or__(self, ptransform)
138
139 def __or__(self, ptransform):
--> 140 return self.pipeline.apply(ptransform, self)
141
142
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pipeline.py in apply(self, transform, pvalueish, label)
575 if isinstance(transform, ptransform._NamedPTransform):
576 return self.apply(
--> 577 transform.transform, pvalueish, label or transform.label)
578
579 if not isinstance(transform, ptransform.PTransform):
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pipeline.py in apply(self, transform, pvalueish, label)
585 try:
586 old_label, transform.label = transform.label, label
--> 587 return self.apply(transform, pvalueish)
588 finally:
589 transform.label = old_label
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pipeline.py in apply(self, transform, pvalueish, label)
628 transform.type_check_inputs(pvalueish)
629
--> 630 pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
631
632 if type_options is not None and type_options.pipeline_type_check:
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/runners/runner.py in apply(self, transform, input, options)
196 m = getattr(self, 'apply_%s' % cls.__name__, None)
197 if m:
--> 198 return m(transform, input, options)
199 raise NotImplementedError(
200 'Execution of [%s] not implemented in runner %s.' % (transform, self))
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/runners/runner.py in apply_PTransform(self, transform, input, options)
226 def apply_PTransform(self, transform, input, options):
227 # The base case of apply is to call the transform's expand.
--> 228 return transform.expand(input)
229
230 def run_transform(self,
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py in expand(self, pcoll)
921 # Might not be a function.
922 pass
--> 923 return self._fn(pcoll, *args, **kwargs)
924
925 def default_label(self):
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/tensorflow_model_analysis/api/model_eval_lib.py in ExtractEvaluateAndWriteResults(examples, eval_shared_model, eval_config, extractors, evaluators, writers, output_path, display_only_data_location, display_only_file_format, slice_spec, write_config, compute_confidence_intervals, min_slice_size, random_seed_for_testing, tensor_adapter_config, schema)
1079 | 'ExtractAndEvaluate' >> ExtractAndEvaluate(
1080 extractors=extractors, evaluators=evaluators)
-> 1081 | 'WriteResults' >> WriteResults(writers=writers))
1082
1083 return beam.pvalue.PDone(examples.pipeline)
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pvalue.py in __or__(self, ptransform)
138
139 def __or__(self, ptransform):
--> 140 return self.pipeline.apply(ptransform, self)
141
142
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pipeline.py in apply(self, transform, pvalueish, label)
575 if isinstance(transform, ptransform._NamedPTransform):
576 return self.apply(
--> 577 transform.transform, pvalueish, label or transform.label)
578
579 if not isinstance(transform, ptransform.PTransform):
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pipeline.py in apply(self, transform, pvalueish, label)
585 try:
586 old_label, transform.label = transform.label, label
--> 587 return self.apply(transform, pvalueish)
588 finally:
589 transform.label = old_label
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pipeline.py in apply(self, transform, pvalueish, label)
628 transform.type_check_inputs(pvalueish)
629
--> 630 pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
631
632 if type_options is not None and type_options.pipeline_type_check:
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/runners/runner.py in apply(self, transform, input, options)
196 m = getattr(self, 'apply_%s' % cls.__name__, None)
197 if m:
--> 198 return m(transform, input, options)
199 raise NotImplementedError(
200 'Execution of [%s] not implemented in runner %s.' % (transform, self))
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/runners/runner.py in apply_PTransform(self, transform, input, options)
226 def apply_PTransform(self, transform, input, options):
227 # The base case of apply is to call the transform's expand.
--> 228 return transform.expand(input)
229
230 def run_transform(self,
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py in expand(self, pcoll)
921 # Might not be a function.
922 pass
--> 923 return self._fn(pcoll, *args, **kwargs)
924
925 def default_label(self):
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/tensorflow_model_analysis/api/model_eval_lib.py in ExtractAndEvaluate(extracts, extractors, evaluators)
818 for v in evaluators:
819 if v.run_after == x.stage_name:
--> 820 update(evaluation, extracts | v.stage_name >> v.ptransform)
821 for v in evaluators:
822 if v.run_after == extractor.LAST_EXTRACTOR_STAGE_NAME:
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pvalue.py in __or__(self, ptransform)
138
139 def __or__(self, ptransform):
--> 140 return self.pipeline.apply(ptransform, self)
141
142
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pipeline.py in apply(self, transform, pvalueish, label)
575 if isinstance(transform, ptransform._NamedPTransform):
576 return self.apply(
--> 577 transform.transform, pvalueish, label or transform.label)
578
579 if not isinstance(transform, ptransform.PTransform):
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pipeline.py in apply(self, transform, pvalueish, label)
585 try:
586 old_label, transform.label = transform.label, label
--> 587 return self.apply(transform, pvalueish)
588 finally:
589 transform.label = old_label
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pipeline.py in apply(self, transform, pvalueish, label)
628 transform.type_check_inputs(pvalueish)
629
--> 630 pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
631
632 if type_options is not None and type_options.pipeline_type_check:
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/runners/runner.py in apply(self, transform, input, options)
196 m = getattr(self, 'apply_%s' % cls.__name__, None)
197 if m:
--> 198 return m(transform, input, options)
199 raise NotImplementedError(
200 'Execution of [%s] not implemented in runner %s.' % (transform, self))
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/runners/runner.py in apply_PTransform(self, transform, input, options)
226 def apply_PTransform(self, transform, input, options):
227 # The base case of apply is to call the transform's expand.
--> 228 return transform.expand(input)
229
230 def run_transform(self,
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py in expand(self, pcoll)
921 # Might not be a function.
922 pass
--> 923 return self._fn(pcoll, *args, **kwargs)
924
925 def default_label(self):
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/tensorflow_model_analysis/evaluators/metrics_and_plots_evaluator_v2.py in _EvaluateMetricsAndPlots(extracts, eval_config, eval_shared_models, metrics_key, plots_key, validations_key, schema, random_seed_for_testing)
757 plots_key=plots_key,
758 schema=schema,
--> 759 random_seed_for_testing=random_seed_for_testing))
760
761 for k, v in evaluation.items():
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pvalue.py in __or__(self, ptransform)
138
139 def __or__(self, ptransform):
--> 140 return self.pipeline.apply(ptransform, self)
141
142
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pipeline.py in apply(self, transform, pvalueish, label)
575 if isinstance(transform, ptransform._NamedPTransform):
576 return self.apply(
--> 577 transform.transform, pvalueish, label or transform.label)
578
579 if not isinstance(transform, ptransform.PTransform):
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pipeline.py in apply(self, transform, pvalueish, label)
585 try:
586 old_label, transform.label = transform.label, label
--> 587 return self.apply(transform, pvalueish)
588 finally:
589 transform.label = old_label
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/pipeline.py in apply(self, transform, pvalueish, label)
628 transform.type_check_inputs(pvalueish)
629
--> 630 pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
631
632 if type_options is not None and type_options.pipeline_type_check:
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/runners/runner.py in apply(self, transform, input, options)
196 m = getattr(self, 'apply_%s' % cls.__name__, None)
197 if m:
--> 198 return m(transform, input, options)
199 raise NotImplementedError(
200 'Execution of [%s] not implemented in runner %s.' % (transform, self))
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/runners/runner.py in apply_PTransform(self, transform, input, options)
226 def apply_PTransform(self, transform, input, options):
227 # The base case of apply is to call the transform's expand.
--> 228 return transform.expand(input)
229
230 def run_transform(self,
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py in expand(self, pcoll)
921 # Might not be a function.
922 pass
--> 923 return self._fn(pcoll, *args, **kwargs)
924
925 def default_label(self):
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/tensorflow_model_analysis/evaluators/metrics_and_plots_evaluator_v2.py in _ComputeMetricsAndPlots(extracts, eval_config, metrics_specs, eval_shared_models, metrics_key, plots_key, schema, random_seed_for_testing)
582 if eval_shared_model.model_type == constants.TF_KERAS:
583 keras_specs = keras_util.metrics_specs_from_keras(
--> 584 model_name, eval_shared_model.model_loader)
585 metrics_specs = keras_specs + metrics_specs[:]
586 # TODO(mdreves): Add support for calling keras.evaluate().
~/.pyenv/versions/miniconda3-4.3.30/envs/tensorflow/lib/python3.7/site-packages/tensorflow_model_analysis/evaluators/keras_util.py in metrics_specs_from_keras(model_name, model_loader)
60 # y_true, y_pred as inputs so it can't be calculated via standard inputs so
61 # we remove it.
---> 62 metrics.extend(model.compiled_loss.metrics[1:])
63 metrics.extend(model.compiled_metrics.metrics)
64 metric_names = [m.name for m in metrics]
AttributeError: 'NoneType' object has no attribute 'metrics'
I suspect this might be because I am not compiling the Keras model before exporting it. Does TFMA only support compiled models?
I am using tensorflow==2.3.0 and tensorflow-model-analysis==0.22.1
Yes, your understanding is correct i.e., it is resulting in error because you are not compiling and consequently, not adding the METRICS.
It is evident from the statement specified in the Tensorflow Model Analysis Documentation as well, which is mentioned below.
Note: Only training time metrics added via model.compile (not
model.add_metric) are currently supported for keras.

Python function to calculate balance for every row corresponding to individual transactions

I am working on a bank statement, corresponding to the output dataframe and an ending balance corresponding to the output['balance'][0] I would like to calculate all balance values for the individual transactions as described below. It's a very straightforward calculation and yet it doesn't seem to be working - is there something quite obvious I am missing? Thanks in advance!
output['balance'] = ''
output['balance'][0] = 21.15
if len(output[amount]) > 0:
return output[balance][i+1].append((output[balance][i]-output[amount][i+1]))
else:
output[balance].append((output[balance][0]))
output[['balance']] = output['Amount'].apply(lambda amount: bal_calc(output, amount))```
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2896 try:
-> 2897 return self._engine.get_loc(key)
2898 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 4.95
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-271-b85947935fca> in <module>
----> 1 output[['balance']] = output['Amount'].apply(lambda amount: bal_calc(output, amount))
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4040 else:
4041 values = self.astype(object).values
-> 4042 mapped = lib.map_infer(values, f, convert=convert_dtype)
4043
4044 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-271-b85947935fca> in <lambda>(amount)
----> 1 output[['balance']] = output['Amount'].apply(lambda amount: bal_calc(output, amount))
<ipython-input-270-cbf5ac20716d> in bal_calc(output, amount)
2 output['balance'] = ''
3 output['balance'][0] = 21.15
----> 4 if len(output[amount]) > 0:
5 return output[balance][i+1].append((output[balance][i]-output[amount][i+1]))
6 else:
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2978 if self.columns.nlevels > 1:
2979 return self._getitem_multilevel(key)
-> 2980 indexer = self.columns.get_loc(key)
2981 if is_integer(indexer):
2982 indexer = [indexer]
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2897 return self._engine.get_loc(key)
2898 except KeyError:
-> 2899 return self._engine.get_loc(self._maybe_cast_indexer(key))
2900 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2901 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 4.95
It will be easier to understand your problem you can post your existing dataframe and intended dataframe. From your description I think you can approach calculating balance like this
import pandas as pd
## creating dummy dataframe for testing
arr = np.random.choice(range(500,1000),(10,2))
debit_credit = np.random.choice([0,1], (10))
arr[:,0] = arr[:,0] * debit_credit
arr[:,1] = arr[:,1] * (1-debit_credit)
df=pd.DataFrame(arr, columns=["Debit", "Credit"])
display(df)
## calculating Balance
df["Balance"] = (df.Debit-df.Credit).cumsum()
display(df)
Output
Debit Credit Balance
0 957 0 957
1 0 698 259
2 608 0 867
3 0 969 -102
4 0 766 -868
5 0 551 -1419
6 985 0 -434
7 861 0 427
8 927 0 1354
9 0 923 431
bs['balance'][0] = 21.15
for i in range(1, len(bs)):
bs.loc[i, 'balance'] = bs.loc[i-1, 'balance'] + bs.loc[i, 'Credit'] -bs.loc[i, 'Debit']

Tensorflow_probability integer type error

I am trying to use tensorflow_probability to construct a mcmc chain. This is my code:
chain_states, kernel_results = tfp.mcmc.sample_chain(
num_results=tf.constant(1e3, dtype=tf.int32),
num_burnin_steps=tf.constant(1e2, dtype=tf.int32),
parallel_iterations=tf.constant(10, dtype=tf.int32),
current_state=current_state,
kernel=tfp.mcmc.MetropolisHastings(
inner_kernel=tfp.mcmc.HamiltonianMonteCarlo(
target_log_prob_fn=joint_log_prob,
num_leapfrog_steps=tf.constant(2, dtype=tf.int32),
step_size=tf.Variable(1.),
step_size_update_fn=tfp.mcmc.make_simple_step_size_update_policy()
)))
But I got this error::
> InvalidArgumentError Traceback (most recent call last) <ipython-input-13-7e972cc65053> in <module>()
> ----> 1 make_model(well_complex, well_ligand, fi_complex, fi_ligand)
>
> ~/Documents/GitHub/assaytools2/assaytools2/assaytools2/inference.py in
> make_model(well_complex, well_ligand, fi_complex, fi_ligand)
> 162 num_leapfrog_steps=tf.constant(2, dtype=tf.int32),
> 163 step_size=tf.Variable(1.),
> --> 164 step_size_update_fn=tfp.mcmc.make_simple_step_size_update_policy()
> 165 )))
> 166
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/sample.py
> in sample_chain(num_results, current_state, previous_kernel_results,
> kernel, num_burnin_steps, num_steps_between_results,
> parallel_iterations, name)
> 238
> 239 if previous_kernel_results is None:
> --> 240 previous_kernel_results = kernel.bootstrap_results(current_state)
> 241 return tf.scan(
> 242 fn=_scan_body,
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/metropolis_hastings.py
> in bootstrap_results(self, init_state)
> 261 name=mcmc_util.make_name(self.name, 'mh', 'bootstrap_results'),
> 262 values=[init_state]):
> --> 263 pkr = self.inner_kernel.bootstrap_results(init_state)
> 264 if not has_target_log_prob(pkr):
> 265 raise ValueError(
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/hmc.py
> in bootstrap_results(self, init_state)
> 506 def bootstrap_results(self, init_state):
> 507 """Creates initial `previous_kernel_results` using a supplied `state`."""
> --> 508 kernel_results = self._impl.bootstrap_results(init_state)
> 509 if self.step_size_update_fn is not None:
> 510 step_size_assign = self.step_size_update_fn(self.step_size, None) # pylint:
> disable=not-callable
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/metropolis_hastings.py
> in bootstrap_results(self, init_state)
> 261 name=mcmc_util.make_name(self.name, 'mh', 'bootstrap_results'),
> 262 values=[init_state]):
> --> 263 pkr = self.inner_kernel.bootstrap_results(init_state)
> 264 if not has_target_log_prob(pkr):
> 265 raise ValueError(
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/hmc.py
> in bootstrap_results(self, init_state)
> 672 init_target_log_prob,
> 673 init_grads_target_log_prob,
> --> 674 ] = mcmc_util.maybe_call_fn_and_grads(self.target_log_prob_fn, init_state)
> 675 return UncalibratedHamiltonianMonteCarloKernelResults(
> 676 log_acceptance_correction=tf.zeros_like(init_target_log_prob),
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/util.py
> in maybe_call_fn_and_grads(fn, fn_arg_list, result, grads,
> check_non_none_grads, name)
> 232 fn_arg_list = (list(fn_arg_list) if is_list_like(fn_arg_list)
> 233 else [fn_arg_list])
> --> 234 result, grads = _value_and_gradients(fn, fn_arg_list, result, grads)
> 235 if not all(r.dtype.is_floating
> 236 for r in (result if is_list_like(result) else [result])): # pylint: disable=superfluous-parens
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/util.py
> in _value_and_gradients(fn, fn_arg_list, result, grads, name)
> 207 ]
> 208 else:
> --> 209 grads = tfe.gradients_function(fn)(*fn_arg_list)
> 210 else:
> 211 if is_list_like(result) and len(result) == len(fn_arg_list):
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py
> in decorated(*args, **kwds)
> 368 """Computes the gradient of the decorated function."""
> 369
> --> 370 _, grad = val_and_grad_function(f, params=params)(*args, **kwds)
> 371 return grad
> 372
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py
> in decorated(*args, **kwds)
> 469 "receive keyword arguments.")
> 470 val, vjp = make_vjp(f, params)(*args, **kwds)
> --> 471 return val, vjp(dy=dy)
> 472
> 473 return decorated
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py
> in vjp(dy)
> 539 return imperative_grad.imperative_grad(
> 540 _default_vspace, this_tape, nest.flatten(result), sources,
> --> 541 output_gradients=dy)
> 542 return result, vjp
> 543
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/eager/imperative_grad.py
> in imperative_grad(vspace, tape, target, sources, output_gradients)
> 61 """
> 62 return pywrap_tensorflow.TFE_Py_TapeGradient(
> ---> 63 tape._tape, vspace, target, sources, output_gradients) # pylint: disable=protected-access
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py
> in _gradient_function(op_name, attr_tuple, num_inputs, inputs,
> outputs, out_grads)
> 115 return [None] * num_inputs
> 116
> --> 117 return grad_fn(mock_op, *out_grads)
> 118
> 119
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py
> in _ProdGrad(op, grad)
> 158 with ops.device("/cpu:0"):
> 159 rank = array_ops.rank(op.inputs[0])
> --> 160 reduction_indices = (reduction_indices + rank) % rank
> 161 reduced = math_ops.cast(reduction_indices, dtypes.int32)
> 162 idx = math_ops.range(0, rank)
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py
> in binary_op_wrapper(x, y)
> 860 with ops.name_scope(None, op_name, [x, y]) as name:
> 861 if isinstance(x, ops.Tensor) and isinstance(y, ops.Tensor):
> --> 862 return func(x, y, name=name)
> 863 elif not isinstance(y, sparse_tensor.SparseTensor):
> 864 try:
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py
> in add(x, y, name)
> 322 else:
> 323 message = e.message
> --> 324 _six.raise_from(_core._status_to_exception(e.code, message), None)
> 325
> 326
>
> ~/anaconda2/envs/py36/lib/python3.6/site-packages/six.py in
> raise_from(value, from_value)
>
> InvalidArgumentError: cannot compute Add as input #0(zero-based) was
> expected to be a int32 tensor but is a int64 tensor [Op:Add] name:
> mcmc_sample_chain/mh_bootstrap_results/mh_bootstrap_results/hmc_kernel_bootstrap_results/maybe_call_fn_and_grads/value_and_gradients/add/
I doubled checked and none of my initial tensors were of integer type.
I wonder where I did it wrong.
Thanks!

SQL query is not working (Error in rsqlite_send_query)

This is what the head of my data frame looks like
> head(d19_1)
SMZ SIZ1_diff SIZ1_base SIZ2_diff SIZ2_base SIZ3_diff SIZ3_base SIZ4_diff SIZ4_base SIZ5_diff SIZ5_base
1 1 -620 4170 -189 1347 -35 2040 82 1437 244 1533
2 2 -219 831 -57 255 -4 392 8 282 14 297
3 3 -426 834 -162 294 -134 379 -81 241 -22 221
4 4 -481 676 -142 216 -114 267 -50 158 -43 166
5 5 -233 1711 -109 584 54 913 71 624 74 707
6 6 -322 1539 -79 512 -50 799 23 532 63 576
Total_og Total_base %_SIZ1 %_SIZ2 %_SIZ3 %_SIZ4 %_SIZ5 Total_og Total_base
1 11980 12648 14.86811 14.03118 1.715686 5.706333 15.916504 11980 12648
2 2156 2415 26.35379 22.35294 1.020408 2.836879 4.713805 2156 2415
3 1367 2314 51.07914 55.10204 35.356201 33.609959 9.954751 1367 2314
4 790 1736 71.15385 65.74074 42.696629 31.645570 25.903614 790 1736
5 5339 5496 13.61777 18.66438 5.914567 11.378205 10.466761 5339 5496
6 4362 4747 20.92268 15.42969 6.257822 4.323308 10.937500 4362 4747
The datatype of the data frame is as below str(d19_1)
> str(d19_1)
'data.frame': 1588 obs. of 20 variables:
$ SMZ : int 1 2 3 4 5 6 7 8 9 10 ...
$ SIZ1_diff : int -620 -219 -426 -481 -233 -322 -176 -112 -34 -103 ...
$ SIZ1_base : int 4170 831 834 676 1711 1539 720 1396 998 1392 ...
$ SIZ2_diff : int -189 -57 -162 -142 -109 -79 -12 72 -36 -33 ...
$ SIZ2_base : int 1347 255 294 216 584 512 196 437 343 479 ...
$ SIZ3_diff : int -35 -4 -134 -114 54 -50 16 4 26 83 ...
$ SIZ3_base : int 2040 392 379 267 913 799 361 804 566 725 ...
$ SIZ4_diff : int 82 8 -81 -50 71 23 36 127 46 75 ...
$ SIZ4_base : int 1437 282 241 158 624 532 242 471 363 509 ...
$ SIZ5_diff : int 244 14 -22 -43 74 63 11 143 79 125 ...
$ SIZ5_base : int 1533 297 221 166 707 576 263 582 429 536 ...
$ Total_og : int 11980 2156 1367 790 5339 4362 2027 4715 3465 4561 ...
$ Total_base: int 12648 2415 2314 1736 5496 4747 2168 4464 3278 4375 ...
$ %_SIZ1 : num 14.9 26.4 51.1 71.2 13.6 ...
$ %_SIZ2 : num 14 22.4 55.1 65.7 18.7 ...
$ %_SIZ3 : num 1.72 1.02 35.36 42.7 5.91 ...
$ %_SIZ4 : num 5.71 2.84 33.61 31.65 11.38 ...
$ %_SIZ5 : num 15.92 4.71 9.95 25.9 10.47 ...
$ Total_og : int 11980 2156 1367 790 5339 4362 2027 4715 3465 4561 ...
$ Total_base: int 12648 2415 2314 1736 5496 4747 2168 4464 3278 4375 ...
When I run the below query, it is returning me the below error and I don't know why. I don't have any column in table
Query
d20_1 <- sqldf('SELECT *, CASE
WHEN SMZ BETWEEN 1 AND 110 THEN "Baltimore City"
WHEN SMZ BETWEEN 111 AND 217 THEN "Anne Arundel County"
WHEN SMZ BETWEEN 218 AND 405 THEN "Baltimore County"
WHEN SMZ BETWEEN 406 AND 453 THEN "Carroll County"
WHEN SMZ BETWEEN 454 AND 524 THEN "Harford County"
WHEN SMZ BETWEEN 1667 AND 1674 THEN "York County"
ELSE 0
END Jurisdiction
FROM d19_1')
Error:
Error in rsqlite_send_query(conn#ptr, statement) :
table d19_1 has no column named <NA>
Your code works correctly for me:
d19_1 <- structure(list(SMZ = 1:6, SIZ1_diff = c(-620L, -219L, -426L,
-481L, -233L, -322L), SIZ1_base = c(4170L, 831L, 834L, 676L,
1711L, 1539L), SIZ2_diff = c(-189L, -57L, -162L, -142L, -109L,
-79L), SIZ2_base = c(1347L, 255L, 294L, 216L, 584L, 512L), SIZ3_diff = c(-35L,
-4L, -134L, -114L, 54L, -50L), SIZ3_base = c(2040L, 392L, 379L,
267L, 913L, 799L), SIZ4_diff = c(82L, 8L, -81L, -50L, 71L, 23L
), SIZ4_base = c(1437L, 282L, 241L, 158L, 624L, 532L), SIZ5_diff = c(244L,
14L, -22L, -43L, 74L, 63L), SIZ5_base = c(1533L, 297L, 221L,
166L, 707L, 576L), Total_og = c(11980L, 2156L, 1367L, 790L, 5339L,
4362L), Total_base = c(12648L, 2415L, 2314L, 1736L, 5496L, 4747L
), X._SIZ1 = c(14.86811, 26.35379, 51.07914, 71.15385, 13.61777,
20.92268), X._SIZ2 = c(14.03118, 22.35294, 55.10204, 65.74074,
18.66438, 15.42969), X._SIZ3 = c(1.715686, 1.020408, 35.356201,
42.696629, 5.914567, 6.257822), X._SIZ4 = c(5.706333, 2.836879,
33.609959, 31.64557, 11.378205, 4.323308), X._SIZ5 = c(15.916504,
4.713805, 9.954751, 25.903614, 10.466761, 10.9375), Total_og.1 = c(11980L,
2156L, 1367L, 790L, 5339L, 4362L), Total_base.1 = c(12648L, 2415L,
2314L, 1736L, 5496L, 4747L)), .Names = c("SMZ", "SIZ1_diff",
"SIZ1_base", "SIZ2_diff", "SIZ2_base", "SIZ3_diff", "SIZ3_base",
"SIZ4_diff", "SIZ4_base", "SIZ5_diff", "SIZ5_base", "Total_og",
"Total_base", "X._SIZ1", "X._SIZ2", "X._SIZ3", "X._SIZ4", "X._SIZ5",
"Total_og.1", "Total_base.1"), row.names = c(NA, -6L), class = "data.frame")
library(sqldf)
sqldf('SELECT *, CASE
WHEN SMZ BETWEEN 1 AND 110 THEN "Baltimore City"
WHEN SMZ BETWEEN 111 AND 217 THEN "Anne Arundel County"
WHEN SMZ BETWEEN 218 AND 405 THEN "Baltimore County"
WHEN SMZ BETWEEN 406 AND 453 THEN "Carroll County"
WHEN SMZ BETWEEN 454 AND 524 THEN "Harford County"
WHEN SMZ BETWEEN 1667 AND 1674 THEN "York County"
ELSE 0
END Jurisdiction
FROM d19_1')