python3 variable expansion "a"*x to add empty elements in a list - variables

How can I use python3 variable expansion to add empty elements into a list.
>>> "a"*5
'aaaaa'
This initialises a list with 3 elements.
l = ['']
>>> l
['']
>>> l.append('')
>>> l.append('')
>>> l
['', '', '']
When I try to add 5 empty elements I get just one.
>>> l=['' * 5]
>>> l
['']
I am writing this list into a csv, I want a cheap way to added empty columns, elements in a row. Where I build the row as elements in a list.

It was just a matter of semantics. Where I did the multiplication.
>>> l = [''] * 5
>>> l
['', '', '', '', '']
or
>>> l=[]
>>> l.extend([''] * 5)
>>> l
['', '', '', '', '']

Related

Conditional mapping among columns of two data frames with Pandas Data frame

I needed your advice regarding how to map columns between data-frames:
I have put it in simple way so that it's easier for you to understand:
df = dataframe
EXAMPLE:
df1 = pd.DataFrame({
"X": [],
"Y": [],
"Z": []
})
df2 = pd.DataFrame({
"A": ['', '', 'A1'],
"C": ['', '', 'C1'],
"D": ['D1', 'Other', 'D3'],
"F": ['', '', ''],
"G": ['G1', '', 'G3'],
"H": ['H1', 'H2', 'H3']
})
Requirement:
1st step:
We needed to track a value for X column on df1 from columns A, C, D respectively. It would stop searching once it finds any value and would select it.
2nd step:
If the selected value is "Other" then X column of df1 would map columns F, G, and H respectively until it finds any value.
Result:
X
0 D1
1 H2
2 A1
Thank you so much in advance
Try this:
def first_non_empty(df, cols):
"""Return the first non-empty, non-null value among the specified columns per row"""
return df[cols].replace('', pd.NA).bfill(axis=1).iloc[:, 0]
col_x = first_non_empty(df2, ['A','C','D'])
col_x = col_x.mask(col_x == 'Other', first_non_empty(df2, ['F','G','H']))
df1['X'] = col_x

How to ignore terms in replace if they do not exist in the pandas dataframe?

I have the following code to replace one term with another, this only works if the value exists in the pandas dataframe, I assume I need to wrap gdf[montype] = gdf[montype].replace(dict(montype), regex=True) in an if statement? How would I do this, or is there a better way?
montype = [
['HIS_COP_', ''],
['_Ply', ''],
['_Pt',''],
['BURIAL','burial'],
['CUT', 'CUT'],
['MODERN', 'MODERN'],
['NATURAL', 'NATURAL'],
['STRUCTURE', 'STRUCTURE'],
['SURFACE', 'SURFACE'],
['TREETHROW', 'natural feature'],
['FURROW', 'FURROW'],
['FIELD_DRAIN', 'FIELD_DRAIN'],
['DEPOSIT_FILL', 'DEPOSIT_FILL'],
['POSTHOLE', ''],
['TIMBER', ''],
['', '']
]
gdf[montype] = gdf[montype].replace(dict(montype), regex=True)
When the term does not exist I get the error raise KeyError(f"None of [{key}] are in the [{axis_name}]")
Edit:
mtype = {
'HIS_COP_': '',
'_Ply': '',
'_Pt': '',
'BURIAL': 'burial',
'CUT': 'CUT',
'MODERN': 'MODERN',
'NATURAL': 'NATURAL',
'STRUCTURE': 'STRUCTURE',
'SURFACE': 'SURFACE',
'TREETHROW': 'natural feature',
'FURROW': 'FURROW',
'FIELD_DRAIN': 'FIELD_DRAIN',
'DEPOSIT_FILL': 'DEPOSIT_FILL',
'POSTHOLE': '',
'TIMBER': ''
} # dict(montype)
gdf['montype'] = gdf['montype'].map(mtype).fillna(gdf['montype'])
You can try this:
# Convert you list to dict
Montype={'His_cop':'','Modern':'Modern', etc...} # dict(montype)
gdf[montype]=gdf[montype].map(Montype).fillna('whatever value you want')

pandas: filter rows with list elements beginning with string?

Blockquote
I have the following dataframe.
d = pd.DataFrame({'a': [['foo', 'bar'], ['bar'], ['fah', 'baz']})
I'd like to return just the rows with values of a beginning f in them - i.e. the first and third rows.
This is what I've tried:
d[d.a.is_in('f')]
Use any in list comprehension with generator:
d = d[[any(y.startswith('f') for y in x) for x in d['a']]]
print (d)
a
0 [foo, bar]
2 [fah, baz]
Detail: (convert to list only for sample)
print ([list(y.startswith('f') for y in x) for x in d['a']])
[[True, False], [False], [True, False]]
Solution using .apply(), iterating over the individual list elements, checking with .startswith() and evaluating the length of the resultant list:
import pandas as pd
df = pd.DataFrame({'a': [['foo', 'bar'], ['bar'], ['fah', 'baz']]})
df = df[df.a.apply(lambda x: len([el for el in x if el.startswith('f')]) > 0)]
print(df)
which results in:
a
0 [foo, bar]
2 [fah, baz]

Exporting Tokenized SpaCy result into Excel or SQL tables

I'm using SpaCy with Pandas to get a sentence tokenised with Part of Speech (POS)export to excel. The code is as follow:
import spacy
import xlsxwriter
import pandas as pd
nlp = spacy.load('en_core_web_sm')
text ="""He is a good boy."""
doc = nlp(text)
for token in doc:
x=[token.text, token.lemma_, token.pos_, token.tag_,token.dep_,token.shape_, token.is_alpha, token.is_stop]
print(x)
When I print(x)I get the following:
['He', '-PRON-', 'PRON', 'PRP', 'nsubj', 'Xx', True, False]
['is', 'be', 'VERB', 'VBZ', 'ROOT', 'xx', True, True]
['a', 'a', 'DET', 'DT', 'det', 'x', True, True]
['good', 'good', 'ADJ', 'JJ', 'amod', 'xxxx', True, False]
['boy', 'boy', 'NOUN', 'NN', 'attr', 'xxx', True, False]
['.', '.', 'PUNCT', '.', 'punct', '.', False, False]
To the token loop, I added the DataFrame as follow:
for token in doc:
for token in doc:
x=[token.text, token.lemma_, token.pos_, token.tag_,token.dep_,token.shape_, token.is_alpha, token.is_stop]
df=pd.Dataframe(x)
print(df)
Now, I stat to get the following format:
0
0 He
1 -PRON-
2 PRON
3 PRP
4 nsubj
5 Xx
6 True
7 False
........
........
However, when I try exporting the output (df) to excel using Pandas as the following code, it only shows me the last iteration of x in the column
df=pd.DataFrame(x)
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')
df.to_excel(writer,sheet_name='Sheet1')
Output (in Excel Sheet):
0
0 .
1 .
2 PUNCT
3 .
4 punct
5 .
6 False
7 False
How I can have all the iterations one after the other in the new column in this scenario as follow?
0 He is ….
1 -PRON- be ….
2 PRON VERB ….
3 PRP VBZ ….
4 nsubj ROOT ….
5 Xx xx ….
6 True True ….
7 False True ….
Some shorter code:
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm')
text ="""He is a good boy."""
param = [[token.text, token.lemma_, token.pos_,
token.tag_,token.dep_,token.shape_,
token.is_alpha, token.is_stop] for token in nlp(text)]
df=pd.DataFrame(param)
headers = ['text', 'lemma', 'pos', 'tag', 'dep',
'shape', 'is_alpha', 'is_stop']
df.columns = headers
In case you don't have your version yet:
import pandas as pd
rows =[
['He', '-PRON-', 'PRON', 'PRP', 'nsubj', 'Xx', True, False],
['is', 'be', 'VERB', 'VBZ', 'ROOT', 'xx', True, True],
['a', 'a', 'DET', 'DT', 'det', 'x', True, True],
['good', 'good', 'ADJ', 'JJ', 'amod', 'xxxx', True, False],
['boy', 'boy', 'NOUN', 'NN', 'attr', 'xxx', True, False],
['.', '.', 'PUNCT', '.', 'punct', '.', False, False],
]
headers = ['text', 'lemma', 'pos', 'tag', 'dep',
'shape', 'is_alpha', 'is_stop']
# example 1: list of lists of dicts
#following https://stackoverflow.com/a/28058264/1758363
d = []
for row in rows:
dict_ = {k:v for k, v in zip(headers, row)}
d.append(dict_)
df = pd.DataFrame(d)[headers]
# example 2: appending dicts
df2 = pd.DataFrame(columns=headers)
for row in rows:
dict_ = {k:v for k, v in zip(headers, row)}
df2 = df2.append(dict_, ignore_index=True)
#example 3: lists of dicts created with map() function
def as_dict(row):
return {k:v for k, v in zip(headers, row)}
df3 = pd.DataFrame(list(map(as_dict, rows)))[headers]
def is_equal(df_a, df_b):
"""Substitute for pd.DataFrame.equals()"""
return (df_a == df_b).all().all()
assert is_equal(df, df2)
assert is_equal(df2, df3)

How does Bio.PDB identify hetero-residues?

I'm wondering how Bio.PDB identifies a residue as a hetero-residue.
I know that the residue.id method returns a tuple in which the first item is the hetero flag, the second one is the residue identifier (number) and the third one is the insertion code.
But how does the internal code decide what to put in the hetero flag field? Does it check whether the atoms in the residue are HETATM records vs. ATOM records?
Or does it check the atom names in each residue and compare it to some set of hetero-atoms?
The reason I ask is because in 4MDH chain B, the first residue in the coordinates section is ACE (acetyl). It has only C and O atoms, and the PDB file lists it as a HETATM. But when the residue.id for this residue is (' ', 0, ' ').
Here is my code:
>>> from Bio.PDB.mmtf import MMTFParser
>>> structure = MMTFParser.get_structure_from_url('4mdh')
/Library/Python/2.7/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 0.
PDBConstructionWarning)
/Library/Python/2.7/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 0.
PDBConstructionWarning)
>>> chain = [c for c in structure.get_chains() if c.get_id() == 'B'][0]
>>> residue0 = [r for r in chain.get_residues() if r.id[1] == 0][0]
>>> residue0.id
(' ', 0, ' ')
>>>
TL;DR: It's not BioPython but the mmtf library which does the interpretation.
From the source code:
self.structure_bulder.init_residue(group_name, self.this_type,
group_number, insertion_code)
Here the residue is created. The 2nd parameter (self.this_type) is the field/hetero flag in init_residue
def init_residue(self, resname, field, resseq, icode):
"""Create a new Residue object.
Arguments:
- resname - string, e.g. "ASN"
- field - hetero flag, "W" for waters, "H" for hetero residues, otherwise blank.
In the mmtfParser this_type is set for the whole chain in set_chain_info.
If you import the same sequence with mmtf, you can see that chain 0 and 1 are considered to be polymers which is interpreted as a 'regular` atom by BioPython. That makes sense since the acetate group is bound to the peptide chain.
from mmtf import fetch
decoded_data = fetch("4mdh")
print(decoded_data.entity_list)
[{'chainIndexList': [0, 1],
'description': 'CYTOPLASMIC MALATE DEHYDROGENASE',
'sequence': 'XSE...SSA',
'type': 'polymer'},
{'chainIndexList': [2, 4],
'description': 'SULFATE ION',
'sequence': '',
'type': 'non-polymer'},
{'chainIndexList': [3, 5],
'description': 'NICOTINAMIDE-ADENINE-DINUCLEOTIDE',
'sequence': '',
'type': 'non-polymer'},
{'chainIndexList': [6, 7],
'description': 'water',
'sequence': '',
'type': 'water'}]
Note you can access models, chains and residues in BioPython by indexes, e.g.
structure[0]['B'][0] would give you the same atom as in the question.