Gilda modules reference
API
- gilda.api.annotate(text, sent_split_fun=None, organisms=None, namespaces=None, context_text=None)[source]
Annotate a given text with Gilda (i.e., do named entity recognition).
- Parameters
text (str) – The text to be annotated.
sent_split_fun (Callable[str, Iterable[Tuple[int, int]]], optional) – A function that splits the text into sentences. The default is
nltk.tokenize.PunktSentenceTokenizer.span_tokenize()
. The function should take a string as input and return an iterable of coordinate pairs corresponding to the start and end coordinates for each sentence in the input text.organisms (list[str], optional) – A list of organism names to pass to the grounder. If not provided, human is used.
namespaces (list[str], optional) – A list of namespaces to pass to the grounder to restrict the matches to. By default, no restriction is applied.
context_text (
Optional
[str
]) – A longer span of text that serves as additional context for the text being annotated for disambiguation purposes.
- Returns
A list of matches where each match is an Annotation object which contains as attributes the text span that was matched, the list of ScoredMatches, and the start and end character offsets of the text span.
- Return type
- gilda.api.get_grounder()[source]
Initialize and return the default Grounder instance.
- Return type
- Returns
A Grounder instance whose attributes and methods can be used directly.
- gilda.api.get_models()[source]
Return a list of entity texts for which disambiguation models exist.
- gilda.api.get_names(db, id, status=None, source=None)[source]
Return a list of entity texts corresponding to a given database ID.
- Parameters
db (str) – The database in which the ID is an entry, e.g., HGNC.
id (str) – The ID of an entry in the database.
status (Optional[str]) – If given, only entity texts with the given status e.g., “synonym” are returned.
source (Optional[str]) – If given, only entity texts from the given source e.g., “uniprot” are returned.
- gilda.api.ground(text, context=None, organisms=None, namespaces=None)[source]
Return a list of scored matches for a text to ground.
- Parameters
text (str) – The entity text to be grounded.
context (Optional[str]) – Any additional text that serves as context for disambiguating the given entity text, used if a model exists for disambiguating the given text.
organisms (Optional[List[str]]) – A list of taxonomy identifiers to use as a priority list when surfacing matches for proteins/genes from multiple organisms.
namespaces (Optional[List[str]]) – A list of namespaces to restrict the matches to. By default, no restriction is applied.
- Returns
A list of ScoredMatch objects representing the groundings.
- Return type
Examples
Ground a string corresponding to an entity name, label, or synonym
>>> import gilda >>> scored_matches = gilda.ground('mapt')
The matches are sorted in descending order by score, and in the event of a tie, by the namespace of the primary grounding. Each scored match has a
gilda.term.Term
object that contain information about the primary grounding.>>> scored_matches[0].term.db 'hgnc' >>> scored_matches[0].term.id '6893' >>> scored_matches[0].term.get_curie() 'hgnc:6893'
The score for each match can be accessed directly:
>>> scored_matches[0].score 0.7623
The rationale for each match is contained in the
match
attribute whose fields are described ingilda.scorer.Match
:>>> match_object = scored_matches[0].match
Give optional context to be used by Gilda’s disambiguation models, if available
>>> scored_matches = gilda.ground('ER', context='Calcium is released from the ER.')
Only return results from a certain namespace, such as when a family and gene have the same name
>>> scored_matches = gilda.ground('ESR', namespaces=["hgnc"])
- gilda.api.make_grounder(terms)[source]
Create a custom grounder from a list of Terms.
- Parameters
terms (
Union
[str
,List
[Term
],Mapping
[str
,List
[Term
]]]) – Specifies the grounding terms that should be loaded in the Grounder. If str, it is interpreted as a path to a grounding terms gzipped TSV file which is then loaded. If list, it is assumed to be a flat list of Terms. If dict, it is assumed to be a grounding terms dict with normalized entity strings as keys and lists of Term objects as values. Default: None- Return type
- Returns
A Grounder instance, initialized with either the default terms loaded from the resource file or a custom set of terms if the terms argument was specified.
Examples
The following example shows how to get an ontology with
obonet
and load custom terms:from gilda import make_grounder from gilda.process import normalize from gilda import Term prefix = "UBERON" url = "http://purl.obolibrary.org/obo/uberon/basic.obo" g = obonet.read_obo(url) custom_obo_terms = [] it = tqdm(g.nodes(data=True), unit_scale=True, unit="node") for node, data in it: # Skip entries imported from other ontologies if not node.startswith(f"{prefix}:"): continue identifier = node.removeprefix(f"{prefix}:") name = data["name"] custom_obo_terms.append(gilda.Term( norm_text=normalize(name), text=name, db=prefix, id=identifier, entry_name=name, status="name", source=prefix, )) # Add terms for all synonyms for synonym_raw in data.get("synonym", []): try: # Try to parse out of the quoted OBO Field synonym = synonym_raw.split('"')[1].strip() except IndexError: continue # the synonym was malformed custom_obo_terms.append(gilda.Term( norm_text=normalize(synonym), text=synonym, db=prefix, id=identifier, entry_name=name, status="synonym", source=prefix, )) custom_grounder = gilda.make_grounder(custom_obo_terms) scored_matches = custom_grounder.ground("head")
Additional examples for loading custom content from OBO Graph JSON,
pyobo
, and more can be found in the Jupyter notebooks in the Gilda repository on GitHub.
Grounder
- class gilda.grounder.Annotation(text, matches, start, end)[source]
Bases:
object
A class representing a named entity annotation in a given text.
- matches
The list of scored matches for the text span.
- Type
- class gilda.grounder.Grounder(terms=None, *, namespace_priority=None)[source]
Bases:
object
Class to look up and ground query texts in a terms file.
- Parameters
terms (
Union
[str
,Path
,Iterable
[Term
],Mapping
[str
,List
[Term
]],None
]) –Specifies the grounding terms that should be loaded in the Grounder.
If
None
, the default grounding terms are loaded from the versioned resource folder.If
str
orpathlib.Path
, it is interpreted as a path to a grounding terms gzipped TSV file which is then loaded. If it’s a str and looks like a URL, will be downloaded from the internetIf
dict
, it is assumed to be a grounding terms dict with normalized entity strings as keys andgilda.term.Term
instances as values.If
list
,set
,tuple
, or any other iterable, it is assumed to be a flat list ofgilda.term.Term
instances.
namespace_priority (
Optional
[List
[str
]]) – Specifies a term namespace priority order. For example, if multiple terms are matched with the same score, will use this list to decide which are given by which namespace appears further towards the front of the list. By default,DEFAULT_NAMESPACE_PRIORITY
is used, which, for example, prioritizes famplex entities over HGNC ones.
- get_ambiguities(skip_names=True, skip_curated=True, skip_name_matches=True, skip_species_ambigs=True)[source]
Return a list of ambiguous term groups in the grounder.
- Parameters
skip_names (
bool
) – If True, groups of terms where one has the “name” status are skipped. This makes sense usually since these are prioritized over synonyms anyway.skip_curated (
bool
) – If True, groups of terms where one has the “curated” status are skipped. This makes sense usually since these are prioritized over synonyms anyway.skip_name_matches (
bool
) – If True, groups of terms that all share the same standard name are skipped. This is effective at eliminating spurious ambiguities due to unresolved cross-references between equivalent terms in different namespaces.skip_species_ambigs (
bool
) – If True, groups of terms that are all genes or proteins, and are all from different species (one term from each species) are skipped. This is effective at eliminating ambiguities between orthologous genes in different species that are usually resolved using the organism priority list.
- Return type
- get_names(db, id, status=None, source=None)[source]
Return a list of entity texts corresponding to a given database ID.
- Parameters
db (str) – The database in which the ID is an entry, e.g., HGNC.
id (str) – The ID of an entry in the database.
status (Optional[str]) – If given, only entity texts with the given status e.g., “synonym” are returned.
source (Optional[str]) – If given, only entity texts from the given source e.g., “uniprot” are returned.
- Returns
names – A list of entity texts corresponding to the given database/ID
- Return type
- ground(raw_str, context=None, organisms=None, namespaces=None)[source]
Return scored groundings for a given raw string.
- Parameters
raw_str (str) – A string to be grounded with respect to the set of Terms that the Grounder contains.
context (Optional[str]) – Any additional text that serves as context for disambiguating the given entity text, used if a model exists for disambiguating the given text.
organisms (Optional[List[str]]) – An optional list of organism identifiers defining a priority ranking among organisms, if genes/proteins from multiple organisms match the input. If not provided, the default [‘9606’] i.e., human is used.
namespaces (Optional[List[str]]) – A list of namespaces to restrict matches to. This will apply to both the primary namespace of a matched term, to any subsumed matches, and to the source namespaces of terms if they were created using cross-reference mappings. By default, no restriction is applied.
- Returns
A list of ScoredMatch objects representing the groundings sorted by decreasing score.
- Return type
- ground_best(raw_str, context=None, organisms=None, namespaces=None)[source]
Return the best scored grounding for a given raw string.
- Parameters
raw_str (str) – A string to be grounded with respect to the set of Terms that the Grounder contains.
context (Optional[str]) – Any additional text that serves as context for disambiguating the given entity text, used if a model exists for disambiguating the given text.
organisms (Optional[List[str]]) – An optional list of organism identifiers defining a priority ranking among organisms, if genes/proteins from multiple organisms match the input. If not provided, the default [‘9606’] i.e., human is used.
namespaces (Optional[List[str]]) – A list of namespaces to restrict matches to. This will apply to both the primary namespace of a matched term, to any subsumed matches, and to the source namespaces of terms if they were created using cross-reference mappings. By default, no restriction is applied.
- Returns
The best ScoredMatch returned by
ground()
if any are returned, otherwise None.- Return type
Optional[gilda.grounder.ScoredMatch]
- class gilda.grounder.ScoredMatch(term, score, match, disambiguation=None, subsumed_terms=None)[source]
Bases:
object
Class representing a scored match to a grounding term.
- term
The Term that the scored match is for.
- Type
gilda.grounder.Term
- match
The Match object characterizing the match to the Term.
- Type
- subsumed_terms
A list of additional Term objects that also matched, have the same db/id value as the term associated with the match, but were further down the score ranking. In some cases examining the subsumed terms associated with a match can provide additional metadata in downstream applications.
- Type
Optional[list[gilda.grounder.Term]]
- get_groundings()[source]
Return all groundings for this match including from mapped and subsumed terms.
- gilda.grounder.load_entries_from_terms_file(terms_file)[source]
Yield Terms from a compressed terms TSV file path.
- gilda.grounder.load_terms_file(terms_file)[source]
Load a TSV file containing terms into a lookup dictionary.
Scorer
- class gilda.scorer.Match(query, ref, exact=None, space_mismatch=None, dash_mismatches=None, cap_combos=None)[source]
Bases:
object
Class representing a match between a query and a reference string
- gilda.scorer.generate_match(query, ref, beginning_of_sentence=False)[source]
Return a match data structure based on comparing a query to a ref str.
- Parameters
- Returns
A Match object characterizing the match between the two strings.
- Return type
- gilda.scorer.score_string_match(match)[source]
Return a score between 0 and 1 for the goodness of a match.
This score is purely based on the relationship of the two strings and does not take the status of the reference into account.
- Parameters
match (gilda.scorer.Match) – The Match object characterizing the relationship of the query and reference strings.
- Returns
A match score between 0 and 1.
- Return type
Term
- class gilda.term.Term(norm_text, text, db, id, entry_name, status, source, organism=None, source_db=None, source_id=None)[source]
Bases:
object
Represents a text entry corresponding to a grounded term.
- organism
When the term represents a protein, this attribute provides the taxonomy code of the species for the protein. For non-proteins, not provided. Default: None
- Type
Optional[str]
- source_db
If the term’s db/id was mapped from a different, original db/id from a given source, this attribute provides the original db value before mapping.
- Type
Optional[str]
- source_id
If the term’s db/id was mapped from a different, original db/id from a given source, this attribute provides the original ID value before mapping.
- Type
Optional[str]
Process
Module containing various string processing functions used for grounding.
- gilda.process.dashes = ['−', '-', '‐', '‑', '‒', '–', '—', '―']
A list of all kinds of dashes
- gilda.process.depluralize(word)[source]
Return the depluralized version of the word, along with a status flag.
- Parameters
word (str) – The word which is to be depluralized.
- Returns
The original word, if it is detected to be non-plural, or the depluralized version of the word, and a status flag representing the detected pluralization status of the word, with non_plural (e.g., BRAF), plural_oes (e.g., mosquitoes), plural_ies (e.g., antibodies), plural_es (e.g., switches), plural_cap_s (e.g., MAPKs), and plural_s (e.g., receptors).
- Return type
list of str pairs
- gilda.process.get_capitalization_pattern(word, beginning_of_sentence=False)[source]
Return the type of capitalization for the string.
- Parameters
- Returns
The capitalization pattern of the given word. Returns one of the following: sentence_initial_cap, single_cap_letter, all_caps, all_lower, initial_cap, mixed.
- Return type
- gilda.process.replace_dashes(s, rep='-')[source]
Replace all types of dashes in a given string with a given replacement.
- gilda.process.replace_greek_latin(s)[source]
Replace Greek spelled out letters with their latin character.
- gilda.process.replace_greek_spelled_out(s)[source]
Replace Greek unicode character with latin spelled out.
- gilda.process.replace_greek_uni(s)[source]
Replace Greek spelled out letters with their unicode character.
- gilda.process.replace_unicode(s)[source]
Replace unicode with ASCII equivalent, except Greek letters.
Greek letters are handled separately and aren’t replaced in this context.
Named Entity Recognition
Pandas Utilities
Utilities for Pandas.
- gilda.pandas_utils.ground_df(df, source_column, *, target_column=None, grounder=None, **kwargs)[source]
Ground the elements of a column in a Pandas dataframe as CURIEs, in-place.
- Parameters
df (
DataFrame
) – A pandas dataframesource_column (
Union
[str
,int
]) – The column to ground. This column contains text corresponding to named entities’ labels or synonymstarget_column (
Union
[None
,str
,int
]) – The column where to put the groundings (either a CURIE string, or None). It’s possible to create a new column when passing a string for this argument. If not given, will create a new column name like<source column>_grounded
.grounder (
Optional
[Grounder
]) – A custom grounder. If none given, uses the built-in grounder.kwargs – Keyword arguments passed to
Grounder.ground()
, could include context, organisms, or namespaces.
- Return type
Examples
The following example shows how to use this function.
import pandas as pd import gilda url = "https://raw.githubusercontent.com/OBOAcademy/obook/master/docs/tutorial/linking_data/data.csv" df = pd.read_csv(url) gilda.ground_df(df, source_column="disease", target_column="disease_curie")
- gilda.pandas_utils.ground_df_map(df, source_column, *, grounder=None, **kwargs)[source]
Ground the elements of a column in a Pandas dataframe as CURIEs.
- Parameters
df (
DataFrame
) – A pandas dataframesource_column (
Union
[str
,int
]) – The column to ground. This column contains text corresponding to named entities’ labels or synonymsgrounder (
Optional
[Grounder
]) – A custom grounder. If none given, uses the built-in ground.kwargs – Keyword arguments passed to
Grounder.ground()
, could include context, organisms, or namespaces.
- Returns
A pandas series representing the grounded CURIE strings. Contains NaNs if grounding was not successful or if there was an NaN in the cell before.
- Return type
series