Gilda modules reference

API

gilda.api.annotate(text, sent_split_fun=None, organisms=None, namespaces=None, context_text=None)[source]

Annotate a given text with Gilda (i.e., do named entity recognition).

Parameters
  • text (str) – The text to be annotated.

  • sent_split_fun (Callable[str, Iterable[Tuple[int, int]]], optional) – A function that splits the text into sentences. The default is nltk.tokenize.PunktSentenceTokenizer.span_tokenize(). The function should take a string as input and return an iterable of coordinate pairs corresponding to the start and end coordinates for each sentence in the input text.

  • organisms (list[str], optional) – A list of organism names to pass to the grounder. If not provided, human is used.

  • namespaces (list[str], optional) – A list of namespaces to pass to the grounder to restrict the matches to. By default, no restriction is applied.

  • context_text (Optional[str]) – A longer span of text that serves as additional context for the text being annotated for disambiguation purposes.

Returns

A list of matches where each match is an Annotation object which contains as attributes the text span that was matched, the list of ScoredMatches, and the start and end character offsets of the text span.

Return type

list[Annotation]

gilda.api.get_grounder()[source]

Initialize and return the default Grounder instance.

Return type

Grounder

Returns

A Grounder instance whose attributes and methods can be used directly.

gilda.api.get_models()[source]

Return a list of entity texts for which disambiguation models exist.

Returns

The list of entity texts for which a disambiguation model is available.

Return type

list[str]

gilda.api.get_names(db, id, status=None, source=None)[source]

Return a list of entity texts corresponding to a given database ID.

Parameters
  • db (str) – The database in which the ID is an entry, e.g., HGNC.

  • id (str) – The ID of an entry in the database.

  • status (Optional[str]) – If given, only entity texts with the given status e.g., “synonym” are returned.

  • source (Optional[str]) – If given, only entity texts from the given source e.g., “uniprot” are returned.

gilda.api.ground(text, context=None, organisms=None, namespaces=None)[source]

Return a list of scored matches for a text to ground.

Parameters
  • text (str) – The entity text to be grounded.

  • context (Optional[str]) – Any additional text that serves as context for disambiguating the given entity text, used if a model exists for disambiguating the given text.

  • organisms (Optional[List[str]]) – A list of taxonomy identifiers to use as a priority list when surfacing matches for proteins/genes from multiple organisms.

  • namespaces (Optional[List[str]]) – A list of namespaces to restrict the matches to. By default, no restriction is applied.

Returns

A list of ScoredMatch objects representing the groundings.

Return type

list[gilda.grounder.ScoredMatch]

Examples

Ground a string corresponding to an entity name, label, or synonym

>>> import gilda
>>> scored_matches = gilda.ground('mapt')

The matches are sorted in descending order by score, and in the event of a tie, by the namespace of the primary grounding. Each scored match has a gilda.term.Term object that contain information about the primary grounding.

>>> scored_matches[0].term.db
'hgnc'
>>> scored_matches[0].term.id
'6893'
>>> scored_matches[0].term.get_curie()
'hgnc:6893'

The score for each match can be accessed directly:

>>> scored_matches[0].score
0.7623

The rationale for each match is contained in the match attribute whose fields are described in gilda.scorer.Match:

>>> match_object = scored_matches[0].match

Give optional context to be used by Gilda’s disambiguation models, if available

>>> scored_matches = gilda.ground('ER', context='Calcium is released from the ER.')

Only return results from a certain namespace, such as when a family and gene have the same name

>>> scored_matches = gilda.ground('ESR', namespaces=["hgnc"])
gilda.api.make_grounder(terms)[source]

Create a custom grounder from a list of Terms.

Parameters

terms (Union[str, List[Term], Mapping[str, List[Term]]]) – Specifies the grounding terms that should be loaded in the Grounder. If str, it is interpreted as a path to a grounding terms gzipped TSV file which is then loaded. If list, it is assumed to be a flat list of Terms. If dict, it is assumed to be a grounding terms dict with normalized entity strings as keys and lists of Term objects as values. Default: None

Return type

Grounder

Returns

A Grounder instance, initialized with either the default terms loaded from the resource file or a custom set of terms if the terms argument was specified.

Examples

The following example shows how to get an ontology with obonet and load custom terms:

from gilda import make_grounder
from gilda.process import normalize
from gilda import Term

prefix = "UBERON"
url = "http://purl.obolibrary.org/obo/uberon/basic.obo"
g = obonet.read_obo(url)
custom_obo_terms = []
it = tqdm(g.nodes(data=True), unit_scale=True, unit="node")
for node, data in it:
    # Skip entries imported from other ontologies
    if not node.startswith(f"{prefix}:"):
        continue

    identifier = node.removeprefix(f"{prefix}:")

    name = data["name"]
    custom_obo_terms.append(gilda.Term(
        norm_text=normalize(name),
        text=name,
        db=prefix,
        id=identifier,
        entry_name=name,
        status="name",
        source=prefix,
    ))

    # Add terms for all synonyms
    for synonym_raw in data.get("synonym", []):
        try:
            # Try to parse out of the quoted OBO Field
            synonym = synonym_raw.split('"')[1].strip()
        except IndexError:
            continue  # the synonym was malformed

        custom_obo_terms.append(gilda.Term(
            norm_text=normalize(synonym),
            text=synonym,
            db=prefix,
            id=identifier,
            entry_name=name,
            status="synonym",
            source=prefix,
        ))

custom_grounder = gilda.make_grounder(custom_obo_terms)
scored_matches = custom_grounder.ground("head")

Additional examples for loading custom content from OBO Graph JSON, pyobo, and more can be found in the Jupyter notebooks in the Gilda repository on GitHub.

Grounder

class gilda.grounder.Annotation(text, matches, start, end)[source]

Bases: object

A class representing a named entity annotation in a given text.

text

The text span that was annotated.

Type

str

matches

The list of scored matches for the text span.

Type

list[ScoredMatch]

start

The start character offset of the text span.

Type

int

end

The end character offset of the text span.

Type

int

to_json()[source]

Convert the Annotation object to JSON.

class gilda.grounder.Grounder(terms=None, *, namespace_priority=None)[source]

Bases: object

Class to look up and ground query texts in a terms file.

Parameters
  • terms (Union[str, Path, Iterable[Term], Mapping[str, List[Term]], None]) –

    Specifies the grounding terms that should be loaded in the Grounder.

    • If None, the default grounding terms are loaded from the versioned resource folder.

    • If str or pathlib.Path, it is interpreted as a path to a grounding terms gzipped TSV file which is then loaded. If it’s a str and looks like a URL, will be downloaded from the internet

    • If dict, it is assumed to be a grounding terms dict with normalized entity strings as keys and gilda.term.Term instances as values.

    • If list, set, tuple, or any other iterable, it is assumed to be a flat list of gilda.term.Term instances.

  • namespace_priority (Optional[List[str]]) – Specifies a term namespace priority order. For example, if multiple terms are matched with the same score, will use this list to decide which are given by which namespace appears further towards the front of the list. By default, DEFAULT_NAMESPACE_PRIORITY is used, which, for example, prioritizes famplex entities over HGNC ones.

get_ambiguities(skip_names=True, skip_curated=True, skip_name_matches=True, skip_species_ambigs=True)[source]

Return a list of ambiguous term groups in the grounder.

Parameters
  • skip_names (bool) – If True, groups of terms where one has the “name” status are skipped. This makes sense usually since these are prioritized over synonyms anyway.

  • skip_curated (bool) – If True, groups of terms where one has the “curated” status are skipped. This makes sense usually since these are prioritized over synonyms anyway.

  • skip_name_matches (bool) – If True, groups of terms that all share the same standard name are skipped. This is effective at eliminating spurious ambiguities due to unresolved cross-references between equivalent terms in different namespaces.

  • skip_species_ambigs (bool) – If True, groups of terms that are all genes or proteins, and are all from different species (one term from each species) are skipped. This is effective at eliminating ambiguities between orthologous genes in different species that are usually resolved using the organism priority list.

Return type

List[List[Term]]

get_models()[source]

Return a list of entity texts for which disambiguation models exist.

Returns

The list of entity texts for which a disambiguation model is available.

Return type

list[str]

get_names(db, id, status=None, source=None)[source]

Return a list of entity texts corresponding to a given database ID.

Parameters
  • db (str) – The database in which the ID is an entry, e.g., HGNC.

  • id (str) – The ID of an entry in the database.

  • status (Optional[str]) – If given, only entity texts with the given status e.g., “synonym” are returned.

  • source (Optional[str]) – If given, only entity texts from the given source e.g., “uniprot” are returned.

Returns

names – A list of entity texts corresponding to the given database/ID

Return type

list[str]

ground(raw_str, context=None, organisms=None, namespaces=None)[source]

Return scored groundings for a given raw string.

Parameters
  • raw_str (str) – A string to be grounded with respect to the set of Terms that the Grounder contains.

  • context (Optional[str]) – Any additional text that serves as context for disambiguating the given entity text, used if a model exists for disambiguating the given text.

  • organisms (Optional[List[str]]) – An optional list of organism identifiers defining a priority ranking among organisms, if genes/proteins from multiple organisms match the input. If not provided, the default [‘9606’] i.e., human is used.

  • namespaces (Optional[List[str]]) – A list of namespaces to restrict matches to. This will apply to both the primary namespace of a matched term, to any subsumed matches, and to the source namespaces of terms if they were created using cross-reference mappings. By default, no restriction is applied.

Returns

A list of ScoredMatch objects representing the groundings sorted by decreasing score.

Return type

list[gilda.grounder.ScoredMatch]

ground_best(raw_str, context=None, organisms=None, namespaces=None)[source]

Return the best scored grounding for a given raw string.

Parameters
  • raw_str (str) – A string to be grounded with respect to the set of Terms that the Grounder contains.

  • context (Optional[str]) – Any additional text that serves as context for disambiguating the given entity text, used if a model exists for disambiguating the given text.

  • organisms (Optional[List[str]]) – An optional list of organism identifiers defining a priority ranking among organisms, if genes/proteins from multiple organisms match the input. If not provided, the default [‘9606’] i.e., human is used.

  • namespaces (Optional[List[str]]) – A list of namespaces to restrict matches to. This will apply to both the primary namespace of a matched term, to any subsumed matches, and to the source namespaces of terms if they were created using cross-reference mappings. By default, no restriction is applied.

Returns

The best ScoredMatch returned by ground() if any are returned, otherwise None.

Return type

Optional[gilda.grounder.ScoredMatch]

lookup(raw_str)[source]

Return matching Terms for a given raw string.

Parameters

raw_str (str) – A string to be looked up in the set of Terms that the Grounder contains.

Return type

List[Term]

Returns

A list of Terms that are potential matches for the given string.

print_summary(**kwargs)[source]

Print the summary of this grounder.

Return type

None

summary_str()[source]

Summarize the contents of the grounder.

Return type

str

class gilda.grounder.ScoredMatch(term, score, match, disambiguation=None, subsumed_terms=None)[source]

Bases: object

Class representing a scored match to a grounding term.

term

The Term that the scored match is for.

Type

gilda.grounder.Term

score

The score associated with the match.

Type

float

match

The Match object characterizing the match to the Term.

Type

gilda.scorer.Match

disambiguation

Meta-information about disambiguation, when available.

Type

Optional[dict]

subsumed_terms

A list of additional Term objects that also matched, have the same db/id value as the term associated with the match, but were further down the score ranking. In some cases examining the subsumed terms associated with a match can provide additional metadata in downstream applications.

Type

Optional[list[gilda.grounder.Term]]

get_grounding_dict()[source]

Get the groundings as CURIEs and URLs.

Return type

Mapping[str, str]

get_groundings()[source]

Return all groundings for this match including from mapped and subsumed terms.

Return type

Set[Tuple[str, str]]

Returns

A set of tuples representing groundings for this match including the grounding for the primary term as well as any subsumed terms, and groundings that come from having mapped an original source grounding during grounding resource construction.

get_namespaces()[source]

Return all namespaces for this match including from mapped and subsumed terms.

Return type

Set[str]

Returns

A set of strings representing namespaces for terms involved in this match, including the namespace for the primary term as well as any subsumed terms, and groundings that come from having mapped an original source grounding during grounding resource construction.

gilda.grounder.load_entries_from_terms_file(terms_file)[source]

Yield Terms from a compressed terms TSV file path.

Parameters

terms_file (Union[str, Path]) – Path to a compressed TSV terms file with columns corresponding to the serialized elements of a Term.

Return type

Iterator[Term]

Returns

Terms loaded from the file yielded by a generator.

gilda.grounder.load_terms_file(terms_file)[source]

Load a TSV file containing terms into a lookup dictionary.

Parameters

terms_file (Union[str, Path]) – Path to a compressed TSV terms file with columns corresponding to the serialized elements of a Term.

Return type

Mapping[str, List[Term]]

Returns

A lookup dictionary whose keys are normalized entity texts, and values are lists of Terms with that normalized entity text.

Scorer

class gilda.scorer.Match(query, ref, exact=None, space_mismatch=None, dash_mismatches=None, cap_combos=None)[source]

Bases: object

Class representing a match between a query and a reference string

gilda.scorer.generate_match(query, ref, beginning_of_sentence=False)[source]

Return a match data structure based on comparing a query to a ref str.

Parameters
  • query (str) – The string to be compared against a reference string.

  • ref (str) – The reference string against which the incoming query string is compared.

  • beginning_of_sentence (bool) – True if the query_str appears at the beginning of a sentence, relevant for how capitalization is evaluated.

Returns

A Match object characterizing the match between the two strings.

Return type

Match

gilda.scorer.score_string_match(match)[source]

Return a score between 0 and 1 for the goodness of a match.

This score is purely based on the relationship of the two strings and does not take the status of the reference into account.

Parameters

match (gilda.scorer.Match) – The Match object characterizing the relationship of the query and reference strings.

Returns

A match score between 0 and 1.

Return type

float

Term

class gilda.term.Term(norm_text, text, db, id, entry_name, status, source, organism=None, source_db=None, source_id=None)[source]

Bases: object

Represents a text entry corresponding to a grounded term.

norm_text

The normalized text corresponding to the text entry, used for lookups.

Type

str

text

The text entry itself.

Type

str

db

The database / name space corresponding to the grounded term.

Type

str

id

The identifier of the grounded term within the database / name space.

Type

str

entry_name

The standardized name corresponding to the grounded term.

Type

str

status

The relationship of the text entry to the grounded term, e.g., synonym.

Type

str

source

The source from which the term was obtained.

Type

str

organism

When the term represents a protein, this attribute provides the taxonomy code of the species for the protein. For non-proteins, not provided. Default: None

Type

Optional[str]

source_db

If the term’s db/id was mapped from a different, original db/id from a given source, this attribute provides the original db value before mapping.

Type

Optional[str]

source_id

If the term’s db/id was mapped from a different, original db/id from a given source, this attribute provides the original ID value before mapping.

Type

Optional[str]

get_curie()[source]

Get the compact URI for this term.

Return type

str

get_groundings()[source]

Return all groundings for this term, including from a mapped source.

Return type

Set[Tuple[str, str]]

Returns

A set of tuples representing the main grounding for this term, as well as any source grounding from which the main grounding was mapped.

get_namespaces()[source]

Return all namespaces for this term, including from a mapped source.

Return type

Set[str]

Returns

A set of strings including the main namespace for this term, as well as any source namespace from which the main grounding was mapped.

to_json()[source]

Return the term serialized into a JSON dict.

to_list()[source]

Return the term serialized into a list of strings.

gilda.term.dump_terms(terms, fname)[source]

Dump a list of terms to a tsv.gz file.

Return type

None

Process

Module containing various string processing functions used for grounding.

gilda.process.dashes = ['−', '-', '‐', '‑', '‒', '–', '—', '―']

A list of all kinds of dashes

gilda.process.depluralize(word)[source]

Return the depluralized version of the word, along with a status flag.

Parameters

word (str) – The word which is to be depluralized.

Returns

The original word, if it is detected to be non-plural, or the depluralized version of the word, and a status flag representing the detected pluralization status of the word, with non_plural (e.g., BRAF), plural_oes (e.g., mosquitoes), plural_ies (e.g., antibodies), plural_es (e.g., switches), plural_cap_s (e.g., MAPKs), and plural_s (e.g., receptors).

Return type

list of str pairs

gilda.process.get_capitalization_pattern(word, beginning_of_sentence=False)[source]

Return the type of capitalization for the string.

Parameters
  • word (str) – The word whose capitalization is determined.

  • beginning_of_sentence (Optional[bool]) – True if the word appears at the beginning of a sentence. Default: False

Returns

The capitalization pattern of the given word. Returns one of the following: sentence_initial_cap, single_cap_letter, all_caps, all_lower, initial_cap, mixed.

Return type

str

gilda.process.normalize(s)[source]

Normalize white spaces, dashes and case of a given string.

Parameters

s (str) – The string to be normalized.

Returns

The normalized string.

Return type

str

gilda.process.remove_dashes(s)[source]

Remove all types of dashes in the given string.

Parameters

s (str) – The string in which all types of dashes should be replaced.

Returns

The string from which dashes have been removed.

Return type

str

gilda.process.replace_dashes(s, rep='-')[source]

Replace all types of dashes in a given string with a given replacement.

Parameters
  • s (str) – The string in which all types of dashes should be replaced.

  • rep (Optional[str]) – The string with which dashes should be replaced. By default, the plain ASCII dash (-) is used.

Returns

The string in which dashes have been replaced.

Return type

str

gilda.process.replace_greek_latin(s)[source]

Replace Greek spelled out letters with their latin character.

gilda.process.replace_greek_spelled_out(s)[source]

Replace Greek unicode character with latin spelled out.

gilda.process.replace_greek_uni(s)[source]

Replace Greek spelled out letters with their unicode character.

gilda.process.replace_unicode(s)[source]

Replace unicode with ASCII equivalent, except Greek letters.

Greek letters are handled separately and aren’t replaced in this context.

gilda.process.replace_whitespace(s, rep=' ')[source]

Replace any length white spaces in the given string with a replacement.

Parameters
  • s (str) – The string in which any length whitespaces should be replaced.

  • rep (Optional[str]) – The string with which all whitespace should be replaced. By default, the plain ASCII space ( ) is used.

Returns

The string in which whitespaces have been replaced.

Return type

str

gilda.process.split_preserve_tokens(s)[source]

Return split words of a string including the non-word tokens.

Parameters

s (str) – The string to be split.

Returns

The list of words in the string including the separator tokens, typically spaces and dashes..

Return type

list of str

Named Entity Recognition

Pandas Utilities

Utilities for Pandas.

gilda.pandas_utils.ground_df(df, source_column, *, target_column=None, grounder=None, **kwargs)[source]

Ground the elements of a column in a Pandas dataframe as CURIEs, in-place.

Parameters
  • df (DataFrame) – A pandas dataframe

  • source_column (Union[str, int]) – The column to ground. This column contains text corresponding to named entities’ labels or synonyms

  • target_column (Union[None, str, int]) – The column where to put the groundings (either a CURIE string, or None). It’s possible to create a new column when passing a string for this argument. If not given, will create a new column name like <source column>_grounded.

  • grounder (Optional[Grounder]) – A custom grounder. If none given, uses the built-in grounder.

  • kwargs – Keyword arguments passed to Grounder.ground(), could include context, organisms, or namespaces.

Return type

None

Examples

The following example shows how to use this function.

import pandas as pd
import gilda

url = "https://raw.githubusercontent.com/OBOAcademy/obook/master/docs/tutorial/linking_data/data.csv"
df = pd.read_csv(url)
gilda.ground_df(df, source_column="disease", target_column="disease_curie")
gilda.pandas_utils.ground_df_map(df, source_column, *, grounder=None, **kwargs)[source]

Ground the elements of a column in a Pandas dataframe as CURIEs.

Parameters
  • df (DataFrame) – A pandas dataframe

  • source_column (Union[str, int]) – The column to ground. This column contains text corresponding to named entities’ labels or synonyms

  • grounder (Optional[Grounder]) – A custom grounder. If none given, uses the built-in ground.

  • kwargs – Keyword arguments passed to Grounder.ground(), could include context, organisms, or namespaces.

Returns

A pandas series representing the grounded CURIE strings. Contains NaNs if grounding was not successful or if there was an NaN in the cell before.

Return type

series