-
Notifications
You must be signed in to change notification settings - Fork 2
Populate and surface post-mapped HGVS expressions and VEP functional consequence, and surface gnomAD AF #553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bencap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on the scripts, and let me know if you have any questions about refactoring to share these methods with a worker process.
src/mavedb/lib/score_sets.py
Outdated
| value = str(mapping.hgvs_g) if mapping and mapping.hgvs_g else "" | ||
| elif column_key == "post_mapped_hgvs_p": | ||
| hgvs_str = get_hgvs_from_post_mapped(mapping.post_mapped) if mapping and mapping.post_mapped else None | ||
| if hgvs_str is not None and is_hgvs_p(hgvs_str): | ||
| value = hgvs_str | ||
| value = str(mapping.hgvs_p) if mapping and mapping.hgvs_p else "" | ||
| elif column_key == "post_mapped_hgvs_c": | ||
| value = str(mapping.hgvs_c) if mapping and mapping.hgvs_c else "" | ||
| elif column_key == "post_mapped_hgvs_at_assay_level": | ||
| value = str(mapping.hgvs_assay_level) if mapping and mapping.hgvs_assay_level else "" | ||
| elif column_key == "vep_functional_consequence": | ||
| vep_functional_consequence = mapping.vep_functional_consequence if mapping else None | ||
| if vep_functional_consequence is not None: | ||
| value = vep_functional_consequence | ||
| else: | ||
| value = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the empty string to represent null values here we should use the na_rep.
|
|
||
| from mavedb.scripts.environment import script_environment, with_database_session | ||
|
|
||
| CLINGEN_API_URL = "https://reg.test.genome.network/allele" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should import this from /src/mavedb/lib/clingen, and should also probably change that one to not be the test site (https://ldh.genome.network/ldh/srvc).
Probably reworking the environment variable CAR_SUBMISSION_ENDPOINT would be the best thing to do, so we can have the URL be more configurable and use the test one in dev environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just add an issue for reworking the env var though, to keep this task clean.
| ) | ||
|
|
||
| # for variant_urn, post_mapped, clingen_allele_id in variant_info: | ||
| for variant_urn, mapped_variant in variant_info: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since score sets may have a lot of variants and it can be frustrating to invoke the routine and not have insight into how far we've gotten, it may be worthwhile to convert that variant_info result into a list and log our progress every 10% of variants.
| for variant_urn, mapped_variant in variant_info: | |
| variant_info_list = variant_info.all() | |
| num_variants = len(variant_info_list) | |
| # for variant_urn, post_mapped, clingen_allele_id in variant_info: | |
| for v_idx, (variant_urn, mapped_variant) in enumerate(variant_info_list): | |
| if (v_idx + 1) % ((num_variants + 9) // 10) == 0: | |
| logger.info( | |
| f"Processing variant {v_idx+1}/{num_variants} ({variant_urn}) for score set {score_set.urn} ({idx+1}/{len(urns)})." | |
| ) |
| hgvs_p: Optional[str] = None | ||
|
|
||
| # NOTE: if no clingen allele id, could consider searching clingen using hgvs_assay_level. for now, skipping variant if no clingen allele id in db | ||
| # TODO skipping multi-variants for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add TODO as issue.
| hgvs_p = allele["hgvs"][0] | ||
| break | ||
|
|
||
| # TODO should we check that assay level hgvs mtches either g. or p.? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be a good check if we want to fail the population if they don't match. If we are fine populating when they don't match, we can skip it.
| queue = [] | ||
| variant_map = {} | ||
| for mapped_variant in mapped_variants: | ||
| hgvs_string = mapped_variant.post_mapped.get("expressions", {})[0].get("value") # type: ignore | ||
| if not hgvs_string: | ||
| logger.warning(f"No HGVS string found in post_mapped for variant {mapped_variant.id}.") | ||
| continue | ||
| queue.append(hgvs_string) | ||
| variant_map[hgvs_string] = mapped_variant | ||
|
|
||
| if len(queue) == 200: | ||
| consequences = get_functional_consequence(queue) | ||
| for hgvs, consequence in consequences.items(): | ||
| mapped_variant = variant_map[hgvs] | ||
| if consequence: | ||
| mapped_variant.vep_functional_consequence = consequence | ||
| mapped_variant.vep_access_date = date.today() | ||
| db.add(mapped_variant) | ||
| else: | ||
| logger.warning(f"Could not retrieve functional consequence for HGVS {hgvs}.") | ||
| db.commit() | ||
| queue.clear() | ||
| variant_map.clear() | ||
|
|
||
| # Process any remaining variants in the queue | ||
| if queue: | ||
| consequences = get_functional_consequence(queue) | ||
| for hgvs, consequence in consequences.items(): | ||
| mapped_variant = variant_map[hgvs] | ||
| if consequence: | ||
| mapped_variant.vep_functional_consequence = consequence | ||
| mapped_variant.vep_access_date = date.today() | ||
| db.add(mapped_variant) | ||
| else: | ||
| logger.warning(f"Could not retrieve functional consequence for HGVS {hgvs}.") | ||
| db.commit() | ||
|
|
||
| except Exception as e: | ||
| logger.error( | ||
| f"Failed to populate functional consequence predictions for score set {score_set.urn}: {str(e)}" | ||
| ) | ||
| db.rollback() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could simplify this queue logic by building the strings up front and then batching the list with the batched function from mavedb/lib/utils.py.
In pseudocode, you would:
hgvs_strings = []
for mv in mapped_variants:
append hgvs to hgvs_strings, warn on missing
for hgvs_batch in batched(hgvs_strings, 200):
get_functional_consequence(hgvs_batch)
existing logic ...
Implemented like this, we don't have to handle a half filled queue or worry about checking the queue length at all. We just generate the list of items we'd like to annotate, batch them, and handle all requests and results identically.
5422a27 to
4fdad00
Compare
857b15c to
21a771f
Compare
No description provided.