-
Notifications
You must be signed in to change notification settings - Fork 606
Fix Funcotator transcript override functionality #9214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Correctly remove the version suffix from transcripts that include an underscore character - Allow ENSEMBL GTF files to use the --prefer-mane-transcripts option
Github actions tests reported job failures from actions build 15832447462
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kockan! just a couple very small comments, and a failing test, then should be good to go
src/main/java/org/broadinstitute/hellbender/tools/funcotator/FuncotatorUtils.java
Show resolved
Hide resolved
...roadinstitute/hellbender/tools/funcotator/dataSources/gencode/GencodeFuncotationFactory.java
Outdated
Show resolved
Hide resolved
@@ -886,19 +886,25 @@ private List<GencodeFuncotation> createFuncotationsHelper(final VariantContext v | |||
List<GencodeGtfTranscriptFeature> transcriptList; | |||
|
|||
// Only get basic transcripts if we're using data from Gencode: | |||
if ( gtfFeature.getGtfSourceFileType().equals(GencodeGtfCodec.GTF_FILE_TYPE_STRING) || | |||
gtfFeature.getGtfSourceFileType().equals(EnsemblGtfCodec.GTF_FILE_TYPE_STRING)) { | |||
if ( gtfFeature.getGtfSourceFileType().equals(GencodeGtfCodec.GTF_FILE_TYPE_STRING) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why split this up like this instead of keeping in the same if block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhh i see, nvm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, thanks @kockan!
Some of the newer GENCODE GTFs (at the very least GRCh37 liftovers of versions >v19) contain transcript IDs whose version numbers include a secondary number separated with an underscore, e.g. "ENST00000xxxxxx.yy_zz". The currently used regex only considers a single dot character followed by one or more digits and therefore fails to properly remove the version number. This PR should fix that.
Personal note: I don't like regexes in production code anyway, so instead of replacing it with one that works for now, I opted for a cleaner and more maintainable (in my opinion) alternative.
In addition, the recently introduced
--prefer-mane-transcripts
requires the file to be strictly of the GENCODE GTF format, but the hg19 data source with GENCODE v43 (backmapped to GRCh37) contains a GTF in the ENSEMBL GTF format (actually GFF3), making this option unusable. This PR should also fix that.