Commit fe8a14830f9ee90bc20fbcf09cafff90686bce0d

Authored by Jean-Michel Garant
1 parent 6235bcc9
Exists in stable-0.3 and in 1 other branch master

Documentation

Showing 4 changed files with 252 additions and 67 deletions   Show diff stats
MANUAL.md
... ... @@ -4,19 +4,63 @@
4 4 **G4RNA SCREENER MANUAL**
5 5 =========================
6 6  
  7 +1. [Name](#name)
  8 +2. [Synopsis](#synopsis)
  9 +3. [Screen.py](#screen.py)
  10 + 1. [Usage](#screen-usage)
  11 + 2. [Description](#screen-description)
  12 + 3. [Arguments](#screen-arguments)
  13 + 4. [Fasta Example](#screen-fasta-example)
  14 +4. [Merge.py](#merge.py)
  15 + 1. [Usage](#merge-usage)
  16 + 2. [Description](#merge-description)
  17 + 3. [Arguments](#merge-arguments)
  18 + 4. [Example](#merge-example)
  19 +
7 20 ## **NAME**
8 21  
9 22 G4RNA Screener - The nucleic acid screener for RNA G-quadruplexes
10 23  
11 24 ## **SYNOPSIS**
12 25  
13   - ./screen.py [-?|--help] [-V|--version]
  26 +```
  27 +./screen.py [-h|--help] [-V|--version]
  28 +
  29 +./screen.py <path>[.fa|.fas|.fasta]
  30 + [-a|--ann <path>.pkl]
  31 + [-c|--columns <list>]
  32 + [-w|--window <int>]
  33 + [-s|--step <int>]
  34 + [-b|--bedgraph]
  35 + [-e|--error]
  36 + [-v|--verbose]
  37 +```
  38 +
  39 +```
  40 +./merge.py [-h|--help] [-V|--version]
  41 +
  42 +./merge.py <path>[.tsv|.csv|.txt]
  43 + [--cGcC [<float>]]
  44 + [--G4H [<float>]]
  45 + [--G4NN [<float>]]
  46 + [-w|--window <int>]
  47 + [-s|--step <int>]
  48 + [-a|--aggregate <str>]
  49 +```
  50 +
  51 +> Combination using pipeline
  52 +> ```./screen.py <path>.fa | ./merge.py -```
  53 +> example:
  54 +> ```./screen.py example.fasta | ./merge.py - --G4NN > example.tsv```
  55 +
  56 +**SCREEN.PY**
  57 +=============
14 58  
15   - ./screen.py [-a|--ann <path>.pkl] [-c|--columns <list>] [-w|--window <int>]
16   - [-s|--step <int>] [-b|--bedgraph]
17   - [-f|--fasta <path>.fas] [-e|--error] [-v|--verbose]
  59 +## **SCREN USAGE**
18 60  
19   -## **DESCRIPTION**
  61 + screen.py [-h] FASTA [-a ANN] [-w INT] [-s INT] [-c [...]] [-b] [-v] [-V] [-e]
  62 +
  63 +## **SCREEN DESCRIPTION**
20 64  
21 65 Score nucleic acid using an artificial neural network classifier (G4NN) that was
22 66 trained on sequences found in the [G4RNA database](http://scottgroup.med.
... ... @@ -24,27 +68,23 @@ usherbrooke.ca/G4RNA/). It also provides the previously described:
24 68 [G4Hunter score](https://www.ncbi.nlm.nih.gov/pubmed/26792894) and
25 69 [cG/cC score](https://www.ncbi.nlm.nih.gov/pubmed/24121682) if specified.
26 70  
27   -## **OPTIONS**
  71 +## **SCREEN ARGUMENTS**
28 72  
29   -**_-?, --help_**
  73 +**_FASTA Positional Argument_** = - (default STDIN)
30 74  
31   -> Show usage and exit.
  75 +> Path to the fasta file to analyze. Will support string value as long as it
  76 +respects the fasta format. Use "-" to feed standard input.
32 77  
33   -**_-V, --version_**
  78 +**_-h, --help_**
34 79  
35   -> Show version information and exit.
  80 +> Show usage and exit.
36 81  
37 82 **_-a, --ann_** = G4RNA_2016-11-07.pkl
38 83  
39 84 > Path to the pickled ANN (.pkl) which will provide the program a particular
40 85 pattern to evaluate each sequences or windows of sequences.
41 86  
42   -**_-f, --fasta_** = STDIN
43   -
44   -> Path to the fasta file to analyze. Will support string value as long as it
45   -respects the fasta format. Use "STDIN" to feed standard input to -f argument.
46   -
47   -**_-w, --window_**
  87 +**_-w, --window_** = 60
48 88  
49 89 > Length of the sliding window that is used to analyze long sequences.
50 90  
... ... @@ -108,7 +148,19 @@ gene,description,chromosome,strand,...
108 148 > + G4NN: _Score obtained through G4NN (Must be specified when
109 149 enumerating columns in list)_
110 150  
111   -###### Fasta example
  151 +**_-v, --verbose_**
  152 +
  153 +> Rudimental verbose
  154 +
  155 +**_-V, --version_**
  156 +
  157 +> Show version information and exit.
  158 +
  159 +**_-e, --error_**
  160 +
  161 +> Raise and display error and warnings
  162 +
  163 +###### Screen Fasta Examples
112 164  
113 165 Any fasta file can support columns: description,length,sequence,cGcC,G4H,G4NN
114 166 ```html
... ... @@ -151,3 +203,84 @@ tgacggctattatcgcatgatcGGCGGCGGCGGGGcggcggcgatatcgt
151 203 >ENST00000380152 chromosome:GRCh38:11:106073501:106100710:1
152 204 ACGTTTACGGCTATTCGTATGCTTGTGCTATTGCATGTACTG
153 205 ```
  206 +
  207 +**MERGE.PY**
  208 +============
  209 +
  210 +## **MERGE USAGE**
  211 +
  212 + merge.py [-h] TSV [--cGcC [FLOAT]] [--G4H [FLOAT]] [--G4NN [FLOAT]] [-w INT] [-s INT] [-a STR]
  213 +
  214 +## **MERGE DESCRIPTION**
  215 +
  216 +Transforms the tabular output of `screen.py` in order to merge overlapping
  217 +regions that are scored above threshold(s). The omission of using --cGcC, --G4H
  218 +and --G4NN argument wil merge all overlapping windows back into the original
  219 +sequences. The -w,--window and -s,--step arguments must be the same used for
  220 +the `screen.py`.
  221 +
  222 +## **MERGE ARGUMENTS**
  223 +
  224 +**_TSV Positional Argument_** = - (default STDIN)
  225 +
  226 +> Path to the tabular separated values file to analyze. Use "-" to feed
  227 +standard input.
  228 +
  229 +**_-h, --help_**
  230 +
  231 +> Show usage and exit.
  232 +
  233 +**_--cGcC_** = 4.5
  234 +
  235 +> Use consecutive G over consecutive C score to determine positive windows
  236 +above threshold and merge them.
  237 +
  238 +**_--G4H_** = 0.9
  239 +
  240 +> Use G4Hunter score to determine positive windows above threshold and merge
  241 +them.
  242 +
  243 +**_--G4NN_** = 0.5
  244 +
  245 +> Use G4 neural network score to determine positive windows above threshold and
  246 +merge them.
  247 +
  248 +**_-w, --window_** = 60
  249 +
  250 +> Length of the sliding window that was used to analyze long sequences.
  251 +
  252 +**_-a, --aggregation_** = list (only available with python 2.7.15rc1 +)
  253 +
  254 +> Scores of merged windows will be aggregated using an aggregation function:
  255 +(choose from 'max', 'min', 'median', 'mean', 'std', 'sem')
  256 +
  257 +**_-s, --step_** = 10
  258 +
  259 +> Step size that defines the overlap between the windows sequences.
  260 +
  261 +**_-v, --verbose_**
  262 +
  263 +> Show version information and exit.
  264 +
  265 +**_-e, --error_**
  266 +
  267 +> Raise and display error and warnings
  268 +
  269 +
  270 +###### Merge Examples
  271 +
  272 +`screen.py` output
  273 +```
  274 +./screen.py spinach.fas -w 60 -s 10
  275 + description sequence start cGcC G4H G4NN
  276 +1 Spinach aptamer (false negative example) GGACGCGACCGAAAUGGUGAAGGACGGGUCCAGUGCGAAACACGCACUGUUGAGUAGAGU 1 2.1875 0.3 0.209915966961
  277 +2 Spinach aptamer (false negative example) GAAAUGGUGAAGGACGGGUCCAGUGCGAAACACGCACUGUUGAGUAGAGUGUGAGCUCCG 11 2.2 0.283333333333 0.146798576587
  278 +3 Spinach aptamer (false negative example) AGGACGGGUCCAGUGCGAAACACGCACUGUUGAGUAGAGUGUGAGCUCCGUAACUGGUCG 21 1.88235294118 0.233333333333 0.154917456609
  279 +4 Spinach aptamer (false negative example) CAGUGCGAAACACGCACUGUUGAGUAGAGUGUGAGCUCCGUAACUGGUCGCGUC 31 1.26666666667 0.0555555555556 0.118000916901
  280 +```
  281 +
  282 +`screen.py | merge.py` output
  283 +```
  284 +./screen.py spinach.fas -w 60 -s 10 | ./merge.py - -w 60 -s 10
  285 +1 Spinach aptamer (false negative example) GGACGCGACCGAAAUGGUGAAGGACGGGUCCAGUGCGAAACACGCACUGUUGAGUAGAGUGUGAGCUCCGUAACUGGUCGCGUC 1 [2.1875, 2.2, 1.88235294118, 1.2666666666700002] [0.3, 0.283333333333, 0.233333333333, 0.0555555555556] [0.209915966961, 0.14679857658700002, 0.154917456609, 0.118000916901]
  286 +```
... ...
README.md
... ... @@ -34,7 +34,7 @@ library_name (recommended version)
34 34 ##### From environment
35 35  
36 36 ```bash
37   -python2.7 (2.7.12)
  37 +python2.7 (2.7.15rc1)
38 38 ```
39 39  
40 40 ##### From Python
... ...
merge.py
... ... @@ -34,6 +34,8 @@ class float_range(object):
34 34 Object that defines a range of float that is authorized or returns the
35 35 authorized range in error. Used to validate score threshold input by user.
36 36 """
  37 + # float range is a function that is needed to have any threshold of float in
  38 + # supported range of scores
37 39 def __init__(self, start, end):
38 40 self.start = start
39 41 self.end = end
... ... @@ -102,12 +104,14 @@ def merge_g4rna(df, window=60, step=10,
102 104 score_aggregation=list):
103 105 """
104 106 """
  107 + # filter windows with users thresholds
105 108 if cGcC:
106 109 df = df[ df.cGcC >= cGcC ].dropna()
107 110 if G4H:
108 111 df = df[ df.G4H >= G4H ].dropna()
109 112 if G4NN:
110 113 df = df[ df.G4NN >= G4NN ].dropna()
  114 + # aggregate functions are defined in this dictionnary
111 115 agg_fct = {
112 116 # 'description':"".join,
113 117 'gene_symbol':'max',
... ... @@ -134,25 +138,32 @@ def merge_g4rna(df, window=60, step=10,
134 138 # overlap is the length that sequential windows should share
135 139 overlap = window-step
136 140 pd.set_option('display.max_colwidth', -1)
  141 + # the merge is dependant on the sequence columns
137 142 if 'sequence' in df.columns:
138   - for ite in range(0,len(df)):
139   - print any(df.sequence.str[:(overlap+step*ite)].eq(
140   - df.sequence.str[step:].shift(1)))
  143 + for ite in range(0,len(df)):#max iterations is dataframe length
  144 + # check if any windows still needs some merging
141 145 if any(df.sequence.str[:(overlap+step*ite)].eq(
142 146 df.sequence.str[step:].shift(1))) is False:
143 147 break
  148 + # merge overlapping window by appending the [step] first
  149 + # nucleotides to the next window
144 150 df.loc[
145 151 df.sequence.str[:(overlap+step*ite)].eq(
146 152 df.sequence.str[step:].shift(1))
147 153 , 'sequence'] = df.sequence.str[:step].shift(1) + \
148 154 df.sequence.str[:]
  155 + # description column is used when available which should prevent the
  156 + # eventual problems caused by a user that submits two sequences that
  157 + # share some subsequences
149 158 if 'description' in df.columns:
150 159 df_grouped = df.groupby(
151 160 [df.description,df.sequence.str[:overlap]],
152 161 sort=False,
153 162 as_index=False)
154   - print "******",{k:agg_fct[k] for k in df.columns.drop(['description'])}
155   - return df_grouped.agg({k:agg_fct[k] for k in df.columns.drop(['description'])}).reindex(columns=df.columns)
  163 + # grouping the resulting windows using the first [:overlap] nts
  164 + return df_grouped.agg(
  165 + {k:agg_fct[k] for k in df.columns.drop(['description'])}
  166 + ).reindex(columns=df.columns)
156 167 else:
157 168 df_grouped = df.groupby(
158 169 df.sequence.str[-overlap:],
... ... @@ -211,6 +222,18 @@ def arguments():
211 222 help="Use G4NN score threshold to determine positive windows "\
212 223 "(default: 0.5)",
213 224 metavar="FLOAT")
  225 + # windows length
  226 + parser.add_argument("-w", "--window",
  227 + type=int,
  228 + default=60,
  229 + help="Windows length of input",
  230 + metavar="INT")
  231 + # step length
  232 + parser.add_argument("-s","--step",
  233 + type=int,
  234 + default=10,
  235 + help="Step lenght of input",
  236 + metavar="INT")
214 237 # aggregation function for scores
215 238 parser.add_argument("-a", "--aggregation",
216 239 #nargs="+",
... ... @@ -219,8 +242,19 @@ def arguments():
219 242 help="Aggregation function to pool scores of merged windows "\
220 243 "(default: list)",
221 244 metavar="STR")
  245 + parser.add_argument("-V", "--version",
  246 + action="version",
  247 + version="%(prog)s 0.3")
  248 + # useful for debug, not meant for users
  249 + parser.add_argument("-e", "--error",
  250 + action="store_true",
  251 + default=False,
  252 + help="Raise errors and exceptions")
  253 + # needed to to have default help display
  254 + if len(sys.argv[1:])==0:
  255 + parser.print_help()
  256 + parser.exit()
222 257 args = parser.parse_args()
223   -# args.aggregation = [ aggr if ( aggr != 'list' ) else list for aggr in args.aggregation ]
224 258 if args.cGcC == None:
225 259 args.cGcC = 4.5
226 260 if args.G4H == None:
... ... @@ -232,11 +266,21 @@ def arguments():
232 266 def main():
233 267 args = arguments()
234 268 g4rna_frame = pd.read_csv(args.tsv, sep='\t', index_col=0)
235   - merge_g4rna(g4rna_frame,
236   - 60, 10,
237   - args.cGcC, args.G4H, args.G4NN,
238   - args.aggregation).to_csv(
239   - path_or_buf=sys.stdout, sep='\t')
  269 + try:
  270 + merge_g4rna(g4rna_frame,
  271 + args.window, args.step,
  272 + args.cGcC, args.G4H, args.G4NN,
  273 + args.aggregation).to_csv(
  274 + path_or_buf=sys.stdout, sep='\t')
  275 + except:
  276 + # raise python error calls if -e --error is used
  277 + if args.error:
  278 + raise
  279 + # custom error message
  280 + else:
  281 + parser.print_usage()
  282 + sys.stderr.write(parser.prog+': error: '\
  283 + 'An option is missing, incorrect or not authorized\n')
240 284  
241 285 if __name__ == '__main__':
242 286 main()
... ...
screen.py
... ... @@ -239,7 +239,7 @@ def arguments():
239 239 parser.add_argument("-a", "--ann",
240 240 type=argparse.FileType('r'),
241 241 default=os.path.dirname(__file__)+"/G4RNA_2016-11-07.pkl",
242   - help="Supply a picled ANN (default: G4RNA_2016-11-07.pkl)")
  242 + help="Supply a pickled ANN (default: G4RNA_2016-11-07.pkl)")
243 243 # length of segmentation of long sequences into analysis windows
244 244 parser.add_argument("-w", "--window",
245 245 type=int,
... ... @@ -253,13 +253,6 @@ def arguments():
253 253 default=10,
254 254 help="Step length between windows (default: 10)",
255 255 metavar="INT")
256   - # bedgraph will generate the required header, compatible UCSC genome browser
257   - parser.add_argument("-b", "--bedgraph",
258   - action="store_true",
259   - default=False,
260   - help="Display output as BedGraph, user must provides columns")
261   - ## TODO use choices of three scores as bedgraph options which will
262   - ## select columns for the user, must include verifications
263 256 # columns to generate are provided by a space delimited list
264 257 parser.add_argument("-c", "--columns",
265 258 nargs="+",
... ... @@ -292,16 +285,61 @@ def arguments():
292 285 "cGcC G4H G4NN). "\
293 286 "To browse available columns use: -c list",
294 287 metavar="")
  288 + # bedgraph will generate the required header, compatible UCSC genome browser
  289 + parser.add_argument("-b", "--bedgraph",
  290 + action="store_true",
  291 + default=False,
  292 + help="Display output as BedGraph, user must provides columns")
  293 + ## TODO use choices of three scores as bedgraph options which will
  294 + ## select columns for the user, must include verifications
295 295 # verbose option is very rudimental
296 296 parser.add_argument("-v", "--verbose",
297 297 action="store_true",
298 298 default=False,
299 299 help="Verbose output with operations when completed")
  300 + parser.add_argument("-V", "--version",
  301 + action="version",
  302 + version="%(prog)s 0.3")
300 303 # useful for debug, not meant for users
301 304 parser.add_argument("-e", "--error",
302 305 action="store_true",
303 306 default=False,
304 307 help="Raise errors and exceptions")
  308 + # needed to to have default help display
  309 + if len(sys.argv[1:])==0:
  310 + parser.print_help()
  311 + parser.exit()
  312 + # adapted help for columns listing
  313 + if "list" in sys.argv:
  314 + splitted_help = parser.format_help().split(
  315 + " Columns to display (default: description sequence\n start cGcC G4H G4NN). To browse available columns use:\n -c list")
  316 + sys.stdout.write("\n\t".join([splitted_help[0],
  317 + "Available columns:",
  318 + "all \t\tAll of the following except description\n",
  319 + "description\t\tDescription as available in fasta (Default)",
  320 + "gene_symbol\t\tGene symbol",
  321 + "mrnaAcc \t\tRefSeq mRNA accession number",
  322 + "protAcc \t\tRefseq protein accession number",
  323 + "gene_stable_id\t\tEnsembl gene stable ID",
  324 + "transcript_stable_id\tEnsembl transcript stable ID",
  325 + "full_name \t\tGene full name (From HGNC)",
  326 + "HGNC_id \t\tHGNC numeric ID",
  327 + "identifier \t\tIdentifier",
  328 + "source \t\tSource of the data",
  329 + "genome_assembly\t\tGenome build version",
  330 + "chromosome \t\tChromosome",
  331 + "start \t\tStart position (Default)",
  332 + "end \t\tEnd position",
  333 + "strand \t\tCoding strand",
  334 + "range \t\tInitial chromosomic range",
  335 + "length \t\tLength of sequence analyzed",
  336 + "sequence \t\tSequence analyzed (Default)",
  337 + "cGcC \t\tcGcC score (Default)",
  338 + "G4H \t\tG4Hunter score (Default)",
  339 + "G4NN \t\tG4NN score of similitude (Default)",
  340 + " \t\t(must be specified to use ANN)",
  341 + splitted_help[1]]))
  342 + sys.exit()
305 343 return parser
306 344  
307 345 def main():
... ... @@ -311,36 +349,6 @@ def main():
311 349 parser = arguments()
312 350 args = parser.parse_args()
313 351 # custom help message to list columns choices
314   - if args.columns == ["list"]:
315   - splitted_help = parser.format_help().split(
316   - " Columns to display (default: description sequence\n start cGcC G4H G4NN). To browse available columns use:\n -c list")
317   - print("\n\t".join([splitted_help[0],
318   - "Available columns:",
319   - "description\t\tDescription as available in fasta (Default)",
320   - "all \t\tAll of the following except description\n",
321   - "gene_symbol\t\tGene symbol",
322   - "mrnaAcc \t\tRefSeq mRNA accession number",
323   - "protAcc \t\tRefseq protein accession number",
324   - "gene_stable_id\tEnsembl gene stable ID",
325   - "transcript_stable_id\tEnsembl transcript stable ID",
326   - "full_name \t\tGene full name (From HGNC)",
327   - "HGNC_id \t\tHGNC numeric ID",
328   - "identifier \t\tIdentifier",
329   - "source \t\tSource of the data",
330   - "genome_assembly\tGenome build version",
331   - "chromosome \t\tChromosome",
332   - "start \t\tStart position",
333   - "end \t\tEnd position",
334   - "strand \t\tCoding strand",
335   - "range \t\tInitial chromosomic range",
336   - "length \t\tLength of sequence analyzed",
337   - "sequence \t\tSequence analyzed",
338   - "cGcC \t\tcGcC score",
339   - "G4H \t\tG4Hunter score",
340   - "G4NN \t\tG4NN score of similitude",
341   - " \t\t(must be specified to use ANN)",
342   - splitted_help[1]]))
343   - sys.exit()
344 352 # restrictive verifications for bedgraph options
345 353 if args.bedgraph and (
346 354 len(args.columns) != 4\
... ...