
    Ofm                     
   S SK r S SKrS SKrS SKrS SKrS SKJr  S SKJr  S SK	J
r
JrJrJr  S SKJr  S SKJr  S SKJrJrJr   " S S	\5      r " S
 S\5      rS rS rS rS rS rS rSS jrSS jrS r S r!S r"S r#S r$g)    N)reduce)ElementTree)FileSystemPathPointerPathPointerSeekableUnicodeStreamReaderZipFilePathPointer)slice_bounds)wordpunct_tokenize)AbstractLazySequenceLazyConcatenationLazySubsequencec                   z    \ rS rSrSrSS jr\" S SS9rS rS	 r	S
 r
S rS rS rS rS rS rS rS rS rSrg)StreamBackedCorpusView    a*  
A 'view' of a corpus file, which acts like a sequence of tokens:
it can be accessed by index, iterated over, etc.  However, the
tokens are only constructed as-needed -- the entire corpus is
never stored in memory at once.

The constructor to ``StreamBackedCorpusView`` takes two arguments:
a corpus fileid (specified as a string or as a ``PathPointer``);
and a block reader.  A "block reader" is a function that reads
zero or more tokens from a stream, and returns them as a list.  A
very simple example of a block reader is:

    >>> def simple_block_reader(stream):
    ...     return stream.readline().split()

This simple block reader reads a single line at a time, and
returns a single token (consisting of a string) for each
whitespace-separated substring on the line.

When deciding how to define the block reader for a given
corpus, careful consideration should be given to the size of
blocks handled by the block reader.  Smaller block sizes will
increase the memory requirements of the corpus view's internal
data structures (by 2 integers per block).  On the other hand,
larger block sizes may decrease performance for random access to
the corpus.  (But note that larger block sizes will *not*
decrease performance for iteration.)

Internally, ``CorpusView`` maintains a partial mapping from token
index to file position, with one entry per block.  When a token
with a given index *i* is requested, the ``CorpusView`` constructs
it as follows:

  1. First, it searches the toknum/filepos mapping for the token
     index closest to (but less than or equal to) *i*.

  2. Then, starting at the file position corresponding to that
     index, it reads one block at a time using the block reader
     until it reaches the requested token.

The toknum/filepos mapping is created lazily: it is initially
empty, but every time a new block is read, the block's
initial token is added to the mapping.  (Thus, the toknum/filepos
map has one entry per block.)

In order to increase efficiency for random access patterns that
have high degrees of locality, the corpus view may cache one or
more blocks.

:note: Each ``CorpusView`` object internally maintains an open file
    object for its underlying corpus file.  This file should be
    automatically closed when the ``CorpusView`` is garbage collected,
    but if you wish to close it manually, use the ``close()``
    method.  If you access a ``CorpusView``'s items after it has been
    closed, the file object will be automatically re-opened.

:warning: If the contents of the file are modified during the
    lifetime of the ``CorpusView``, then the ``CorpusView``'s behavior
    is undefined.

:warning: If a unicode encoding is specified when constructing a
    ``CorpusView``, then the block reader may only call
    ``stream.seek()`` with offsets that have been returned by
    ``stream.tell()``; in particular, calling ``stream.seek()`` with
    relative offsets, or with offsets based on string lengths, may
    lead to incorrect behavior.

:ivar _block_reader: The function used to read
    a single block from the underlying file stream.
:ivar _toknum: A list containing the token index of each block
    that has been processed.  In particular, ``_toknum[i]`` is the
    token index of the first token in block ``i``.  Together
    with ``_filepos``, this forms a partial mapping between token
    indices and file positions.
:ivar _filepos: A list containing the file position of each block
    that has been processed.  In particular, ``_toknum[i]`` is the
    file position of the first character in block ``i``.  Together
    with ``_toknum``, this forms a partial mapping between token
    indices and file positions.
:ivar _stream: The stream used to access the underlying corpus file.
:ivar _len: The total number of tokens in the corpus, if known;
    or None, if the number of tokens is not yet known.
:ivar _eofpos: The character position of the last character in the
    file.  This is calculated when the corpus view is initialized,
    and is used to decide when the end of file has been reached.
:ivar _cache: A cache of the most recently read block.  It
   is encoded as a tuple (start_toknum, end_toknum, tokens), where
   start_toknum is the token index of the first token in the block;
   end_toknum is the token index of the first token not in the
   block; and tokens is a list of the tokens in the block.
Nc                    U(       a  X l         S/U l        U/U l        X@l        SU l        Xl        SU l        SU l         SU l          [        U R
                  [        5      (       a   U R
                  R                  5       U l        O0[        R                  " U R
                  5      R                  U l         SU l        g! [          a  n[#        SU< SU 35      UeSnAff = f)af  
Create a new corpus view, based on the file ``fileid``, and
read with ``block_reader``.  See the class documentation
for more information.

:param fileid: The path to the file that is read by this
    corpus view.  ``fileid`` can either be a string or a
    ``PathPointer``.

:param startpos: The file position at which the view will
    start reading.  This can be used to skip over preface
    sections.

:param encoding: The unicode encoding that should be used to
    read the file's contents.  If no encoding is specified,
    then the file's contents will be read as a non-unicode
    string (i.e., a str).
r   NzUnable to open or access z -- )r   N)
read_block_toknum_filepos	_encoding_len_fileid_stream_current_toknum_current_blocknum
isinstancer   	file_size_eofpososstatst_size	Exception
ValueError_cache)selffileidblock_readerstartposencodingexcs         9/usr/lib/python3/dist-packages/nltk/corpus/reader/util.py__init__StreamBackedCorpusView.__init__}   s    & *Os!
!	#	( "&	(	W$,,44#||557!wwt||4<< %  	W8
$seLMSVV	Ws   	>C  /C   
C"
CC"c                     U R                   $ N)r   r%   s    r+   <lambda>StreamBackedCorpusView.<lambda>   s    T\\    za
        The fileid of the file that is accessed by this view.

        :type: str or PathPointer)docc                     [        S5      e)z
Read a block from the input stream.

:return: a block of tokens from the input stream
:rtype: list(any)
:param stream: an input stream
:type stream: stream
zAbstract Method)NotImplementedError)r%   streams     r+   r   !StreamBackedCorpusView.read_block   s     ""344r3   c                 P   [        U R                  [        5      (       a+  U R                  R                  U R                  5      U l        gU R                  (       a0  [        [        U R                  S5      U R                  5      U l        g[        U R                  S5      U l        g)z
Open the file stream associated with this corpus view.  This
will be called performed if any value is read from the view
while its file stream is closed.
rbN)r   r   r   openr   r   r   r0   s    r+   _openStreamBackedCorpusView._open   sj     dllK00<<,,T^^<DL^^6T\\4($..DL  d3DLr3   c                 `    U R                   b  U R                   R                  5         SU l         g)a;  
Close the file stream associated with this corpus view.  This
can be useful if you are worried about running out of file
handles (although the stream should automatically be closed
upon garbage collection of the corpus view).  If the corpus
view is accessed after it is closed, it will be automatically
re-opened.
N)r   closer0   s    r+   r?   StreamBackedCorpusView.close   s%     <<#LL r3   c                     U $ r/    r0   s    r+   	__enter__ StreamBackedCorpusView.__enter__   s    r3   c                 $    U R                  5         g r/   )r?   )r%   typevalue	tracebacks       r+   __exit__StreamBackedCorpusView.__exit__   s    

r3   c                 ~    U R                   c%  U R                  U R                  S   5       H  nM     U R                   $ Nr   )r   iterate_fromr   r%   toks     r+   __len__StreamBackedCorpusView.__len__   s9    99 ((b)9: ;yyr3   c                    [        U[        5      (       aU  [        X5      u  p#U R                  S   nXB::  a(  X0R                  S   ::  a  U R                  S   X$-
  X4-
   $ [	        XU5      $ US:  a  U[        U 5      -  nUS:  a  [        S5      eU R                  S   nXAs=::  a  U R                  S   :  a  O  OU R                  S   X-
     $  [        U R                  U5      5      $ ! [         a  n[        S5      UeS nAff = f)Nr         zindex out of range)
r   slicer	   r$   r   len
IndexErrornextrM   StopIteration)r%   istartstopoffsetes         r+   __getitem__"StreamBackedCorpusView.__getitem__   s    a&t/KE[[^F4;;q>#9{{1~ent}EE"455 1uSY1u !566[[^F+T[[^+{{1~aj11>D--a011  > !56A=>s   C* *
D4D  Dc              #     #    U R                   S   Us=::  a  U R                   S   :  a4  O  O1U R                   S   XR                   S   -
  S   H  nUv   US-  nM     XR                  S   :  aC  [        R                  " U R                  U5      S-
  nU R                  U   nU R                  U   nO6[        U R                  5      S-
  nU R                  S   nU R                  S   nU R                  c  U R                  5         U R                  S:X  a  SU l	        XPR                  :  Ga  U R                  R                  U5        X@l        X0l        U R                  U R                  5      n[        U[        [         ["        45      (       d   SU R                  R$                  -  5       e[        U5      nU R                  R'                  5       nX:  d    SU R                  R$                  U4-  5       eXDU-   [!        U5      4U l         X@R                  S   ::  d   eUS:  a  US-  nX@R                  S   :X  aM  XR                  S   :  d   eU R                  R)                  U5        U R                  R)                  XG-   5        O5XR                  U   :X  d   S5       eXG-   U R                  U   :X  d   S5       eXR                  :X  a	  XG-   U l	        U[+        SX-
  5      S   H  nUv   M	     XR                  ::  d   eXR                  :X  a  OXG-  nUnXPR                  :  a  GM  U R                  c   eU R-                  5         g 7f)	Nr   rS   rT   r   z.block reader %s() should return list or tuple.z=block reader %s() should consume at least 1 byte (filepos=%d)z*inconsistent block reader (num chars read)z/inconsistent block reader (num tokens returned))r$   r   bisectbisect_rightr   rV   r   r<   r   r   seekr   r   r   r   tuplelistr   __name__tellappendmaxr?   )	r%   	start_tokrO   block_indextoknumfilepostokensnum_toksnew_fileposs	            r+   rM   #StreamBackedCorpusView.iterate_from  s    ;;q>Y7Q7{{1~i++a.&@&BC	Q	 D ||B'' --dllIFJK\\+.FmmK0Gdll+a/K\\"%FmmB'G <<JJL <<1DI $LLg&#) %0"__T\\2Ffud4H&IJJ @//**+J 6{H,,++-K%N((R % "H#4d6lCDK \\"----!|q \\"--&r)::::MM((5LL''(9: $}}['AADCDA )T\\+-FFIHIF ll*"-	 c!Y%78:;	 < ,,...ll*F!Gk $p yy$$$ 	

s   L'M, Mc                     [        X/5      $ r/   concatr%   others     r+   __add__StreamBackedCorpusView.__add__l  s    tm$$r3   c                     [        X/5      $ r/   rt   rv   s     r+   __radd__StreamBackedCorpusView.__radd__o  s    um$$r3   c                      [        U /U-  5      $ r/   rt   r%   counts     r+   __mul__StreamBackedCorpusView.__mul__r      tfun%%r3   c                      [        U /U-  5      $ r/   rt   r~   s     r+   __rmul__StreamBackedCorpusView.__rmul__u  r   r3   )r$   r   r   r   r   r   r   r   r   r   r   )Nr   utf8)rg   
__module____qualname____firstlineno____doc__r,   propertyr&   r   r<   r?   rC   rI   rP   r_   rM   rx   r{   r   r   __static_attributes__rB   r3   r+   r   r       sa    Zx8%t !%F	54>6Yz%%&&r3   r   c                   0    \ rS rSrSrS rS rS rS rSr	g)	ConcatenatedCorpusViewiy  z
A 'view' of a corpus file that joins together one or more
``StreamBackedCorpusViews<StreamBackedCorpusView>``.  At most
one file handle is left open at any time.
c                 2    Xl          S/U l         S U l        g )Nr   )_pieces_offsets_open_piece)r%   corpus_viewss     r+   r,   ConcatenatedCorpusView.__init__  s,    #	 	>  	Jr3   c                     [        U R                  5      [        U R                  5      ::  a%  U R                  U R                  S   5       H  nM     U R                  S   $ rL   )rV   r   r   rM   rN   s     r+   rP   ConcatenatedCorpusView.__len__  sN    t}}T\\!22((r):; < }}R  r3   c                 J    U R                    H  nUR                  5         M     g r/   )r   r?   )r%   pieces     r+   r?   ConcatenatedCorpusView.close  s    \\EKKM "r3   c              #   j  #    [         R                  " U R                  U5      S-
  nU[        U R                  5      :  a  U R                  U   nU R                  U   nU R
                  ULa-  U R
                  b  U R
                  R                  5         X@l        UR                  [        SX-
  5      5       S h  vN   US-   [        U R                  5      :X  a4  U R                  R                  U R                  S   [        U5      -   5        US-  nU[        U R                  5      :  a  M  g g  Nv7f)NrS   r   r   )
rb   rc   r   rV   r   r   r?   rM   rj   ri   )r%   rk   piecenumr]   r   s        r+   rM   #ConcatenatedCorpusView.iterate_from  s     &&t}}i@1DT\\**]]8,FLL*E u,##/$$**,#(  ))#a1C*DEEE !|s4==11$$T]]2%6U%CD MH% T\\** Fs   B8D3:D1;A2D3/D3)r   r   r   N)
rg   r   r   r   r   r,   rP   r?   rM   r   rB   r3   r+   r   r   y  s    J!r3   r   c                 "   [        U 5      S:X  a  U S   $ [        U 5      S:X  a  [        S5      eU  Vs1 s H  oR                  iM     nn[        S U  5       5      (       a  SR	                  U 5      $ U H   n[        U[        [        45      (       a  M     O   [        U 5      $ U H  n[        U[        5      (       a  M    O   [        U 5      $ [        U5      S:X  a  [        U5      S   n[        U[        5      (       a  [        S U / 5      $ [        U[        5      (       a  [        S U S5      $ [        R                  " U5      (       a2  [        R                  " S	5      nU  H  nUR!                  U5        M     U$ [        S
U-  5      es  snf )z
Concatenate together the contents of multiple documents from a
single corpus, using an appropriate concatenation function.  This
utility function is used by corpus readers when the user requests
more than one document at a time.
rS   r   z%concat() expects at least one object!c              3   B   #    U  H  n[        U[        5      v   M     g 7fr/   )r   str).0r4   s     r+   	<genexpr>concat.<locals>.<genexpr>  s     
04C:c34s    c                 
    X-   $ r/   rB   abs     r+   r1   concat.<locals>.<lambda>      r3   c                 
    X-   $ r/   rB   r   s     r+   r1   r     r   r3   rB   	documentsz'Don't know how to concatenate types: %r)rV   r#   	__class__alljoin
issubclassr   r   r   r   rf   r   re   r   	iselementElementri   )docsdtypestypxmltreer4   s         r+   ru   ru     si    4yA~Aw
4yA~@AA"&'$Q[[$E' 
04
000wwt} # 68NOPP  &d++ #344  !&& 5zQ5k!nc4  -b99c5!!-b99  %%!))+6Gs# N >F
GGK (s   Fc                     / n[        S5       H0  nUR                  U R                  5       R                  5       5        M2     U$ N   )rangeextendreadlinesplitr7   toksrZ   s      r+   read_whitespace_blockr     s6    D2YFOO%++-. Kr3   c                 ~    / n[        S5       H+  nUR                  [        U R                  5       5      5        M-     U$ r   )r   r   r
   r   r   s      r+   read_wordpunct_blockr     s3    D2Y&v'89: Kr3   c                     / n[        S5       H>  nU R                  5       nU(       d  Us  $ UR                  UR                  S5      5        M@     U$ )Nr   
)r   r   ri   rstrip)r7   r   rZ   lines       r+   read_line_blockr     sE    D2Y KDKK%&	 
 Kr3   c                     Sn U R                  5       nU(       d  U(       a  U/$ / $ U(       a   UR                  5       (       d  U(       a  U/$ OX-  nMP  )Nr   )r   stripr7   sr   s      r+   read_blankline_blockr     sR    
A
 s
	$**,,s
  IA r3   c                     Sn U R                  5       nUS   S:X  d  US   S:X  d	  US S S:X  a  M.  U(       d  U(       a  U/$ / $ X-  n[        R                  " SU5      b  U/$ Ma  )Nr   r   =r   rT   z
z^\d+-\d+)r   rematchr   s      r+   read_alignedsent_blockr     ss    
A
 7c>T!W_RaF0Bs
	 IAxxT*6s
 r3   c                     U R                  5       nU(       d  / $ [        R                  " X5      (       a  OM7  U/n U R                  5       nU R                  5       nU(       d  SR	                  U5      /$ Ub-  [        R                  " X#5      (       a  SR	                  U5      /$ Uc>  [        R                  " X5      (       a#  U R                  U5        SR	                  U5      /$ UR                  U5        M  )a  
Read a sequence of tokens from a stream, where tokens begin with
lines that match ``start_re``.  If ``end_re`` is specified, then
tokens end with lines that match ``end_re``; otherwise, tokens end
whenever the next line matching ``start_re`` or EOF is found.
r   )r   r   r   rh   r   rd   ri   )r7   start_reend_rer   linesoldposs         r+   read_regexp_blockr   )  s      I88H##  FE
 GGEN##"((6"8"8GGEN## >bhhx66KKGGEN##T r3   c                 r   U R                  5       nU R                  U5      n[        U SS5      nUc  [        U[        5      (       d   eUS;  a  SSKnUR                  SU-  5        U(       a-  [        R                  " S[        R                  " U5      -  5      n  U(       a.  X@R                  5       -  n[        R                  " W[        U5      n[        U5      u  p[        R                  " S5      R                  XI5      R                  5       n	Uc  U R!                  X9-   5        U$ U R!                  U[#        USU	 R%                  U5      5      -   5        U$ ! [&         aQ  n
U
R(                  S   S:X  a8  U R                  U5      nU(       a
  XK-  n Sn
A
M  UR+                  5       /s Sn
A
$ e Sn
A
ff = f)	a  
Read a sequence of s-expressions from the stream, and leave the
stream's file position at the end the last complete s-expression
read.  This function will always return at least one s-expression,
unless there are no more s-expressions in the file.

If the file ends in in the middle of an s-expression, then that
incomplete s-expression is returned when the end of the file is
reached.

:param block_size: The default block size for reading.  If an
    s-expression is longer than one block, then more than one
    block will be read.
:param comment_char: A character that marks comments.  Any lines
    that begin with this character will be stripped out.
    (If spaces or tabs precede the comment character, then the
    line will not be stripped.)
r)   N)Nzutf-8r   zAParsing may fail, depending on the properties of the %s encoding!z
(?m)^%s.*$z\s*Block too small)rh   readgetattrr   r   warningswarnr   compileescaper   sub
_sub_space_parse_sexpr_blocksearchendrd   rV   encoder#   argsr   )r7   
block_sizecomment_charr[   blockr)   r   COMMENTro   r]   r^   
next_blocks               r+   read_sexpr_blockr   L  s   & KKMEKK
#Evz40H:eS#9#999&"$,-	
 **\BIIl,CCD
	
 **w
E:/6NFZZ'..u=AACF EN+
 M ECgv(=(=h(G$HHI M 
	vvay--#[[4
'E "KKM?*
	s1   BE *0E 
F6%/F1F1*F60F11F6c                 J    SU R                  5       U R                  5       -
  -  $ )znHelper function: given a regexp match, return a string of
spaces that's the same length as the matched string. )r   r[   )ms    r+   r   r     s      !%%'AGGI%&&r3   c                    / nS=p#U[        U 5      :  GaD  [        R                  " S5      R                  X5      nU(       d  X4$ UR	                  5       nUR                  5       S:w  aR  [        R                  " S5      R                  X5      nU(       a  UR	                  5       nOU(       a  X4$ [        S5      eSn[        R                  " S5      R                  X5       H:  nUR                  5       S:X  a  US-  nOUS-  nUS:X  d  M*  UR                  5       n  O   U(       a  X4$ [        S5      eUR                  XU 5        U[        U 5      :  a  GMD  X4$ )Nr   z\S(z[\s(]r   z[()]rS   )
rV   r   r   r   r[   groupr#   finditerr   ri   )r   ro   r[   r   r   m2nestings          r+   r   r     s6   FOE
E

JJu$$U0;	 779H%,,U:Bhhj!;& !233 GZZ(11%?779#qLGqLGa<%%'C @ !;& !233e#&'C E

F ;r3   c           
      f   [        U [        5      (       d  [        S5      eUS-  n[        U [        5      (       a  U R                  R                  5        Vs/ s H3  nUR                  S5      (       a  M  U[        U R                  5      S  PM5     nnU Vs/ s H"  n[        R                  " X5      (       d  M   UPM$     nn[        U5      $ [        U [        5      (       a  / n[        R                  " U R                  5       H  u  pVnSR!                  S [#        U R                  U5       5       5      nUU Vs/ s H'  n[        R                  " XU-   5      (       d  M#  Xx-   PM)     sn-  nSU;   d  Mr  UR%                  S5        M     [        U5      $ ['        SU -  5      es  snf s  snf s  snf )Nz+find_corpus_fileids: expected a PathPointer$/r   c              3   ,   #    U  H
  nS U-  v   M     g7f)z%s/NrB   )r   ps     r+   r   &find_corpus_fileids.<locals>.<genexpr>  s     O0N1UQY0Ns   z.svnzDon't know how to handle %r)r   r   	TypeErrorr   zipfilenamelistendswithrV   entryr   r   sortedr   r   walkpathr   
_path_fromremoveAssertionError)	rootregexpnamefileidsitemsdirnamesubdirsprefixr&   s	            r+   find_corpus_fileidsr    sz   dK((EFF
cMF $*++ --/
/==% $DTZZ"#/ 	 

 #*D'$RXXf-C'De} 
D/	0	0)+);%GgWWO
499g0NOOF%%F88FVO4  % E  v& *< e} :TABB3

 Es$   F$5F$F):F);"F.!F.c                 h   [         R                  R                  U 5      S   S:X  a"  [         R                  R                  U 5      S   n / nX:w  ab  [         R                  R                  U5      u  pUR                  SU5        [         R                  R                  U5      S   U:w  d   eX:w  a  Mb  U$ )NrS   r   r   )r   r  r   insert)parentchildr  r  s       r+   r  r    s    	ww}}VQ2%v&q)D
/u-Awww}}U#A&%/// / Kr3   c                     Sn U R                  5       n[        R                  " SU5      (       a  UR                  5       (       a  U/$ O$US:X  a  UR                  5       (       a  U/$ / $ X-  nMk  )Nr   z======+\s*$)r   r   r   r   )r7   parar   s      r+   !tagged_treebank_para_block_readerr    sl    D
 88ND))zz||v  RZzz||v	 LD r3   r/   )i @  N)%rb   r   pickler   tempfile	functoolsr   	xml.etreer   	nltk.datar   r   r   r   nltk.internalsr	   nltk.tokenizer
   	nltk.utilr   r   r   r   r   ru   r   r   r   r   r   r   r   r   r   r  r  r  rB   r3   r+   <module>r     s     	  	   !  ( , N NV&1 V&r
61 6r1Hr&& FCL''^!CH r3   