|
1 :mod:`difflib` --- Helpers for computing deltas |
|
2 =============================================== |
|
3 |
|
4 .. module:: difflib |
|
5 :synopsis: Helpers for computing differences between objects. |
|
6 .. moduleauthor:: Tim Peters <tim_one@users.sourceforge.net> |
|
7 .. sectionauthor:: Tim Peters <tim_one@users.sourceforge.net> |
|
8 .. Markup by Fred L. Drake, Jr. <fdrake@acm.org> |
|
9 |
|
10 .. testsetup:: |
|
11 |
|
12 import sys |
|
13 from difflib import * |
|
14 |
|
15 .. versionadded:: 2.1 |
|
16 |
|
17 This module provides classes and functions for comparing sequences. It |
|
18 can be used for example, for comparing files, and can produce difference |
|
19 information in various formats, including HTML and context and unified |
|
20 diffs. For comparing directories and files, see also, the :mod:`filecmp` module. |
|
21 |
|
22 .. class:: SequenceMatcher |
|
23 |
|
24 This is a flexible class for comparing pairs of sequences of any type, so long |
|
25 as the sequence elements are :term:`hashable`. The basic algorithm predates, and is a |
|
26 little fancier than, an algorithm published in the late 1980's by Ratcliff and |
|
27 Obershelp under the hyperbolic name "gestalt pattern matching." The idea is to |
|
28 find the longest contiguous matching subsequence that contains no "junk" |
|
29 elements (the Ratcliff and Obershelp algorithm doesn't address junk). The same |
|
30 idea is then applied recursively to the pieces of the sequences to the left and |
|
31 to the right of the matching subsequence. This does not yield minimal edit |
|
32 sequences, but does tend to yield matches that "look right" to people. |
|
33 |
|
34 **Timing:** The basic Ratcliff-Obershelp algorithm is cubic time in the worst |
|
35 case and quadratic time in the expected case. :class:`SequenceMatcher` is |
|
36 quadratic time for the worst case and has expected-case behavior dependent in a |
|
37 complicated way on how many elements the sequences have in common; best case |
|
38 time is linear. |
|
39 |
|
40 |
|
41 .. class:: Differ |
|
42 |
|
43 This is a class for comparing sequences of lines of text, and producing |
|
44 human-readable differences or deltas. Differ uses :class:`SequenceMatcher` |
|
45 both to compare sequences of lines, and to compare sequences of characters |
|
46 within similar (near-matching) lines. |
|
47 |
|
48 Each line of a :class:`Differ` delta begins with a two-letter code: |
|
49 |
|
50 +----------+-------------------------------------------+ |
|
51 | Code | Meaning | |
|
52 +==========+===========================================+ |
|
53 | ``'- '`` | line unique to sequence 1 | |
|
54 +----------+-------------------------------------------+ |
|
55 | ``'+ '`` | line unique to sequence 2 | |
|
56 +----------+-------------------------------------------+ |
|
57 | ``' '`` | line common to both sequences | |
|
58 +----------+-------------------------------------------+ |
|
59 | ``'? '`` | line not present in either input sequence | |
|
60 +----------+-------------------------------------------+ |
|
61 |
|
62 Lines beginning with '``?``' attempt to guide the eye to intraline differences, |
|
63 and were not present in either input sequence. These lines can be confusing if |
|
64 the sequences contain tab characters. |
|
65 |
|
66 |
|
67 .. class:: HtmlDiff |
|
68 |
|
69 This class can be used to create an HTML table (or a complete HTML file |
|
70 containing the table) showing a side by side, line by line comparison of text |
|
71 with inter-line and intra-line change highlights. The table can be generated in |
|
72 either full or contextual difference mode. |
|
73 |
|
74 The constructor for this class is: |
|
75 |
|
76 |
|
77 .. function:: __init__([tabsize][, wrapcolumn][, linejunk][, charjunk]) |
|
78 |
|
79 Initializes instance of :class:`HtmlDiff`. |
|
80 |
|
81 *tabsize* is an optional keyword argument to specify tab stop spacing and |
|
82 defaults to ``8``. |
|
83 |
|
84 *wrapcolumn* is an optional keyword to specify column number where lines are |
|
85 broken and wrapped, defaults to ``None`` where lines are not wrapped. |
|
86 |
|
87 *linejunk* and *charjunk* are optional keyword arguments passed into ``ndiff()`` |
|
88 (used by :class:`HtmlDiff` to generate the side by side HTML differences). See |
|
89 ``ndiff()`` documentation for argument default values and descriptions. |
|
90 |
|
91 The following methods are public: |
|
92 |
|
93 |
|
94 .. function:: make_file(fromlines, tolines [, fromdesc][, todesc][, context][, numlines]) |
|
95 |
|
96 Compares *fromlines* and *tolines* (lists of strings) and returns a string which |
|
97 is a complete HTML file containing a table showing line by line differences with |
|
98 inter-line and intra-line changes highlighted. |
|
99 |
|
100 *fromdesc* and *todesc* are optional keyword arguments to specify from/to file |
|
101 column header strings (both default to an empty string). |
|
102 |
|
103 *context* and *numlines* are both optional keyword arguments. Set *context* to |
|
104 ``True`` when contextual differences are to be shown, else the default is |
|
105 ``False`` to show the full files. *numlines* defaults to ``5``. When *context* |
|
106 is ``True`` *numlines* controls the number of context lines which surround the |
|
107 difference highlights. When *context* is ``False`` *numlines* controls the |
|
108 number of lines which are shown before a difference highlight when using the |
|
109 "next" hyperlinks (setting to zero would cause the "next" hyperlinks to place |
|
110 the next difference highlight at the top of the browser without any leading |
|
111 context). |
|
112 |
|
113 |
|
114 .. function:: make_table(fromlines, tolines [, fromdesc][, todesc][, context][, numlines]) |
|
115 |
|
116 Compares *fromlines* and *tolines* (lists of strings) and returns a string which |
|
117 is a complete HTML table showing line by line differences with inter-line and |
|
118 intra-line changes highlighted. |
|
119 |
|
120 The arguments for this method are the same as those for the :meth:`make_file` |
|
121 method. |
|
122 |
|
123 :file:`Tools/scripts/diff.py` is a command-line front-end to this class and |
|
124 contains a good example of its use. |
|
125 |
|
126 .. versionadded:: 2.4 |
|
127 |
|
128 |
|
129 .. function:: context_diff(a, b[, fromfile][, tofile][, fromfiledate][, tofiledate][, n][, lineterm]) |
|
130 |
|
131 Compare *a* and *b* (lists of strings); return a delta (a :term:`generator` |
|
132 generating the delta lines) in context diff format. |
|
133 |
|
134 Context diffs are a compact way of showing just the lines that have changed plus |
|
135 a few lines of context. The changes are shown in a before/after style. The |
|
136 number of context lines is set by *n* which defaults to three. |
|
137 |
|
138 By default, the diff control lines (those with ``***`` or ``---``) are created |
|
139 with a trailing newline. This is helpful so that inputs created from |
|
140 :func:`file.readlines` result in diffs that are suitable for use with |
|
141 :func:`file.writelines` since both the inputs and outputs have trailing |
|
142 newlines. |
|
143 |
|
144 For inputs that do not have trailing newlines, set the *lineterm* argument to |
|
145 ``""`` so that the output will be uniformly newline free. |
|
146 |
|
147 The context diff format normally has a header for filenames and modification |
|
148 times. Any or all of these may be specified using strings for *fromfile*, |
|
149 *tofile*, *fromfiledate*, and *tofiledate*. The modification times are normally |
|
150 expressed in the format returned by :func:`time.ctime`. If not specified, the |
|
151 strings default to blanks. |
|
152 |
|
153 >>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n'] |
|
154 >>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n'] |
|
155 >>> for line in context_diff(s1, s2, fromfile='before.py', tofile='after.py'): |
|
156 ... sys.stdout.write(line) # doctest: +NORMALIZE_WHITESPACE |
|
157 *** before.py |
|
158 --- after.py |
|
159 *************** |
|
160 *** 1,4 **** |
|
161 ! bacon |
|
162 ! eggs |
|
163 ! ham |
|
164 guido |
|
165 --- 1,4 ---- |
|
166 ! python |
|
167 ! eggy |
|
168 ! hamster |
|
169 guido |
|
170 |
|
171 See :ref:`difflib-interface` for a more detailed example. |
|
172 |
|
173 .. versionadded:: 2.3 |
|
174 |
|
175 |
|
176 .. function:: get_close_matches(word, possibilities[, n][, cutoff]) |
|
177 |
|
178 Return a list of the best "good enough" matches. *word* is a sequence for which |
|
179 close matches are desired (typically a string), and *possibilities* is a list of |
|
180 sequences against which to match *word* (typically a list of strings). |
|
181 |
|
182 Optional argument *n* (default ``3``) is the maximum number of close matches to |
|
183 return; *n* must be greater than ``0``. |
|
184 |
|
185 Optional argument *cutoff* (default ``0.6``) is a float in the range [0, 1]. |
|
186 Possibilities that don't score at least that similar to *word* are ignored. |
|
187 |
|
188 The best (no more than *n*) matches among the possibilities are returned in a |
|
189 list, sorted by similarity score, most similar first. |
|
190 |
|
191 >>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy']) |
|
192 ['apple', 'ape'] |
|
193 >>> import keyword |
|
194 >>> get_close_matches('wheel', keyword.kwlist) |
|
195 ['while'] |
|
196 >>> get_close_matches('apple', keyword.kwlist) |
|
197 [] |
|
198 >>> get_close_matches('accept', keyword.kwlist) |
|
199 ['except'] |
|
200 |
|
201 |
|
202 .. function:: ndiff(a, b[, linejunk][, charjunk]) |
|
203 |
|
204 Compare *a* and *b* (lists of strings); return a :class:`Differ`\ -style |
|
205 delta (a :term:`generator` generating the delta lines). |
|
206 |
|
207 Optional keyword parameters *linejunk* and *charjunk* are for filter functions |
|
208 (or ``None``): |
|
209 |
|
210 *linejunk*: A function that accepts a single string argument, and returns true |
|
211 if the string is junk, or false if not. The default is (``None``), starting with |
|
212 Python 2.3. Before then, the default was the module-level function |
|
213 :func:`IS_LINE_JUNK`, which filters out lines without visible characters, except |
|
214 for at most one pound character (``'#'``). As of Python 2.3, the underlying |
|
215 :class:`SequenceMatcher` class does a dynamic analysis of which lines are so |
|
216 frequent as to constitute noise, and this usually works better than the pre-2.3 |
|
217 default. |
|
218 |
|
219 *charjunk*: A function that accepts a character (a string of length 1), and |
|
220 returns if the character is junk, or false if not. The default is module-level |
|
221 function :func:`IS_CHARACTER_JUNK`, which filters out whitespace characters (a |
|
222 blank or tab; note: bad idea to include newline in this!). |
|
223 |
|
224 :file:`Tools/scripts/ndiff.py` is a command-line front-end to this function. |
|
225 |
|
226 >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1), |
|
227 ... 'ore\ntree\nemu\n'.splitlines(1)) |
|
228 >>> print ''.join(diff), |
|
229 - one |
|
230 ? ^ |
|
231 + ore |
|
232 ? ^ |
|
233 - two |
|
234 - three |
|
235 ? - |
|
236 + tree |
|
237 + emu |
|
238 |
|
239 |
|
240 .. function:: restore(sequence, which) |
|
241 |
|
242 Return one of the two sequences that generated a delta. |
|
243 |
|
244 Given a *sequence* produced by :meth:`Differ.compare` or :func:`ndiff`, extract |
|
245 lines originating from file 1 or 2 (parameter *which*), stripping off line |
|
246 prefixes. |
|
247 |
|
248 Example: |
|
249 |
|
250 >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1), |
|
251 ... 'ore\ntree\nemu\n'.splitlines(1)) |
|
252 >>> diff = list(diff) # materialize the generated delta into a list |
|
253 >>> print ''.join(restore(diff, 1)), |
|
254 one |
|
255 two |
|
256 three |
|
257 >>> print ''.join(restore(diff, 2)), |
|
258 ore |
|
259 tree |
|
260 emu |
|
261 |
|
262 |
|
263 .. function:: unified_diff(a, b[, fromfile][, tofile][, fromfiledate][, tofiledate][, n][, lineterm]) |
|
264 |
|
265 Compare *a* and *b* (lists of strings); return a delta (a :term:`generator` |
|
266 generating the delta lines) in unified diff format. |
|
267 |
|
268 Unified diffs are a compact way of showing just the lines that have changed plus |
|
269 a few lines of context. The changes are shown in a inline style (instead of |
|
270 separate before/after blocks). The number of context lines is set by *n* which |
|
271 defaults to three. |
|
272 |
|
273 By default, the diff control lines (those with ``---``, ``+++``, or ``@@``) are |
|
274 created with a trailing newline. This is helpful so that inputs created from |
|
275 :func:`file.readlines` result in diffs that are suitable for use with |
|
276 :func:`file.writelines` since both the inputs and outputs have trailing |
|
277 newlines. |
|
278 |
|
279 For inputs that do not have trailing newlines, set the *lineterm* argument to |
|
280 ``""`` so that the output will be uniformly newline free. |
|
281 |
|
282 The context diff format normally has a header for filenames and modification |
|
283 times. Any or all of these may be specified using strings for *fromfile*, |
|
284 *tofile*, *fromfiledate*, and *tofiledate*. The modification times are normally |
|
285 expressed in the format returned by :func:`time.ctime`. If not specified, the |
|
286 strings default to blanks. |
|
287 |
|
288 >>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n'] |
|
289 >>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n'] |
|
290 >>> for line in unified_diff(s1, s2, fromfile='before.py', tofile='after.py'): |
|
291 ... sys.stdout.write(line) # doctest: +NORMALIZE_WHITESPACE |
|
292 --- before.py |
|
293 +++ after.py |
|
294 @@ -1,4 +1,4 @@ |
|
295 -bacon |
|
296 -eggs |
|
297 -ham |
|
298 +python |
|
299 +eggy |
|
300 +hamster |
|
301 guido |
|
302 |
|
303 See :ref:`difflib-interface` for a more detailed example. |
|
304 |
|
305 .. versionadded:: 2.3 |
|
306 |
|
307 |
|
308 .. function:: IS_LINE_JUNK(line) |
|
309 |
|
310 Return true for ignorable lines. The line *line* is ignorable if *line* is |
|
311 blank or contains a single ``'#'``, otherwise it is not ignorable. Used as a |
|
312 default for parameter *linejunk* in :func:`ndiff` before Python 2.3. |
|
313 |
|
314 |
|
315 .. function:: IS_CHARACTER_JUNK(ch) |
|
316 |
|
317 Return true for ignorable characters. The character *ch* is ignorable if *ch* |
|
318 is a space or tab, otherwise it is not ignorable. Used as a default for |
|
319 parameter *charjunk* in :func:`ndiff`. |
|
320 |
|
321 |
|
322 .. seealso:: |
|
323 |
|
324 `Pattern Matching: The Gestalt Approach <http://www.ddj.com/184407970?pgno=5>`_ |
|
325 Discussion of a similar algorithm by John W. Ratcliff and D. E. Metzener. This |
|
326 was published in `Dr. Dobb's Journal <http://www.ddj.com/>`_ in July, 1988. |
|
327 |
|
328 |
|
329 .. _sequence-matcher: |
|
330 |
|
331 SequenceMatcher Objects |
|
332 ----------------------- |
|
333 |
|
334 The :class:`SequenceMatcher` class has this constructor: |
|
335 |
|
336 |
|
337 .. class:: SequenceMatcher([isjunk[, a[, b]]]) |
|
338 |
|
339 Optional argument *isjunk* must be ``None`` (the default) or a one-argument |
|
340 function that takes a sequence element and returns true if and only if the |
|
341 element is "junk" and should be ignored. Passing ``None`` for *isjunk* is |
|
342 equivalent to passing ``lambda x: 0``; in other words, no elements are ignored. |
|
343 For example, pass:: |
|
344 |
|
345 lambda x: x in " \t" |
|
346 |
|
347 if you're comparing lines as sequences of characters, and don't want to synch up |
|
348 on blanks or hard tabs. |
|
349 |
|
350 The optional arguments *a* and *b* are sequences to be compared; both default to |
|
351 empty strings. The elements of both sequences must be :term:`hashable`. |
|
352 |
|
353 :class:`SequenceMatcher` objects have the following methods: |
|
354 |
|
355 |
|
356 .. method:: set_seqs(a, b) |
|
357 |
|
358 Set the two sequences to be compared. |
|
359 |
|
360 :class:`SequenceMatcher` computes and caches detailed information about the |
|
361 second sequence, so if you want to compare one sequence against many |
|
362 sequences, use :meth:`set_seq2` to set the commonly used sequence once and |
|
363 call :meth:`set_seq1` repeatedly, once for each of the other sequences. |
|
364 |
|
365 |
|
366 .. method:: set_seq1(a) |
|
367 |
|
368 Set the first sequence to be compared. The second sequence to be compared |
|
369 is not changed. |
|
370 |
|
371 |
|
372 .. method:: set_seq2(b) |
|
373 |
|
374 Set the second sequence to be compared. The first sequence to be compared |
|
375 is not changed. |
|
376 |
|
377 |
|
378 .. method:: find_longest_match(alo, ahi, blo, bhi) |
|
379 |
|
380 Find longest matching block in ``a[alo:ahi]`` and ``b[blo:bhi]``. |
|
381 |
|
382 If *isjunk* was omitted or ``None``, :meth:`find_longest_match` returns |
|
383 ``(i, j, k)`` such that ``a[i:i+k]`` is equal to ``b[j:j+k]``, where ``alo |
|
384 <= i <= i+k <= ahi`` and ``blo <= j <= j+k <= bhi``. For all ``(i', j', |
|
385 k')`` meeting those conditions, the additional conditions ``k >= k'``, ``i |
|
386 <= i'``, and if ``i == i'``, ``j <= j'`` are also met. In other words, of |
|
387 all maximal matching blocks, return one that starts earliest in *a*, and |
|
388 of all those maximal matching blocks that start earliest in *a*, return |
|
389 the one that starts earliest in *b*. |
|
390 |
|
391 >>> s = SequenceMatcher(None, " abcd", "abcd abcd") |
|
392 >>> s.find_longest_match(0, 5, 0, 9) |
|
393 Match(a=0, b=4, size=5) |
|
394 |
|
395 If *isjunk* was provided, first the longest matching block is determined |
|
396 as above, but with the additional restriction that no junk element appears |
|
397 in the block. Then that block is extended as far as possible by matching |
|
398 (only) junk elements on both sides. So the resulting block never matches |
|
399 on junk except as identical junk happens to be adjacent to an interesting |
|
400 match. |
|
401 |
|
402 Here's the same example as before, but considering blanks to be junk. That |
|
403 prevents ``' abcd'`` from matching the ``' abcd'`` at the tail end of the |
|
404 second sequence directly. Instead only the ``'abcd'`` can match, and |
|
405 matches the leftmost ``'abcd'`` in the second sequence: |
|
406 |
|
407 >>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd") |
|
408 >>> s.find_longest_match(0, 5, 0, 9) |
|
409 Match(a=1, b=0, size=4) |
|
410 |
|
411 If no blocks match, this returns ``(alo, blo, 0)``. |
|
412 |
|
413 .. versionchanged:: 2.6 |
|
414 This method returns a :term:`named tuple` ``Match(a, b, size)``. |
|
415 |
|
416 |
|
417 .. method:: get_matching_blocks() |
|
418 |
|
419 Return list of triples describing matching subsequences. Each triple is of |
|
420 the form ``(i, j, n)``, and means that ``a[i:i+n] == b[j:j+n]``. The |
|
421 triples are monotonically increasing in *i* and *j*. |
|
422 |
|
423 The last triple is a dummy, and has the value ``(len(a), len(b), 0)``. It |
|
424 is the only triple with ``n == 0``. If ``(i, j, n)`` and ``(i', j', n')`` |
|
425 are adjacent triples in the list, and the second is not the last triple in |
|
426 the list, then ``i+n != i'`` or ``j+n != j'``; in other words, adjacent |
|
427 triples always describe non-adjacent equal blocks. |
|
428 |
|
429 .. XXX Explain why a dummy is used! |
|
430 |
|
431 .. versionchanged:: 2.5 |
|
432 The guarantee that adjacent triples always describe non-adjacent blocks |
|
433 was implemented. |
|
434 |
|
435 .. doctest:: |
|
436 |
|
437 >>> s = SequenceMatcher(None, "abxcd", "abcd") |
|
438 >>> s.get_matching_blocks() |
|
439 [Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)] |
|
440 |
|
441 |
|
442 .. method:: get_opcodes() |
|
443 |
|
444 Return list of 5-tuples describing how to turn *a* into *b*. Each tuple is |
|
445 of the form ``(tag, i1, i2, j1, j2)``. The first tuple has ``i1 == j1 == |
|
446 0``, and remaining tuples have *i1* equal to the *i2* from the preceding |
|
447 tuple, and, likewise, *j1* equal to the previous *j2*. |
|
448 |
|
449 The *tag* values are strings, with these meanings: |
|
450 |
|
451 +---------------+---------------------------------------------+ |
|
452 | Value | Meaning | |
|
453 +===============+=============================================+ |
|
454 | ``'replace'`` | ``a[i1:i2]`` should be replaced by | |
|
455 | | ``b[j1:j2]``. | |
|
456 +---------------+---------------------------------------------+ |
|
457 | ``'delete'`` | ``a[i1:i2]`` should be deleted. Note that | |
|
458 | | ``j1 == j2`` in this case. | |
|
459 +---------------+---------------------------------------------+ |
|
460 | ``'insert'`` | ``b[j1:j2]`` should be inserted at | |
|
461 | | ``a[i1:i1]``. Note that ``i1 == i2`` in | |
|
462 | | this case. | |
|
463 +---------------+---------------------------------------------+ |
|
464 | ``'equal'`` | ``a[i1:i2] == b[j1:j2]`` (the sub-sequences | |
|
465 | | are equal). | |
|
466 +---------------+---------------------------------------------+ |
|
467 |
|
468 For example: |
|
469 |
|
470 >>> a = "qabxcd" |
|
471 >>> b = "abycdf" |
|
472 >>> s = SequenceMatcher(None, a, b) |
|
473 >>> for tag, i1, i2, j1, j2 in s.get_opcodes(): |
|
474 ... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" % |
|
475 ... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2])) |
|
476 delete a[0:1] (q) b[0:0] () |
|
477 equal a[1:3] (ab) b[0:2] (ab) |
|
478 replace a[3:4] (x) b[2:3] (y) |
|
479 equal a[4:6] (cd) b[3:5] (cd) |
|
480 insert a[6:6] () b[5:6] (f) |
|
481 |
|
482 |
|
483 .. method:: get_grouped_opcodes([n]) |
|
484 |
|
485 Return a :term:`generator` of groups with up to *n* lines of context. |
|
486 |
|
487 Starting with the groups returned by :meth:`get_opcodes`, this method |
|
488 splits out smaller change clusters and eliminates intervening ranges which |
|
489 have no changes. |
|
490 |
|
491 The groups are returned in the same format as :meth:`get_opcodes`. |
|
492 |
|
493 .. versionadded:: 2.3 |
|
494 |
|
495 |
|
496 .. method:: ratio() |
|
497 |
|
498 Return a measure of the sequences' similarity as a float in the range [0, |
|
499 1]. |
|
500 |
|
501 Where T is the total number of elements in both sequences, and M is the |
|
502 number of matches, this is 2.0\*M / T. Note that this is ``1.0`` if the |
|
503 sequences are identical, and ``0.0`` if they have nothing in common. |
|
504 |
|
505 This is expensive to compute if :meth:`get_matching_blocks` or |
|
506 :meth:`get_opcodes` hasn't already been called, in which case you may want |
|
507 to try :meth:`quick_ratio` or :meth:`real_quick_ratio` first to get an |
|
508 upper bound. |
|
509 |
|
510 |
|
511 .. method:: quick_ratio() |
|
512 |
|
513 Return an upper bound on :meth:`ratio` relatively quickly. |
|
514 |
|
515 This isn't defined beyond that it is an upper bound on :meth:`ratio`, and |
|
516 is faster to compute. |
|
517 |
|
518 |
|
519 .. method:: real_quick_ratio() |
|
520 |
|
521 Return an upper bound on :meth:`ratio` very quickly. |
|
522 |
|
523 This isn't defined beyond that it is an upper bound on :meth:`ratio`, and |
|
524 is faster to compute than either :meth:`ratio` or :meth:`quick_ratio`. |
|
525 |
|
526 The three methods that return the ratio of matching to total characters can give |
|
527 different results due to differing levels of approximation, although |
|
528 :meth:`quick_ratio` and :meth:`real_quick_ratio` are always at least as large as |
|
529 :meth:`ratio`: |
|
530 |
|
531 >>> s = SequenceMatcher(None, "abcd", "bcde") |
|
532 >>> s.ratio() |
|
533 0.75 |
|
534 >>> s.quick_ratio() |
|
535 0.75 |
|
536 >>> s.real_quick_ratio() |
|
537 1.0 |
|
538 |
|
539 |
|
540 .. _sequencematcher-examples: |
|
541 |
|
542 SequenceMatcher Examples |
|
543 ------------------------ |
|
544 |
|
545 This example compares two strings, considering blanks to be "junk:" |
|
546 |
|
547 >>> s = SequenceMatcher(lambda x: x == " ", |
|
548 ... "private Thread currentThread;", |
|
549 ... "private volatile Thread currentThread;") |
|
550 |
|
551 :meth:`ratio` returns a float in [0, 1], measuring the similarity of the |
|
552 sequences. As a rule of thumb, a :meth:`ratio` value over 0.6 means the |
|
553 sequences are close matches: |
|
554 |
|
555 >>> print round(s.ratio(), 3) |
|
556 0.866 |
|
557 |
|
558 If you're only interested in where the sequences match, |
|
559 :meth:`get_matching_blocks` is handy: |
|
560 |
|
561 >>> for block in s.get_matching_blocks(): |
|
562 ... print "a[%d] and b[%d] match for %d elements" % block |
|
563 a[0] and b[0] match for 8 elements |
|
564 a[8] and b[17] match for 21 elements |
|
565 a[29] and b[38] match for 0 elements |
|
566 |
|
567 Note that the last tuple returned by :meth:`get_matching_blocks` is always a |
|
568 dummy, ``(len(a), len(b), 0)``, and this is the only case in which the last |
|
569 tuple element (number of elements matched) is ``0``. |
|
570 |
|
571 If you want to know how to change the first sequence into the second, use |
|
572 :meth:`get_opcodes`: |
|
573 |
|
574 >>> for opcode in s.get_opcodes(): |
|
575 ... print "%6s a[%d:%d] b[%d:%d]" % opcode |
|
576 equal a[0:8] b[0:8] |
|
577 insert a[8:8] b[8:17] |
|
578 equal a[8:29] b[17:38] |
|
579 |
|
580 See also the function :func:`get_close_matches` in this module, which shows how |
|
581 simple code building on :class:`SequenceMatcher` can be used to do useful work. |
|
582 |
|
583 |
|
584 .. _differ-objects: |
|
585 |
|
586 Differ Objects |
|
587 -------------- |
|
588 |
|
589 Note that :class:`Differ`\ -generated deltas make no claim to be **minimal** |
|
590 diffs. To the contrary, minimal diffs are often counter-intuitive, because they |
|
591 synch up anywhere possible, sometimes accidental matches 100 pages apart. |
|
592 Restricting synch points to contiguous matches preserves some notion of |
|
593 locality, at the occasional cost of producing a longer diff. |
|
594 |
|
595 The :class:`Differ` class has this constructor: |
|
596 |
|
597 |
|
598 .. class:: Differ([linejunk[, charjunk]]) |
|
599 |
|
600 Optional keyword parameters *linejunk* and *charjunk* are for filter functions |
|
601 (or ``None``): |
|
602 |
|
603 *linejunk*: A function that accepts a single string argument, and returns true |
|
604 if the string is junk. The default is ``None``, meaning that no line is |
|
605 considered junk. |
|
606 |
|
607 *charjunk*: A function that accepts a single character argument (a string of |
|
608 length 1), and returns true if the character is junk. The default is ``None``, |
|
609 meaning that no character is considered junk. |
|
610 |
|
611 :class:`Differ` objects are used (deltas generated) via a single method: |
|
612 |
|
613 |
|
614 .. method:: Differ.compare(a, b) |
|
615 |
|
616 Compare two sequences of lines, and generate the delta (a sequence of lines). |
|
617 |
|
618 Each sequence must contain individual single-line strings ending with newlines. |
|
619 Such sequences can be obtained from the :meth:`readlines` method of file-like |
|
620 objects. The delta generated also consists of newline-terminated strings, ready |
|
621 to be printed as-is via the :meth:`writelines` method of a file-like object. |
|
622 |
|
623 |
|
624 .. _differ-examples: |
|
625 |
|
626 Differ Example |
|
627 -------------- |
|
628 |
|
629 This example compares two texts. First we set up the texts, sequences of |
|
630 individual single-line strings ending with newlines (such sequences can also be |
|
631 obtained from the :meth:`readlines` method of file-like objects): |
|
632 |
|
633 >>> text1 = ''' 1. Beautiful is better than ugly. |
|
634 ... 2. Explicit is better than implicit. |
|
635 ... 3. Simple is better than complex. |
|
636 ... 4. Complex is better than complicated. |
|
637 ... '''.splitlines(1) |
|
638 >>> len(text1) |
|
639 4 |
|
640 >>> text1[0][-1] |
|
641 '\n' |
|
642 >>> text2 = ''' 1. Beautiful is better than ugly. |
|
643 ... 3. Simple is better than complex. |
|
644 ... 4. Complicated is better than complex. |
|
645 ... 5. Flat is better than nested. |
|
646 ... '''.splitlines(1) |
|
647 |
|
648 Next we instantiate a Differ object: |
|
649 |
|
650 >>> d = Differ() |
|
651 |
|
652 Note that when instantiating a :class:`Differ` object we may pass functions to |
|
653 filter out line and character "junk." See the :meth:`Differ` constructor for |
|
654 details. |
|
655 |
|
656 Finally, we compare the two: |
|
657 |
|
658 >>> result = list(d.compare(text1, text2)) |
|
659 |
|
660 ``result`` is a list of strings, so let's pretty-print it: |
|
661 |
|
662 >>> from pprint import pprint |
|
663 >>> pprint(result) |
|
664 [' 1. Beautiful is better than ugly.\n', |
|
665 '- 2. Explicit is better than implicit.\n', |
|
666 '- 3. Simple is better than complex.\n', |
|
667 '+ 3. Simple is better than complex.\n', |
|
668 '? ++\n', |
|
669 '- 4. Complex is better than complicated.\n', |
|
670 '? ^ ---- ^\n', |
|
671 '+ 4. Complicated is better than complex.\n', |
|
672 '? ++++ ^ ^\n', |
|
673 '+ 5. Flat is better than nested.\n'] |
|
674 |
|
675 As a single multi-line string it looks like this: |
|
676 |
|
677 >>> import sys |
|
678 >>> sys.stdout.writelines(result) |
|
679 1. Beautiful is better than ugly. |
|
680 - 2. Explicit is better than implicit. |
|
681 - 3. Simple is better than complex. |
|
682 + 3. Simple is better than complex. |
|
683 ? ++ |
|
684 - 4. Complex is better than complicated. |
|
685 ? ^ ---- ^ |
|
686 + 4. Complicated is better than complex. |
|
687 ? ++++ ^ ^ |
|
688 + 5. Flat is better than nested. |
|
689 |
|
690 |
|
691 .. _difflib-interface: |
|
692 |
|
693 A command-line interface to difflib |
|
694 ----------------------------------- |
|
695 |
|
696 This example shows how to use difflib to create a ``diff``-like utility. |
|
697 It is also contained in the Python source distribution, as |
|
698 :file:`Tools/scripts/diff.py`. |
|
699 |
|
700 .. testcode:: |
|
701 |
|
702 """ Command line interface to difflib.py providing diffs in four formats: |
|
703 |
|
704 * ndiff: lists every line and highlights interline changes. |
|
705 * context: highlights clusters of changes in a before/after format. |
|
706 * unified: highlights clusters of changes in an inline format. |
|
707 * html: generates side by side comparison with change highlights. |
|
708 |
|
709 """ |
|
710 |
|
711 import sys, os, time, difflib, optparse |
|
712 |
|
713 def main(): |
|
714 # Configure the option parser |
|
715 usage = "usage: %prog [options] fromfile tofile" |
|
716 parser = optparse.OptionParser(usage) |
|
717 parser.add_option("-c", action="store_true", default=False, |
|
718 help='Produce a context format diff (default)') |
|
719 parser.add_option("-u", action="store_true", default=False, |
|
720 help='Produce a unified format diff') |
|
721 hlp = 'Produce HTML side by side diff (can use -c and -l in conjunction)' |
|
722 parser.add_option("-m", action="store_true", default=False, help=hlp) |
|
723 parser.add_option("-n", action="store_true", default=False, |
|
724 help='Produce a ndiff format diff') |
|
725 parser.add_option("-l", "--lines", type="int", default=3, |
|
726 help='Set number of context lines (default 3)') |
|
727 (options, args) = parser.parse_args() |
|
728 |
|
729 if len(args) == 0: |
|
730 parser.print_help() |
|
731 sys.exit(1) |
|
732 if len(args) != 2: |
|
733 parser.error("need to specify both a fromfile and tofile") |
|
734 |
|
735 n = options.lines |
|
736 fromfile, tofile = args # as specified in the usage string |
|
737 |
|
738 # we're passing these as arguments to the diff function |
|
739 fromdate = time.ctime(os.stat(fromfile).st_mtime) |
|
740 todate = time.ctime(os.stat(tofile).st_mtime) |
|
741 fromlines = open(fromfile, 'U').readlines() |
|
742 tolines = open(tofile, 'U').readlines() |
|
743 |
|
744 if options.u: |
|
745 diff = difflib.unified_diff(fromlines, tolines, fromfile, tofile, |
|
746 fromdate, todate, n=n) |
|
747 elif options.n: |
|
748 diff = difflib.ndiff(fromlines, tolines) |
|
749 elif options.m: |
|
750 diff = difflib.HtmlDiff().make_file(fromlines, tolines, fromfile, |
|
751 tofile, context=options.c, |
|
752 numlines=n) |
|
753 else: |
|
754 diff = difflib.context_diff(fromlines, tolines, fromfile, tofile, |
|
755 fromdate, todate, n=n) |
|
756 |
|
757 # we're using writelines because diff is a generator |
|
758 sys.stdout.writelines(diff) |
|
759 |
|
760 if __name__ == '__main__': |
|
761 main() |