FIFTH EDITION
FIFTH EDITION
Steve Marschner
Cornell University
Peter Shirley
NVIDIA
with
Michael Ashikhmin, Gro Intelligence
Gleicher Michael, University of Wisconsin
Naty Hoffman, Lucasfilm
Garrett Johnson, Rochester Institute of Technology
Tamara Munzner, University of British Columbia
Erik Reinhard, InterDigital, Inc.
William B. Thompson, University of Utah
Peter Willemsen, University of Minnesota Duluth
Brian Wyvill, SceneWizard Software Ltd.
Fifth edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by Routledge
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
© 2022 Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, LLC
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe.
Library of Congress Cataloging‑in‑Publication Data
Names: Marschner, Steve, author. | Shirley, Peter, author.
Title: Fundamentals of computer graphics / Steve Marschner, Peter Shirley.
Description: 5th edition. | Boca Raton: CRC Press, 2021. | Includes bibliographical references and index. Identifiers: LCCN 2021008492 | ISBN 9780367505035 (hardback) | ISBN 9781003050339 (ebook)
Subjects: LCSH: Computer graphics.
Classification: LCC T385 .M36475 2021 | DDC 006.6—dc23
LC record available at https://lccn.loc.gov/2021008492
ISBN: 978-0-367-50503-5 (hbk)
ISBN: 978-0-367-50558-5 (pbk)
ISBN: 978-1-003-05033-9 (ebk)
Typeset in Times
by codeMantra
1.7 Designing and Coding Graphics Programs
2.2 Solving Quadratic Equations
3.2 Images, Pixels, and Geometry
4.1 The Basic Ray-Tracing Algorithm
6.3 Computing with Matrices and Determinants
6.4 Eigenvalues and Matrix Diagonalization
7.3 Translation and Affine Transformations
7.4 Inverses of Transformation Matrices
7.5 Coordinate Transformations
8.2 Projective Transformations
8.4 Some Properties of the Perspective Transform
9.2 Operations Before and After Rasterization
9.4 Culling Primitives for Efficiency
10.1 Digital Audio: Sampling in 1D
10.4 Signal Processing for Images
11.1 Looking Up Texture Values
11.2 Texture Coordinate Functions
11.3 Antialiasing Texture Lookups
11.4 Applications of Texture Mapping
12 Data Structures for Graphics
12.5 Tiling Multidimensional Arrays
14.4 Dielectrics with Subsurface Scattering
14.5 A Brute Force Photon Tracer
17.2 What Is Graphics Hardware
17.3 Heterogeneous Multiprocessing
17.4 Graphics Hardware Programming: Buffers, State, and Shaders
17.6 Basic OpenGL Application Layout
17.12 Shading with Per-Vertex Attributes
17.13 Shading in the fragment Processor
17.16 Object-Oriented Design for Graphics Hardware Programming
19.4 Objects, Locations, and Events
20.5 Frequency-Based Operators
20.6 Gradient-Domain Operators
21.1 Implicit Functions, Skeletal Primitives, and Summation Blending
21.5 Constructive Solid Geometry
21.9 Interactive Implicit Modeling Systems
22.5 The Game Production Process
23.3 Human-Centered Design Process
23.4 Visual Encoding Principles
This edition of Fundamentals of Computer Graphics includes substantial rewrites of the material on shading, light reflection, and path tracing, as well as many corrections throughout. This book now provides a better introduction to the techniques that go by the names of physics-based materials and physics-based rendering and are becoming predominant in actual practice. This material is now better integrated, and we think this book maps well to the way many instructors are organizing graphics courses at present.
The organization of this book remains substantially similar to the fourth edition. As we have revised this book over the years, we have endeavored to retain the informal, intuitive style of presentation that characterizes the earlier editions, while at the same time improving consistency, precision, and completeness. We hope the reader will find the result is an appealing platform for a variety of courses in computer graphics.
The cover image is from Tiger in the Water by J. W. Baker (brushed and air-brushed acrylic on canvas, 16” by 20”, www.jwbart.com).
The subject of a tiger is a reference to a wonderful talk given by Alain Fournier (1943–2000) at a workshop at Cornell University in 1998. His talk was an evocative verbal description of the movements of a tiger. He summarized his point:
Even though modelling and rendering in computer graphics have been improved tremendously in the past 35 years, we are still not at the point where we can model automatically a tiger swimming in the river in all its glorious details. By automatically I mean in a way that does not need careful manual tweaking by an artist/expert.
The bad news is that we have still a long way to go.
The good news is that we have still a long way to go.
The website for this book is http://www.cs.cornell.edu/~srm/fcg5/. We will continue to maintain a list of errata and links to courses that use the book, as well as teaching materials that match the book’s style. Most of the figures in this book are in Adobe Illustrator format, and we would be happy to convert specific figures into portable formats on request. Please feel free to contact us at srm@cs.cornell.edu or ptrshrl@gmail.com.
The following people have provided helpful information, comments, or feedback about the various editions of this book: Ahmet Oğuz Akyüz, Josh Andersen, Beatriz Trinchãao Andrade Zeferino Andrade, Bagossy Attila, Kavita Bala, Mick Beaver, Robert Belleman, Adam Berger, Adeel Bhutta, Solomon Boulos, Stephen Chenney, Michael Coblenz, Greg Coombe, Frederic Cremer, Brian Curtin, Dave Edwards, Jonathon Evans, Karen Feinauer, Claude Fuhrer, Yotam Gingold, Amy Gooch, Eungyoung Han, Chuck Hansen, Andy Hanson, Razen Al Harbi, Dave Hart, John Hart, Yong Huang, John “Spike” Hughes, Helen Hu, Vicki Interrante, Wenzel Jakob, Doug James, Henrik Wann Jensen, Shi Jin, Mark Johnson, Ray Jones, Revant Kapoor, Kristin Kerr, Erum Arif Khan, Mark Kilgard, Fangjun Kuang, Dylan Lacewell, Mathias Lang, Philippe Laval, Joshua Levine, Marc Levoy, Howard Lo, Joann Luu, Mauricio Maurer, Andrew Medlin, Ron Metoyer, Keith Morley, Eric Mortensen, Koji Nakamaru, Micah Neilson, Blake Nelson, Michael Nikelsky, James O’Brien, Hongshu Pan , Steve Parker, Sumanta Pattanaik, Matt Pharr, Ken Phillis Jr, Nicolò Pinciroli, Peter Poulos, Shaun Ramsey, Rich Riesenfeld, Nate Robins, Nan Schaller, Chris Schryvers, Tom Sederberg, Richard Sharp, Sarah Shirley, Peter-Pike Sloan, Hannah Story, Tony Tahbaz, Jan-Phillip Tiesel, Bruce Walter, Alex Williams, Amy Williams, Chris Wyman, Kate Zebrose, and Angela Zhang.
Ching-Kuang Shene and David Solomon allowed us to borrow their examples. Henrik Wann Jensen, Eric Levin, Matt Pharr, and Jason Waltman generously provided images. Brandon Mansfield helped improve the discussion of hierarchical bounding volumes for ray tracing. Philip Greenspun (philip.greenspun.com) kindly allowed us to use his photographs. John “Spike” Hughes helped improve the discussion of sampling theory. Wenzel Jakob’s Mitsuba renderer was invaluable in creating many figures. We are extremely thankful to J. W. Baker for helping create the cover Pete envisioned. In addition to being a talented artist, he was a great pleasure to work with personally.
Many works that were helpful in preparing this book are cited in the chapter notes. However, a few key texts that influenced the content and presentation deserve special recognition here. These include the two classic computer graphics texts from which we both learned the basics: Computer Graphics: Principles & Practice (Foley, Van Dam, Feiner, & Hughes, 1990) and Computer Graphics (Hearn & Baker, 1986). Other texts include both of Alan Watt’s influential books (Watt, 1993, 1991), Hill’s Computer Graphics Using OpenGL (Francis S. Hill, 2000), Angel’s Interactive Computer Graphics: A Top-Down Approach Using OpenGL (Angel, 2002), Hugues Hoppe’s University of Washington dissertation (Hoppe, 1994), and Rogers’ two excellent graphics texts (Rogers, 1985, 1989).
We would like to especially thank Alice and Klaus Peters for encouraging Pete to write the first edition of this book and for their great skill in bringing a book to fruition. Their patience with the authors and their dedication to making their books the best they can be has been instrumental in making this book what it is. This book certainly would not exist without their extraordinary efforts.
Steve Marschner, Ithaca, NY
Peter Shirley, Salt Lake City, UT
February 2021
Steve Marschner is a Professor of Computer Science at Cornell University. He obtained his Sc.B. from Brown University in 1993 and his Ph.D. from Cornell in 1998. He held research positions at Microsoft Research and Stanford University before joining Cornell in 2002. He is recipient of the SIGGRAPH Computer Graphics Achievement Award in 2015 and co-recipient of a 2003 Technical Academy Award.
Peter Shirley is a Distinguished Research Scientist at NVIDIA. He held academic positions at Indiana University, Cornell University, and the University of Utah. He obtained a B.A. in Physics from Reed College in 1985 and a Ph.D. in Computer Science from University of Illinois in 1991.
The term computer graphics describes any use of computers to create and manipulate images. This book introduces the algorithmic and mathematical tools that can be used to create all kinds of images—realistic visual effects, informative technical illustrations, or beautiful computer animations. Graphics can be two- or three-dimensional; images can be completely synthetic or can be produced by manipulating photographs. This book is about the fundamental algorithms and mathematics, especially those used to produce synthetic images of three-dimensional objects and scenes.
Actually doing computer graphics inevitably requires knowing about specific hardware, file formats, and usually a graphics API (see Section 1.3) or two. Computer graphics is a rapidly evolving field, so the specifics of that knowledge are a moving target. Therefore, in this book we do our best to avoid depending on any specific hardware or API. Readers are encouraged to supplement the text with relevant documentation for their software and hardware environment. Fortunately, the culture of computer graphics has enough standard terminology and concepts that the discussion in this book should map nicely to most environments.
This chapter defines some basic terminology and provides some historical background, as well as information sources related to computer graphics.
Imposing categories on any field is dangerous, but most graphics practitioners would agree on the following major areas of computer graphics:
Modeling deals with the mathematical specification of shape and appearance properties in a way that can be stored on the computer. For example, a coffee mug might be described as a set of ordered 3D points along with some interpolation rule to connect the points and a reflection model that describes how light interacts with the mug.
Rendering is a term inherited from art and deals with the creation of shaded images from 3D computer models.
Animation is a technique to create an illusion of motion through sequences of images. Animation uses modeling and rendering but adds the key issue of movement over time, which is not usually dealt with in basic modeling and rendering.
There are many other areas that involve computer graphics, and whether they are core graphics areas is a matter of opinion. These will all be at least touched on in the text. Such related areas include the following:
User interaction deals with the interface between input devices such as mice and tablets, the application, feedback to the user in imagery, and other sensory feedback. Historically, this area is associated with graphics largely because graphics researchers had some of the earliest access to the input/output devices that are now ubiquitous.
Virtual reality attempts to immerse the user into a 3D virtual world. This typically requires at least stereo graphics and response to head motion. For true virtual reality, sound and force feedback should be provided as well. Because this area requires advanced 3D graphics and advanced display technology, it is often closely associated with graphics.
Visualization attempts to give users insight into complex information via visual display. Often, there are graphic issues to be addressed in a visualization problem.
Image processing deals with the manipulation of 2D images and is used in both the fields of graphics and vision.
Three-dimensional scanning uses range-finding technology to create measured 3D models. Such models are useful for creating rich visual imagery, and the processing of such models often requires graphics algorithms.
Computational photography is the use of computer graphics, computer vision, and image processing methods to enable new ways of photographically capturing objects, scenes, and environments.
Almost any endeavor can make some use of computer graphics, but the major consumers of computer graphics technology include the following industries:
Video games increasingly use sophisticated 3D models and rendering algorithms.
Cartoons are often rendered directly from 3D models. Many traditional 2D cartoons use backgrounds rendered from 3D models, which allow a continuously moving viewpoint without huge amounts of artist time.
Visual effects use almost all types of computer graphics technology. Almost every modern film uses digital compositing to superimpose backgrounds with separately filmed foregrounds. Many films also use 3D modeling and animation to create synthetic environments, objects, and even characters that most viewers will never suspect are not real.
Animated films use many of the same techniques that are used for visual effects, but without necessarily aiming for images that look real.
CAD/CAM stands for computer-aided design and computer-aided manufacturing. These fields use computer technology to design parts and products on the computer and then, using these virtual designs, to guide the manufacturing process. For example, many mechanical parts are designed in a 3D computer modeling package and then automatically produced on a computer-controlled milling device.
Simulation can be thought of as accurate video gaming. For example, a flight simulator uses sophisticated 3D graphics to simulate the experience of flying an airplane. Such simulations can be extremely useful for initial training in safety-critical domains such as driving, and for scenario training for experienced users such as specific fire-fighting situations that are too costly or dangerous to create physically.
Medical imaging creates meaningful images of scanned patient data. For example, a computed tomography (CT) dataset is composed of a large 3D rectangular array of density values. Computer graphics is used to create shaded images that help doctors extract the most salient information from such data.
Information visualization creates images of data that do not necessarily have a “natural” visual depiction. For example, the temporal trend of the price of ten different stocks does not have an obvious visual depiction, but clever graphing techniques can help humans see the patterns in such data.
A key part of using graphics libraries is dealing with a graphics API. An application program interface (API) is a standard collection of functions to perform a set of related operations, and a graphics API is a set of functions that perform basic operations such as drawing images and 3D surfaces into windows on the screen.
Every graphics program needs to be able to use two related APIs: a graphics API for visual output and a user-interface API to get input from the user. There are currently two dominant paradigms for graphics and user-interface APIs. The first is the integrated approach, exemplified by Java, where the graphics and user-interface toolkits are integrated and portable packages that are fully standardized and supported as part of the language. The second is represented by Direct3D and OpenGL, where the drawing commands are part of a software library tied to a language such as C++, and the user-interface software is an independent entity that might vary from system to system. In this latter approach, it is problematic to write portable code, although for simple programs, it may be possible to use a portable library layer to encapsulate the system specific user-interface code.
Whatever your choice of API, the basic graphics calls will be largely the same, and the concepts of this book will apply.
Every desktop computer today has a powerful 3D graphics pipeline. This is a special software/hardware subsystem that efficiently draws 3D primitives in perspective. Usually, these systems are optimized for processing 3D triangles with shared vertices. The basic operations in the pipeline map the 3D vertex locations to 2D screen positions and shade the triangles so that they both look realistic and appear in proper back-to-front order.
Although drawing the triangles in valid back-to-front order was once the most important research issue in computer graphics, it is now almost always solved using the z-buffer, which uses a special memory buffer to solve the problem in a brute-force manner.
It turns out that the geometric manipulation used in the graphics pipeline can be accomplished almost entirely in a 4D coordinate space composed of three traditional geometric coordinates and a fourth homogeneous coordinate that helps with perspective viewing. These 4D coordinates are manipulated using 4 × 4 matrices and 4-vectors. The graphics pipeline, therefore, contains much machinery for efficiently processing and composing such matrices and vectors. This 4D coordinate system is one of the most subtle and beautiful constructs used in computer science, and it is certainly the biggest intellectual hurdle to jump when learning computer graphics. A big chunk of the first part of every graphics book deals with these coordinates.
The speed at which images can be generated depends strongly on the number of triangles being drawn. Because interactivity is more important in many applications than visual quality, it is worthwhile to minimize the number of triangles used to represent a model. In addition, if the model is viewed in the distance, fewer triangles are needed than when the model is viewed from a closer distance. This suggests that it is useful to represent a model with a varying level of detail (LOD).
Many graphics programs are really just 3D numerical codes. Numerical issues are often crucial in such programs. In the “old days,” it was very difficult to handle such issues in a robust and portable manner because machines had different internal representations for numbers, and even worse, handled exceptions in different and incompatible ways. Fortunately, almost all modern computers conform to the IEEE floating-point standard (IEEE Standards Association, 1985). This allows the programmer to make many convenient assumptions about how certain numeric conditions will be handled.
Although IEEE floating-point has many features that are valuable when coding numeric algorithms, there are only a few that are crucial to know for most situations encountered in graphics. First, and most important, is to understand that there are three “special” values for real numbers in IEEE floating-point:
Infinity (∞). This is a valid number that is larger than all other valid numbers.
Minus infinity (–∞). This is a valid number that is smaller than all other valid numbers.
Not a number (NaN). This is an invalid number that arises from an operation with undefined consequences, such as zero divided by zero.
The designers of IEEE floating-point made some decisions that are extremely convenient for programmers. Many of these relate to the three special values above in handling exceptions such as division by zero. In these cases, an exception is logged, but in many cases, the programmer can ignore that. Specifically, for any positive real number a, the following rules involving division by infinite values hold
Other operations involving infinite values behave the way one would expect. Again for positive a, the behavior is as follows:
The rules in a Boolean expression involving infinite values are as expected:
All finite valid numbers are less than +∞.
All finite valid numbers are greater than –∞.
–∞ is less than +∞.
The rules involving expressions that have NaN values are simple:
Any arithmetic expression that includes NaN results in NaN.
Any Boolean expression involving NaN is false.
Perhaps the most useful aspect of IEEE floating-point is how divide-by-zero is handled; for any positive real number a, the following rules involving division by zero values hold
There are many numeric computations that become much simpler if the programmer takes advantage of the IEEE rules. For example, consider the expression:
Such expressions arise with resistors and lenses. If divide-by-zero resulted in a program crash (as was true in many systems before IEEE floating-point), then two if statements would be required to check for small or zero values of b or c. Instead, with IEEE floating-point, if b or c is zero, we will get a zero value for a as desired. Another common technique to avoid special checks is to take advantage of the Boolean properties of NaN. Consider the following code segment:
a = f(x) if (a > 0) then do something
Here, the function f may return “ugly” values such as ∞ or NaN, but the if condition is still well-defined: it is false for a = NaN or a = –∞ and true for a = +∞. With care in deciding which values are returned, often the if can make the right choice, with no special checks needed. This makes programs smaller, more robust, and more efficient.
There are no magic rules for making code more efficient. Efficiency is achieved through careful tradeoffs, and these tradeoffs are different for different architectures. However, for the foreseeable future, a good heuristic is that programmers should pay more attention to memory access patterns than to operation counts. This is the opposite of the best heuristic of two decades ago. This switch has occurred because the speed of memory has not kept pace with the speed of processors. Since that trend continues, the importance of limited and coherent memory access for optimization should only increase.
A reasonable approach to making code fast is to proceed in the following order, taking only those steps which are needed:
Write the code in the most straightforward way possible. Compute intermediate results as needed on the fly rather than storing them.
Compile in optimized mode.
Use whatever profiling tools exist to find critical bottlenecks.
Examine data structures to look for ways to improve locality. If possible, make data unit sizes match the cache/page size on the target architecture.
If profiling reveals bottlenecks in numeric computations, examine the assembly code generated by the compiler for missed efficiencies. Rewrite source code to solve any problems you find.
The most important of these steps is the first one. Most “optimizations” make the code harder to read without speeding things up. In addition, time spent upfront optimizing code is usually better spent correcting bugs or adding features. Also, beware of suggestions from old texts; some classic tricks such as using integers instead of reals may no longer yield speed because modern CPUs can usually perform floating-point operations just as fast as they perform integer operations. In all situations, profiling is needed to be sure of the merit of any optimization for a specific machine and compiler.
Certain common strategies are often useful in graphics programming. In this section, we provide some advice that you may find helpful as you implement the methods you learn about in this book.
A key part of any graphics program is to have good classes or routines for geometric entities such as vectors and matrices, as well as graphics entities such as RGB colors and images. These routines should be made as clean and efficient as possible. A universal design question is whether locations and displacements should be separate classes because they have different operations; e.g., a location multiplied by one-half makes no geometric sense while one-half of a displacement does (Goldman, 1985; DeRose, 1989). There is little agreement on this question, which can spur hours of heated debate among graphics practitioners, but for the sake of example, let’s assume we will not make the distinction.
This implies that some basic classes to be written include
vector2. A 2D vector class that stores an x- and y-component. It should store these components in a length-2 array so that an indexing operator can be well supported. You should also include operations for vector addition, vector subtraction, dot product, cross product, scalar multiplication, and scalar division.
vector3. A 3D vector class analogous to vector2.
hvector. A homogeneous vector with four components (see Chapter 8).
rgb. An RGB color that stores three components. You should also include operations for RGB addition, RGB subtraction, RGB multiplication, scalar multiplication, and scalar division.
transform. A 4 × 4 matrix for transformations. You should include a matrix multiply and member functions to apply to locations, directions, and surface normal vectors. As shown in Chapter 7, these are all different.
image. A 2D array of RGB pixels with an output operation.
In addition, you might or might not want to add classes for intervals, orthonormal bases, and coordinate frames.
Modern architecture suggests that keeping memory use down and maintaining coherent memory access are the keys to efficiency. This suggests using single-precision data. However, avoiding numerical problems suggests using double-precision arithmetic. The tradeoffs depend on the program, but it is nice to have a default in your class definitions.
If you ask around, you may find that as programmers become more experienced, they use traditional debuggers less and less. One reason for this is that using such debuggers is more awkward for complex programs than for simple programs. Another reason is that the most difficult errors are conceptual ones where the wrong thing is being implemented, and it is easy to waste large amounts of time stepping through variable values without detecting such cases. We have found several debugging strategies to be particularly useful in graphics.
In graphics programs, there is an alternative to traditional debugging that is often very useful. The downside to it is that it is very similar to what computer programmers are taught not to do early in their careers, so you may feel “naughty” if you do it: we create an image and observe what is wrong with it. Then, we develop a hypothesis about what is causing the problem and test it. For example, in a ray-tracing program we might have many somewhat random looking dark pixels. This is the classic “shadow acne” problem that most people run into when they write a ray tracer. Traditional debugging is not helpful here; instead, we must realize that the shadow rays are hitting the surface being shaded. We might notice that the color of the dark spots is the ambient color, so the direct lighting is what is missing. Direct lighting can be turned off in shadow, so you might hypothesize that these points are incorrectly being tagged as in shadow when they are not. To test this hypothesis, we could turn off the shadowing check and recompile. This would indicate that these are false shadow tests, and we could continue our detective work. The key reason that this method can sometimes be good practice is that we never had to spot a false value or really determine our conceptual error. Instead, we just narrowed in on our conceptual error experimentally. Typically, only a few trials are needed to track things down, and this type of debugging is enjoyable.
In many cases, the easiest channel by which to get debugging information out of a graphics program is the output image itself. If you want to know the value of some variable for part of a computation that runs for every pixel, you can just modify your program temporarily to copy that value directly to the output image and skip the rest of the calculations that would normally be done. For instance, if you suspect a problem with surface normals is causing a problem with shading, you can copy the normal vectors directly to the image (x goes to red, y goes to green, z goes to blue), resulting in a color-coded illustration of the vectors actually being used in your computation. Or, if you suspect a particular value is sometimes out of its valid range, make your program write bright red pixels where that happens. Other common tricks include drawing the back sides of surfaces with an obvious color (when they are not supposed to be visible), coloring the image by the ID numbers of the objects, or coloring pixels by the amount of work they took to compute.
There are still cases, particularly when the scientific method seems to have led to a contradiction, when there’s no substitute for observing exactly what is going on. The trouble is that graphics programs often involve many, many executions of the same code (once per pixel, for instance, or once per triangle), making it completely impractical to step through in the debugger from the start. And the most difficult bugs usually only occur for complicated inputs.
A useful approach is to “set a trap” for the bug. First, make sure your program is deterministic—run it in a single thread and make sure that all random numbers are computed from fixed seeds. Then, find out which pixel or triangle is exhibiting the bug and add a statement before the code you suspect is incorrect that will be executed only for the suspect case. For instance, if you find that pixel (126, 247) exhibits the bug, then add
if x = 126 and y = 247 then print “blarg!”
If you set a breakpoint on the print statement, you can drop into the debugger just before the pixel you’re interested in is computed. Some debuggers have a “conditional breakpoint” feature that can achieve the same thing without modifying the code.
In the cases where the program crashes, a traditional debugger is useful for pinpointing the site of the crash. You should then start backtracking in the program, using asserts and recompiles, to find where the program went wrong. These asserts should be left in the program for potential future bugs you will add. This again means the traditional step-through process is avoided, because that would not be adding the valuable asserts to your program.
Often, it is hard to understand what your program is doing, because it computes a lot of intermediate results before it finally goes wrong. The situation is similar to a scientific experiment that measures a lot of data, and one solution is the same: make good plots and illustrations for yourself to understand what the data mean. For instance, in a ray tracer you might write code to visualize ray trees so you can see what paths contributed to a pixel, or in an image resampling routine you might make plots that show all the points where samples are being taken from the input. Time spent writing code to visualize your program’s internal state is also repaid in a better understanding of its behavior when it comes time to optimize it.
The discussion of software engineering is influenced by the Effective C++ series (Meyers, 1995, 1997), the Extreme Programming movement (Beck & Andres, 2004), and The Practice of Programming (Kernighan & Pike, 1999). The discussion of experimental debugging is based on discussions with Steve Parker.
There are a number of annual conferences related to computer graphics, including ACM SIGGRAPH and SIGGRAPH Asia, Graphics Interface, the Game Developers Conference (GDC), Eurographics, Pacific Graphics, High Performance Graphics, the Eurographics Symposium on Rendering, and IEEE VisWeek. These can be readily found by web searches on their names.
Much of graphics is just translating math directly into code. The cleaner the math, the cleaner the resulting code; so much of this book concentrates on using just the right math for the job. This chapter reviews various tools from high school and college mathematics and is designed to be used more as a reference than as a tutorial. It may appear to be a hodge-podge of topics and indeed it is; each topic is chosen because it is a bit unusual in “standard” math curricula, because it is of central importance in graphics, or because it is not typically treated from a geometric standpoint. In addition to establishing a review of the notation used in this book, this chapter also emphasizes a few points that are sometimes skipped in the standard undergraduate curricula, such as barycentric coordinates on triangles. This chapter is not intended to be a rigorous treatment of the material; instead, intuition and geometric interpretation are emphasized. A discussion of linear algebra is deferred until Chapter 6 just before transformation matrices are discussed. Readers are encouraged to skim this chapter to familiarize themselves with the topics covered and to refer back to it as needed. The exercises at the end of this chapter may be useful in determining which topics need a refresher.
Mappings, also called functions, are basic to mathematics and programming. Like a function in a program, a mapping in math takes an argument of one type and maps it to (returns) an object of a particular type. In a program, we say “type”; in math, we would identify the set. When we have an object that is a member of a set, we use the ∈ symbol. For example,
a ∈ S,
can be read “a is a member of set S.” Given any two sets A and B, we can create a third set by taking the Cartesian product of the two sets, denoted A × B. This set A × B is composed of all possible ordered pairs (a, b) where a ∈ A and b ∈ B. As a shorthand, we use the notation A^{2} to denote A × A. We can extend the Cartesian product to create a set of all possible ordered triples from three sets and so on for arbitrarily long ordered tuples from arbitrarily many sets.
Common sets of interest include
ℝ—the real numbers;
ℝ ^{+}—the nonnegative real numbers (includes zero);
ℝ ^{2}—the ordered pairs in the real 2D plane;
ℝ ^{n}—the points in n-dimensional Cartesian space;
Z—the integers;
S^{2}—the set of 3D points (points in ℝ^{3}) on the unit sphere.
Note that although S^{2} is composed of points embedded in three-dimensional space, it is on a surface that can be parameterized with two variables, so it can be thought of as a 2D set. Notation for mappings uses the arrow and a colon, for example,
which you can read as “There is a function called f that takes a real number as input and maps it to an integer.” Here, the set that comes before the arrow is called the domain of the function, and the set on the right-hand side is called the target. Computer programmers might be more comfortable with the following equivalent language: “There is a function called f which has one real argument and returns an integer.” In other words, the set notation above is equivalent to the common programming notation:
So the colon-arrow notation can be thought of as a programming syntax. It’s that simple.
The point f (a) is called the image of a, and the image of a set A (a subset of the domain) is the subset of the target that contains the images of all points in A. The image of the whole domain is called the range of the function.
If we have a function f : A ⟼ B, there may exist an inverse function f^{–1}: B ⟼ A, which is defined by the rule f^{–1}(b) = a where b = f (a) . This definition only works if every b ∈ B is an image of some point under f (i.e., the range equals the target) and if there is only one such point (i.e., there is only one a for which f (a) = b). Such mappings or functions are called bijections. A bijection maps every a ∈ A to a unique b ∈ B, and for every b ∈ B, there is exactly one a ∈ A such that f (a) = b (Figure 2.1). A bijection between a group of riders and horses indicates that everybody rides a single horse, and every horse is ridden. The two functions would be rider (horse) and horse (rider). These are inverse functions of each other. Functions that are not bijections have no inverse (Figure 2.2).
An example of a bijection is f : ℝ ⟼ ℝ, with f (x) = x^{3}. The inverse is . This example shows that the standard notation can be somewhat awkward because x is used as a dummy variable in both f and f^{–1}. It is sometimes more intuitive to use different dummy variables, with y = f (x) and x = f^{–1}(y) . This yields the more intuitive y = x^{3} and . An example of a function that does not have an inverse is sqr : ℝ ⟼ ℝ, where sqr(x) = x^{2}. This is true for two reasons: first x^{2} = (–x)^{2}, and second no members of the domain map to the negative portions of the target. Note that we can define an inverse if we restrict the domain and range to R^{+}. Then, is a valid inverse.
Often, we would like to specify that a function deals with real numbers that are restricted in value. One such constraint is to specify an interval. An example of an interval is the real numbers between zero and one, not including zero or one. We denote this (0, 1) . Because it does not include its endpoints, this is referred to as an open interval. The corresponding closed interval, which does contain its endpoints, is denoted with square brackets: [0, 1]. This notation can be mixed; i.e., [0, 1) includes zero but not one. When writing an interval [a, b], we assume that a ≤ b. The three common ways to represent an interval are shown in Figure 2.3. The Cartesian products of intervals are often used. For example, to indicate that a point x is in the unit cube in 3D, we say x ∈ [0, 1]^{3}.
Intervals are particularly useful in conjunction with set operations: intersection, union, and difference. For example, the intersection of two intervals is the set of points they have in common. The symbol ∩ is used for intersection. For example, [3, 5)∩[4, 6] = [4, 5) . For unions, the symbol ∪ is used to denote points in either interval. For example, [3, 5) ∪ [4, 6] = [3, 6]. Unlike the first two operators, the difference operator produces different results depending on argument order. The minus sign is used for the difference operator, which returns the points in the left interval that are not also in the right. For example, [3, 5) – [4, 6] = [3, 4) and [4, 6] – [3, 5) = [5, 6]. These operations are particularly easy to visualize using interval diagrams (Figure 2.4).
Although not as prevalent today as they were before calculators, logarithms are often useful in problems where equations with exponential terms arise. By definition, every logarithm has a base a. The “log base a”of x is written_{a} x and is defined as “the exponent to which a must be raised to get x,” i.e.,
Note that the logarithm base a and the function that raises a to a power are inverses of each other. This basic definition has several consequences:
When we apply calculus to logarithms, the special number e = 2.718... often turns up. The logarithm with base e is called the natural logarithm. We adopt the common shorthand ln to denote it:
Note that the “≡” symbol can be read “is equivalent by definition.” Like π, the special number e arises in a remarkable number of contexts. Many fields use a particular base in addition to e for manipulations and omit the base in their notation, i.e., log x. For example, astronomers often use base 10 and theoretical computer scientists often use base 2. Because computer graphics borrows technology from many fields, we will avoid this shorthand.
The derivatives of logarithms and exponents illuminate why the natural logarithm is “natural”:
The constant multipliers above are unity only for a = e.
A quadratic equation has the form
where x is a real unknown, and A, B,and C are known constants. If you think of a2D xy plot with y = Ax^{2} + Bx + C, the solution is just whatever x values are “zero crossings” in y. Because y = Ax^{2} + Bx + C is a parabola, there will be zero, one, or two real solutions depending on whether the parabola misses, grazes, or hits the x-axis (Figure 2.5).
To solve the quadratic equation analytically, we first divide by A:
Then, we “complete the square” to group terms:
Moving the constant portion to the right-hand side and taking the square root give
Subtracting B/(2A) from both sides and grouping terms with the denominator 2A gives the familiar form:^{1}
Here, the “±” symbol means there are two solutions, one with a plus sign and one with a minus sign. Thus, 3 ± 1 equals “two or four.” Note that the term that determines the number of real solutions is
which is called the discriminant of the quadratic equation. If D > 0, there are two real solutions (also called roots). If D = 0, there is one real solution (a “double” root). If D < 0, there are no real solutions.
For example, the roots of 2x^{2} +6x +4 = 0 are x = –1 and x = –2, and the equation x^{2} + x+1 has no real solutions. The discriminants of these equations are D = 4 and D = –3, respectively, so we expect the number of solutions given. In programs, it is usually a good idea to evaluate D first and return “no roots” without taking the square root if D is negative.
In graphics, we use basic trigonometry in many contexts. Usually, it is nothing too fancy, and it often helps to remember the basic definitions.
Although we take angles somewhat for granted, we should return to their definition so we can extend the idea of the angle onto the sphere. An angle is formed between two half-lines (infinite rays stemming from an origin) or directions, and some convention must be used to decide between the two possibilities for the angle created between them as shown in Figure 2.6. An angle is defined by the length of the arc segment it cuts out on the unit circle. A common convention is that the smaller arc length is used, and the sign of the angle is determined by the order in which the two half-lines are specified. Using that convention, all angles are in the range [–π, π].
Each of these angles is the length of the arc of the unit circle that is “cut” by the two directions. Because the perimeter of the unit circle is 2π, the two possible angles sum to 2π. The unit of these arc lengths is radians. Another common unit is degrees, where the perimeter of the circle is 360°. Thus, an angle that is π radians is 180°, usually denoted 180°. The conversion between degrees and radians is
Given a right triangle with sides of length a, o, and h, where h is the length of the longest side (which is always opposite the right angle), or hypotenuse, an important relation is described by the Pythagorean theorem:
You can see that this is true from Figure 2.7, where the big square has area (a+o)^{2}, the four triangles have the combined area 2ao, and the center square has area h^{2}.
Because the triangles and inner square subdivide the larger square evenly, we have 2ao + h^{2} = (a + o)^{2}, which is easily manipulated to the form above.
We define sine and cosine of ϕ, as well as the other ratio-based trigonometric expressions:
These definitions allow us to set up polar coordinates, where a point is coded as a distance from the origin and a signed angle relative to the positive x-axis (Figure 2.8). Note the convention that angles are in the range ϕ ∈ (–π, π], and that the positive angles are counterclockwise from the positive x-axis. This convention that counterclockwise maps to positive numbers is arbitrary, but it is used in many contexts in graphics so it is worth committing to memory.
Trigonometric functions are periodic and can take any angle as an argument. For example, sin(A) = sin(A +2π) . This means the functions are not invertible when considered with the domain R. This problem is avoided by restricting the range of standard inverse functions, and this is done in a standard way in almost all modern math libraries (e.g., Plauger (1991)). The domains and ranges are
The last function, atan2(s, c) is often very useful. It takes an s value proportional to sin A and a c value that scales cos A by the same factor and returns A. The factor is assumed to be positive. One way to think of this is that it returns the angle of a 2D Cartesian point (s, c) in polar coordinates (Figure 2.9).
This section lists without derivation a variety of useful trigonometric identities.
Half-angle identities:
Half-angle identities:
Product identities:
The following identities are for arbitrary triangles with side lengths a, b, and c, each with an angle opposite it given by A, B, C, respectively (Figure 2.10),
The area of a triangle can also be computed in terms of these side lengths:
Traditional trigonometry in this section deals with triangles on the plane. Triangles can be defined on non-planar surfaces as well, and one that arises in many fields, astronomy, for example, is triangles on the unit-radius sphere. These spherical triangles have sides that are segments of the great circles (unit-radius circles) on the sphere. The study of these triangles is a field called spherical trigonometry and is not used that commonly in graphics, but sometimes, it is critical when it does arise. We wont discuss the details of it here, but want the reader to be aware that area exists for when those problems do arise, and there are a lot of useful rules such as a spherical law of cosines and a spherical law of sines. For an example of the machinery of spherical trigonometry being used, see the paper on sampling triangle lights (which project to a spherical triangle) (Arvo, 1995b).
Of more central importance to computer graphics are solid angles. While angles allow us to quantify things like “what is the separation of those two poles in my visual field,” solid angles let us quantify things like “how much of my visual field does that airplane cover.” For traditional angles, we project the posts onto the unit circle and measure arc length between them on the unit circle. We work with angles often enough that many of us can forget this definition because it is all so intuitive to us now. Solid angles are just as simple, but they may seem more confusing because most of us learn about them as adults. For solid angles, we project the visible directions that “see” the airplane and project it onto the unit sphere and measure the area. This area is the solid angle in the same way the arc length is the angle. While angles are measured in radians and sum to 2π (the total length of a unit circle), solid angles are measured in steradians and sum to 4π (the total area of a unit sphere).
A vector describes a length and a direction. It can be usefully represented by an arrow. Two vectors are equal if they have the same length and direction even if we think of them as being located in different places (Figure 2.11). As much as possible, you should think of a vector as an arrow and not as coordinates or numbers. At some point, we will have to represent vectors as numbers in our programs, but even in code, they should be manipulated as objects and only the low-level vector operations should know about their numeric representation (DeRose, 1989). Vectors will be represented as bold characters, e.g., a. A vector’s length is denoted ||a||. A unit vector is any vector whose length is one. The zero vector is the vector of zero length. The direction of the zero vector is undefined.
Vectors can be used to represent many different things. For example, they can be used to store an offset, also called a displacement. If we know “the treasure is buried two paces east and three paces north of the secret meeting place,” then we know the offset, but we don’t know where to start. Vectors can also be used to store a location, another word for position or point. Locations can be represented as a displacement from another location. Usually, there is some understood origin location from which all other locations are stored as offsets. Note that locations are not vectors. As we shall discuss, you can add two vectors. However, it usually does not make sense to add two locations unless it is an intermediate operation when computing weighted averages of a location (Goldman, 1985). Adding two offsets does make sense, so that is one reason why offsets are vectors. But this emphasizes that a location is not an offset; it is an offset from a specific origin location. The offset by itself is not the location.
Vectors have most of the usual arithmetic operations that we associate with real numbers. Two vectors are equal if and only if they have the same length and direction. Two vectors are added according to the parallelogram rule. This rule states that the sum of two vectors is found by placing the tail of either vector against the head of the other (Figure 2.12). The sum vector is the vector that “completes the triangle” started by the two vectors. The parallelogram is formed by taking the sum in either order. This emphasizes that vector addition is commutative:
Note that the parallelogram rule just formalizes our intuition about displacements. Think of walking along one vector, tail to head, and then walking along the other. The net displacement is just the parallelogram diagonal. You can also create a unary minus for a vector: –a (Figure 2.13) is a vector with the same length as a but opposite direction. This allows us to also define subtraction:
You can visualize vector subtraction with a parallelogram (Figure 2.14). We can write
Vectors can also be multiplied. In fact, there are several kinds of products involving vectors. First, we can scale the vector by multiplying it by a real number k.
This just multiplies the vector’s length without changing its direction. For example, 3.5a is a vector in the same direction as a, but it is 3.5 times as long as a. We discuss two products involving two vectors, the dot product and the cross product, later in this section, and a product involving three vectors, the determinant, in Chapter 6.
A 2D vector can be written as a combination of any two nonzero vectors which are not parallel. This property of the two vectors is called linear independence. Two linearly independent vectors form a 2D basis, and the vectors are thus referred to as basis vectors. For example, a vector c may be expressed as a combination of two basis vectors a and b (Figure 2.15):
Note that the weights a_{c} and b_{c} are unique. Bases are especially useful if the two vectors are orthogonal; i.e., they are at right angles to each other. It is even more useful if they are also unit vectors in which case they are orthonormal. If we assume two such “special” vectors x and y are known to us, then we can use them to represent all other vectors in a Cartesian coordinate system, where each vector is represented as two real numbers. For example, a vector a might be represented as
where x_{a} and y_{a} are the real Cartesian coordinates of the 2D vector a (Figure 2.16). Note that this is not really any different conceptually from Equation (2.3), where the basis vectors were not orthonormal. But there are several advantages to a Cartesian coordinate system. For instance, by the Pythagorean theorem, the length of a is
It is also simple to compute dot products, cross products, and coordinates of vectors in Cartesian systems, as we’ll see in the following sections.
By convention, we write the coordinates of a either as an ordered pair (x_{a},y_{a}) or a column matrix:
The form we use will depend on typographic convenience. We will also occasionally write the vector as a row matrix, which we will indicate as a^{T}:
We can also represent 3D, 4D, etc., vectors in Cartesian coordinates. For the 3D case, we use a basis vector z that is orthogonal to both x and y.
The simplest way to multiply two vectors is the dot product. The dot product of a and b is denoted a · b and is often called the scalar product because it returns a scalar. The dot product returns a value related to its arguments’ lengths and the angle ϕ between them (Figure 2.17):
The most common use of the dot product in graphics programs is to compute the cosine of the angle between two vectors.
The dot product can also be used to find the projection of one vector onto another. This is the length a→b of a vector a that is projected at right angles onto a vector b (Figure 2.18):
The dot product obeys the familiar associative and distributive properties we have in real arithmetic:
If 2D vectors a and b are expressed in Cartesian coordinates, we can take advantage of x · x = y · y = 1 and x · y = 0 to derive that their dot product is
Similarly in 3Dwe can find
The cross product a × b is usually used only for three-dimensional vectors; generalized cross products are discussed in references given in the chapter notes. The cross product returns a 3D vector that is perpendicular to the two arguments of the cross product. The length of the resulting vector is related to sin ϕ:
The magnitude ||a × b|| is equal to the area of the parallelogram formed by vectors a and b. In addition, a × b is perpendicular to both a and b (Figure 2.19). Note that there are only two possible directions for such a vector. By definition, the vectors in the direction of the x-, y- and z-axes are given by
and we set as a convention that x × y must be in the plus or minus z direction. The choice is somewhat arbitrary, but it is standard to assume that
All possible permutations of the three Cartesian unit vectors are
Because of the sin ϕ property, we also know that a vector cross itself is the zero vector, so x × x = 0 and so on. Note that the cross product is not commutative, i.e., x × y = y × x. The careful observer will note that the above discussion does not allow us to draw an unambiguous picture of how the Cartesian axes relate. More specifically, if we put x and y on a sidewalk, with x pointing east and y pointing north, then does z point up to the sky or into the ground? The usual convention is to have z point to the sky. This is known as a right-handed coordinate system. This name comes from the memory scheme of “grabbing” x with your right palm and fingers and rotating it toward y. The vector z should align with your thumb. This is illustrated in Figure 2.20.
The cross product has the nice property that
and
However, a consequence of the right-hand rule is
In Cartesian coordinates, we can use an explicit expansion to compute the cross product:
So, in coordinate form,
Managing coordinate systems is one of the core tasks of almost any graphics program; the key to this is managing orthonormal bases. Any set of two 2D vectors u and v form an orthonormal basis provided that they are orthogonal (at right angles) and are each of unit length. Thus,
and
In 3D, three vectors u, v,and w form an orthonormal basis if
and
This orthonormal basis is right-handed provided
and otherwise, it is left-handed.
Note that the Cartesian canonical orthonormal basis is just one of infinitely many possible orthonormal bases. What makes it special is that it and its implicit origin location are used for low-level representation within a program. Thus, the vectors x, y, and z are never explicitly stored and neither is the canonical origin location o. The global model is typically stored in this canonical coordinate system, and it is thus often called the global coordinate system. However, if we want to use another coordinate system with origin p and orthonormal basis vectors u, v, and w, then we do store those vectors explicitly. Such a system is called a frame of reference or coordinate frame. For example, in a flight simulator, we might want to maintain a coordinate system with the origin at the nose of the plane, and the orthonormal basis aligned with the airplane. Simultaneously, we would have the master canonical coordinate system (Figure 2.21). The coordinate system associated with a particular object, such as the plane, is usually called a local coordinate system.
At a low level, the local frame is stored in canonical coordinates. For example, if u has coordinates (x_{u},y_{u},z_{u}) ,
A location implicitly includes an offset from the canonical origin:
where (x_{p},y_{p},z_{p}) are the coordinates of p.
Note that if we store a vector a with respect to the u-v-w frame, we store a triple (u_{a},v_{a},w_{a}) which we can interpret geometrically as
To get the canonical coordinates of a vector a stored in the u-v-w coordinate system, simply recall that u, v,and w are themselves stored in terms of Cartesian coordinates, so the expression u_{a}u + v_{a}v + w_{a}w is already in Cartesian coordinates if evaluated explicitly. To get the u-v-w coordinates of a vector b stored in the canonical coordinate system, we can use dot products:
This works because we know that for some u_{b}, v_{b},and w_{b},
and the dot product isolates the u_{b} coordinate:
This works because u, v,and w are orthonormal.
Using matrices to manage changes of coordinate systems is discussed in Sections 7.2.1 and 7.5.
Often we need an orthonormal basis that is aligned with a given vector. That is, given a vector a, we want an orthonormal u, v, and w such that w points in the same direction as a (Hughes & Möller, 1999), but we don’t particularly care what u and v are. One vector isn’t enough to uniquely determine the answer; we just need a robust procedure that will find any one of the possible bases.
This can be done using cross products as follows. First, make w a unit vector in the direction of a:
Then, choose any vector t not collinear with w, and use the cross product to build a unit vector u perpendicular to w:
If t is collinear with w, the denominator will vanish, and if they are nearly collinear, the results will have low precision. A simple procedure to find a vector sufficiently different from w is to start with t equal to w and change the smallest magnitude component of t to 1. For example, if then . Once w and u are in hand, completing the basis is simple:
An example of a situation where this construction is used is surface shading, where a basis aligned to the surface normal is needed but the rotation around the normal is often unimportant.
For serious production code, recently researchers at Pixar have developed a rather remarkable method for constructing a vector from two vectors that is impressive in its compactness and efficiency (Duff et al., 2017). They provide battle-tested code, and readers are encouraged to use it as there are not “gotchas” that have emerged as it used throughout the industry.
The procedure in the previous section can also be used in situations where the rotation of the basis around the given vector is important. A common example is building a basis for a camera: it’s important to have one vector aligned in the direction the camera is looking, but the orientation of the camera around that vector is not arbitrary, and it needs to be specified somehow. Once the orientation is pinned down, the basis is completely determined.
A common way to fully specify a frame is by providing two vectors a (which specifies w) and b (which specifies v). If the two vectors are known to be perpendicular, it is a simple matter to construct the third vector by u = b × a.
To be sure that the resulting basis really is orthonormal, even if the input vectors weren’t quite, a procedure much like the single-vector procedure is advisable:
In fact, this procedure works just fine when a and b are not perpendicular. In this case, w will be constructed exactly in the direction of a,and v is chosen to be the closest vector to b among all vectors perpendicular to w.
This procedure won’t work if a and b are collinear. In this case, b is of no help in choosing which of the directions perpendicular to a we should use: it is perpendicular to all of them.
In the example of specifying camera positions (Section 4.3), we want to construct a frame that has w parallel to the direction the camera is looking, and v should point out the top of the camera. To orient the camera upright, we build the basis around the view direction, using the straight-up direction as the reference vector to establish the camera’s orientation around the view direction. Setting v as close as possible to straight up exactly matches the intuitive notion of “holding the camera straight.”
Occasionally, you may find problems caused in your computations by a basis that is supposed to be orthonormal but where error has crept in—due to rounding error in computation, or to the basis having been stored in a file with low precision, for instance.
The procedure of the previous section can be used; simply constructing the basis anew using the existing w and v vectors will produce a new basis that is orthonormal and is close to the old one.
This approach is good for many applications, but it is not the best available. It does produce accurately orthogonal vectors, and for nearly orthogonal starting bases, the result will not stray far from the starting point. However, it is asymmetric: it “favors” w over v and v over u (whose starting value is thrown away). It chooses a basis close to the starting basis but has no guarantee of choosing the closest orthonormal basis. When this is not good enough, the SVD (Section 6.4.1) can be used to compute an orthonormal basis that is guaranteed to be closest to the original basis.
A possibly misleading thing about graphics is that it is full of integrals and thus one might think one has to be good at algebraically solving integrals. This is most definitely not the case. Most of the integrals in graphics are not analytically solvable and are thus solved numerically. It is quite possible to have a great career in graphics and never algebraically solve a single integral.
While you do not need to be able to algebraically solve integrals, you do need to be able to read them so you can numerically solve them. In one dimension, integrals are usually pretty readable. For example, this integral
can be read as “compute the area of the function sin (x) between x = π and x = 2π.” A computer scientist might view this part:
as a function call. We might call it “integrate().” It takes two objects: a function and a domain (interval). So the whole call might be
float area = integrate(sin(), [pi,2pi]).
In more advanced calculus, we might start taking integrals over spheres, and the neat thing for graphics is we can still think of things that way:
float area = integrate(cos(), unit-sphere)
The machinery inside this function may be different, but all integrals have two things:
The function being integrated
The domain over which it is integrated.
The trick, usually, is just carefully decoding what 1 and 2 are for a problem at hand. This is pretty similar in spirit to getting an API call right from sometimes confusing documentation.
Integrals compute the total of things. Lengths, areas, volumes, etc. But they are often used to compute averages. For example, we can compute the total volume of a region by integrating the elevation over a region (like a country).
float volume = integrate(elevation(), country)
But we could also compute the average elevation:
float averageElevation = integrate(elevation(), country) / integrate(1, country)
This is basically “divide the volume by the area.” This can be abstracted as
Float averageElevation = average(elevation, country)
We can also take a weighted average. Here, we add a weighting function to emphasize some points in the average more than others. For example, if we want to emphasize a parts of the region by the temperature (this is pretty arbitrary, and we will see more graphics relevant examples in the next section):
float weightedAverageElevation =
integrate(temperature()*elevation(),
country) / integrate(temperature(), country)
It’s a good idea to keep an eye out for this form; often integrals contain a weighted average without explicitly pointing that out and it can sometimes help intuition.
One example of a type of integral we see a lot is one of these forms or something related:
float shade = integrate(cos()*f*(),
unit-hemisphere)
Note that since integrate(cos(), unit-hemisphere) = pi, the weighted average version is just
float shade = integrate((1/pi)*cos()*f*(),
unit-hemisphere)
The more traditional form of this integral is
Or with spherical coordinates as we might use to solve such integrals algebraically:
The sine term if an area-correction factor for spherical coordinates. Note that in graphics, we will rarely need to write that all out and will use simpler forms without explicit coordinates as we numerically solve the integrals
The particular integral above is the shade of a perfectly reflective matte (dif-fuse) surface, and it is also a weighted average of all incident colors. This structure can be great for intuition; the color of a surface is usually related to a weighted average of incident colors.
The integrals over solid angle are almost always the same but use a wide variety of notations. Key is to recognize this is just notations and map the notations you see to one you are most comfortable with. This is much like reading pseudocode!
Density functions come up all the time in graphics (e.g., “probability density functions”) and they can be surprisingly confusing at times, but getting a handle on what precisely they are will help us use them and navigate out of confusion when it strikes us. We know what a function is, and a density function is just one that returns a density. So what is a density? Density is something that is a “per unit something,” or more formally an intensive quantity. For example, your weight is not a density, it is an extensive quantity, or just an amount of stuff, not an amount of stuff per unit something. The amount of weight a person might gain in a set period of time, say a year, is an amount of stuff, is measured in kilograms, and is thus an extensive quantity and not a density. The amount of weight the person was gaining “per day” or “per hour” is an intensive quantity, so is a density.
As an example of a non-density function, consider the amount of energy that is produced by a solar panel on a given day, July 1, 2014, and let’s say it is 120 kilojoules. That is an amount of “stuff.” Well that is fine, but is it enough to run my computer? My computer, if a desktop, needs a density of energy, or rate of energy, to keep working. So how do we take that day of energy and convert it into a rate of energy. We could divide it into segments of time. For example, we could do four-hour blocks, two-hour blocks, or one-hour blocks, and we would see that the rate changes during the day, but also that the amounts keep getting shorter as shown in Figure 2.22.
As we divide time finer and finer, we would eventually get down to minutes and seconds and we would get more information about time variation, but the box heights would get so small that we wouldn’t see anything. So what we could do is re-scale the height of their boxes based on their widths, so (30kJ)/(0.5 h) = 60 kJ/h. If we use this new “KJ per hour” measure, the boxes no longer get shorter, as shown in Figure 2.23. If we take this process to the limit where the width of the box becomes infinitesimal, we get a smooth curve.
This curve is an example of a density function. It would be called by some an “energy density” function where the dimension the density is taken over is time, and some contexts would be called a “temporal energy density” function. Because this particular density is so useful and commonly talked about, it gets its own name, power, and instead of saying “joules per hour,” we say Watts. Note that “Watts” is joules per second rather than per hour by convention; the specific units rather than dimension are chosen for convenience. For example, some physical units make more sense with meters, some with kilometers, and some with nanometers (and a few like spectral radiance for light use both meters and nanometers in the same quantity, so when you find yourself confused, it is not your fault).
Putting this all together, (1) a density is always some kind of ratio where you say “so many X per unit Y” or “so many X per Y” like “so many kilometers per hour” (saying “so many kilometers per unit length” would be odd, but makes sense if everybody agrees what the unit of length is by default), and (2) a density function is a function that returns a density.
Density functions by themselves are useful for comparing relative concentrations at two different points. For example, with our energy density function defined over time (power), we can say “there is twice as much power at 2 pm as at 9 am” for example. But another way we can use them is to compute total quantity in a region. For example, to compute how much energy is produced between 2 pm and 4 pm, we just integrate:
Many integrals are this sort of “integrate a density function” but that is not spelled out. It can sometimes make things more clear if you tease out whether an integral is processing the “mass” of a density function in some interval or region.
The geometry of curves, and especially surfaces, plays a central role in graphics, and here, we review the basics of curves and surfaces in 2D and 3D space.
Intuitively, a curve is a set of points that can be drawn on a piece of paper without lifting the pen. A common way to describe a curve is using an implicit equation. An implicit equation in two dimensions has the form
The function f (x, y) returns a real value. Points (x, y) where this value is zero are on the curve, and points where the value is nonzero are not on the curve. For example, let’s say that f (x, y) is
where (x_{c},y_{c}) is a 2D point and r is a nonzero real number. If we take f (x, y) = 0, the points where this equality holds are on the circle with center (x_{c},y_{c}) and radius r. The reason that this is called an “implicit” equation is that the points (x, y) on the curve cannot be immediately calculated from the equation and instead must be determined by solving the equation. Thus, the points on the curve are not generated by the equation explicitly, but they are buried somewhere implicitly in the equation.
It is interesting to note that f does have values for all (x, y) . We can think of f as a terrain, with sea level at f = 0 (Figure 2.24). The shore is the implicit curve. The value of f is the altitude. Another thing to note is that the curve partitions space into regions where f > 0, f < 0, and f = 0. So you evaluate f to decide whether a point is “inside” a curve. Note that f (x, y) = c is a curve for any constant c,and c = 0 is just used as a convention. For example, if f (x, y) = x^{2} + y^{2} – 1, varying c just gives a variety of circles centered at the origin (Figure 2.25).
We can compress our notation using vectors. If we have c = (x_{c},y_{c}) and p = (x, y) , then our circle with center c and radius r is defined by those position vectors that satisfy
This equation, if expanded algebraically, will yield Equation (2.9), but it is easier to see that this is an equation for a circle by “reading” the equation geometrically. It reads, “points p on the circle have the following property: the vector from c to p when dotted with itself has value r^{2}.” Because a vector dotted with itself is just its own length squared, we could also read the equation as, “points p on the circle have the following property: the vector from c to p has squared length r^{2}.”
Even better, is to observe that the squared length is just the squared distance from c to p, which suggests the equivalent form
The above could be read “the points p on the circle are those a distance r from the center point c,” which is as good a definition of circle as any. This illustrates that the vector form of an equation often suggests more geometry and intuition than the equivalent full-blown Cartesian form with x and y. For this reason, it is usually advisable to use vector forms when possible. In addition, you can support a vector class in your code; the code is cleaner when vector forms are used. The vector-oriented equations are also less error prone in implementation: once you implement and debug vector types in your code, the cut-and-paste errors involving x, y,and z will go away. It takes a little while to get used to vectors in these equations, but once you get the hang of it, the payoff is large.
If we think of the function f (x, y) as a height field with height = f (x, y) , the gradient vector points in the direction of maximum upslope, i.e., straight uphill. The gradient vector ∇f(x, y) is given by
The gradient vector evaluated at a point on the implicit curve f (x, y) = 0 is perpendicular to the tangent vector of the curve at that point. This perpendicular vector is usually called the normal vector to the curve. In addition, since the gradient points uphill, it indicates the direction of the f (x, y) > 0 region.
In the context of height fields, the geometric meaning of partial derivatives and gradients is more visible than usual. Suppose that near the point (a, b) , f (x, y) is a plane (Figure 2.26). There is a specific uphill and downhill direction. At right angles to this direction is a direction that is level with respect to the plane. Any intersection between the plane and the f (x, y) = 0 plane will be in the direction that is level. Thus, the uphill/downhill directions will be perpendicular to the line of intersection f (x, y) = 0. To see why the partial derivative has something to do with this, we need to visualize its geometric meaning. Recall that the conventional derivative of a 1D function y = g(x) is
This measures the slope of the tangent line to g (Figure 2.27).
The partial derivative is a generalization of the 1D derivative. For a 2D function f (x, y) , we can’t take the same limit for x as in Equation (2.10), because f can change in many ways for a given change in x. However, if we hold y constant, we can define an analog of the derivative, called the partial derivative (Figure 2.28):
Why is it that the partial derivatives with respect to x and y are the components of the gradient vector? Again, there is more obvious insight in the geometry than in the algebra. In Figure 2.29, we see the vector a travels along a path where f does not change. Note that this is again at a small enough scale that the surface height (x, y) = f (x, y) can be considered locally planar. From the figure, we see that the vector a = (Δx, Δy) .
Because the uphill direction is perpendicular to a, we know the dot product is equal to zero:
We also know that the change in f in the direction (x_{a},y_{a}) equals zero:
Given any vectors (x, y) and (x,y) that are perpendicular, we know that the angle between them is 90 degrees, and thus, their dot product equals zero (recall that the dot product is proportional to the cosine of the angle between the two vectors). Thus, we have xx + yy = 0. Given (x, y) , it is easy to construct valid vectors whose dot product with (x, y) equals zero, the two most obvious being (y,–x) and (–y, x) ; you can verify that these vectors give the desired zero dot product with (x, y). A generalization of this observation is that (x, y) is perpendicular to k(y,–x) where k is any nonzero constant. This implies that
Combining Equations (2.11) and (2.12) gives
where k^{} is any nonzero constant. By definition, “uphill” implies a positive change in f , so we would like k^{} > 0,and k^{} = 1 is a perfectly good convention.
As an example of the gradient, consider the implicit circle x^{2} + y^{2} – 1 = 0 with gradient vector (2x, 2y) , indicating that the outside of the circle is the positive region for the function f (x, y) = x^{2} + y^{2} – 1. Note that the length of the gradient vector can be different depending on the multiplier in the implicit equation. For example, the unit circle can be described by Ax^{2} + Ay^{2} – A = 0 for any nonzero A. The gradient for this curve is (2Ax, 2Ay) . This will be normal (perpendicular) to the circle, but will have a length determined by A. For A > 0, the normal will point outward from the circle, and for A < 0, it will point inward. This switch from outward to inward is as it should be, since the positive region switches inside the circle. In terms of the height-field view, h = Ax^{2} + Ay^{2} – A, and the circle is at zero altitude. For A > 0, the circle encloses a depression, and for A < 0, the circle encloses a bump. As A becomes more negative, the bump increases in height, but the h = 0 circle doesn’t change. The direction of maximum uphill doesn’t change, but the slope increases. The length of the gradient reflects this change in degree of the slope. So intuitively, you can think of the gradient’s direction as pointing uphill and its magnitude as measuring how uphill the slope is.
The familiar “slope-intercept” form of the line is
This can be converted easily to implicit form (Figure 2.30):
Here, m is the “slope” (ratio of rise to run), and b is the y value where the line crosses the y-axis, usually called the y-intercept. The line also partitions the 2D plane, but here “inside” and “outside” might be more intuitively called “over” and “under.”
Because we can multiply an implicit equation by any constant without changing the points where it is zero, kf (x, y) = 0 is the same curve for any nonzero k. This allows several implicit forms for the same line, for example,
One reason the slope-intercept form is sometimes awkward is that it can’t represent some lines such as x = 0 because m would have to be infinite. For this reason, a more general form is often useful:
for real numbers A, B, C.
Suppose we know two points on the line, (x_{0}, y_{0}) and (x_{1},y_{1}) . What A, B, and C describe the line through these two points? Because these points lie on the line, they must both satisfy Equation (2.15):
Unfortunately, we have two equations and three unknowns: A, B, and C. This problem arises because of the arbitrary multiplier we can have with an implicit equation. We could set C = 1 for convenience:
but we have a similar problem to the infinite slope case in slope-intercept form: lines through the origin would need to have A(0) + B(0) + 1 = 0, which is a contradiction. For example, the equation for a 45–° line through the origin can be written x – y = 0, or equally well y – x = 0, or even 17y – 17x = 0, but it cannot be written in the form Ax + By +1 = 0.
Whenever we have such pesky algebraic problems, we try to solve the problems using geometric intuition as a guide. One tool we have, as discussed in Section 2.7.2, is the gradient. For the line Ax + By + C = 0, the gradient vector is (A, B) . This vector is perpendicular to the line (Figure 2.31), and points to the side of the line where Ax + By + C is positive. Given two points on the line (x_{0},y_{0}) and (x_{1},y_{1}) , we know that the vector between them points in the same direction as the line. This vector is just (x_{1} – x_{0},y_{1} – y_{0}), and because it is parallel to the line, it must also be perpendicular to the gradient vector (A, B) . Recall that there are an infinite number of (A, B, C) that describe the line because of the arbitrary scaling property of implicits. We want any one of the valid (A, B, C) .
We can start with any (A, B) perpendicular to (x_{1}–x_{0},y_{1}–y_{0}). Such a vector is just (A, B) = (y_{0} –y_{1}, x_{1} – x_{0}) by the same reasoning as in Section 2.7.2. This means that the equation of the line through (x_{0},y_{0}) and (x_{1},y_{1}) is
Now we just need to find C. Because (x_{0},y_{0}) and (x_{1},y_{1}) are on the line, they must satisfy Equation (2.16). We can plug either value in and solve for C. Doing this for (x_{0},y_{0}) yields C = x_{0}y_{1} – x_{1}y_{0}, and thus, the full equation for the line is
Again, this is one of infinitely many valid implicit equations for the line through two points, but this form has no division operation and thus no numerically degenerate cases for points with finite Cartesian coordinates. A nice thing about Equation (2.17) is that we can always convert to the slope-intercept form (when it exists) by moving the non-y terms to the right-hand side of the equation and dividing by the multiplier of the y term:
An interesting property of the implicit line equation is that it can be used to find the signed distance from a point to the line. The value of Ax + By + C is proportional to the distance from the line (Figure 2.32). As shown in Figure 2.33, the distance from a point to the line is the length of the vector k(A, B) ,which is
For the point (x, y)+ k(A, B) ,thevalueof f (x, y) = Ax + By + C is
The simplification in that equation is a result of the fact that we know (x, y) is on the line, so Ax + By + C = 0. From Equations (2.18) and (2.19), we can see that the signed distance from line Ax + By + C = 0 to a point (a, b) is
Here, “signed distance” means that its magnitude (absolute value) is the geometric distance, but on one side of the line, distances are positive and on the other, they are negative. You can choose between the equally valid representations f (x, y) = 0 and –f (x, y) = 0 if your problem has some reason to prefer a particular side being positive. Note that if (A, B) is a unit vector, then f (a, b) is the signed distance. We can multiply Equation (2.17) by a constant that ensures that (A, B) is a unit vector:
Note that evaluating f (x, y) in Equation (2.20) directly gives the signed distance, but it does require a square root to set up the equation. Implicit lines will turn out to be very useful for triangle rasterization (Section 9.1.2). Other forms for 2D lines are discussed in Chapter 13.
In the previous section, we saw that a linear function f (x, y) gives rise to an implicit line f (x, y) = 0. If f is instead a quadratic function of x and y, with the general form
the resulting implicit curve is called a quadric. Two-dimensional quadric curves include ellipses and hyperbolas, as well as the special cases of parabolas, circles, and lines.
Examples of quadric curves include the circle with center (x_{c}, y_{c}) and radius r,
and axis-aligned ellipses of the form
where (x_{c},y_{c}) is the center of the ellipse, and a and b are the minor and major semi-axes (Figure 2.34).
Just as implicit equations can be used to define curves in 2D, they can be used to define surfaces in 3D. As in 2D, implicit equations implicitly define a set of points that are on the surface:
Any point (x, y, z) that is on the surface results in zero when given as an argument to f . Any point not on the surface results in some number other than zero. You can check whether a point is on the surface by evaluating f , or you can check which side of the surface the point lies on by looking at the sign of f , but you cannot always explicitly construct points on the surface. Using vector notation, we will write such functions of p = (x, y, z) as
A surface normal (which is needed for lighting computations, among other things) is a vector perpendicular to the surface. Each point on the surface may have a different normal vector. In the same way that the gradient provides a normal to an implicit curve in 2D, the surface normal at a point p on an implicit surface is given by the gradient of the implicit function
The reasoning is the same as for the 2D case: the gradient points in the direction of fastest increase in f , which is perpendicular to all directions tangent to the surface, in which f remains constant. The gradient vector points toward the side of the surface where f (p) > 0, which we may think of as “into” the surface or “out from” the surface in a given context. If the particular form of f creates inward-facing gradients, and outward-facing gradients are desired, the surface –f (p) = 0 is the same as surface f (p) = 0 but has directionally reversed gradients, i.e., –∇f (p) = ∇(–f (p)) .
As an example, consider the infinite plane through point a with surface normal n. The implicit equation to describe this plane is given by
Note that a and n are known quantities. The point p is any unknown point that satisfies the equation. In geometric terms this equation says “the vector from a to p is perpendicular to the plane normal.” If p were not in the plane, then (p – a) would not make a right angle with n (Figure 2.35).
Sometimes, we want the implicit equation for a plane through points a, b, and c. The normal to this plane can be found by taking the cross product of any two vectors in the plane. One such cross product is
This allows us to write the implicit plane equation:
A geometric way to read this equation is that the volume of the parallelepiped defined by p – a, b – a, and c – a is zero; i.e., they are coplanar. This can only be true if p is in the same plane as a, b, and c. The full-blown Cartesian representation for this is given by the determinant (this is discussed in more detail in Section 6.3):
The determinant can be expanded (see Section 6.3 for the mechanics of expanding determinants) to the bloated form with many terms.
Equations (2.22) and (2.23) are equivalent, and comparing them is instructive. Equation (2.22) is easy to interpret geometrically and will yield efficient code. In addition, it is relatively easy to avoid a typographic error that compiles into incorrect code if it takes advantage of debugged cross and dot product code. Equation (2.23) is also easy to interpret geometrically and will be efficient provided an efficient 3 × 3 determinant function is implemented. It is also easy to implement without a typo if a function determinant (a, b, c) is available. It will be especially easy for others to read your code if you rename the determinant function volume. So both Equations (2.22) and (2.23) map well into code. The full expansion of either equation into x-, y-, and z-components is likely to generate typos. Such typos are likely to compile and, thus, to be especially pesky. This is an excellent example of clean math generating clean code and bloated math generating bloated code.
Just as quadratic polynomials in two variables define quadric curves in 2D, quadratic polynomials in x, y,and z define quadric surfaces in 3D. For instance, a sphere can be written as
and an axis-aligned ellipsoid may be written as
One might hope that an implicit 3D curve could be created with the form f (p) = 0. However, all such curves are just degenerate surfaces and are rarely useful in practice. A 3D curve can be constructed from the intersection of two simultaneous implicit equations:
For example, a 3D line can be formed from the intersection of two implicit planes. Typically, it is more convenient to use parametric curves instead; they are discussed in the following sections.
A parametric curve is controlled by a single parameter that can be considered a sort of index that moves continuously along the curve. Such curves have the form
Here, (x, y) is a point on the curve, and t is the parameter that influences the curve. For a given t, there will be some point determined by the functions g and h. For continuous g and h, a small change in t will yield a small change in x and y. Thus, as t continuously changes, points are swept out in a continuous curve. This is a nice feature because we can use the parameter t to explicitly construct points on the curve. Often, we can write a parametric curve in vector form,
where f is a vector-valued function, . Such vector functions can generate very clean code, so they should be used when possible.
We can think of the curve with a position as a function of time. The curve can go anywhere and could loop and cross itself. We can also think of the curve as having a velocity at any point. For example, the point p(t) is traveling slowly near t = –2 and quickly between t = 2 and t = 3. This type of “moving point” vocabulary is often used when discussing parametric curves even when the curve is not describing a moving point.
A parametric line in 2D that passes through points p_{0} = (x_{0}, y_{0}) and p_{1} = (x_{1},y_{1}) can be written as
Because the formulas for x and y have such similar structure, we can use the vector form for p = (x, y) (Figure 2.36):
You can read this in geometric form as “start at point p_{0} and go some distance toward p_{1} determined by the parameter t.” A nice feature of this form is that p(0) = p_{0} and p(1) = p_{1}. Since the point changes linearly with t, the value of t between p_{0} and p_{1} measures the fractional distance between the points. Points with t < 0 are to the “far” side of p_{0}, and points with t > 1 are to the “far” side of p_{1}.
Parametric lines can also be described as just a point o and a vector d:
When the vector d has unit length, the line is arc-length parameterized. This means t is an exact measure of distance along the line. Any parametric curve can be arc-length parameterized, which is obviously a very convenient form, but not all can be converted analytically.
A circle with center (x_{c},y_{c}) and radius r has a parametric form:
To ensure that there is a unique parameter ϕ for every point on the curve, we can restrict its domain: ϕ ∈ [0, 2π) or ϕ ∈ (–π, π] or any other half-open interval of length 2π.
An axis-aligned ellipse can be constructed by scaling the x and y parametric equations separately:
A 3D parametric curve operates much like a 2D parametric curve:
For example, a spiral around the z-axis is written as
As with 2D curves, the functions f , g, and h are defined on a domain D ⊂ R if we want to control where the curve starts and ends. In vector form, we can write
In this chapter, we only discuss 3D parametric lines in detail. General 3D parametric curves are discussed more extensively in Chapter 15.
A 3D parametric line can be written as a straightforward extension of the 2D parametric line, e.g.,
This is cumbersome and does not translate well to code variables, so we will write it in vector form:
where, for this example, o and d are given by
Note that this is very similar to the 2D case. The way to visualize this is to imagine that the line passes through o and is parallel to d. Given any value of t, you get some point p(t) on the line. For example, at t = 2, p(t) = (2, 1, 3) + 2(7, 2, –5) = (16, 5, –7) . This general concept is the same as for two dimensions (Figure 2.36).
As in 2D, a line segment can be described by a 3D parametric line and an interval t ∈ [t_{a},t_{b}]. The line segment between two points a and b is given by p(t) = a + t(b – a) with t ∈ [0, 1]. Here, p(0) = a, p(1) = b, and p(0.5) = (a + b)/2, the midpoint between a and b.
A ray, or half-line, is a 3D parametric line with a half-open interval, usually [0, ∞) . From now on, we will refer to all lines, line segments, and rays as “rays.” This is sloppy, but corresponds to common usage and makes the discussion simpler.
The parametric approach can be used to define surfaces in 3D space in much the same way we define curves, except that there are two parameters to address the two-dimensional area of the surface. These surfaces have the form
or, in vector form,
With implicit surfaces, the derivative of the function f gave us the surface normal. With parametric surfaces, the derivatives of p also give information about the surface geometry.
Consider the function q(t) = p(t, v_{0}) . This function defines a parametric curve obtained by varying u while holding v fixed at the value v_{0}. This curve, called an isoparametric curve (or sometimes “isoparm” for short), lies in the surface. The derivative of q gives a vector tangent to the curve, and since the curve lies in the surface, the vector q also lies in the surface. Since it was obtained by varying one argument of p, the vector q is the partial derivative of p with respect to u, which we’ll denote p_{u}. A similar argument shows that the partial derivative p_{v} gives the tangent to the isoparametric curves for constant u, which is a second tangent vector to the surface.
The derivative of p, then, gives two tangent vectors at any point on the surface. The normal to the surface may be found by taking the cross product of these vectors: since both are tangent to the surface, their cross product, which is perpendicular to both tangents, is normal to the surface. The right-hand rule for cross products provides a way to decide which side is the front, or outside, of the surface; we will use the convention that the vector
points toward the outside of the surface.
Implicit curves in 2D or surfaces in 3D are defined by scalar-valued functions of two or three variables, f : ℝ^{2} → ℝ or f : ℝ^{3} → ℝ, and the surface consists of all points where the function is zero:
Parametric curves in 2D or 3D are defined by vector-valued functions of one variable, p : D ⊂ ℝ → ℝ^{2} or p : D ⊂ ℝ → ℝ^{3}, and the curve is swept out as t varies over all of D:
Parametric surfaces in 3D are defined by vector-valued functions of two variables, p : D ⊂ ℝ^{2} → ℝ^{3}, and the surface consists of the images of all points (u, v) in the domain:
For implicit curves and surfaces, the normal vector is given by the derivative of f (the gradient), and the tangent vector (for a curve) or vectors (for a surface) can be derived from the normal by constructing a basis.
For parametric curves and surfaces, the derivative of p gives the tangent vector (for a curve) or vectors (for a surface), and the normal vector can be derived from the tangents by constructing a basis.
Perhaps the most common mathematical operation in graphics is linear interpolation. We have already seen an example of linear interpolation of position to form line segments in 2D and 3D, where two points a and b are associated with a parameter t to form the line p = (1 – t)a + tb. Thisis interpolation because p goes through a and b exactly at t = 0 and t = 1. Itis linear interpolation because the weighting terms t and 1 – t are linear polynomials of t.
Another common linear interpolation is among a set of positions on the x-axis: x_{0}, x_{1}, ..., x_{n}, and for each x_{i}, we have an associated height, y_{i}. We want to create a continuous function y = f (x) that interpolates these positions, so that f goes through every data point, i.e., f (x_{i}) = y_{i}. For linear interpolation, the points (x_{i},y_{i}) are connected by straight line segments. It is natural to use parametric line equations for these segments. The parameter t is just the fractional distance between x_{i} and x_{i}_{+1}:
Because the weighting functions are linear polynomials of x, this is linear interpolation.
The two examples above have the common form of linear interpolation. We create a variable t that varies from 0 to 1 as we move from data item A to data item B. Intermediate values are just the function (1 – t)A + tB. Notice that Equation (2.26) has this form with
Triangles in both 2D and 3D are the fundamental modeling primitive in many graphics programs. Often information such as color is tagged onto triangle vertices, and this information is interpolated across the triangle. The coordinate system that makes such interpolation straightforward is called barycentric coordinates; we will develop these from scratch. We will also discuss 2D triangles, which must be understood before we can draw their pictures on 2D screens.
If we have a 2D triangle defined by 2D points a, b, and c, we can first find its area:
The derivation of this formula can be found in Section 6.3. This area will have a positive sign if the points a, b,and c are in counterclockwise order and a negative sign, otherwise.
Often in graphics, we wish to assign a property, such as color, at each triangle vertex and smoothly interpolate the value of that property across the triangle. There are a variety of ways to do this, but the simplest is to use barycentric coordinates. One way to think of barycentric coordinates is as a nonorthogonal coordinate system as was discussed briefly in Section 2.4.2. Such a coordinate system is shown in Figure 2.38, where the coordinate origin is a and the vectors from a to b and c are the basis vectors. With that origin and those basis vectors, any point p can be written as
Note that we can reorder the terms in Equation (2.28) to get
Often people define a new variable α to improve the symmetry of the equations:
which yields the equation
with the constraint that
Barycentric coordinates seem like an abstract and unintuitive construct at first, but they turn out to be powerful and convenient. You may find it useful to think of how street addresses would work in a city where there are two sets of parallel streets, but where those sets are not at right angles. The natural system would essentially be barycentric coordinates, and you would quickly get used to them. Barycentric coordinates are defined for all points on the plane. A particularly nice feature of barycentric coordinates is that a point p is inside the triangle formed by a, b,and c if and only if
If one of the coordinates is zero and the other two are between zero and one, then you are on an edge. If two of the coordinates are zero, then the other is one, and you are at a vertex. Another nice property of barycentric coordinates is that Equation (2.29) in effect mixes the coordinates of the three vertices in a smooth way. The same mixing coefficients (α, β, γ) can be used to mix other properties, such as color, as we will see in the next chapter.
Given a point p, how do we compute its barycentric coordinates? One way is to write Equation (2.28) as a linear system with unknowns β and γ,solve,andset α = 1 – β – γ . That linear system is
Although it is straightforward to solve Equation (2.31) algebraically, it is often fruitful to compute a direct geometric solution.
One geometric property of barycentric coordinates is that they are the signed scaled distance from the lines through the triangle sides, as is shown for β in Figure 2.39. Recall from Section 2.7.2 that evaluating the equation f (x, y) for the line f (x, y) = 0 returns the scaled signed distance from (x, y) to the line. Also recall that if f (x, y) = 0 is the equation for a particular line, so is kf (x, y) = 0 for any nonzero k. Changing k scales the distance and controls which side of the line has positive signed distance, and which negative. We would like to choose k such that, for example, kf (x, y) = β. Since k is only one unknown, we can force this with one constraint, namely, that at point b, we know β = 1. So if the line f_{ac}(x, y) = 0 goes through both a and c, then we can compute β for a point (x, y) as follows:
and we can compute γ and α in a similar fashion. For efficiency, it is usually wise to compute only two of the barycentric coordinates directly and to compute the third using Equation (2.30).
To find this “ideal” form for the line through p_{0} and p_{1}, we can first use the technique of Section 2.7.2 to find some valid implicit lines through the vertices. Equation (2.17) gives us
Note that f_{ab}(x_{c},y_{c}) probably does not equal one, so it is probably not the ideal form we seek. By dividing through by f_{ab}(x_{c},y_{c}) ,weget
The presence of the division might worry us because it introduces the possibility of divide-by-zero, but this cannot occur for triangles with areas that are not near zero. There are analogous formulas for α and β, but typically only one is needed:
Another way to compute barycentric coordinates is to compute the areas A_{a}, A_{b}, and A_{c}, of subtriangles as shown in Figure 2.40. Barycentric coordinates obey
where A is the area of the triangle. Note that A = A_{a} + A_{b} + A_{c}, so it can be computed with two additions rather than a full area formula. This rule still holds for points outside the triangle if the areas are allowed to be signed. The reason for this is shown in Figure 2.41. Note that these are signed areas and will be computed correctly as long as the same signed area computation is used for both A and the subtriangles A_{a}, A_{b},and A_{c}.
One wonderful thing about barycentric coordinates is that they extend almost transparently to 3D. If we assume the points a, b, and c are 3D, then we can still use the representation
Now, as we vary β and γ, we sweep out a plane.
The normal vector to a triangle can be found by taking the cross product of any two vectors in the plane of the triangle (Figure 2.42). It is easiest to use two of the three edges as these vectors, for example,
Note that this normal vector is not necessarily of unit length, and it obeys the right-hand rule of cross products.
The area of the triangle can be found by taking the length of the cross product:
Note that this is not a signed area, so it cannot be used directly to evaluate barycentric coordinates. However, we can observe that a triangle with a “clockwise” vertex order will have a normal vector that points in the opposite direction to the normal of a triangle in the same plane with a “counterclockwise” vertex order. Recall that
where ϕ is the angle between the vectors. If a and b are parallel, then cos ϕ = ±1, and this gives a test of whether the vectors point in the same or opposite directions. This, along with Equations (2.33)–(2.35), suggest the formulas
where n is Equation (2.34) evaluated with vertices a, b, and c; n_{a} is Equation (2.34) evaluated with vertices b, c,and p, and so on, i.e.,
Probability studies things that include random outcomes and discrete probability refers to when there is a finite number of random outcomes. A classic example is a six-sided die, where the die takes on a random value from {1, 2, 3, 4, 5, 6},where when you roll it, each outcome comes with equal probability. The probability of a certain outcome is the fraction of time that outcome happens. The fraction that something happens at all is 1. Each roll comes up one sixth of the time.
One of the more confusing things about randomness is distinguishing between a random outcome that either hasn’t happened (or happened and we don’t know the outcome) and a die after it has been rolled. A random variable is a single value that does not have a known value, but will on one from a known set of possibilities with a known likelihood. The term “variable” here comes from math and directly related to “variable” in programming. An example of a random variable is X, where “X = the eventual outcome of the die.” The variable could use any symbol; capital X is often used as a random variable symbol in math in the same way “i” and “j” are often used for loop variables in computer science. Computer programs have a pretty direct use of random variables:
int X = rand_from(1,6)
where X a variable where we don’t know the value, but we do know that when we run the program, we will get one of six values each with a probability of one sixth, and this corresponds directly to the case of a random variable “X = the eventual outcome of the die.” There are two properties of random variables that are used all the time: expected value and variance. Expected value, sometimes called expectation, of a random variable X, often denied EX or E(X) , might better be called “expected average value,” but it isn’t so don’t say that or it will confuse people who know the standard terminology. This is just the average value that X takes on under all parallel universes where “the die is rolled.” This can be computed by multiplying each outcome by its probability and adding:
So if we averaged a lot of dice rolls, we would “expect” a value around 3.5. This saying “I expect the die to come up 3.5” is not the nonsense it sounds when you know it can’t come up anything but a whole number, but the terminology is perhaps unfortunate. The terminology is quite standard across fields so despite its flaws, just try to internalize it and you will have no problems communicated with people from other fields about this topic.
Expected value tells us where a random variable will trend, but it doesn’t tell us how long that trend will take to occur nor how much it oscillates away from its average. For example, a die that had 3 ones and 3 sixes would still have an expected value of 3.5, but the “deviation from the mean” would be larger than on a conventional die. So how do we measure the magnitude of variation? One would be to measure the average deviation form 3.5, but if we include signs that average deviation is zero because the –2.5 of rolling 1 cancels out the +2.5 deviation of rolling 6. We could take the absolute difference but that has practical problems (algebra including absolute values is challenging) and some theoretical issues. In practice people prefer average squared deviation and call it variance:
Because it is statistical, that average is an expectation, so
For the case of the die, E(X) = 3.5, and the values of X –E(X) are –2.5, –1.5, –.5, .5, 1.5, 2.5, and the values of (X – E(X))^{2} are thus 6.25, 2.25, 0.25, 0.25, 2.25, 6.25, and thus, variance of X, often denoted, is 17.5/6.
An algebraic manipulation of the variance formula yields a sometimes more convenient form:
There are some algebraic niceties to expectation and variance that get used a lot. For example, suppose we have two random variables X and Y and define a variable Z = X + Y . What is E(Z)? It turns out
An amazing thing is that even if X and Y are not “statistically independent” (so for the case of our dice, they might influence each other somehow). In an extreme example, we can look at the first die and just set the second die to the same as the first. Still we would have the formula apply! This is very powerful and is often used as an unstated property in programs.
The variance has the same behavior but only if X and Y are independent.
A counterexample that shows this formula does not necessarily apply for dependent X and Y, assume you roll X and then just set Y to be the opposite side of that die, so for X = 1 choose Y = 6, and if X = 2 choose Y = 5, and for X = 3 choose Y = 4, etc. The value of Z will always be 7, and thus, the variance is zero. But the variance of X is 0 and clearly not 2(17.5 / 6) as the independent sum would yield.
One disadvantage of variance is that it is not very intuitive because of the squaring. So people often use the square root of the variance, called the standard deviation, usually denoted sigma(X). So
There are no nice formulas for σ(X + Y ) , so the appeal of variance, where there are nice formulas, becomes more obvious. Note that for the die example, the standard deviation . This is “around” the average distance from the mean of 3.5, but is slightly different as the actual mean absolute distance is 1.5. So while in practice it is almost always not dangerous intuition to think of standard deviation as average absolute deviation, it is good to keep at least in the back of your mind they are different.
In graphics, we often use random variables that can take on a range of values. These are usually called continuous random variables. The good news is almost everything about discrete random variables carries over: the terminology, the expected value definition and formulas, and variance definition and formulas. There is however, a big difference: the probability of a continuous random variable taking on any particular value is zero. Suppose you have a uniform random variable X that is between 0 and 10:
X = continuous_random_from(-2.3, 10.9).
The probability of getting the value 1.7 or π or e is all equally likely. The trouble is getting exactly 1.7 has a probability of zero.
The good news is density functions solve this problem. Just like the case of joules per second, we can use probability per length for the 1D case. In the example above of X = continuous random from(-2.3, 10.9), the dimension over which we measure the probability is length. If the length is in some unspecified unit and we just know the zero to ten range, then we would say the probability is measured “per unit length.”
Section 2.5 discussed how to “read” an integral and abstracted it away as an “integrate()” function. But how do we actually implement that function? The most common way in graphics is to use Monte Carlo Integration. The algebra for Monte Carlo Integration is often ugly and intimidating. But if we look at this function:
float shade = average(f(), hemisphere)
Our intuition would find the right answer. Pick a bunch on random points v_{i} on the hemisphere and evaluate f (v_{i}) and average them, for example:
float sum = 0.0
N = 10000; // or some other big number the user sets
For (int i = 1 to N)
vec3 v = random_point_on_hemisphere()
sum = sum + f(v)
Average = sum / N
It really is that easy! Now you need a function to pick random points on the unit hemisphere. The simplest method is a “rejection method” that first picks points uniformly in the unit ball by repeatedly picking three random numbers uniformly in a unit cube:
do
X = random_from(-1,1)
Y = random_from(-1,1)
Z = random_from(-1,1)
while (x^2 + y^2 + z^2 > 1)
And then flip the Z if needed to be in the half-ball:
If (Z < 0) Z = -Z
Then, project the point onto the unit hemisphere
v = unit_vector(X, Y, Z).
That is a way to handle an average. But what about a general integral? Recall that
average(f(), domain) = integrate(f(), domain) /
integrate(1, domain)
So
integrate(f(), domain)) = average(f(), domain)*
integrate(1, domain)
In the case of a hemisphere, integrate(1, domain) is just the area, which is 2π.
So Monte Carlo integration often is an average of random points times a constant (the size of the domain– length, area, etc.).
When a function we want to take a random average of has a wide variation in its high and low values, it can be to our advantage to concentrate samples in some areas and then correct for the nonuniformity with weights. The probability density functions give us the right tool for that: if we know the PDF of a sample, that is a direct measure of how “oversampled” that region is. If we use nonuniform samples, then we can get thus
integrate = average_of_nonuniform_samples(f()/p(),
domain).
A neat thing about this formula is it also works for uniform random samples. In that case, the PDF p() = 1/ integrate(1, domain) so the “size” of the domain is encoded in the PDF.
For any given Monte Carlo importance sampling problem, there is a pretty formulaic approach we follow, at least to get started:
Identify what is the function f () and the domain of integration (e.g., points on the unit sphere or points on a triangle).
Pick a method for generating random samples x_{i} on that domain, and make sure there is a way to evaluate the PDF p(x_{i}) for each sample.
Average the ratio f (x_{i})/p(x_{i}) for many x_{i}. This is our estimate of the integral.
A neat thing is that any p() can be used and you will converge to the right answer (with the caveat that where f () is nonzero p() must be nonzero). Which p() you use merely influences how fast your estimate converges. So we usually start with a constant p() for debugging our code.
Why isn’t there vector division?
It turns out that there is no “nice” analogy of division for vectors. However, it is possible to motivate the quaternions by examining this question in detail (see Hoffmann’s book referenced in the chapter notes).
Is there something as clean as barycentric coordinates for polygons with more than three sides?
Unfortunately, there is not. Even convex quadrilaterals are much more complicated. This is one reason triangles are such a common geometric primitive in graphics.
Is there an implicit form for 3D lines?
No. However, the intersection of two 3D planes defines a 3D line, so a 3D line can be described by two simultaneous implicit 3D equations.
How is quasi–Monte Carlo (QMC) or blue noise sampling related to Monte Carlo sampling?
The core idea of Monte Carlo is you can average a bunch of “fair” samples to estimate a true average. Here, fair can be framed in a statistical sense. But some sample sets can also be shown to be “fair” even if they are not random. One such set are quasi–Monte Carlo and have obvious deterministic structure which is not random, but is uniform in a formal sense that is not statistical, and these sets often improve convergence over random ones. Blue noise sample sets add constraints on the samples to avoid clumping, and like QMC sets can improve convergence without being fully random. In practice, most techniques are developed using Monte Carlo formalisms because the math is more tractable, and then, QMC or blue noise points are inserted in the code with the empirical confidence that uniformity is all that is needed in practice.
The history of vector analysis is particularly interesting. It was largely invented by Grassmann in the mid-1800s but was ignored and reinvented later (Crowe, 1994). Grassman now has a following in the graphics field of researchers who are developing Geometric Algebra based on some of his ideas (Doran & Lasenby, 2003). Readers interested in why the particular scalar and vector products are in some sense the right ones, and why we do not have a commonly used vector division, will find enlightenment in the concise About Vectors (Hoffmann, 1975). Another important geometric tool is the quaternion invented by Hamilton in the mid-1800s. Quaternions are useful in many situations, but especially where orientations are concerned (Hanson, 2005).
1. The cardinality of a set is the number of elements it contains. Under IEEE floating-point representation (Section 1.5), what is the cardinality of the floats?
2. Is it possible to implement a function that maps 32-bit integers to 64-bit integers that has a well-defined inverse? Do all functions from 32-bit integers to 64-bit integers have well-defined inverses?
3. Specify the unit cube (x-, y-, and z-coordinates all between 0 and 1 inclusive) in terms of the Cartesian product of three intervals.
4. If you have access to the natural log function ln(x) , specify how you could use it to implement a log(b, x) function where b is the base of the log. What should the function do for negative b values? Assume an IEEE floating-point implementation.
5. Solve the quadratic equation 2x^{2} +6x +4 = 0.
6. Implement a function that takes in coefficients A, B,and C for the quadratic equation Ax^{2} + Bx + C = 0 and computes the two solutions. Have the function return the number of valid (not NaN) solutions and fill in the return arguments so the smaller of the two solutions is first.
7. Show that the two forms of the quadratic formula on page 17 are equivalent (assuming exact arithmetic) and explain how to choose one for each root in order to avoid subtracting nearly equal floating-point numbers, which leads to loss of precision.
8. Show by counterexample that it is not always true that for 3D vectors a, b, and c, a × (b × c) = (a × b) × c.
9. Given the nonparallel 3D vectors a and b, compute a right-handed orthonormal basis such that u is parallel to a and v is in the plane defined by a and b.
10. What is the gradient of f (x, y, z) = x^{2} + y – 3z^{3}?
11. What is a parametric form for the axis-aligned 2D ellipse?
12. What is the implicit equation of the plane through 3D points (1, 0, 0) , (0, 1, 0) , and (0, 0, 1) ? What is the parametric equation? What is the normal vector to this plane?
13. Given four 2D points a_{0}, a_{1}, b_{0}, and b_{1}, design a robust procedure to determine whether the line segments a_{0}a_{1} and b_{0}b_{1} intersect.
14. Design a robust procedure to compute the barycentric coordinates of a 2D point with respect to three 2D non-collinear points.
15. Calculate the various 1D integrals from introductory calculus, and vary the number of samples. How quickly do the answer converge as the number of samples is increased?
Most computer graphics images are presented to the user on some kind of raster display. Raster displays show images as rectangular arrays of pixels. A common example is a flat-panel computer display or television, which has a rectangular array of small light-emitting pixels that can individually be set to different colors to create any desired image. Different colors are achieved by mixing varying intensities of red, green, and blue light. Most printers, such as laser printers and ink-jet printers, are also raster devices. They are based on scanning: there is no physical grid of pixels, but the image is laid down sequentially by depositing ink at selected points on a grid.
Pixel is short for “picture element.”
Rasters are also prevalent in input devices for images. A digital camera contains an image sensor comprising a grid of light-sensitive pixels, each of which records the color and intensity of light falling on it. A desktop scanner contains a linear array of pixels that is swept across the page being scanned, making many measurements per second to produce a grid of pixels.
Color in printers is more complicated, involving mixtures of at least four pigments.
Because rasters are so prevalent in devices, raster images are the most common way to store and process images. A raster image is simply a 2D array that stores the pixel value for each pixel—usually a color stored as three numbers, for red, green, and blue. A raster image stored in memory can be displayed by using each pixel in the stored image to control the color of one pixel of the display.
Or, maybe it’s because raster images are so convenient that raster devices are prevalent.
But we don’t always want to display an image this way. We might want to change the size or orientation of the image, correct the colors, or even show the image pasted on a moving three-dimensional surface. Even in televisions, the display rarely has the same number of pixels as the image being displayed. Considerations like these break the direct link between image pixels and display pixels. It’s best to think of a raster image as a device-independent description of the image to be displayed, and the display device as a way of approximating that ideal image.
There are other ways of describing images besides using arrays of pixels. A vector image is described by storing descriptions of shapes—areas of color bounded by lines or curves—with no reference to any particular pixel grid. In essence, this amounts to storing the instructions for displaying the image rather than the pixels needed to display it. The main advantage of vector images is that they are resolution independent and can be displayed well on very-high-resolution devices. The corresponding disadvantage is that they must be rasterized before they can be displayed. Vector images are often used for text, diagrams, mechanical drawings, and other applications where crispness and precision are important and photographic images and complex shading aren’t needed.
In this chapter, we discuss the basics of raster images and displays, paying particular attention to the nonlinearities of standard displays. The details of how
Or: you have to know what those numbers in your image actually mean. pixel values relate to light intensities are important to have in mind when we discuss computing images in later chapters.
Before discussing raster images in the abstract, it is instructive to look at the basic operation of some specific devices that use these images. A few familiar raster devices can be categorized into a simple hierarchy:
Output
– Display
* Transmissive: liquid crystal display (LCD)
* Emissive: light-emitting diode (LED) display
Hardcopy
* Binary: ink-jet printer
* Continuous tone: dye sublimation printer
Input
– 2D array sensor: digital camera
– 1D array sensor: flatbed scanner
Current displays, including televisions and digital cinematic projectors as well as displays and projectors for computers, are nearly universally based on fixed arrays of pixels. They can be separated into emissive displays, which use pixels that directly emit controllable amounts of light, and transmissive displays, in which the pixels themselves don’t emit light but instead vary the amount of light that they allow to pass through them. Transmissive displays require a light source to illuminate them: in a direct-viewed display, this is a backlight behind the array; in a projector, it is a lamp that emits light that is projected onto the screen after passing through the array. An emissive display is its own light source.
Light-emitting diode (LED) displays are an example of the emissive type. Each pixel is composed of one or more LEDs, which are semiconductor devices (based on inorganic or organic semiconductors) that emit light with intensity depending on the electrical current passing through them (see Figure 3.1).
The pixels in a color display are divided into three independently controlled subpixels—one red, one green, and one blue—each with its own LED made using different materials so that they emit light of different colors (Figure 3.2). When the display is viewed from a distance, the eye can’t separate the individual subpixels, and the perceived color is a mixture of red, green, and blue.
Liquid crystal displays (LCDs) are an example of the transmissive type. A liquid crystal is a material whose molecular structure enables it to rotate the polarization of light that passes through it, and the degree of rotation can be adjusted by an applied voltage. An LCD pixel (Figure 3.3) has a layer of polarizing film behind it, so that it is illuminated by polarized light—let’s assume it is polarized horizontally.
A second layer of polarizing film in front of the pixel is oriented to transmit only vertically polarized light. If the applied voltage is set so that the liquid crystal layer in between does not change the polarization, all light is blocked and the pixel is in the “off” (minimum intensity) state. If the voltage is set so that the liquid crystal rotates the polarization by 90°, then all the light that entered through the back of the pixel will escape through the front, and the pixel is fully “on”—it has its maximum intensity. Intermediate voltages will partly rotate the polarization so that the front polarizer partly blocks the light, resulting in intensities between the minimum and maximum (Figure 3.4). Like color LED displays, color LCDs have red, green, and blue subpixels within each pixel, which are three independent pixels with red, green, and blue color filters over them.
Any type of display with a fixed pixel grid, including these and other technologies, has a fundamentally fixed resolution determined by the size of the grid. For displays and images, resolution simply means the dimensions of the pixel grid: if a desktop monitor has a resolution of 1920 × 1200 pixels, this means that it has 2,304,000 pixels arranged in 1920 columns and 1200 rows.
The resolution of a display is sometimes called its “native resolution” since most displays can handle images of other resolutions, via built-in conversion.
An image of a different resolution, to fill the screen, must be converted into a 1920 × 1200 image using the methods of Chapter 10.
The process of recording images permanently on paper has very different constraints from showing images transiently on a display. In printing, pigments are distributed on paper or another medium so that when light reflects from the paper it forms the desired image. Printers are raster devices like displays, but many printers can only print binary images—pigment is either deposited or not at each grid position, with no intermediate amounts possible.
An ink-jet printer (Figure 3.5) is an example of a device that forms a raster image by scanning. An ink-jet print head contains liquid ink carrying pigment, which can be sprayed in very small drops under electronic control. The head moves across the paper, and drops are emitted as it passes grid positions that should receive ink; no ink is emitted in areas intended to remain blank. After each sweep, the paper is advanced slightly, and then, the next row of the grid is laid down. Color prints are made by using several print heads, each spraying ink with a different pigment, so that each grid position can receive any combination of different colored drops. Because all drops are the same, an ink-jet printer prints binary images: at each grid point, there is a drop or no drop; there are no intermediate shades.
An ink-jet printer has no physical array of pixels; the resolution is determined by how small the drops can be made and how far the paper is advanced after each sweep. Many ink-jet printers have multiple nozzles in the print head, enabling several sweeps to be made in one pass, but it is the paper advance, not the nozzle spacing, that ultimately determines the spacing of the rows.
The thermal dye transfer process is an example of a continuous tone printing process, meaning that varying amounts of dye can be deposited at each pixel—it is not all-or-nothing like an ink-jet printer (Figure 3.6). A donor ribbon containing colored dye is pressed between the paper, or dye receiver, and a print head containing a linear array of heating elements, one for each column of pixels in the image. As the paper and ribbon move past the head, the heating elements switch on and off to heat the ribbon in areas where dye is desired, causing the dye to diffuse from the ribbon to the paper. This process is repeated for each of several dye colors. Since higher temperatures cause more dye to be transferred, the amount of each dye deposited at each grid position can be controlled, allowing a continuous range of colors to be produced. The number of heating elements in the print head establishes a fixed resolution in the direction across the page, but the resolution along the page is determined by the rate of heating and cooling compared to the speed of the paper.
There are also continuous ink-jet printers that print in a continuous helical path on paper wrapped around a spinning drum, rather than moving the head back and forth.
Unlike displays, the resolution of printers is described in terms of the pixel density instead of the total count of pixels. So a thermal dye transfer printer that has elements spaced 300 per inch across its print head has a resolution of 300 pixels per inch (ppi) across the page. If the resolution along the page is chosen to be the same, we can simply say the printer’s resolution is 300 ppi. An ink-jet printer that places dots on a grid with 1200 grid points per inch is described as having a resolution of 1200 dots per inch (dpi). Because the ink-jet printer is a binary device, it requires a much finer grid for at least two reasons. Because edges are abrupt black/white boundaries, very high resolution is required to avoid stair-stepping, or aliasing, from appearing (see Section 9.3). When continuous-tone images are printed, the high resolution is required to simulate intermediate colors by printing varying-density dot patterns called halftones.
The term “dpi” is all too often used to mean “pixels per inch,” but dpi should be used in reference to binary devices and ppi in reference to continuous-tone devices.
Raster images have to come from somewhere, and any image that wasn’t computed by some algorithm has to have been measured by some raster input device, most often a camera or scanner. Even in rendering images of 3D scenes, photographs are used constantly as texture maps (see Chapter 11). A raster input device has to make a light measurement for each pixel, and (like output devices) they are usually based on arrays of sensors.
A digital camera is an example of a 2D array input device. The image sensor in a camera is a semiconductor device with a grid of light-sensitive pixels. Two common types of arrays are known as CCDs (charge-coupled devices) and CMOS (complimentary metal–oxide–semiconductor) image sensors. The camera’s lens projects an image of the scene to be photographed onto the sensor, and then, each pixel measures the light energy falling on it, ultimately resulting in a number that goes into the output image (Figure 3.7). In much the same way as color displays use red, green, and blue subpixels, most color cameras work by using a color-filter array or mosaic to allow each pixel to see only red, green, or blue light, leaving the image processing software to fill in the missing values in a process known as demosaicking (Figure 3.8).
Other cameras use three separate arrays, or three separate layers in the array, to measure independent red, green, and blue values at each pixel, producing a usable color image without further processing. The resolution of a camera is determined by the fixed number of pixels in the array and is usually quoted using the total count of pixels: a camera with an array of 3000 columns and 2000 rows produces an image of resolution 3000 × 2000, which has 6 million pixels, and is called a 6 megapixel (MP) camera. It’s important to remember that a mosaic sensor does not measure a complete color image, so a camera that measures the same number of pixels but with independent red, green, and blue measurements records more information about the image than one with a mosaic sensor.
People who are selling cameras use “mega” to mean 10^{6}, not 2^{20} as with megabytes.
A flatbed scanner also measures red, green, and blue values for each of a grid of pixels, but like a thermal dye transfer printer, it uses a 1D array that sweeps across the page being scanned, making many measurements per second (Figure 3.9). The resolution across the page is fixed by the size of the array, and the resolution along the page is determined by the frequency of measurements compared to the speed at which the scan head moves. A color scanner has a 3 × n_{x} array, where n_{x} is the number of pixels across the page, with the three rows covered by red, green, and blue filters. With an appropriate delay between the times at which the three colors are measured, this allows three independent color measurements at each grid point. As with continuous-tone printers, the resolution of scanners is reported in pixels per inch (ppi).
The resolution of a scanner is sometimes called its “optical resolution” since most scanners can produce images of other resolutions, via built-in conversion.
With this concrete information about where our images come from and where they will go, we’ll now discuss images more abstractly, in the way we’ll use them in graphics algorithms.
“A pixel is not a little square!”—Alvy Ray Smith (1995)
We know that a raster image is a big array of pixels, each of which stores information about the color of the image at its grid point. We’ve seen what various output devices do with images we send to them and how input devices derive them from images formed by light in the physical world. But for computations in the computer, we need a convenient abstraction that is independent of the specifics of any device, that we can use to reason about how to produce or interpret the values stored in images.
When we measure or reproduce images, they take the form of two-dimensional distributions of light energy: the light emitted from the monitor as a function of position on the face of the display; the light falling on a camera’s image sensor as a function of position across the sensor’s plane; the reflectance, or fraction of light reflected (as opposed to absorbed) as a function of position on a piece of paper. So in the physical world, images are functions defined over two-dimensional areas—almost always rectangles. So we can abstract an image as a function
where R ⊂ ℝ^{2} is a rectangular area and V is the set of possible pixel values. The simplest case is an idealized grayscale image where each point in the rectangle has just a brightness (no color), and we can say V = ℝ^{+} (the nonnegative reals). An idealized color image, with red, green, and blue values at each pixel, has V = (ℝ^{+})^{3}. We’ll discuss other possibilities for V in the next section.
Are there any raster devices that are not rectangular?
How does a raster image relate to this abstract notion of a continuous image? Looking to the concrete examples, a pixel from a camera or scanner is a measurement of the average color of the image over some small area around the pixel. A display pixel, with its red, green, and blue subpixels, is designed so that the average color of the image over the face of the pixel is controlled by the corresponding pixel value in the raster image. In both cases, the pixel value is a local average of the color of the image, and it is called a point sample of the image. In other words, when we find the value x in a pixel, it means “the value of the image in the vicinity of this grid point is x.” The idea of images as sampled representations of functions is explored further in Chapter 10.
A mundane but important question is where the pixels are located in 2D space. This is only a matter of convention, but establishing a consistent convention is important! In this book, a raster image is indexed by the pair (i, j) indicating the column (i) and row (j) of the pixel, counting from the bottom left. If an image has n_{x} columns and n_{y} rows of pixels, the bottom-left pixel is (0, 0) and the top-right is pixel (n_{x} – 1,n_{y} – 1) . We need 2D real screen coordinates to specify pixel positions. We will place the pixels’ sample points at integer coordinates, as shown by the 4 × 3 screen in Figure 3.10.
In some APIs, and many file formats, the rows of an image are organized top-to-bottom, so that (0, 0) is at the top left. This is for historical reasons: the rows in analog television transmission started from the top.
The rectangular domain of the image has width n_{x} and height n_{y} and is centered on this grid, meaning that it extends half a pixel beyond the last sample point on each side. So the rectangular domain of a n_{x} × n_{y} image is
Some systems shift the coordinates by half a pixel to place the sample points halfway between the integers but place the edges of the image at integers.
Again, these coordinates are simply conventions, but they will be important to remember later when implementing cameras and viewing transformations.
So far we have described the values of pixels in terms of real numbers, representing intensity (possibly separately for red, green, and blue) at a point in the image. This suggests that images should be arrays of floating-point numbers, with either one (for grayscale, or black and white, images) or three (for RGB color images) 32-bit floating-point numbers stored per pixel. This format is sometimes used, when its precision and range of values are needed, but images have a lot of pixels and memory and bandwidth for storing and transmitting images are invariably scarce. Just one ten-megapixel photograph would consume about 115 MB of RAM in this format.
Less range is required for images that are meant to be displayed directly. While the range of possible light intensities is unbounded in principle, any given device has a decidedly finite maximum intensity, so in many contexts, it is perfectly sufficient for pixels to have a bounded range, usually taken to be [0, 1] for simplicity. For instance, the possible values in an 8-bit image are 0, 1/255, 2/255, ..., 254/255, 1. Images stored with floating-point numbers, allowing a wide range of values, are often called high dynamic range (HDR) images to distinguish them from fixed-range, or low dynamic range (LDR) images that are stored with integers. See Chapter 20 for an in-depth discussion of techniques and applications for high dynamic range images.
Here are some pixel formats with typical applications:
Why 115 MB and not 120 MB?
The denominator of 255, rather than 256, is awkward, but being able to represent 0 and 1 exactly is important.
1-bit grayscale—text and other images where intermediate grays are not desired (high resolution required);
8-bit RGB fixed-range color (24 bits total per pixel)—web and email applications, consumer photographs;
8- or 10-bit fixed-range RGB (24–30 bits/pixel)—digital interfaces to computer displays;
12- to 14-bit fixed-range RGB (36–42 bits/pixel)—raw camera images for professional photography;
16-bit fixed-range RGB (48 bits/pixel)—professional photography and printing; intermediate format for image processing of fixed-range images;
16-bit fixed-range grayscale (16 bits/pixel)—radiology and medical imaging;
16-bit “half-precision” floating-point RGB—HDR images; intermediate format for real-time rendering;
32-bit floating-point RGB—general-purpose intermediate format for software rendering and processing of HDR images.
Reducing the number of bits used to store each pixel leads to two distinctive types of artifacts, or artificially introduced flaws, in images. First, encoding images with fixed-range values produces clipping when pixels that would otherwise be brighter than the maximum value are set, or clipped, to the maximum representable value. For instance, a photograph of a sunny scene may include reflections that are much brighter than white surfaces; these will be clipped (even if they were measured by the camera) when the image is converted to a fixed range to be displayed. Second, encoding images with limited precision leads to quantization artifacts, or banding, when the need to round pixel values to the nearest representable value introduces visible jumps in intensity or color. Banding can be particularly insidious in animation and video, where the bands may not be objectionable in still images, but become very visible when they move back and forth.
All modern monitors take digital input for the “value” of a pixel and convert this to an intensity level. Real monitors have some nonzero intensity when they are off because the screen reflects some light. For our purposes, we can consider this “black” and the monitor fully on as “white.” We assume a numeric description of pixel color that ranges from zero to one. Black is zero, white is one, and a gray halfway between black and white is 0.5. Note that here “halfway” refers to the physical amount of light coming from the pixel, rather than the appearance. The human perception of intensity is nonlinear and will not be part of the present discussion; see Chapter 19 for more.
There are two key issues that must be understood to produce correct images on monitors. The first is that monitors are nonlinear with respect to input. For example, if you give a monitor 0, 0.5, and 1.0 as inputs for three pixels, the intensities displayed might be 0, 0.25, and 1.0 (off, one-quarter fully on, and fully on). As an approximate characterization of this nonlinearity, monitors are commonly characterized by a γ (“gamma”) value. This value is the degree of freedom in the formula
where a is the input pixel value between zero and one. For example, if a monitor has a gamma of 2.0, and we input a value of a = 0.5, the displayed intensity will be one-fourth the maximum possible intensity because 0.5^{2} = 0.25. Note that a = 0 maps to zero intensity and a = 1 maps to the maximum intensity regardless of the value of γ. Describing a display’s nonlinearity using γ is only an approximation; we do not need a great deal of accuracy in estimating the γ of a device. A nice visual way to gauge the nonlinearity is to find what value of a gives an intensity halfway between black and white. This a will be
If we can find that a, we can deduce γ by taking logarithms on both sides:
We can find this a by a standard technique where we display a checkerboard pattern of black and white pixels next to a square of gray pixels with input a (Figure 3.11), then ask the user to adjust a (with a slider, for instance) until the two sides match in average brightness. When you look at this image from a distance (or without glasses if you are nearsighted), the two sides of the image will look about the same when a is producing an intensity halfway between black and white. This is because the blurred checkerboard is mixing even numbers of white and black pixels so the overall effect is a uniform color halfway between white and black.
Once we know γ, we can gamma correct our input so that a value of a = 0.5 is displayed with intensity halfway between black and white. This is done with the transformation
For monitors with analog interfaces, which have difficulty changing intensity rapidly along the horizontal direction, horizontal black and white stripes work better than a checkerboard.
When this formula is plugged into Equation (3.1), we get
Another important characteristic of real displays is that they take quantized input values. So while we can manipulate intensities in the floating-point range [0, 1], the detailed input to a monitor is a fixed-size integer. The most common range for this integer is 0–255 which can be held in 8 bits of storage. This means that the possible values for a are not any number in [0, 1] but instead
This means the possible displayed intensity values are approximately
where M is the maximum intensity. In applications where the exact intensities need to be controlled, we would have to actually measure the 256 possible intensities, and these intensities might be different at different points on the screen, especially for CRTs. They might also vary with viewing angle. Fortunately, few applications require such accurate calibration.
Most computer graphics images are defined in terms of red-green-blue (RGB) color. RGB color is a simple space that allows straightforward conversion to the controls for most computer screens. In this section, RGB color is discussed from a user’s perspective, and operational facility is the goal. A more thorough discussion of color is given in Chapter 18, but the mechanics of RGB color space will allow us to write most graphics programs. The basic idea of RGB color space is that the color is displayed by mixing three primary lights: one red, one green, and one blue. The lights mix in an additive manner.
In grade school, you probably learned that the primaries are red, yellow, and blue, and that, e.g., yellow + blue = green. This is subtractive color mixing, which is fundamentally different from the more familiar additive mixing that happens in displays.
In RGB additive color mixing we have (Figure 3.12)
The color “cyan” is a blue-green, and the color “magenta” is a purple.
If we are allowed to dim the primary lights from fully off (indicated by pixel value 0) to fully on (indicated by 1), we can create all the colors that can be displayed on an RGB monitor. The red, green, and blue pixel values create a three-dimensional RGB color cube that has a red, a green, and a blue axis. Allowable coordinates for the axes range from zero to one. The color cube is shown graphically in Figure 3.13.
The colors at the corners of the cube are
Actual RGB levels are often given in quantized form, just like the grayscales discussed in Section 3.2.2. Each component is specified with an integer. The most common size for these integers is one byte each, so each of the three RGB components is an integer between 0 and 255. The three integers together take up three bytes, which is 24 bits. Thus, a system that has “24-bit color” has 256 possible levels for each of the three primary colors. Issues of gamma correction discussed in Section 3.2.2 also apply to each RGB component separately.
Often, we would like to only partially overwrite the contents of a pixel. A common example of this occurs in compositing, where we have a background and want to insert a foreground image over it. For opaque pixels in the foreground, we just replace the background pixel. For entirely transparent foreground pixels, we do not change the background pixel. For partially transparent pixels, some care must be taken. Partially transparent pixels can occur when the foreground object has partially transparent regions, such as glass. But, the most frequent case where foreground and background must be blended is when the foreground object only partly covers the pixel, either at the edge of the foreground object, or when there are subpixel holes such as between the leaves of a distant tree.
The most important piece of information needed to blend a foreground object over a background object is the pixel coverage, which tells the fraction of the pixel covered by the foreground layer. We can call this fraction α. If we want to composite a foreground color c_{f} over background color c_{b}, and the fraction of the pixel covered by the foreground is α, then we can use the formula
For an opaque foreground layer, the interpretation is that the foreground object covers area α within the pixel’s rectangle and the background object covers the remaining area, which is (1 – α) . For a transparent layer (think of an image painted on glass or on tracing paper, using translucent paint), the interpretation is that the foreground layer blocks the fraction (1 – α) of the light coming through from the background and contributes a fraction α of its own color to replace what was removed. An example of using Equation (3.2) is shown in Figure 3.14.
The α values for all the pixels in an image might be stored in a separate grayscale image, which is then known as an alpha mask or transparency mask. Or the information can be stored as a fourth channel in an RGB image, in which case it is called the alpha channel, and the image can be called an RGBA image. With 8-bit images, each pixel then takes up 32 bits, which is a conveniently sized chunk in many computer architectures.
Since the weights of the foreground and background layers add up to 1, the color won’t change if the foreground and background layers have the same color.
Although Equation (3.2) is what is usually used, there are a variety of situations where α is used differently (Porter & Duff, 1984).
Most RGB image formats use eight bits for each of the red, green, and blue channels. This results in approximately three megabytes of raw information for a single million-pixel image. To reduce the storage requirement, most image formats allow for some kind of compression. At a high level, such compression is either lossless or lossy. No information is discarded in lossless compression, while some information is lost unrecoverably in a lossy system. Popular image storage formats include
jpeg. This lossy format compresses image blocks based on thresholds in the human visual system. This format works well for natural images.
tiff. This format is most commonly used to hold binary images or losslessly compressed 8- or 16-bit RGB although many other options exist.
ppm. This very simple lossless, uncompressed format is most often used for 8-bit RGB images although many options exist.
png. This is a set of lossless formats with a good set of open source management tools.
Because of compression and variants, writing input/output routines for images can be involved. Fortunately, one can usually rely on library routines to read and write standard file formats. For quick-and-dirty applications, where simplicity is valued above efficiency, a simple choice is to use raw ppm files, which can often be written simply by dumping the array that stores the image in memory to a file, prepending the appropriate header.
Why don’t they just make monitors linear and avoid all this gamma business?
Ideally, the 256 possible intensities of a monitor should look evenly spaced as opposed to being linearly spaced in energy. Because human perception of intensity is itself nonlinear, a gamma between 1.5 and 3 (depending on viewing conditions) will make the intensities approximately uniform in a subjective sense. In this way, gamma is a feature. Otherwise, the manufacturers would make the monitors linear.
1. Simulate an image acquired from the Bayer mosaic by taking a natural image (preferably a scanned photo rather than a digital photo where the Bayer mosaic may already have been applied) and creating a grayscale image composed of interleaved red/green/blue channels. This simulates the raw output of a digital camera. Now create a true RGB image from that output and compare with the original.
One of the basic tasks of computer graphics is rendering three-dimensional objects: taking a scene composed of many geometric objects arranged in 3D space and computing a 2D image that shows the objects as viewed from a particular viewpoint. It is the same operation that has been done for centuries by architects and engineers creating drawings to communicate their designs to others.
Fundamentally, rendering is a process that takes as its input a set of objects and produces as its output an array of pixels. One way or another, rendering involves considering how each object contributes to each pixel, and it can be organized in two general ways. In object-order rendering, each object is considered in turn, and for each object, all the pixels that it influences are found and updated. In image-order rendering, each pixel is considered in turn, and for each pixel all the objects that influence it are found and the pixel value is computed. You can think of the difference in terms of the nesting of loops: in image-order rendering, the “for each pixel” loop is on the outside, whereas in object-order rendering, the “for each object” loop is on the outside.
If the output is a vector image rather than a raster image, rendering doesn’t have to involve pixels, but we’ll assume raster images in this book.
Image-order and object-order renderers can compute exactly the same images, but they lend themselves to computing different kinds of effects and have quite different performance characteristics. We’ll explore the comparative strengths of the approaches in Chapter 9 after we have discussed them both, but, broadly speaking, image-order rendering is simpler to get working and more flexible in the effects that can be produced and usually (though not always) takes more execution time to produce a comparable image.
In a ray tracer, it is easy to compute accurate shadows and reflections, which are awkward in the object-order framework.
Ray tracing is an image-order algorithm for making renderings of 3D scenes, and we’ll consider it first because it’s possible to get a ray tracer working without developing any of the mathematical machinery that’s used for object-order rendering.
A ray tracer works by computing one pixel at a time, and for each pixel, the basic task is to find the object that is seen at that pixel’s position in the image. Each pixel “looks” in a different direction, and any object that is seen by a pixel must intersect the viewing ray, a line that emanates from the viewpoint in the direction that pixel is looking. The particular object we want is the one that intersects the viewing ray nearest the camera, since it blocks the view of any other objects behind it. Once that object is found, a shading computation uses the intersection point, surface normal, and other information (depending on the desired type of rendering) to determine the color of the pixel. This is shown in Figure 4.1, where the ray intersects two triangles, but only the first triangle hit, T_{2}, is shaded.
A basic ray tracer therefore has three parts:
ray generation, which computes the origin and direction of each pixel’s viewing ray based on the camera geometry;
ray intersection, which finds the closest object intersecting the viewing ray;
shading, which computes the pixel color based on the results of ray intersection.
The structure of the basic ray tracing program is
for each pixel do compute viewing ray find first object hit by ray and its surface normal n set pixel color to value computed from hit point, lights, and n
This chapter covers basic methods for ray generation, ray intersection, and shading, that are sufficient for implementing a simple demonstration ray tracer. For a really useful system, more efficient ray intersection techniques from Chapter 12 need to be added, and the real potential of a ray tracer will be seen with the more advanced rendering techniques from Chapter 14.
The problem of representing a 3D object or scene with a 2D drawing or painting was studied by artists hundreds of years before computers. Photographs also represent 3D scenes with 2D images. While there are many unconventional ways to make images, from cubist painting to fisheye lenses (Figure 4.2) to peripheral cameras, the standard approach for both art and photography, as well as computer graphics, is linear perspective, in which 3D objects are projected onto an image plane in such a way that straight lines in the scene become straight lines in the image.
The simplest type of projection is parallel projection, in which 3D points are mapped to 2D by moving them along a projection direction until they hit the image plane (Figures 4.3–4.4). The view that is produced is determined by the choice of projection direction and image plane. If the image plane is perpendicular to the view direction, the projection is called orthographic; otherwise, it is called oblique.
Some books reserve “orthographic” for projection directions that are parallel to the coordinate axes.
Parallel projections are often used for mechanical and architectural drawings because they keep parallel lines parallel and they preserve the size and shape of planar objects that are parallel to the image plane.
The advantages of parallel projection are also its limitations. In our everyday experience (and even more so in photographs), objects look smaller as they get farther away, and as a result, parallel lines receding into the distance do not appear parallel. This is because eyes and cameras don’t collect light from a single viewing direction; they collect light that passes through a particular viewpoint. As has been recognized by artists since the Renaissance, we can produce natural-looking views using perspective projection: we simply project along lines that pass through a single point, the viewpoint, rather than along parallel lines (Figure 4.4). In this way, objects farther from the viewpoint naturally become smaller when they are projected. A perspective view is determined by the choice of viewpoint (rather than projection direction) and image plane. As with parallel views, there are oblique and non-oblique perspective views; the distinction is made based on the projection direction at the center of the image.
You may have learned about the artistic conventions of three-point perspective, a system for manually constructing perspective views (Figure 4.5). A surprising fact about perspective is that all the rules of perspective drawing will be followed automatically if we follow the simple mathematical rule underlying perspective: objects are projected directly toward the eye, and they are drawn where they meet a view plane in front of the eye.
From the previous section, the basic tools of ray generation are the viewpoint (or view direction, for parallel views) and the image plane. There are many ways to work out the details of camera geometry; in this section, we explain one based on orthonormal bases that supports normal and oblique parallel and orthographic views.
In order to generate rays, we first need a mathematical representation for a ray. A ray is really just an origin point and a propagation direction; a 3D parametric line is ideal for this. As discussed in Section 2.7.7, the 3D parametric line from the eye e through a point s on the image plane (Figure 4.6) is given by
This should be interpreted as, “we advance from e along the vector (s – e) a fractional distance t to find the point p.” So given t, we can determine a point p. The point e is the ray’s origin,and s – e is the ray’s direction.
Note that p(0) = e, and p(1) = s, and more generally, if 0 < t_{1} < t_{2}, then p(t_{1}) is closer to the eye than p(t_{2}) . Also, if t < 0,then p(t) is “behind” the eye. These facts will be useful when we search for the closest object hit by the ray that is not behind the eye.
Caution: we are overloading the variable t, which is the ray parameter and also the v-coordinate of the top edge of the image.
Rays are invariably represented in code using some kind of structure or object that stores the position and direction. For instance, in an object-oriented program we might write:
class Ray Vec3 o | ray origin Vec3 d | ray direction Vec3 evaluate(real t) return o + td
We are assuming there is a class Vec3 that represents three-dimensional vectors and supports the usual arithmetic operations.
To compute a viewing ray, we need to know e (which is given) and s. Finding s may seem difficult, but it is actually straightforward if we look at the problem in the right coordinate system.
All of our ray-generation methods start from an orthonormal coordinate frame known as the camera frame (Figure 4.7), which we’ll denote by e, for the eye point, or viewpoint, and u, v, and w for the three basis vectors, organized with u pointing rightward (from the camera’s view), v pointing upward, and w pointing backward, so that {u, v, w} forms a right-handed coordinate system. The most common way to construct the camera frame is from the viewpoint, which becomes e, the view direction,which is –w, and the up vector, which is used to construct a basis that has v and w in the plane defined by the view direction and the up direction, using the process for constructing an orthonormal basis from two vectors described in Section 2.4.7 (Figure 4.8).
Since v and w have to be perpendicular, the up vector and v are not generally the same. But setting the up vector to point straight upward in the scene will orient the camera in the way we would think of as “up-right.”
For an orthographic view, all the rays will have the direction –w. Even though a parallel view doesn’t have a viewpoint per se, we can still use the origin of the camera frame to define the plane where the rays start, so that it’s possible for objects to be behind the camera.
The viewing rays should start on the plane defined by the point e and the vectors u and v; the only remaining information required is where on the plane the image is supposed to be. We’ll define the image dimensions with four numbers, for the four sides of the image: l and r are the positions of the left and right edges of the image, as measured from e along the u direction; and b and t are the positions of the bottom and top edges of the image, as measured from e along the v direction. Usually, l < 0 < r and b < 0 < t. (SeeFigure4.9a.)
It might seem logical that orthographic viewing rays should start from infinitely far away, but then it would not be possible to make orthographic views of an object inside a room, for instance.
In Section 3.2, we discussed pixel coordinates in an image. To fit an image with n_{x} × n_{y} pixels into a rectangle of size (r – l)×(t–b) , the pixels are spaced a distance (r – l)/n_{x} apart horizontally and (t – b)/n_{y} apart vertically, with a half-pixel space around the edge to center the pixel grid within the image rectangle. This means that the pixel at position (i, j) in the raster image has the position
Many systems assume that l = – r and b = – t so that a width and a height suffice.
where (u, v) are the coordinates of the pixel’s position on the image plane, measured with respect to the origin e and the basis {u, v}.
With l and r both specified, there is redundancy: moving the viewpoint a bit to the right and correspondingly decreasing l and r will not change the view (and similarly on the v-axis).
In an orthographic view, we can simply use the pixel’s image-plane position as the ray’s starting point, and we already know the ray’s direction is the view direction. The procedure for generating orthographic viewing rays is then
compute u and v using (4.1)
ray.o ← e + u u + v v
ray.d ←–w
It’s very simple to make an oblique parallel view: just allow the image plane normal w to be specified separately from the view direction d. The procedure is then exactly the same, but with d substituted for –w. Of course, w is still used to construct u and v.
For a perspective view, all the rays have the same origin, at the viewpoint; it is the directions that are different for each pixel. The image plane is no longer positioned at e, but rather some distance d in front of e; this distance is the image plane distance, often loosely called the focal length, because choosing d plays the same role as choosing focal length in a real camera. The direction of each ray is defined by the viewpoint and the position of the pixel on the image plane. This situation is illustrated in Figure 4.9, and the resulting procedure is similar to the orthographic one:
compute u and v using (4.1)
ray.o ← e
ray.d ←– d w + u u + v v
As with parallel projection, oblique perspective views can be achieved by specifying the image plane normal separately from the projection direction.
Once we’ve generated a ray e + td, we next need to find the first intersection with any object where t > 0. In practice, it turns out to be useful to solve a slightly more general problem: find the first intersection between the ray and a surface that occurs at a t in the interval [t_{0},t_{1}]. The basic ray intersection is then the case where t_{0} = 0 and t_{1} = +∞. We solve this problem for both spheres and triangles. In the next section, multiple objects are discussed.
Given a ray p(t) = e + td and an implicit surface f (p) = 0 (see Section 2.7.3), we’d like to know where they intersect. Intersection points occur when points on the ray satisfy the implicit equation, so the values of t we seek are those that solve the equation
A sphere with center c = (x_{c} ,y_{c} ,z_{c}) and radius R can be represented by the implicit equation
We can write this same equation in vector form:
Any point p that satisfies this equation is on the sphere. If we plug points on the ray p(t) = e + td into this equation, we get an equation in terms of t that is satisfied by the values of t that yield points on the sphere:
Rearranging terms yields
Here, everything is known except the parameter t, so this is a classic quadratic equation in t, meaning it has the form
The solution to this equation is discussed in Section 2.2. The term under the square root sign in the quadratic solution, B^{2} – 4AC, is called the discriminant and tells us how many real solutions there are. If the discriminant is negative, its square root is imaginary and the line and sphere do not intersect. If the discriminant is positive, there are two solutions: one solution where the ray enters the sphere and one where it leaves. If the discriminant is zero, the ray grazes the sphere, touching it at exactly one point. Plugging in the actual terms for the sphere and canceling a factor of two, we get
In an actual implementation, you should first check the value of the discriminant before computing other terms. To correctly find the closest intersection in the interval [t_{0},t_{1}], there are three cases: if the smaller of the two solutions is in the interval, it is the first hit; otherwise, if the larger solution is in the interval, it is the first hit; otherwise, there is no hit.
As discussed in Section 2.7.4, the normal vector at point p is given by the gradient n = 2(p – c) . The unit normal is (p – c)/R.
There are many algorithms for computing ray-triangle intersections. We will present the form that uses barycentric coordinates for the parametric plane containing the triangle, because it requires no long-term storage other than the vertices of the triangle (Snyder & Barr, 1987).
To intersect a ray with a parametric surface, we set up a system of equations where the Cartesian coordinates all match:
Here, we have three equations and three unknowns (t, u, and v). In the case where the surface is a parametric plane, the parametric equation is linear and can be written in vector form as discussed in Section 2.9.2. If the vertices of the triangle are a, b,and c, then the intersection will occur when
for some t, β,and γ. Solving this equation tells us both t, which locates the intersection point along the ray, and (β, γ) , which locates the intersection point relative to the triangle. The intersection p will be at e+td as shown in Figure 4.10. Again from Section 2.9.2, we know the intersection is inside the triangle if and only if β > 0, γ > 0, and β + γ < 1. Otherwise, the ray has hit the plane outside the triangle, so it misses the triangle. If there are no solutions, either the triangle is degenerate or the ray is parallel to the plane containing the triangle.
To solve for t, β, and γ in Equation (4.2), we expand it from its vector form into the three equations for the three coordinates:
This can be rewritten as a standard linear system:
The fastest classic method to solve this 3 × 3 linear system is Cramer’s rule. This gives us the solutions
where the matrix A is
and |A| denotes the determinant of A. The 3 × 3 determinants have common subterms that can be exploited for efficiency in implementation. Looking at the linear systems with dummy variables
where
We can reduce the number of operations by reusing numbers such as “ei-minus-hf.”
The algorithm for the ray-triangle intersection for which we need the linear solution can have some conditions for early termination. Thus, the function should look something like:
boolean raytri (Ray r, vector3 a, vector3 b, vector3 c,
interval [t_{0},t_{1}])
compute t
if (t < t_{0}) or (t > t_{1}) then
return false
compute γ
if (γ < 0) or (γ > 1) then
return false
compute β
if (β < 0) or (β > 1 – γ) then
return false
return true
In a ray tracing program, it is a good idea to use an object-oriented design that has a class called something like Surface with derived classes Triangle, Sphere, etc. Anything that a ray can intersect, including groups of surfaces or efficiency structures (Section 12.3) should be a subclass of Surface. The ray-tracing program would then have one reference to a Surface for the whole model, and new types of objects and efficiency structures can be added transparently.
The key interface of the Surface class is a method to intersect a ray (Kirk & Arvo, 1988).
class Surface
HitRecord hit(Ray r, real t_{0}, real t_{1})
Here, (t_{0},t_{1}) is the interval on the ray where hits will be returned, and HitRecord is a class that contains all the data about the surface intersection that will be needed:
class HitRecord
Surface s | surface that was hit
real t | coordinate of hit point along the ray
Vec3 n | surface normal at the hit point
.
.
.
The surface that was hit, the t value, and the surface normal are the minimum required, but other data such as texture coordinates or tangent vectors may be stored as well. Depending on the language, the hit record might not be literally returned from the function but rather passed by reference and filled in. A miss can be indicated by a hit that has t = ∞.
Of course, most interesting scenes consist of more than one object, and when we intersect a ray with the scene, we must find only the closest intersection to the camera along the ray. A simple way to implement this is to think of a group of objects as itself being another type of object. To intersect a ray with a group, you simply intersect the ray with the objects in the group and return the intersection with the smallest t value. The following code tests for hits in the interval t ∈ [t_{0},t_{1}]:
class Group, subclass of Surface
list-of-Surface surfaces | list of all surfaces in the group
HitRecord hit(Ray ray, real t_{0}, real t_{1})
HitRecord closest-hit(∞) | initialize to indicate miss
for surf in surfaces do
rec = surf.hit(ray, t_{0}, t_{1})
if rec.t < ∞ then
closest-hit = rec
t_{1} = t
return closest-hit
Note that this code shrinks the intersection interval [t_{0},t_{1}] so that the call to surf.hit will only hit surfaces that are closer than the closest one seen so far.
Once ray-scene intersection works, we can render an image like Figure 4.11, but nicer results depend on including more visual cues, as we describe next.
Once the visible surface for a pixel is known, the pixel value is computed by evaluating a shading model. How this is done depends entirely on the application— methods range from simple heuristics to elaborate physics-based models. Exactly the same shading models can be used in ray tracing or in object-order rendering methods.
Chapter 5 describes a simple shading model that is suitable for a basic ray tracer and that is the one we used to make the renderings in this chapter. For more realism, you can upgrade to the models discussed in Chapter 14, which are much more true to the physics of real surfaces. Here, we will discuss how a ray tracer computes the inputs to shading.
To support shading, a ray tracing program always has a list of light sources. For the Chapter 5 shading model, we need three types of lights: point lights, which emit light from a point in space, directional lights, which illuminate the scene from a single direction, and ambient lights, which provide constant illumination to fill in the shadows. In fancier systems, other types of lights are supported, such as area lights (which are basically scene geometry that emits light) or environment lights (which use an image to represent light coming from far-away sources like the sky).
Computing shading from a point or directional light source requires certain geometric information, and in a ray tracer, after a viewing ray has been determined to hit the surface, we have all we need to determine these four vectors:
The shading point x can be computed by evaluating the viewing ray at the t value of the intersection.
The surface normal n depends on the type of surface (sphere, triangle, etc.), and every surface needs to be able to compute its normal at the point where a ray intersects it.
The light direction l is computed from the light source position or direction as part of shading.
The viewing direction v is simply opposite the direction of the viewing ray (v = –d/ ||d||).
The shading from an ambient source is much simpler: there is no l since light comes from everywhere; the shading does not depend on v; and for the simple models of Chapter 5, it doesn’t even depend on x or n.
Computing shading in a scene containing several lights is simply a matter of adding up the contributions of the lights. In a basic ray tracer, you can simply loop over all the light sources, computing shading from each one, and accumulate the results into the pixel color.
A ray tracing program usually contains objects representing light sources and materials. Light sources can be instances of subclasses of a Light class, and they must include enough information to fully describe the light source. Since shading also requires parameters describing the material of the surface, another class that is useful is Material, which encapsulates everything needed to evaluate the shading model.
Different systems take different approaches to breaking up the shading calculations between lights and materials. An approach that aligns with the presentation in this chapter is to make lights responsible for the overall illumination computation and materials responsible for computing BRDF values. With this setup, the interfaces of these classes might look like:
class Light
Color illuminate(Ray ray, HitRecord hrec)
class Material
Color evaluate(Vec3 l, Vec3 v, Vec3 n)
Each surface would then store a reference to its material, and in this way, point light illumination might be implemented as follows:
class PointLight, subclass of Light
Color I
Vec3 p
Color illuminate(Ray ray, HitRecord hrec)
Vec3 x = ray.evaluate(hrec.t)
real r = p – x
Vec3 l = (p – x)/r
Vec3 n = hrec.normal
Color E = max(0, n · l) I/r^{2}
Color k = hrec.surface.material.evaluate(l, v, n)
return kE
These computations assume the class Color carries the RGB components of a color and supports componentwise multiplication. This arrangement is also amenable to treating ambient lighting as a light source, by making the ambient coefficient a property of the material:
class AmbientLight, subclass of Light
Color I_{a}
Color illuminate(Ray ray, HitRecord hrec)
Color k_{a} = hrec.surface.material.k_{a}
return k_{a} I_{a}
The complete calculation for shading a ray, including the intersection and handling several lights, can look like this:
function shade-ray(Ray ray, realt_{0}, realt_{1})
HitRecord rec = scene.hit(ray,t_{0},t_{1})
if rec.t < ∞ then
Color c = 0
for light in scene.lights do
c = c + light.illuminate(ray, rec)
return c
else return background-color
This setup keeps materials and lights reasonably separate and allows you to later add new kinds of materials and lights transparently. Textures add some complexity to the architecture of a ray tracer; see Section 11.2.5.
By itself, shading makes images of 3D objects more realistic and understandable, but it doesn’t show their interactions with other objects. For instance, the spheres in Figure 4.12 appear to float above the floor they are resting on.
Once you have basic shading in your ray tracer, shadows for point and directional lights can be added very easily. If we imagine ourselves at a point x on a surface being shaded, the point is in shadow if we “look” towards the light source and see an object between us and the light source. If there are no objects in between, then the light is not blocked.
This is shown in Figure 4.13, where the ray x + tl does not hit any objects and thus the point x is not in shadow. On the other hand, the point x is in shadow because the ray x + tl does hit an object. The rays that determine in or out of shadow are called shadow rays to distinguish them from viewing rays.
To get the algorithm for shading, we add an if statement to the code that adds shading from a light source to first determine whether the light is shadowed. In a naive implementation, the shadow ray will check for t ∈ [0,r], but because of numerical imprecision, this can result in an intersection with the surface on which p lies. Instead, the usual adjustment to avoid that problem is to test for t ∈ [ , r] where is some small positive constant (Figure 4.14).
A shadow test can be added to the method PointLight.illuminate shown above by tracing a shadow ray and adding a conditional:
HitRecord srec = scene.hit(Ray(x, l), ,r) if srec.t < ∞ then proceed with normal illumination calculation else return 0 | shading point is in shadow
The shadow test for directional lights is similar but uses t_{1} = ∞ rather than r. Note that the illumination computation for each light requires a separate shadow ray, and there is no shadow test in computing ambient shading.
Shadows serve an important visual role in showing the relationships between nearby objects, as shown in Figure 4.15.
It is straightforward to add ideal specular reflection, or mirror reflection,toaray-tracing program. The key observation is shown in Figure 4.16 where a viewer looking from direction e sees what is in direction r as seen from the surface. The vector r is the reflection of the vector –d across the surface normal n, which can be computed using the projection of d onto the direction of the surface normal:
In the real world, some energy is lost when the light reflects from the surface, and this loss can be different for different colors. For example, gold reflects yellow more efficiently than blue, so it shifts the colors of the objects it reflects. This can be implemented by adding a recursive call in shade-ray that adds one more contribution after all the lights are accounted for:
where k_{m} (for “mirror reflection”) is the specular RGB color. We need to make sure to pass t_{0} = for the same reason as we did with shadow rays; we don’t want the reflection ray to hit the object that generates it.
The problem with the recursive call above is that it may never terminate. For example, if a ray starts inside a room, it will bounce forever. This can be fixed by adding a maximum recursion depth. The code will be more efficient if a reflection ray is generated only if k_{m} is not zero.
Using a constant mirror reflection coefficient k_{m} gives a particular look characteristic of simple ray tracers (Figure 4.17); in the real world, this coefficient varies substantially depending on the incident angle. For better models, see Chapter 14.
Ray tracing was developed early in the history of computer graphics (Appel, 1968) but was not used much until sufficient compute power was available (Kay & Greenberg, 1979; Whitted, 1980).
Ray tracing has a lower asymptotic time complexity than basic object-order rendering (Snyder & Barr, 1987; Muuss, 1995; Parker et al., 1999; Wald, Slusallek, Benthin, & Wagner, 2001). Although it was traditionally thought of as an offline method, real-time ray tracing implementations are becoming more and more common.
Why is there no perspective matrix in ray tracing?
The perspective matrix in a z-buffer exists so that we can turn the perspective projection into a parallel projection. This is not needed in ray tracing, because it is easy to do the perspective projection implicitly by fanning the rays out from the eye.
Can ray tracing be made interactive?
For sufficiently small models and images, any modern PC is sufficiently powerful for ray tracing to be interactive. In practice, multiple CPUs with a shared frame buffer are required for a full-screen implementation. Computer power is increasing much faster than screen resolution, and it is just a matter of time before conventional PCs can ray trace complex scenes at screen resolution.
Is ray tracing useful in a hardware graphics program?
Ray tracing is frequently used for picking. When the user clicks the mouse on a pixel in a 3D graphics program, the program needs to determine which object is visible within that pixel. Ray tracing is an ideal way to determine that.
1. What are the ray parameters of the intersection points between ray (1, 1, 1)+ t(–1, –1, –1) and the sphere centered at the origin with radius 1? Note: this is a good debugging case.
2. What are the barycentric coordinates and ray parameter where the ray (1, 1, 1) + t(–1, –1, –1) hits the triangle with vertices (1, 0, 0) , (0, 1, 0) , and (0, 0, 1) ? Note: this is a good debugging case.
3. Do a back of the envelope computation of the approximate time complexity of ray tracing on “nice” (non-adversarial) models. Split your analysis into the cases of preprocessing and computing the image, so that you can predict the behavior of ray tracing multiple frames for a static model.
When we are rendering images of 3D scenes, whether by using ray tracing or rasterization, in real time or in batch processing, one of the key contributors to the visual impression of three-dimensionality is shading or coloring surfaces in the scene based on their shape and their relationship to other objects in the scene. In the physical world, most of the light we see is reflected light, and the physics of light reflection is strongly influenced by geometry, which produces a variety of cues that the human visual system makes very effective use of to understand shape.
In computer graphics, the purpose of shading is to provide these cues to the visual system, although the goals differ depending on the application. In computer-aided design or scientific visualization, the focus is on clarity: shading should be designed to provide the clearest, most accurate impression of 3D shape. On the other hand, in visual effects or advertising, the goal is to maximize the resemblance of renderings to the appearance of real objects. In animation, virtual environments, or games, the goals are somewhere in the middle: shading is meant to achieve artistic ends, which include depicting shape and material, but may not necessarily be intended to literally imitate reality.
The equations used to compute shading are known as a shading model, and a range of different shading models have been developed for these different applications. Generally, they all begin with simple models that provide a useful approximation to the physics of light reflection. From this starting point, additional features can be added to achieve closer approximations to physics for realistic rendering, or some parts can be modified or left out to make models suitable for more abstract styles.
A shading model is quite independent of the rest of a rendering system, and the same models can be used in ray tracing and rasterization systems. This chapter describes a basic shading model for an opaque surface illuminated by a point light source. This model might be all we need for simple applications, and it forms the starting point for more advanced shading computations such as those discussed in Chapter 14.
In the real world, light falls on surfaces from all directions. But for modeling illumination, the simplest case is when light arrives from a single direction; this is always an idealization, but it makes a useful model for light sources that are small in proportion to to their distance from the surface, either because they are indeed small (for example, an LED flashlight) or because they are very far away (for example, the sun). Point-like sources come in two flavors: a point source is small enough to be treated as a point, but is close to the scene and can illuminate different surfaces differently; and a directional source is both small enough (relative to its distance) to be treated as point-like and also so far away that it illuminates all surfaces the same and there is no need to keep track of its location, only its direction. The flashlight and the sun are canonical examples of these two types of light sources.
A point light source is described by its position, which is a point in 3D space, and its intensity, which describes the amount of light it produces. A point source can be isotropic, meaning the intensity is the same in all directions; this is normally the default, but many systems provide “spot lights” that only send light in some directions, which can be handy for controlling light in a virtual scene in the same way that a real spot light is useful for controlling light on a stage.
When in doubt, make light sources neutral in color, with equal red, green, and blue intensities.
For an isotropic point source, it’s easy to reason about how much light falls on a surface a certain distance away. Suppose we have a point source that emits one Watt of radiant power isotropically, and we place this source at the center of a hollow sphere with one meter radius (Figure 5.1). All the power from the light falls on the inside surface of the sphere, and it’s distributed uniformly over the whole surface area of 4π m^{2}, so the density of radiant power per unit area is 1/(4π) Watts per square meter. This density is known as irradiance and is the right quantity to describe how much light is falling on a surface for the purposes of light reflection.
In the general case of a source that has power P and a receiving sphere of radius r, we find the irradiance E is
The quantity I = P/(4π) is the intensity of the source; it is a property of the source itself that is independent of what surface it’s illuminating. The r^{–2} factor, often called the inverse square term, describes how irradiance depends on the distance r between the source and the surface.
One other important consideration in computing irradiance is the angle of incidence—the angle between the surface normal and the direction the light is traveling. Consider a small surface that is illuminated by a point source that is far away compared to the size of the surface. The light that falls on the surface is all travelling approximately parallel. If we tilt the surface to an angle of 60° as shown in Figure 5.2, the surface intercepts only half the light that it did when it was facing the source. In general, when rotated by an angle θ it intercepts an amount of light (radiant power) proportional to cos θ, and since the area stays the same, the irradiance (which, remember, is radiant power per unit area) is proportional to the same factor. This rule, that the irradiance on a surface falls off as the cosine of the incident angle, is known as Lambert’s cosine law because it was described by Johann Heinrich Lambert in his 1760 book Photometria.
Putting this together with the formula, we just derived for irradiance on a surface facing exactly toward the source, we get the general formula for irradiance due to a point source,
The term cos θ/r^{2} can be called the geometry factor for a point source; it depends on the geometric relationship between source and receiving surface, but not on the specific properties of either one.
In practice, the angle θ is not normally computed, because given a unit vector n that is normal to the surface and a unit vector l that points toward the light (Figure 5.3), the cosine factor can be computed using the dot product
which is simpler and more efficient than computations with trigonometric functions.
A directional source is a limiting case of a very bright, far-away point source. As the source gets farther and farther away, the ratio I/r^{2} in Equation 5.2 varies less and less over the scene, and for a directional source, we replace this with a constant, H:
Note that this formula only holds when these two vectors have unit length!
This constant can be called the normal irradiance since it is equal to the irradiance when the light is positioned along the surface normal. A directional source is characterized by the direction toward the source (rather than by a position) and by the normal irradiance H (rather than by an intensity). The illumination from a directional source is uniform and does not fall off with distance in the way that point source illumination does.
Now that we have the ability to compute irradiance, which describes how much light falls on an object, we come to the question of how the object reflects that light. This depends on the material the object is made out of, and in this chapter, we develop a basic model for a colored material with an optional shiny surface. The idea behind this model is shown in Figure 5.4: the material can have a base layer that determines the object’s overall color, and it can have a surface that provides a shiny, mirror-like reflection, and we will look at the simplest model for each.
The very simplest kind of reflection is a surface that reflects light equally to all directions, regardless of where it came from, so that the reflected light L_{r} seen by the observer is simply a constant multiple of the irradiance:
A surface that behaves this way is known as an ideal diffuse surface and appears the same brightness from all directions; its color is view independent and is completely described by its reflectance, R, which is the fraction of the irradiance it reflects. The coefficient relating reflected to incident light is R/π (the reason for the factor of π will have to wait for Chapter 14):
The reflectance can be different for different colors of light, and for simple modeling of color, it suffices to just keep three different reflectances, one each for red, green, and blue, so this shading equation is carried out separately for the three color channels.
Ideal diffuse shading, often called Lambertian shading because Lambert’s cosine law is the main effect it models, provides a flat, chalky appearance by itself. Physically, it models light that bounces around inside the material so that it “for-gets” where it came from and emerges randomly in all directions. It is an effective model for paper, flat paint, dirt, tree bark, stone, and other rough materials that don’t have a distinct and smooth enough top surface to produce noticeable shiny reflections.
Precise prediction of color is a bit more complex; see Chapter 18.
Many materials have some degree of shininess to them—for example, metals, plastics, gloss or semi-gloss paints, or many leaves of plants. When you look at these materials, you see reflections that move around when you move your viewpoint; you could describe their color as being view-dependent in contrast to the view-independent color of a Lambertian surface. The view-dependent part of the reflection generally happens at the top surface of the material and is known as specular reflection.
The simplest kind of specular reflection happens at perfectly smooth surfaces like a mirror or the surface of water: light reflects in a mirrorlike way so that light coming from a point source goes in exactly one direction. This is known as ideal specular reflection and generally needs to be handled as a special case. But many surfaces are not perfectly smooth, and they exhibit a more general kind of reflection known in computer graphics as glossy reflection. There are many models for glossy reflection, and better ones are discussed in Chapter 14, but a simple and well-known model was originally proposed by Phong (1975) and later updated by Blinn (1976) and others to the form most commonly used today, known as the Modified Blinn–Phong model.
Since specular reflection is view dependent, it is a function of the view vector v that points from the shading point toward the viewer, as well as the normal vector n and light direction l. The idea is to produce reflection that is at its brightest when v and l are symmetrically positioned across the surface normal, which is when mirror reflection would occur; the reflection then decreases smoothly as the vectors move away from a mirror configuration.
We can tell how close we are to a mirror configuration using the idea of a half vector, which is the vector halfway between the viewing and illumination directions and is perpendicular to the surface exactly when l and v are in a mirror reflection configuration (Figure 5.5). If the half vector is near the surface normal, the specular component should be bright; if it is far away, it should be dim. We measure the nearness of h and n by computing their dot product (remember they are unit vectors, so n · h reaches its maximum of 1 when the vectors are equal) and then take the result to a power p > 1 to make it decrease faster:
(n · h)^{p}
The Phong exponent, p, controls the apparent shininess of the surface: higher values make the reflection fall off faster away from the mirror direction, leading to a shinier appearance. The half vector itself is easy to compute: since v and l are the same length, their sum is a vector that bisects the angle between them, which only needs to be normalized to produce h:
Typical values of p:
10—“eggshell”;
100—mildly shiny;
1000—really glossy;
10,000—nearlymirror-like.
To incorporate the Blinn–Phong idea into a shading computation, we add a specular component to Lambertian shading; the Lambertian part is then the diffuse component. We simply generalize the factor k from (5.3) that relates reflected light to incident irradiance to include not just the contribution of diffuse reflection but also a separate term that adds in specular reflection:
where the scale factor k_{s} is the specular coefficient (separate for red, green, and blue) and controls how bright the specular component is, and we have added a clamping operation to avoid surprises for corner cases in which n faces away from h.
When in doubt, for surfaces that also have a diffuse color, make the specular coefficient neutral in color, with equal red, green, and blue values.
The expression that generalizes the factor k is called the bidirectional reflectance distribution function or BRDF, because it describes how the reflectance varies as a function of the two directions l and v. The BRDF of a Lambertian surface is constant, but the BRDF of a surface that has specular reflection is not. The shading calculation then boils down to computing the irradiance (describing how much light is available to reflect) and the BRDF (describing how the surface reflects it), and then multiplying them. The BRDF is discussed more completely in Chapter 14.
When implementing surface shading, the code needs to have access to information about the light source, the surface, and the viewing direction. Writing clean code that supports both point and directional lights is easiest to do by separating the calculation of irradiance from the calculation of reflected light. Irradiance depends only on the light source and the surface geometry, and once it’s known, calculating the reflected light only depends on the surface properties and the viewing geometry.
Basic shading calculations can be done in exactly the same way in ray tracing and rasterization systems; it’s really only how the inputs are computed that varies. To compute irradiance, we need
The shading point x, a 3D point on a surface
The surface normal n perpendicular to the surface at x
The light source position p for a point light or its direction l for a directional light
The light source intensity I for a point light or its normal irradiance H for a directional light (these are RGB colors).
For a point light, we need to compute the distance and the light direction, which are both simple to get from the vector p – x:
and for both types of lights, the cosine factor is best computed using a dot product; as long as n and l are unit vectors,
In practice, it’s a good idea when computing irradiance to clamp the dot product at zero to make sure that even if in some cases, you find the light direction is facing away from the surface normal, you won’t get negative shading. This leads to what we view as the official equations for computing irradiance:
Once the irradiance is known, it needs to be multiplied by the BDRF value, and the ingredients for calculating that value are
This can happen with interpolated normals (Section 9.2.4)
The light direction l, a unit vector pointing from x toward the light (already computed as part of the irradiance calculation)
The viewing direction v, a unit vector pointing from x toward the viewer
The parameters describing the properties of the surface material. For this chapter’s model, this includes R, k_{s}, and p.
How you get these quantities differs substantially between ray tracing and rasterization systems, but the actual shading calculation itself is the same. Don’t forget that v, l, and n all must be unit vectors; failing to normalize these vectors is a very common error in shading computations.
Point-like sources are models for very localized sources that produce a lot of light near one direction. Other kinds of light sources are not so localized—for instance the sky, or the light reflected from the walls of a room. While such extended sources can be modeled in great detail, for basic shading we need a really simple approximation, so we make the assumption of ambient light that is exactly the same in all directions and at all locations in the scene (Figure 5.6). We further assume ambient light is only reflected diffusely (since there is no light direction and therefore no way to compute specular shading). This makes ambient shading very simple: it is a constant!
Normally, this constant is factored into the product of a material-related ambient reflection coefficient k_{a} and a light-related ambient intensity I_{a}:
really, k_{a} ought to be called reflectance and I_{a} ought to be called radiance, but this is not the usual nomenclature.
Both these quantities are colored, so they are multiplied componentwise (the ambient coefficient for red scales the red ambient intensity). This arrangement makes it convenient to tune ambient shading per object and in the scene as a whole.
Ambient shading is a bit of a hack, since lighting from large extended light sources does still vary: it tends to be darker in corners and other concave areas. But it is an important part of simple shading setups because it prevents shadows from being completely black and allows an easy way to tweak overall scene contrast.
When in doubt, set the ambient color to be the same as the diffuse color and the ambient intensity to a neutral color.
Many systems treat ambient light as a type of light source that appears in a list with point and directional lights; other systems make the ambient intensity a parameter of the scene so that there is no explicit light source for ambient, which is the same as assuming there is always exactly one ambient light.
Phong shading seems like an enormous hack. Is that true?
Yes. It is not a very good model if you are trying to match measurements of real surfaces. However, it is simple and has proven to produce shading that is very useful in practice. Applications that are looking for realistic shading are moving away from Phong shading to more complex but much more accurate models based on microfacet theory (Walter, Marschner, Li, & Torrance, 2007). But realism also absolutely requires going beyond point-like light sources. All this is discussed in Chapter 14.
I hate calling pow(). Is there a way to avoid it when doing Phong lighting?
A simple way is to only have exponents that are themselves a power of two, i.e., 2, 4, 8, 16, .... In practice, this is not a problematic restriction for most applications. Many systems designed for fast graphics calculations have library functions for pow() that are much faster and slightly less accurate than the ones found in standard math libraries.
1. The moon is poorly approximated by both diffuse and Phong shading. What observations tell you that this is true?
2. Velvet is poorly approximated by both diffuse and Phong shading. What observations tell you that this is true?
3. Why do most highlights on plastic objects look white, while those on gold metal look gold?
Perhaps, the most universal tools of graphics programs are the matrices that change or transform points and vectors. In the next chapter, we will see how a vector can be represented as a matrix with a single column, and how the vector can be represented in a different basis via multiplication with a square matrix. We will also describe how we can use such multiplications to accomplish changes in the vector such as scaling, rotation, and translation. In this chapter, we review basic linear algebra from a geometric perspective, focusing on intuition and algorithms that work well in the two- and three-dimensional case.
This chapter can be skipped by readers comfortable with linear algebra. However, there may be some enlightening tidbits even for such readers, such as the development of determinants and the discussion of singular and eigenvalue decomposition.
We usually think of determinants as arising in the solution of linear equations. However, for our purposes, we will think of determinants as another way to multiply vectors. For 2D vectors a and b, the determinant |ab| is the area of the parallelogram formed by a and b (Figure 6.1). This is a signed area, and the sign is positive if a and b are right-handed and negative if they are left-handed. This means |ab| = –|ba|. In 2D, we can interpret “right-handed” as meaning we rotate the first vector counterclockwise to close the smallest angle to the second vector. In 3D, the determinant must be taken with three vectors at a time. For three 3D vectors, a, b, and c, the determinant |abc| is the signed volume of the parallelepiped (3D parallelogram; a sheared 3D box) formed by the three vectors (Figure 6.2). To compute a 2D determinant, we first need to establish a few of its properties. We note that scaling one side of a parallelogram scales its area by the same fraction (Figure 6.3):
Also, we note that “shearing” a parallelogram does not change its area (Figure 6.4):
Finally, we see that the determinant has the following property:
because as shown in Figure 6.5, we can “slide” the edge between the two parallelograms over to form a single parallelogram without changing the area of either of the two original parallelograms.
Now let’s assume a Cartesian representation for a and b:
This simplification uses the fact that |vv| = 0 for any vector v, because the parallelograms would all be collinear with v and thus without area.
In three dimensions, the determinant of three 3D vectors a, b,and c is denoted |abc|. With Cartesian representations for the vectors, there are analogous rules for parallelepipeds as there are for parallelograms, and we can do an analogous expansion as we did for 2D:
As you can see, the computation of determinants in this fashion gets uglier as the dimension increases. We will discuss less error-prone ways to compute determinants in Section 6.3.
Example 2 Determinants arise naturally when computing the expression for one vector as a linear combination of two others—for example, if we wish to express a vector c as a combination of vectors a and b:
We can see from Figure 6.6 that
because these parallelograms are just sheared versions of each other. Solving for b_{c} yields
An analogous argument yields
This is the two-dimensional version of Cramer’s rule which we will revisit in Section 6.3.2.
A matrix is an array of numeric elements that follow certain arithmetic rules. An example of a matrix with two rows and three columns is
Matrices are frequently used in computer graphics for a variety of purposes including representation of spatial transforms. For our discussion, we assume the elements of a matrix are all real numbers. This chapter describes both the mechanics of matrix arithmetic and the determinant of “square” matrices, i.e., matrices with the same number of rows as columns.
A matrix times a constant results in a matrix where each element has been multiplied by that constant, e.g.,
For matrix multiplication, we “multiply” rows of the first matrix with columns of the second matrix:
For matrix multiplication, we “multiply” rows of the first matrix with columns of the second matrix:
So the element p_{ij} of the resulting product is
Taking a product of two matrices is only possible if the number of columns of the left matrix is the same as the number of rows of the right matrix. For example,
Matrix multiplication is not commutative in most instances:
Also, if AB = AC, it does not necessarily follow that B = C. Fortunately, matrix multiplication is associative and distributive:
We would like a matrix analog of the inverse of a real number. We know the inverse of a real number x is 1/x and that the product of x and its inverse is 1. We need a matrix I that we can think of as a “matrix one.” This exists only for square matrices and is known as the identity matrix; it consists of ones down the diagonal and zeroes elsewhere. For example, the four by four identity matrix is
The inverse matrix A^{–1} of a matrix A is the matrix that ensures AA^{–} 1 = I. For example,
Note that the inverse of A^{-1} is A. So AA^{-1} = A^{-1}A = I. The inverse of a product of two matrices is the product of the inverses, but with the order reversed:
We will return to the question of computing inverses in Section 6.3.
The transpose A^{T} of a matrix A has the same numbers, but the rows are switched with the columns. If we label the entries of A^{T} as a^{}_{ij},then
For example,
The transpose of a product of two matrices obeys a rule similar to Equation (6.4):
The determinant of a square matrix is simply the determinant of the columns of the matrix, considered as a set of vectors. The determinant has several nice relationships to the matrix operations just discussed, which we list here for reference:
In graphics, we use a square matrix to transform a vector represented as a matrix. For example, if you have a 2D vector a = (x_{a}, y_{a}) and want to rotate it by 90 degrees about the origin to form vector a = (–y_{a}, x_{a}) , you can use a product of a 2 × 2 matrix and a 2 × 1 matrix, called a column vector. The operation in matrix form is
We can get the same result by using the transpose of this matrix and multiplying on the left (“premultiplying”) with a row vector:
These days, postmultiplication using column vectors is fairly standard, but in many older books and systems, you will run across row vectors and premultiplication. The only difference is that the transform matrix must be replaced with its transpose.
We also can use matrix formalism to encode operations on just vectors. If we consider the result of the dot product as a 1 × 1 matrix, it can be written
For example, if we take two 3D vectors we get
A related vector product is the outer product between two vectors, which can be expressed as a matrix multiplication with a column vector on the left and a row vector on the right: ab^{T}. The result is a matrix consisting of products of all pairs of an entry of a with an entry of b. For 3D vectors, we have
It is often useful to think of matrix multiplication in terms of vector operations. To illustrate using the three-dimensional case, we can think of a 3 × 3 matrix as a collection of three 3D vectors in two ways: either it is made up of three column vectors side-by-side or it is made up of three row vectors stacked up. For instance, the result of a matrix-vector multiplication y = Ax can be interpreted as a vector whose entries are the dot products of x with the rows of A. Naming these row vectors r_{i}, we have
Alternatively, we can think of the same product as a sum of the three columns c_{i} of A, weighted by the entries of x:
Using the same ideas, one can understand a matrix–matrix product AB as an array containing the pairwise dot products of all rows of A with all columns of B (cf. (6.2)); as a collection of products of the matrix A with all the column vectors of B, arranged left to right; as a collection of products of all the row vectors of A with the matrix B, stacked top to bottom; or as the sum of the pairwise outer products of all columns of A with all rows of B. (See Exercise 8.)
These interpretations of matrix multiplication can often lead to valuable geometric interpretations of operations that may otherwise seem very abstract.
The identity matrix is an example of a diagonal matrix, where all nonzero elements occur along the diagonal. The diagonal consists of those elements whose column index equals the row index counting from the upper left.
The identity matrix also has the property that it is the same as its transpose. Such matrices are called symmetric.
The identity matrix is also an orthogonal matrix, because each of its columns considered as a vector has length 1 and the columns are orthogonal to one another. The same is true of the rows (see Exercise 2). The determinant of any orthogonal matrix is either +1 or –1.
The idea of an orthogonal matrix corresponds to the idea of an orthonormal basis, not just a set of orthogonal vectors—an unfortunate glitch in terminology.
A very useful property of orthogonal matrices is that they are nearly their own inverses. Multiplying an orthogonal matrix by its transpose results in the identity,
This is easy to see because the entries of R^{T}R are dot products between the columns of R. Off-diagonal entries are dot products between orthogonal vectors, and the diagonal entries are dot products of the (unit-length) columns with themselves.
Example 3 The matrix
is diagonal, and therefore symmetric, but not orthogonal (the columns are orthogonal but they are not unit length).
The matrix
is symmetric, but not diagonal or orthogonal.
The matrix
is orthogonal, but neither diagonal nor symmetric.
Recall from Section 6.1 that the determinant takes n n-dimensional vectors and combines them to get a signed n-dimensional volume of the n-dimensional parallelepiped defined by the vectors. For example, the determinant in 2D is the area of the parallelogram formed by the vectors. We can use matrices to handle the mechanics of computing determinants.
If we have 2D vectors r and s, we denote the determinant |rs|; this value is the signed area of the parallelogram formed by the vectors. Suppose we have two 2D vectors with Cartesian coordinates (a, b) and (A, B) (Figure 6.7). The determinant can be written in terms of column vectors or as a shorthand:
Note that the determinant of a matrix is the same as the determinant of its transpose:
This means that for any parallelogram in 2D, there is a “sibling” parallelogram that has the same area but a different shape (Figure 6.8). For example, the parallelogram defined by vectors (3, 1) and (2, 4) has area 10, as does the parallelogram defined by vectors (3, 2) and (1, 4) .
Example 4 The geometric meaning of the 3D determinant is helpful in seeing why certain formulas make sense. For example, the equation of the plane through the points (x_{i}, y_{i}, z_{i}) for i = 0, 1, 2 is
Each column is a vector from point (x_{i} ,y_{i} ,z_{i}) to point (x, y, z) . The volume of the parallelepiped with those vectors as sides is zero only if (x, y, z) is coplanar with the three other points. Almost all equations involving determinants have similarly simple underlying geometry.
As we saw earlier, we can compute determinants by a brute force expansion where most terms are zero, and there is a great deal of bookkeeping on plus and minus signs. The standard way to manage the algebra of computing determinants is to use a form of Laplace’s expansion. The key part of computing the determinant this way is to find cofactors of various matrix elements. Each element of a square matrix has a cofactor which is the determinant of a matrix with one fewer row and column possibly multiplied by minus one. The smaller matrix is obtained by eliminating the row and column that the element in question is in. For example, for a 10 × 10 matrix, the cofactor of a_{82} is the determinant of the 9 × 9 matrix with the 8th row and 2nd column eliminated. The sign of a cofactor is positive if the sum of the row and column indices is even and negative otherwise. This can be remembered by a checkerboard pattern:
So, for a 4 × 4 matrix,
The cofactors of the first row are
The determinant of a matrix is found by taking the sum of products of the elements of any row or column with their cofactors. For example, the determinant of the 4 × 4 matrix above taken about its second column is
We could do a similar expansion about any row or column and they would all yield the same result. Note the recursive nature of this expansion.
Example 5 A concrete example for the determinant of a particular 3 × 3 matrix by expanding the cofactors of the first row is
We can deduce that the volume of the parallelepiped formed by the vectors defined by the columns (or rows since the determinant of the transpose is the same) is zero. This is equivalent to saying that the columns (or rows) are not linearly independent. Note that the sum of the first and third rows is twice the second row, which implies linear dependence.
Determinants give us a tool to compute the inverse of a matrix. It is a very inefficient method for large matrices, but often in graphics, our matrices are small. A key to developing this method is that the determinant of a matrix with two identical rows is zero. This should be clear because the volume of the n-dimensional parallelepiped is zero if two of its sides are the same. Suppose we have a 4 × 4 A and we wish to find its inverse A^{–1}. The inverse is
Note that this is just the transpose of the matrix where elements of A are replaced by their respective cofactors multiplied by the leading constant (1 or –1). This matrix is called the adjoint of A. The adjoint is the transpose of the cofactor matrix of A. We can see why this is an inverse. Look at the product AA^{–1} which we expect to be the identity. If we multiply the first row of A by the first column of the adjoint matrix we need to get |A| (remember the leading constant above divides by |A|:
This is true because the elements in the first row of A are multiplied exactly by their cofactors in the first column of the adjoint matrix which is exactly the determinant. The other values along the diagonal of the resulting matrix are |A| for analogous reasons. The zeros follow a similar logic:
Note that this product is a determinant of some matrix:
The matrix in fact is
Because the first two rows are identical, the matrix is singular, and thus, its determinant is zero.
The argument above does not apply just to four by four matrices; using that size just simplifies typography. For any matrix, the inverse is the adjoint matrix divided by the determinant of the matrix being inverted. The adjoint is the transpose of the cofactor matrix, which is just the matrix whose elements have been replaced by their cofactors.
Example 6 The inverse of one particular three-by-three matrix whose determinant is 6 is
You can check this yourself by multiplying the matrices and making sure you get the identity.
We often encounter linear systems in graphics with “n equations and n unknowns,” usually for n = 2 or n = 3. For example,
Here, x, y, and z are the “unknowns” for which we wish to solve. We can write this in matrix form:
A common shorthand for such systems is Ax = b where it is assumed that A is a square matrix with known constants, x is an unknown column vector (with elements x, y,and z in our example), and b is a column matrix of known constants.
There are many ways to solve such systems, and the appropriate method depends on the properties and dimensions of the matrix A. Because in graphics we so frequently work with systems of size n ≤ 4, we’ll discuss here a method appropriate for these systems, known as Cramer’s rule, which we saw earlier, from a 2D geometric viewpoint, in the example on page 108. Here, we show this algebraically. The solution to the above equation is
The rule here is to take a ratio of determinants, where the denominator is |A| and the numerator is the determinant of a matrix created by replacing a column of A with the column vector b. The column replaced corresponds to the position of the unknown in vector x. For example, y is the second unknown and the second column is replaced. Note that if |A| = 0, the division is undefined and there is no solution. This is just another version of the rule that if A is singular (zero determinant), then there is no unique solution to the equations.
Square matrices have eigenvalues and eigenvectors associated with them. The eigenvectors are those nonzero vectors whose directions do not change when multiplied by the matrix. For example, suppose for a matrix A and vector a,wehave
This means we have stretched or compressed a, but its direction has not changed. The scale factor λ is called the eigenvalue associated with eigenvector a. Knowing the eigenvalues and eigenvectors of matrices is helpful in a variety of practical applications. We will describe them to gain insight into geometric transformation matrices and as a step toward singular values and vectors described in the next section.
If we assume a matrix has at least one eigenvector, then we can do a standard manipulation to find it. First, we write both sides as the product of a square matrix with the vector a:
where I is an identity matrix. This can be rewritten
Because matrix multiplication is distributive, we can group the matrices:
This equation can only be true if the matrix (A –λI) is singular, and thus, its determinant is zero. The elements in this matrix are the numbers in A except along the diagonal. For example, for a 2 × 2 matrix the eigenvalues obey
Because this is a quadratic equation, we know there are exactly two solutions for λ. These solutions may or may not be unique or real. A similar manipulation for an n × n matrix will yield an nth-degree polynomial in λ. Because it is not possible, in general, to find exact explicit solutions of polynomial equations of degree greater than four, we can only compute eigenvalues of matrices 4 × 4 or smaller by analytic methods. For larger matrices, numerical methods are the only option.
An important special case where eigenvalues and eigenvectors are particularly simple is symmetric matrices (where A = A^{T}). The eigenvalues of real symmetric matrices are always real numbers, and if they are also distinct, their eigenvectors are mutually orthogonal. Such matrices can be put into diagonal form:
where Q is an orthogonal matrix and D is a diagonal matrix. The columns of Q are the eigenvectors of A and the diagonal elements of D are the eigenvalues of A. Putting A in this form is also called the eigenvalue decomposition, because it decomposes A into a product of simpler matrices that reveal its eigenvectors and eigenvalues.
Recall that an orthogonal matrix has orthonormal rows and orthonormal columns.
Example 7 Given the matrix
the eigenvalues of A are the solutions to
We approximate the exact values for compactness of notation:
Now we can find the associated eigenvector. The first is the nontrivial (not x = y = 0) solution to the homogeneous equation,
This is approximately (x, y) = (0.8507, 0.5257) . Note that there are infinitely many solutions parallel to that 2D vector, and we just picked the one of unit length. Similarly, the eigenvector associated with λ_{2} is (x, y) = (–0.5257, 0.8507) .This means the diagonal form of A is (within some precision due to our numeric approximation):
We will revisit the geometry of this matrix as a transform in the next chapter.
We saw in the last section that any symmetric matrix can be diagonalized, or decomposed into a convenient product of orthogonal and diagonal matrices. However, most matrices we encounter in graphics are not symmetric, and the eigenvalue decomposition for nonsymmetric matrices is not nearly so convenient or illuminating, and in general involves complex-valued eigenvalues and eigenvectors even for real-valued inputs.
We would recommend learning in this order: symmetric eigenvalues/vectors, singular values/vectors, and then nonsymmetric eigenvalues, which are much trickier.
There is another generalization of the symmetric eigenvalue decomposition to nonsymmetric (and even non-square) matrices; it is the singular value decomposition (SVD). The main difference between the eigenvalue decomposition of a symmetric matrix and the SVD of a nonsymmetric matrix is that the orthogonal matrices on the left and right sides are not required to be the same in the SVD:
Here, U and V are two, potentially different, orthogonal matrices, whose columns are known as the left and right singular vectors of A,and S is a diagonal matrix whose entries are known as the singular values of A. When A is symmetric and has all nonnegative eigenvalues, the SVD and the eigenvalue decomposition are the same.
There is another relationship between singular values and eigenvalues that can be used to compute the SVD (though this is not the way an industrial-strength SVD implementation works). First, we define M = AA^{T}. We assume that we can perform a SVD on M:
The substitution is based on the fact that (BC)^{T} = C^{T}B^{T}, that the transpose of an orthogonal matrix is its inverse, and that the transpose of a diagonal matrix is the matrix itself. The beauty of this new form is that M is symmetric and US^{2}U^{T} is its eigenvalue decomposition, where S^{2} contains the (all nonnegative) eigenvalues. Thus, we find that the singular values of a matrix are the square roots of the eigenvalues of the product of the matrix with its transpose, and the left singular vectors are the eigenvectors of that product. A similar argument allows V, the matrix of right singular vectors, to be computed from A^{T}A.
Example 8 We now make this concrete with an example:
We saw the eigenvalue decomposition for this matrix in the previous section. We observe immediately
We can solve for V algebraically:
The inverse of S is a diagonal matrix with the reciprocals of the diagonal elements of S. This yields
This form used the standard symbol σ_{i} for the ith singular value. Again, for a symmetric matrix, the eigenvalues and the singular values are the same (σ_{i} = λ_{i}). We will examine the geometry of SVD further in Section 7.1.6.
Why is matrix multiplication defined the way it is rather than just element by element?
Element by element multiplication is a perfectly good way to define matrix multiplication, and indeed, it has nice properties. However, in practice it is not very useful. Ultimately, most matrices are used to transform column vectors; e.g., in 3D you might have
where a and b are vectors and M is a 3×3 matrix. To allow geometric operations such as rotation, combinations of all three elements of a must go into each element of b. That requires us to go either row-by-row or column-by-column through M. That choice is made based on composition of matrices having the desired property,
which allows us to use one composite matrix C = M_{2}M_{1} to transform our vector. This is valuable when many vectors will be transformed by the same composite matrix. So, in summary, the somewhat weird rule for matrix multiplication is engineered to have these desired properties.
Sometimes I hear that eigenvalues and singular values are the same thing and sometimes that one is the square of the other. Which is right?
If a real matrix A is symmetric, and its eigenvalues are nonnegative, then its eigenvalues and singular values are the same. If A is not symmetric, the matrix M = AA^{T} is symmetric and has nonnegative real eignenvalues. The singular values of A and A^{T} are the same and are the square roots of the singular/eigenvalues of M. Thus, when the square root statement is made, it is because two different matrices (with a very particular relationship) are being talked about: M = AA^{T}.
The discussion of determinants as volumes is based on A Vector Space Approach to Geometry (Hausner, 1998). Hausner has an excellent discussion of vector analysis and the fundamentals of geometry as well. The geometric derivation of Cramer’s rule in 2D is taken from Practical Linear Algebra: A Geometry Tool-box (Farin & Hansford, 2004). That book also has geometric interpretations of other linear algebra operations such as Gaussian elimination. The discussion of eigenvalues and singular values is based primarily on Linear Algebra and Its Applications (Strang, 1988). The example of SVD of the shear matrix is based on a discussion in Computer Graphics and Geometric Modeling (Salomon, 1999).
1. Write an implicit equation for the 2D line through points (x_{0},y_{0}) and (x_{1},y_{1}) using a 2D determinant.
2. Show that if the columns of a matrix are orthonormal, then so are the rows.
3. Prove the properties of matrix determinants stated in Equations (6.5)–(6.7).
4. Show that the eigenvalues of a diagonal matrix are its diagonal elements.
5. Show that for a square matrix A, AA^{T} is a symmetric matrix.
6. Show that for three 3D vectors a, b, c, the following identity holds: |abc| = (a × b) · c .
7. Explain why the volume of the tetrahedron with side vectors a, b, c (see Figure 6.2) is given by |abc|/6.
8. Demonstrate the four interpretations of matrix–matrix multiplication by taking the following matrix–matrix multiplication code, rearranging the nested loops, and interpreting the resulting code in terms of matrix and vector operations.
function mat-mult(in a[m][p], in b[p][n], out c[m][n]) {
// the array c is initialized to zero
for i = 1 to m
for j = 1 to n
for k = 1 to p
c[i][j] += a[i][k] * b[k][j]
}
9. Prove that if A, Q, and D satisfy Equation (6.14), v is the ith column of Q, and λ is the ith entry on the diagonal of D, then v is an eigenvector of A with eigenvalue λ.
10. Prove that if A, Q,and D satisfy Equation (6.14), the eigenvalues of A are all distinct, and v is an eigenvector of A with eigenvalue λ, then for some i, v is the ith row of Q and λ is the ith entry on the diagonal of D.
11. Given the (x, y) coordinates of the three vertices of a 2D triangle, explain why the area is given by
The machinery of linear algebra can be used to express many of the operations required to arrange objects in a 3D scene, view them with cameras, and get them onto the screen. Geometric transformations such as rotation, translation, scaling, and projection can be accomplished with matrix multiplication, and the transformation matrices used to do this are the subject of this chapter.
We will show how a set of points transform if the points are represented as offset vectors from the origin, and we will use the clock shown in Figure 7.1 as an example of a point set. So think of the clock as a bunch of points that are the ends of vectors whose tails are at the origin. We also discuss how these transforms operate differently on locations (points), displacement vectors, and surface normal vectors.
We can use a 2 × 2 matrix to change, or transform, a 2D vector:
This kind of operation, which takes in a 2-vector and produces another 2-vector by a simple matrix multiplication, is a linear transformation.
By this simple formula, we can achieve a variety of useful transformations, depending on what we put in the entries of the matrix, as will be discussed in the following sections. For our purposes, consider moving along the x-axis a horizontal move and along the y-axis, a vertical move.
The most basic transform is a scale along the coordinate axes. This transform can change length and possibly direction:
Note what this matrix does to a vector with Cartesian components (x, y) :
So, just by looking at the matrix of an axis-aligned scale, we can read off the two scale factors.
Example 9 The matrix that shrinks x and y uniformly by a factor of two is (Figure 7.1)
A matrix which halves in the horizontal and increases by three-halves in the vertical is (see Figure 7.2)
A shear is something that pushes things sideways, producing something like a deck of cards across which you push your hand; the bottom card stays put and cards move more the closer they are to the top of the deck. The horizontal and vertical shear matrices are
Example 10 The transform that shears horizontally so that vertical lines become 45° lines leaning toward the right is (see Figure 7.3)
An analogous transform vertically is (see Figure 7.4)
In both cases, the square outline of the sheared clock becomes a parallelogram, and the circular face of the sheared clock becomes an ellipse.
Another way to think of a shear is in terms of rotation of only the vertical (or horizontal) axis. The shear transform that takes a vertical axis and tilts it clockwise by an angle ϕ is
In fact, the image of a circle under any matrix transformation is an ellipse.
Similarly, the shear matrix which rotates the horizontal axis counterclockwise by angle ϕ is
Suppose we want to rotate a vector a by an angle ϕ counterclockwise to get vector b (Figure 7.5). If a makes an angle α with the x-axis, and its length is , then we know that
Because b is a rotation of a, it also has length r. Because it is rotated an angle ϕ from a, b makes an angle (α + ϕ) with the x-axis. Using the trigonometric addition identities (Section 2.3.3):
Substituting x_{a} = r cos α and y_{a} = r sin α gives
In matrix form, the transformation that takes a to b is then
Example 11 A matrix that rotates vectors by π/4 radians (45°) is (see Figure 7.6)
A matrix that rotates by π/6 radians (30°)inthe clockwise direction is a rotation by – π/6 radians in our framework (see Figure 7.7):
Because the norm of each row of a rotation matrix is one (sin^{2} ϕ+cos^{2} ϕ = 1), and the rows are orthogonal (cos ϕ(– sin ϕ)+sin ϕ cos ϕ = 0), we see that rotation matrices are orthogonal matrices (Section 6.2.4). By looking at the matrix, we can read off two pairs of orthonormal vectors: the two columns, which are the vectors to which the transformation sends the canonical basis vectors (1, 0) and (0, 1) ; and the rows, which are the vectors that the transformations sends to the canonical basis vectors.
Said briefly, Re_{i} = u_{i} and Rv_{i} = u_{i}, for a rotation with columns u_{i} and rows v_{i}.
We can reflect a vector across either of the coordinate axes by using a scale with one negative scale factor (see Figures 7.8 and 7.9):
While one might expect that the matrix with –1 in both elements of the diagonal is also a reflection, in fact it is just a rotation by π radians.
This rotation can also be called a “reflection through the origin.”
It is common for graphics programs to apply more than one transformation to an object. For example, we might want to first apply a scale S and then a rotation R. This would be done in two steps on a 2D vector v_{1}:
Another way to write this is
Because matrix multiplication is associative, we can also write
In other words, we can represent the effects of transforming a vector by two matrices in sequence using a single matrix of the same size, which we can compute by multiplying the two matrices: M = RS (Figure 7.10).
It is very important to remember that these transforms are applied from the right side first. So the matrix M = RS first applies S and then R.
Example 12 Suppose we want to scale by one-half in the vertical direction and then rotate by π/4 radians (45°). The resulting matrix is
It is important to always remember that matrix multiplication is not commutative. So the order of transforms does matter. In this example, rotating first and then scaling result in a different matrix (see Figure 7.11):
Example 13 Using the scale matrices we have presented, nonuniform scaling can only be done along the coordinate axes. If we wanted to stretch our clock by 50% along one of its diagonals, so that 8:00 through 1:00 move to the northwest and 2:00 through 7:00 move to the southeast, we can use rotation matrices in combination with an axis-aligned scaling matrix to get the result we want. The idea is to use a rotation to align the scaling axis with a coordinate axis, then scale along that axis, and then rotate back. In our example, the scaling axis is the “backslash” diagonal of the square, and we can make it parallel to the x-axis with
a rotation by +45°. Putting these operations together, the full transformation is
Remember to read the transformations from right to left.
In mathematical notation, this can be written RSR^{T}. The result of multiplying the three matrices together is
It is no coincidence that this matrix is symmetric— try applying the transpose-of-product rule to the formula RSR^{T}.
Building up a transformation from rotation and scaling transformations actually works for any linear transformation, and this fact leads to a powerful way of thinking about these transformations, as explored in the next section.
Sometimes, it’s necessary to “undo” a composition of transformations, taking a transformation apart into simpler pieces. For instance, it’s often useful to present a transformation to the user for manipulation in terms of separate rotations and scale factors, but a transformation might be represented internally simply as a matrix, with the rotations and scales already mixed together. This kind of manipulation can be achieved if the matrix can be computationally disassembled into the desired pieces, the pieces adjusted, and the matrix reassembled by multiplying the pieces together again.
It turns out that this decomposition, or factorization, is possible, regardless of the entries in the matrix—and this fact provides a fruitful way of thinking about transformations and what they do to geometry that is transformed by them.
Let’s start with symmetric matrices. Recall from Section 6.4 that a symmetric matrix can always be taken apart using the eigenvalue decomposition into a product of the form
where R is an orthogonal matrix and S is a diagonal matrix; we will call the columns of R (the eigenvectors) by the names v_{1} and v_{2}, and we’ll call the diagonal entries of S (the eigenvalues) by the names λ_{1} and λ_{2}.
In geometric terms, we can now recognize R as a rotation and S as a scale, so this is just a multi-step geometric transformation (Figure 7.12):
Rotate v_{1} and v_{2} to the x- and y-axes (the transform by R^{T}).
Scale in x and y by (λ_{1},λ_{2}) (the transform by S).
Rotate the x- and y-axes back to v_{1} and v_{2} (the transform by R).
If you like to count dimensions: a symmetric 2 × 2 matrix has 3° of freedom, and the eigenvalue decomposition rewrites them as a rotation angle and two scale factors.
Looking at the effect of these three transforms together, we can see that they have the effect of a nonuniform scale along a pair of axes. As with an axis-aligned scale, the axes are perpendicular, but they aren’t the coordinate axes; instead, they are the eigenvectors of A. This tells us something about what it means to be a symmetric matrix: symmetric matrices are just scaling operations—albeit potentially nonuniform and non–axis-aligned ones.
Example 14 Recall the example from Section 6.4:
The matrix above, then, according to its eigenvalue decomposition, scales in a direction 31.7° counterclockwise from three o’clock (the x-axis). This is a touch before 2 p.m. on the clockface as is confirmed by Figure 7.13.
We can also reverse the diagonalization process; to scale by (λ_{1},λ_{2}) with the first scaling direction an angle ϕ clockwise from the x-axis, we have
We should take heart that this is a symmetric matrix as we know must be true since we constructed it from a symmetric eigenvalue decomposition.
A very similar kind of decomposition can be done with nonsymmetric matrices as well: it’s the singular value decomposition (SVD), also discussed in Section 6.4.1. The difference is that the matrices on either side of the diagonal matrix are no longer the same:
The two orthogonal matrices that replace the single rotation R are called U and V, and their columns are called u_{i} (the left singular vectors) and v_{i} (the right singular vectors), respectively. In this context, the diagonal entries of S are called singular values rather than eigenvalues. The geometric interpretation is very similar to that of the symmetric eigenvalue decomposition (Figure 7.14):
For dimension counters: a general 2 × 2 matrix has 4° of freedom, and the SVD rewrites them as two rotation angles and two scale factors. One more bit is needed to keep track of reflections, but that doesn’t add a dimension.
Rotate v_{1} and v_{2} to the x- and y-axes (the transform by V^{T}).
Scale in x and y by (σ_{1},σ_{2}) (the transform by S).
Rotate the x- and y-axes to u_{1} and u_{2} (the transform by U).
The principal difference is between a single rotation and two different orthogonal matrices. This difference causes another, less important, difference. Because the SVD has different singular vectors on the two sides, there is no need for negative singular values: we can always flip the sign of a singular value, reverse the direction of one of the associated singular vectors, and end up with the same transformation again. For this reason, the SVD always produces a diagonal matrix with all positive entries, but the matrices U and V are not guaranteed to be rotations—they could include reflection as well. In geometric applications like graphics, this is an inconvenience, but a minor one: it is easy to differentiate rotations from reflections by checking the determinant, which is +1 for rotations and –1 for reflections, and if rotations are desired, one of the singular values can be negated, resulting in a rotation–scale–rotation sequence where the reflection is rolled in with the scale, rather than with one of the rotations.
Example 15 The example used in Section 6.4.1 is in fact a shear matrix (Figure 7.15):
An immediate consequence of the existence of SVD is that all the 2D transformation matrices we have seen can be made from rotation matrices and scale matrices. Shear matrices are a convenience, but they are not required for expressing transformations.
In summary, every matrix can be decomposed via SVD into a rotation times a scale times another rotation. Only symmetric matrices can be decomposed via eigenvalue diagonalization into a rotation times a scale times the inverse-rotation, and such matrices are a simple scale in an arbitrary direction. The SVD of a symmetric matrix will yield the same triple product as eigenvalue decomposition via a slightly more complex algebraic manipulation.
Another decomposition uses shears to represent nonzero rotations (Paeth, 1990). The following identity allows this:
For example, a rotation by π/4 (45°) is (see Figure 7.16)
This particular transform is useful for raster rotation because shearing is a very efficient raster operation for images; it introduces some jagginess, but will leave no holes. The key observation is that if we take a raster position (i, j) and apply a horizontal shear to it, we get
If we round sj to the nearest integer, this amounts to taking each row in the image and moving it sideways by some amount—a different amount for each row. Because it is the same displacement within a row, this allows us to rotate with no gaps in the resulting image. A similar action works for a vertical shear. Thus, we can implement a simple raster rotation easily.
The linear 3D transforms are an extension of the 2D transforms. For example, a scale along Cartesian axes is
Rotation is considerably more complicated in 3D than in 2D, because there are more possible axes of rotation. However, if we simply want to rotate about the z-axis, which will only change x- and y-coordinates, we can use the 2D rotation matrix with no operation on z:
Similarly we can construct matrices to rotate about the x-axis and the y-axis:
To understand why the minus sign is in the lower left for the y-axis rotation, think of the three axes in a circular sequence: y after x; z after y; x after z.
We will discuss rotations about arbitrary axes in the next section.
As in two dimensions, we can shear along a particular axis, for example,
As with 2D transforms, any 3D transformation matrix can be decomposed using SVD into a rotation, scale, and another rotation. Any symmetric 3D matrix has an eigenvalue decomposition into rotation, scale, and inverse-rotation. Finally, a 3D rotation can be decomposed into a product of 3D shear matrices.
As in 2D, 3D rotations are orthogonal matrices. Geometrically, this means that the three rows of the matrix are the Cartesian coordinates of three mutually orthogonal unit vectors as discussed in Section 2.4.5. The columns are three, potentially different, mutually orthogonal unit vectors. There are an infinite number of such rotation matrices. Let’s write down such a matrix:
Here, u = x_{u}x + y_{u}y + z_{u}z and so on for v and w. Since the three vectors are orthonormal, we know that
We can infer some of the behavior of the rotation matrix by applying it to the vectors u, v and w. For example,
Note that those three rows of R_{uvw}u are all dot products:
Similarly, R_{uvw}v = y, and R_{uvw}w = z. So R_{uvw} takes the basis uvw to the corresponding Cartesian axes via rotation.
If R_{uvw} is a rotation matrix with orthonormal rows, then R^{T}_{uvw} is also a rotation matrix with orthonormal columns and in fact is the inverse of R_{uvw} (the inverse of an orthogonal matrix is always its transpose). An important point is that for transformation matrices, the algebraic inverse is also the geometric inverse. So if R_{uvw} takes u to x, then R^{T}_{uvw} takes x to u. The same should be true of v and y as we can confirm:
So we can always create rotation matrices from orthonormal bases.
If we wish to rotate about an arbitrary vector a, we can form an orthonormal basis with w = a, rotate that basis to the canonical basis xyz, rotate about the z-axis, and then rotate the canonical basis back to the uvw basis. In matrix form, to rotate about the w-axis by an angle ϕ:
Here, we have w a unit vector in the direction of a (i.e., a divided by its own length). But what are u and v? A method to find reasonable u and v is given in Section 2.4.6.
If we have a rotation matrix and we wish to have the rotation in axis-angle form, we can compute the one real eigenvalue (which will be λ = 1), and the corresponding eigenvector is the axis of rotation. This is the one axis that is not changed by the rotation.
See Section 16.2.2 for a comparison of the few most-used ways to represent rotations, besides rotation matrices.
While most 3D vectors we use represent positions (offset vectors from the origin) or directions, such as where light comes from, some vectors represent surface normals. Surface normal vectors are perpendicular to the tangent plane of a surface. These normals do not transform the way we would like when the underlying surface is transformed. For example, if the points of a surface are transformed by a matrix M, a vector t that is tangent to the surface and is multiplied by M will be tangent to the transformed surface. However, a surface normal vector n that is transformed by M may not be normal to the transformed surface (Figure 7.17).
We can derive a transform matrix N which does take n to a vector perpendicular to the transformed surface. One way to attack this issue is to note that a surface normal vector and a tangent vector are perpendicular, so their dot product is zero, which is expressed in matrix form as
If we denote the desired transformed vectors as t_{M} = Mt and n_{N} = Nn, our goal is to find N such that . We can find N by some algebraic tricks.
First, we can sneak an identity matrix into the dot product and then take advantage of M^{–1}M = I:
Although the manipulations above don’t obviously get us anywhere, note that we can add parentheses that make the above expression more obviously a dot product:
This means that the row vector that is perpendicular to t_{M} is the left part of the expression above. This expression holds for any of the tangent vectors in the tangent plane. Since there is only one direction in 3D (and its opposite) that is perpendicular to all such tangent vectors, we know that the left part of the expression above must be the row vector expression for n_{N} ; i.e., it is n^{T}_{N} , so this allows us to infer N:
so we can take the transpose of that to get
Therefore, we can see that the matrix that correctly transforms normal vectors so they remain normal is N = (M^{–1})^{T}, i.e., the transpose of the inverse matrix. Since this matrix may change the length of n, we can multiply it by an arbitrary scalar and it will still produce n_{N} with the right direction. Recall from Section 6.3 that the inverse of a matrix is the transpose of the cofactor matrix divided by the determinant. Because we don’t care about the length of a normal vector, we can skip the division and find that for a 3 × 3 matrix,
This assumes the element of M in row i and column j is m_{ij}. So the full expression for N is
We have been looking at methods to change vectors using a matrix M. In two dimensions, these transforms have the form
We cannot use such transforms to move objects, only to scale and rotate them. In particular, the origin (0, 0) always remains fixed under a linear transformation. To move, or translate, an object by shifting all its points the same amount, we need a transform of the form
There is just no way to do that by multiplying (x, y) by a 2 × 2 matrix. One possibility for adding translation to our system of linear transformations is to simply associate a separate translation vector with each transformation matrix, letting the matrix take care of scaling and rotation and the vector take care of translation. This is perfectly feasible, but the bookkeeping is awkward and the rule for composing two transformations is not as simple and clean as with linear transformations.
Instead, we can use a clever trick to get a single matrix multiplication to do both operations together. The idea is simple: represent the point (x, y) by a 3D vector [x y 1]^{T}, and use 3 × 3 matrices of the form
The fixed third row serves to copy the 1 into the transformed vector, so that all vectors have a 1 in the last place, and the first two rows compute x^{} and y^{} as linear combinations of x, y,and 1:
The single matrix implements a linear transformation followed by a translation! This kind of transformation is called an affine transformation, and this way of implementing affine transformations by adding an extra dimension is called homogeneous coordinates (Roberts, 1965; Riesenfeld, 1981; Penna & Patterson, 1986). Homogeneous coordinates not only clean up the code for transformations, but this scheme also makes it obvious how to compose two affine transformations: simply multiply the matrices.
A problem with this new formalism arises when we need to transform vectors that are not supposed to be positions—they represent directions or offsets between positions. Vectors that represent directions or offsets should not change when we translate an object. Fortunately, we can arrange for this by setting the third coordinate to zero:
If there is a scaling/rotation transformation in the upper-left 2 × 2 entries of the matrix, it will apply to the vector, but the translation still multiplies with the zero and is ignored. Furthermore, the zero is copied into the transformed vector, so direction vectors remain direction vectors after they are transformed.
This gives an explanation for the name “homogeneous:” translation, rotation, and scaling of positions and directions all fit into a single system.
This is exactly the behavior we want for vectors, so they fit smoothly into the system: the extra (third) coordinate will be either 1 or 0 depending on whether we are encoding a position or a direction. We actually do need to store the homoge-neous coordinate so we can distinguish between locations and other vectors. For example,
Later, when we do perspective viewing, we will see that it is useful to allow the homogeneous coordinate to take on values other than one or zero.
Homogeneous coordinates are used nearly universally to represent transformations in graphics systems. In particular, homogeneous coordinates underlie the design and operation of renderers implemented in graphics hardware. We will see in Chapter 8 that homogeneous coordinates also make it easy to draw scenes in perspective, another reason for their popularity.
Homogeneous coordinates are also ubiquitous in computer vision.
Homogeneous coordinates can be considered just a clever way to handle the bookkeeping for translation, but there is also a different, geometric interpretation. The key observation is that when we do a 3D shear based on the z-coordinate, we get this transform:
Note that this almost has the form we want in x and y for a 2D translation, but has a z hanging around that doesn’t have a meaning in 2D. Now comes the key decision: we will add a coordinate z = 1 to all 2D locations. This gives us
By associating a (z = 1)-coordinate with all 2D points, we now can encode translations into matrix form. For example, to first translate in 2D by (x_{t} ,y_{t}) and then rotate by angle ϕ we would use the matrix
Note that the 2D rotation matrix is now 3 × 3 with zeros in the “translation slots.” With this type of formalism, which uses shears along z = 1 to encode translations, we can represent any number of 2D shears, 2D rotations, and 2D translations as one composite 3D matrix. The bottom row of that matrix will always be (0, 0, 1) , so we don’t really have to store it. We just need to remember it is there when we multiply two matrices together.
In 3D, the same technique works: we can add a fourth coordinate, a homogeneous coordinate, and then, we have translations:
Again, for a direction vector, the fourth coordinate is zero and the vector is thus unaffected by translations.
Example 16 (Windowing transformations) Often in graphics, we need to create a transform matrix that takes points in the rectangle . This can be accomplished with a single scale and translate in sequence. However, it is more intuitive to create the transform from a sequence of three operations (Figure 7.18):
Move the point (x_{l} ,y_{l}) to the origin.
Scale the rectangle to be the same size as the target rectangle.
Move the origin to point (x^{}_{l} ,y^{}_{l}) .
Remembering that the right-hand matrix is applied first, we can write
It is perhaps not surprising to some readers that the resulting matrix has the form it does, but the constructive process with the three matrices leaves no doubt as to the correctness of the result.
An exactly analogous construction can be used to define a 3D windowing transformation, which maps the box [x_{l} ,x_{h}] × [y_{l} ,y_{h}] × [z_{l} ,z_{h}] to the box
It is interesting to note that if we multiply an arbitrary matrix composed of scales, shears, and rotations with a simple translation (translation comes second), we get
Thus, we can look at any matrix and think of it as a scaling/rotation part and a translation part because the components are nicely separated from each other.
An important class of transforms are rigid-body transforms. These are composed only of translations and rotations, so they have no stretching or shrinking of the objects. Such transforms will have a pure rotation for the a_{ij} above.
While we can always invert a matrix algebraically, we can use geometry if we know what the transform does. For example, the inverse of scale(s_{x}, s_{y}, s_{z}) is scale(1/s_{x}, 1/s_{y}, 1/s_{z}) . The inverse of a rotation is the same rotation with the opposite sign on the angle. The inverse of a translation is a translation in the opposite direction. If we have a series of matrices M = M_{1}M_{2}... M_{n}, then .
Also, certain types of transformation matrices are easy to invert. We’ve already mentioned scales, which are diagonal matrices; the second important example is rotations, which are orthogonal matrices. Recall (Section 6.2.4) that the inverse of an orthogonal matrix is its transpose. This makes it easy to invert rotations and rigid body transformations (see Exercise 6). Also, it’s useful to know that a matrix with [0 0 0 1] in the bottom row has an inverse that also has [0 0 0 1] in the bottom row (see Exercise 7).
Interestingly, we can use SVD to invert a matrix as well. Since we know that any matrix can be decomposed into a rotation times a scale times a rotation, inversion is straightforward. For example, in 3D we have
and from the rules above, it follows easily that
All of the previous discussion has been in terms of using transformation matrices to move points around. We can also think of them as simply changing the coordinate system in which the point is represented. For example, in Figure 7.19, we see two ways to visualize a movement. In different contexts, either interpretation may be more suitable.
For example, a driving game may have a model of a city and a model of a car. If the player is presented with a view out the windshield, objects inside the car are always drawn in the same place on the screen, while the streets and buildings appear to move backward as the player drives. On each frame, we apply a transformation to these objects that moves them farther back than on the previous frame. One way to think of this operation is simply that it moves the buildings backward; another way to think of it is that the buildings are staying put but the coordinate system in which we want to draw them—which is attached to the car—is moving. In the second interpretation, the transformation is changing the coordinates of the city geometry, expressing them as coordinates in the car’s coordinate system. Both ways will lead to exactly the same matrix that is applied to the geometry outside the car.
If the game also supports an overhead view to show where the car is in the city, the buildings and streets need to be drawn in fixed positions while the car needs to move from frame to frame. The same two interpretations apply: we can think of the changing transformation as moving the car from its canonical position to its current location in the world; or we can think of the transformation as simply changing the coordinates of the car’s geometry, which is originally expressed in terms of a coordinate system attached to the car, to express them instead in a coordinate system fixed relative to the city. The change-of-coordinates interpretation makes it clear that the matrices used in these two modes (city-to-car coordinate change vs. car-to-city coordinate change) are inverses of one another.
The idea of changing coordinate systems is much like the idea of type conversions in programming. Before we can add a floating-point number to an integer, we need to convert the integer to floating point or the floating-point number to an integer, depending on our needs, so that the types match. And before we can draw the city and the car together, we need to convert the city to car coordinates or the car to city coordinates, depending on our needs, so that the coordinates match.
When managing multiple coordinate systems, it’s easy to get confused and wind up with objects in the wrong coordinates, causing them to show up in unexpected places. But with systematic thinking about transformations between coordinate systems, you can reliably get the transformations right.
Geometrically, a coordinate system, or coordinate frame, consists of an origin and a basis—a set of three vectors. Orthonormal bases are so convenient that we’ll normally assume frames are orthonormal unless otherwise specified. In a frame with origin p and basis {u, v, w}, the coordinates (u, v, w) describe the point
In 2D, of course, there are two basis vectors.
When we store these vectors in the computer, they need to be represented in terms of some coordinate system. To get things started, we have to designate some canonical coordinate system, often called “global” or “world” coordinates, which is used to describe all other systems. In the city example, we might adopt the street grid and use the convention that the x-axis points along Main Street, the y-axis points up, and the z-axis points along Central Avenue. Then, when we write the origin and basis of the car frame in terms of these coordinates, it is clear what we mean.
In 2D our convention, it is to use the point o for the origin, and x and y for
In 2D, right-handed means y is counterclockwise from x. the right-handed orthonormal basis vectors x and y (Figure 7.20).
Another coordinate system might have an origin e and right-handed orthonormal basis vectors u and v. Note that typically the canonical data o, x, and y are never stored explicitly. They are the frame-of-reference for all other coordinate systems. In that coordinate system, we often write down the location of p as an ordered pair, which is shorthand for a full vector expression:
For example, in Figure 7.20, (x_{p}, y_{p}) = (2.5, 0.9) . Note that the pair (x_{p}, y_{p}) implicitly assumes the origin o. Similarly, we can express p in terms of another equation:
In Figure 7.20, this has (u_{p}, v_{p}) = (0.5, –0.7). Again, the origin e is left as an implicit part of the coordinate system associated with u and v.
We can express this same relationship using matrix machinery, like this:
Note that this assumes we have the point e and vectors u and v stored in canonical coordinates; the (x, y)-coordinate system is the first among equals. In terms of the basic types of transformations we’ve discussed in this chapter, this is a rotation (involving u and v) followed by a translation (involving e). Looking at the matrix for the rotation and translation together, you can see it’s very easy to write down: we just put u, v, and e into the columns of a matrix, with the usual [0 0 1] in the third row. To make this even clearer, we can write the matrix like this:
The name “frame-to-canonical” is based on thinking about changing the coordinates of a vector from one system to another. Thinking in terms of moving vectors around, the frame-to-canonical matrix maps the canonical frame to the (u,v) frame.
We call this matrix the frame-to-canonical matrix for the (u, v) frame. It takes points expressed in the (u, v) frame and converts them to the same points expressed in the canonical frame.
To go in the other direction, we have
This is a translation followed by a rotation; they are the inverses of the rotation and translation we used to build the frame-to-canonical matrix, and when multiplied together, they produce the inverse of the frame-to-canonical matrix, which is (not surprisingly) called the canonical-to-frame matrix:
The canonical-to-frame matrix takes points expressed in the canonical frame and converts them to the same points expressed in the (u,v) frame. We have written this matrix as the inverse of the frame-to-canonical matrix because it can’t immediately be written down using the canonical coordinates of e, u, and v. But remember that all coordinate systems are equivalent; it’s only our convention of storing vectors in terms of x- and y-coordinates that creates this seeming asymmetry. The canonical-to-frame matrix can be expressed simply in terms of the (u, v) coordinates of o, x,and y:
All these ideas work strictly analogously in 3D, where we have
and
Can’t I just hardcode transforms rather than use the matrix formalisms?
Yes, but in practice it is harder to derive, harder to debug, and not any more efficient. Also, all current graphics APIs use this matrix formalism so it must be understood even to use graphics libraries.
The bottom row of the matrix is always (0,0,0,1). Do I have to store it?
You do not have to store it unless you include perspective transforms (Chapter 8).
The derivation of the transformation properties of normals is based on Properties of Surface Normal Transformations (Turkowski, 1990). In many treatments through the mid-1990s, vectors were represented as row vectors and premulti-plied, e.g., b = aM. In our notation, this would be b^{T} = a^{T}M^{T}. If you want to find a rotation matrix R that takes one vector a to a vector b of the same length: b = Ra, you could use two rotations constructed from orthonormal bases. A more efficient method is given in Efficiently Building a Matrix to Rotate One Vector to Another (Akenine-Möller, Haines, & Hoffman, 2008).
1. Write down the 4 × 4 3D matrix to move by (x_{m}, y_{m}, z_{m}).
2. Write down the 4 × 4 3D matrix to rotate by an angle θ about the y-axis.
3. Write down the 4 × 4 3D matrix to scale an object by 50% in all directions.
4. Write the 2D rotation matrix that rotates by 90° clockwise.
5. Write the matrix from Exercise 4 as a product of three shear matrices.
6. Find the inverse of the rigid body transformation:
where R is a 3 × 3 rotation matrix and t is a 3-vector.
7. Show that the inverse of the matrix for an affine transformation (one that has all zeros in the bottom row except for a one in the lower right entry) also has the same form.
8. Describe in words what this 2D transform matrix does:
9. Write down the 3 × 3 matrix that rotates a 2D point by angle θ about a point p = (x_{p}, y_{p}) .
10. Write down the 4 × 4 rotation matrix that takes the orthonormal 3D vectors u = (x_{u}, y_{u}, z_{u}), v = (x_{v}, y_{v}, z_{v}), and w = (x_{w}, y_{w}, z_{w}), to orthonormal 3D vectors a = (x_{a}, y_{a}, z_{a}), b = (x_{b}, y_{b}, z_{b}), and c = (x_{c}, y_{c}, z_{c}), So M u = a, M v = b,and M w = c.
11. What is the inverse matrix for the answer to the previous problem?
In the previous chapter, we saw how to use matrix transformations as a tool for arranging geometric objects in 2D or 3D space. A second important use of geometric transformations is in moving objects between their 3D locations and their positions in a 2D view of the 3D world. This 3D to 2D mapping is called a viewing transformation, and it plays an important role in object-order rendering, in which we need to rapidly find the image-space location of each object in the scene.
When we studied ray tracing in Chapter 4, we covered the different types of perspective and orthographic views and how to generate viewing rays according to any given view. This chapter is about the inverse of that process. Here, we explain how to use matrix transformations to express any parallel or perspective view. The transformations in this chapter project 3D points in the scene (world space) to 2D points in the image (image space), and they will project any point on a given pixel’s viewing ray back to that pixel’s position in image space.
If you have not looked at it recently, it is advisable to review the discussion of perspective and ray generation in Chapter 4 before reading this chapter.
By itself, the ability to project points from the world to the image is only good for producing wireframe renderings—renderings in which only the edges of objects are drawn, and closer surfaces do not occlude more distant surfaces (Figure 8.1). Just as a ray tracer needs to find the closest surface intersection along each viewing ray, an object-order renderer displaying solid-looking objects has to work out which of the (possibly many) surfaces drawn at any given point on the screen is closest and display only that one. In this chapter, we assume we are drawing a model consisting only of 3D line segments that are specified by the (x, y, z) coordinates of their two endpoints. Later chapters will discuss the machinery needed to produce renderings of solid surfaces.
The viewing transformation has the job of mapping 3D locations, represented as (x, y, z) coordinates in the canonical coordinate system, to coordinates in the image, expressed in units of pixels. It is a complicated beast that depends on many different things, including the camera position and orientation, the type of projection, the field of view, and the resolution of the image. As with all complicated transformations, it is best approached by breaking it up into a product of several simpler transformations. Most graphics systems do this by using a sequence of three transformations:
A camera transformation or eye transformation, which is a rigid body transformation that places the camera at the origin in a convenient orientation. It depends only on the position and orientation, or pose, of the camera.
A projection transformation, which projects points from camera space so that all visible points fall in the range –1 to 1 in x and y. It depends only on the type of projection desired.
A viewport transformation or windowing transformation, which maps this unit image rectangle to the desired rectangle in pixel coordinates. It depends only on the size and position of the output image.
Some APIs use “viewing transformation” for just the piece of our viewing transformation that we call the camera transformation.
To make it easy to describe the stages of the process (Figure 8.2), we give names to the coordinate systems that are the inputs and output of these transformations.
The camera transformation converts points in canonical coordinates (or world space) to camera coordinates or places them in camera space. The projection transformation moves points from camera space to the canonical view volume. Finally, the viewport transformation maps the canonical view volume to screen space.
Each of these transformations is individually quite simple. We’ll discuss them in detail for the orthographic case beginning with the viewport transformation and then cover the changes required to support perspective projection.
Other names: camera space is also “eye space,” and the camera transformation is sometimes the “viewing transformation;” the canonical view volume is also “clip space” or “normalized device coordinates;” screen space is also “pixel coordinates.”
We begin with a problem whose solution will be reused for any viewing condition. We assume that the geometry we want to view is in the canonical view volume, and we wish to view it with an orthographic camera looking in the – z direction. The canonical view volume is the cube containing all 3D points whose Cartesian coordinates are between –1 and +1—that is, (x, y, z) ∈ [–1, 1]^{3} (Figure 8.3). We project x = –1 to the left side of the screen, x = +1 to the right side of the screen, y = –1 to the bottom of the screen, and y = +1 to the top of the screen.
The word “canonical” crops up again—it means something arbitrarily chosen for convenience. For instance, the unit circle could be called the “canonical circle.”
Recall the conventions for pixel coordinates from Chapter 3: each pixel “owns” a unit square centered at integer coordinates; the image boundaries have a half-unit overshoot from the pixel centers; and the smallest pixel center coordinates are (0, 0) . If we are drawing into an image (or window on the screen) that has n_{x} by n_{y} pixels, we need to map the square [–1, 1]^{2} to the rectangle [–0.5,n_{x} – 0.5] × [–0.5,n_{y} – 0.5].
Mapping a square to a potentially non-square rectangle is not a problem; x and y just end up with different scale factors going from canonical to pixel coordinates.
For now, we will assume that all line segments to be drawn are completely inside the canonical view volume. Later, we will relax that assumption when we discuss clipping.
Since the viewport transformation maps one axis-aligned rectangle to another, it is a case of the windowing transform given by Equation (7.6):
Note that this matrix ignores the z-coordinate of the points in the canonical view volume, because a point’s distance along the projection direction doesn’t affect where that point projects in the image. But before we officially call this the view-port matrix, we add a row and column to carry along the z-coordinate without changing it. We don’t need it in this chapter, but eventually, we will need the z values because they can be used to make closer surfaces hide more distant surfaces (see Section 9.2.3).
Of course, we usually want to render geometry in some region of space other than the canonical view volume. Our first step in generalizing the view will keep the view direction and orientation fixed looking along – z with +y up, but will allow arbitrary rectangles to be viewed. Rather than replacing the viewport matrix, we’ll augment it by multiplying it with another matrix on the right.
Under these constraints, the view volume is an axis-aligned box, and we’ll name the coordinates of its sides so that the view volume is [l, r] × [b, t] × [f, n] shown in Figure 8.4. We call this box the orthographic view volume and refer to the bounding planes as follows:
That vocabulary assumes a viewer who is looking along the minus z-axis with his head pointing in the y-direction.^{1} This implies that n > f, which may be unintuitive, but if you assume the entire orthographic view volume has negative z values, then the z = n “near” plane is closer to the viewer if and only if n > f ; here, f is a smaller number than n, i.e., a negative number of larger absolute value than n.
This concept is shown in Figure 8.5. The transform from orthographic view volume to the canonical view volume is another windowing transform, so we can simply substitute the bounds of the orthographic and canonical view volumes into Equation (7.7) to obtain the matrix for this transformation:
This matrix is very close to the one used traditionally in OpenGL, except that n, f, and zcanonical all have the opposite sign.
^{1} Most programmers find it intuitive to have the x-axis pointing right and the y-axis pointing up. In a right-handed coordinate system, this implies that we are looking in the –z direction. Some systems use a left-handed coordinate system for viewing so that the gaze direction is along +z. Which is best is a matter of taste, and this text assumes a right-handed coordinate system. A reference that argues for the left-handed system instead is given in the notes at the end of this chapter.
To draw 3D line segments in the orthographic view volume, we project them into screen x-and y-coordinates and ignore z-coordinates. We do this by combining Equations (8.2) and (8.3). Note that in a program, we multiply the matrices together to form one matrix and then manipulate points as follows:
The z-coordinate will now be in [–1, 1]. We don’t take advantage of this now, but it will be useful when we examine z-buffer algorithms.
The code to draw many 3D lines with endpoints a_{i} and b_{i} thus becomes both simple and efficient:
This is a first example of how matrix transformation machinery makes graphics programs clean and efficient.
construct M_{vp}
construct M_{orth}
M = M_{vp}M_{orth}
for each line segment (a_{i}, b_{i}) do
p = Ma_{i}
q = Mb_{i}
drawline(x_{p}, y_{p}, x_{q}, y_{q})
We’d like to be able to change the viewpoint in 3D and look in any direction. There are a multitude of conventions for specifying viewer position and orientation. We will use the following one (see Figure 8.6):
the eye position e,
the gaze direction g,
the view-up vector t.
The eye position is a location that the eye “sees from.” If you think of graphics as a photographic process, it is the center of the lens. The gaze direction is any vector in the direction that the viewer is looking. The view-up vector is any vector in the plane that both bisects the viewer’s head into right and left halves and points “to the sky” for a person standing on the ground. These vectors provide us with enough information to set up a coordinate system with origin e and a uvw basis, using the construction of Section 2.4.7:
Our job would be done if all points we wished to transform were stored in coordinates with origin e and basis vectors u, v,and w. But as shown in Figure 8.7, the coordinates of the model are stored in terms of the canonical (or world) origin o and the x-, y-, and z-axes. To use the machinery we have already developed, we just need to convert the coordinates of the line segment endpoints we wish to draw from xyz-coordinates into uvw-coordinates. This kind of transformation was discussed in Section 7.5, and the matrix that enacts this transformation is the canonical-to-basis matrix of the camera’s coordinate frame:
Alternatively, we can think of this same transformation as first moving e to the origin, then aligning u, v, w to x, y, z.
To make our previously z-axis-only viewing algorithm work for cameras with any location and orientation, we just need to add this camera transformation to the product of the viewport and projection transformations, so that it converts the incoming points from world to camera coordinates before they are projected:
construct M_{vp}
construct M_{orth}
construct M_{cam}
M = M_{vp}M_{orth}M_{cam}
for each line segment (a_{i}, b_{i}) do
p = Ma_{i}
q = Mb_{i}
drawline(x_{p}, y_{p}, x_{q}, y_{q})
Again, almost no code is needed once the matrix infrastructure is in place.
We have left perspective for last because it takes a little bit of cleverness to make it fit into the system of vectors and matrix transformations that has served us so well up to now. To see what we need to do, let’s look at what the perspective projection transformation needs to do with points in camera space. Recall that the viewpoint is positioned at the origin and the camera is looking along the z-axis.
For the moment, we will ignore the sign of z to keep the equations simpler, but it will return on page 168.
The key property of perspective is that the size of an object on the screen is proportional to 1/z for an eye at the origin looking up the negative z-axis. This can be expressed more precisely in an equation for the geometry in Figure 8.8:
where y is the distance of the point along the y-axis, and y_{s} is where the point should be drawn on the screen.
We would really like to use the matrix machinery we developed for ortho-graphic projection to draw perspective images; we could then just multiply another matrix into our composite matrix and use the algorithm we already have. However, this type of transformation, in which one of the coordinates of the input vector appears in the denominator, can’t be achieved using affine transformations.
We can allow for division with a simple generalization of the mechanism of homogeneous coordinates that we have been using for affine transformations. We have agreed to represent the point (x, y, z) using the homogeneous vector [x y z 1]^{T}; the extra coordinate, w, is always equal to 1, and this is ensured by always using [0 0 0 1]^{T} as the fourth row of an affine transformation matrix.
Rather than just thinking of the 1 as an extra piece bolted on to coerce matrix multiplication to implement translation, we now define it to be the denominator of the x-, y-, and z-coordinates: the homogeneous vector [x y z w]^{T} represents the point (x/w, y/w, z/w) . This makes no difference when w = 1, but it allows a broader range of transformations to be implemented if we allow any values in the bottom row of a transformation matrix, causing w to take on values other than 1.
Concretely, linear transformations allow us to compute expressions like
and affine transformations extend this to
Treating w as the denominator further expands the possibilities, allowing us to compute functions like
this could be called a “linear rational function” of x, y,and z. But there is an extra constraint—the denominators are the same for all coordinates of the transformed point:
Expressed as a matrix transformation,
and
A transformation like this is known as a projective transformation or a homography.
Example 17 The matrix
represents a 2D projective transformation that transforms the unit square ([0, 1] × [0, 1]) to the quadrilateral shown in Figure 8.9.
For instance, the lower-right corner of the square at (1, 0) is represented by the homogeneous vector [1 0 1]^{T} and transforms as follows:
which represents the point , or (3, 0). Note that if we use the matrix
instead, the result is [3 0 1]^{T}, which also represents (3, 0) . In fact, any scalar multiple cM is equivalent: the numerator and denominator are both scaled by c, which does not change the result.
There is a more elegant way of expressing the same idea, which avoids treating the w-coordinate specially. In this view, a 3D projective transformation is simply a 4D linear transformation, with the extra stipulation that all scalar multiples of a vector refer to the same point:
The symbol ~ is read as “is equivalent to” and means that the two homogeneous vectors both describe the same point in space.
Example 18 In 1D homogeneous coordinates, in which we use 2-vectors to represent points on the real line, we could represent the point (1.5) using the homogeneous vector [1.5 1]^{T}, or any other point on the line x = 1.5h in homogeneous space. (See Figure 8.10.)
In 2D homogeneous coordinates, in which we use 3-vectors to represent points in the plane, we could represent the point (–1, –0.5) using the homogeneous vector [–2; –1; 2]^{T}, or any other point on the line x = α[–1 – 0.5 1]^{T}. Any homogeneous vector on the line can be mapped to the line’s intersection with the plane w = 1 to obtain its Cartesian coordinates. (See Figure 8.11.)
It’s fine to transform homogeneous vectors as many times as needed, without worrying about the value of the w-coordinate—in fact, it is fine if the w-coordinate is zero at some intermediate phase. It is only when we want the ordinary Cartesian coordinates of a point that we need to normalize to an equivalent point that has w = 1, which amounts to dividing all the coordinates by w. Once we’ve done this, we are allowed to read off the (x, y, z) -coordinates from the first three components of the homogeneous vector.
The mechanism of projective transformations makes it simple to implement the division by z required to implement perspective. In the 2D example shown in Figure 8.8, we can implement the perspective projection with a matrix transformation
as follows:
This transforms the 2D homogeneous vector [y; z;1]^{T} to the 1D homogeneous vector [dy z]^{T}, which represents the 1D point (dy/z) (because it is equivalent to the 1D homogeneous vector [dy/z 1]^{T}. This matches Equation (8.5).
For the “official” perspective projection matrix in 3D, we’ll adopt our usual convention of a camera at the origin facing in the – z direction, so the distance of the point (x, y, z) is – z. As with orthographic projection, we also adopt the notion of near and far planes that limit the range of distances to be seen. In this context, we will use the near plane as the projection plane, so the image plane distance is – n.
The desired mapping is then y_{s} = (n/z)y, and similarly for x. This transformation can be implemented by the perspective matrix:
Remember, n < 0.
The first, second, and fourth rows simply implement the perspective equation. The third row, as in the orthographic and viewport matrices, is designed to bring the z-coordinate “along for the ride” so that we can use it later for hidden surface removal. In the perspective projection, though, the addition of a non-constant denominator prevents us from actually preserving the value of z—it’s actually impossible to keep z from changing while getting x and y to do what we need them to do. Instead, we’ve opted to keep z unchanged for points on the near or far planes.
There are many matrices that could function as perspective matrices, and all of them nonlinearly distort the z-coordinate. This specific matrix has the nice properties shown in Figures 8.12 and 8.13; it leaves points on the (z = n)-plane entirely alone, and it leaves points on the (z = f ) -plane while “squishing” them in x and y by the appropriate amount. The effect of the matrix on a point (x, y, z) is
As you can see, x and y are scaled and, more importantly, divided by z. Because both n and z (inside the view volume) are negative, there are no “flips” in x and y. Although it is not obvious (see the exercise at the end of this chapter), the transform also preserves the relative order of z values between z = n and z = f, allowing us to do depth ordering after this matrix is applied. This will be important later when we do hidden surface elimination.
Sometimes, we will want to take the inverse of P, for example, to bring a screen coordinate plus z back to the original space, as we might want to do for picking. The inverse is
Since multiplying a homogeneous vector by a scalar does not change its meaning, the same is true of matrices that operate on homogeneous vectors. So we can write the inverse matrix in a prettier form by multiplying through by nf :
This matrix is not literally the inverse of the matrix P, but the transformation it describes is the inverse of the transformation described by P.
Taken in the context of the orthographic projection matrix M_{orth} in Equation (8.3), the perspective matrix simply maps the perspective view volume (which is shaped like a slice, or frustum, of a pyramid) to the orthographic view volume (which is an axis-aligned box). The beauty of the perspective matrix is that once we apply it, we can use an orthographic transform to get to the canonical view volume. Thus, all of the orthographic machinery applies, and all that we have added is one matrix and the division by w. It is also heartening that we are not “wasting” the bottom row of our four by four matrices!
Concatenating P with M_{orth} results in the perspective projection matrix,
One issue, however, is: How are l,r,b,t determined for perspective? They identify the “window” through which we look. Since the perspective matrix does not change the values of x and y on the (z = n) -plane, we can specify (l, r, b, t) on that plane.
To integrate the perspective matrix into our orthographic infrastructure, we simply replace M_{orth} with M_{per}, which inserts the perspective matrix P after the camera matrix M_{cam} has been applied but before the orthographic projection. So the full set of matrices for perspective viewing is
The resulting algorithm is
compute M_{vp}
compute M_{per}
compute M_{cam}
M = M_{vp}M_{per}M_{cam}
for each line segment (a_{i}, b_{i}) do
p = Ma_{i}
q = Mb_{i}
drawline(x_{p} /w_{p}, y_{p} /w_{p}, x_{q} /w_{q}, y_{q} /w_{q})
Note that the only change other than the additional matrix is the divide by the homogeneous coordinate w.
Multiplied out, the matrix M_{per} looks like this:
This or similar matrices often appear in documentation, and they are less mysterious when one realizes that they are usually the product of a few simple matrices.
Example 19 Many APIs such as OpenGL (Shreiner, Neider, Woo, & Davis, 2004) use the same canonical view volume as presented here. They also usually have the user specify the absolute values of n and f . The projection matrix for OpenGL is
Other APIs send n and f to 0 and 1, respectively. Blinn (1996) recommends making the canonical view volume [0, 1]^{3} for efficiency. All such decisions will change the projection matrix slightly.
An important property of the perspective transform is that it takes lines to lines and planes to planes. In addition, it takes line segments in the view volume to line segments in the canonical volume. To see this, consider the line segment
When transformed by a 4 × 4 matrix M, it is a point with possibly varying homogeneous coordinate:
The homogenized 3D line segment is
If Equation (8.6) can be rewritten in a form
then all the homogenized points lie on a 3D line. Brute force manipulation of Equation (8.6) yields such a form with
It also turns out that the line segments do map to line segments preserving the ordering of the points (Exercise 8); i.e., they do not get reordered or “torn.”
A byproduct of the transform taking line segments to line segments is that it takes the edges and vertices of a triangle to the edges and vertices of another triangle. Thus, it takes triangles to triangles and planes to planes.
While we can specify any window using the (l, r, b, t) and n values, sometimes we would like to have a simpler system where we look through the center of the window. This implies the constraint that
If we also add the constraint that the pixels are square, i.e., there is no distortion of shape in the image, then the ratio of r to t must be the same as the ratio of the number of horizontal pixels to the number of vertical pixels:
Once n_{x} and n_{y} are specified, this leaves only one degree of freedom. That is often set using the field-of-view shown as θ in Figure 8.14. This is sometimes called the vertical field-of-view to distinguish it from the angle between left and right sides or from the angle between diagonal corners. From the figure, we can see that
If n and θ are specified, then we can derive t and use code for the more general viewing system. In some systems, the value of n is hard-coded to some reasonable value, and thus, we have one fewer degree of freedom.
Is orthographic projection ever useful in practice?
It is useful in applications where relative length judgments are important. It can also yield simplifications where perspective would be too expensive as occurs in some medical visualization applications.
The tessellated spheres I draw in perspective look like ovals. Is this a bug?
No. It is correct behavior. If you place your eye in the same relative position to the screen as the virtual viewer has with respect to the viewport, then these ovals will look like circles because they themselves are viewed at an angle.
Does the perspective matrix take negative z values to positive z values with a reversed ordering? Doesn’t that cause trouble?
Yes. The equation for transformed z is
So z = + is transformed to z^{} = –∞ and z = – is transformed to z = ∞. So any line segments that span z = 0 will be “torn” although all points will be projected to an appropriate screen location. This tearing is not relevant when all objects are contained in the viewing volume. This is usually assured by clipping to the view volume. However, clipping itself is made more complicated by the tearing phenomenon as is discussed in Chapter 9.
The perspective matrix changes the value of the homogeneous coordinate. Doesn’t that make the move and scale transformations no longer work properly?
Applying a translation to a homogeneous point, we have
Similar effects are true for other transforms (see Exercise 5).
Most of the discussion of viewing matrices is based on information in Real-Time Rendering (Akenine-Möller, Haines, & Hoffman, 2008), the OpenGL Programming Guide (Shreiner et al., 2004), Computer Graphics (Hearn & Baker, 1986), and 3D Game Engine Design (Eberly, 2000).
1. Construct the viewport matrix required for a system in which pixel coordinates count down from the top of the image, rather than up from the bottom.
2. Multiply the viewport and orthographic projection matrices, and show that the result can also be obtained by a single application of Equation (7.7).
3. Derive the third row of Equation (8.3) from the constraint that z is preserved for points on the near and far planes.
4. Show algebraically that the perspective matrix preserves order of z values within the view volume.
5. For a 4×4 matrix whose top three rows are arbitrary and whose bottom row is (0, 0, 0, 1), show that the points (x, y, z, 1) and (hx, hy, hz, h) transform to the same point after homogenization.
6. Verify that the form of M^{–}1 p given in the text is correct.
7. Verify that the full perspective to canonical matrix M_{per} takes (r, t, n) to (1, 1, 1) .
8. Write down a perspective matrix for n = 1, f = 2.
9. For the point p = (x, y, z, 1), what are the homogenized and unhomogenized results for that point transformed by the perspective matrix in Exercise 6?
10. For the eye position e = (0, 1, 0), a gaze vector g = (0, –1, 0), andaviewup vector t = (1, 1, 0), what is the resulting orthonormal uvw basis used for coordinate rotations?
11. Show, that for a perspective transform, line segments that start in the view volume do map to line segments in the canonical volume after homogenization. Furthermore, show that the relative ordering of points on the two segments is the same. Hint: Show that the f (t) in Equation (8.8) has the properties f (0) = 0, f (1) = 1, the derivative of f is positive for all t ∈ [0, 1], and the homogeneous coordinate does not change sign.
The previous several chapters have established the mathematical scaffolding we need to look at the second major approach to rendering: drawing objects one by one onto the screen or object-order rendering. Unlike in ray tracing, where we consider each pixel in turn and find the objects that influence its color, we’ll now instead consider each geometric object in turn and find the pixels that it could have an effect on. The process of finding all the pixels in an image that are occupied by a geometric primitive is called rasterization, so object-order rendering can also be called rendering by rasterization. The sequence of operations that is required, starting with objects and ending by updating pixels in the image, is known as the graphics pipeline.
Any graphics system has one or more types of “primitive object” that it can handle directly, and more complex objects are converted into these “primitives.” Triangles are the most often used primitive.
Rasterization-based systems are also called scanline renderers.
Object-order rendering has enjoyed great success because of its efficiency. For large scenes, management of data access patterns is crucial to performance, and making a single pass over the scene visiting each bit of geometry once has significant advantages over repeatedly searching the scene to retrieve the objects required to shade each pixel.
The title of this chapter suggests that there is only one way to do object-order rendering. Of course, this isn’t true—two quite different examples of graphics pipelines with very different goals are the hardware pipelines used to support interactive rendering via APIs like OpenGL and Direct3D and the software pipelines used in film production, supporting APIs like RenderMan. Hardware pipelines must run fast enough to react in real time for games, visualizations, and user interfaces. Production pipelines must render the highest quality animation and visual effects possible and scale to enormous scenes, but may take much more time to do so. Despite the different design decisions resulting from these divergent goals, a remarkable amount is shared among most, if not all, pipelines, and this chapter attempts to focus on these common fundamentals, erring on the side of following the hardware pipelines more closely.
The work that needs to be done in object-order rendering can be organized into the task of rasterization itself, the operations that are done to geometry before rasterization, and the operations that are done to pixels after rasterization. The most common geometric operation is applying matrix transformations, as discussed in the previous two chapters, to map the points that define the geometry from object space to screen space, so that the input to the rasterizer is expressed in pixel coordinates, or screen space. The most common pixelwise operation is hidden surface removal which arranges for surfaces closer to the viewer to appear in front of surfaces farther from the viewer. Many other operations also can be included at each stage, thereby achieving a wide range of different rendering effects using the same general process.
For the purposes of this chapter, we’ll discuss the graphics pipeline in terms of four stages (Figure 9.1). Geometric objects are fed into the pipeline from an interactive application or from a scene description file, and they are always described by sets of vertices. The vertices are operated on in the vertex-processing stage, then the primitives using those vertices are sent to the rasterization stage. The rasterizer breaks each primitive into a number of fragments, one for each pixel covered by the primitive. The fragments are processed in the fragment processing stage, and then, the various fragments corresponding to each pixel are combined in the fragment blending stage.
We’ll begin by discussing rasterization and then illustrate the purpose of the geometric and pixel-wise stages by a series of examples.
Rasterization is the central operation in object-order graphics, and the rasterizer is central to any graphics pipeline. For each primitive that comes in, the rasterizer has two jobs: it enumerates the pixels that are covered by the primitive and it interpolates values, called attributes, across the primitive—the purpose for these attributes will be clear with later examples. The output of the rasterizer is a set of fragments, one for each pixel covered by the primitive. Each fragment “lives” at a particular pixel and carries its own set of attribute values.
In this chapter, we will present rasterization with a view toward using it to render three-dimensional scenes. The same rasterization methods are used to draw lines and shapes in 2D as well—although it is becoming more and more common to use the 3D graphics system “under the covers” to do all 2D drawing.
Most graphics packages contain a line drawing command that takes two endpoints in screen coordinates (see Figure 3.10) and draws a line between them. For example, the call for endpoints (1,1) and (3,2) would turn on pixels (1,1) and (3,2) and fill in one pixel between them. For general screen coordinate endpoints (x_{0},y_{0}) and (x_{1},y_{1}), the routine should draw some “reasonable” set of pixels that approximates a line between them. Drawing such lines is based on line equations, and we have two types of equations to choose from: implicit and parametric. This section describes the approach using implicit lines.
Even though we often use integer-valued endpoints for examples, it’s important to properly support arbitrary endpoints.
The most common way to draw lines using implicit equations is the midpoint algorithm ((Pitteway, 1967; van Aken & Novak, 1985)). The midpoint algorithm ends up drawing the same lines as the Bresenham algorithm (Bresenham, 1965), but it is somewhat more straightforward.
The first thing to do is find the implicit equation for the line as discussed in Section 2.7.2:
We assume that x_{0} ≤ x_{1}. If that is not true, we swap the points so that it is true. The slope m of the line is given by
The following discussion assumes m ∈ (0, 1]. Analogous discussions can be derived for m ∈ (–∞, –1], m ∈ (–1, 0],and m ∈ (1, ∞). The four cases cover all possibilities.
For the case m ∈ (0, 1], there is more “run” than “rise” ; i.e., the line is moving faster in x than in y. If we have an API where the y-axis points downward, we might have a concern about whether this makes the process harder, but, in fact, we can ignore that detail. We can ignore the geometric notions of “up” and “down,” because the algebra is exactly the same for the two cases. Cautious readers can confirm that the resulting algorithm works for the y-axis downward case. The key assumption of the midpoint algorithm is that we draw the thinnest line possible that has no gaps. A diagonal connection between two pixels is not considered a gap.
As the line progresses from the left endpoint to the right, there are only two possibilities: draw a pixel at the same height as the pixel drawn to its left, or draw a pixel one higher. There will always be exactly one pixel in each column of pixels between the endpoints. Zero would imply a gap, and two would be too thick a line. There may be two pixels in the same row for the case we are considering; the line is more horizontal than vertical, so sometimes it will go right and sometimes up. This concept is shown in Figure 9.2, where three “reasonable” lines are shown, each advancing more in the horizontal direction than in the vertical direction.
The midpoint algorithm for m ∈ (0, 1] first establishes the leftmost pixel and the column number (x-value) of the rightmost pixel and then loops horizontally establishing the row (y-value) of each pixel. The basic form of the algorithm is
y = y_{0}
for x = x_{0} to x_{1} do
draw(x, y)
if (some condition) then
y = y +1
Note that x and y are integers. In words this says, “keep drawing pixels from left to right and sometimes move upward in the y-direction while doing so.” The key is to establish efficient ways to make the decision in the if statement.
An effective way to make the choice is to look at the midpoint of the line between the two potential pixel centers. More specifically, the pixel just drawn is pixel (x, y) whose center in real screen coordinates is at (x, y). The candidate pixels to be drawn to the right are pixels (x + 1,y) and (x + 1,y + 1). The midpoint between the centers of the two candidate pixels is (x +1,y +0.5). If the line passes below this midpoint, we draw the bottom pixel, and otherwise, we draw the top pixel (Figure 9.3).
To decide whether the line passes above or below (x+1,y +0.5),weevaluate f (x +1,y +0.5) in Equation (9.1). Recall from Section 2.7.1 that f (x, y) = 0 for points (x, y) on the line, f (x, y) > 0 for points on one side of the line, and f (x, y) < 0 for points on the other side of the line. Because –f (x, y) = 0 and f (x, y) = 0 are both perfectly good equations for the line, it is not immediately clear whether f (x, y) being positive indicates that (x, y) is above the line or whether it is below. However, we can figure it out; the key term in Equation (9.1) is the y term (x_{1} – x_{0})y . Note that (x_{1} – x_{0}) is definitely positive because x_{1} > x_{0}. This means that as y increases, the term (x_{1} – x_{0})y gets larger (i.e., more positive or less negative). Thus, the case f (x, +∞) is definitely positive, and definitely above the line, implying points above the line are all positive.
Another way to look at it is that the y component of the gradient vector is positive. So above the line, where y can increase arbitrarily, f (x, y) must be positive. This means we can make our code more specific by filling in the if statement:
if f (x + 1,y + 0.5) < 0 then
y = y + 1
The above code will work nicely for lines of the appropriate slope (i.e., between zero and one). The reader can work out the other three cases which differ only in small details.
If greater efficiency is desired, using an incremental method can help. An incremental method tries to make a loop more efficient by reusing computation from the previous step. In the midpoint algorithm as presented, the main computation is the evaluation of f (x +1,y +0.5). Note that inside the loop, after the first iteration, either we already evaluated f (x – 1,y +0.5) or f (x – 1,y – 0.5) (Figure 9.4). Note also this relationship:
This allows us to write an incremental version of the code:
y = y_{0}
d = f (x_{0} + 1,y_{0} + 0.5)
for x = x_{0} to x_{1} do
draw(x, y)
if d < 0 then
y = y + 1
d = d + (x_{1} – x_{0}) + (y_{0} – y_{1})
else
d = d + (y_{0} – y_{1})
This code should run faster since it has little extra setup cost compared to the non-incremental version (that is not always true for incremental algorithms), but it may accumulate more numeric error because the evaluation of f (x, y + 0.5) may be composed of many adds for long lines. However, given that lines are rarely longer than a few thousand pixels, such an error is unlikely to be critical. Slightly longer setup cost, but faster loop execution, can be achieved by storing (x_{1} – x_{0})+(y_{0} – y_{1}) and (y_{0} – y_{1}) as variables. We might hope a good compiler would do that for us, but if the code is critical, it would be wise to examine the results of compilation to make sure.
We often want to draw a 2D triangle with 2D points p_{0} = (x_{0},y_{0}), p_{1} = (x_{1},y_{1}), and p_{2} = (x_{2},y_{2}) in screen coordinates. This is similar to the line drawing problem, but it has some of its own subtleties. As with line drawing, we may wish to interpolate color or other properties from values at the vertices. This is straightforward if we have the barycentric coordinates (Section 2.9). For example, if the vertices have colors c_{0}, c_{1}, and c_{2}, the color at a point in the triangle with barycentric coordinates (α, β, γ) is
This type of interpolation of color is known in graphics as Gouraud interpolation after its inventor (Gouraud, 1971).
Another subtlety of rasterizing triangles is that we are usually rasterizing triangles that share vertices and edges. This means we would like to rasterize adjacent triangles, so there are no holes. We could do this by using the midpoint algorithm to draw the outline of each triangle and then fill in the interior pixels. This would mean adjacent triangles both draw the same pixels along each edge. If the adjacent triangles have different colors, the image will depend on the order in which the two triangles are drawn. The most common way to rasterize triangles that avoids the order problem and eliminates holes is to use the convention that pixels are drawn if and only if their centers are inside the triangle; i.e., the barycentric coordinates of the pixel center are all in the interval (0, 1). Thisraises the issue of what to do if the center is exactly on the edge of the triangle. There are several ways to handle this as will be discussed later in this section. The key observation is that barycentric coordinates allow us to decide whether to draw a pixel and what color that pixel should be if we are interpolating colors from the vertices. So our problem of rasterizing the triangle boils down to efficiently finding the barycentric coordinates of pixel centers (Pineda, 1988). The brute force rasterization algorithm is
for all x do
for all y do
compute (α, β, γ) for (x, y)
if (α ∈ [0, 1] and β ∈ [0, 1] and γ ∈ [0, 1]) then
c = αc_{0} + βc_{1} + γc_{2}
drawpixel (x, y) with color c
The rest of the algorithm limits the outer loops to a smaller set of candidate pixels and makes the barycentric computation efficient.
We can add a simple efficiency by finding the bounding rectangle of the three vertices and only looping over this rectangle for candidate pixels to draw. We can compute barycentric coordinates using Equation (2.32). This yields the algorithm:
x_{min} = floor(x_{i})
x_{max} = ceiling(x_{i})
y_{min} = floor(y_{i})
y_{max} = ceiling(y_{i})
for y = y_{min} to y_{max} do
for x = x_{min} to x_{max} do
α = f_{12}(x, y)/f_{12}(x_{0},y_{0})
β = f_{20}(x, y)/f_{20}(x_{1},y_{1})
γ = f_{01}(x, y)/f_{01}(x_{2},y_{2})
if (α > 0 and β > 0 and γ > 0) then
c = αc_{0} + βc_{1} + γc_{2}
drawpixel (x, y) with color c
Here, f_{ij} is the line given by Equation (9.1) with the appropriate vertices:
Note that we have exchanged the test α ∈ (0, 1) with α > 0 etc., because if all of α, β, γ are positive, then we know they are all less than one because α + β + γ = 1. We could also compute only two of the three barycentric variables and get the third from that relation, but it is not clear that this saves computation once the algorithm is made incremental, which is possible as in the line drawing algorithms; each of the computations of α, β, and γ does an evaluation of the form f (x, y) = Ax + By + C. In the inner loop, only x changes, and it changes by one. Note that f (x +1,y) = f (x, y)+ A. This is the basis of the incremental algorithm. In the outer loop, the evaluation changes for f (x, y) to f (x, y + 1), so a similar efficiency can be achieved. Because α, β, and γ change by constant increments in the loop, so does the color c. So this can be made incremental as well. For example, the red value for pixel (x + 1,y) differs from the red value for pixel (x, y) by a constant amount that can be precomputed. An example of a triangle with color interpolation is shown in Figure 9.5.
We have still not discussed what to do for pixels whose centers are exactly on the edge of a triangle. If a pixel is exactly on the edge of a triangle, then it is also on the edge of the adjacent triangle if there is one. There is no obvious way to award the pixel to one triangle or the other. The worst decision would be to not draw the pixel because a hole would result between the two triangles. Better, but still not good, would be to have both triangles draw the pixel. If the triangles are transparent, this will result in a double-coloring. We would really like to award the pixel to exactly one of the triangles, and we would like this process to be simple; which triangle is chosen does not matter as long as the choice is well defined.
One approach is to note that any off-screen point is definitely on exactly one side of the shared edge and that is the edge we will draw. For two non-overlapping triangles, the vertices not on the edge are on opposite sides of the edge from each other. Exactly one of these vertices will be on the same side of the edge as the off-screen point (Figure 9.6). This is the basis of the test. The test if numbers p and q have the same sign can be implemented as the test pq > 0, which is very efficient in most environments.
Note that the test is not perfect because the line through the edge may also go through the off-screen point, but we have at least greatly reduced the number of problematic cases. Which off-screen point is used is arbitrary, and (x, y) = (–1, –1) is as good a choice as any. We will need to add a check for the case of a point exactly on an edge. We would like this check not to be reached for common cases, which are the completely inside or outside tests. This suggests
f_{α} = f_{12}(x_{0},y_{0})
f_{β} = f_{20}(x_{1},y_{1})
f_{γ} = f_{01}(x_{2},y_{2})
for y = y_{min} to y_{max} do
for x = x_{min} to x_{max} do
α = f_{12}(x, y)/f_{α}
β = f_{20}(x, y)/f_{β}
γ = f_{01}(x, y)/f_{γ}
if (α ≥ 0 and β ≥ 0 and γ ≥ 0) then
if (α > 0 or f_{α}f_{12}(–1, –1) > 0) and
(β > 0 or f_{β}f_{20}(–1, –1) > 0) and
(γ > 0 or f_{γ}f_{01}(–1, –1) > 0) then
c = αc_{0} + βc_{1} + γc_{2}
drawpixel (x, y) with color c
We might expect that the above code would work to eliminate holes and double-draws only if we use exactly the same line equation for both triangles. In fact, the line equation is the same only if the two shared vertices have the same order in the draw call for each triangle. Otherwise, the equation might flip in sign. This could be a problem depending on whether the compiler changes the order of operations. So if a robust implementation is needed, the details of the compiler and arithmetic unit may need to be examined. The first four lines in the pseudocode above must be coded carefully to handle cases where the edge exactly hits the pixel center.
In addition to being amenable to an incremental implementation, there are several potential early exit points. For example, if α is negative, there is no need to compute β or γ. While this may well result in a speed improvement, profiling is always a good idea; the extra branches could reduce pipelining or concurrency and might slow down the code. So as always, test any attractive-looking optimizations if the code is a critical section.
Another detail of the above code is that the divisions could be divisions by zero for degenerate triangles, i.e., if f_{γ} = 0. Either the floating point error conditions should be accounted for properly, or another test will be needed.
There are some subtleties in achieving correct-looking perspective when interpolating quantities, such as texture coordinates or 3D positions, that need to vary linearly across the 3D triangles. We’ll use texture coordinates as an example of a quantity where perspective correction is important, but the same considerations apply to any attribute where linearity in 3D space is important.
The reason things are not straightforward is that just interpolating texture coordinates in screen space results in incorrect images, as shown for the grid texture in Figure 9.7. Because things in perspective get smaller as the distance to the viewer increases, the lines that are evenly spaced in 3D should compress in 2D image space. More careful interpolation of texture coordinates is needed to accomplish this.
We can implement texture mapping on triangles by interpolating the (u, v) coordinates, modifying the rasterization method of Section 9.1.2, but this results in the problem shown at the right of Figure 9.7. A similar problem occurs for triangles if screen space barycentric coordinates are used as in the following rasterization code:
for all x do
for all y do
compute (α, β, γ) for (x, y)
if α ∈ (0, 1) and β ∈ (0, 1) and γ ∈ (0, 1) then
t = αt_{0} + βt_{1} + γt_{2}
drawpixel (x, y) with color texture(t) for a solid texture
or with texture(β, γ) for a 2D texture.
This code will generate images, but there is a problem. To unravel the basic problem, let’s consider the progression from world space q to homogeneous point r to homogenized point s:
The simplest form of the texture coordinate interpolation problem is when we have texture coordinates (u, v) associated with two points, q and Q, and we need to generate texture coordinates in the image along the line between s and S. Ifthe world-space point q′ that is on the line between q and Q projects to the screen-space point s′ on the line between s and S, then the two points should have the same texture coordinates.
The naïve screen-space approach, embodied by the algorithm above, says that at the point s′ = s + α(S – s), we should use texture coordinates u_{s} + α(u_{S} – u_{s}) and v_{s} + α(v_{S} – v_{s}) . This doesn’t work correctly because the world-space point q′ that transforms to s′ is not q + α(Q – q).
However, we know from Section 8.4 that the points on the line segment between q and Q do end up somewhere on the line segment between s and S; infact, in that section we showed that
The interpolation parameters t and α are not the same, but we can compute one from the other:^{1}
These equations provide one possible fix to the screen-space interpolation idea. To get texture coordinates for the screen-space point s′ = s + α(S – s), compute u′_{s} = u_{s} + t(α)(u_{S} – u_{s}) and v′_{s} = v_{s} + t(α)(v_{S} – v_{s}). These are the coordinates of the point q′ that maps to s′, so this will work. However, it is slow to evaluate t(α) for each fragment, and there is a simpler way.
The key observation is that because, as we know, the perspective transform preserves lines and planes, it is safe to linearly interpolate any attributes we want across triangles, but only as long as they go through the perspective transformation along with the points (Figure 9.8). To get a geometric intuition for this, reduce the dimension so that we have homogeneous points (x_{r},y_{r},w_{r}) and a single attribute u being interpolated. The attribute u is supposed to be a linear function of x_{r} and y_{r},soifweplot u as a height field over (x_{r},y_{r}), the result is a plane. Now, if we think of u as a third spatial coordinate (call it u_{r} to emphasize that it’s treated the same as the others) and send the whole 3D homogeneous point (x_{r},y_{r},u_{r},w_{r}) through the perspective transformation, the result (x_{s},y_{s},u_{s}) still generates points that lie on a plane. There will be some warping within the plane, but the plane stays flat. This means that u_{s} is a linear function of (x_{s},y_{s})—which is to say, we can compute u_{s} anywhere by using linear interpolation based on the coordinates (x_{s},y_{s}).
Returning to the full problem, we need to interpolate texture coordinates (u, v) that are linear functions of the world space coordinates (x_{q},y_{q},z_{q}). After transforming the points to screen space, and adding the texture coordinates as if they were additional coordinates, we have
^{1} It is worthwhile to derive these functions yourself from Equation (7.6); in that chapter’s notation, α = f(t).
The practical implication of the previous paragraph is that we can go ahead and interpolate all of these quantities based on the values of (x_{s},y_{s})—including the value z_{s}, used in the z-buffer. The problem with the naïve approach is simply that we are interpolating components selected inconsistently—as long as the quantities involved are from before or all from after the perspective divide, all will be well.
The one remaining problem is that (u/w_{r},v/w_{r}) is not directly useful for looking up texture data; we need (u, v). This explains the purpose of the extra parameter we slipped into (9.3), whose value is always 1: once we have u/w_{r}, v/w_{r},and 1/w_{r}, we can easily recover (u, v) by dividing.
To verify that this is all correct, let’s check that interpolating the quantity 1/w_{r} in screen space indeed produces the reciprocal of the interpolated w_{r} in world space. To see this is true, confirm (Exercise 2):
remembering that α(t) and t are related by Equation 9.2.
This ability to interpolate 1/w_{r} linearly with no error in the transformed space allows us to correctly texture triangles. We can use these facts to modify our scan-conversion code for three points t_{i} = (x_{i},y_{i},z_{i},w_{i}) that have been passed through the viewing matrices, but have not been homogenized, complete with texture coordinates t_{i} = (u_{i},v_{i}):
for all x_{s} do
for all y_{s} do
compute (α, β, γ) for (x_{s},y_{s})
if (α ∈ [0, 1] and β ∈ [0, 1] and γ ∈ [0, 1]) then
u_{s} = α(u_{0}/w_{0}) + β(u_{1}/w_{1}) + γ(u_{2}/w_{2})
v_{s} = α(v_{0}/w_{0}) + β(v_{1}/w_{1}) + γ(v_{2}/w_{2})
1_{s} = α(1/w_{0}) + β(1/w_{1}) + γ(1/w_{2})
u = u_{s}/1_{s}
v = v_{s}/1_{s}
drawpixel (x_{s},y_{s}) with color texture(u, v)
Of course, many of the expressions appearing in this pseudocode would be precomputed outside the loop for speed.
In practice, modern systems interpolate all attributes in a perspective-correct way, unless some other method is specifically requested.
Simply transforming primitives into screen space and rasterizing them does not quite work by itself. This is because primitives that are outside the view volume—particularly, primitives that are behind the eye—can end up being rasterized, leading to incorrect results. For instance, consider the triangle shown in Figure 9.9. Two vertices are in the view volume, but the third is behind the eye. The projection transformation maps this vertex to a nonsensical location behind the far plane, and if this is allowed to happen, the triangle will be rasterized incorrectly. For this reason, rasterization has to be preceded by a clipping operation that removes parts of primitives that could extend behind the eye.
Clipping is a common operation in graphics, needed whenever one geometric entity “cuts” another. For example, if you clip a triangle against the plane x = 0, the plane cuts the triangle into two parts if the signs of the x-coordinates of the vertices are not all the same. In most applications of clipping, the portion of the triangle on the “wrong” side of the plane is discarded. This operation for a single plane is shown in Figure 9.10.
In clipping to prepare for rasterization, the “wrong” side is the side outside the view volume. It is always safe to clip away all geometry outside the view volume—that is, clipping against all six faces of the volume—but many systems manage to get away with only clipping against the near plane.
This section discusses the basic implementation of a clipping module. Those interested in implementing an industrial-speed clipper should see the book by Blinn mentioned in the notes at the end of this chapter.
The two most common approaches for implementing clipping are
In world coordinates using the six planes that bound the truncated viewing pyramid,
In the 4D transformed space before the homogeneous divide.
Either possibility can be effectively implemented (J. Blinn, 1996) using the following approach for each triangle:
for each of six planes do
if (triangle entirely outside of plane) then
break (triangle is not visible)
else if triangle spans plane then
clip triangle
if (quadrilateral is left) then
break into two triangles
Option 1 has a straightforward implementation. The only question is, “What are the six plane equations?” Because these equations are the same for all triangles rendered in the single image, we do not need to compute them very efficiently. For this reason, we can just invert the transform shown in Figure 7.12 and apply it to the eight vertices of the transformed view volume:
The plane equations can be inferred from here. Alternatively, we can use vector geometry to get the planes directly from the viewing parameters.
Surprisingly, the option usually implemented is that of clipping in homogeneous coordinates before the divide. Here, the view volume is 4D, and it is bounded by 3D volumes (hyperplanes). These are
These planes are quite simple, so the efficiency is better than for Option 1. They still can be improved by transforming the view volume [l, r] × [b, t] × [f, n] to [0, 1]^{3}. It turns out that the clipping of the triangles is not much more complicated than in 3D.
No matter which option we choose, we must clip against a plane. Recall from Section 2.7.5 that the implicit equation for a plane through point q with normal n is
Interestingly, this equation not only describes a 3D plane, but also describes a line in 2D and the volume analog of a plane in 4D. All of these entities are usually called planes in their appropriate dimension.
If we have a line segment between points a and b, we can “clip” it against a plane using the techniques for cutting the edges of 3D triangles in BSP tree programs described in Section 12.4.3. Here, the points a and b are tested to determine whether they are on opposite sides of the plane f (p) = 0 by checking whether f (a) and f (b) have different signs. Typically, f (p) < 0 is defined to be “inside” the plane, and f (p) > 0 is “outside” the plane. If the plane does split the line, then we can solve for the intersection point by substituting the equation for the parametric line,
into the f (p) = 0 plane of Equation (9.5). This yields
Solving for t gives
We can then find the intersection point and “shorten” the line.
To clip a triangle, we again can follow Section 12.4.3 to produce one or two triangles.
Before a primitive can be rasterized, the vertices that define it must be in screen coordinates, and the colors or other attributes that are supposed to be interpolated across the primitive must be known. Preparing this data is the job of the vertex-processing stage of the pipeline. In this stage, incoming vertices are transformed by the modeling, viewing, and projection transformations, mapping them from their original coordinates into screen space (where, recall, position is measured in terms of pixels). At the same time, other information, such as colors, surface normals, or texture coordinates, is transformed as needed; we’ll discuss these additional attributes in the examples below.
After rasterization, further processing is done to compute a color and depth for each fragment. This processing can be as simple as just passing through an interpolated color and using the depth computed by the rasterizer; or it can involve complex shading operations. Finally, the blending phase combines the fragments generated by the (possibly several) primitives that overlapped each pixel to compute the final color. The most common blending approach is to choose the color of the fragment with the smallest depth (closest to the eye).
The purposes of the different stages are best illustrated by examples.
The simplest possible pipeline does nothing in the vertex or fragment stages, and in the blending stage, the color of each fragment simply overwrites the value of the previous one. The application supplies primitives directly in pixel coordinates, and the rasterizer does all the work. This basic arrangement is the essence of many simple, older APIs for drawing user interfaces, plots, graphs, and other 2D content. Solid color shapes can be drawn by specifying the same color for all vertices of each primitive, and our model pipeline also supports smoothly varying color using interpolation.
To draw objects in 3D, the only change needed to the 2D drawing pipeline is a single matrix transformation: the vertex-processing stage multiplies the incoming vertex positions by the product of the modeling, camera, projection, and viewport matrices, resulting in screen-space triangles that are then drawn in the same way as if they’d been specified directly in 2D.
One problem with the minimal 3D pipeline is that in order to get occlusion relationships correct—to get nearer objects in front of farther away objects—primitives must be drawn in back-to-front order. This is known as the painter’s algorithm for hidden surface removal, by analogy to painting the background of a painting first, and then painting the foreground over it. The painter’s algorithm is a perfectly valid way to remove hidden surfaces, but it has several drawbacks. It cannot handle triangles that intersect one another, because there is no correct order in which to draw them. Similarly, several triangles, even if they don’t intersect, can still be arranged in an occlusion cycle, as shown in Figure 9.11, another case in which the back-to-front order does not exist. And most importantly, sorting the primitives by depth is slow, especially for large scenes, and disturbs the efficient flow of data that makes object-order rendering so fast. Figure 9.12 shows the result of this process when the objects are not sorted by depth.
In practice, the painter’s algorithm is rarely used; instead, a simple and effective hidden surface removal algorithm known as the z-buffer algorithm is used. The method is very simple: at each pixel, we keep track of the distance to the closest surface that has been drawn so far, and we throw away fragments that are farther away than that distance. The closest distance is stored by allocating an extra value for each pixel, in addition to the red, green, and blue color values, which is known as the depth, or z-value. The depth buffer, or z-buffer, is the name for the grid of depth values.
The z-buffer algorithm is implemented in the fragment blending phase, by comparing the depth of each fragment with the current value stored in the z-buffer. If the fragment’s depth is closer, both its color and its depth value overwrite the values currently in the color and depth buffers. If the fragment’s depth is farther away, it is discarded. To ensure that the first fragment will pass the depth test, the z buffer is initialized to the maximum depth (the depth of the far plane). Irrespective of the order in which surfaces are drawn, the same fragment will win the depth test, and the image will be the same.
Of course there can be ties in the depth test, in which case the order may well matter.
The z-buffer algorithm requires each fragment to carry a depth. This is done simply by interpolating the z-coordinate as a vertex attribute, in the same way that color or other attributes are interpolated.
The z-buffer is such a simple and practical way to deal with hidden surfaces in object-order rendering that it is by far the dominant approach. It is much simpler than geometric methods that cut surfaces into pieces that can be sorted by depth, because it avoids solving any problems that don’t need to be solved. The depth order only needs to be determined at the locations of the pixels, and that is all that the z-buffer does. It is universally supported by hardware graphics pipelines and is also the most commonly used method for software pipelines. Figures 9.13 and 9.14 show example results.
In practice, the z-values stored in the buffer are nonnegative integers. This is preferable to true floats because the fast memory needed for the z-buffer is somewhat expensive and is worth keeping to a minimum.
The use of integers can cause some precision problems. If we use an integer range having B values {0, 1,...,B – 1}, we can map 0 to the near clipping plane z = n and B –1 to the far clipping plane z = f . Note, that for this discussion, we assume z, n,and f are positive. This will result in the same results as the negative case, but the details of the argument are easier to follow. We send each z-value to a “bucket” with depth Δz = (f – n)/B. We would not use the integer z-buffer if memory were not a premium, so it is useful to make B as small as possible.
If we allocate b bits to store the z-value, then B = 2^{b}. We need enough bits to make sure any triangle in front of another triangle will have its depth mapped to distinct depth bins.
For example, if you are rendering a scene where triangles have a separation of at least one meter, then Δz < 1 should yield images without artifacts. There are two ways to make Δz smaller: move n and f closer together or increase b. If b is fixed, as it may be in APIs or on particular hardware platforms, adjusting n and f is the only option.
The precision of z-buffers must be handled with great care when perspective images are created. The value Δz above is used after the perspective divide. Recall from Section 8.3 that the result of the perspective divide is
The actual bin depth is related to z_{w}, the world depth, rather than z, the post-perspective divide depth. We can approximate the bin size by differentiating both sides:
Bin sizes vary in depth. The bin size in world space is
Note that the quantity Δz is as previously discussed. The biggest bin will be for z = f ,where
Note that choosing n = 0, a natural choice if we don’t want to lose objects right in front of the eye, will result in an infinitely large bin—a very bad condition. To make Δz^{max}_{w} as small as possible, we want to minimize f and maximize n. Thus, it is always important to choose n and f carefully.
So far the application sending triangles into the pipeline is responsible for setting the color; the rasterizer just interpolates the colors and they are written directly into the output image. For some applications, this is sufficient, but in many cases, we want 3D objects to be drawn with shading, using the same illumination equations that we used for image-order rendering in Chapter 4. Recall that these equations require a light direction, an eye direction, and a surface normal to compute the color of a surface.
One way to handle shading computations is to perform them in the vertex stage. The application provides normal vectors at the vertices, and the positions and colors of the lights are provided separately (they don’t vary across the surface, so they don’t need to be specified for each vertex). For each vertex, the direction to the viewer and the direction to each light are computed based on the positions of the camera, the lights, and the vertex. The desired shading equation is evaluated to compute a color, which is then passed to the rasterizer as the vertex color. Per-vertex shading is sometimes called Gouraud shading.
One decision to be made is the coordinate system in which shading computations are done. World space or eye space are good choices. It is important to choose a coordinate system that is orthonormal when viewed in world space, because shading equations depend on angles between vectors, which are not preserved by operations like nonuniform scale that are often used in the modeling transformation, or perspective projection, often used in the projection to the canonical view volume. Shading in eye space has the advantage that we don’t need to keep track of the camera position, because the camera is always at the origin in eye space, in perspective projection, or the view direction is always +z in orthographic projection.
Per-vertex shading has the disadvantage that it cannot produce any details in the shading that are smaller than the primitives used to draw the surface, because it only computes shading once for each vertex and never in between vertices. For instance, in a room with a floor that is drawn using two large triangles and illuminated by a light source in the middle of the room, shading will be evaluated only at the corners of the room, and the interpolated value will likely be much too dark in the center. Also, curved surfaces that are shaded with specular highlights must be drawn using primitives small enough that the highlights can be resolved.
Figure 9.15 shows our two spheres drawn with per-vertex shading.
To avoid the interpolation artifacts associated with per-vertex shading, we can avoid interpolating colors by performing the shading computations after the interpolation, in the fragment stage. In per-fragment shading, the same shading equations are evaluated, but they are evaluated for each fragment using interpolated vectors, rather than for each vertex using the vectors from the application.
In per-fragment shading, the geometric information needed for shading is passed through the rasterizer as attributes, so the vertex stage must coordinate with the fragment stage to prepare the data appropriately. One approach is to interpolate the eye-space surface normal and the eye-space vertex position, which then can be used just as they would in per-vertex shading.
Figure 9.16 shows our two spheres drawn with per-fragment shading.
Per-fragment shading is sometimes called Phong shading, which is confusing because the same name is attached to the Phong illumination model.
Textures (discussed in Chapter 11) are images that are used to add extra detail to the shading of surfaces that would otherwise look too homogeneous and artificial. The idea is simple: each time shading is computed, we read one of the values used in the shading computation—the diffuse color, for instance—from a texture instead of using the attribute values that are attached to the geometry being rendered. This operation is known as a texture lookup: the shading code specifies a texture coordinate, a point in the domain of the texture, and the texture-mapping system finds the value at that point in the texture image and returns it. The texture value is then used in the shading computation.
The most common way to define texture coordinates is simply to make the texture coordinate another vertex attribute. Each primitive then knows where it lives in the texture.
The decision about where to place shading computations depends on how fast the color changes—the scale of the details being computed. Shading with large-scale features, such as diffuse shading on curved surfaces, can be evaluated fairly infrequently and then interpolated: it can be computed with a low shading frequency. Shading that produces small-scale features, such as sharp highlights or detailed textures, needs to be evaluated at a high shading frequency. For details that need to look sharp and crisp in the image, the shading frequency needs to be at least one shading sample per pixel.
So large-scale effects can safely be computed in the vertex stage, even when the vertices defining the primitives are many pixels apart. Effects that require a high shading frequency can also be computed at the vertex stage, as long as the vertices are close together in the image; alternatively, they can be computed at the fragment stage when primitives are larger than a pixel.
For example, a hardware pipeline as used in a computer game, generally using primitives that cover several pixels to ensure high efficiency, normally does most shading computations per fragment. On the other hand, the PhotoRealistic RenderMan system does all shading computations per vertex, after first subdividing, or dicing, all surfaces into small quadrilaterals called micropolygons that are about the size of pixels. Since the primitives are small, per-vertex shading in this system achieves a high shading frequency that is suitable for detailed shading.
Just as with ray tracing, rasterization will produce jagged lines and triangle edges if we make an all-or-nothing determination of whether each pixel is inside the primitive or not. In fact, the set of fragments generated by the simple triangle rasterization algorithms described in this chapter, sometimes called standard or aliased rasterization, is exactly the same as the set of pixels that would be mapped to that triangle by a ray tracer that sends one ray through the center of each pixel. Also as in ray tracing, the solution is to allow pixels to be partly covered by a primitive (Crow, 1978). In practice, this form of blurring helps visual quality, especially in animations. This is shown as the top line of Figure 9.17.
There are a number of different approaches to antialiasing in rasterization applications. Just as with a ray tracer, we can produce an antialiased image by setting each pixel value to the average color of the image over the square area belonging to the pixel, an approach known as box filtering. This means we have to think of all drawable entities as having well-defined areas. For example, the line in Figure 9.17 can be thought of as approximating a one-pixel-wide rectangle.
There are better filters than the box, but a box filter will suffice for all but the most demanding applications.
The easiest way to implement box-filter antialiasing is by supersampling: create images at very high resolutions and then downsample. For example, if our goal is a 256 × 256 pixel image of a line with width 1.2 pixels, we could rasterize a rectangle version of the line with width 4.8 pixels on a 1024 × 1024 screen, and then average 4 × 4 groups of pixels to get the colors for each of the 256 × 256 pixels in the “shrunken” image. This is an approximation of the actual boxfiltered image, but works well when objects are not extremely small relative to the distance between pixels.
Supersampling is quite expensive, however. Because the very sharp edges that cause aliasing are normally caused by the edges of primitives, rather than sudden variations in shading within a primitive, a widely used optimization is to sample visibility at a higher rate than shading. If information about coverage and depth is stored for several points within each pixel, very good antialiasing can be achieved even if only one color is computed. In systems like RenderMan that use per-vertex shading, this is achieved by rasterizing at high resolution: it is inexpensive to do so because shading is simply interpolated to produce colors for the many fragments, or visibility samples. In systems with per-fragment shading, such as hardware pipelines, multisample antialiasing is achieved by storing for each fragment a single color plus a coverage mask and a set of depth values.
The strength of object-order rendering, that it requires a single pass over all the geometry in the scene, is also a weakness for complex scenes. For instance, in a model of an entire city, only a few buildings are likely to be visible at any given time. A correct image can be obtained by drawing all the primitives in the scene, but a great deal of effort will be wasted processing geometry that is behind the visible buildings, or behind the viewer, and therefore doesn’t contribute to the final image.
Identifying and throwing away invisible geometry to save the time that would be spent processing it is known as culling. Three commonly implemented culling strategies (often used in tandem) are
view volume culling—the removal of geometry that is outside the view volume;
occlusion culling—the removal of geometry that may be within the view volume but is obscured, or occluded, by other geometry closer to the camera;
backface culling—the removal of primitives facing away from the camera.
We will briefly discuss view volume culling and backface culling, but culling in high performance systems is a complex topic; see Akenine-Möller, Haines, and Hoffman (2008) for a complete discussion and for information about occlusion culling.
When an entire primitive lies outside the view volume, it can be culled, since it will produce no fragments when rasterized. If we can cull many primitives with a quick test, we may be able to speed up drawing significantly. On the other hand, testing primitives individually to decide exactly which ones need to be drawn may cost more than just letting the rasterizer eliminate them.
View volume culling, also known as view frustum culling, is especially helpful when many triangles are grouped into an object with an associated bounding volume. If the bounding volume lies outside the view volume, then so do all the triangles that make up the object. For example, if we have 1000 triangles bounded by a single sphere with center c and radius r, we can check whether the sphere lies outside the clipping plane,
where a is a point on the plane, and p is a variable. This is equivalent to checking whether the signed distance from the center of the sphere c to the plane is greater than +r. This amounts to the check that
Note that the sphere may overlap the plane even in a case where all the triangles do lie outside the plane. Thus, this is a conservative test. How conservative the test is depends on how well the sphere bounds the object.
The same idea can be applied hierarchically if the scene is organized in one of the spatial data structures described in Chapter 12.
When polygonal models are closed, i.e., they bound a closed space with no holes, they are often assumed to have outward facing normal vectors as discussed in Chapter 5. For such models, the polygons that face away from the eye are certain to be overdrawn by polygons that face the eye. Thus, those polygons can be culled before the pipeline even starts.
I’ve often seen clipping discussed at length, and it is a much more involved process than that described in this chapter. What is going on here?
The clipping described in this chapter works, but lacks optimizations that an industrial-strength clipper would have. These optimizations are discussed in detail in Blinn’s definitive work listed in the chapter notes.
How are polygons that are not triangles rasterized?
These can either be done directly scan-line by scan-line, or they can be broken down into triangles. The latter appears to be the more popular technique.
Is it always better to antialias?
No. Some images look crisper without antialiasing. Many programs use unantialiased “screen fonts” because they are easier to read.
The documentation for my API talks about “scene graphs” and “matrix stacks.” Are these part of the graphics pipeline?
The graphics pipeline is certainly designed with these in mind, and whether we define them as part of the pipeline is a matter of taste. This book delays their discussion until Chapter 12.
Is a uniform distance z-buffer better than the standard one that includes perspective matrix nonlinearities?
It depends. One “feature” of the nonlinearities is that the z-buffer has more resolution near the eye and less in the distance. If a level-of-detail system is used, then geometry in the distance is coarser and the “unfairness” of the z-buffer can be a good thing.
Is a software z-buffer ever useful?
Yes. Most of the movies that use 3D computer graphics have used a variant of the software z-buffer developed by Pixar (Cook, Carpenter, & Catmull, 1987).
A wonderful book about designing a graphics pipeline is Jim Blinn’s Corner: A Trip Down the Graphics Pipeline (J. Blinn, 1996). Many nice details of the pipeline and culling are in 3D Game Engine Design (Eberly, 2000) and Real-Time Rendering (Akenine-Möller et al., 2008).
1. Suppose that in the perspective transform, we have n = 1 and f = 2. Under what circumstances will we have a “reversal” where a vertex before and after the perspective transform flips from in front of to behind the eye or vice versa?
2. Is there any reason not to clip in x and y after the perspective divide (see Figure 11.2, stage 3)?
3. Derive the incremental form of the midpoint line-drawing algorithm with colors at endpoints for 0 < m ≤ 1.
4. Modify the triangle-drawing algorithm so that it will draw exactly one pixel for points on a triangle edge which goes through (x, y) = (–1, –1).
5. Suppose you are designing an integer z-buffer for flight simulation where all of the objects are at least one meter thick, are never closer to the viewer than 4 m, and may be as far away as 100 km. How many bits are needed in the z-buffer to ensure there are no visibility errors? Suppose that visibility errors only matter near the viewer, i.e., for distances less than 100 m. How many bits are needed in that case?
In graphics, we often deal with functions of a continuous variable: an image is the first example you have seen, but you will encounter many more as you continue your exploration of graphics. By their nature, continuous functions can’t be directly represented in a computer; we have to somehow represent them using a finite number of bits. One of the most useful approaches to representing continuous functions is to use samples of the function: just store the values of the function at many different points and reconstruct the values in between when and if they are needed.
You are by now familiar with the idea of representing an image using a two-dimensional grid of pixels—so you have already seen a sampled representation! Think of an image captured by a digital camera: the actual image of the scene that was formed by the camera’s lens is a continuous function of the position on the image plane, and the camera converted that function into a two-dimensional grid of samples. Mathematically, the camera converted a function of type ℝ^{2} → C (where C is the set of colors) to a two-dimensional array of color samples, or a function of type ℤ^{2} → C.
Another example of a sampled representation is a 2D digitizing tablet, such as the screen of a tablet computer or a separate pen tablet used by an artist. In this case, the original function is the motion of the stylus, which is a time-varying 2D position, or a function of type ℝ → ℝ^{2}. The digitizer measures the position of the stylus at many points in time, resulting in a sequence of 2D coordinates, or a function of type ℤ → ℝ^{2} . A motion capture system does exactly the same thing for a special marker attached to an actor’s body: it takes the 3D position of the marker over time (ℝ → ℝ^{3}) and makes it into a series of instantaneous position measurements (ℤ → ℝ^{3}).
Going up in dimension, a medical CT scanner, used to non-invasively examine the interior of a person’s body, measures density as a function of position inside the body. The output of the scanner is a 3D grid of density values: it converts the density of the body (ℝ^{3} → ℝ) to a 3D array of real numbers (ℤ^{3} → ℝ).
These examples seem different, but in fact they can all be handled using exactly the same mathematics. In all cases, a function is being sampled at the points of a lattice in one or more dimensions, and in all cases, we need to be able to reconstruct that original continuous function from the array of samples.
From the example of a 2D image, it may seem that the pixels are enough, and we never need to think about continuous functions again once the camera has discretized the image. But what if we want to make the image larger or smaller on the screen, particularly by ion-integer scale factors? It turns out that the simplest algorithms to do this perform badly, introducing obvious visual artifacts known as aliasing. Explaining why aliasing happens and understanding how to prevent it require the mathematics of sampling theory. The resulting algorithms are rather simple, but the reasoning behind them, and the details of making them perform well, can be subtle.
Representing continuous functions in a computer is, of course, not unique to graphics; nor is the idea of sampling and reconstruction. Sampled representations are used in applications from digital audio to computational physics, and graphics is just one (and by no means the first) user of the related algorithms and mathematics. The fundamental facts about how to do sampling and reconstruction have been known in the field of communications since the 1920s and were stated in exactly the form we use them by the 1940s (Shannon & Weaver, 1964).
This chapter starts by summarizing sampling and reconstruction using the concrete one-dimensional example of digital audio. Then, we go on to present the basic mathematics and algorithms that underlie sampling and reconstruction in one and two dimensions. Finally, we go into the details of the frequency-domain viewpoint, which provides many insights into the behavior of these algorithms.
Although sampled representations had already been in use for years in telecommunications, the introduction of the compact disc in 1982, following the increased use of digital recording for audio in the previous decade, was the first highly visible consumer application of sampling.
In audio recording, a microphone converts sound, which exists as pressure waves in the air, into a time-varying voltage that amounts to a measurement of the changing air pressure at the point where the microphone is located. This electrical signal needs to be stored somehow so that it may be played back at a later time and sent to a loudspeaker that converts the voltage back into pressure waves by moving a diaphragm in synchronization with the voltage.
The digital approach to recording the audio signal (Figure 10.1) uses sampling: an analog-to-digital converter (A/D converter,or ADC) measures the voltage many thousand times per second, generating a stream of integers that can easily be stored on any number of media, say a disk on a computer in the recording studio, or transmitted to another location, say the memory in a portable audio player. At playback time, the data are read out at the appropriate rate and sent to a digital-to-analog converter (D/A converter,or DAC). The DAC produces a voltage according to the numbers it receives, and, provided we take enough samples to fairly represent the variation in voltage, the resulting electrical signal is, for all practical purposes, identical to the input.
It turns out that the number of samples per second required to end up with a good reproduction depends on how high-pitched the sounds are that we are trying to record. A sample rate that works fine for reproducing a string bass or a kick drum produces bizarre-sounding results if we try to record a piccolo or a cymbal; but those sounds are reproduced just fine with a higher sample rate. To avoid these undersampling artifacts, the digital audio recorder filters the input to the ADC to remove high frequencies that can cause problems.
Another kind of problem arises on the output side. The DAC produces a voltage that changes whenever a new sample comes in, but stays constant until the next sample, producing a stair-step- shaped graph. These stair-steps act like noise, adding a high-frequency, signal-dependent buzzing sound. To remove this reconstruction artifact, the digital audio player filters the output from the DAC to smooth out the waveform.
The digital audio recording chain can serve as a concrete model for the sampling and reconstruction processes that happen in graphics. The same kind of under-sampling and reconstruction artifacts also happens with images or other sampled signals in graphics, and the solution is the same: filtering before sampling and filtering again during reconstruction.
A concrete example of the kind of artifacts that can arise from too-low sample frequencies is shown in Figure 10.2. Here, we are sampling a simple sine wave using two different sample frequencies: 10.8 samples per cycle on the top and 1.2 samples per cycle on the bottom. The higher rate produces a set of samples that obviously capture the signal well, but the samples resulting from the lower sample rate are indistinguishable from samples of a low-frequency sine wave—in fact, faced with this set of samples the low-frequency sinusoid seems the more likely interpretation.
Once the sampling has been done, it is impossible to know which of the two signals—the fast or the slow sine wave—was the original, and therefore, there is no single method that can properly reconstruct the signal in both cases. Because the high-frequency signal is “pretending to be” a low-frequency signal, this phenomenon is known as aliasing.
Aliasing shows up whenever flaws in sampling and reconstruction lead to artifacts at surprising frequencies. In audio, aliasing takes the form of odd-sounding extra tones—a bell ringing at 10 KHz, after being sampled at 8 KHz, turns into a 6 KHz tone. In images, aliasing often takes the form of moiré patterns that result from the interaction of the sample grid with regular features in an image, for instance, the window blinds in Figure 10.34.
Another example of aliasing in a synthetic image is the familiar stair-stepping on straight lines that are rendered with only black and white pixels (Figure 10.34). This is an example of small-scale features (the sharp edges of the lines) creating artifacts at a different scale (for shallow-slope lines, the stair steps are very long).
The basic issues of sampling and reconstruction can be understood simply based on features being too small or too large, but some more quantitative questions are harder to answer:
What sample rate is high enough to ensure good results?
What kinds of filters are appropriate for sampling and reconstruction?
What degree of smoothing is required to avoid aliasing?
Solid answers to these questions will have to wait until we have developed the theory fully in Section 10.5.
Before we discuss algorithms for sampling and reconstruction, we’ll first examine the mathematical concept on which they are based—convolution. Convolution is a simple mathematical concept that underlies the algorithms that are used for sampling, filtering, and reconstruction. It also is the basis of how we will analyze these algorithms throughout this chapter.
Convolution is an operation on functions: it takes two functions and combines them to produce a new function. In this book, the convolution operator is denoted by a star: the result of applying convolution to the functions f and g is f * g. We say that f is convolved with g,and f * g is the convolution of f and g.
Convolution can be applied either to continuous functions (functions f (x) that are defined for any real argument x) or to discrete sequences (functions a[i] that are defined only for integer arguments i). It can also be applied to functions defined on one-dimensional, two-dimensional, or higher-dimensional domains (i.e., functions of one, two, or more arguments). We will start with the discrete, one-dimensional case first and then continue to continuous functions and two- and three-dimensional functions.
For convenience in the definitions, we generally assume that the functions’ domains go on forever, although of course in practice they will have to stop somewhere, and we have to handle the endpoints in a special way.
To get a basic picture of convolution, consider the example of smoothing a 1D function using a moving average (Figure 10.3). To get a smoothed value at any point, we compute the average of the function over a range extending a distance r in each direction. The distance r, called the radius of the smoothing operation, is a parameter that controls how much smoothing happens.
We can state this idea mathematically for discrete or continuous functions. If we’re smoothing a continuous function g(x), averaging means integrating g over an interval and then dividing by the length of the interval:
On the other hand, if we’re smoothing a discrete function a[i], averaging means summing a for a range of indices and dividing by the number of values:
In each case, the normalization constant is chosen so that if we smooth a constant function, the result will be the same function.
This idea of a moving average is the essence of convolution; the only difference is that in convolution, the moving average is a weighted average.
We will start with the most concrete case of convolution: convolving a discrete sequence a[i] with another discrete sequence b[i]. The result is a discrete sequence (a * b)[i]. The process is just like smoothing a with a moving average, but this time instead of equally weighting all samples within a distance r,weuseasecond sequence b to give a weight to each sample (Figure 10.4). The value b[i – j] gives the weight for the sample at position j, which is at a distance i – j from the index i where we are evaluating the convolution. Here is the definition of (a ⋆ b), expressed as a formula:
By omitting bounds on j, we indicate that this sum runs over all integers (i.e., from –∞ to +∞). Figure 10.4 illustrates how one output sample is computed, using the example of —that is, , etc.
In graphics, one of the two functions will usually have finite support (as does the example in Figure 10.4), which means that it is nonzero only over a finite interval of argument values. If we assume that b has finite support, there is some radius r such that b[k] = 0 whenever |k| > r. In that case, we can write the sum above as
and we can express the definition in code as
function convolve(sequence a, filter b,int i)
s = 0
r = b.radius
for j = i – r to i + r do
s = s + a[j]b[i – j]
return s
Convolution is important because we can use it to perform filtering. Looking back at our first example of filtering, the moving average, we can now reinterpret that smoothing operation as convolution with a particular sequence. When we compute an average over some limited range of indices, that is the same as weighting the points in the range all identically and weighting the rest of the points with zeros. This kind of filter, which has a constant value over the interval where it is nonzero, is known as a box filter (because it looks like a rectangle if you draw its graph—see Figure 10.5). For a box filter of radius r, the weight is 1/(2r +1):
If you substitute this filter into Equation (10.2), you will find that it reduces to the moving average in Equation (10.1).
As in this example, convolution filters are usually designed so that they sum to 1. That way, they don’t affect the overall level of the signal.
Example 20 (Convolution of a box and a step)
For a simple example of filtering, let the signal be the step function
and the filter be the five-point box filter centered at zero,
What is the result of convolving a and b? At a particular index i, as shown in Figure 10.6, the result is the average of the step function over the range from i – 2 to i + 2. If i < –2, we are averaging all zeros and the result is zero. If i ≥ 2, we are averaging all ones and the result is one. In between, there are i +3 ones, resulting in the value . The output is a linear ramp that goes from 0 to 1 over five samples:.
The way we’ve written it so far, convolution seems like an asymmetric operation: a is the sequence we’re smoothing, and b provides the weights. But one of the nice properties of convolution is that it actually doesn’t make any difference which is which: the filter and the signal are interchangeable. To see this, just rethink the sum in Equation (10.2) with the indices counting from the origin of the filter b, rather than from the origin of a. That is, we replace j with i – k. The result of this change of variable is
This is exactly the same as Equation (10.2) but with a acting as the filter and b acting as the signal. So for any sequences a and b, (a * b) = (b * a), and we say that convolution is a commutative operation.^{1}
More generally, convolution is a “multiplication-like” operation. Like multiplication or addition of numbers or functions, neither the order of the arguments nor the placement of parentheses affects the result. Also, convolution relates to addition in the same way that multiplication does. To be precise, convolution is commutative and associative, and it is distributive over addition.
These properties are very natural if we think of convolution as being like multiplication, and they are very handy to know about because they can help us save work by simplifying convolutions before we actually compute them. For instance, suppose we want to take a sequence a and convolve it with three filters, b_{1}, b_{2}, and b_{3}—that is, we want ((a * b_{1}) * b_{2}) * b_{3}. If the sequence is long and the filters are short (that is, they have small radii), it is much faster to first convolve the three filters together (computing b_{1} * b_{2} * b_{3}) and finally to convolve the result with the signal, computing a * (b_{1} * b_{2} * b_{3}), which we know from associativity gives the same result.
A very simple filter serves as an identity for discrete convolution: it is the discrete filter of radius zero, or the sequence d[i] = ..., 0, 0, 1, 0, 0,... (Figure 10.7). If we convolve d with a signal a, there will be only one nonzero term in the sum:
^{1} You may have noticed that one of the functions in the convolution sum seems to be flipped over—that is, b[k] gives the weight for the sample k units earlier in the sequence, while b[–k] gives the weight for the sample k units later in the sequence. The reason for this has to do with ensuring associativity; see Exercise 4. Most of the filters we use are symmetric, so you hardly ever need to worry about this.
So clearly, convolving a with d just gives back a again. The sequence d is known as the discrete impluse. It is occasionally useful in expressing a filter: for instance, the process of smoothing a signal a with a filter b and then subtracting that from the original could be expressed as a single convolution with the filter d – b:
There is a second, entirely equivalent, way of interpreting Equation (10.2). Looking at the samples of a⋆b one at a time leads to the weighted-average interpretation that we have already seen. But if we omit the [i], we can instead think of the sum as adding together entire sequences. One piece of notation is required to make this work: if b is a sequence, then the same sequence shifted to the right by j places is called b_{→j} (Figure 10.8):
Then, we can write Equation (10.2) as a statement about the whole sequence (a * b) rather than element-by-element:
Looking at it this way, the convolution is a sum of shifted copies of b, weighted by the entries of a (Figure 10.9). Because of commutativity, we can pick either a or b as the filter; if we choose b, then we are adding up one copy of the filter for every sample in the input.
While it is true that discrete sequences are what we actually work with in a computer program, these sampled sequences are supposed to represent continuous functions, and often we need to reason mathematically about the continuous functions in order to figure out what to do. For this reason, it is useful to define convolution between continuous functions and also between continuous and discrete functions.
The convolution of two continuous functions is the obvious generalization of Equation (10.2), with an integral replacing the sum:
One way of interpreting this definition is that the convolution of f and g, evaluated at the argument x, is the area under the curve of the product of the two functions after we shift g so that g(0) lines up with f (t). Just like in the discrete case, the convolution is a moving average, with the filter providing the weights for the average (see Figure 10.10).
Like discrete convolution, convolution of continuous functions is commutative and associative, and it is distributive over addition. Also as with the discrete case, the continuous convolution can be seen as a sum of copies of the filter rather than the computation of weighted averages. Except, in this case, there are infinitely many copies of the filter g:
Example 21 (Convolution of two box functions)
Let f be a box function:
Then what is f * f ? The definition (Equation 10.3) gives
Figure 10.11 shows the two cases of this integral. The two boxes might have zero overlap, which happens when x ≤ –1 or x ≥ 1; in this case, the result is zero. When –1 < x < 1, the overlap depends on the separation between the two boxes, which is |x|; the result is 1 –|x|. So
This function, known as the tent function, is another common filter (see Section 10.3.1).
In discrete convolution, we saw that the discrete impulse d acted as an identity: d * a = a. In the continuous case, there is also an identity function, called the Dirac impulse or Dirac delta function, denoted δ(x).
Intuitively, the delta function is a very narrow, very tall spike that has infinitesimal width but still has area equal to 1 (Figure 10.12). The key defining property of the delta function is that multiplying it by a function selects out the value exactly at zero:
The delta function does not have a well-defined value at 0 (you can think of its value loosely as +∞), but it does have the value δ(x) = 0 for all x ≠ 0.
From this property of selecting out single values, it follows that the delta function is the identity for continuous convolution (Figure 10.13), because convolving δ with any function f yields
So δ * f = f (and because of commutativity f * δ = f also).
There are two ways to connect the discrete and continuous worlds. One is sampling: we convert a continuous function into a discrete one by writing down the function’s value at all integer arguments and forgetting about the rest. Given a continuous function f (x), we can sample it to convert to a discrete sequence a[i]:
Going the other way, from a discrete function, or sequence, to a continuous function, is called reconstruction. This is accomplished using yet another form of convolution, the discrete-continuous form. In this case, we are filtering a discrete sequence a[i] with a continuous filter f (x):
The value of the reconstructed function a * f at x is a weighted sum of the samples a[i] for values of i near x (Figure 10.14). The weights come from the filter f , which is evaluated at a set of points spaced one unit apart. For example, if x = 5.3 and f has radius 2, f is evaluated at 1.3, 0.3, –0.7, and –1.7. Note that for discrete-continuous convolution, we generally write the sequence first and the filter second, so that the sum is over integers.
As with discrete convolution, we can put bounds on the sum if we know the filter’s radius, r, eliminating all points where the difference between x and i is at least r:
Note, that if a point falls exactly at distance r from x (i.e., if x – r turns out to be an integer), it will be left out of the sum. This is in contrast to the discrete case, where we included the point at i – r.
Expressed in code, this is
function reconstruct(sequence a, filter f , real x)
s = 0
r = f.radius
for i = x – r to x + r do
s = s + a[i]f (x – i)
return s
As with the other forms of convolution, discrete-continuous convolution may be seen as summing shifted copies of the filter (Figure 10.15):
Discrete-continuous convolution is closely related to splines. For uniform splines (a uniform B-spline, for instance), the parameterized curve for the spline is exactly the convolution of the spline’s basis function with the control point sequence (see Section 15.6.2).
So far, everything we have said about sampling and reconstruction has been one-dimensional: there has been a single variable x or a single sequence index i. Many of the important applications of sampling and reconstruction in graphics, though, are applied to two-dimensional functions—in particular, to 2D images. Fortunately, the generalization of sampling algorithms and theory from 1D to 2D, 3D, and beyond is conceptually very simple.
Beginning with the definition of discrete convolution, we can generalize it to two dimensions by making the sum into a double sum:
If b is a finitely supported filter of radius r (i.e., it has (2r +1)^{2} values), then we can write this sum with bounds (Figure 10.16):
^{2} Note that the term “Fourier transform” is used both for the function and for the operation that computes from f. Unfortunately, this rather ambiguous usage is standard.
function convolve2d(sequence2d a, filter2d b,int i,int j)
s = 0
r = b.radius
for i′ = i – r to i + r do
for j′ = j – r to j + r do
s = s + a[i′][j′]b[i – i′][j – j′]
return s
This definition can be interpreted in the same way as in the 1D case: each output sample is a weighted average of an area in the input, using the 2D filter as a “mask” to determine the weight of each sample in the average.
Continuing the generalization, we can write continuous-continuous (Figure 10.17) and discrete-continuous (Figure 10.18) convolutions in 2D as well:
In each case, the result at a particular point is a weighted average of the input near that point. For the continuous-continuous case, it is a weighted integral over a region centered at that point, and in the discrete-continuous case, it is a weighted average of all the samples that fall near the point.
Once we have gone from 1D to 2D, it should be fairly clear how to generalize further to 3D or even to higher dimensions.
Now that we have the machinery of convolution, let’s examine some of the particular filters commonly used in graphics.
Each of the following filters has a natural radius, which is the default size to be used for sampling or reconstruction when samples are spaced one unit apart. In this section, filters are defined at this natural size: for instance, the box filter has a natural radius of , and the cubic filters have a natural radius of 2. We also arrange for each filter to integrate to 1: , as required for sampling and reconstruction without changing a signal’s average value.
As we will see in Section 10.4.3, some applications require filters of different sizes, which can be obtained by scaling the basic filter. For a filter f (x), we can define a version of scale s:
The filter is stretched horizontally by a factor of s and then squashed vertically by a factor so that its area is unchanged. A filter that has a natural radius of r and is used at scale s has a radius of support sr (see Figure 10.20).
The box filter (Figure 10.19) is a piecewise constant function whose integral is equal to one. As a discrete filter, it can be written as
Note that for symmetry, we include both endpoints.
As a continuous filter, we write
In this case, we exclude one endpoint, which makes the box of radius 0.5 usable as a reconstruction filter. It is because the box filter is discontinuous that these boundary cases are important, and so for this particular filter, we need to pay attention to them. We write just f_{box} for the natural radius of .
The tent, or linear filter (Figure 10.20), is a continuous, piecewise linear function:
Its natural radius is 1. For filters, such as this one, that are at least C^{0} (i.e., there are no sudden jumps in the value, as there are with the box), we no longer need to separate the definitions of the discrete and continuous filters: the discrete filter is just the continuous filter sampled at the integers.
The Gaussian function (Figure 10.21), also known as the normal distribution, is an important filter theoretically and practically. We’ll see more of its special properties as this chapter goes on:
The parameter σ is called the standard deviation. The Gaussian makes a good sampling filter because it is very smooth; we’ll make this statement more precise in Section 10.5.
The Gaussian filter does not have any particular natural radius; it is a useful sampling filter for a range of σ. The Gaussian also does not have a finite radius of support, although because of the exponential decay, its values rapidly become small enough to ignore. When necessary, then, we can trim the tails from the function by setting it to zero outside some radius r, resulting in a trimmed Gaussian. This means that the filter’s width and natural radius can vary depending on the application, and a trimmed Gaussian scaled by s is the same as an unscaled trimmed Gaussian with standard deviation sσ and radius sr. The best way to handle this in practice is to let σ and r be set as properties of the filter, fixed when the filter is specified, and then scale the filter just like any other when it is applied.
Good starting points are σ = 1 and r = 3.
Many filters are defined as piecewise polynomials, and cubic filters with four pieces (natural radius of 2) are often used as reconstruction filters. One such filter is known as the B-spline filter (Figure 10.22) because of its origins as a blending function for spline curves (see Chapter 15):
Among piecewise cubics, the B-spline is special because it has continuous first and second derivatives—that is, it is C^{2}. A more concise way of defining this filter is f_{B} = f_{box} * f_{box} * f_{box} * f_{box}; proving that the longer form above is equivalent is a nice exercise in convolution (see Exercise 3).
Another piecewise cubic filter named for a spline, the Catmull–Rom filter (Figure 10.23), has the value zero at x = –2, –1, 1,and 2, which means it will interpolate the samples when used as a reconstruction filter (Section 10.3.2):
For the all-important application of resampling images, Mitchell and Netravali (1988) made a study of cubic filters and recommended one partway between the previous two filters as the best all-around choice (Figure 10.24). It is simply a weighted combination of the previous two filters:
Filters have some traditional terminology that goes with them, which we use to describe the filters and compare them to one another.
The impulse response of a filter is just another name for the function: it is the response of the filter to a signal that just contains an impulse (and recall that convolving with an impulse just gives back the filter).
A continuous filter is interpolating if, when it is used to reconstruct a continuous function from a discrete sequence, the resulting function takes on exactly the values of the samples at the sample points—that is, it “connects the dots” rather than producing a function that only goes near the dots. Interpolating filters are exactly those filters f for which f (0) = 1 and f (i) = 0 for all nonzero integers i (Figure 10.25).
A filter that takes on negative values has ringing or overshoot: it will produce extra oscillations in the value around sharp changes in the value of the function being filtered.
For instance, the Catmull–Rom filter has negative lobes on either side, and if you filter a step function with it, it will exaggerate the step a bit, resulting in function values that undershoot 0 and overshoot 1 (Figure 10.26).
A continuous filter is ripple free if, when used as a reconstruction filter, it will reconstruct a constant sequence as a constant function (Figure 10.27). This is equivalent to the requirement that the filter sum to one on any integer-spaced grid:
All the filters in Section 10.3.1 are ripple-free at their natural radii, except the Gaussian, but none of them are necessarily ripple-free when they are used at a non-integer scale. If it is necessary to eliminate ripple in discrete-continuous convolution, it is easy to do so: divide each computed sample by the sum of the weights used to compute it:
This expression can still be interpreted as convolution between a and a filter (see Exercise 6).
A continuous filter has a degree of continuity, which is the highest-order derivative that is defined everywhere. A filter, like the box filter, that has sudden jumps in its value is not continuous at all. A filter that is continuous but has sharp corners (discontinuities in the first derivative), such as the tent filter, has order of continuity zero, and we say it is C^{0}. A filter that has a continuous derivative (no sharp corners), such as the piecewise cubic filters in the previous section, is C^{1}; if its second derivative is also continuous, as is true of the B-spline filter, it is C^{2}. The order of continuity of a filter is particularly important for a reconstruction filter because the reconstructed function inherits the continuity of the filter.
So far we have only discussed filters for 1D convolution, but for images and other multidimensional signals, we need filters too. In general, any 2D function could be a 2D filter, and occasionally it is useful to define them this way. But, in most cases, we can build suitable 2D (or higher-dimensional) filters from the 1D filters we have already seen.
The most useful way of doing this is by using a separable filter. The value of a separable filter f_{2}(x, y) at a particular x and y is simply the product of f_{1} (the 1D filter) evaluated at x and at y:
Similarly, for discrete filters,
Any horizontal or vertical slice through f_{2} is a scaled copy of f_{1}. The integral of f_{2} is the square of the integral of f_{1}, so in particular, if f_{1} is normalized, then so is f_{2}.
Example 22 (The separable tent filter)
If we choose the tent function for f_{1}, the resulting piecewise bilinear function (Figure 10.28) is
The profiles along the coordinate axes are tent functions, but the profiles along the diagonals are quadratics (for instance, along the line x = y in the positive quadrant, we see the quadratic function (1 – x)^{2}).
Example 23 (The 2D Gaussian filter)
If we choose the Gaussian function for f_{1}, the resulting 2D function (Figure 10.29) is
Notice that this is (up to a scale factor) the same function we would get if we revolved the 1D Gaussian around the origin to produce a circularly symmetric function. The property of being both circularly symmetric and separable at the same time is unique to the Gaussian function. The profiles along the coordinate axes are Gaussians, but so are the profiles along any direction at any offset from the center.
The key advantage of separable filters over other 2D filters has to do with efficiency in implementation. Let’s substitute the definition of a_{2} into the definition of discrete convolution:
Note that b_{1}[i–i′] does not depend on j′ and can be factored out of the inner sum:
Let’s abbreviate the inner sum as S[i′]:
With the equation in this form, we can first compute and store S[i′] for each value of i′ and then compute the outer sum using these stored values. At first glance, this does not seem remarkable, since we still had to do work proportional to (2r +1)^{2} to compute all the inner sums. However, it’s quite different if we want to compute the value at many points [i, j].
Suppose we need to compute a ⋆ b_{2} at [2, 2] and [3, 2], and b_{1} has a radius of 2. Examining Equation 10.5, we can see that we will need S[0],...,S[4] to compute the result at [2, 2], and we will need S[1],...,S[5] to compute the result at [3, 2]. So, in the separable formulation, we can just compute all six values of S and share S[1],...,S[4] (Figure 10.30).
This savings has great significance for large filters. Filtering an image with a filter of radius r in the general case requires computation of (2r +1)^{2} products per pixel, while filtering the image with a separable filter of the same size requires 2(2r +1) products (at the expense of some intermediate storage). This change in asymptotic complexity from O(r^{2}) to O(r) enables the use of much larger filters.
The algorithm is
function filterImage(image I, filter b)
r = b.radius
n_{x} = I.width
n_{y} = I.height
allocate storage array S[0 ... n_{x} – 1]
allocate image I_{out}[r ... n_{x} – r – 1,r ... n_{y} – r – 1]
initialize S and I_{out} to all zero
for j = r to n_{y} – r – 1 do
for i′ = 0 to n_{x} – 1 do
S[i′] = 0
for j′ = j – r to j + r do
S[i′] = S[i′]+ I[i′,j′]b[j – j′]
for i = r to n_{x} – r – 1 do
for i′ = i – r to i + r do
I_{out}[i, j] = I_{out}[i, j]+ S[i′]b[i – i′]
return I_{out}
For simplicity, this function avoids all questions of boundaries by trimming r pixels off all four sides of the output image. In practice, there are various ways to handle the boundaries; see Section 10.4.3.
We have discussed sampling, filtering, and reconstruction in the abstract so far, using mostly 1D signals for examples. But as we observed at the beginning of this chapter, the most important and most common application of signal processing in graphics is for sampled images. Let us look carefully at how all this applies to images.
Perhaps the simplest application of convolution is processing images using discrete convolution. Some of the most widely used features of image manipulation programs are simple convolution filters. Blurring of images can be achieved by convolving with many common low-pass filters, ranging from the box to the Gaussian (Figure 10.31). A Gaussian filter creates a very smooth-looking blur and is commonly used for this purpose.
The opposite of blurring is sharpening, and one way to do this is by using the “unsharp mask” procedure: subtract a fraction α of a blurred image from the original. With a rescaling to avoid changing the overall brightness, we have
where f_{g,σ} is the Gaussian filter of width σ. Using the discrete impluse d and the distributive property of convolution, we were able to write this whole process as a single filter that depends on both the width of the blur and the degree of sharpening (Figure 10.32).
Another example of combining two discrete filters is a drop shadow. It’s common to take a blurred, shifted copy of an object’s outline to create a soft drop shadow (Figure 10.33). We can express the shifting operation as convolution with an off-center impulse:
Shifting, then blurring, is achieved by convolving with both filters:
Here, we have used associativity to group the two operations into a single filter with three parameters.
In image synthesis, we often have the task of producing a sampled representation of an image for which we have a continuous mathematical formula (or at least a procedure we can use to compute the color at any point, not just at integer pixel positions). Ray tracing is a common example; more about ray tracing and the specific methods for antialiasing is in Chapter 4. In the language of signal processing, we have a continuous 2D signal (the image) that we need to sample on a regular 2D lattice. If we go ahead and sample the image without any special measures, the result will exhibit various aliasing artifacts (Figure 10.34). At sharp edges in the image, we see stair-step artifacts known as “jaggies.” In areas where there are repeating patterns, we see wide bands known as moiré patterns.
The problem here is that the image contains too many small-scale features; we need to smooth it out by filtering it before sampling. Looking back at the definition of continuous convolution in Equation (10.3), we need to average the image over an area around the pixel location, rather than just taking the value at a single point. The specific methods for doing this are discussed in Chapter 4. A simple filter like a box will improve the appearance of sharp edges, but it still produces some moiré patterns (Figure 10.35). The Gaussian filter, which is very smooth, is much more effective against the moiré patterns, at the expense of overall somewhat more blurring. These two examples illustrate the tradeoff between sharpness and aliasing that is fundamental to choosing antialiasing filters.
One of the most common image operations where careful filtering is crucial is resampling—changing the sample rate, or changing the image size.
Suppose we have taken an image with a digital camera that is 3000 by 2000 pixels in size, and we want to display it on a monitor that has only 1280 by 1024 pixels. In order to make it fit, while maintaining the 3:2 aspect ratio, we need to resample it to 1278 by 852 pixels. How should we go about this?
One way to approach this problem is to think of the process as dropping pixels: the size ratio is between 2 and 3, so we’ll have to drop out one or two pixels between pixels that we keep. It’s possible to shrink an image in this way, but the quality of the result is low—the images in Figure 10.34 were made using pixel dropping. Pixel dropping is very fast, however, and it is a reasonable choice to make a preview of the resized image during an interactive manipulation.
The way to think about resizing images is as a resampling operation: we want a set of samples of the image on a particular grid that is defined by the new image dimensions, and we get them by sampling a continuous function that is reconstructed from the input samples (Figure 10.36). Looking at it this way, it’s just a sequence of standard image processing operations: first, we reconstruct a continuous function from the input samples, and then, we sample that function just as we would sample any other continuous image. To avoid aliasing artifacts, appropriate filters need to be used at each stage.
A small example is shown in Figure 10.37: if the original image is 12 × 9 pixels and the new one is 8 × 6 pixels, there are 2/3 as many output pixels as input pixels in each dimension, so their spacing across the image is 3/2 the spacing of the original samples.
In order to come up with a value for each of the output samples, we need to somehow compute values for the image in between the samples. The pixel-dropping algorithm gives us one way to do this: just take the value of the closest sample in the input image and make that the output value. This is exactly equivalent to reconstructing the image with a 1-pixel-wide (radius one-half) box filter and then point sampling.
Of course, if the main reason for choosing pixel dropping or other very simple filtering is performance, one would never implement that method as a special case of the general reconstruction-and-resampling procedure. In fact, because of the discontinuities, it’s difficult to make box filters work in a general framework. But, for high-quality resampling, the reconstruction/sampling framework provides valuable flexibility.
To work out the algorithmic details, it’s simplest to drop down to 1D and discuss resampling a sequence. The simplest way to write an implementation is in terms of the reconstruct function we defined in Section 10.2.5.
function resample(sequence a,float x_{0},float Δx,int n, filter f )
create sequence b of length n
for i = 0 to n – 1 do
b[i]= reconstruct(a, f, x_{0} + iΔx)
return b
The parameter x_{0} gives the position of the first sample of the new sequence in terms of the samples of the old sequence. That is, if the first output sample falls midway between samples 3 and 4 in the input sequence, x_{0} is 3.5.
This procedure reconstructs a continuous image by convolving the input sequence with a continuous filter and then point samples it. That’s not to say that these two operations happen sequentially—the continuous function exists only in principle, and its values are computed only at the sample points. But mathematically, this function computes a set of point samples of the function a * f.
This point sampling seems wrong, though, because we just finished saying that a signal should be sampled with an appropriate smoothing filter to avoid aliasing. We should be convolving the reconstructed function with a sampling filter g and point sampling g * (f * a). But since this is the same as (g * f) * a, we can roll the sampling filter together with the reconstruction filter; one convolution operation is all we need (Figure 10.38). This combined reconstruction and sampling filter is known as a resampling filter.
When resampling images, we usually specify a source rectangle in the units of the old image that specifies the part we want to keep in the new image. For example, using the pixel sample positioning convention from Chapter 3, the rectangle we’d use to resample the entire image is . Given a source rectangle (x_{l}, x_{h}) × (y_{l}, y_{h}), the sample spacing for the new image is in x and in y. The lower-left sample is positioned at (x_{l} +Δx/2,y_{l} +Δy/2).
Modifying the 1D pseudocode to use this convention and expanding the call to the reconstruct function into the double loop that is implied, we arrive at
function resample(sequence a,float x_{l},float x_{h},int n, filter f )
create sequence b of length n
r = f.radius
x_{0} = x_{l} +Δx/2
for i = 0 to n – 1 do
s = 0
x = x_{0} + iΔx
for j = x – r to x + r do
s = s + a[j]f (x – j)
b[i]= s
return b
This routine contains all the basics of resampling an image. One last issue that remains to be addressed is what to do at the edges of the image, where the simple version here will access beyond the bounds of the input sequence. There are several things we might do:
Just stop the loop at the ends of the sequence. This is equivalent to padding the image with zeros on all sides.
Clip all array accesses to the end of the sequence—that is, return a[0] when we would want to access a[–1]. This is equivalent to padding the edges of the image by extending the last row or column.
Modify the filter as we approach the edge so that it does not extend beyond the bounds of the sequence.
The first option leads to dim edges when we resample the whole image, which is not really satisfactory. The second option is easy to implement; the third is probably the best performing. The simplest way to modify the filter near the edge of the image is to renormalize it: divide the filter by the sum of the part of the filter that falls within the image. This way, the filter always adds up to 1 over the actual image samples, so it preserves image intensity. For performance, it is desirable to handle the band of pixels within a filter radius of the edge (which require this renormalization) separately from the center (which contains many more pixels and does not require renormalization).
The choice of filter for resampling is important. There are two separate issues: the shape of the filter and the size (radius). Because the filter serves both as a reconstruction filter and a sampling filter, the requirements of both roles affect the choice of filter. For reconstruction, we would like a filter smooth enough to avoid aliasing artifacts when we enlarge the image, and the filter should be ripple-free. For sampling, the filter should be large enough to avoid undersampling and smooth enough to avoid moiré artifacts. Figure 10.39 illustrates these two different needs.
Generally, we will choose one filter shape and scale it according to the relative resolutions of the input and output. The lower of the two resolutions determines the size of the filter: when the output is more coarsely sampled than the input (downsampling, or shrinking the image), the smoothing required for proper sampling is greater than the smoothing required for reconstruction, so we size the filter according to the output sample spacing (radius 3 in Figure 10.39). On the other hand, when the output is more finely sampled (upsampling, or enlarging the image), the smoothing required for reconstruction dominates (the reconstructed function is already smooth enough to sample at a higher rate than it started), so the size of the filter is determined by the input sample spacing (radius 1 in Figure 10.39).
Choosing the filter itself is a tradeoff between speed and quality. Common choices are the box filter (when speed is paramount), the tent filter (moderate quality), or a piecewise cubic (excellent quality). In the piecewise cubic case, the degree of smoothing can be adjusted by interpolating between f_{B} and f_{C}; the Mitchell–Netravali filter is a good choice.
Just as with image filtering, separable filters can provide a significant speedup. The basic idea is to resample all the rows first, producing an image with changed width but not height, then to resample the columns of that image to produce the final result (Figure 10.40). Modifying the pseudocode given earlier so that it takes advantage of this optimization is reasonably straightforward.
If you are only interested in implementation, you can stop reading here; the algorithms and recommendations in the previous sections will let you implement programs that perform sampling and reconstruction and achieve excellent results. However, there is a deeper mathematical theory of sampling with a history reaching back to the first uses of sampled representations in telecommunications. Sampling theory answers many questions that are difficult to answer with reasoning based strictly on scale arguments.
But most important, sampling theory gives valuable insight into the workings of sampling and reconstruction. It gives the student who learns it an extra set of intellectual tools for reasoning about how to achieve the best results with the most efficient code.
The Fourier transform, along with convolution, is the main mathematical concept that underlies sampling theory. You can read about the Fourier transform in many math books on analysis, as well as in books on signal processing.
The basic idea behind the Fourier transform is to express any function by adding together sine waves (sinusoids) of all frequencies. By using the appropriate weights for the different frequencies, we can arrange for the sinusoids to add up to any (reasonable) function we want.
As an example, the square wave in Figure 10.41 can be expressed by a sequence of sine waves:
This Fourier series starts with a sine wave (sin 2πx) that has frequency 1.0—same as the square wave—and the remaining terms add smaller and smaller corrections to reduce the ripples and, in the limit, reproduce the square wave exactly. Note that all the terms in the sum have frequencies that are integer multiples of the frequency of the square wave. This is because other frequencies would produce results that don’t have the same period as the square wave.
A surprising fact is that a signal does not have to be periodic in order to be expressed as a sum of sinusoids in this way: a non-periodic signal just requires more sinusoids. Rather than summing over a discrete sequence of sinusoids, we will instead integrate over a continuous family of sinusoids. For instance, a box function can be written as the integral of a family of cosine waves:
This integral in Equation (10.6) is adding up infinitely many cosines, weighting the cosine of frequency u by the weight (sin πu)/πu. The result, as we include higher and higher frequencies, converges to the box function (see Figure 10.42). When a function f is expressed in this way, this weight, which is a function of the frequency u, is called the Fourier transform of f , denoted . The function tells us how to build f by integrating over a family of sinusoids:
Equation (10.7) is known as the inverse Fourier transform (IFT) because it starts with the Fourier transform of f and ends up with f.^{2}
Note that in Equation (10.7), the complex exponential e^{2}πiux has been substituted for the cosine in the previous equation. Also, is a complex-valued function. The machinery of complex numbers is required to allow the phase, as well as the frequency, of the sinusoids to be controlled; this is necessary to represent any functions that are not symmetric across zero. The magnitude of is known as the Fourier spectrum, and, for our purposes, this is sufficient—we won’t need to worry about phase or use any complex numbers directly.
It turns out that computing from f looks very much like computing f from :
Equation (10.8) is known as the (forward) Fourier transform (FT). The sign in the exponential is the only difference between the forward and inverse Fourier transforms, and it is really just a technical detail. For our purposes, we can think of the FT and IFT as the same operation.
Sometimes, the f – notation is inconvenient, and then, we will denote the Fourier transform of f by and the inverse Fourier transform of by .
A function and its Fourier transform are related in many useful ways. A few facts (most of them easy to verify) that we will use later in this chapter are
A function and its Fourier transform have the same squared integral:
The physical interpretation is that the two have the same energy (Figure 10.43).
In particular, scaling a function up by a also scales its Fourier transform by a. That is, .
Stretching a function along the x-axis squashes its Fourier transform along the u-axis by the same factor (Figure 10.44):
(The renormalization by b is needed to keep the energy the same.)
This means that if we are interested in a family of functions of different width and height (say all box functions centered at zero), then we only need to know the Fourier transform of one canonical function (say the box function with width and height equal to one), and we can easily know the Fourier transforms of all the scaled and dilated versions of that function. For example, we can instantly generalize Equation (10.6) to give the Fourier transform of a box of width b and height a:
The average value of f is equal to (0). This makes sense since (0) is supposed to be the zero-frequency component of the signal (the DC component if we are thinking of an electrical voltage).
If f is real (which it always is for us), is an even function—that is, . Likewise, if f is an even function, then will be real (this is not usually the case in our domain, but remember that we really are only going to care about the magnitude of ).
One final property of the Fourier transform that deserves special mention is its relationship to convolution (Figure 10.45). Briefly,
The Fourier transform of the convolution of two functions is the product of the Fourier transforms. Following the by now familiar symmetry,
The convolution of two Fourier transforms is the Fourier transform of the product of the two functions. These facts are fairly straightforward to derive from the definitions.
This relationship is the main reason Fourier transforms are useful in studying the effects of sampling and reconstruction. We’ve seen how sampling, filtering, and reconstruction can be seen in terms of convolution; now the Fourier transform gives us a new domain—the frequency domain—in which these operations are simply products.
Now that we have some facts about Fourier transforms, let’s look at some examples of individual functions. In particular, we’ll look at some filters from Section 10.3.1, which are shown with their Fourier transforms in Figure 10.46. We have already seen the box function:
The function^{3} sin x/x is important enough to have its own name, sinc x.
The tent function is the convolution of the box with itself, so its Fourier transform is just the square of the Fourier transform of the box function:
^{3} You may notice that sin πu/πu is undefined for u = 0. It is, however, continuous across zero, and we take it as understood that we use the limiting value of this ratio, 1, at u = 0.
We can continue this process to get the Fourier transform of the B-spline filter (see Exercise 3):
The Gaussian has a particularly nice Fourier transform:
It is another Gaussian! The Gaussian with standard deviation 1.0 becomes a Gaussian with standard deviation 1/2π.
The reason impulses are useful in sampling theory is that we can use them to talk about samples in the context of continuous functions and Fourier transforms. We represent a sample, which has a position and a value, by an impulse translated to that position and scaled by that value. A sample at position a with value b is represented by bδ(x – a). This way we can express the operation of sampling the function f (x) at a as multiplying f by δ(x – a). The result is f (a)δ(x – a).
Sampling a function at a series of equally spaced points is therefore expressed as multiplying the function by the sum of a series of equally spaced impulses, called an impulse train (Figure 10.47). An impulse train with period T , meaning that the impulses are spaced a distance T apart, is
The Fourier transform of s_{1} is the same as s_{1}: a sequence of impulses at all integer frequencies. You can see why this should be true by thinking about what happens when we multiply the impulse train by a sinusoid and integrate. We wind up adding up the values of the sinusoid at all the integers. This sum will exactly cancel to zero for non-integer frequencies, and it will diverge to +∞ for integer frequencies.
Because of the dilation property of the Fourier transform, we can guess that the Fourier transform of an impulse train with period T (which is like a dilation of s_{1}) is an impulse train with period 1/T . Making the sampling finer in the space domain makes the impulses farther apart in the frequency domain.
Now that we have built the mathematical machinery, we need to understand the sampling and reconstruction process from the viewpoint of the frequency domain. The key advantage of introducing Fourier transforms is that it makes the effects of convolution filtering on the signal much clearer, and it provides more precise explanations of why we need to filter when sampling and reconstructing.
We start the process with the original, continuous signal. In general, its Fourier transform could include components at any frequency, although for most kinds of signals (especially images), we expect the content to decrease as the frequency gets higher. Images also tend to have a large component at zero frequency—remember that the zero-frequency, or DC, component is the integral of the whole image, and since images are all positive values this tends to be a large number.
Let’s see what happens to the Fourier transform if we sample and reconstruct without doing any special filtering (Figure 10.48). When we sample the signal, we model the operation as multiplication with an impulse train; the sampled signal is fs_{T} . Because of the multiplication-convolution property, the FT of the sampled signal is .
Recall that δ is the identity for convolution. This means that
that is, convolving with the impulse train makes a whole series of equally spaced copies of the spectrum of f . A good intuitive interpretation of this seemingly odd result is that all those copies just express the fact (as we saw back in Section 10.1.1) that frequencies that differ by an integer multiple of the sampling frequency are indistinguishable once we have sampled—they will produce exactly the same set of samples. The original spectrum is called the base spectrum, and the copies are known as alias spectra.
The trouble begins if these copies of the signal’s spectrum overlap, which will happen if the signal contains any significant content beyond half the sample frequency. When this happens, the spectra add, and the information about different frequencies is irreversibly mixed up. This is the first place aliasing can occur, and if it happens here, it’s due to undersampling—using too low a sample frequency for the signal.
Suppose we reconstruct the signal using the nearest-neighbor technique. This is equivalent to convolving with a box of width 1. (The discrete-continuous convolution used to do this is the same as a continuous convolution with the series of impulses that represent the samples.) The convolution-multiplication property means that the spectrum of the reconstructed signal will be the product of the spectrum of the sampled signal and the spectrum of the box. The resulting reconstructed Fourier transform contains the base spectrum (though somewhat attenuated at higher frequencies), plus attenuated copies of all the alias spectra. Because the box has a fairly broad Fourier transform, these attenuated bits of alias spectra are significant, and they are the second form of aliasing, due to an inadequate reconstruction filter. These alias components manifest themselves in the image as the pattern of squares that is characteristic of nearest-neighbor reconstruction.
To do high-quality sampling and reconstruction, we have seen that we need to choose sampling and reconstruction filters appropriately. From the standpoint of the frequency domain, the purpose of low-pass filtering when sampling is to limit the frequency range of the signal so that the alias spectra do not overlap the base spectrum. Figure 10.49 shows the effect of sample rate on the Fourier transform of the sampled signal. Higher sample rates move the alias spectra farther apart, and eventually, whatever overlap is left does not matter.
The key criterion is that the width of the spectrum must be less than the distance between the copies—that is, the highest frequency present in the signal must be less than half the sample frequency. This is known as the Nyquist criterion, and the highest allowable frequency is known as the Nyquist frequency or Nyquist limit. The Nyquist–Shannon sampling theorem states that a signal whose frequencies do not exceed the Nyquist limit (or, said another way, a signal that is bandlimited to the Nyquist frequency) can, in principle, be reconstructed exactly from samples.
With a high enough sample rate for a particular signal, we don’t need to use a sampling filter. But if we are stuck with a signal that contains a wide range of frequencies (such as an image with sharp edges in it), we must use a sampling filter to bandlimit the signal before we can sample it. Figure 10.50 shows the effects of three low-pass (smoothing) filters in the frequency domain, and Figure 10.51 shows the effect of using these same filters when sampling. Even if the spectra overlap without filtering, convolving the signal with a low-pass filter can narrow the spectrum enough to eliminate overlap and produce a well-sampled representation of the filtered signal. Of course, we have lost the high frequencies, but that’s better than having them get scrambled with the signal and turn into artifacts.
From the frequency domain perspective, the job of a reconstruction filter is to remove the alias spectra while preserving the base spectrum. In Figure 10.48, we can see that the crudest reconstruction filter, the box, does attenuate the alias spectra. Most important, it completely blocks the DC spike for all the alias spectra. This is a characteristic of all reasonable reconstruction filters: they have zeroes in frequency space at all multiples of the sample frequency. This turns out to be equivalent to the ripple-free property in the space domain.
So a good reconstruction filter needs to be a good low-pass filter, with the added requirement of completely blocking all multiples of the sample frequency. The purpose of using a reconstruction filter different from the box filter is to more completely eliminate the alias spectra, reducing the leakage of high-frequency artifacts into the reconstructed signal, while disturbing the base spectrum as little as possible. Figure 10.52 illustrates the effects of different filters when used during reconstruction. As we have seen, the box filter is quite “leaky” and results in plenty of artifacts even if the sample rate is high enough. The tent filter, resulting in linear interpolation, attenuates high frequencies more, resulting in milder artifacts, and the B-spline filter is very smooth, controlling the alias spectra very effectively. It also smooths the base spectrum some—this is the tradeoff between smoothing and aliasing that we saw earlier.
When the operations of reconstruction and sampling are combined in resampling, the same principles apply, but with one filter doing the work of both reconstruction and sampling. Figure 10.53 illustrates how a resampling filter must remove the alias spectra and leave the spectrum narrow enough to be sampled at the new sample rate.
Following the frequency domain analysis to its logical conclusion, a filter that is exactly a box in the frequency domain is ideal for both sampling and reconstruction. Such a filter would prevent aliasing at both stages without diminishing the frequencies below the Nyquist frequency at all.
Recall that the inverse and forward Fourier transforms are essentially identical, so the spatial domain filter that has a box as its Fourier transform is the function sin πx/πx = sinc πx.
However, the sinc filter is not generally used in practice, either for sampling or for reconstruction, because it is impractical and because, even though it is optimal according to the frequency domain criteria, it doesn’t produce the best results for many applications.
For sampling, the infinite extent of the sinc filter, and its relatively slow rate of decrease with distance from the center, is a liability. Also, for some kinds of sampling, the negative lobes are problematic. A Gaussian filter makes an excellent sampling filter even for difficult cases where high-frequency patterns must be removed from the input signal, because its Fourier transform falls off exponentially, with no bumps that tend to let aliases leak through. For less difficult cases, a tent filter generally suffices.
For reconstruction, the size of the sinc function again creates problems, but even more importantly, the many ripples create “ringing” artifacts in reconstructed signals.
1. Show that discrete convolution is commutative and associative. Do the same for continuous convolution.
2. Discrete-continuous convolution can’t be commutative, because its arguments have two different types. Show that it is associative, though.
3. Prove that the B-spline is the convolution of four box functions.
4. Show that the “flipped” definition of convolution is necessary by trying to show that convolution is commutative and associative using this (incorrect) definition (see the footnote on page 214):
5. Prove that and .
6. Equation 10.4 can be interpreted as the convolution of a with a filter . Write a mathematical expression for the “de-rippled” filter . Plot the filter that results from de-rippling the box, tent, and B-spline filters scaled to s = 1.25.
When trying to replicate the look of the real world, one quickly realizes that hardly any surfaces are featureless. Wood grows with grain; skin grows with wrinkles; cloth shows its woven structure; and paint shows the marks of the brush or roller that laid it down. Even smooth plastic is made with bumps molded into it, and smooth metal shows the marks of the machining process that made it. Materials that were once featureless quickly become covered with marks, dents, stains, scratches, fingerprints, and dirt.
In computer graphics, we lump all these phenomena under the heading of “spatially varying surface properties”—attributes of surfaces that vary from place to place but don’t really change the shape of the surface in a meaningful way. To allow for these effects, all kinds of modeling and rendering systems provide some means for texture mapping: using an image, called a texture map, texture image, or just a texture, to store the details that you want to go on a surface and then mathematically “mapping” the image onto the surface.
This is mapping in the sense of Section 2.1.
As it turns out, once the mechanism to map images onto surfaces exists, there are many less obvious ways it can be used that go beyond the basic purpose of introducing surface detail. Textures can be used to make shadows and reflections, to provide illumination, even to define surface shape. In sophisticated interactive programs, textures are used to store all kinds of data that doesn’t even have anything to do with pictures!
This chapter discusses the use of textures for representing surface detail, shadows, and reflections. While the basic ideas are simple, several practical problems complicate the use of textures. First of all, textures easily become distorted, and designing the functions that map textures onto surfaces is challenging. Also, texture mapping is a resampling process, just like rescaling an image, and as we saw in Chapter 10, resampling can very easily introduce aliasing artifacts. The use of texture mapping and animation together readily produces truly dramatic aliasing, and much of the complexity of texture mapping systems is created by the antialiasing measures that are used to tame these artifacts.
To start off, let’s consider a simple application of texture mapping. We have a scene with a wood floor, and we would like the diffuse color of the floor to be controlled by an image showing floorboards with wood grain. Regardless of whether we are using ray tracing or rasterization, the shading code that computes the color for a ray–surface intersection point or for a fragment generated by the rasterizer needs to know the color of the texture at the shading point, in order to use it as the diffuse color in the Lambertian shading model from Chapter 5.
To get this color, the shader performs a texture lookup: it figures out the location, in the coordinate system of the texture image, that corresponds to the shading point, and it reads out the color at that point in the image, resulting in the texture sample. That color is then used in shading, and since the texture lookup happens at a different place in the texture for every pixel that sees the floor, a pattern of different colors shows up in the image. The code might look like this:
Color texture_lookup(Texture t, float u, float v) { int i = round(u ⋆ t.width() - 0.5) int j = round(v ⋆ t.height() - 0.5) return t.get_pixel(i,j) } Color shade_surface_point(Surface s, Point p, Texture t) { Vector normal = s.get_normal(p) (u,v) = s.get_texcoord(p) Color diffuse_color = texture_lookup(u,v) // compute shading using diffuse_color and normal // return shading result }
In this code, the shader asks the surface where to look in the texture, and somehow every surface that we want to shade using a texture needs to be able to answer this query. This brings us to the first key ingredient of texture mapping: we need a function that maps from the surface to the texture that we can easily compute for every pixel. This is the texture coordinate function (Figure 11.1), and we say that it assigns texture coordinates to every point on the surface. Mathematically, it is a mapping from the surface S to the domain of the texture, T:
The set T, often called “texture space,” is usually just a rectangle that contains the image; it is common to use the unit square (u, v) ∈ [0,1]^{2} (in this book, we’ll use the names u and v for the two texture coordinates). In many ways, it’s similar to the viewing projection discussed in Chapter 8, called π in this chapter, which maps points on surfaces in the scene to points in the image; both are 3D-to-2D mappings, and both are needed for rendering—one to know where to get the texture value from, and one to know where to put the shading result in the image. But there are some important differences, too: π is almost always a perspective or orthographic projection, whereas ϕ can take on many forms; and there is only one viewing projection for an image, whereas each object in the scene is likely to have a completely separate texture coordinate function.
It may seem surprising that ϕ is a mapping from the surface to the texture, when our goal is to put the texture onto the surface, but this is the function we need.
So … the first thing you have to learn is how to think backwards?
For the case of the wood floor, if the floor happens to be at constant z and aligned to the x and y axes, we could just use the mapping
for some suitably chosen scale factors a and b, to assign texture coordinates (u,v) to the point (x,y,z)_{floor}, and then use the value of the texture pixel, or texel, closest to (u,v) as the texture value at (x,y). In this way we rendered the image in Figure 11.2.
This is pretty limiting, though: what if the room is modeled at an angle to the x and y axes, or what if we want the wood texture on the curved back of a chair? We will need some better way to compute texture coordinates for points on the surface.
Another problem that arises from the simplest form of texture mapping is illustrated dramatically by rendering at a high contrast texture from a very grazing angle into a low-resolution image. Figure 11.3 shows a larger plane textured using the same approach but with a high contrast grid pattern and a view toward the horizon. You can see it contains aliasing artifacts (stairsteps in the foreground, wavy and glittery patterns in the distance) similar to the ones that arise in image resampling (Chapter 10) when appropriate filters are not used. Although it takes an extreme case to make these artifacts so obvious in a tiny still image printed in a book, in animations these patterns move around and are very distracting even when they are much more subtle.
We have now seen the two primary issues in basic texture mapping:
Defining texture coordinate functions, and
Looking up texture values without introducing too much aliasing.
These two concerns are fundamental to all kinds of applications of texture mapping and are discussed in Sections 11.2 and 11.3. Once you understand them and some of the solutions to them, you understand texture mapping. The rest is just how to apply the basic texturing machinery for a variety of different purposes, which is discussed in Section 11.4.
Designing the texture coordinate function ϕ well is a key requirement for getting good results with texture mapping. You can think of this as deciding how you are going to deform a flat, rectangular image so that it conforms to the 3D surface you want to draw. Or alternatively, you are taking the surface and gently flattening it, without letting it wrinkle, tear, or fold, so that it lies flat on the image. Sometimes, this is easy: maybe the 3D surface is already a flat rectangle! In other cases, it’s very tricky: the 3D shape might be very complicated, like the surface of a character’s body.
The problem of defining texture coordinate functions is not new to computer graphics. Exactly, the same problem is faced by cartographers when designing maps that cover large areas of the Earth’s surface: the mapping from the curved globe to the flat map inevitably causes distortion of areas, angles, and/or distances that can easily make maps very misleading. Many map projections have been proposed over the centuries, all balancing the same competing concerns—of minimizing various kinds of distortion while covering a large area in one contiguous piece—that are faced in texture mapping.
In some applications (some examples are in Section 11.2.1), there’s a clear reason to use a particular map. But in most cases, designing the texture coordinate map is a delicate task of balancing competing concerns, which skilled modelers put considerable effort into.
“UV mapping” or “surface parameterization” are other names you may encounter for the texture coordinate function.
You can define ϕ in just about any way you can dream up. But there are several competing goals to consider:
Bijectivity. In most cases, you’d like ϕ to be bijective (see Section 2.1.1), so that each point on the surface maps to a different point in texture space. If several points map to the same texture space point, the value at one point in the texture will affect several points on the surface. In cases where you want a texture to repeat over a surface (think of wallpaper or carpet with their repeating patterns), it makes sense to deliberately introduce a many-to-one mapping from surface points to texture points, but you don’t want this to happen by accident.
Size distortion. The scale of the texture should be approximately constant across the surface. That is, close-together points anywhere on the surface that are about the same distance apart should map to points about the same distance apart in the texture. In terms of the function ϕ, the magnitude of the derivatives of ϕ should not vary too much.
Shape distortion. The texture should not be very distorted. That is, a small circle drawn on the surface should map to a reasonably circular shape in texture space, rather than an extremely squashed or elongated shape. In terms of ϕ, the derivative of ϕ should not be too different in different directions.
Continuity. There should not be too many seams: neighboring points on the surface should map to neighboring points in the texture. That is, ϕ should be continuous or have as few discontinuities as possible. In most cases, some discontinuities are inevitable, and we’d like to put them in inconspicuous locations.
Surfaces that are defined by parametric equations (Section 2.7.8) come with a built-in choice for the texture coordinate function: simply invert the function that defines the surface, and use the two parameters of the surface as texture coordinates. These texture coordinates may or may not have desirable properties, depending on the surface, but they do provide a mapping.
But for surfaces that are defined implicitly, or are just defined by a triangle mesh, we need some other way to define the texture coordinates, without relying on an existing parameterization. Broadly speaking, the two ways to define texture coordinates are to compute them geometrically, from the spatial coordinates of the surface point, or, for mesh surfaces, to store values of the texture coordinates at vertices and interpolate them across the surface. Let’s look at these options one at a time.
Geometrically determined texture coordinates are used for simple shapes or special situations, as a quick solution, or as a starting point for designing a hand-tweaked texture coordinate map.
We will illustrate the various texture coordinate functions by mapping the test image in Figure 11.4 onto the surface. The numbers in the image let you read the approximate (u,v) coordinates out of the rendered image, and the grid lets you see how distorted the mapping is.
Probably, the simplest mapping from 3D to 2D is a parallel projection—the same mapping as used for orthographic viewing (Figure 11.5). The machinery we developed already for viewing (Section 8.1) can be reused directly for defining texture coordinates: just as orthographic viewing boils down to multiplying by a matrix and discarding the z component, generating texture coordinates by planar projection can be done with a simple matrix multiply:
where the texturing matrix M_{t} represents an affine transformation, and the asterisk indicates that we don’t care what ends up in the third coordinate.
This works quite well for surfaces that are mostly flat, without too much variation in surface normal, and a good projection direction can be found by taking the average normal. For any kind of closed shape, though, a planar projection will not be injective: points on the front and back will map to the same point in texture space (Figure 11.6).
By simply substituting perspective projection for orthographic, we get projective texture coordinates (Figure 11.7):
Now the 4×4 matrix P_{t} represents a projective (not necessarily affine) transformation—that is, the last row may not be [0,0,0,1].
Projective texture coordinates are important in the technique of shadow mapping, discussed in Section 11.4.4.
For spheres, the latitude/longitude parameterization is familiar and widely used. It has a lot of distortion near the poles, which can lead to difficulties, but it does cover the whole sphere with discontinuities only along one line of latitude.
Surfaces that are roughly spherical in shape can be parameterized using a texture coordinate function that maps a point on the surface to a point on a sphere using radial projection: take a line from the center of the sphere through the point on the surface, and find the intersection with the sphere. The spherical coordinates of this intersection point are the texture coordinates of the point you started with on the surface.
Another way to say this is that you express the surface point in spherical coordinates (ρ,θ,ϕ) and then discard the ρ coordinate and map θ and ϕ each to the range [0,1]. The formula depends on the spherical coordinates convention; using the convention of Section 2.7.8,
This and other texture coordinate functions in this chapter for objects that are in the box [-1, 1]^{3} and centered at the origin.
A spherical coordinates map will be bijective everywhere except at the poles if the whole surface is visible from the center point. It inherits the same distortion near the poles as the latitude–longitude map on the sphere. Figure 11.8 shows an object for which spherical coordinates provide a suitable texture coordinate function.
For objects that are more columnar than spherical, projection outward from an axis onto a cylinder may work better than projection from a point onto a sphere (Figure 11.9). Analogously to spherical projection, this amounts to converting to cylindrical coordinates and discarding the radius:
Using spherical coordinates to parameterize a spherical or sphere-like shape leads to high distortion of shape and area near the poles, which often leads to visible artifacts that reveal that there are two special points where something is going wrong with the texture. A popular alternative is much more uniform at the cost of having more discontinuities. The idea is to project onto a cube, rather than a sphere, and then use six separate square textures for the six faces of the cube. The collection of six square textures is called a cubemap. This introduces discontinuities along all the cube edges, but it keeps distortion of shape and area low.
Computing cubemap texture coordinates is also cheaper than for spherical coordinates, because projecting onto a plane just requires a division—essentially the same as perspective projection for viewing. For instance, for a point that projects onto the + z face of the cube:
A confusing aspect of cubemaps is establishing the convention for how the u and v directions are defined on the six faces. Any convention is fine, but the convention chosen affects the contents of textures, so standardization is important. Because cubemaps are very often used for textures that are viewed from the inside of the cube (see environment mapping in Section 11.4.5), the usual conventions have the u and v axes oriented so that u is clockwise from v as viewed from inside. The convention used by OpenGL is
The subscripts indicate which face of the cube each projection corresponds to. For example, ϕ_{−x} is used for points that project to the face of the cube at x = +1. You can tell which face a point projects to by looking at the coordinate with the largest absolute value: for example, if |x| > |y| and |x| > |z|, the point projects to the +x or −x face, depending on the sign of x.
A texture to be used with a cube map has six square pieces. (See Figure 11.10.) Often they are packed together in a single image for storage, arranged as if the cube was unwrapped.
For more fine-grained control over the texture coordinate function on a triangle mesh surface, you can explicitly store the texture coordinates at each vertex, and interpolate them across the triangles using barycentric interpolation (Section 9.1.2). It works in exactly the same way as any other smoothly varying quantities you might define over a mesh: colors, normals, even the 3D position itself.
The idea of interpolated texture coordinates is very simple—but it can be a bit confusing at first.
Let’s look at an example with a single triangle. Figure 11.11 shows a triangle texture mapped with part of the by now familiar test pattern. By looking at the pattern that appears on the rendered triangle, you can deduce that the texture coordinates of the three vertices are (0.2, 0.2), (0.8, 0.2), and (0.2, 0.8), because those are the points in the texture that appear at the three corners of the triangle. Just as with the geometrically determined mappings in the previous section, we control where the texture goes on the surface by giving the mapping from the surface to the texture domain, in this case by specifying where each vertex should go in texture space. Once you position the vertices, linear (barycentric) interpolation across triangles takes care of the rest.
In Figure 11.12, we show a common way to visualize texture coordinates on a whole mesh: simply draw triangles in texture space with the vertices positioned at their texture coordinates. This visualization shows you what parts of the texture are being used by which triangles, and it is a handy tool for evaluating texture coordinates and for debugging all sorts of texture-mapping code.
The quality of a texture coordinate mapping that is defined by vertex texture coordinates depends on what coordinates are assigned to the vertices—that is, how the mesh is laid out in texture space. No matter what coordinates are assigned, as long as the triangles in the mesh share vertices (Section 12.1), the texture coordinate mapping is always continuous, because neighboring triangles will agree on the texture coordinate at points on their shared edge. But the other desirable qualities described above are not so automatic. Injectivity means the triangles don’t overlap in texture space—if they do, it means there’s some point in the texture that will show up at more than one place on the surface.
Size distortion is low when the areas of triangles in texture space are in proportion to their areas in 3D. For instance, if a character’s face is mapped with a continuous texture coordinate function, one often ends up with the nose squeezed into a relatively small area in texture space, as shown in Figure 11.13. Although triangles on the nose are smaller than on the cheek, the ratio of sizes is more extreme in texture space. The result is that the texture is enlarged on the nose, because a small area of texture has to cover a large area of surface. Similarly, comparing the forehead to the temple, the triangles are similar in size in 3D, but the triangles around the temple are larger in texture space, causing the texture to appear smaller there.
Similarly, shape distortion is low when the shapes of triangles are similar in 3D and in texture space. The face example has fairly low shape distortion, but, for example, the sphere in Figure 11.15 has very large shape distortion near the poles.
It’s often useful to allow texture coordinates to go outside the bounds of the texture image. Sometimes, this is a detail: rounding error in a texture coordinate calculation might cause a vertex that lands exactly on the texture boundary to be slightly outside, and the texture mapping machinery should not fail in that case. But it can also be a modeling tool.
If a texture is only supposed to cover part of the surface, but texture coordinates are already set up to map the whole surface to the unit square, one option is to prepare a texture image that is mostly blank with the content in a small area. But that might require a very high resolution texture image to get enough detail in the relevant area. Another alternative is to scale up all the texture coordinates so that they cover a larger range—[−4.5,5.5] × [−4.5,5.5] for instance, to position the unit square at one-tenth size in the center of the surface.
For a case like this, texture lookups outside the unit-square area that’s covered by the texture image should return a constant background color. One way to do this is to set a background color to be returned by texture lookups outside the unit square. If the texture image already has a constant background color (for instance, a logo on a white background), another way to extend this background automatically over the plane is to arrange for lookups outside the unit square to return the color of the texture image at the closest point on the edge, achieved by clamping the u and v coordinates to the range from the first pixel to the last pixel in the image.
Sometimes, we want a repeating pattern, such as a checkerboard, a tile floor, or a brick wall. If the pattern repeats on a rectangular grid, it would be wasteful to create an image with many copies of the same data. Instead, we can handle texture lookups outside the texture image using wraparound indexing—when the lookup point exits the right edge of the texture image, it wraps around to the left edge. This is handled very simply using the integer remainder operation on the pixel coordinates.
Color texture_lookup_wrap(Texture t, float u, float v) { int i = round(u ⋆ t.width() - 0.5) int j = round(v ⋆ t.height() - 0.5) return t.get_pixel(i % t.width(), j % t.height()) } Color texture_lookup_wrap(Texture t, float u, float v) { int i = round(u ⋆ t.width() - 0.5) int j = round(v ⋆ t.height() - 0.5) return t.get_pixel(max(0, min(i, t.width()-1)), (max(0, min(j, t.height()-1)))) }
The choice between these two ways of handling out-of-bounds lookups is specified by selecting a wrapping mode from a list that includes tiling, clamping, and often combinations or variants of the two. With wrapping modes, we can freely think of a texture as a function that returns a color for any point in the infinite 2D plane (Figure 11.14). When we specify a texture using an image, these modes describe how the finite image data are supposed to be used to define this function. In Section 11.5, we’ll see that procedural textures can naturally extend across an infinite plane, since they are not limited by finite image data. Since both are logically infinite in extent, the two types of textures are interchangeable.
When adjusting the scale and placement of textures, it’s convenient to avoid actually changing the functions that generate texture coordinates, or the texture coordinate values stored at vertices of meshes, by instead applying a matrix transformation to the texture coordinates before using them to sample the texture:
where ϕ_{model} is the texture coordinate function provided with the model, and M_{T} is a 3 by 3 matrix representing an affine or projective transformation of the 2D texture coordinates using homogeneous coordinates. Such a transformation, sometimes limited just to scaling and/or translation, is supported by most renderers that use texture mapping.
Although low distortion and continuity are nice properties to have in a texture coordinate function, discontinuities are often unavoidable. For any closed 3D surface, it’s a basic result of topology that there is no continuous, bijective function that maps the whole surface into a texture image. Something has to give, and by introducing seams—curves on the surface where the texture coordinates change suddenly—we can have low distortion everywhere else. Many of the geometrically determined mappings discussed above already contain seams: in spherical and cylindrical coordinates, the seams are where the angle computed by atan2 wraps around from π to - π, and in the cubemap, the seams are along the cube edges, where the mapping switches between the six square textures.
With interpolated texture coordinates, seams require special consideration, because they don’t happen naturally. We observed earlier that interpolated texture coordinates are automatically continuous on shared-vertex meshes—the sharing of texture coordinates guarantees it. But this means that if a triangle spans a seam, with some vertices on one side and some on the other, the interpolation machinery will cheerfully provide a continuous mapping, but it will likely be highly distorted or fold over so that it’s not injective. Figure 11.15 illustrates this problem on a globe mapped with spherical coordinates. For example, there is a triangle near the bottom of the globe that has one vertex at the tip of New Zealand’s South Island, and another vertex in the Pacific about 400 km northeast of the North Island. A sensible pilot flying between these points would fly over New Zealand, but the path starts at longitude 167° E (+167) and ends at 179° W (i.e., longitude −179), so linear interpolation chooses a route that crosses South America on the way. This causes a backward copy of the entire map to be compressed into the strip of triangles that crosses the 180th meridian! The solution is to label the second vertex with the equivalent longitude of 181° E, but this just pushes the problem to the next triangle.
The only way to create a clean transition is to avoid sharing texture coordinates at the seam: the triangle crossing New Zealand needs to interpolate to longitude +181, and the next triangle in the Pacific needs to continue starting from to longitude −179. To do this, we duplicate the vertices at the seam: for each vertex, we add a second vertex with an equivalent longitude, differing by 360°, and the triangles on opposite sides of the seam use different vertices. This solution is shown in the right half of Figure 11.15, in which the vertices at the far left and right of the texture space are duplicates, with the same 3D positions.
Textures are used in all kinds of rendering systems, and although the fundamentals are the same, the details are different for ray tracing and rasterization systems.
Texture coordinates are part of the model being rendered, and the scene description needs to include enough information to define what they are. Mostly, this means storing texture coordinates as per-vertex attributes of all triangle meshes that will be used with textures. If the rendering system directly supports geometric primitives other than meshes, these primitives usually have pre-defined texture coordinates (e.g., latitude–longitude coordinates on spheres), possibly with a choice of mapping schemes for each primitive type.
In a ray tracing renderer, each type of surface that supports ray intersection must be able to compute not just the intersection point and surface normal, but also the texture coordinates of the intersection point. Like the other information about the intersection, texture coordinates can be stored in a hit record (see Section 4.4.3). In the common case of geometry represented by triangle meshes, the ray–triangle intersection code will compute texture coordinates by barycentric interpolation from the texture coordinates stored at the vertices, and for other types of geometry, the intersection code must compute the texture coordinates directly.
In a rasterization-based system, triangles will normally be the only supported type of geometry, so all surfaces must be converted to this form. Texture coordinates can be read in with the model (the common case), or for triangle meshes that are generated in code, they can be computed and stored at the time the mesh is created. Alternatively, for texture coordinates that can be computed from other vertex data (for instance, where texture coordinates are computed from the 3D position), texture coordinates can also be computed in a vertex shader and passed on to the rasterizer. Texture coordinates are then interpolated by the rasterizer, so that every invocation of the fragment shader has the appropriate texture coordinates for its fragment.
The second fundamental problem of texture mapping is antialiasing. Rendering a texture mapped image is a sampling process: mapping the texture onto the surface and then projecting the surface into the image produce a 2D function across the image plane, and we are sampling it at pixels. As we saw in Chapter 10, doing this using point samples will produce aliasing artifacts when the image contains detail or sharp edges—and since the whole point of textures is to introduce detail, they become a prime source of aliasing problems like the ones we saw in Figure 11.3.
It’s a good idea to review the first half of Chapter 10 now.
Just as with antialiased rasterization of lines or triangles (Section 9.3), antialiased ray tracing, or downsampling images (Section 10.4), the solution is to make each pixel not a point sample but an area average of the image, over an area similar in size to the pixel. Using the same supersampling approach used for antialiased rasterization and ray tracing, with enough samples, excellent results can be obtained with no changes to the texture mapping machinery: many samples within a pixel’s area will land at different places in the texture map, and averaging the shading results computed using the different texture lookups is an accurate way to approximate the average color of the image over the pixel. However, with detailed textures it takes very many samples to get good results, which is slow. Computing this area average efficiently in the presence of textures on the surface is the first key topic in texture antialiasing.
Texture images are usually defined by raster images, so there is also a reconstruction problem to be considered, just as with upsampling images (Section 10.4). The solution is the same for textures: use a reconstruction filter to interpolate between texels.
We expand on each of these topics in the following sections.
What makes antialiasing textures more complex than other kinds of antialiasing is that the relationship between the rendered image and the texture is constantly changing. Every pixel value should be computed as an average color over the area belonging to the pixel in the image, and in the common case that the pixel is looking at a single surface, this corresponds to averaging over an area on the surface. If the surface color comes from a texture, this in turn amounts to averaging over a corresponding part of the texture, known as the texture space footprint of the pixel. Figure 11.16 illustrates how the footprints of square areas (which could be pixel areas in a lower-resolution image) map to very different sized and shaped areas in the floor’s texture space.
Recall the three spaces involved in rendering with textures: the projection π that maps 3D points into the image and the texture coordinate function ϕ that maps 3D points into texture space. To work with pixel footprints, we need to understand the composition of these two mappings: first follow π backwards to get from the image to the surface and then follow ϕ forwards. This composition ψ = ϕ ∘ π^{-1} is what determines pixel footprints: the footprint of a pixel is the image of that pixel’s square area of the image under the mapping ψ.
The core problem in texture antialiasing is computing an average value of the texture over the footprint of a pixel. To do this exactly in general could be a pretty complicated job: for a faraway object with a complicated surface shape, the footprint could be a complicated shape covering a large area, or possibly several disconnected areas, in texture space. But in the typical case, a pixel lands in a smooth area of surface that is mapped to a single area in the texture.
Because ψ contains both the mapping from image to surface and the mapping from surface to texture, the size and shape of the footprint depend on both the viewing situation and the texture coordinate function. When a surface is closer to the camera, pixel footprints will be smaller; when the same surface moves farther away, the footprint gets bigger. When surfaces are viewed at an oblique angle, the footprint of a pixel on the surface is elongated, which usually means it will be elongated in texture space also. Even with a fixed view, the texture coordinate function can cause variations in the footprint: if it distorts area, the size of footprints will vary, and if it distorts shape, they can be elongated even for head-on views of the surface.
However, to find an efficient algorithm for computing antialiased lookups, some substantial approximations will be needed. When a function is smooth, a linear approximation is often useful. In the case of texture antialiasing, this means approximating the mapping ψ from image space to texture space as a linear mapping from 2D to 2D:
In mathematicians’ terms, we have made a one-term Taylor series approximation to the function ψ.
where the two-by-two matrix J is some approximation to the derivative of ψ. It has four entries, and if we denote the image-space position as x = (x,y) and the texture-space position as u = (u,v), then
where the four derivatives describe how the texture point (u,v) that is seen at a point (x,y) in the image changes when we change x and y.
A geometric interpretation of this approximation (Figure 11.17) is that it says a unit-sized square pixel area centered at x in the image will map approximately to a parallelogram in texture space, centered at ψ(x) and with its edges parallel to the vectors u_{x} = (du/dx,dv/dx) and u_{y} = (du/dy,dv/dy).
The derivative matrix J is useful because it tells the whole story of variation in the (approximated) texture-space footprint across the image. Derivatives that are larger in magnitude indicate larger texture-space footprints, and the relationship between the derivative vectors u_{x} and u_{y} indicates the shape. When they are orthogonal and the same length, the footprint is square, and as they become skewed and/or very different in length, the footprint becomes elongated.
The approach here uses a box filter to sample the image. Some systems instead use a Gaussian pixel filter, which becomes an elliptical Gaussian in texture space; this is elliptical weighted averaging (EWA).
We’ve now reached the form of the problem that’s usually thought of as the “right answer”: a filtered texture sample at a particular image-space position should be the average value of the texture map over the parallelogram-shaped footprint defined by the texture coordinate derivatives at that point. This already has some assumptions baked into it—namely, that the mapping from image to texture is smooth—but it is sufficiently accurate for excellent image quality. However, this parallelogram area average is already too expensive to compute exactly, so various approximations are used. Approaches to texture antialiasing differ in the speed/quality tradeoffs they make in approximating this lookup. We discuss these in the following sections.
When the footprint is smaller than a texel, we are magnifying the texture as it is mapped into the image. This case is analogous to upsampling an image, and the main consideration is interpolating between texels to produce a smooth image in which the texel grid is not obvious. Just as in image upsampling, this smoothing process is defined by a reconstruction filter that is used to compute texture samples at arbitrary locations in texture space. (See Figure 11.18.)
The considerations are pretty much the same as in image resampling, with one important difference. In image resampling, the task is to compute output samples on a regular grid and that regularity enabled an important optimization in the case of a separable reconstruction filter. In texture filtering, the pattern of lookups is not regular, and the samples have to be computed separately. This means large, high-quality reconstruction filters are very expensive to use, and for this reason the highest-quality filter normally used for textures is bilinear interpolation.
The calculation of a bilinearly interpolated texture sample is the same as computing one pixel in an image being upsampled with bilinear interpolation. First, we express the texture-space sample point in terms of (real-valued) texel coordinates, then we read the values of the four neighboring texels and average them. Textures are usually parameterized over the unit square, and the texels are located in the same way as pixels in any image, spaced a distance 1/n_{u} apart in the u direction and 1/n_{v} in v, with texel (0,0) positioned half a texel in from the edge for symmetry. (See Chapter 10 for the full explanation.)
Color tex_sample_bilinear(Texture t, float u, float v) { u_p = u ⋆ t.width - 0.5 v_p = v ⋆ t.height - 0.5 iu0 = floor(u_p); iu1 = iu0 + 1 iv0 = floor(v_p); iv1 = iv0 + 1 a_u = (iu1 - u_p); b_u = 1 - a_u a_v = (iv1 - v_p); b_v = 1 - a_v return a_u ⋆ a_v ⋆ t[iu0][iv0] + a_u ⋆ b_v ⋆ t[iu0][iv1] + b_u ⋆ a_v ⋆ t[iu1][iv0] + b_u ⋆ b_v ⋆ t[iu1][iv1] }
In many systems, this operation becomes an important performance bottleneck, mainly because of the memory latency involved in fetching the four texel values from the texture data. The pattern of sample points for textures is irregular, because the mapping from image to texture space is arbitrary, but often coherent, since nearby image points tend to map to nearby texture points that may read the same texels. For this reason, high-performance systems have special hardware devoted to texture sampling that handles interpolation and manages caches of recently used texture data to minimize the number of slow data fetches from the memory where the texture data are stored.
After reading Chapter 10, you may complain that linear interpolation may not be a smooth enough reconstruction for some demanding applications. However, it can always be made good enough by resampling the texture to a somewhat higher resolution using a better filter, so that the texture is smooth enough that bilinear interpolation works well.
Doing a good job of interpolation only suffices in situations where the texture is being magnified: where the pixel footprint is small compared to the spacing of texels. When a pixel footprint covers many texels, good antialiasing requires computing the average of many texels to smooth out the signal so that it can be sampled safely.
One very accurate way to compute the average texture value over the footprint would be to find all the texels within the footprint and add them up. However, this is potentially very expensive when the footprint is large—it could require reading many thousands of texel just for a single lookup. A better approach is to precompute and store the averages of the texture over various areas of different size and position.
The name “mip” stands for the Latin phrase multim in parvo meaning “much in a small space.”
A very popular version of this idea is known as “MIP mapping” or just mipmapping. A mipmap is a sequence of textures that all contain the same image but at lower and lower resolution. The original, full-resolution texture image is called the base level, or level 0, of the mipmap, and level 1 is generated by taking that image and downsampling it by a factor of 2 in each dimension, resulting in an image with one-fourth as many texels. The texels in this image are, roughly speaking, averages of square areas 2 by 2 texels in size in the level-0 image.
This process can be continued to define as many mipmap levels as desired: the image at level k is computed by downsampling the image at level k − 1 by two. A texel at level k corresponds to a square area measuring 2^{k} by 2^{k} texels in the original texture. For instance, starting with a 1024 × 1024 texture image, we could generate a mipmap with 11 levels: level 0 is 1024 × 1024; level 1 is 512 × 512, and so on until level 10, which has just a single texel. This kind of structure, with images that represent the same content at a series of lower and lower sampling rates, is called an image pyramid, based on the visual metaphor of stacking all the smaller images on top of the original.
With the mipmap, or image pyramid, in hand, texture filtering can be done much more efficiently than by accessing many texels individually. When we need a texture value averaged over a large area, we simply use values from higher levels of the mipmap, which are already averages over large areas of the image. The simplest and fastest way to do this is to look up a single value from the mipmap, choosing the level so that the size covered by the texels at that level is roughly the same as the overall size of the pixel footprint. Of course, the pixel footprint might be quite different in shape from the (always square) area represented by the texel, and we can expect that to produce some artifacts.
Setting aside for a moment the question of what to do when the pixel footprint has an elongated shape, suppose the footprint is a square of width D, measured in terms of texels in the full-resolution texture. What level of the mipmap is it appropriate to sample? Since the texels at level k cover squares of width 2^{k}, it seems appropriate to choose k so that
so we let k = log_{2}D. Of course, this will give non-integer values of k most of the time, and we only have stored mipmap images for integer levels. Two possible solutions are to look up a value only for the integer nearest to k (efficient but produces seams at the abrupt transitions between levels) or to look up values for the two nearest integers to k and linearly interpolate the values (twice the work, but smoother).
Before we can actually write down the algorithm for sampling a mipmap, we have to decide how we will choose the “width” D when footprints are not square. Some possibilities might be to use the square root of the area or to find the longest axis of the footprint and call that the width. A practical compromise that is easy to compute is to use the length of the longest edge:
Color mipmap_sample_trilinear(Texture mip[], float u, float v, matrix J) { D = max_column_norm(J) k = log2(D) k0 = floor(k); k1 = k0 + 1 a = k1 - k; b = 1 - a c0 = tex_sample_bilinear(mip[k0], u, v) c1 = tex_sample_bilinear(mip[k1], u, v) return a ⋆ c0 + b ⋆ c1 }
Basic mipmapping does a good job of removing aliasing, but because it’s unable to handle elongated, or anisotropic pixel footprints, it doesn’t perform well when surfaces are viewed at grazing angles. This is most commonly seen on large planes that represent a surface the viewer is standing on. Points on the floor that are far away are viewed at very steep angles, resulting in very anisotropic footprints that mipmapping approximates with much larger square areas. The resulting image will appear blurred in the horizontal direction.
A mipmap can be used with multiple lookups to approximate an elongated footprint better. The idea is to select the mipmap level based on the shortest axis of the footprint rather than the largest and then average together several lookups spaced along the long axis. (See Figure 11.19.)
Once you understand the idea of defining texture coordinates for a surface and the machinery of looking up texture values, this machinery has many uses. In this section, we survey a few of the most important techniques in texture mapping, but textures are a very general tool with applications limited only by what the programmer can think up.
The most basic use of texture mapping is to introduce variation in color by making the diffuse color that is used in shading computations—whether in a ray tracer or in a fragment shader—dependent on a value looked up from a texture. A textured diffuse component can be used to paste decals, paint decorations, or print text on a surface, and it can also simulate the variation in material color, for example, for wood or stone.
Nothing limits us to varying only the diffuse color, though. Any other parameters, such as the specular reflectance or specular roughness, can also be textured. For instance, a cardboard box with transparent packing tape stuck to it may have the same diffuse color everywhere but be shinier, with higher specular reflectance and lower roughness, where the tape is than elsewhere. In many cases, the maps for different parameters are correlated: for instance, a glossy white ceramic cup with a logo printed on it may be both rougher and darker where it is printed (Figure 11.20), and a book with its title printed in metallic ink might change in diffuse color, specular color, and roughness, all at once.
Another quantity that is important for shading is the surface normal. With interpolated normals (Section 9.2), we know that the shading normal does not have to be the same as the geometric normal of the underlying surface. Normal mapping takes advantage of this fact by making the shading normal depend on values read from a texture map. The simplest way to do this is just to store the normals in a texture, with three numbers stored at every texel that are interpreted, instead of as the three components of a color, as the 3D coordinates of the normal vector.
Before a normal map can be used, though, we need to know what coordinate system the normals read from the map are represented in. Storing normals directly in object space, in the same coordinate system used for representing the surface geometry itself, is simplest: the normal read from the map can be used in exactly the same way as the normal reported by the surface itself: in most cases, it will need to be transformed into world space for lighting calculations, just like a normal that came with the geometry.
However, normal maps that are stored in object space are inherently tied to the surface geometry—even for the normal map to have no effect, to reproduce the result with the geometric normals, the contents of the normal map have to track the orientation of the surface. Furthermore, if the surface is going to deform, so that the geometric normal changes, the object-space normal map can no longer be used, since it would keep providing the same shading normals.
The solution is to define a coordinate system for the normals that is attached to the surface. Such a coordinate system can be defined based on the tangent space of the surface (see Section 2.7): select a pair of tangent vectors and use them to define an orthonormal basis (Section 2.4.5). The texture coordinate function itself provides a useful way to select a pair of tangent vectors: use the directions tangent to lines of constant u and v. These tangents are not generally orthogonal, but we can use the procedure from Section 2.4.7 to “square up” the orthonormal basis, or it can be defined using the surface normal and just one tangent vector.
When normals are expressed in this basis they vary a lot less; since they are mostly pointing near the direction of the normal to the smooth surface, they will be near the vector (0,0,1)^{T} in the normal map.
Where do normal maps come from? Often they are computed from a more detailed model to which the smooth surface is an approximation; other times, they can be measured directly from real surfaces. They can also be authored as part of the modeling process; in this case, it’s often nice to use a bump map to specify the normals indirectly. The idea is that a bump map is a height field: a function that give the local height of the detailed surface above the smooth surface. Where the values are high (where the map looks bright, if you display it as an image), the surface is protruding outside the smooth surface; where the values are low (where the map looks dark), the surface is receding below it. For instance, a narrow dark line in the bump map is a scratch, or a small white dot is a bump.
Deriving a normal map from a bump map is simple: the normal map (expressed in the tangent frame) is the derivative of the bump map.
Figure 11.21 shows texture maps being used to create woodgrain color and to simulate increased surface roughness due to finish soaking into the more porous parts of the wood, together with a bump map to create an imperfect finish and gaps between boards, to make a realistic wood floor.
A problem with normal maps is that they don’t actually change the surface at all; they are just a shading trick. This becomes obvious when the geometry implied by the normal map should cause noticeable effects in 3D. In still images, the first problem to be noticed is usually that the silhouettes of objects remain smooth despite the appearance of bumps in the interior. In animations, the lack of parallax gives away that the bumps, however convincing, are really just “painted” on the surface.
Textures can be used for more than just shading, though: they can be used to alter geometry. A displacement map is one of the simplest versions of this idea. The concept is the same as a bump map: a scalar (one-channel) map that gives the height above the “average terrain.” But the effect is different. Rather than deriving a shading normal from the height map while using the smooth geometry, a displacement map actually changes the surface, moving each point along the normal of the smooth surface to a new location. The normals are roughly the same in each case, but the surface is different.
The most common way to implement displacement maps is to tessellate the smooth surface with a large number of small triangles and then displace the vertices of the resulting mesh using the displacement map. In the graphics pipeline, this can be done using a texture lookup at the vertex stage and is particularly handy for terrain.
Shadows are an important cue to object relationships in a scene, and as we have seen, they are simple to include in ray-traced images. However, it’s not obvious how to get shadows in rasterized renderings, because surfaces are considered one at a time, in isolation. Shadow maps are a technique for using the machinery of texture mapping to get shadows from point light sources.
The idea of a shadow map is to represent the volume of space that is illuminated by a point light source. Think of a source like a spotlight or video projector, which emits light from a point into a limited range of directions. The volume that is illuminated—the set of points where you would see light on your hand if you held it there—is the union of line segments joining the light source to the closest surface point along every ray leaving that point.
Interestingly, this volume is the same as the volume that is visible to a perspective camera located at the same point as the light source: a point is illuminated by a source if and only if it is visible from the light source location. In both cases, there’s a need to evaluate visibility for points in the scene: for visibility, we needed to know whether a fragment was visible to the camera, to know whether to draw it in the image; and for shadowing, we need to know whether a fragment is visible to the light source, to know whether it’s illuminated by that source or not. (See Figure 11.22.)
In both cases, the solution is the same: a depth map that tells the distance to the closest surface along a bunch of rays. In the visibility case, this is the z-buffer (Section 9.2.3), and for the shadowing case, it is called a shadow map. In both cases, visibility is evaluated by comparing the depth of a new fragment to the depth stored in the map, and the surface is hidden from the projection point (occluded or shadowed) if its depth is greater than the depth of the closest visible surface. A difference is that the z buffer is used to keep track of the closest surface seen so far and is updated during rendering, whereas a shadow map tells the distance to the closest surface in the whole scene.
A shadow map is calculated in a separate rendering pass ahead of time: simply rasterize the whole scene as usual, and retain the resulting depth map (there is no need to bother with calculating pixel values). Then, with the shadow map in hand, you perform an ordinary rendering pass, and when you need to know whether a fragment is visible to the source, you project its location in the shadow map (using the same perspective projection that was used to render the shadow map in the first place) and compare the looked-up value d_{map} with the actual distance d to the source. If the distances are the same, the fragment’s point is illuminated; if the d > d_{map}, that implies there is a different surface closer to the source, so it is shadowed.
The phrase “if the distances are the same” should raise some red flags in your mind: since all the quantities involved are approximations with limited precision, we can’t expect them to be exactly the same. For visible points, the d ≈ d_{map} but sometimes d will be a bit larger and sometimes a bit smaller. For this reason, a tolerance is required: a point is considered illuminated if d - d_{map} < ϵ. This tolerance ϵ is known as shadow bias.
When looking up in shadow maps it doesn’t make a lot of sense to interpolate between the depth values recorded in the map. This might lead to more accurate depths (requiring less shadow bias) in smooth areas, but will cause bigger problems near shadow boundaries, where the depth value changes suddenly. Therefore, texture lookups in shadow maps are done using nearest-neighbor reconstruction. To reduce aliasing, multiple samples can be used, with the 1-or-0 shadow results (rather than the depths) averaged; this is known as percentage closer filtering.
Just as a texture is handy for introducing detail into the shading on a surface without having to add more detail to the model, a texture can also be used to introduce detail into the illumination without having to model complicated light source geometry. When light comes from far away compared to the size of objects in view, the illumination changes very little from point to point in the scene. It is handy to make the assumption that the illumination depends only on the direction you look and is the same for all points in the scene, and then to express this dependence of illumination on direction using an environment map.
The idea of an environment map is that a function defined over directions in 3D is a function on the unit sphere, so it can be represented using a texture map in exactly the same way as we might represent color variation on a spherical object. Instead of computing texture coordinates from the 3D coordinates of a surface point, we use exactly the same formulas to compute texture coordinates from the 3D coordinates of the unit vector that represents the direction from which we want to know the illumination.
The simplest application of an environment map is to give colors to rays in a ray tracer that don’t hit any objects:
trace_ray(ray, scene) { if (surface = scene.intersect(ray)) { return surface.shade(ray) } else { u, v = spheremap_coords(r.direction) return texture_lookup(scene.env_map, u, v) } }
With this change to the ray tracer, shiny objects that reflect other scene objects will now also reflect the background environment.
A similar effect can be achieved in the rasterization context by adding a mirror reflection to the shading computation, which is computed in the same way as in a ray tracer, but simply looks up directly in the environment map with no regard for other objects in the scene:
shade_fragment(view_dir, normal) { out_color = diffuse_shading(k_d, normal) out_color += specular_shading(k_s, view_dir, normal) u, v = spheremap_coords(reflect(view_dir, normal)) out_color += k_m ⋆ texture_lookup(environment_map, u, v) }
This technique is known as reflection mapping.
A more advanced used of environment maps computes all the illumination from the environment map, not just the mirror reflection. This is environment lighting and can be computed in a ray tracer using Monte Carlo integration or in rasterization by approximating the environment with a collection of point sources and computing many shadow maps.
Environment maps can be stored in any coordinates that could be used for mapping a sphere. Spherical (longitude–latitude) coordinates are one popular option, though the compression of textures at the poles wastes texture resolution and can create artifacts at the poles. Cubemaps are a more efficient choice, widely used in interactive applications (Figure 11.23).
In previous chapters, we used c_{r} as the diffuse reflectance at a point on an object. For an object that does not have a solid color, we can replace this with a function c_{r}(p) which maps 3D points to RGB colors (Peachey, 1985; Perlin, 1985). This function might just return the reflectance of the object that contains p. But for objects with texture, we should expect c_{r}(p) to vary as p moves across a surface.
An alternative to defining texture mapping functions that map from a 3D surface to a 2D texture domain is to create a 3D texture that defines an RGB value at every point in 3D space. We will only call it for points p on the surface, but it is usually easier to define it for all 3D points than a potentially strange 2D subset of points that are on an arbitrary surface. The good thing about 3D texture mapping is that it is easy to define the mapping function, because the surface is already embedded in 3D space, and there is no distortion in the mapping from 3D to texture space. Such a strategy is clearly suitable for surfaces that are “carved” from a solid medium, such as a marble sculpture.
The downside to 3D textures is that storing them as 3D raster images or volumes consumes a great deal of memory. For this reason, 3D texture coordinates are most commonly used with procedural textures in which the texture values are computed using a mathematical procedure rather than by looking them up from a texture image. In this section, we look at a couple of the fundamental tools used to define procedural textures. These could also be used to define 2D procedural textures, though in 2D it is more common to use raster texture images.
There are a surprising number of ways to make a striped texture. Let’s assume we have two colors c_{0} and c_{1} that we want to use to make the stripe color. We need some oscillating function to switch between the two colors. An easy one is a sine:
RGB stripe( point p)
if (sin(x_{p}) > 0) then
return c_{0}
else
return c_{1}
We can also make the stripe’s width w controllable:
RGB stripe( point p, real w)
if (sin(πx_{p}∕w) > 0) then
return c_{0}
else
return c_{1}
If we want to interpolate smoothly between the stripe colors, we can use a parameter t to vary the color linearly:
RGB stripe( point p, real w
t = (1 + sin(πp_{x}/w))/2
return (1 - t)c_{0} + tc_{1}
These three possibilities are shown in Figure 11.24.
Although regular textures such as stripes are often useful, we would like to be able to make “mottled” textures such as we see on birds’ eggs. This is usually done by using a sort of “solid noise,” usually called Perlin noise after its inventor, who received a technical Academy Award for its impact in the film industry (Perlin, 1985).
Getting a noisy appearance by calling a random number for every point would not be appropriate, because it would just be like “white noise” in TV static. We would like to make it smoother without losing the random quality. One possibility is to blur white noise, but there is no practical implementation of this. Another possibility is to make a large lattice with a random number at every lattice point and then interpolate these random points for new points between lattice nodes; this is just a 3D texture array as described in the last section with random numbers in the array. This technique makes the lattice too obvious. Perlin used a variety of tricks to improve this basic lattice technique so the lattice was not so obvious. This results in a rather baroque-looking set of steps, but essentially there are just three changes from linearly interpolating a 3D array of random values. The first change is to use Hermite interpolation to avoid mach bands, just as can be done with regular textures. The second change is the use of random vectors rather than values, with a dot product to derive a random number; this makes the underlying grid structure less visually obvious by moving the local minima and maxima off the grid vertices. The third change is to use a 1D array and hashing to create a virtual 3D array of random vectors. This adds computation to lower memory use. Here is his basic method:
where (x,y,z) are the Cartesian coordinates of x, and
and ω(t) is the cubic weighting function:
The final piece is that Γ_{ijk} is a random unit vector for the lattice point (x,y,z) = (i,j,k). Since we want any potential ijk, we use a pseudorandom table:
where G is a precomputed array of n random unit vectors, and ϕ(i) = P[i mod n] where P is an array of length n containing a permutation of the integers 0 through n - 1. In practice, Perlin reports n = 256 works well. To choose a random unit vector (v_{x},v_{y},v_{z}), first set
where ξ,ξ′,ξ^{″} are canonical random numbers (uniform in the interval [0,1)). Then, if , make the vector a unit vector. Otherwise, keep setting it randomly until its length is less than one, and then make it a unit vector. This is an example of a rejection method, which will be discussed more in Chapter 13. Essentially, the “less than” test gets a random point in the unit sphere, and the vector for the origin to that point is uniformly random. That would not be true of random points in the cube, so we “get rid” of the corners with the test.
Because solid noise can be positive or negative, it must be transformed before being converted to a color. The absolute value of noise over a 10 × 10 square is shown in Figure 11.25, along with stretched versions. These versions are stretched by scaling the points input to the noise function.
The dark curves are where the original noise function changed from positive to negative. Since noise varies from - 1 to 1, a smoother image can be achieved by using (noise + 1)∕2 for color. However, since noise values close to 1 or - 1 are rare, this will be a fairly smooth image. Larger scaling can increase the contrast (Figure 11.26).
Many natural textures contain a variety of feature sizes in the same texture. Perlin uses a pseudofractal “turbulence” function:
This effectively repeatedly adds scaled copies of the noise function on top of itself as shown in Figure 11.27.
The turbulence can be used to distort the stripe function:
RGB turbstripe( point p, double w)
double t = (1 + sin(k_{1}z_{p} + turbulence(k_{2}p))∕w)∕2
return t * s0 + (1 − t) * s1
Various values for k_{1} and k_{2} were used to generate Figure 11.28.
How do I implement displacement mapping in ray tracing?
There is no ideal way to do it. Generating all the triangles and caching the geometry when necessary will prevent memory overload (Pharr & Hanrahan, 1996; Pharr, Kolb, Gershbein, & Hanrahan, 1997). Trying to intersect the displaced surface directly is possible when the displacement function is restricted (Patterson, Hoggar, & Logie, 1991; Heidrich & Seidel, 1998; Smits, Shirley, & Stark, 2000).
Humans are good at seeing small imperfections in surfaces. Geometric imperfections are typically absent in computer-generated images that use texture maps for details, so they look “too smooth.”
The discussion of perspective-correct textures is based on Fast Shadows and Lighting Effects Using Texture Mapping (Segal, Korobkin, van Widenfelt, Foran, & Haeberli, 1992) and on 3D Game Engine Design (Eberly, 2000).
1. Find several ways to implement an infinite 2D checkerboard using surface and solid techniques. Which is best?
2. Verify that Equation (9.4) is a valid equality using brute-force algebra.
3. How could you implement solid texturing by using the z-buffer depth and a matrix transform?
4. Expand the function mipmap_sample_trilinear into a single function.
Certain data structures seem to pop up repeatedly in graphics applications, perhaps because they address fundamental underlying ideas such as surfaces, space, and scene structure. This chapter talks about several basic and unrelated categories of data structures that are among the most common and useful: mesh structures, spatial data structures, scene graphs, and tiled multidimensional arrays.
For meshes, we discuss the basic storage schemes used for storing static meshes and for transferring meshes to graphics APIs. We also discuss the winged-edge data structure (Baumgart, 1974) and the related half-edge structure, which are useful for managing models where the tessellation changes, such as in subdivision or model simplification. Although these methods generalize to arbitrary polygon meshes, we focus on the simpler case of triangle meshes here.
Next, the scene-graph data structure is presented. Various forms of this data structure are ubiquitous in graphics applications because they are so useful in managing objects and transformations. All new graphics APIs are designed to support scene graphs well.
For spatial data structures, we discuss three approaches to organizing models in 3D space—bounding volume hierarchies, hierarchical space subdivision, and uniform space subdivision—and the use of hierarchical space subdivision (BSP trees) for hidden surface removal. The same methods are also used for other purposes, including geometry culling and collision detection.
Finally, the tiled multidimensional array is presented. Originally developed to help paging performance in applications where graphics data needed to be swapped in from disk, such structures are now crucial for memory locality on machines regardless of whether the array fits in main memory.
Most real-world models are composed of complexes of triangles with shared vertices. These are usually known as triangular meshes, triangle meshes, or triangular irregular networks (TINs), and handling them efficiently is crucial to the performance of many graphics programs. The kind of efficiency that is important depends on the application. Meshes are stored on disk and in memory, and we’d like to minimize the amount of storage consumed. When meshes are transmitted across networks or from the CPU to the graphics system, they consume bandwidth, which is often even more precious than storage. In applications that perform operations on meshes, besides simply storing and drawing them—such as subdivision, mesh editing, mesh compression, or other operations—efficient access to adjacency information is crucial.
Triangle meshes are generally used to represent surfaces, so a mesh is not just a collection of unrelated triangles, but rather a network of triangles that connect to one another through shared vertices and edges to form a single continuous surface. This is a key insight about meshes: a mesh can be handled more efficiently than a collection of the same number of unrelated triangles.
The minimum information required for a triangle mesh is a set of triangles (triples of vertices) and the positions (in 3D space) of their vertices. But many, if not most, programs require the ability to store additional data at the vertices, edges, or faces to support texture mapping, shading, animation, and other operations. Vertex data are the most common: each vertex can have material parameters, texture coordinates, and irradiances—any parameters whose values change across the surface. These parameters are then linearly interpolated across each triangle to define a continuous function over the whole surface of the mesh. However, it is also occasionally important to be able to store data per edge or per face.
The idea that meshes are surface-like can be formalized as constraints on the mesh topology—the way the triangles connect together, without regard for the vertex positions. Many algorithms will only work, or are much easier to implement, on a mesh with predictable connectivity. The simplest and most restrictive requirement on the topology of a mesh is for the surface to be a manifold. A manifold mesh is “watertight”—it has no gaps and separates the space on the inside of the surface from the space outside. It also looks like a surface everywhere on the mesh.
We’ll leave the precise definitions to the mathematicians; see the chapter notes.
The term manifold comes from the mathematical field of topology: roughly speaking, a manifold (specifically a two-dimensional manifold, or 2-manifold) is a surface in which a small neighborhood around any point could be smoothed out into a bit of flat surface. This idea is most clearly explained by counterexample: if an edge on a mesh has three triangles connected to it, the neighborhood of a point on the edge is different from the neighborhood of one of the points in the interior of one of the triangles, because it has an extra “fin” sticking out of it (Figure 12.1). If the edge has exactly two triangles attached to it, points on the edge have neighborhoods just like points in the interior, only with a crease down the middle. Similarly, if the triangles sharing a vertex are in a configuration like the left one in Figure 12.2, the neighborhood is like two pieces of surface glued together at the center, which can’t be flattened without doubling it up. The vertex with the simpler neighborhood shown at right is just fine.
Many algorithms assume that meshes are manifold, and it’s always a good idea to verify this property to prevent crashes or infinite loops if you are handed a malformed mesh as input. This verification boils down to checking that all edges are manifold and checking that all vertices are manifold by verifying the following conditions:
Every edge is shared by exactly two triangles.
Every vertex has a single, complete loop of triangles around it.
Figure 12.1 illustrates how an edge can fail the first test by having too many triangles, and Figure 12.2 illustrates how a vertex can fail the second test by having two separate loops of triangles attached to it.
Manifold meshes are convenient, but sometimes, it’s necessary to allow meshes to have edges or boundaries. Such meshes are not manifolds—a point on the boundary has a neighborhood that is cut off on one side. They are not necessarily watertight. However, we can relax the requirements of a manifold mesh to those for a manifold with boundary without causing problems for most mesh processing algorithms. The relaxed conditions are
Every edge is used by either one or two triangles.
Every vertex connects to a single edge-connected set of triangles.
Figure 12.3 illustrates these conditions: from left to right, there is an edge with one triangle, a vertex whose neighboring triangles are in a single edge-connected set, and a vertex with two disconnected sets of triangles attached to it.
Finally, in many applications it’s important to be able to distinguish the “front” or “outside” of a surface from the “back” or “inside”—this is known as the orientation of the surface. For a single triangle, we define orientation based on the order in which the vertices are listed: the front is the side from which the triangle’s three vertices are arranged in counterclockwise order. A connected mesh is consistently oriented if its triangles all agree on which side is the front—and this is true if and only if every pair of adjacent triangles is consistently oriented.
In a consistently oriented pair of triangles, the two shared vertices appear in opposite orders in the two triangles’ vertex lists (Figure 12.4). What’s important is consistency of orientation—some systems define the front using clockwise rather than counterclockwise order.
Any mesh that has non-manifold edges can’t be oriented consistently. But it’s also possible for a mesh to be a valid manifold with boundary (or even a manifold) and yet have no consistent way to orient the triangles—they are not orientable surfaces. An example is the Möbius band shown in Figure 12.5. This is rarely an issue in practice, however.
A simple triangular mesh is shown in Figure 12.6. You could store these three triangles as independent entities, each of this form:
Triangle { vector3 vertexPosition[3] }
This would result in storing vertex b three times and the other vertices twice each for a total of nine stored points (three vertices for each of three triangles). Or you could instead arrange to share the common vertices and store only four, resulting in a shared-vertex mesh. Logically, this data structure has triangles which point to vertices which contain the vertex data (Figure 12.7):
Triangle { Vertex v[3] } Vertex { vector3 position // or other vertex data }
Note that the entries in the v array are references, or pointers, to Vertex objects; the vertices are not contained in the triangle.
In implementation, the vertices and triangles are normally stored in arrays, with the triangle-to-vertex references handled by storing array indices:
IndexedMesh { int tInd[nt][3] vector3 verts[nv] }
The index of the kth vertex of the ith triangle is found in tInd[i][k], and the position of that vertex is stored in the corresponding row of the verts array; see Figure 12.8 for an example. This way of storing a shared-vertex mesh is an indexed triangle mesh.
Separate triangles or shared vertices will both work well. Is there a space advantage for sharing vertices? If our mesh has n_{v} vertices and n_{t} triangles, and if we assume that the data for floats, pointers, and ints all require the same storage (a dubious assumption), the space requirements are as follows:
Triangle. Three vectors per triangle, for 9n_{t} units of storage;
IndexedMesh. One vector per vertex and three ints per triangle, for 3n_{v} + 3n_{t} units of storage.
The relative storage requirements depend on the ratio of n_{t} to n_{v}.
Is this factor of two worth the complication? I think the answer is yes, and it becomes an even bigger win as soon as you start adding “properties” to the vertices.
As a rule of thumb, a large mesh has each vertex connected to about six triangles (although there can be any number for extreme cases). Since each triangle connects to three vertices, this means that there are generally twice as many triangles as vertices in a large mesh: n_{t} ≈ 2n_{v}. Making this substitution, we can conclude that the storage requirements are 18n_{v} for the Triangle structure and 9n_{v} for IndexedMesh. Using shared vertices reduces storage requirements by about a factor of two, and this seems to hold in practice for most implementations.
Indexed meshes are the most common in-memory representation of triangle meshes, because they achieve a good balance of simplicity, convenience, and compactness. They are also commonly used to transfer meshes over networks and between the application and graphics pipeline. In applications where even more compactness is desirable, the triangle vertex indices (which take up two-thirds of the space in an indexed mesh with only positions at the vertices) can be expressed more efficiently using triangle strips and triangle fans.
A triangle fan is shown in Figure 12.9. In an indexed mesh, the triangles array would contain [(0, 1, 2), (0, 2, 3), (0, 3, 4), (0, 4, 5)]. We are storing 12 vertex indices, although there are only six distinct vertices. In a triangle fan, all the triangles share one common vertex, and the other vertices generate a set of triangles like the vanes of a collapsible fan. The fan in the figure could be specified with the sequence [0, 1, 2, 3, 4, 5]: the first vertex establishes the center, and subsequently each pair of adjacent vertices (1-2, 2-3, etc.) creates a triangle.
The triangle strip is a similar concept, but it is useful for a wider range of meshes. Here, vertices are added alternating top and bottom in a linear strip as shown in Figure 12.10. The triangle strip in the figure could be specified by the sequence [0 1 2 3 4 5 6 7], and every subsequence of three adjacent vertices (0-1-2, 1-2-3, etc.) creates a triangle. For consistent orientation, every other triangle needs to have its order reversed. In the example, this results in the triangles (0, 1, 2), (2, 1, 3), (2, 3, 4), (4, 3, 5), etc. For each new vertex that comes in, the oldest vertex is forgotten and the order of the two remaining vertices is swapped. See Figure 12.11 for a larger example.
In both strips and fans, n + 2 vertices suffice to describe n triangles—a substantial savings over the 3n vertices required by a standard indexed mesh. Long triangle strips will save approximately a factor of three if the program is vertex-bound.
It might seem that triangle strips are only useful if the strips are very long, but even relatively short strips already gain most of the benefits. The savings in storage space (for only the vertex indices) are as follows:
strip length |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
16 |
100 |
∞ |
relative size |
1.00 |
0.67 |
0.56 |
0.50 |
0.47 |
0.44 |
0.43 |
0.42 |
0.38 |
0.34 |
0.33 |
So, in fact, there is a rather rapid diminishing return as the strips grow longer. Thus, even for an unstructured mesh, it is worthwhile to use some greedy algorithm to gather them into short strips.
Indexed meshes, strips, and fans are all good, compact representations for static meshes. However, they do not readily allow for meshes to be modified. In order to efficiently edit meshes, more complicated data structures are needed to efficiently answer queries such as
Given a triangle, what are the three adjacent triangles?
Given an edge, which two triangles share it?
Given a vertex, which faces share it?
Given a vertex, which edges share it?
There are many data structures for triangle meshes, polygonal meshes, and polygonal meshes with holes (see the notes at the end of this chapter for references). In many applications, the meshes are very large, so an efficient representation can be crucial.
The most straightforward, though bloated, implementation would be to have three types, Vertex, Edge, and Triangle, and to just store all the relationships directly:
Triangle { Vertex v[3] Edge e[3] } Edge { Vertex v[2] Triangle t[2] } Vertex { Triangle t[] Edge e[] }
This lets us directly look up answers to the connectivity questions above, but because this information is all inter-related, it stores more than is really needed. Also, storing connectivity in vertices makes for variable-length data structures (since vertices can have arbitrary numbers of neighbors), which are generally less efficient to implement. Rather than committing to store all these relationships explicitly, it is best to define a class interface to answer these questions, behind which a more efficient data structure can hide. It turns out we can store only some of the connectivity and efficiently recover the other information when needed.
The fixed-size arrays in the Edge and Triangle classes suggest that it will be more efficient to store the connectivity information there. In fact, for polygon meshes, in which polygons have arbitrary numbers of edges and vertices, only edges have fixed-size connectivity information, which leads to many traditional mesh data structures being based on edges. But for triangle-only meshes, storing connectivity in the (less numerous) faces is appealing.
A good mesh data structure should be reasonably compact and allow efficient answers to all adjacency queries. Efficient means constant-time: the time to find neighbors should not depend on the size of the mesh. We’ll look at three data structures for meshes, one based on triangles and two based on edges.
We can create a compact mesh data structure based on triangles by augmenting the basic shared-vertex mesh with pointers from the triangles to the three neighboring triangles, and a pointer from each vertex to one of the adjacent triangles (it doesn’t matter which one); see Figure 12.12:
Triangle { Triangle nbr[3]; Vertex v[3]; } Vertex { // ... per-vertex data ... Triangle t; // any adjacent tri }
In the array Triangle.nbr, the kth entry points to the neighboring triangle that shares vertices k and k + 1. We call this structure the triangle-neighbor structure. Starting from standard indexed mesh arrays, it can be implemented with two additional arrays: one that stores the three neighbors of each triangle and one that stores a single neighboring triangle for each vertex (see Figure 12.13 for an example):
Mesh { // ... per-vertex data ... int tInd[nt][3]; // vertex indices int tNbr[nt][3]; // indices of neighbor triangles int vTri[nv]; // index of any adjacent triangle }
Clearly, the neighboring triangles and vertices of a triangle can be found directly in the data structure, but by using this triangle adjacency information carefully, it is also possible to answer connectivity queries about vertices in constant time. The idea is to move from triangle to triangle, visiting only the triangles adjacent to the relevant vertex. If triangle t has vertex v as its kth vertex, then the triangle t.nbr[k] is the next triangle around v in the clockwise direction. This observation leads to the following algorithm to traverse all the triangles adjacent to a given vertex:
Of course, a real program would do something with the triangles as it found them.
TrianglesOfVertex(v) { t = v.t do { find i such that (t.v[i] == v) t = t.nbr[i] } while (t != v.t) }
This operation finds each subsequent triangle in constant time—even though a search is required to find the position of the central vertex in each triangle’s vertex list, the vertex lists have constant size so the search takes constant time. However, that search is awkward and requires extra branching.
A small refinement can avoid these searches. The problem is that once we follow a pointer from one triangle to the next, we don’t know from which way we came: we have to search the triangle’s vertices to find the vertex that connects back to the previous triangle. To solve this, instead of storing pointers to neighboring triangles, we can store pointers to specific edges of those triangles by storing an index with the pointer:
Triangle { Edge nbr[3]; Vertex v[3]; } Edge { // the i-th edge of triangle t Triangle t; int i; // in {0,1,2} } Vertex { // ... per-vertex data ... Edge e; // any edge leaving vertex }
In practice, the Edge is stored by borrowing two bits of storage from the triangle index t to store the edge index i, so that the total storage requirements remain the same.
In this structure, the neighbor array for a triangle tells which of the neighboring triangles’ edges are shared with the three edges of that triangle. With this extra information, we always know where to find the original triangle, which leads to an invariant of the data structure: for any jth edge of any triangle t,
t.nbr[j].t.nbr[t.nbr[j].i].t ==t.
Knowing which edge we came in through lets us know immediately which edge to leave through in order to continue traversing around a vertex, leading to a streamlined algorithm:
TrianglesOfVertex(v) { {t, i} = v.e; do { {t, i} = t.nbr[i]; i = (i+1) mod 3; } while (t != v.e.t); }
The triangle-neighbor structure is quite compact. For a mesh with only vertex positions, we are storing four numbers (three coordinates and an edge) per vertex and six (three vertex indices and three edges) per face, for a total of 4n_{v} + 6n_{t} ≈ 16n_{v} units of storage per vertex, compared with 9n_{v} for the basic indexed mesh.
The triangle neighbor structure as presented here works only for manifold meshes, because it depends on returning to the starting triangle to terminate the traversal of a vertex’s neighbors, which will not happen at a boundary vertex that doesn’t have a full cycle of triangles. However, it is not difficult to generalize it to manifolds with boundary, by introducing a suitable sentinel value (such as - 1) for the neighbors of boundary triangles and taking care that the boundary vertices point to the most counterclockwise neighboring triangle, rather than to any arbitrary triangle.
One widely used mesh data structure that stores connectivity information at the edges instead of the faces is the winged-edge data structure. This data structure makes edges the first-class citizen of the data structure, as illustrated in Figures 12.14 and 12.15.
In a winged-edge mesh, each edge stores pointers to the two vertices it connects (the head and tail vertices), the two faces it is part of (the left and right faces), and, most importantly, the next and previous edges in the counterclockwise traversal of its left and right faces (Figure 12.16). Each vertex and face also stores a pointer to a single, arbitrary edge that connects to it:
Edge { Edge lprev, lnext, rprev, rnext; Vertex head, tail; Face left, right; } Face { // ... per-face data ... Edge e; // any adjacent edge } Vertex { // ... per-vertex data ... Edge e; // any incident edge }
The winged-edge data structure supports constant-time access to the edges of a face or of a vertex, and from those edges the adjoining vertices or faces can be found:
EdgesOfVertex(v) { e = v.e; do { if (e.tail == v) e = e.lprev; else e = e.rprev; } while (e != v.e); } EdgesOfFace(f) { e = f.e; do { if (e.left == f) e = e.lnext; else e = e.rnext; } while (e != f.e); }
These same algorithms and data structures will work equally well in a polygon mesh that isn’t limited to triangles; this is one important advantage of edge-based structures.
As with any data structure, the winged-edge data structure makes a variety of time/space tradeoffs. For example, we can eliminate the prev references. This makes it more difficult to traverse clockwise around faces or counterclockwise around vertices, but when we need to know the previous edge, we can always follow the successor edges in a circle until we get back to the original edge. This saves space, but it makes some operations slower. (See the chapter notes for more information on these tradeoffs).
The winged-edge structure is quite elegant, but it has one remaining awkward- ness—the need to constantly check which way the edge is oriented before moving to the next edge. This check is directly analogous to the search we saw in the basic version of the triangle neighbor structure: we are looking to find out whether we entered the present edge from the head or from the tail. The solution is also almost indistinguishable: rather than storing data for each edge, we store data for each half-edge. There is one half-edge for each of the two triangles that share an edge, and the two half-edges are oriented oppositely, each oriented consistently with its own triangle.
The data normally stored in an edge are split between the two half-edges. Each half-edge points to the face on its side of the edge and to the vertex at its head, and each contains the edge pointers for its face (Figure 12.17). It also points to its neighbor on the other side of the edge, from which the other half of the information can be found. Like the winged-edge, a half-edge can contain pointers to both the previous and next half-edges around its face, or only to the next half-edge. We’ll show the example that uses a single pointer.
HEdge { HEdge pair, next; Vertex v; Face f; } Face { // ... per-face data ... HEdge h; // any h-edge of this face } Vertex { // ... per-vertex data ... HEdge h; // any h-edge pointing toward this vertex }
Traversing a half-edge structure is just like traversing a winged-edge structure except that we no longer need to check orientation, and we follow the pair pointer to access the edges in the opposite face.
EdgesOfVertex(v) { h = v.h; do { h = h.pair.next; } while (h != v.h); } EdgesOfFace(f) { h = f.h; do { h = h.next; } while (h != f.h); }
The vertex traversal here is clockwise, which is necessary because of omitting the prev pointer from the structure.
Because half-edges are generally allocated in pairs (at least in a mesh with no boundaries), many implementations can do away with the pair pointers. For instance, in an implementation based on array indexing (such as shown in Figure 12.18), the array can be arranged so that an even-numbered edge i always pairs with edge i + 1 and an odd-numbered edge j always pairs with edge j - 1.
In addition to the simple traversal algorithms shown in this chapter, all three of these mesh topology structures can support “mesh surgery” operations of various sorts, such as splitting or collapsing vertices, swapping edges, adding, or removing triangles.
A triangle mesh manages a collection of triangles that constitute an object in a scene, but another universal problem in graphics applications is arranging the objects in the desired positions. As we saw in Chapter 7, this is done using transformations, but complex scenes can contain a great many transformations and organizing them well makes the scene much easier to manipulate. Most scenes admit to a hierarchical organization, and the transformations can be managed according to this hierarchy using a scene graph.
To motivate the scene-graph data structure, we will use the hinged pendulum shown in Figure 12.19. Consider how we would draw the top part of the pendulum:
M_{1} = rotate(θ)
M_{2} = translate(p)
M_{3} = M_{2}M_{1}
Apply M_{3} to all points in upper pendulum
The bottom is more complicated, but we can take advantage of the fact that it is attached to the bottom of the upper pendulum at point b in the local coordinate system. First, we rotate the lower pendulum so that it is at an angle ϕ relative to its initial position. Then, we move it so that its top hinge is at point b. Now it is at the appropriate position in the local coordinates of the upper pendulum, and it can then be moved along with that coordinate system. The composite transform for the lower pendulum is
M_{a} = rotate(ϕ)
M_{b} = translate(b)
M_{c} = M_{b}M_{a}
M_{d} = M_{3}M_{c}
Apply M_{d} to all points in lower pendulum
Thus, we see not onyl that the lower pendulum lives in its own local coordinate system, but also that coordinate system itself is moved along with that of the upper pendulum.
We can encode the pendulum in a data structure that makes management of these coordinate system issues easier, as shown in Figure 12.20. The appropriate matrix to apply to an object is just the product of all the matrices in the chain from the object to the root of the data structure. For example, consider the model of a ferry that has a car that can move freely on the deck of the ferry and wheels that each move relative to the car as shown in Figure 12.21.
As with the pendulum, each object should be transformed by the product of the matrices in the path from the root to the object:
ferry transform using M_{0};
car body transform using M_{0}M_{1};
left wheel transform using M_{0}M_{1}M_{2};
left wheel transform using M_{0}M_{1}M_{3}.
An efficient implementation in the case of rasterization can be achieved using a matrix stack, a data structure supported by many APIs. A matrix stack is manipulated using push and pop operations that add and delete matrices from the right-hand side of a matrix product. For example, calling
push(M_{0})
push(M_{1})
push(M_{2})
creates the active matrix M = M_{0}M_{1}M_{2}. A subsequent call to pop() strips the last matrix added so that the active matrix becomes M = M_{0}M_{1}. Combining the matrix stack with a recursive traversal of a scene graph gives us
function traverse(node)
push(M_{local})
draw object using composite matrix from stack
traverse(left child)
traverse(right child)
pop()
There are many variations on scene graphs but all follow the basic idea above.
An elegant property of ray tracing is that it allows very natural application of transformations without changing the representation of the geometry. The basic idea of instancing is to distort all points on an object by a transformation matrix before the object is displayed. For example, if we transform the unit circle (in 2D) by a scale factor (2,1) in x and y, respectively, then rotate it by 45^{∘}, and move one unit in the x-direction, the result is an ellipse with an eccentricity of 2 and a long axis along the (x = −y)-direction centered at (0,1) (Figure 12.22). The key thing that makes that entity an “instance” is that we store the circle and the composite transform matrix. Thus, the explicit construction of the ellipse is left as a future operation at render time.
The advantage of instancing in ray tracing is that we can choose the space in which to do intersection. If the base object is composed of a set of points, one of which is p, then the transformed object is composed of that set of points transformed by matrix M, where the example point is transformed to Mp. If we have a ray a + tb that we want to intersect with the transformed object, we can instead intersect an inverse-transformed ray with the untransformed object (Figure 12.23). There are two potential advantages to computing in the untransformed space (i.e., the right-hand side of Figure 12.23):
The untransformed object may have a simpler intersection routine, e.g., a sphere versus an ellipsoid.
Many transformed objects can share the same untransformed object, thus reducing storage, e.g., a traffic jam of cars, where individual cars are just transforms of a few base (untransformed) models.
As discussed in Section 7.2.2, surface normal vectors transform differently. With this in mind and using the concepts illustrated in Figure 12.23, we can determine the intersection of a ray and an object transformed by matrix M. If we create an instance class of type surface, we need to create a hit function:
instance::hit(ray a + tb, real t_{0}, real t_{1}, hit-record rec)
ray r’ = M^{-1}a + tM^{-1}b
if (base-object →hit(r′, t_{0}, t_{1}, rec)) then
rec.n = (M^{-1})^{T}rec.n
return true
else
return false
An elegant thing about this function is that the parameter rec.t does not need to be changed, because it is the same in either space. Also note that we need not compute or store the matrix M.
This brings up a very important point: the ray direction b must not be restricted to a unit-length vector, or none of the infrastructure above works. For this reason, it is useful not to restrict ray directions to unit vectors.
In many, if not all, graphics applications, the ability to quickly locate geometric objects in particular regions of space is important. Ray tracers need to find objects that intersect rays; interactive applications navigating an environment need to find the objects visible from any given viewpoint; games and physical simulations require detecting when and where objects collide. All these needs can be supported by various spatial data structures designed to organize objects in space so they can be looked up efficiently.
In this section, we will discuss examples of three general classes of spatial data structures. Structures that group objects together into a hierarchy are object partitioning schemes: objects are divided into disjoint groups, but the groups may end up overlapping in space. Structures that divide space into disjoint regions are space partitioning schemes: space is divided into separate partitions, but one object may have to intersect more than one partition. Space partitioning schemes can be regular, in which space is divided into uniformly shaped pieces, or irregular, in which space is divided adaptively into irregular pieces, with smaller pieces where there are more and smaller objects.
We will use ray tracing as the primary motivation while discussing these structures, although they can all also be used for view culling or collision detection. In Chapter 4, all objects were looped over while checking for intersections. For N objects, this is an O(N) linear search and is thus slow for large scenes. Like most search problems, the ray-object intersection can be computed in sub-linear time using “divide and conquer” techniques, provided we can create an ordered data structure as a preprocess. There are many techniques to do this.
This section discusses three of these techniques in detail: bounding volume hierarchies (Rubin & Whitted, 1980; Whitted, 1980; Goldsmith & Salmon, 1987), uniform spatial subdivision (Cleary, Wyvill, Birtwistle, & Vatti, 1983; Fujimoto, Tanaka, & Iwata, 1986; Amanatides & Woo, 1987), and binary space partitioning (Glassner, 1984; Jansen, 1986; Havran, 2000). An example of the first two strategies is shown in Figure 12.24.
A key operation in most intersection-acceleration schemes is computing the intersection of a ray with a bounding box (Figure 12.25). This differs from conventional intersection tests in that we do not need to know where the ray hits the box; we only need to know whether it hits the box.
To build an algorithm for ray-box intersection, we begin by considering a 2D ray whose direction vector has positive x and y components. We can generalize this to arbitrary 3D rays later. The 2D bounding box is defined by two horizontal and two vertical lines:
The points bounded by these lines can be described in interval notation:
As shown in Figure 12.26, the intersection test can be phrased in terms of these intervals. First, we compute the ray parameter where the ray hits the line x = x_{min}:
We then make similar computations for t_{xmax}, t_{ymin}, and t_{ymax}. The ray hits the box if and only if the intervals [t_{xmin},t_{xmax}] and [t_{ymin},t_{ymax}] overlap; i.e., their intersection is nonempty. In pseudocode this algorithm is
t_{xmin} = (x_{min} − x_{e})∕x_{d}
t_{xmax} = (x_{max} - x_{e})∕x_{d}
t_{ymin} = (y_{min} - y_{e})∕y_{d}
t_{ymax} = (y_{max} - y_{e})∕y_{d}
if (t_{xmin} > t_{ymax}) or (t_{ymin} > t_{xmax}) then
return false
else
return true
The if statement may seem non-obvious. To see the logic of it, note that there is no overlap if the first interval is either entirely to the right or entirely to the left of the second interval.
The first thing we must address is the case when x_{d} or y_{d} is negative. If x_{d} is negative, then the ray will hit x_{max} before it hits x_{min}. Thus, the code for computing t_{xmin} and t_{xmax} expands to
if (x_{d} ≥ 0) then
t_{xmin} = (x_{min} - x_{e})∕x_{d}
t_{xmax} = (x_{max} - x_{e})∕x_{d}
else
t_{xmin} = (x_{max} - x_{e})∕x_{d}
t_{xmax} = (x_{min} - x_{e})∕x_{d}
A similar code expansion must be made for the y cases. A major concern is that horizontal and vertical rays have a zero value for y_{d} and x_{d}, respectively. This will cause divide-by-zero which may be a problem. However, before addressing this directly, we check whether IEEE floating point computation handles these cases gracefully for us. Recall from Section 1.5 the rules for divide-by-zero: for any positive real number a,
Consider the case of a vertical ray where x_{d} = 0 and y_{d} > 0. We can then calculate
There are three possibilities of interest:
x_{e} ≤ x_{min} (no hit);
x_{min} < x_{e} < x_{max} (hit);
x_{max} ≤ x_{e} (no hit).
For the first case, we have
This yields the interval (t_{xmin},t_{xmin}) = (∞,∞). That interval will not overlap with any interval, so there will be no hit, as desired. For the second case, we have
This yields the interval (t_{xmin},t_{xmin}) = (-∞,∞) which will overlap with all intervals and thus will yield a hit as desired. The third case results in the interval (-∞,-∞) which yields no hit, as desired. Because these cases work as desired, we need no special checks for them. As is often the case, IEEE floating point conventions are our ally. However, there is still a problem with this approach.
Consider the code segment:
if (x_{d} ≥ 0) then
t_{min} = (x_{min} - x_{e})∕x_{d}
t_{max} = (x_{max} - x_{e})∕x_{d}
else
t_{min} = (x_{max} - x_{e})∕x_{d}
t_{max} = (x_{min} - x_{e})∕x_{d}
This code breaks down when x_{d} = −0. This can be overcome by testing on the reciprocal of x_{d} (Williams, Barrus, Morley, & Shirley, 2005):
a = 1∕x_{d}
if (a ≥ 0) then
t_{min} = a(x_{min} - x_{e})
t_{max} = a(x_{max} - x_{e})
else
t_{min} = a(x_{max} - x_{e})
t_{max} = a(x_{min} - x_{e})
The basic idea of hierarchical bounding boxes can be seen by the common tactic of placing an axis-aligned 3D bounding box around all the objects as shown in Figure 12.27. Rays that hit the bounding box will actually be more expensive to compute than in a brute force search, because testing for intersection with the box is not free. However, rays that miss the box are cheaper than the brute force search. Such bounding boxes can be made hierarchical by partitioning the set of objects in a box and placing a box around each partition as shown in Figure 12.28. The data structure for the hierarchy shown in Figure 12.29 might be a tree with the large bounding box at the root and the two smaller bounding boxes as left and right subtrees. These would in turn each point to a list of three triangles. The intersection of a ray with this particular hard-coded tree would be
if (ray hits root box) then
if (ray hits left subtree box) then
check three triangles for intersection
if (ray intersects right subtree box) then
check other three triangles for intersection
if (an intersections returned from each subtree) then
return the closest of the two hits
else if (a intersection is returned from exactly one subtree) then
return that intersection
else
return false
else
return false
Some observations related to this algorithm are that there is no geometric ordering between the two subtrees, and there is no reason a ray might not hit both subtrees. Indeed, there is no reason that the two subtrees might not overlap.
A key point of such data hierarchies is that a box is guaranteed to bound all objects that are below it in the hierarchy, but they are not guaranteed to contain all objects that overlap it spatially, as shown in Figure 12.29. This makes this geometric search somewhat more complicated than a traditional binary search on strictly ordered one-dimensional data. The reader may note that several possible optimizations present themselves