Home | History | Annotate | Download | only in doc
      1 namespace Eigen {
      2 
      3 /** \page TopicInsideEigenExample What happens inside Eigen, on a simple example
      4 
      5 \b Table \b of \b contents
      6   - \ref WhyInteresting
      7   - \ref ConstructingVectors
      8   - \ref ConstructionOfSumXpr
      9   - \ref Assignment
     10 \n
     11 
     12 <hr>
     13 
     14 
     15 Consider the following example program:
     16 
     17 \code
     18 #include<Eigen/Core>
     19 
     20 int main()
     21 {
     22   int size = 50;
     23   // VectorXf is a vector of floats, with dynamic size.
     24   Eigen::VectorXf u(size), v(size), w(size);
     25   u = v + w;
     26 }
     27 \endcode
     28 
     29 The goal of this page is to understand how Eigen compiles it, assuming that SSE2 vectorization is enabled (GCC option -msse2).
     30 
     31 \section WhyInteresting Why it's interesting
     32 
     33 Maybe you think, that the above example program is so simple, that compiling it shouldn't involve anything interesting. So before starting, let us explain what is nontrivial in compiling it correctly -- that is, producing optimized code -- so that the complexity of Eigen, that we'll explain here, is really useful.
     34 
     35 Look at the line of code
     36 \code
     37   u = v + w;   //   (*)
     38 \endcode
     39 
     40 The first important thing about compiling it, is that the arrays should be traversed only once, like
     41 \code
     42   for(int i = 0; i < size; i++) u[i] = v[i] + w[i];
     43 \endcode
     44 The problem is that if we make a naive C++ library where the VectorXf class has an operator+ returning a VectorXf, then the line of code (*) will amount to:
     45 \code
     46   VectorXf tmp = v + w;
     47   VectorXf u = tmp;
     48 \endcode
     49 Obviously, the introduction of the temporary \a tmp here is useless. It has a very bad effect on performance, first because the creation of \a tmp requires a dynamic memory allocation in this context, and second as there are now two for loops:
     50 \code
     51   for(int i = 0; i < size; i++) tmp[i] = v[i] + w[i];
     52   for(int i = 0; i < size; i++) u[i] = tmp[i];
     53 \endcode
     54 Traversing the arrays twice instead of once is terrible for performance, as it means that we do many redundant memory accesses.
     55 
     56 The second important thing about compiling the above program, is to make correct use of SSE2 instructions. Notice that Eigen also supports AltiVec and that all the discussion that we make here applies also to AltiVec.
     57 
     58 SSE2, like AltiVec, is a set of instructions allowing to perform computations on packets of 128 bits at once. Since a float is 32 bits, this means that SSE2 instructions can handle 4 floats at once. This means that, if correctly used, they can make our computation go up to 4x faster.
     59 
     60 However, in the above program, we have chosen size=50, so our vectors consist of 50 float's, and 50 is not a multiple of 4. This means that we cannot hope to do all of that computation using SSE2 instructions. The second best thing, to which we should aim, is to handle the 48 first coefficients with SSE2 instructions, since 48 is the biggest multiple of 4 below 50, and then handle separately, without SSE2, the 49th and 50th coefficients. Something like this:
     61 
     62 \code
     63   for(int i = 0; i < 4*(size/4); i+=4) u.packet(i)  = v.packet(i) + w.packet(i);
     64   for(int i = 4*(size/4); i < size; i++) u[i] = v[i] + w[i];
     65 \endcode
     66 
     67 So let us look line by line at our example program, and let's follow Eigen as it compiles it.
     68 
     69 \section ConstructingVectors Constructing vectors
     70 
     71 Let's analyze the first line:
     72 
     73 \code
     74   Eigen::VectorXf u(size), v(size), w(size);
     75 \endcode
     76 
     77 First of all, VectorXf is the following typedef:
     78 \code
     79   typedef Matrix<float, Dynamic, 1> VectorXf;
     80 \endcode
     81 
     82 The class template Matrix is declared in src/Core/util/ForwardDeclarations.h with 6 template parameters, but the last 3 are automatically determined by the first 3. So you don't need to worry about them for now. Here, Matrix\<float, Dynamic, 1\> means a matrix of floats, with a dynamic number of rows and 1 column.
     83 
     84 The Matrix class inherits a base class, MatrixBase. Don't worry about it, for now it suffices to say that MatrixBase is what unifies matrices/vectors and all the expressions types -- more on that below.
     85 
     86 When we do
     87 \code
     88   Eigen::VectorXf u(size);
     89 \endcode
     90 the constructor that is called is Matrix::Matrix(int), in src/Core/Matrix.h. Besides some assertions, all it does is to construct the \a m_storage member, which is of type DenseStorage\<float, Dynamic, Dynamic, 1\>.
     91 
     92 You may wonder, isn't it overengineering to have the storage in a separate class? The reason is that the Matrix class template covers all kinds of matrices and vector: both fixed-size and dynamic-size. The storage method is not the same in these two cases. For fixed-size, the matrix coefficients are stored as a plain member array. For dynamic-size, the coefficients will be stored as a pointer to a dynamically-allocated array. Because of this, we need to abstract storage away from the Matrix class. That's DenseStorage.
     93 
     94 Let's look at this constructor, in src/Core/DenseStorage.h. You can see that there are many partial template specializations of DenseStorages here, treating separately the cases where dimensions are Dynamic or fixed at compile-time. The partial specialization that we are looking at is:
     95 \code
     96 template<typename T, int _Cols> class DenseStorage<T, Dynamic, Dynamic, _Cols>
     97 \endcode
     98 
     99 Here, the constructor called is DenseStorage::DenseStorage(int size, int rows, int columns)
    100 with size=50, rows=50, columns=1.
    101 
    102 Here is this constructor:
    103 \code
    104 inline DenseStorage(int size, int rows, int) : m_data(internal::aligned_new<T>(size)), m_rows(rows) {}
    105 \endcode
    106 
    107 Here, the \a m_data member is the actual array of coefficients of the matrix. As you see, it is dynamically allocated. Rather than calling new[] or malloc(), as you can see, we have our own internal::aligned_new defined in src/Core/util/Memory.h. What it does is that if vectorization is enabled, then it uses a platform-specific call to allocate a 128-bit-aligned array, as that is very useful for vectorization with both SSE2 and AltiVec. If vectorization is disabled, it amounts to the standard new[].
    108 
    109 As you can see, the constructor also sets the \a m_rows member to \a size. Notice that there is no \a m_columns member: indeed, in this partial specialization of DenseStorage, we know the number of columns at compile-time, since the _Cols template parameter is different from Dynamic. Namely, in our case, _Cols is 1, which is to say that our vector is just a matrix with 1 column. Hence, there is no need to store the number of columns as a runtime variable.
    110 
    111 When you call VectorXf::data() to get the pointer to the array of coefficients, it returns DenseStorage::data() which returns the \a m_data member.
    112 
    113 When you call VectorXf::size() to get the size of the vector, this is actually a method in the base class MatrixBase. It determines that the vector is a column-vector, since ColsAtCompileTime==1 (this comes from the template parameters in the typedef VectorXf). It deduces that the size is the number of rows, so it returns VectorXf::rows(), which returns DenseStorage::rows(), which returns the \a m_rows member, which was set to \a size by the constructor.
    114 
    115 \section ConstructionOfSumXpr Construction of the sum expression
    116 
    117 Now that our vectors are constructed, let's move on to the next line:
    118 
    119 \code
    120 u = v + w;
    121 \endcode
    122 
    123 The executive summary is that operator+ returns a "sum of vectors" expression, but doesn't actually perform the computation. It is the operator=, whose call occurs thereafter, that does the computation.
    124 
    125 Let us now see what Eigen does when it sees this:
    126 
    127 \code
    128 v + w
    129 \endcode
    130 
    131 Here, v and w are of type VectorXf, which is a typedef for a specialization of Matrix (as we explained above), which is a subclass of MatrixBase. So what is being called is
    132 
    133 \code
    134 MatrixBase::operator+(const MatrixBase&)
    135 \endcode
    136 
    137 The return type of this operator is
    138 \code
    139 CwiseBinaryOp<internal::scalar_sum_op<float>, VectorXf, VectorXf>
    140 \endcode
    141 The CwiseBinaryOp class is our first encounter with an expression template. As we said, the operator+ doesn't by itself perform any computation, it just returns an abstract "sum of vectors" expression. Since there are also "difference of vectors" and "coefficient-wise product of vectors" expressions, we unify them all as "coefficient-wise binary operations", which we abbreviate as "CwiseBinaryOp". "Coefficient-wise" means that the operations is performed coefficient by coefficient. "binary" means that there are two operands -- we are adding two vectors with one another.
    142 
    143 Now you might ask, what if we did something like
    144 
    145 \code
    146 v + w + u;
    147 \endcode
    148 
    149 The first v + w would return a CwiseBinaryOp as above, so in order for this to compile, we'd need to define an operator+ also in the class CwiseBinaryOp... at this point it starts looking like a nightmare: are we going to have to define all operators in each of the expression classes (as you guessed, CwiseBinaryOp is only one of many) ? This looks like a dead end!
    150 
    151 The solution is that CwiseBinaryOp itself, as well as Matrix and all the other expression types, is a subclass of MatrixBase. So it is enough to define once and for all the operators in class MatrixBase.
    152 
    153 Since MatrixBase is the common base class of different subclasses, the aspects that depend on the subclass must be abstracted from MatrixBase. This is called polymorphism.
    154 
    155 The classical approach to polymorphism in C++ is by means of virtual functions. This is dynamic polymorphism. Here we don't want dynamic polymorphism because the whole design of Eigen is based around the assumption that all the complexity, all the abstraction, gets resolved at compile-time. This is crucial: if the abstraction can't get resolved at compile-time, Eigen's compile-time optimization mechanisms become useless, not to mention that if that abstraction has to be resolved at runtime it'll incur an overhead by itself.
    156 
    157 Here, what we want is to have a single class MatrixBase as the base of many subclasses, in such a way that each MatrixBase object (be it a matrix, or vector, or any kind of expression) knows at compile-time (as opposed to run-time) of which particular subclass it is an object (i.e. whether it is a matrix, or an expression, and what kind of expression).
    158 
    159 The solution is the <a href="http://en.wikipedia.org/wiki/Curiously_Recurring_Template_Pattern">Curiously Recurring Template Pattern</a>. Let's do the break now. Hopefully you can read this wikipedia page during the break if needed, but it won't be allowed during the exam.
    160 
    161 In short, MatrixBase takes a template parameter \a Derived. Whenever we define a subclass Subclass, we actually make Subclass inherit MatrixBase\<Subclass\>. The point is that different subclasses inherit different MatrixBase types. Thanks to this, whenever we have an object of a subclass, and we call on it some MatrixBase method, we still remember even from inside the MatrixBase method which particular subclass we're talking about.
    162 
    163 This means that we can put almost all the methods and operators in the base class MatrixBase, and have only the bare minimum in the subclasses. If you look at the subclasses in Eigen, like for instance the CwiseBinaryOp class, they have very few methods. There are coeff() and sometimes coeffRef() methods for access to the coefficients, there are rows() and cols() methods returning the number of rows and columns, but there isn't much more than that. All the meat is in MatrixBase, so it only needs to be coded once for all kinds of expressions, matrices, and vectors.
    164 
    165 So let's end this digression and come back to the piece of code from our example program that we were currently analyzing,
    166 
    167 \code
    168 v + w
    169 \endcode
    170 
    171 Now that MatrixBase is a good friend, let's write fully the prototype of the operator+ that gets called here (this code is from src/Core/MatrixBase.h):
    172 
    173 \code
    174 template<typename Derived>
    175 class MatrixBase
    176 {
    177   // ...
    178 
    179   template<typename OtherDerived>
    180   const CwiseBinaryOp<internal::scalar_sum_op<typename internal::traits<Derived>::Scalar>, Derived, OtherDerived>
    181   operator+(const MatrixBase<OtherDerived> &other) const;
    182 
    183   // ...
    184 };
    185 \endcode
    186 
    187 Here of course, \a Derived and \a OtherDerived are VectorXf.
    188 
    189 As we said, CwiseBinaryOp is also used for other operations such as substration, so it takes another template parameter determining the operation that will be applied to coefficients. This template parameter is a functor, that is, a class in which we have an operator() so it behaves like a function. Here, the functor used is internal::scalar_sum_op. It is defined in src/Core/Functors.h.
    190 
    191 Let us now explain the internal::traits here. The internal::scalar_sum_op class takes one template parameter: the type of the numbers to handle. Here of course we want to pass the scalar type (a.k.a. numeric type) of VectorXf, which is \c float. How do we determine which is the scalar type of \a Derived ? Throughout Eigen, all matrix and expression types define a typedef \a Scalar which gives its scalar type. For example, VectorXf::Scalar is a typedef for \c float. So here, if life was easy, we could find the numeric type of \a Derived as just
    192 \code
    193 typename Derived::Scalar
    194 \endcode
    195 Unfortunately, we can't do that here, as the compiler would complain that the type Derived hasn't yet been defined. So we use a workaround: in src/Core/util/ForwardDeclarations.h, we declared (not defined!) all our subclasses, like Matrix, and we also declared the following class template:
    196 \code
    197 template<typename T> struct internal::traits;
    198 \endcode
    199 In src/Core/Matrix.h, right \em before the definition of class Matrix, we define a partial specialization of internal::traits for T=Matrix\<any template parameters\>. In this specialization of internal::traits, we define the Scalar typedef. So when we actually define Matrix, it is legal to refer to "typename internal::traits\<Matrix\>::Scalar".
    200 
    201 Anyway, we have declared our operator+. In our case, where \a Derived and \a OtherDerived are VectorXf, the above declaration amounts to:
    202 \code
    203 class MatrixBase<VectorXf>
    204 {
    205   // ...
    206 
    207   const CwiseBinaryOp<internal::scalar_sum_op<float>, VectorXf, VectorXf>
    208   operator+(const MatrixBase<VectorXf> &other) const;
    209 
    210   // ...
    211 };
    212 \endcode
    213 
    214 Let's now jump to src/Core/CwiseBinaryOp.h to see how it is defined. As you can see there, all it does is to return a CwiseBinaryOp object, and this object is just storing references to the left-hand-side and right-hand-side expressions -- here, these are the vectors \a v and \a w. Well, the CwiseBinaryOp object is also storing an instance of the (empty) functor class, but you shouldn't worry about it as that is a minor implementation detail.
    215 
    216 Thus, the operator+ hasn't performed any actual computation. To summarize, the operation \a v + \a w just returned an object of type CwiseBinaryOp which did nothing else than just storing references to \a v and \a w.
    217 
    218 \section Assignment The assignment
    219 
    220 At this point, the expression \a v + \a w has finished evaluating, so, in the process of compiling the line of code
    221 \code
    222 u = v + w;
    223 \endcode
    224 we now enter the operator=.
    225 
    226 What operator= is being called here? The vector u is an object of class VectorXf, i.e. Matrix. In src/Core/Matrix.h, inside the definition of class Matrix, we see this:
    227 \code
    228     template<typename OtherDerived>
    229     inline Matrix& operator=(const MatrixBase<OtherDerived>& other)
    230     {
    231       eigen_assert(m_storage.data()!=0 && "you cannot use operator= with a non initialized matrix (instead use set()");
    232       return Base::operator=(other.derived());
    233     }
    234 \endcode
    235 Here, Base is a typedef for MatrixBase\<Matrix\>. So, what is being called is the operator= of MatrixBase. Let's see its prototype in src/Core/MatrixBase.h:
    236 \code
    237     template<typename OtherDerived>
    238     Derived& operator=(const MatrixBase<OtherDerived>& other);
    239 \endcode
    240 Here, \a Derived is VectorXf (since u is a VectorXf) and \a OtherDerived is CwiseBinaryOp. More specifically, as explained in the previous section, \a OtherDerived is:
    241 \code
    242 CwiseBinaryOp<internal::scalar_sum_op<float>, VectorXf, VectorXf>
    243 \endcode
    244 So the full prototype of the operator= being called is:
    245 \code
    246 VectorXf& MatrixBase<VectorXf>::operator=(const MatrixBase<CwiseBinaryOp<internal::scalar_sum_op<float>, VectorXf, VectorXf> > & other);
    247 \endcode
    248 This operator= literally reads "copying a sum of two VectorXf's into another VectorXf".
    249 
    250 Let's now look at the implementation of this operator=. It resides in the file src/Core/Assign.h.
    251 
    252 What we can see there is:
    253 \code
    254 template<typename Derived>
    255 template<typename OtherDerived>
    256 inline Derived& MatrixBase<Derived>
    257   ::operator=(const MatrixBase<OtherDerived>& other)
    258 {
    259   return internal::assign_selector<Derived,OtherDerived>::run(derived(), other.derived());
    260 }
    261 \endcode
    262 
    263 OK so our next task is to understand internal::assign_selector :)
    264 
    265 Here is its declaration (all that is still in the same file src/Core/Assign.h)
    266 \code
    267 template<typename Derived, typename OtherDerived,
    268          bool EvalBeforeAssigning = int(OtherDerived::Flags) & EvalBeforeAssigningBit,
    269          bool NeedToTranspose = Derived::IsVectorAtCompileTime
    270                 && OtherDerived::IsVectorAtCompileTime
    271                 && int(Derived::RowsAtCompileTime) == int(OtherDerived::ColsAtCompileTime)
    272                 && int(Derived::ColsAtCompileTime) == int(OtherDerived::RowsAtCompileTime)
    273                 && int(Derived::SizeAtCompileTime) != 1>
    274 struct internal::assign_selector;
    275 \endcode
    276 
    277 So internal::assign_selector takes 4 template parameters, but the 2 last ones are automatically determined by the 2 first ones.
    278 
    279 EvalBeforeAssigning is here to enforce the EvalBeforeAssigningBit. As explained <a href="TopicLazyEvaluation.html">here</a>, certain expressions have this flag which makes them automatically evaluate into temporaries before assigning them to another expression. This is the case of the Product expression, in order to avoid strange aliasing effects when doing "m = m * m;" However, of course here our CwiseBinaryOp expression doesn't have the EvalBeforeAssigningBit: we said since the beginning that we didn't want a temporary to be introduced here. So if you go to src/Core/CwiseBinaryOp.h, you'll see that the Flags in internal::traits\<CwiseBinaryOp\> don't include the EvalBeforeAssigningBit. The Flags member of CwiseBinaryOp is then imported from the internal::traits by the EIGEN_GENERIC_PUBLIC_INTERFACE macro. Anyway, here the template parameter EvalBeforeAssigning has the value \c false.
    280 
    281 NeedToTranspose is here for the case where the user wants to copy a row-vector into a column-vector. We allow this as a special exception to the general rule that in assignments we require the dimesions to match. Anyway, here both the left-hand and right-hand sides are column vectors, in the sense that ColsAtCompileTime is equal to 1. So NeedToTranspose is \c false too.
    282 
    283 So, here we are in the partial specialization:
    284 \code
    285 internal::assign_selector<Derived, OtherDerived, false, false>
    286 \endcode
    287 
    288 Here's how it is defined:
    289 \code
    290 template<typename Derived, typename OtherDerived>
    291 struct internal::assign_selector<Derived,OtherDerived,false,false> {
    292   static Derived& run(Derived& dst, const OtherDerived& other) { return dst.lazyAssign(other.derived()); }
    293 };
    294 \endcode
    295 
    296 OK so now our next job is to understand how lazyAssign works :)
    297 
    298 \code
    299 template<typename Derived>
    300 template<typename OtherDerived>
    301 inline Derived& MatrixBase<Derived>
    302   ::lazyAssign(const MatrixBase<OtherDerived>& other)
    303 {
    304   EIGEN_STATIC_ASSERT_SAME_MATRIX_SIZE(Derived,OtherDerived)
    305   eigen_assert(rows() == other.rows() && cols() == other.cols());
    306   internal::assign_impl<Derived, OtherDerived>::run(derived(),other.derived());
    307   return derived();
    308 }
    309 \endcode
    310 
    311 What do we see here? Some assertions, and then the only interesting line is:
    312 \code
    313   internal::assign_impl<Derived, OtherDerived>::run(derived(),other.derived());
    314 \endcode
    315 
    316 OK so now we want to know what is inside internal::assign_impl.
    317 
    318 Here is its declaration:
    319 \code
    320 template<typename Derived1, typename Derived2,
    321          int Vectorization = internal::assign_traits<Derived1, Derived2>::Vectorization,
    322          int Unrolling = internal::assign_traits<Derived1, Derived2>::Unrolling>
    323 struct internal::assign_impl;
    324 \endcode
    325 Again, internal::assign_selector takes 4 template parameters, but the 2 last ones are automatically determined by the 2 first ones.
    326 
    327 These two parameters \a Vectorization and \a Unrolling are determined by a helper class internal::assign_traits. Its job is to determine which vectorization strategy to use (that is \a Vectorization) and which unrolling strategy to use (that is \a Unrolling).
    328 
    329 We'll not enter into the details of how these strategies are chosen (this is in the implementation of internal::assign_traits at the top of the same file). Let's just say that here \a Vectorization has the value \a LinearVectorization, and \a Unrolling has the value \a NoUnrolling (the latter is obvious since our vectors have dynamic size so there's no way to unroll the loop at compile-time).
    330 
    331 So the partial specialization of internal::assign_impl that we're looking at is:
    332 \code
    333 internal::assign_impl<Derived1, Derived2, LinearVectorization, NoUnrolling>
    334 \endcode
    335 
    336 Here is how it's defined:
    337 \code
    338 template<typename Derived1, typename Derived2>
    339 struct internal::assign_impl<Derived1, Derived2, LinearVectorization, NoUnrolling>
    340 {
    341   static void run(Derived1 &dst, const Derived2 &src)
    342   {
    343     const int size = dst.size();
    344     const int packetSize = internal::packet_traits<typename Derived1::Scalar>::size;
    345     const int alignedStart = internal::assign_traits<Derived1,Derived2>::DstIsAligned ? 0
    346                            : internal::first_aligned(&dst.coeffRef(0), size);
    347     const int alignedEnd = alignedStart + ((size-alignedStart)/packetSize)*packetSize;
    348 
    349     for(int index = 0; index < alignedStart; index++)
    350       dst.copyCoeff(index, src);
    351 
    352     for(int index = alignedStart; index < alignedEnd; index += packetSize)
    353     {
    354       dst.template copyPacket<Derived2, Aligned, internal::assign_traits<Derived1,Derived2>::SrcAlignment>(index, src);
    355     }
    356 
    357     for(int index = alignedEnd; index < size; index++)
    358       dst.copyCoeff(index, src);
    359   }
    360 };
    361 \endcode
    362 
    363 Here's how it works. \a LinearVectorization means that the left-hand and right-hand side expression can be accessed linearly i.e. you can refer to their coefficients by one integer \a index, as opposed to having to refer to its coefficients by two integers \a row, \a column.
    364 
    365 As we said at the beginning, vectorization works with blocks of 4 floats. Here, \a PacketSize is 4.
    366 
    367 There are two potential problems that we need to deal with:
    368 \li first, vectorization works much better if the packets are 128-bit-aligned. This is especially important for write access. So when writing to the coefficients of \a dst, we want to group these coefficients by packets of 4 such that each of these packets is 128-bit-aligned. In general, this requires to skip a few coefficients at the beginning of \a dst. This is the purpose of \a alignedStart. We then copy these first few coefficients one by one, not by packets. However, in our case, the \a dst expression is a VectorXf and remember that in the construction of the vectors we allocated aligned arrays. Thanks to \a DstIsAligned, Eigen remembers that without having to do any runtime check, so \a alignedStart is zero and this part is avoided altogether.
    369 \li second, the number of coefficients to copy is not in general a multiple of \a packetSize. Here, there are 50 coefficients to copy and \a packetSize is 4. So we'll have to copy the last 2 coefficients one by one, not by packets. Here, \a alignedEnd is 48.
    370 
    371 Now come the actual loops.
    372 
    373 First, the vectorized part: the 48 first coefficients out of 50 will be copied by packets of 4:
    374 \code
    375   for(int index = alignedStart; index < alignedEnd; index += packetSize)
    376   {
    377     dst.template copyPacket<Derived2, Aligned, internal::assign_traits<Derived1,Derived2>::SrcAlignment>(index, src);
    378   }
    379 \endcode
    380 
    381 What is copyPacket? It is defined in src/Core/Coeffs.h:
    382 \code
    383 template<typename Derived>
    384 template<typename OtherDerived, int StoreMode, int LoadMode>
    385 inline void MatrixBase<Derived>::copyPacket(int index, const MatrixBase<OtherDerived>& other)
    386 {
    387   eigen_internal_assert(index >= 0 && index < size());
    388   derived().template writePacket<StoreMode>(index,
    389     other.derived().template packet<LoadMode>(index));
    390 }
    391 \endcode
    392 
    393 OK, what are writePacket() and packet() here?
    394 
    395 First, writePacket() here is a method on the left-hand side VectorXf. So we go to src/Core/Matrix.h to look at its definition:
    396 \code
    397 template<int StoreMode>
    398 inline void writePacket(int index, const PacketScalar& x)
    399 {
    400   internal::pstoret<Scalar, PacketScalar, StoreMode>(m_storage.data() + index, x);
    401 }
    402 \endcode
    403 Here, \a StoreMode is \a #Aligned, indicating that we are doing a 128-bit-aligned write access, \a PacketScalar is a type representing a "SSE packet of 4 floats" and internal::pstoret is a function writing such a packet in memory. Their definitions are architecture-specific, we find them in src/Core/arch/SSE/PacketMath.h:
    404 
    405 The line in src/Core/arch/SSE/PacketMath.h that determines the PacketScalar type (via a typedef in Matrix.h) is:
    406 \code
    407 template<> struct internal::packet_traits<float>  { typedef __m128  type; enum {size=4}; };
    408 \endcode
    409 Here, __m128 is a SSE-specific type. Notice that the enum \a size here is what was used to define \a packetSize above.
    410 
    411 And here is the implementation of internal::pstoret:
    412 \code
    413 template<> inline void internal::pstore(float*  to, const __m128&  from) { _mm_store_ps(to, from); }
    414 \endcode
    415 Here, __mm_store_ps is a SSE-specific intrinsic function, representing a single SSE instruction. The difference between internal::pstore and internal::pstoret is that internal::pstoret is a dispatcher handling both the aligned and unaligned cases, you find its definition in src/Core/GenericPacketMath.h:
    416 \code
    417 template<typename Scalar, typename Packet, int LoadMode>
    418 inline void internal::pstoret(Scalar* to, const Packet& from)
    419 {
    420   if(LoadMode == Aligned)
    421     internal::pstore(to, from);
    422   else
    423     internal::pstoreu(to, from);
    424 }
    425 \endcode
    426 
    427 OK, that explains how writePacket() works. Now let's look into the packet() call. Remember that we are analyzing this line of code inside copyPacket():
    428 \code
    429 derived().template writePacket<StoreMode>(index,
    430     other.derived().template packet<LoadMode>(index));
    431 \endcode
    432 
    433 Here, \a other is our sum expression \a v + \a w. The .derived() is just casting from MatrixBase to the subclass which here is CwiseBinaryOp. So let's go to src/Core/CwiseBinaryOp.h:
    434 \code
    435 class CwiseBinaryOp
    436 {
    437   // ...
    438     template<int LoadMode>
    439     inline PacketScalar packet(int index) const
    440     {
    441       return m_functor.packetOp(m_lhs.template packet<LoadMode>(index), m_rhs.template packet<LoadMode>(index));
    442     }
    443 };
    444 \endcode
    445 Here, \a m_lhs is the vector \a v, and \a m_rhs is the vector \a w. So the packet() function here is Matrix::packet(). The template parameter \a LoadMode is \a #Aligned. So we're looking at
    446 \code
    447 class Matrix
    448 {
    449   // ...
    450     template<int LoadMode>
    451     inline PacketScalar packet(int index) const
    452     {
    453       return internal::ploadt<Scalar, LoadMode>(m_storage.data() + index);
    454     }
    455 };
    456 \endcode
    457 We let you look up the definition of internal::ploadt in GenericPacketMath.h and the internal::pload in src/Core/arch/SSE/PacketMath.h. It is very similar to the above for internal::pstore.
    458 
    459 Let's go back to CwiseBinaryOp::packet(). Once the packets from the vectors \a v and \a w have been returned, what does this function do? It calls m_functor.packetOp() on them. What is m_functor? Here we must remember what particular template specialization of CwiseBinaryOp we're dealing with:
    460 \code
    461 CwiseBinaryOp<internal::scalar_sum_op<float>, VectorXf, VectorXf>
    462 \endcode
    463 So m_functor is an object of the empty class internal::scalar_sum_op<float>. As we mentioned above, don't worry about why we constructed an object of this empty class at all -- it's an implementation detail, the point is that some other functors need to store member data.
    464 
    465 Anyway, internal::scalar_sum_op is defined in src/Core/Functors.h:
    466 \code
    467 template<typename Scalar> struct internal::scalar_sum_op EIGEN_EMPTY_STRUCT {
    468   inline const Scalar operator() (const Scalar& a, const Scalar& b) const { return a + b; }
    469   template<typename PacketScalar>
    470   inline const PacketScalar packetOp(const PacketScalar& a, const PacketScalar& b) const
    471   { return internal::padd(a,b); }
    472 };
    473 \endcode
    474 As you can see, all what packetOp() does is to call internal::padd on the two packets. Here is the definition of internal::padd from src/Core/arch/SSE/PacketMath.h:
    475 \code
    476 template<> inline __m128  internal::padd(const __m128&  a, const __m128&  b) { return _mm_add_ps(a,b); }
    477 \endcode
    478 Here, _mm_add_ps is a SSE-specific intrinsic function, representing a single SSE instruction.
    479 
    480 To summarize, the loop
    481 \code
    482   for(int index = alignedStart; index < alignedEnd; index += packetSize)
    483   {
    484     dst.template copyPacket<Derived2, Aligned, internal::assign_traits<Derived1,Derived2>::SrcAlignment>(index, src);
    485   }
    486 \endcode
    487 has been compiled to the following code: for \a index going from 0 to the 11 ( = 48/4 - 1), read the i-th packet (of 4 floats) from the vector v and the i-th packet from the vector w using two __mm_load_ps SSE instructions, then add them together using a __mm_add_ps instruction, then store the result using a __mm_store_ps instruction.
    488 
    489 There remains the second loop handling the last few (here, the last 2) coefficients:
    490 \code
    491   for(int index = alignedEnd; index < size; index++)
    492     dst.copyCoeff(index, src);
    493 \endcode
    494 However, it works just like the one we just explained, it is just simpler because there is no SSE vectorization involved here. copyPacket() becomes copyCoeff(), packet() becomes coeff(), writePacket() becomes coeffRef(). If you followed us this far, you can probably understand this part by yourself.
    495 
    496 We see that all the C++ abstraction of Eigen goes away during compilation and that we indeed are precisely controlling which assembly instructions we emit. Such is the beauty of C++! Since we have such precise control over the emitted assembly instructions, but such complex logic to choose the right instructions, we can say that Eigen really behaves like an optimizing compiler. If you prefer, you could say that Eigen behaves like a script for the compiler. In a sense, C++ template metaprogramming is scripting the compiler -- and it's been shown that this scripting language is Turing-complete. See <a href="http://en.wikipedia.org/wiki/Template_metaprogramming"> Wikipedia</a>.
    497 
    498 */
    499 
    500 }
    501