Tại sao tôi nhận được rất nhiều lỗi cho người giải Runge Kutta Fehlberg của tôi?

Dự án hiện tại của tôi là lập trình lại mô hình gấp protein liên quan đến giải pháp hàng ngàn ODE trong C ++. Tôi đã dừng lại và bắt đầu tiến bộ khi tôi đang viết bộ giải để chạy hoàn toàn trên GPU. Cuối cùng tôi đã tích hợp được nó nhưng khi tôi cố gắng giải dC / dt = -C bằng cách sử dụng giải pháp bậc 5 từ thuật toán RKF45 với kích thước bước cố định là h = .001, tôi nhận được quyết định từ giá trị máy tính của e ^ {- t } theo thứ tự 10 ^ {- 4} khi có lỗi toàn cầu theo thứ tự 10 ^ {- 12}.

Có phải vì tôi không sử dụng kích thước bước thích ứng và kiểm soát lỗi bằng cả hai giải pháp thứ tư và thứ năm mà điều này đang xảy ra? Tôi nghĩ rằng vì một thuật toán thích ứng nhìn vào sự khác biệt giữa các lần đoán thứ tư và thứ năm và về cơ bản coi lệnh thứ năm là "câu trả lời đúng" mà tôi chỉ sử dụng câu trả lời đó để thử tích hợp.

BIÊN TẬP :

Tôi nghĩ rằng kể từ khi

Lỗi toàn cầu = (Số điểm) * (Lỗi cục bộ)

và

Lỗi cục bộ = h ^ O

Trong đó h là kích thước bước và O là thứ tự, tôi tính

O = ln (Lỗi toàn cầu / Số điểm) / ln h

Thực hiện phân tích đơn hàng theo cả cách tôi nghĩ chính xác và cách được đề xuất trong câu trả lời, tôi nhận được những điều sau:

Number of Points    h       Global Error    My Way      The Answer's Way
10                  0.1     2.89E-06        6.539E+00   
20                  0.05    7.09E-08        6.495E+00   5.350E+00
40                  0.025   1.76E-07        5.216E+00   -1.315E+00
50                  0.02    1.128E-07       5.089E+00   2.003E+00
80                  0.0125  4.401E-08       4.866E+00   2.002E+00
100                 0.01    2.816E-08       4.775E+00   2.001E+00
500                 0.002   1.126E-09       4.316E+00   2.000E+00
1000                0.001   2.810E-10       4.184E+00   2.003E+00
10000               0.0001  3.000E-12       3.881E+00   1.972E+00

Vì vậy, mặc dù tôi đã đột phá vào đêm qua, nơi tôi thực sự có lỗi lập trình, thứ tự bộ giải của tôi vẫn bị tắt. Tuy nhiên, lỗi đã nói ở trên liên quan đến .001 bước đã giảm đáng kể, tuy nhiên, từ O (10 ^ -4) xuống O (10 ^ -10).

EDIT lần 2

Dưới đây là biểu đồ của LN (E) -LN (a) so với LN (h). Tôi đã đưa ra một sự phù hợp tuyến tính vì vậy tôi mong muốn điều đó sẽ tiết lộ thứ tự cho tôi. Điều thực sự khó hiểu bây giờ là tôi có ba tính toán cho thứ tự phương pháp của mình. Biểu đồ cho tôi biết ~ 3, phương pháp của Godric cho tôi biết gần như chính xác 2 và phương pháp của tôi cho thấy nó thay đổi từ 6-> 4 để tăng kích thước bước. Tôi có xu hướng tin vào các tính toán của mình hơn vì Lỗi toàn cầu rất thấp (~ 10 ^ -10) và gần nhất với những gì tôi mong đợi (10 ^ -15 * 10 ^ 3 = 10 ^ -12) cho bước đó (10 ^ -3).

nhập mô tả hình ảnh ở đây

EDIT 3:

Mã của tôi. Xin tha thứ cho sự noobishness.

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <iostream>
#include <iomanip>                      //display 2 decimal places
#include <math.h>
using namespace std;

/*
Things to do log:
1. Decide how adaptive step sizes is handled by the integrator loop.  We can no longer say how many points to integrate if that is to be determined in the kernel
itself, only what the initial step size is.
2. Figure out how I'm going to graph all this stuff.
3. On paper solve for the step size.
4. Adaptive Step Size
    a. arrMin kernel to get min of array
    b. Decide how to implement tolerances
5. Use multiples of 256 for maxlength 

Important things accomplished:
1. Can create 2D array for storing concentrations over time.
2. Can store a 1D array in 2D array.
3. Can calculate all the k-arrays given static step size.
4. Can take a step given static step size.
5. Can step through an integration from t0 to tf
6. Can accomodate fixed time-step
7. Calculated the current order of integrator using logs.
8. Learned how to profile the program.
9. Fixed the GPU timeout problem with WDDM.

Ideas:
1. Just use large fixed size array
*/
__global__ void rkf5(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, int*, int*, size_t, double*, double*, double*);
__global__ void calcK(double*, double*, double*);
__global__ void k1(double*, double*, double*);
__global__ void k2(double*, double*, double*);
__global__ void k3(double*, double*, double*);
__global__ void k4(double*, double*, double*);
__global__ void k5(double*, double*, double*);
__global__ void k6(double*, double*, double*);
__global__ void arrAdd(double*, double*, double*);
__global__ void arrSub(double*, double*, double*);
__global__ void arrMult(double*, double*, double*);
__global__ void arrInit(double*, double);
__global__ void arrCopy(double*, double*);
__device__ void setup(double , double*, double*, double*, double*, int*);
__device__ double flux(int, double*) ;
__device__ double knowles_flux(int, double*);
__device__ void calcStepSize(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, int*);
__global__ void storeConcs(double*, size_t, double*, int);
__global__ void takeFourthOrderStep(double*, double*, double*, double*, double*, double*, double*);
__global__ void takeFifthOrderStep(double*, double*, double*, double*, double*, double*, double*, double*);

//Error checking that I don't understand yet.
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

//Main program.
int main(int argc, char** argv)
{
    //std::cout << std::fixed;          //display 2 decimal places
    //std::cout << std::setprecision(12);   //display 2 decimal places
    const int maxlength = 1;            //Number of discrete concentrations we are tracking.
    double concs[maxlength];            //Meant to store the current concentrations 
    double temp1[maxlength];                //Used as a bin to store products of Butcher's tableau and k values.
    double temp2[maxlength];                //Used as a bin to store products of Butcher's tableau and k values.
    double tempsum[maxlength];          //Used as a bin to store cumulative sum of tableau and k values
    double k1s[maxlength];
    double k2s[maxlength];
    double k3s[maxlength];
    double k4s[maxlength];
    double k5s[maxlength];
    double k6s[maxlength];
    const int numpoints = 40;       
    double to = 0;
    double tf = 1;
    //double dt = static_cast<double>(.5)/static_cast<double>(64);
    double dt = (tf-to)/static_cast<double>(numpoints);
    double mo = 1;
    double concStorage[maxlength][numpoints];   //Stores concs vs. time                     

    //Initialize all the arrays on the host to ensure arrays of 0's are sent to the device.
    //Also, here is where we can seed the system.
    std::cout<<dt;
    std::cout<<"\n";
    concs[0]=mo;
    std::cout<<concs[0];
    std::cout<<" ";
    for (int i=0; i<maxlength; i++)
    {
        for (int j=0; j<numpoints; j++)
            concStorage[i][j]=0;
        concs[i]=0;
        temp1[i]=0;
        temp2[i]=0;
        tempsum[i]=0;
        k1s[i]=0;
        k2s[i]=0;
        k3s[i]=0;
        k4s[i]=0;
        k5s[i]=0;
        k6s[i]=0;
        std::cout<<concs[i];
        std::cout<<" ";
    }
    concs[0]=mo;
    std::cout<<"\n";

    //Define all the pointers to device array memory addresses. These contain the on-GPU
    //addresses of all the data we're generating/using.
    double *d_concs;
    double *d_temp1;
    double *d_temp2;
    double *d_tempsum;
    double *d_k1s;
    double *d_k2s;
    double *d_k3s;
    double *d_k4s;
    double *d_k5s;
    double *d_k6s;
    double *d_dt;
    int *d_maxlength;
    int *d_numpoints;
    double *d_to;
    double *d_tf;
    double *d_concStorage;

    //Calculate all the sizes of the arrays in order to allocate the proper amount of memory on the GPU.
    size_t size_concs = sizeof(concs);
    size_t size_temp1 = sizeof(temp1);
    size_t size_temp2 = sizeof(temp2);
    size_t size_tempsum = sizeof(tempsum);
    size_t size_ks = sizeof(k1s);
    size_t size_maxlength = sizeof(maxlength);
    size_t size_numpoints = sizeof(numpoints);
    size_t size_dt = sizeof(dt);
    size_t size_to = sizeof(to);
    size_t size_tf = sizeof(tf);
    size_t h_pitch = numpoints*sizeof(double);
    size_t d_pitch;

    //Calculate the "pitch" of the 2D array.  The pitch is basically the length of a 2D array's row.  IT's larger 
    //than the actual row full of data due to hadware issues.  We thusly will use the pitch instead of the data 
    //size to traverse the array.
    gpuErrchk(cudaMallocPitch( (void**)&d_concStorage, &d_pitch, numpoints * sizeof(double), maxlength)); 

    //Allocate memory on the GPU for all the arrrays we're going to use in the integrator.
    gpuErrchk(cudaMalloc((void**)&d_concs, size_concs));
    gpuErrchk(cudaMalloc((void**)&d_temp1, size_temp1));
    gpuErrchk(cudaMalloc((void**)&d_temp2, size_temp1));
    gpuErrchk(cudaMalloc((void**)&d_tempsum, size_tempsum));
    gpuErrchk(cudaMalloc((void**)&d_k1s, size_ks));
    gpuErrchk(cudaMalloc((void**)&d_k2s, size_ks));
    gpuErrchk(cudaMalloc((void**)&d_k3s, size_ks));
    gpuErrchk(cudaMalloc((void**)&d_k4s, size_ks));
    gpuErrchk(cudaMalloc((void**)&d_k5s, size_ks));
    gpuErrchk(cudaMalloc((void**)&d_k6s, size_ks));
    gpuErrchk(cudaMalloc((void**)&d_maxlength, size_maxlength));
    gpuErrchk(cudaMalloc((void**)&d_numpoints, size_numpoints));
    gpuErrchk(cudaMalloc((void**)&d_dt, size_dt));
    gpuErrchk(cudaMalloc((void**)&d_to, size_to));
    gpuErrchk(cudaMalloc((void**)&d_tf, size_tf));

    //Copy all initial values of arrays to GPU.
    gpuErrchk(cudaMemcpy2D(d_concStorage, d_pitch, concStorage, h_pitch, numpoints*sizeof(double), maxlength, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_concs, &concs, size_concs, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_temp1, &temp1, size_temp1, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_temp2, &temp2, size_temp2, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_tempsum, &tempsum, size_tempsum, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_k1s, &k1s, size_ks, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_k2s, &k2s, size_ks, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_k3s, &k3s, size_ks, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_k4s, &k4s, size_ks, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_k5s, &k5s, size_ks, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_k6s, &k6s, size_ks, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_maxlength, &maxlength, size_maxlength, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_numpoints, &numpoints, size_numpoints, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_dt, &dt, size_dt, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_to, &to, size_to, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_tf, &tf, size_tf, cudaMemcpyHostToDevice));

    //Run the integrator.
    rkf5<<<1,1>>>(d_concs, d_concStorage, d_temp1, d_temp2, d_tempsum, d_k1s, d_k2s, d_k3s, d_k4s, d_k5s, d_k6s, d_maxlength, d_numpoints, d_pitch, d_dt, d_to, d_tf);
    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaDeviceSynchronize() );
    cudaDeviceSynchronize();

    //Copy concentrations from GPU to Host.  Almost defunct now that transferring the 2D array works.
    cudaMemcpy(concs, d_concs, size_concs, cudaMemcpyDeviceToHost);
    //Copy 2D array of concentrations vs. time from GPU to Host.
    gpuErrchk( cudaMemcpy2D(concStorage, h_pitch, d_concStorage, d_pitch, numpoints*sizeof(double), maxlength, cudaMemcpyDeviceToHost) );   

    //Print concentrations after the integrator kernel runs.  Used to test that data was transferring to and from GPU correctly.
    std::cout << "\n";
    for (int i=0; i<maxlength; i++)
    {
        std::cout<<concs[i];
        std::cout<<" ";
    }

    double a[10];
    double b[10];
    double c[10];
    for(int i = 0; i< 10; i++)
    {
        a[i]=0;
        b[i]=0;
        c[i]=0;
    }



    //Print out the concStorage array after the kernel runs.  Used to test that the 2D array transferred correctly from host to GPU and back.
    std::cout << "\n\n";
    std::cout << "Calculated Array";
    std::cout << "\n\n";
    for (int i=0; i<maxlength; i++)
    {
        for(int j=0; j<numpoints; j++)
        {
            if (j%(numpoints/10)==0)
            {
                a[j/(numpoints/10)]=concStorage[i][j];
                std::cout<<concStorage[i][j];
                std::cout<<"   ";
            }
        }
        std::cout << "\n";
    }
    std::cout << "\n";
    std::cout << "Exponential";
    std::cout << "\n\n";
    for (int i=0; i<10; i++)
    {
        b[i]=exp(-i*(tf-to)/10);
        std::cout<<exp(-i*(tf-to)/10);
        std::cout<<"   ";
    }
    std::cout << "\n\n";
    std::cout << "Error Array";
    std::cout << "\n\n";
    for (int i=0; i<10; i++)
    {
        c[i]=a[i]-b[i];
        std::cout<<c[i];
        std::cout<<"   ";
    }
    std::cout << "\n\n";
    cudaDeviceReset();  //Clean up all memory.

    return 0;
}
//Main kernel.  This is mean to be run as a master thread that calls all the other functions and thusly "runs" the integrator.
__global__ void rkf5(double* concs, double* concStorage, double* temp1, double* temp2, double* tempsum, double* k1s, double* k2s, double* k3s, double* k4s, double* k5s, double* k6s, int* maxlength, int* numpoints, size_t pitch, double* dt, double* to, double* tf)
{
    /*
    axy variables represent the coefficients in the Butcher's tableau where x represents the order of the step and the y value corresponds to the ky value 
    the coefficient gets multiplied by.  Have to cast them all as doubles, or the ratios evaluate as integers.
    e.g. a21 -> a21 * k1
    e.g. a31 -> a31 * k1 + a32 * k2
    */
    double a21 = static_cast<double>(.25);

    double a31 = static_cast<double>(3)/static_cast<double>(32);
    double a32 = static_cast<double>(9)/static_cast<double>(32);

    double a41 = static_cast<double>(1932)/static_cast<double>(2197);
    double a42 = static_cast<double>(-7200)/static_cast<double>(2197);
    double a43 = static_cast<double>(7296)/static_cast<double>(2197);

    double a51 = static_cast<double>(439)/static_cast<double>(216);
    double a52 = static_cast<double>(-8);
    double a53 = static_cast<double>(3680)/static_cast<double>(513);
    double a54 = static_cast<double>(-845)/static_cast<double>(4104);

    double a61 = static_cast<double>(-8)/static_cast<double>(27);
    double a62 = static_cast<double>(2);
    double a63 = static_cast<double>(-3544)/static_cast<double>(2565);
    double a64 = static_cast<double>(1859)/static_cast<double>(4104);
    double a65 = static_cast<double>(-11)/static_cast<double>(40);

    //for loop that integrates over the specified number of points. Actually, might have to make it a do-while loop for adaptive step sizes 
    for(int k = 0; k < *numpoints; k++)
    {
        if (k!=0)
        {
            arrCopy<<< 1, *maxlength >>>(concs, tempsum);
            cudaDeviceSynchronize();
        }
        arrInit<<< 1, *maxlength >>>(tempsum, 0);
        cudaDeviceSynchronize();
        arrInit<<< 1, *maxlength >>>(temp1, 0);
        cudaDeviceSynchronize();
        arrInit<<< 1, *maxlength >>>(temp2, 0);
        cudaDeviceSynchronize();

        calcK<<< 1, *maxlength >>>(concs, k1s, dt);         //k1 = dt * flux (concs)
        cudaDeviceSynchronize(); //Sync here because kernel continues onto next line before k1 finished

        setup(a21, temp1, tempsum, k1s, concs, maxlength);  //tempsum = a21*k1
        arrAdd<<< 1, *maxlength >>>(concs, tempsum, tempsum);   //tempsum = concs + a21*k1    
        cudaDeviceSynchronize();

        calcK<<< 1, *maxlength >>>(tempsum, k2s, dt);       //k2 = dt * flux (concs + a21*k1)
        cudaDeviceSynchronize();

        arrInit<<< 1, *maxlength >>>(tempsum, 0);
        cudaDeviceSynchronize();
        setup(a31, temp1, tempsum, k1s, concs, maxlength);  //temp1sum = a31*k1
        setup(a32, temp1, tempsum, k2s, concs, maxlength);  //tempsum = a31*k1 + a32*k2
        arrAdd<<< 1, *maxlength >>>(concs, tempsum, tempsum);   //tempsum = concs + a31*k1 + a32*k2
        cudaDeviceSynchronize();

        calcK<<< 1, *maxlength >>>(tempsum, k3s, dt);           //k3 = dt * flux (concs + a31*k1 + a32*k2)
        cudaDeviceSynchronize();

        arrInit<<< 1, *maxlength >>>(tempsum, 0);
        cudaDeviceSynchronize();
        setup(a41, temp1, tempsum, k1s, concs, maxlength);  //tempsum = a41*k1
        setup(a42, temp1, tempsum, k2s, concs, maxlength);  //tempsum = a41*k1 + a42*k2
        setup(a43, temp1, tempsum, k3s, concs, maxlength);  //tempsum = a41*k1 + a42*k2 + a43*k3
        arrAdd<<< 1, *maxlength >>>(concs, tempsum, tempsum);   //tempsum = concs + a41*k1 + a42*k2 + a43*k3
        cudaDeviceSynchronize();

        calcK<<< 1, *maxlength >>>(tempsum, k4s, dt);           //k4 = dt * flux (concs + a41*k1 + a42*k2 + a43*k3)
        cudaDeviceSynchronize();

        arrInit<<< 1, *maxlength >>>(tempsum, 0);
        cudaDeviceSynchronize();
        setup(a51, temp1, tempsum, k1s, concs, maxlength);  //tempsum = a51*k1
        setup(a52, temp1, tempsum, k2s, concs, maxlength);  //tempsum = a51*k1 + a52*k2
        setup(a53, temp1, tempsum, k3s, concs, maxlength);  //tempsum = a51*k1 + a52*k2 + a53*k3
        setup(a54, temp1, tempsum, k4s, concs, maxlength);  //tempsum = a51*k1 + a52*k2 + a53*k3 + a54*k4
        arrAdd<<< 1, *maxlength >>>(concs, tempsum, tempsum);   //tempsum = concs + a51*k1 + a52*k2 + a53*k3 + a54*k4
        cudaDeviceSynchronize();

        calcK<<< 1, *maxlength >>>(tempsum, k5s, dt);           //k5 = dt * flux (concs + a51*k1 + a52*k2 + a53*k3 + a54*k4)
        cudaDeviceSynchronize();

        arrInit<<< 1, *maxlength >>>(tempsum, 0);
        cudaDeviceSynchronize();
        setup(a61, temp1, tempsum, k1s, concs, maxlength);  //tempsum = a61*k1
        setup(a62, temp1, tempsum, k2s, concs, maxlength);  //tempsum = a61*k1 + a62*k2
        setup(a63, temp1, tempsum, k3s, concs, maxlength);  //tempsum = a61*k1 + a62*k2 + a63*k3
        setup(a64, temp1, tempsum, k4s, concs, maxlength);  //tempsum = a61*k1 + a62*k2 + a63*k3 + a64*k4
        setup(a65, temp1, tempsum, **k4s**, concs, maxlength);  //tempsum = a61*k1 + a62*k2 + a63*k3 + a64*k4 + a65*k5
        arrAdd<<< 1, *maxlength >>>(concs, tempsum, tempsum);   //tempsum = concs + a61*k1 + a62*k2 + a63*k3 + a64*k4 + a65*k5
        cudaDeviceSynchronize();

        calcK<<< 1, *maxlength >>>(tempsum, k6s, dt);           //k6 = dt * flux (concs + a61*k1 + a62*k2 + a63*k3 + a64*k4 + a65*k5)
        cudaDeviceSynchronize();

        //At this point, temp1 and tempsum are maxlength dimension arrays that are able to be used for other things.

        //Calculate acceptable step size before storing the concentrations.
        calcStepSize(temp1, temp2, tempsum, concs, k1s, k2s, k3s, k4s, k5s, k6s, dt, maxlength);    //temp1 = 4th Order guess, tempsum = 5th Order guess
        cudaDeviceSynchronize();

        //Store the initial conditions in the first column of the storage array.
        if (k==0)
        {
            storeConcs<<< 1, *maxlength >>>(concStorage, pitch, concs, k);  //Store this step's concentrations in 2D array
            cudaDeviceSynchronize();
        }
        //Store future concentration in next column of storage array.
        storeConcs<<< 1, *maxlength >>>(concStorage, pitch, tempsum, k+1);  //Store this step's concentrations in 2D array
        cudaDeviceSynchronize();
    }
}
//calcStepSize will take in an error tolerance, the current concentrations and the k values and calculate the resulting step size according to the following equation
//e[n+1]=y4[n+1] - y5[n+1]
__device__ void calcStepSize(double* temp1, double*temp2, double* tempsum, double* concs, double* k1s, double* k2s, double* k3s, double* k4s, double* k5s, double* k6s, double* dt, int* maxlength)
{
    //do
    //{
        takeFourthOrderStep<<< 1, *maxlength >>>(temp1, concs, k1s, k2s, k3s, k4s, k5s);            //Store 4th order guess in temp1
        takeFifthOrderStep<<< 1, *maxlength >>>(tempsum, concs, k1s, k2s, k3s, k4s, k5s, k6s);  //Store 5th order guess in tempsum
        cudaDeviceSynchronize();
        arrSub<<< 1, *maxlength >>>(temp1, tempsum, temp2)
        arrMin<<< 1, *maxlength >>>
    //arrMult
    //}
    //while
}
//takeFourthOrderStep is going to overwrite the old temp1 array with the new array of concentrations that result from a 4th order step.  This kernel is meant to be launched 
// with as many threads as there are discrete concentrations to be tracked.
__global__ void takeFourthOrderStep(double* y4, double* concs, double* k1s, double* k2s, double* k3s, double* k4s, double* k5s)
{
    double b41 = static_cast<double>(25)/static_cast<double>(216);
    double b42 = static_cast<double>(0);
    double b43 = static_cast<double>(1408)/static_cast<double>(2565);
    double b44 = static_cast<double>(2197)/static_cast<double>(4104);
    double b45 = static_cast<double>(-1)/static_cast<double>(5);
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    y4[idx] = concs[idx] + b41 * k1s[idx] + b42 * k2s[idx] + b43 * k3s[idx] + b44 * k4s[idx] + b45 * k5s[idx];
}
//takeFifthOrderStep is going to overwrite the old array of concentrations with the new array of concentrations.  As of now, this will be the 5th order step.  Another function can be d
//defined that will take a fourth order step if that is interesting for any reason.  This kernel is meant to be launched with as many threads as there are discrete concentrations
//to be tracked.
//Store b values in register? Constants?
__global__ void takeFifthOrderStep(double* y5, double* concs, double* k1s, double* k2s, double* k3s, double* k4s, double* k5s, double* k6s)
{
    double b51 = static_cast<double>(16)/static_cast<double>(135);
    double b52 = static_cast<double>(0);
    double b53 = static_cast<double>(6656)/static_cast<double>(12825);
    double b54 = static_cast<double>(28561)/static_cast<double>(56430);
    double b55 = static_cast<double>(-9)/static_cast<double>(50);
    double b56 = static_cast<double>(2)/static_cast<double>(55);
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    y5[idx] = concs[idx] + b51 * k1s[idx] + b52 * k2s[idx] + b53 * k3s[idx] + b54 * k4s[idx] + b55 * k5s[idx] + b56 * k6s[idx];
}
//storeConcs takes the current array of concentrations and stores it in the cId'th column of the 2D concStorage array
//pitch = memory size of a row
__global__ void storeConcs(double* cS, size_t pitch, double* concs, int cId)
{
    int tIdx = threadIdx.x;
    //cS is basically the memory address of the first element of the flattened (1D) 2D array.
    double* row = (double*)((char*)cS + tIdx * pitch);
    row[cId] = concs[tIdx];
}
//Perhaps I can optimize by using shared memory to hold conc values.
__global__ void calcK(double* concs, double* ks, double* dt)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    ks[idx]=(*dt)*flux(idx, concs);
}
//Adds two arrays (a + b) element by element and stores the result in array c.
__global__ void arrAdd(double* a, double* b, double* c)
{                                                 
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    c[idx]=a[idx]+b[idx];
}
//Subtracts two arrays (a - b) element by element and stores the result in array c.
__global__ void arrSub(double* a, double* b, double* c)
{                                                 
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    c[idx]=a[idx]-b[idx];
}
//Multiplies two arrays (a * b) element by element and stores the result in array c.
__global__ void arrMult(double* a, double* b, double* c)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    c[idx]=a[idx]*b[idx];
}
//Will find the min of errors array.
__global__ void arrMin(double* errors)
{
    //extern _shared_ double[7];
}
//Initializes an array a to double value b.
__global__ void arrInit(double* a, double b)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    a[idx]=b;
}
//Copies array b onto array a.
__global__ void arrCopy(double* a, double* b)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    a[idx]=b[idx];
}
//Placeholder function for the flux calculation.  It will take the size of the oligomer and current concentrations as inputs.
__device__ double flux(int r, double *concs) 
{
    return -concs[r];
}
//This function multiplies a tableau value by the corresponding k array and adds the result to tempsum.  Used to
//add all the a*k terms.
//e.g. setup(a21, temp1, tempsum, k1s, concs, maxlength) => tempsum = a21 * k1
__device__ void setup(double tableauValue, double *temp1, double *tempsum, double *ks, double *concs, int *maxlength) 
{
    //Sets tempsum to tabVal * k
    arrInit<<< 1, *maxlength >>>(temp1, tableauValue);      //Set [temp1] to tableau value, temp1 = a
    cudaDeviceSynchronize();
    arrMult<<< 1, *maxlength >>>(ks, temp1, temp1);         //Multiply tableau value by appropriate [k], temp1 = a*k
    cudaDeviceSynchronize();
    arrAdd<<< 1, *maxlength >>>(tempsum, temp1, tempsum);   //Move tabVal*k to [tempsum], tempsum = tempsum+temp1
    cudaDeviceSynchronize();
    //temp1 = tableauValue * kArray
    //tempsum = current sum (tableauValue * kArray)
}

EDIT CUỐI CÙNG Tôi thấy lỗi của mình. Đó là một lỗi đánh máy tính toán của K6. Tôi đặt dấu sao xung quanh cụm từ "k4s" vi phạm sẽ là "k5s". Bất cứ ai muốn tích hợp CUDA RKF45 đều được hoan nghênh sử dụng nó vì nó hoạt động ngay bây giờ và đúng thứ tự.

Number of Points    h       Global Error    Calculated Order
        10          0.1         3.59E-09    
        20          0.05        1.08E-10    5.06E+00
        40          0.025       3.31E-12    5.03E+00
        80          0.0125      1.03E-13    5.01E+00
        160         0.00625     2.20E-15    5.55E+00
        320         0.003125    1.20E-15    8.74E-01
        640         0.0015625   1.40E-15    -2.22E-01
        1280        0.00078125  2.00E-16    2.81E+00

— Tóc của Slytherin
nguồn

Có bất kỳ lý do nào bạn không sử dụng một thư viện thực hiện phương pháp này?

— Geoff Oxberry

Nếu bạn chưa quen với C ++, CUDA và tán tỉnh ODE, thì nhiệm vụ này có vẻ khá khó khăn. Nếu bạn có quyền truy cập vào MatLab với hộp Công cụ lập trình song song và thẻ Đồ họa NVidia, có thể đáng để lấy nguyên mẫu mã này trong MatLab. Nó có hỗ trợ cho các GPU CUDA ( http://www.mathworks.com/discovery/matlab-gpu.html ) (và OpenMP và MPI cho vấn đề đó) những ngày này. Khi mã đang chạy trong MatLab, bạn có thể chuyển nó sang C ++ hoặc sử dụng MeX. Chỉ là một ý nghĩ.

— seb

Bạn có thể sử dụng thư viện odeint (một phần của Boost) với giao diện Thrust. Nếu bạn định cuộn mã của riêng mình, bạn nên kiểm tra việc triển khai của mình bằng phương pháp giải pháp được sản xuất.

— Geoff Oxberry

@Karsten: Có thể đáng để làm lại phân tích của bạn khi cho rằng lỗi toàn cầu có dạng . Sau đó , do đó, một biểu đồ log-log (hoặc hồi quy tuyến tính trên dữ liệu chuyển đổi nhật ký) sẽ cho bạn biết lỗi là gì. Một cốt truyện cũng sẽ có nhiều thông tin hơn cho người đọc bài viết của bạn; để hiểu được dữ liệu của bạn, tôi phải tự vẽ biểu đồ để xem thứ tự phương thức của bạn hoặc tôi phải tự mình thực hiện hồi quy.

E (h) = a * h^{n}

$E(h) = a * h^{n}$

\log E = \log a + n \log h

$\log{E} = \log{a} + n \log{h}$

— Geoff Oxberry

@KarstenChu lấy điểm. Tôi thích kết hợp Python / C bản thân mình; Tôi chỉ muốn giữ một tâm trí cởi mở. Để xem những gì bạn có thể làm với MatLab những ngày này, hãy xem liên kết này từ Cụm PDC ở Stockholm, Thụy Điển . Họ có các slide và rất nhiều kịch bản ví dụ matlab. Chúng bao gồm openmp, MPI, cuda.

— seb

Tôi giả sử rằng bạn đang đặt dung sai lỗi tại 1e-12. Bạn đúng rằng khi một lược đồ thích ứng chấp nhận kích thước bước hiện tại, thì nó giả sử lược đồ thứ 5 là, cho tất cả các ý định và mục đích, câu trả lời "chính xác". Tuy nhiên đây chỉ là khi nó chấp nhận bước hiện tại. Nếu sự khác biệt giữa các bước thứ 4 và 5 quá lớn, nó không chấp nhận hoặc là đủ tốt. Để thử nghiệm một thuật toán duy nhất như thế này, bạn cần chạy nó với nhiều bước thời gian và xem thứ tự hội tụ để xác minh nó.

Mặt khác, một bước 0.001với sơ đồ bậc 5 về vấn đề kiểm tra của bạn (tôi giả sử là 1D) là quá đủ để đưa lỗi đến gần máy epsilon. Tôi đoán là bạn có một lỗi đánh máy ở đâu đó làm giảm tích hợp của bạn xuống thứ 1. Giống như tôi đã nói trong đoạn 1, hãy chạy nó với nhiều bước thời gian và xác minh thứ tự hội tụ của nó. Điều này sẽ cho bạn biết liệu lỗi bạn thấy là hợp lý hay nếu bạn có lỗi ở đâu đó trong bộ tích hợp của bạn.

BIÊN TẬP

Để tính thứ tự hội tụ của bộ giải ode, hãy chạy nó trên một vấn đề với một giải pháp đã biết cho hai bước thời gian khác nhau và . Lỗi toàn cầu của bạn sau đó sẽ là: $h$ $\frac{h}{2}$

$\frac{E(h)}{E(\frac{h}{2})} = 2^n$

Lưu ý rằng 2 xuất phát từ tỷ lệ giữa các bước thời gian. Bạn sẽ nhận được cho sơ đồ đặt hàng thứ 5 của mình. $n=5$

— Thần kiếm
nguồn

Ngay bây giờ, tôi không có một thuật toán bước thích ứng được lập trình, vì vậy điều đó có thể góp phần gây ra lỗi. Tuy nhiên, đối với các kích thước cố định, tôi đã thực hiện một trò đùa nhỏ thú vị vào logarit để tính thứ tự gần đúng của bộ giải của tôi và tôi nhận được ~ 2.1. Sử dụng sự khác biệt giữa đoán thứ tự thứ 5 và giá trị được chấp nhận là "lỗi toàn cầu" của tôi, tôi đã thực hiện O = ln (GlobalError / NumberOfPoints) / ln (StepSize) cho các kích thước bước là 0,001, 0,01 và .1. Điều đó có gợi ý rằng tôi có lỗi trong mã của mình không?

— Tóc của Slytherin

Xem chỉnh sửa của tôi để biết cách tính thứ tự.

— Godric Seer

Tôi đã thêm một biểu đồ trong đó tôi đã thực hiện một số lần chạy dữ liệu của mình và các lỗi liên quan và cắm chúng vào phương trình O = LN (E (h1) / E (h2)) / LN (h1 / h2) cho các bước tùy ý. Đó là những gì tôi đặt trong bảng xếp hạng là "Đơn hàng của bạn"

— Tóc của Slytherin

Nhìn vào hai điểm dữ liệu cuối cùng trong bảng của bạn. Bạn đã giảm kích thước bước xuống 10 lần và lỗi của bạn giảm xuống khoảng 100. Phương pháp của bạn chắc chắn là thứ 2. Không thấy mã của bạn, tôi không biết nên gợi ý gì ngoài việc kiểm tra lại các hệ số của bạn.

— Godric Seer

Tất cả các hệ số của bạn là chính xác. Một điều tôi không thấy là nơi bạn đang sửa đổi dt để tính các K khác nhau (tức là f (t + a dt, x + a K)). Vì các phương trình của bạn (kiểm tra và thực tế) có vẻ là tự trị, nên điều đó không quan trọng.

— Godric Seer