answerstu

gpgpu - Distributed (multi-device) implementation of sequence-to-sequence models in TensorFlow?

Here is a very good tutorial on training a sequence-to-sequence model in TensorFlow. I just interested to know if there is a distributed version which leverage a set of GPUs on a single machine for better performance?TensorFlow white paper has been mentioned that it is possible to train a large multilayer recurrent neural network (See Figure 8 and "model parallel training" section) as used in Sequence to Sequence Learning with Neural Networks. Anybody know if the current tutorial cover model parallel training? If no, how to improve the original...Read more

gpgpu - CUDA limit seems to be reached, but what limit is that?

I have a CUDA program that seems to be hitting some sort of limit of some resource, but I can't figure out what that resource is. Here is the kernel function:__global__ void DoCheck(float2* points, int* segmentToPolylineIndexMap, int segmentCount, int* output){ int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x; int pointCount = segmentCount + 1; if(segmentIndex >= segmentCount) return; int polylineIndex = segmentToPolylineIndexMap[segmentIndex]; int result = 0; if(polylineIndex >= 0...Read more

gpgpu - Running OpenCL on hardware from mixed vendors

I've been playing with the ATI OpenCL implementation in their Stream 2.0 beta. The OpenCL in the current beta only uses the CPU for now, the next version is supposed to support GPU kernels. I downloaded Stream because I have an ATI GPU in my work machine.I write software that would benefit hugely from gains by using the GPU. However this software runs on customer machines, I don't have the luxury (as many scientific computing environments have) to choose the exact hardware to develop for, and optimize for that. So my question is, if I distribut...Read more

gpgpu - Is it possible to get information of cache inside of GPU using perf event?

I'm using perf event to get performance count or cache information (such as cache access count, cache miss count).and now, I want to get GPU's cache information. But, the question is whether perf event can get GPU's cache information.I did one test. ioctl(fd, PERF_EVENT_IOC_RESET, 0);ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);matrixMulCUDA<<< grid, threads >>> ( ... );ioctl(fd, PERF_EVENT_IOC_RESET, 0);ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);and I confirmed data seem to be extracted.But I can't be sure it's cache information inside of GP...Read more

gpgpu - How to use the GPU to speed up the Pymc3 sampling?

I've used the 'njobs' parameter to get the multi-sample results, and it's far away from my expectionI've changed the '.theanorc' file to set the 'floatX', 'cnmem' value, etc.I've monitored the GPU source by the command 'nvidia-smi', and it's well usedBut, the sampling speed is already slow, even slower than the CPU.Is that normal?...Read more

gpgpu - Is physics simulation really faster on GPU?

From what i have observed, havok does a significantly better job for rigid simulation than Physx, especially their new Havok Physics 2013.Im not very familiar with how state of the art physics engine works, but by testing alone i cannot get a very accurate test results. For example, PhysX still seems to cripple CPU performance on purpose. My results shows when the simultaneous interaction rigids exceeds a certain amount (this ranges from 1024 to 8096 boxes), it's performance drops along a very unnatural steep curve and stops plumb dropping when...Read more

gpgpu - How to obtain version of installed Vulkan API in Linux?

2018-03-07, the new version of API (Vulkan 1.1) was released.I want to know:which console command can display currently installed API version.$ /usr/bin/vulkaninfo | head -n 5===========VULKAN INFO===========Vulkan Instance Version: 1.1.70WARNING: radv is not a conformant vulkan implementation, testing use only.how to determine the same thing in C# language programmatically....Read more

gpgpu - Data corruption when replacing a GLSL constant with a uniform value

Follow up to this recent question.I am doing GPGPU programming in WebGL2, and I'm passing in a large 4-dimensional square array to my shaders by packing it into a texture to bypass the uniform count limits. Having freed myself from having to use a relatively small fixed-size array, I would like to be able to specify the size of the data that is actually being passed in programmatically.Previously, I had hard-coded the size of the data to read using a const int as follows:const int SIZE = 5;const int SIZE2 = SIZE*SIZE;const int SIZE3 = SIZE2*SIZ...Read more

gpgpu - Data corruption when replacing uniform array with 1d texture in WebGL

I am doing some GPGPU processing on a large 4D input array in WebGL2. Initially, I just flattened the input array and passed it in as a uniform array of ints, with a custom accessor function in GLSL to translate 4D coordinates into an array index, as follows:const int SIZE = 5; // The largest dimension that works; if I can switch to textures, this will be passed in as a uniform value.const int SIZE2 = SIZE*SIZE;const int SIZE3 = SIZE*SIZE2;const int SIZE4 = SIZE*SIZE3;uniform int u_map[SIZE4];int get_cell(vec4 m){ ivec4 i = ivec4(mod(m,float(S...Read more

gpgpu - Exception: device kernel image is invalid

I'm new to the omnisci open source community. I have followed the instruction (https://www.omnisci.com/docs/latest/4_ubuntu-apt-gpu-os-recipe.html) to install the omnisci (open source version) into my ubuntu 18.04LTS ~$ sudo systemctl start omnisci_server~$ $OMNISCI_PATH/bin/omnisqlPassword:User mapd connected to database mapsomnisql> I have also install the CUDA driver 10.0 +-----------------------------------------------------------------------------+| NVIDIA-SMI 415.27 Driver Version: 415.27 CUDA Version: 10.0 ||----------...Read more

gpgpu - GPU programming model - how many simultaneous, divergent threads without penalty

I am new to GPGPU and CUDA. From my reading, on current-generation CUDA GPU's, threads get bundled into warps of 32 threads. All threads in a warp execute the same instructions so if there is divergence in branches all threads essentially take the time corresponding to taking all the incurred branches. However, it seems that different warps executing simultaneously on the GPU can have divergent branches without this cost since the different warps are executed by separate computational resources. So my question is, how many concurrent warps can ...Read more

gpgpu - Error using Tensorflow with GPU

I've tried a bunch of different Tensorflow examples, which works fine on the CPU but generates the same error when I'm trying to run them on the GPU. One little example is this:import tensorflow as tf# Creates a graph.a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')c = tf.matmul(a, b)# Creates a session with log_device_placement set to True.sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))# Runs the op.print sess.run(c)The error is al...Read more

gpgpu - DXGI_ERROR_DEVICE_HUNG resulting from concurrency::copy on C++ AMP

I have created some C++ AMP code for performing background gradient removal on astronomical images. They come in as 16-bit unsigned integers for RGB. All of my application's processing and output occurs in single precision floating point, so I convert the input data, run the C++ AMP code, and then copy the results back to the CPU (in reality the image will go through many of these C++ AMP filters on the GPU before being copied back, but for this test code I have isolated it to just a single such filter.Everything goes well until I initiate th...Read more

gpgpu - Accessing IList<T> inside Alea.Gpu.Default.For

I am trying to access values of System.Collections.Generic.IList<T> which is declared outside Alea.Gpu.Default.For. [GpuManaged]private void Evaluate_Caching(IList<TGenome> genomeList){ var gpu = Gpu.Default; gpu.For(0, genomeList.Count - 1, i => { TGenome genome = genomeList[i]; TPhenome phenome = (TPhenome)genome.CachedPhenome; if (null == phenome) { // Decode the phenome and store a ref against the genome. phenome = _genomeDecoder.Decode(genome); genome.CachedPhenome...Read more