.net - An approach to multithreaded file processing -


i have quite large file(> 15 gb)(never mind kind of file). have read file, processing data, write processed data blank file. in chunks. each chunk contains header of sort, followed data. simplest file of multiple chunks contain:

number of block bytes block bytes number of block bytes block bytes 

so, create 1 thread reading file chunks, threads processing each read chunk, , 1 thread writing chunks processed data.

and have sort of problem managing threads.

i don't know order in each chunk have been processed, though must write chunks file in order has been read.

so, question kind of approach have use manage multithreaded processing.

i guess, might better, if use producer concumer pattern. data structure best use in case storing data has been processed? have 1 guesses - stack based on array, need sort once before start writing.

but i'm not sure. so, please me approach.

//sample of code, without logic of threads managing  public class datablock {     public byte[] data { get; }     public long index { get; }      public datablock(byte[] data, long index)     {         this.data = data;         this.index = index;     } }   int buffersize = 1024*64; //65536 long processedblockcounter = 0l; mystack<datablock> processedblockstore = new mystack<datablock>();  using (filestream fs = new filestream(path, filemode.open, fileaccess.read, fileshare.read, buffersize)) {     using (bufferedstream bs = new bufferedstream(fs, buffersize))     {         byte[] buffer = new byte[buffersize];         int byteread;         while ((byteread = bs.read(buffer, 0, buffersize)) > 0)         {             byte[] originalbytes;             using (memorystream mstream = new memorystream())             {                 mstream.write(buffer, 0, byteread);                 originalbytes = mstream.toarray();             }              long datablockindex = interlocked.increment(ref processedblockcounter);              thread processthread = new thread(() =>             {                 byte[] processedbytes = myprocessor.process(originalbytes);                 datablock processedblock = new datablock(processedbytes, processedblockcounter);                 lock(processedblockstore)                 {                      processedblockstore.add(processedblock);                 }             });             processthread.start();         }     } } 

you're creating new thread each iteration. isn't going scale. i'd recommend use threadpool instead. preferred way use tpl internally uses threadpool.

since need ordering , parallel processing , doesn't go hand in hand, can either make code synchronous if that's option.

if need process in parallel i'd recommend following fork-join strategy given file larger 15 gb , processing time consuming too.

  • chunkify file
  • start task each chunk
  • make each task write output temporary file named index. 1.txt, 2.txt etc
  • wait tasks complete
  • finally read temporary files , create output file in order.
  • then of course delete temporary files. you're done.

Comments