• Fastest conatenating algorithm
    19 replies, posted
I want to be able to multiple input files and folders of files into one files as fast as possible. How would I do that? In python I can get speeds of about 2500 per minute @ 2.4 ghz with file chunk caching. I'm wondering how I would go about this and if someone can post some code to help if possible.
If this was in C/C++ you could use the CreateFile function with the FILE_FLAG_SEQUENTIAL_SCAN for improved read performance. You should also try to read in blocks of data the size of a disk sector.
If I post some python code that shows what I'm trying to do could someone post a C version? [code] import random, time, os, glob, sys a = open(r"C:\Documents and Settings\pkt\My Documents\My Videos\coder\wasabilist.txt") b = [] c = 1 toggle = 1 solar = '' for ghoster in sys.argv: print str(sys.argv) print str(len(sys.argv)) if len(sys.argv) == 1: print 'not enough mana' else: for relishmayo in range(1,len(sys.argv)): inputdir = str(sys.argv[relishmayo]) solar = '' if (inputdir[1:] != ':\\'): for letter in inputdir[::-1]: if (letter != '\\'): solar += letter else: break golem = solar[::-1] else: golem = inputdir[0] + '-' os.chdir(inputdir) if (toggle == 0): g = a.readline() while (len(g) >= 3): g = a.readline() b.append(g[:-1]) c += 1 else: for files in glob.glob("*.mp3"): if (inputdir[1:] != ':\\'): b.append(inputdir + '\\' + files) else: b.append(inputdir + files) while (len(b) <= 10): b.extend(b) cupple = '' ixis = '' if ixis == '': ixis = 1 ixis = 1000000 caulder = [] for wapples in b: qwilt = open(wapples,'rb') caulder.append(qwilt.read()) qwilt.close() scar = 0 while (1): scar += 1 cupple = '' while (len(cupple) <= (96 / 8 * 1000 * 60 * 1)): cupple += caulder[random.randint(0,len(caulder) - 1)] axez = open('R:\\dakii.mp3','wb') axez.write(cupple) axez.close() print str(scar) [/code]
Sorry, but I have no idea what's going on in that script. Ghoster, relishmayo, golem, wapples?
Those are just variable names... Basically it takes one or more directories as input and loads all of the mp3 files in those into an array and goes through that array concatenating those together until it reaches a certain size and then it writes that to a mp3 file. I suck at naming variables as far as sticking to common conventions go but those help me remember them and stuff. Also easier for me than typing out long names over and over and my lack of object oriented understanding comes in too. For what it matters I know that program works but I want it in a language that's much faster. Maybe C or assembler. Just want to see if you guys know it before trying to learn a language just for this test.
Here you go. Takes a list of files (not directories) on the command line and cats them to the file specified after -o [cpp]#include <algorithm> #include <string> #include <iostream> #include <sstream> #include <Windows.h> void usage(const char* err) { std::cerr << err << std::endl; exit(-1); } int main(int argc, char* argv[]) { char **arg_o = std::find_if(argv, argv+argc, [](const char *c){ return !strcmp(c, "-o");}); char **arg_v = std::find_if(argv, argv+argc, [](const char *c){ return !strcmp(c, "-v");}); if (arg_o + 1 >= argv + argc) usage("provide output filename with -o <file>"); bool verbose = argv + argc != arg_v; SYSTEM_INFO sys = {}; GetSystemInfo(&sys); char *ofilename = *(arg_o + 1); HANDLE ofile = CreateFile(ofilename, FILE_GENERIC_READ | FILE_GENERIC_WRITE, 0, 0, CREATE_NEW, FILE_ATTRIBUTE_NORMAL, 0); if (INVALID_HANDLE_VALUE == ofile) usage("error: output file exists or access denied"); DWORD size = 0; std::for_each(argv + 1, argv+argc, [&](char *val) { if (val == *arg_o || val == *(arg_o + 1) || val == *arg_v) return; HANDLE ifile = CreateFile(val, FILE_GENERIC_READ, 0, 0, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, 0); if (INVALID_HANDLE_VALUE == ifile) { std::cerr << "warning: unable to open file: " << val << std::endl; return; } DWORD isize = GetFileSize(ifile, NULL); if (verbose) std::cout << "appending file: " << val << " size: " << isize << std::endl; HANDLE imapping = CreateFileMapping(ifile, 0, PAGE_READONLY, 0, isize, 0); void *iview = MapViewOfFile(imapping, FILE_MAP_READ, 0, 0, 0); DWORD page_offset = size % sys.dwAllocationGranularity; HANDLE omapping = CreateFileMapping(ofile, 0, PAGE_READWRITE, 0, size + isize, 0); void *oview = MapViewOfFile(omapping, FILE_MAP_WRITE, 0, size - page_offset, 0); CopyMemory((char*)oview + page_offset, iview, isize); UnmapViewOfFile(oview); CloseHandle(omapping); UnmapViewOfFile(iview); CloseHandle(imapping); CloseHandle(ifile); size += isize; }); } [/cpp] note: not sure if I did the code tags right.
[QUOTE=artanis;33754977]Here you go. Takes a list of files (not directories) on the command line and cats them to the file specified after -o [cpp]#include <algorithm> #include <string> #include <iostream> #include <sstream> #include <Windows.h> void usage(const char* err) { std::cerr << err << std::endl; exit(-1); } int main(int argc, char* argv[]) { char **arg_o = std::find_if(argv, argv+argc, [](const char *c){ return !strcmp(c, "-o");}); char **arg_v = std::find_if(argv, argv+argc, [](const char *c){ return !strcmp(c, "-v");}); if (arg_o + 1 >= argv + argc) usage("provide output filename with -o <file>"); bool verbose = argv + argc != arg_v; SYSTEM_INFO sys = {}; GetSystemInfo(&sys); char *ofilename = *(arg_o + 1); HANDLE ofile = CreateFile(ofilename, FILE_GENERIC_READ | FILE_GENERIC_WRITE, 0, 0, CREATE_NEW, FILE_ATTRIBUTE_NORMAL, 0); if (INVALID_HANDLE_VALUE == ofile) usage("error: output file exists or access denied"); DWORD size = 0; std::for_each(argv + 1, argv+argc, [&](char *val) { if (val == *arg_o || val == *(arg_o + 1) || val == *arg_v) return; HANDLE ifile = CreateFile(val, FILE_GENERIC_READ, 0, 0, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, 0); if (INVALID_HANDLE_VALUE == ifile) { std::cerr << "warning: unable to open file: " << val << std::endl; return; } DWORD isize = GetFileSize(ifile, NULL); if (verbose) std::cout << "appending file: " << val << " size: " << isize << std::endl; HANDLE imapping = CreateFileMapping(ifile, 0, PAGE_READONLY, 0, isize, 0); void *iview = MapViewOfFile(imapping, FILE_MAP_READ, 0, 0, 0); DWORD page_offset = size % sys.dwAllocationGranularity; HANDLE omapping = CreateFileMapping(ofile, 0, PAGE_READWRITE, 0, size + isize, 0); void *oview = MapViewOfFile(omapping, FILE_MAP_WRITE, 0, size - page_offset, 0); CopyMemory((char*)oview + page_offset, iview, isize); UnmapViewOfFile(oview); CloseHandle(omapping); UnmapViewOfFile(iview); CloseHandle(imapping); CloseHandle(ifile); size += isize; }); } [/cpp] note: not sure if I did the code tags right.[/QUOTE] This piece of code is just awful: a mess of C++11, c-strings and WinAPI.
[QUOTE=sim642;33755139]This piece of code is just awful: a mess of C++11, c-strings and WinAPI.[/QUOTE] I think the only ugly part is the Win32, and IMHO it is unavoidable. Also, he said fast, not pretty. That is as fast as you can do it on windows. Prove me wrong!
Why not use the Unix 'cat' command?
Or 'copy a + b c' on Windows.
[QUOTE=ROBO_DONUT;33756969]Or 'copy a + b c' on Windows.[/QUOTE] Yeah, for binary files it's "copy /b a+b c" otherwise it'll stop when it hits an EOF char.
In the process of reading your code, I ended up rewriting it: [code]import os import os.path import glob import sys def cat_files(directories, destination_file, pattern="*.txt", binary=False): """Concatenates the contents of all files matching the pattern in any of the directories.""" def get_filenames(directories, pattern): """Yields the paths of all pattern-matching files within the directories. Does not avoid duplicates. Assumes that CWD is not changed.""" for inputdir in directories: #Next line, while correct, could cause unexpected behaviour #inputdir = os.path.dirname(inputdir) #Small error check, should use exceptions instead if not os.path.isdir(inputdir): print "Directory does not exist:" print "\t", inputdir continue #Glob operates on cwd os.chdir(inputdir) for filename in glob.glob(pattern): if not os.path.isfile(filename): continue print os.path.join(inputdir, filename) yield os.path.join(inputdir, filename) def read_files(paths, binary=False): """Yields file contents for each path.""" for path in paths: try: with open(path, 'rb' if binary else 'r') as f: yield f.read() except IOError: print "Error on access of", path continue #Concat the files with open(destination_file, 'wb' if binary else 'w') as f: for data in read_files(get_filenames(directories, pattern), binary): f.write(data) if __name__=='__main__': print '(%d)' % len(sys.argv), ' '.join(sys.argv) if len(sys.argv) == 1: print "Required argument(s) missing." else: cat_files(sys.argv[1:], r"R:\dakii.mp3", "*.mp3", binary=True)[/code] It definitely doesn't do exactly what the original did anymore, but I think it's more in line with your description.
Gentlemen. I dialed up the c++ a bit and added some features. Directories are recursed into and wildcards are expanded. [url]http://pastebin.com/VEtDQ6Sv[/url] Oh yeah, it catches all API level errors now and I think the main function is a little easier to understand.
[QUOTE=artanis;33760837]Gentlemen. I dialed up the c++ a bit and added some features. Directories are recursed into and wildcards are expanded. [url]http://pastebin.com/VEtDQ6Sv[/url] Oh yeah, it catches all API level errors now and I think the main function is a little easier to understand.[/QUOTE] Bleh, C mixed with C++. [cpp]#include <vector> #include <string> #include <algorithm> #include <iostream> #include <fstream> #include <filesystem> void usage() { std::cout << "Usage: filecat <directories> -o <output file>" << std::endl; } int main(int argc, char* argv[]) { using namespace std::tr2::sys; std::vector<std::string> args(argv, argv + argc); // Find the iterator of argument "-o" and get the next from it auto it_out = std::find_if(args.begin(), args.end(), [](const std::string& arg) { return arg == "-o"; }); // Increment to leave out -o if (it_out == args.end() || ++it_out == args.end()) { std::cerr << "missing output filename" << std::endl; usage(); return EXIT_FAILURE; } // +1 to skip the first argument (program name) // -1 to leave out the -o std::vector<std::string> dir_list(args.begin() + 1, it_out - 1); std::ofstream ofs(*it_out, std::ios::binary); if (!ofs.good()) { std::cerr << "unable to open output file" << std::endl; usage(); return EXIT_FAILURE; } for (auto it = dir_list.begin(); it != dir_list.end(); ++it) { path dir(*it); if (!is_directory(dir)) continue; recursive_directory_iterator dir_end; for (recursive_directory_iterator dir_it(dir); dir_it != dir_end; ++dir_it) { std::ifstream f(dir_it->path(), std::ios::binary); if (!f.good()) continue; ofs << f.rdbuf(); } } return EXIT_SUCCESS; }[/cpp] C++11 (and filesystem from TR2). Although filesystem is only implemented by MSVC11 so far, but it should be coming to g++ soon. (Or just use Boost::Filesystem, which the TR2 implementation is)
[QUOTE=raBBish;33762349]Bleh, C mixed with C++. [/QUOTE] One c lib function call? :v: I hope you're not implying that char* should be avoided. Really the whole point was to use memory mapped files, something that c++ still doesn't have classes to deal with. There is a boost implementation but I've never used it.
[QUOTE=artanis;33762584]One c lib function call? :v:[/QUOTE] Two [img]http://facepunch.com/fp/emoot/eng101.gif[/img] (Why CopyMemory instead of memcpy?) [QUOTE=artanis;33762584]I hope you're not implying that char* should be avoided.[/quote] Why not? What's the advantage of not using std::string? You're also being inconsistent. You use iterators as pointers, char arrays for command line arguments but std::string everywhere else, std::ostringstream for concatenating two strings instead of just using std::string? My first experiences with C++ was mostly C with some C++. It was horrible to read and write, so I dropped learning it. Few months ago I started learning again, this time with proper C++ and I haven't looked back. Also I prefer not touching Win32 or any platform APIs in general.
Well this thing is used to mix sound bites together to form songs for certain genres of music. The one I posted would go and randomly select parts from the directories listed on the command line as arguments following the program name. The reason I didn't use any fancy flags like -o or whatever is because this thing runs from the send to directory which means it's a right click thing and thus doesn't allow inserting any characters with the side effect of being faster and more convenient. I have other versions which mix the file orders in certain patterns but that's the most basic one. The reason I'm asking to have it as a faster language is to see how fast it can get at maximum speed though it's more for a test than anything. [url]http://www.mediafire.com/?xp4x9coiqwou9j7[/url] There;s the pack of scripts I was using maybe a week ago but things haven't changed that much but it could be useful if you guys want to create similar stuff (Glitchcore, Glitchhop, Breakcore, etc) You'll have to adjust it a little as it's for windows xp but since Vista and 7 have the send to folder as well you can use those too after adjusting some paths.
[cpp] // Public domain, do whatever you want with it. #include <stdio.h> #include <Windows.h> int main(int iArgCount, char **pArguments) { int i; DWORD dwRead; DWORD dwDummy; HANDLE hInput; HANDLE hOutput; LPBYTE lpBuffer; if(iArgCount < 2) { printf("No output file specified.\n"); return 1; } lpBuffer = (LPBYTE)VirtualAlloc(NULL, 4096, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE); if(!lpBuffer) { printf("Couldn't allocate buffer space.\n"); return 2; } hOutput = CreateFile(pArguments[1], GENERIC_WRITE, FILE_SHARE_READ, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); if(hOutput == INVALID_HANDLE_VALUE) { printf("Couldn't open %s.\n", pArguments[1]); VirtualFree(lpBuffer, 0, MEM_RELEASE); return 3; } for(i = 2; i < iArgCount; i++) { hInput = CreateFile(pArguments[i], GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN, NULL); if(hInput == INVALID_HANDLE_VALUE) { printf("Couldn't open %s.\n", pArguments[i]); CloseHandle(hOutput); VirtualFree(lpBuffer, 0, MEM_RELEASE); return 4; } while(1) { ReadFile(hInput, lpBuffer, 4096, &dwRead, NULL); WriteFile(hOutput, lpBuffer, dwRead, &dwDummy, NULL); if(dwRead < 4096) { break; } } CloseHandle(hInput); } CloseHandle(hOutput); VirtualFree(lpBuffer, 0, MEM_RELEASE); return 0; } [/cpp]
[QUOTE=raBBish;33762825]Two [img]http://facepunch.com/fp/emoot/eng101.gif[/img] (Why CopyMemory instead of memcpy?) [/QUOTE] Oh yeah, that is a macro that is indeed implemented by memcpy. I used it because I was already using winapi and initially didn't even realize it was a macro. [QUOTE=raBBish;33762825] Why not? What's the advantage of not using std::string? You're also being inconsistent. You use iterators as pointers, char arrays for command line arguments but std::string everywhere else, std::ostringstream for concatenating two strings instead of just using std::string? My first experiences with C++ was mostly C with some C++. It was horrible to read and write, so I dropped learning it. Few months ago I started learning again, this time with proper C++ and I haven't looked back. Also I prefer not touching Win32 or any platform APIs in general.[/QUOTE] I wouldn't call it inconsistent. Pointers are random access iterators! argv is a preformed vector of char pointers that can be iterated over in the same way. If you don't need to modify the list, another object is unnecessary. The other places I used char* was due to std::exception and the winapi requiring it. ostringstream is necessary anywhere you want to format non-strings into a string. 3/4 cases used that functionality to insert error codes, the one at the end was unnecessary and usage should have just accepted a std::string as well. [editline]17th December 2011[/editline] [QUOTE=pkt;33762977]Well this thing is used to mix sound bites together to form songs for certain genres of music. The one I posted would go and randomly select parts from the directories listed on the command line as arguments following the program name. The reason I didn't use any fancy flags like -o or whatever is because this thing runs from the send to directory which means it's a right click thing and thus doesn't allow inserting any characters with the side effect of being faster and more convenient. I have other versions which mix the file orders in certain patterns but that's the most basic one. The reason I'm asking to have it as a faster language is to see how fast it can get at maximum speed though it's more for a test than anything. [url]http://www.mediafire.com/?xp4x9coiqwou9j7[/url] There;s the pack of scripts I was using maybe a week ago but things haven't changed that much but it could be useful if you guys want to create similar stuff (Glitchcore, Glitchhop, Breakcore, etc) You'll have to adjust it a little as it's for windows xp but since Vista and 7 have the send to folder as well you can use those too after adjusting some paths.[/QUOTE] Well the best speed you're going to get should be however fast your hdd can read and write. What was the unit of the measurement from your original post? 2500MB/min? That would be about 42MB/sec, and depending on your pc (if it's a laptop, or an older hdd) you may be at your limit already, though it seems unlikely.
In megabytes maybe only...well I just did some calculations and running from my ramdisk the original is getting about 1.9 GB per minute or 31.6 MB per sec. I'm starting to think that the most important limiting factor in this program is the write speed of the used media. So at this rate python isn't causing that much of a slowdown. My calculations maybe be incorrect though and correct me if I'm wrong but would the main bottlenecks in a script so simple be more related to read/write speeds and not necessarily a shortcoming in the program or language used? I mean sure c code would be more efficient on the processor as far as I can tell but the tradeoff being modularity with the not needing to compile. I'll likely use the code the guy provided up there at some point though. Is all of the code in this thread public domain unless otherwise noted? At least the thing I posted is. I'm not sure anyone noticed but in this one and I can't tell from the c code posted but one I'm using randomly concats the pieces together until it reaches a certain limit which in this case is 1 minute of 96 kbps of audio and a little after that since I can't think up a better way to limit the written file size than a while loop. Would it be safe to only write a certain amount of bytes of such a file? I heard as long as it's in a multiple of 2 bytes then it shouldn't leave anything weird like fragmentation generated pointers and such or does the EOF character magically make it all better. Also I think the mp3's used start with a header of about 659 bytes if that matters any. It might have a footer too dunno. Just that's what the encoder puts out when it there's zero data as input. Anything else you think I should know? Thanks for reading.
Sorry, you need to Log In to post a reply to this thread.