Read very large file make CPU loaded 100%

241 views Asked by At

I'm doing a text file reading with Apache Commons I/O and get full 100% CPU loaded with large file (23GB) ~ 404 million lines. My code snippet is below:

try (LineIterator it = FileUtils.lineIterator(file1, "UTF-8")) {
    while (it.hasNext()) {
        String lineR = it.nextLine();
        // do something with line
        bytesRead += lineR.length();
        int percent = (int) (bytesRead * 100 / totalBytes);
        if (percent > prePercent && percent % 5 == 0) {
            log.info(percent + "% " + prefix + " read.");
            prePercent = percent;
        }

        //split \t or " ", get domainName
        String domainName = Arrays.stream(lineR.split("[\t ]")).filter(line -> line.contains(prefix)).findFirst().orElse(" ");
        uniqueNameDomainSet.add(domainName.substring(0, a.length() -1));
    }
}

I don't think there's a problem with Apache Commons I/O, so which part can lead to full CPU?

1

There are 1 answers

1
Chicky On

One problem is that String.split() using regex to matches delimiter, each time you invoke the split(), the delimiter [\t ] must be recompiled first.

You can optimize it a bit by using Pattern.split() so that you can reuse the compiled pattern.

Try this:

public static void main(String[] args) throws Exception {
    String text = "one\ttwo\tthree";
    String separatorPatternStr = "[\t ]";
    Pattern separatorsPattern = Pattern.compile(separatorPatternStr);
    int numIterations = 404000000;
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < numIterations; i++) {
        String[] tmp = separatorsPattern.split(text);
    }
    long preCompiledElapsedTime = System.currentTimeMillis() - startTime;
    System.out.printf("preCompiledElapsedTime: %d(ms)\n", preCompiledElapsedTime);


    long startTime2 = System.currentTimeMillis();
    for (int i = 0; i < numIterations; i++) {
        String[] tmp = text.split(separatorPatternStr);
    }
    long plainSplitElapseTime = System.currentTimeMillis() - startTime2;
    System.out.printf("plainSplitElapseTime: %d(ms)\n", plainSplitElapseTime);
}

Also, regex matching is a costing operation, I suggest that your document must use only one delimiter char Ex. '\t' so that you can write custom split by using String.indexOf or using guava which is much more efficient