Counting nbr of lines based on a field separator

23 December 2019

A while ago a customer for which I have designed and developed a data platform has asked me to develop a solution to count the real number of lines of a flat file. With this number we are able to validate the load, just to be sure that we have loaded all rows into a table. In short the number of imported rows returned by the T-SQL command bulk insert should be the same as the number of lines of the flat file. Of course there are other options as well think about a data delivery list with some metrics. Due to various reasons we consider counting the number of lines as the best short term solution.

Okay, we are already using a combination of PowerShell scripts and T-SQL to the load flat files. So counting the number of lines with PowerShell is not a real big thing. For your background we have to deal with large flat files. And handling these files with a minimum consumption of server resources we are using the text stream object instead of using Get-Content. The latest loads the whole file in memory.

A text stream object looks like this:


Clear-Host;
Push-Location $PSScriptRoot;

[int]$TotalLineCount = 0;
[string]$FileName = "MyTestFile.txt";

ForEach ($Line in [System.IO.File]::ReadLines($FileName)){
    $TotalLineCount++;
};

Write-Host "Nbr of lines in file $($FileName): $($TotalLineCount.ToString())";

[System.GC]::Collect();


We are already using the above mentioned text stream object to meet all kind of other requirements e.g. calculating the number lines.

But the challenge to overcome was a free text field. Yes, we have this one in the flat files and we need it to perform text analysis with R. And of course the content of a free text field contains in some cases a carriage return. Due to this the free text field is split into multiple lines in the text file; e.g.


ID*|*Naam*|*StartDatum*|*Comments*|*EindDatum
1*|*Clara de Koe*|*20140419*|*"6sfsjkjkjl fsjdfkljsdflksd
fdfddf
fdfd dfdjk ""dfd"" "*|*5-2-2019
2*|*Jan Jansen*|*19941118*|*"4sfsjkjkjl fsjdfkljsdflksd
fdfddf
fdfd dfdjk ""dfd"" "*|*6-2-2019
3*|*Piet Puk*|*20050125*|*"1sfsjkjkjl fsjdfkljsdflksdfdfddffdfd dfdjk ""dfd"" "*|*7-2-2019

201 rows affected

In the above example we have one header, 3 "data" lines and 12 physical lines. So, for each file we know two things 1. ) the field separator and 2.) the header. The header is mandatory and this is validated when a flat file is processed. Based on the header we can calculate the number of lines with two solutions.

Furthermore we the T-SQL command bulk insert can import a carriage return if the row terminator is different, in our case this should always be true. And we assume that the field separator is not used in the comment field. If an assumption fails, either by catching the line causing the error or by validating the outcome of the calculation of data lines (it contains some decimals). It was not possible to remove the carriage return in the free text field because this increase the burden of the export process at party delivery side significant. And the customer wants the use the data as is.


First solution: Calculation: (Total number of field separators in flat file / total number of field separators in the header) - 1. We subtract 1 because the numerator contains also the header field separators and we need only the number of "data" lines. In PowerShell it looks like this:


Clear-Host;
Push-Location $PSScriptRoot;

[int]$HeaderRowPos = 1;
[int]$TotalLineCount = 0;
[string]$FieldTerminatorChar = "*|*";
[int]$FieldSepCount = 0;
[int]$FieldSepHdrCount = 0;
[int]$FieldSepTotalLineCount = 0;
[string]$FileName = "MyTestFileLineBreaks.txt";

$FieldTerminatorChar = [string]::Concat("(", $FieldTerminatorChar.Replace("*", "\*").Replace("|", "\|"), ")");

ForEach ($Line in [System.IO.File]::ReadLines($FileName)){

    $TotalLineCount++;

    $FieldSepCount = ($($Line | Select-String -Pattern $FieldTerminatorChar -AllMatches).Matches.Count);
    $FieldSepTotalLineCount = $FieldSepTotalLineCount + $FieldSepCount;
  
    if ($TotalLineCount -eq $HeaderRowPos){
        $FieldSepHdrCount = $FieldSepCount;
    };
};

Write-host "FieldSepTotalLineCount: $($FieldSepTotalLineCount)";
Write-host "TotalLineCount: $($TotalLineCount)";
Write-host "TotalLineCount: $(($FieldSepTotalLineCount / $FieldSepHdrCount) - 1)";

[System.GC]::Collect();

The second solution: Just like in the approach of the first step we calculate the number of the field separators in the header. And then we calculate the number of field separators of each line compared this to the total number of the field separators in the header. If the numbers are equal, reset the variable to zero and read the next line. If the number is unequal then read the next line and add the number of found field separators to the previous number. Do this until the numbers (header vs. line) are equal. The code in PowerShell:


Clear-Host;
Push-Location $PSScriptRoot;

[int]$HeaderRowPos = 1;
[int]$TotalLineCount = 0;
[string]$FieldTerminatorChar = "*|*";
[int]$FieldSepCount = 0;
[int]$FieldSepHdrCount = 0;
[int]$FieldSepTotalLineCount = 0;
[string]$FileName = "MyTestFileLineBreaks.txt";

$FieldTerminatorChar = [string]::Concat("(", $FieldTerminatorChar.Replace("*", "\*").Replace("|", "\|"), ")");

ForEach ($Line in [System.IO.File]::ReadLines($FileName)){
    $TotalLineCount++;

    # Count the nbr of field terminators
    $FieldSepCount = ($($Line | Select-String -Pattern $FieldTerminatorChar -AllMatches).Matches.Count);

    if ($TotalLineCount -eq $HeaderRowPos){
        $FieldSepHdrCount = $FieldSepCount;
    } else{

        if (($Line.Length -gt 0) `
            -and ($FieldSepHdrCount -gt 0)){

            if (($FieldSepLineCount + $FieldSepCount) -eq $FieldSepHdrCount){
                # Total nbr is the same as the one from the header
                $FieldSepTotalLineCount++;
                $FieldSepLineCount = 0;
            } else{
                # Not equal, thus we need the next line
                $FieldSepLineCount = ($FieldSepLineCount + $FieldSepCount);
            };
        };
    };
};

Write-host "FieldSepTotalLineCount: $($FieldSepTotalLineCount)";
Write-host "TotalLineCount: $($TotalLineCount)";

[System.GC]::Collect();

In both solutions we need to add some escape characters to the field separator to use this as a regular expression in Powershell function Select-String:

$FieldTerminatorChar = [string]::Concat("(", $FieldTerminatorChar.Replace("*", "\*").Replace("|", "\|"), ")");

In this case we are using *|* as a field separator.

Above I described two solutions as a pattern and provide some code. Of course you should refine it to get the right fit in your own solutions and to meet all requirements.