Recently, a challenge came across my desk which included comparing very large sets of data against one another. Specifically, a list of all computers in our domain compared to a list of all computers registered with a specific application. This posed an interesting question to me, “What would be the fastest way to accomplish this?”
I set out to look for different ways of comparing lists. I can think of three. The first two would be to load all of the items into an array and then search the array, item by item, for the value using either the –match operator or the –contains operator. The third method would be to load all the items into a hash table with empty values and then search the keys to see if they exist. Now since I know that loading up a hash table should take more time than loading an array, so I want to time the entire process not just the searches.
To actually do the timing, I will use the measure-command cmdlet. If you haven’t ever used this, you should really play with it. It’s a great tool for figuring out how long any given code block takes to run. That can be useful for things like filling in a time on you write-progress applet, or reporting the time to execute back to a user. Really you can look at it as a way to avoid setting a variable to get-date and then creating a new-timespan after the command completes. It is essentially rolling it all into one.
So, it’s a race between searching Hash Tables, and Searching arrays using both –match and –contains. Here is the code I used:
$ArrayContainsTime = Measure-Command {
$Array = @(Get-Content u:scriptworkstations.txt)
$Found = 0
foreach ($Name in $Checks){
If ($Array -contains $Name){$found++}
}
}
"Array Contains Count: t$($Array.Count)"
"Array Contains Found: t$($found)"
$ArrayMatchTime = Measure-Command {
$Array = @(Get-Content u:scriptworkstations.txt)
$Found = 0
foreach ($Name in $Checks){
If ($Array -match $Name){$found++}
}
}
"Array Matches Count: t$($Array.Count)"
"Array Matches found: t$($found)"
$HashTime = Measure-Command {
$HashTable = @{}
ForEach ($Line in Get-Content u:scriptworkstations.txt){
$HashTable.Add($Line,"1")
}
$Found = 0
foreach ($Name in $Checks){
If ($Hashtable.ContainsKey($Name)){$found++}
}
}
"Hash Table Count: t$($HashTable.Count)"
"Hash Table Found: t$($found)"
"Milliseconds for Array Contains:t$($ArrayContainsTime.TotalMilliseconds)"
"Milliseconds for Array Matches:t$($ArrayMatchTime.TotalMilliseconds)"
"Milliseconds for Hast Table Contains:t$($HashTime.TotalMilliseconds)"
I have loaded up the text file with 2,000 entries, so we are basically comparing 2,000 items to 2,000 items. Every single one will be a match, so we can see that it’s working by making sure that the found and count values are the same. If you wanted to take this code and load two different lists, then you would see a difference there. So, without further delay, it’s off to the races!
Array Contains Found: 2000
Array Matches Count: 2000
Array Matches found: 2000
Hash Table Count: 2000
Hash Table Found: 2000
Milliseconds for Array Contains: 532.6136
Milliseconds for Array Matches: 9839.4498
Milliseconds for Hast Table Contains: 51.2049
So we have a winner! As you can see all the methods work, but the hash table is substantially faster than the other two, array based methods searching through all 2,000 items in under one tenth of a second! I think array using the –contains operator is still a very reasonable time, and is probably easier and more comfortable for the average scripter to use. As well, it should be said that the array with the –match operator isn’t insanely slow by any means, and is by far the most robust method for searching as it can match any portion of the name in the array. This should be used with caution, though, as it can actually create false positives. Let’s say you are looking for “Computer” in a list that contains “Computer1”. You may not expect this to be a match, but it will.
So, there you have it. If you need to search a massive list for some reason and speed is top on your mind, use a hash table!