Basic PowerShell website scraper › Undocumented Features

5/5 - (1 vote)

Today, I just put together a quick little project for a friend of mine who needed a way to scrape downloadable documents and video files from a website for a project. Rather than clicking and downloading each one manually, he wondered if there was a way he could do it with a script.

BUT OF COURSE THERE IS.

So, “Aaron’s barebones scraper” was born.

It’s not glamorous, but … it’s mine. And it’s some nifty PowerShell that you could maybe someday use to build something great. The basis of the script is:

Download the content of a page using Invoke-WebRequest.
Expand the .Links property, and then select the HREF attribute, using a regex match to select document types
Loop through the resulting links, downloading them to an output directory of our choosing

The most “interesting” tidbit was deciding to build a mechanism for submitting basic auth to websites. Sometimes the standard passing of the -Credential object works (Invoke-WebRequest -Credential), sometimes it doesn’t. Sometimes, you need to pass an Authorization header (I haven’t really determined why authoritatively, but I’m guessing it has to do with how the site responds for the 401 challenge request). I tried to kill two birds with one stone, giving you options to use either the standard PSCredential object (with UseCredentialAsPSCred parameter) or a more legacy style (default). Let me know how it works for you!

# Basic PowerShell scraper
param (
 [System.Management.Automation.PSCredential]$Credential,
 [array]$DocumentTypes = @("pdf","doc","docx","xls","xlsx","xlsm","xlsxm","ppt","pptx","jpg","gif","mp4","m4v","mp3","mov","avi","wmv","wma"),
 [string]$OutputPath = "C:\Temp\",
 [switch]$UseCredentialAsPSCred,
 [Parameter(Mandatory = $true)][string]$Site
)
If ($Credential)
{
 $Username = $Credential.UserName
 $Password = $Credential.GetNetworkCredential().Password
 $CredentialString = "$($Username):$($Password)"
 $CredentialEncoded = [System.Convert]::ToBase64String([System.Text.Encoding]::ASCII.GetBytes($CredentialString))
 $BasicAuthValue = "Basic $($CredentialEncoded)"
 $Headers = @{ Authorization = $BasicAuthValue }
}
try { $data = Invoke-WebRequest -Uri $site }
catch { "unable to gather data from $($site)" }
If (!(Test-Path $OutputPath))
{
$FolderResult = New-Item -Path $OutputPath -Type Directory -Force
}
$OutputPath = $OutputPath.TrimEnd("\")
if ($data)
{
 [array]$Links = @()
 $Links += ($data.Links).Href
 $Filter = '(?i)(' + (($DocumentTypes | % { [regex]::escape($_) }) -join "|") + ')$'
 [array]$FilesToDownload = $Links -match $Filter
 
 $i = 1
 $iTotal = $FilesToDownload.Count
 foreach ($File in $FilesToDownload)
 {
  $Filename = Split-Path $File -Leaf
  $OutputFile = Join-Path $OutputPath -ChildPath $Filename
  Write-Progress -Activity "Downloading $($File)." -PercentComplete (($i/$iTotal) * 100) -Id 1 -Status "File $($i) of $($iTotal)"
  $params = @{ }
  $params.Add('Uri', $File)
  $params.Add('OutFile', $OutputFile)
  If ($Credential)
  {
   If ($UseCredentialAsPSCred -and $Headers)
   {
    $params.add('Headers', $Headers)
   }
   Else
   {
    $params.Add('Credential', $Credential)
   }
  }
  try { Invoke-WebRequest @params }
  catch { Write-Progress -Status "Error downloading $($File)." -Activity "Downloading $($File)." }
  $i++
 }
 Write-Progress -Activity "Finished." -Completed
}

Copy the code, save it as a .ps1. As you can see from the DocumentTypes parameter, I have set a bunch of standard document file extensions. You can obviously change this as you see fit to download more (or less) file types from a site, and since it’s evaluated via an escaped regex, you can choose to include the “.” or not as part of the extension. You could also obviously modify the regular expression to pattern match any part of the document (hint: remove the $ at the end of the expression).

Here’s what it looks like when it’s running:

Happy scraping!

One Reply to “Basic PowerShell website scraper”