C#でのスクレイピングについて - エビフライの唐揚げ

まず、スクレイピングとは何かについて

ウェブスクレイピング（英: Web scraping）とは、ウェブサイトから情報を抽出するコンピュータソフトウェア技術のこと。ウェブ・クローラーあるいはウェブ・スパイダーとも呼ばれる。

引用元：https://ja.wikipedia.org/

要は、Webページにある任意の情報を自動で取り出すような技術のこと

スクレイピングについて著作権やら何やら法律絡みのことがあるので、自己責任でお願いします。あと、ちゃんと調べてからやりましょう。

C#でスクレイピングをするには「AngleSharp」というライブラリを使用した

f:id:littlemore:20200525202615p:plain

これさえアレば、あとは楽に出来る

まず、HTMLからIHtmlDocument を作る

/// <summary>
/// IHtmlDocumentを取得します
/// </summary>
/// <param name="url"></param>
/// <returns></returns>
public IHtmlDocument GetHtmlDocument(string url)
{
    // 指定したサイトのHTMLをストリームで取得する
    var doc = default(IHtmlDocument);
    try
    {
        System.Threading.Thread.Sleep(1000);

        using (WebClient wc = new WebClient())
        {
            using (Stream st = wc.OpenRead(url))
            using (StreamReader sr = new StreamReader(st))
            {
                string htmlText = sr.ReadToEnd();

                var parser = new HtmlParser();
                doc = parser.ParseDocument(htmlText);
            }
        }
    }
    catch (Exception ex)
    {
        //Logger.Writer(ex.Message);
    }
    return doc;
}

あとはこんな感じでCSS セレクターを設定してあげれば、取得したい値が取れる

private static ProductInfo Judge(IHtmlDocument doc, int index)
{
	var info = new ProductInfo();

	// CSSセレクタを指定し取得する
	info.Name = doc?.QuerySelector(string.Format("#sec-02 > ul > li:nth-child({0}) > p.ttl", index))?.InnerHtml?.Trim() ?? string.Empty;
	info.Price = doc?.QuerySelector(string.Format("#sec-02 > ul > li:nth-child({0}) > p.price > span:nth-child(1)", index))?.InnerHtml.Replace("円", string.Empty) ?? string.Empty;
	info.URL = (doc?.QuerySelector(string.Format("#sec-02 > ul > li:nth-child({0}) > p.img > a", index)) as IHtmlAnchorElement)?.Href ?? string.Empty;
	info.Kcal = doc?.QuerySelector(string.Format("#sec-02 > ul > li:nth-child({0}) > p:nth-child(3)", index))?.TextContent ?? string.Empty;
	info.Detail = doc?.QuerySelector(string.Format("#sec-02 > ul > li:nth-child({0}) > p.smalltxt", index))?.InnerHtml?.Trim() ?? string.Empty;

	int i = info.Kcal.IndexOf("kcal");
	if (i >= 0)
		info.Kcal = info.Kcal.Substring(0, i).Trim();

	return info;
}

上のコードだとわかりづらそうなので、一応こちらに書いておく

doc.QuerySelector(【CSS セレクター】)

CSS セレクタの取得はここを見るのが分かりやすい

gammasoft.jp