Pesquisa Google Java

28 de novembro de 2019 29 respostas Resolvido

ASHAMM 28 de novembro de 2019

Alguem me consegue explicar porque é que não me retorna nada? penas diz que foi corrido com sucesso, mas sem qualquer retorno! Não vejo nada errado.

/*
 * To change this license header, choose License Headers in Project Properties.
 * To change this template file, choose Tools | Templates
 * and open the template in the editor.
 */
package googlesearchdemo;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.swing.text.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 *
 * @author filip
 */
public class GoogleSearchDemo {

  // pattern for extracting the link such as www.codeforeach.com/java/ ( domain
  // name + path )
  private static final Pattern p = Pattern
      .compile("([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,6}(/[^&]*)*");

  public static void main(String[] args) throws IOException {
    String searchQuery = "animais";
    List<String> links = searchGoogle(searchQuery);
    for (String link : links) {
      System.out.println(link);
    }

  }

  public static List<String> searchGoogle(String searchQuery) throws IOException {
    List<String> result = new ArrayList<>();
    // lets get the top results counting to nearly 15
    String request = "https://www.google.com/search?q=" + searchQuery + "&num=15";

      org.jsoup.nodes.Document doc = Jsoup.connect(request)
        .userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)").get();
    // get the required content from response . Here ( h3 a ) is the selector
    // pattern for selecting all heading links
    Elements links = doc.select("h3 a[href]");

    for (Element link : links) {
      String hrefValue = link.attr("href");
      if (hrefValue.startsWith("/url?q="))
        result.add(extractLink(hrefValue));
    }

    return result;
  }

  // extract required link from href value
  private static String extractLink(String href) {
    String result = null;
    Matcher m = p.matcher(href);

    if (m.find()) {
      result = m.group();
    }

return result;

29 Respostas

j-menezes 28 de nov. de 2019 1 like

Você já debugou o codigo ?

ASHAMM 29 de nov. de 2019

Debuguei e deu o seguinte erro:
Not able to submit breakpoint MethodBreakpoint [teste_email.Teste_email$1].getPasswordAuthentication '()Ljava/net/PasswordAuthentication;', reason: Breakpoint belongs to disabled source root 'C:\Users\filip\Documents\NetBeansProjects\teste_email\src'. See Window/Debugging/Sources.

Sabe como resolver?

j-menezes 29 de nov. de 2019

Qual IDE vc está usando ?

ASHAMM 29 de nov. de 2019

Uso o NetBeans

j-menezes 29 de nov. de 2019

qual versão o NetBeans e qual a versão Java está sendo compilado seu projeto ?

ASHAMM 29 de nov. de 2019

Utilizo no netbeans 8.2 e a versão do java é 1.8.0_191
Obrigado pela ajuda!

j-menezes 29 de nov. de 2019

Desculpa, somente posso dar uma olhada no seu codigo no final de semana, se tiver resolvido até lá posta aqui no forum como solucionado.

ASHAMM 29 de nov. de 2019

Ok! Agradeço a sua ajuda! Vou tentando!

j-menezes 29 de nov. de 2019

Qual a versão do jsoup está usando ?

ASHAMM 29 de nov. de 2019

Utilizo a versão jsoup 1.12.1

j-menezes 29 de nov. de 2019 1 like

blz, amanha eu vejo

j-menezes 30 de nov. de 2019 3 likes

Olha, verifiquei o programa e somente encontrei um erro minimo na parser da pagina da google. Bom, Eu tambem coloquei um decode pra ver se o resultado estava correto quando colocasse o endereço no navegador. Funcionou corretamente.

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.swing.text.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class GoogleSearchDemo {

  // pattern for extracting the link such as www.codeforeach.com/java/ ( domain
  // name + path )
  private static final Pattern p = Pattern
      .compile("([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,6}(/[^&]*)*");

  public static void main(String[] args) throws IOException {
    String searchQuery = "animais";
    List<String> links = searchGoogle(searchQuery);
    for (String link : links) {
      System.out.println(link);
    }

  }

  public static List<String> searchGoogle(String searchQuery) throws IOException {
    List<String> result = new ArrayList<>();
    // lets get the top results counting to nearly 15
    String request = "https://www.google.com/search?q=" + searchQuery + "&num=15";
    
      org.jsoup.nodes.Document doc = Jsoup.connect(request)
        .userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)").get();
    // get the required content from response . Here ( h3 a ) is the selector
    // pattern for selecting all heading links
    
   //% System.out.println( "--> \n" + doc.toString() );    
   //% Elements links = doc.select("h3 a[href]");   
    
    Elements links = doc.select("a[href]");
  
    for (Element link : links) {
      String hrefValue = link.attr("href");
      
      if (hrefValue.startsWith("/url?q=")) {
          
         try {
            hrefValue = URLDecoder.decode(hrefValue, StandardCharsets.UTF_8.toString());
            
            result.add(extractLink(hrefValue));
            
         } catch (UnsupportedEncodingException ex) {
            throw new RuntimeException(ex.getCause());
         }
          
      }
    }

    return result;
  }

  // extract required link from href value
  private static String extractLink(String href) {
    String result = null;
    Matcher m = p.matcher(href);

    if (m.find()) {
      result = m.group();
    }

    return result;

  }
  
}

Bons Codigos

ASHAMM 30 de nov. de 2019 1 like

Já funcionou! Muito obrigado pela ajuda, a sério!

ASHAMM 30 de nov. de 2019

Mas já agora, pode-me explicar a lógica se eu quisesse buscar os nomes e não os links?! Seria mais fácil?

j-menezes 1 de dez. de 2019 1 like

Bom, primeiramente você precisa ver dentro como está o html.

Pra fazer isso é bem simples,pode colocar um breakPoint atraves do debug na linha de parada e ver o que o jsoup esta retornando.

você verá internamente algo do tipo

<div class="kCrYT">
    <a href="/url?q=https://www.youtube.com/watch%3Fv%3D1D55PsWF7A4&amp;sa=U&amp;ved=2ahUKEwia-cPR55TmAhXbH7kGHfbmCHsQtwIwFHoECAMQAQ&amp;usg=AOvVaw3iFsYZd9AwNsX18oUV2YWk">
     <div class="BNeawe vvjwJb AP7Wnd">
      7 BATALHAS DE ANIMAIS GRAVADAS EM VÍDEO 4 - YouTube
     </div>
     <div class="BNeawe UPmit AP7Wnd">
      https://www.youtube.com › watch
     </div></a>
   </div>

Agora dê uma analisada na fonte e tambem na documentação da jsoup.
Vou postar o fonte com essa alteração pra você analisar melhor e fazer as alterações que te interessam.
E tambem destaco que existe outra forma de fazer isso, que é atraves da raspagem usando WebEngine e WebView, que podem ficar ocultos se necessário e usar uma função em javascript sendo chamada atraves da WebEngine.

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javafx.util.Pair;
import javax.swing.text.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class GoogleSearchDemo {

  // pattern for extracting the link such as www.codeforeach.com/java/ ( domain
  // name + path )
  private static final Pattern p = Pattern
      .compile("([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,6}(/[^&]*)*");

  public static void main(String[] args) throws IOException {
    String searchQuery = "animais";
    List<Pair> links = searchGoogle(searchQuery);
    for (Pair link : links) {
     
      System.out.println("link=" + link.getKey() + "  url=" + link.getValue() );
    }
    
  }

  public static List<Pair> searchGoogle(String searchQuery) throws IOException {
    List<Pair> result = new ArrayList<>();
    
    // lets get the top results counting to nearly 15
    String request = "https://www.google.com/search?q=" + searchQuery + "&num=15";
     
      org.jsoup.nodes.Document doc = Jsoup.connect(request)        
        .userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)").get();
      
    // get the required content from response . Here ( h3 a ) is the selector
    // pattern for selecting all heading links
    
    //  System.out.println( "--> \n" + doc.toString() );
  
    Elements links = doc.select(".kCrYT");
    
    for (Element link : links) {
      Elements el_a = link.select("a");
     
      String hrefValue = el_a.attr("href");    
      Elements el_divs = el_a.select("div");
      
      String nome = "";
      if(el_divs.size() > 0) {
         nome = el_divs.get(0).html();
      }
       
      if (hrefValue.startsWith("/url?q=")) {
          
         try {
             
            String slink = extractLink(hrefValue);
            
            if( slink != null ) {             

               hrefValue = URLDecoder.decode(slink, StandardCharsets.UTF_8.toString());
            
               Pair pair = new Pair(nome, hrefValue );
            
               result.add( pair );
               
            }
             
         } catch (UnsupportedEncodingException ex) {
            throw new RuntimeException(ex.getCause());
         } catch(java.lang.IndexOutOfBoundsException ie) {
            ie.printStackTrace();
           // nao faca nada   
         }
          
      }
    }

    return result;
  }

  // extract required link from href value
  private static String extractLink(String href) {
    String result = null;
    Matcher m = p.matcher(href);

    if (m.find()) {
      result = m.group();
    }

    return result;

  }
  
}

ASHAMM 1 de dez. de 2019 1 like

É isso mesmo! Obrigado pela ajuda!

ASHAMM 4 de dez. de 2019

Agradeço muito a sua ajuda! Mas tenho mais uma duvida:
Como posso retirar a data da noticia ou do artigo publicado?
Isto:
há 2 horas

Obrigado

j-menezes 4 de dez. de 2019 1 like

Voce precisa fazer o scan de um bloco do html que tenha esses dados.

ASHAMM 4 de dez. de 2019

Obrigado
Seria possivel fazer a pesquisa no site do Google News?

ASHAMM 4 de dez. de 2019

Consegue-me explicar isto? Não entendo onde foi buscar o .kCrYT

rodriguesabner 4 de dez. de 2019

é o id do div class.

ASHAMM 4 de dez. de 2019

Eu não consigo encontrar esse elemento .kCrYT no codigo fonte da pagina do google!

rodriguesabner 4 de dez. de 2019

deve ter outro nome, procura o elemnto que vc quer buscar e insere ele no seu codigo

Solucao aceita

j-menezes 4 de dez. de 2019

Se você no navegador usar a opção Inspecionar não vai aparecer, agora se usar o conteudo antes do navegador exibir atraves do jsoup vai aparecer.
Provavel que o o Site chame alguma função interna que troca o nome, vi que o “kCrYT” no navegador aparece como “r”.

Parece que é a mesmo coisa, mas não necessariamente.

O jsoup obtem o retorno do servidor puro, agora quando vai renderizar esse conteudo, dentro da propria pagina, como Eu disse pode ser chamada alguma função e alterar algumas coisas, seja por segurança ou pelo conjunto interno das coisas.

ASHAMM 4 de dez. de 2019

Desculpa estar a ser chato mas como faço isso? Já pesquisar mas não encontrei nada

j-menezes 4 de dez. de 2019

Voce tem que interceptar o conteudo apos o retorno do jsoup.

Tem paginas que sim, o mesmo conteudo que vem será o mesmo a ser exibido, mas isso não é uma regra.

j-menezes 4 de dez. de 2019

Do mais, se você quer pegar o conteúdo exatamente igual está na apresentação da pagina, nesse caso o jsoup não é indicado, embora muitos achem que seja a mesma coisa, mas não, a prova é essa do search do google.

Se quiser pegar exatamente como o navegador apresenta, tem que fazer raspagem usando WebView e WebEngine.

ASHAMM 4 de dez. de 2019

Adaptei o codigo para pesquisar uma palavra no google news e retornar o titulo e o link da noticia. Mas não me retorna nada! O que esta errado? Consegues-me ajudar? Ja intercetei o conteudo pelo jsoup e não ercebo porque nao está a funcionar!

Aqui esta o codigo:

package googlesearch;

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import static java.nio.charset.StandardCharsets.UTF_8;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javafx.util.Pair;
import javax.swing.text.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class GoogleSearch {

  // pattern for extracting the link such as www.codeforeach.com/java/ ( domain
  // name + path )compile("([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,6}(/[^&]*)*");
    
    
    private static final Pattern p = Pattern
      .compile("([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,6}(/[^&]*)*");
    
    private static String request;
    
    public static void main(String[] args) throws IOException {
        Scanner ler = new Scanner(System.in);
        System.out.println("Insira a palavra: ");
        String pesquisa = ler.nextLine();
        pesquisa = pesquisa.toLowerCase();
        String searchQuery = pesquisa;
        List<Pair> links = searchGoogle(searchQuery);
        for (Pair titulosResultados : links) {
            System.out.println("Titulo: " + titulosResultados.getKey());
            System.out.println("Link: " + titulosResultados.getValue());
        }
        
    }
    public static List<Pair> searchGoogle(String searchQuery) throws IOException {
        
        List<Pair> result = new ArrayList<>();
    
    // lets get the top results counting to nearly 15
        request = "https://news.google.com/search?q=" + searchQuery + "&hl=pt-PT&gl=PT&ceid=PT%3Apt-150";
        org.jsoup.nodes.Document doc = Jsoup.connect(request)        
        .userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://news.google.com/bot.html)").get();
      
    // get the required content from response . Here ( h3 a ) is the selector
    // pattern for selecting all heading links
    
    //  System.out.println( "--> \n" + doc.toString() );
        
        Elements links = doc.select(".xrnccd");
    
        for (Element link : links) {
            Elements el_a = link.select("h3");
     
            String hrefValue = el_a.attr("a");    
            Elements el_divs = el_a.select("div");
      
            String nome = "";
            if(el_divs.size() > 0) {
                nome = el_divs.get(0).html();
            }
       
            if (hrefValue.startsWith("/url?q=")) {
          
                try {
             
                    String slink = extractLink(hrefValue);
            
                    if( slink != null ) {             

                    hrefValue = URLDecoder.decode(slink, StandardCharsets.UTF_8.toString());
            
                    Pair pair = new Pair(nome, hrefValue );
            
                    result.add( pair );
               
                    }
             
                } catch (UnsupportedEncodingException ex) {
                    throw new RuntimeException(ex.getCause());
                } catch(java.lang.IndexOutOfBoundsException ie) {
                    ie.printStackTrace();
             
                }
          
            }
        }

        return result;
    }

  // extract required titulosResultados from href value
    private static String extractLink(String href) {
        
        String result = null;
        Matcher m = p.matcher(href);

        if (m.find()) {
            result = m.group();
        }

        return result;

    }

}

Obrigado pela ajuda

j-menezes 4 de dez. de 2019

Esse é um trabalho de paciência e testes.
Quando debugar e estiver na linha de interesse, vais testando as instruções do jsoup até trazer exatamente o que de fato desejas.

Criado 28 de novembro de 2019

Ultima resposta 4 de dez. de 2019

Respostas 29

Participantes 3

29 Respostas

Topicos relacionados