For users, patent search is provided by large companies such as: Yandex. Patents, Google Patents, Espacenet, United States Patent and Trademark Office (USPTO), Dimensions, which contain a large number of patent documents. Many search engines present various search criteria, for example, by date, title, applicant and category. The problem with such services is that they do not provide access to their databases, but only offer a web interface for viewing information. This chapter discusses the process of developing a module for the formation of a patent sample based on data (natural language and metadata) of the US patent array (USPTO) for solving various analysis tasks, such as: building a patent landscape through clustering procedures, identifying patent trends, etc. To solve this problem, a patent filtering algorithm has been developed that allows you to check the basic patent class, determining whether it is included in the clarifying list from the configuration file, an algorithm for parsing patent documents that allows you to extract the necessary elements of the description of the sources under consideration and the clustering algorithm of the patent sample. The parsing module is developed using the lxmllibrary and Beautiful Soup. For clustering, evaluation of the accuracy and completeness of clustering, the sklearn library was used, a database management system (DBMS) was selected for the organization of information storage Clickhouse and HDFS distributed file system. Thus, as a result of testing the developed software, it was determined that the division of the patent sample based on meta-information (IPC classes) with it coincides with the results of clustering carried out on the basis of the analysisof textual patent information with great accuracy.
teleforce•2h ago
For users, patent search is provided by large companies such as: Yandex. Patents, Google Patents, Espacenet, United States Patent and Trademark Office (USPTO), Dimensions, which contain a large number of patent documents. Many search engines present various search criteria, for example, by date, title, applicant and category. The problem with such services is that they do not provide access to their databases, but only offer a web interface for viewing information. This chapter discusses the process of developing a module for the formation of a patent sample based on data (natural language and metadata) of the US patent array (USPTO) for solving various analysis tasks, such as: building a patent landscape through clustering procedures, identifying patent trends, etc. To solve this problem, a patent filtering algorithm has been developed that allows you to check the basic patent class, determining whether it is included in the clarifying list from the configuration file, an algorithm for parsing patent documents that allows you to extract the necessary elements of the description of the sources under consideration and the clustering algorithm of the patent sample. The parsing module is developed using the lxmllibrary and Beautiful Soup. For clustering, evaluation of the accuracy and completeness of clustering, the sklearn library was used, a database management system (DBMS) was selected for the organization of information storage Clickhouse and HDFS distributed file system. Thus, as a result of testing the developed software, it was determined that the division of the patent sample based on meta-information (IPC classes) with it coincides with the results of clustering carried out on the basis of the analysisof textual patent information with great accuracy.